AMD OS1354WBJ4BGHBOX Optimization Guide - Page 15

2.7.1.4

Out-of-Page

Target

Array

The

out-of-page

target

array

(

OPG

holds

the

high

address

bits

([28:12]

for

32

targets

that

are

outside

the

current

page

for

branches

marked

in

the

sparse

BTB

.

Only

sparse

branches

are

eligible

for

out-of-page

target

prediction

.

Branches

marked

by

the

dense

predictor

are

not

eligible

for

OPG

target

prediction

.

Direct

dense

branches

that

are

out-of-page

will

have

their

targets

corrected

by

the

branch

target

address

calculator

with

a

4

-

cycle

penalty

.

Direct

sparse

branch

targets

that

cross

a

28

-bit

address

block

boundary

(

beyond

the

range

of

the

out-of-page

target

array

are

also

corrected

by

the

branch

target

address

calculator

.

2.7.1.5

Branch

Marker

Caching

When

a

cache

line

is

evicted

,

the

sparse

marker

information

for

the

first

two

branches

in

that

cache

line

are

slightly

compressed

and

written

out

into

a

subset

of

the

L

2

ECC

bits—but

only

if

the

line

contains

instructions

exclusively

.

These

markers

are

brought

back

into

the

core

and

reloaded

into

the

sparse

predictor

if

their

L

2

line

is

reloaded

into

the

L

1

instruction

cache

before

eviction

from

L

2

or

before

the

line

is

the

target

of

a

store

.

Dense

branches

may

or

may

not

remain

resident

in

the

dense

predictor

when

the

L

1

instruction

cache

is

reloaded

.

Sparse

markers

in

the

shared

L

2

can

be

shared

with

other

cores

that

fetch

from

the

same

L

2

line

.

Software

with

extremely

large

instruction

footprints

,

especially

those

with

multiple

threads

that

share

instruction

cache

lines

,

can

take

advantage

of

this

property

by

targeting

a

branch

density

of

no

more

than

2

branches

per

cache

line

.

2.7.1.6

Return

Address

Stack

The

Family

16

h

processor

implements

a

16

-entry

return

address

stack

(

RAS

to

predict

return

addresses

from

a

near

call

.

As

calls

are

fetched

,

the

address

of

the

following

instruction

is

pushed

onto

the

return

address

stack

.

Typically

,

the

return

address

of

the

call

is

correctly

predicted

by

the

address

popped

off

the

top

of

the

return

address

stack

.

However

,

mispredictions

sometimes

arise

during

speculative

execution

that

can

cause

incorrect

pushes

and

/

or

pops

to

the

return

address

stack

.

The

processor

implements

mechanisms

that

correctly

recover

the

return

address

stack

in

most

cases

.

If

the

return

address

stack

cannot

be

recovered

,

it

is

invalidated

and

the

execution

hardware

restores

it

to

a

consistent

state

.

The

following

commonly

used

coding

practices

optimized

for

other

processor

microarchitectures

are

not

optimum

for

the

Family

16

h

processor

:

CALL 0h

In

prior

processor

families

(

for

example

,

Family

10

h

a

CALL 0h

followed

by

a

POP

instruction

was

recommended

for

32

-bit

software

to

get

the

RIP

value

into

a

general-purpose

register

.

CALL 0h

was

recognized

and

treated

specially

,

and

the

return

address

stack

was

kept

consistent

even

though

there

was

no

return

instruction

paired

with

the

call

.

On

the

Family

16

h

processor

,

CALL 0h

is

not

treated

specially

,

and

thus

this

code

sequence

will

cause

the

RAS

to

get

out

of

sync

due

to

the

un-paired

call

.

It

is

recommended

to

avoid

the

use

of

CALL 0h

in

32

-bit

software

,

and

instead

use

a

true

subroutine

call

,

a

MOV reg,[RSP]

instruction

,

and

a

paired

return

to

get

the

value

of

the

RIP

register

into

a

general-purpose

register

.

REP RET

For

prior

processor

families

,

such

as

Family

10

h

and

12

h

,

a

three-byte

return-immediate

RET

instruction

had

been

recommended

as

an

optimization

to

improve

performance

over

a

single-byte

near-return

.

With

processor

Families

15

h

and

16

h

,

this

is

no

longer

recommended

and

a

single-byte

near-return

(

opcode

C

3

h

can

be

used

with

no

negative

performance

impact

.

This

will

result

in

smaller

code

size

over

the

three-byte

method

.

For

the

rationale

for

the

former

recommendation

,

see

section

6.2

in

the

Software

Optimization

Guide

for

AMD

Family

10

h

and

12

h

Processors

.

52128

Rev

. 1.1

March

2013

Software

Optimization

Guide

for

AMD

Family

16

h

Processors

Chapter

2

Microarchitecture

of

the

Family

16

h

Processor

15

Section	Page
Contents	3
List of Figures	4
List of Tables	5
Revision History	6
1 Preface	7
2 Microarchitecture of the Family 16h Processor	8
2.1 Features	8
2.2 Instruction Decomposition	10
2.3 Superscalar Organization	10
2.4 Processor Block Diagram	11
2.5 Processor Cache Operations	11
2.5.1 L1 Instruction Cache	12
2.5.2 L1 Data Cache	12
2.5.3 L2 Cache	12
2.6 Memory Address Translation	13
2.6.1 L1 Translation Lookaside Buffers	13
2.6.2 L2 Translation Lookaside Buffers	13
2.6.3 Hardware Page Table Walker	13
2.7 Optimizing Branching	13
2.7.1 Branch Prediction	13
2.7.1.1 Next Address Logic	14
2.7.1.2 Branch Target Buffer	14
2.7.1.3 Branch Target Address Calculator	14
2.7.1.4 Out-of-Page Target Array	15
2.7.1.5 Branch Marker Caching	15
2.7.1.6 Return Address Stack	15
2.7.1.7 Indirect Target Predictor	16
2.7.1.8 Conditional Branch Predictor	16
2.7.1.9 Fetch Window Tracking Structure	16
2.7.2 Loop Alignment	16
2.7.2.1 Encoding Padding for Loop Alignment	16
2.7.2.2 Aligning Loops to Reduce Power Consumption	17
2.8 Instruction Fetch and Decode	18
2.9 Integer Unit	18
2.9.1 Integer Schedulers	18
2.9.2 Integer Execution Units	18
2.9.3 Retire Control Unit	19
2.10 Floating-Point Unit	19
2.10.1 Denormals	21
2.11 XMM Register Merge Optimization	22
2.12 Load Store Unit	23
Appendix A Instruction Latencies	24
A.1 Instruction Latency Assumptions	24

AMD OS1354WBJ4BGHBOX Optimization Guide - Page 15

Out-of- Target, Array, Branch, Marker, Caching, Return, Address, Stack

Page 15 highlights