AMD OS1354WBJ4BGHBOX Optimization Guide - Page 15

Out-of- Target, Array, Branch, Marker, Caching, Return, Address, Stack

Page 15 highlights

52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors 2.7.1.4 Out-of-Page Target Array The out-of-page target array (OPG) holds the high address bits ([28:12]) for 32 targets that are outside the current page for branches marked in the sparse BTB. Only sparse branches are eligible for out-of-page target prediction. Branches marked by the dense predictor are not eligible for OPG target prediction. Direct dense branches that are out-of-page will have their targets corrected by the branch target address calculator with a 4cycle penalty. Direct sparse branch targets that cross a 28-bit address block boundary (beyond the range of the out-of-page target array) are also corrected by the branch target address calculator. 2.7.1.5 Branch Marker Caching When a cache line is evicted, the sparse marker information for the first two branches in that cache line are slightly compressed and written out into a subset of the L2 ECC bits-but only if the line contains instructions exclusively. These markers are brought back into the core and reloaded into the sparse predictor if their L2 line is reloaded into the L1 instruction cache before eviction from L2 or before the line is the target of a store. Dense branches may or may not remain resident in the dense predictor when the L1 instruction cache is reloaded. Sparse markers in the shared L2 can be shared with other cores that fetch from the same L2 line. Software with extremely large instruction footprints, especially those with multiple threads that share instruction cache lines, can take advantage of this property by targeting a branch density of no more than 2 branches per cache line. 2.7.1.6 Return Address Stack The Family 16h processor implements a 16-entry return address stack (RAS) to predict return addresses from a near call. As calls are fetched, the address of the following instruction is pushed onto the return address stack. Typically, the return address of the call is correctly predicted by the address popped off the top of the return address stack. However, mispredictions sometimes arise during speculative execution that can cause incorrect pushes and/or pops to the return address stack. The processor implements mechanisms that correctly recover the return address stack in most cases. If the return address stack cannot be recovered, it is invalidated and the execution hardware restores it to a consistent state. The following commonly used coding practices optimized for other processor microarchitectures are not optimum for the Family 16h processor: CALL 0h In prior processor families (for example, Family 10h ) a CALL 0h followed by a POP instruction was recommended for 32-bit software to get the RIP value into a general-purpose register. CALL 0h was recognized and treated specially, and the return address stack was kept consistent even though there was no return instruction paired with the call. On the Family 16h processor, CALL 0h is not treated specially, and thus this code sequence will cause the RAS to get out of sync due to the un-paired call. It is recommended to avoid the use of CALL 0h in 32-bit software, and instead use a true subroutine call, a MOV reg,[RSP] instruction, and a paired return to get the value of the RIP register into a general-purpose register. REP RET For prior processor families, such as Family 10h and 12h, a three-byte return-immediate RET instruction had been recommended as an optimization to improve performance over a single-byte near-return. With processor Families 15h and 16h, this is no longer recommended and a single-byte near-return (opcode C3h) can be used with no negative performance impact. This will result in smaller code size over the three-byte method. For the rationale for the former recommendation, see section 6.2 in the Software Optimization Guide for AMD Family 10h and 12h Processors. Chapter 2 Microarchitecture of the Family 16h Processor 15

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

2.7.1.4
Out-of-Page
Target
Array
The
out-of-page
target
array
(
OPG
holds
the
high
address
bits
([28:12]
for
32
targets
that
are
outside
the
current
page
for
branches
marked
in
the
sparse
BTB
.
Only
sparse
branches
are
eligible
for
out-of-page
target
prediction
.
Branches
marked
by
the
dense
predictor
are
not
eligible
for
OPG
target
prediction
.
Direct
dense
branches
that
are
out-of-page
will
have
their
targets
corrected
by
the
branch
target
address
calculator
with
a
4
-
cycle
penalty
.
Direct
sparse
branch
targets
that
cross
a
28
-bit
address
block
boundary
(
beyond
the
range
of
the
out-of-page
target
array
are
also
corrected
by
the
branch
target
address
calculator
.
2.7.1.5
Branch
Marker
Caching
When
a
cache
line
is
evicted
,
the
sparse
marker
information
for
the
first
two
branches
in
that
cache
line
are
slightly
compressed
and
written
out
into
a
subset
of
the
L
2
ECC
bits—but
only
if
the
line
contains
instructions
exclusively
.
These
markers
are
brought
back
into
the
core
and
reloaded
into
the
sparse
predictor
if
their
L
2
line
is
reloaded
into
the
L
1
instruction
cache
before
eviction
from
L
2
or
before
the
line
is
the
target
of
a
store
.
Dense
branches
may
or
may
not
remain
resident
in
the
dense
predictor
when
the
L
1
instruction
cache
is
reloaded
.
Sparse
markers
in
the
shared
L
2
can
be
shared
with
other
cores
that
fetch
from
the
same
L
2
line
.
Software
with
extremely
large
instruction
footprints
,
especially
those
with
multiple
threads
that
share
instruction
cache
lines
,
can
take
advantage
of
this
property
by
targeting
a
branch
density
of
no
more
than
2
branches
per
cache
line
.
2.7.1.6
Return
Address
Stack
The
Family
16
h
processor
implements
a
16
-entry
return
address
stack
(
RAS
to
predict
return
addresses
from
a
near
call
.
As
calls
are
fetched
,
the
address
of
the
following
instruction
is
pushed
onto
the
return
address
stack
.
Typically
,
the
return
address
of
the
call
is
correctly
predicted
by
the
address
popped
off
the
top
of
the
return
address
stack
.
However
,
mispredictions
sometimes
arise
during
speculative
execution
that
can
cause
incorrect
pushes
and
/
or
pops
to
the
return
address
stack
.
The
processor
implements
mechanisms
that
correctly
recover
the
return
address
stack
in
most
cases
.
If
the
return
address
stack
cannot
be
recovered
,
it
is
invalidated
and
the
execution
hardware
restores
it
to
a
consistent
state
.
The
following
commonly
used
coding
practices
optimized
for
other
processor
microarchitectures
are
not
optimum
for
the
Family
16
h
processor
:
CALL 0h
In
prior
processor
families
(
for
example
,
Family
10
h
a
CALL 0h
followed
by
a
POP
instruction
was
recommended
for
32
-bit
software
to
get
the
RIP
value
into
a
general-purpose
register
.
CALL 0h
was
recognized
and
treated
specially
,
and
the
return
address
stack
was
kept
consistent
even
though
there
was
no
return
instruction
paired
with
the
call
.
On
the
Family
16
h
processor
,
CALL 0h
is
not
treated
specially
,
and
thus
this
code
sequence
will
cause
the
RAS
to
get
out
of
sync
due
to
the
un-paired
call
.
It
is
recommended
to
avoid
the
use
of
CALL 0h
in
32
-bit
software
,
and
instead
use
a
true
subroutine
call
,
a
MOV reg,[RSP]
instruction
,
and
a
paired
return
to
get
the
value
of
the
RIP
register
into
a
general-purpose
register
.
REP RET
For
prior
processor
families
,
such
as
Family
10
h
and
12
h
,
a
three-byte
return-immediate
RET
instruction
had
been
recommended
as
an
optimization
to
improve
performance
over
a
single-byte
near-return
.
With
processor
Families
15
h
and
16
h
,
this
is
no
longer
recommended
and
a
single-byte
near-return
(
opcode
C
3
h
can
be
used
with
no
negative
performance
impact
.
This
will
result
in
smaller
code
size
over
the
three-byte
method
.
For
the
rationale
for
the
former
recommendation
,
see
section
6.2
in
the
Software
Optimization
Guide
for
AMD
Family
10
h
and
12
h
Processors
.
52128
Rev
. 1.1
March
2013
Software
Optimization
Guide
for
AMD
Family
16
h
Processors
Chapter
2
Microarchitecture
of
the
Family
16
h
Processor
15