AMD OS1354WBJ4BGHBOX Optimization Guide - Page 12

Movups, Movaps

Page 12 highlights

Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 2.5.1 L1 Instruction Cache The AMD Family 16h processor contains a 32-Kbyte, 2-way set associative L1 instruction cache. Cache line size is 64 bytes; however, only 32 bytes are fetched in a cycle. Functions associated with the L1 instruction cache are fetching cache lines from the L2 cache, providing instruction bytes to the decoder, prefetching instructions, and predicting branches. Requests that miss in the L1 instruction cache are fetched from the L2 cache or, if not resident in the L2 cache, from system memory. On misses, the L1 instruction cache generates fill requests for the naturally-aligned 64-byte block that includes the miss address and one or two sequential blocks (prefetches). Because code typically exhibits spatial locality, prefetching is an effective technique for avoiding decode stalls. Cache-line replacement is based on a leastrecently-used replacement algorithm. The L1 instruction cache is protected from error through the use of parity. Due to the indexing and tagging scheme used in the instruction cache, optimal performance is obtained when two hot cache lines which need to be resident in the instruction cache simultaneously do not share the same virtual address bits [20:6]. 2.5.2 L1 Data Cache The AMD Family 16h processor contains a 32-Kbyte, 8-way set associative L1 data cache. This is a write-back cache that supports one 128-bit load and one 128-bit store per cycle. In addition, the L1 cache is protected from bit errors through the use of parity. There is a hardware prefetcher that brings data into the L1 data cache to avoid misses. The L1 data cache has a 3-cycle integer load-to-use latency, and a 5-cycle FPU load-to-use latency. The data cache natural alignment boundary is 16 bytes. A misaligned load or store operation suffers, at minimum, a one cycle penalty in the load-store pipeline if it spans a 16-byte boundary. Throughput for misaligned loads and stores is half that of aligned loads and stores since a misaligned load or store requires two cycles to access the data cache (versus a single cycle for aligned loads and stores). For aligned memory accesses, the aligned and unaligned load and store instructions (for example, MOVUPS/ MOVAPS) provide identical performance. Natural alignment for both 128-bit and 256-bit vectors is 16 bytes. There is no advantage in aligning 256-bit vectors to a 32-byte boundary on the Family 16h processor because 256-bit vectors are loaded and stored as two 128-bit halves. 2.5.3 L2 Cache The AMD Family 16h processor implements a unified 16-way set associative L2 cache shared by up to 4 cores. This on-die L2 cache is inclusive of the L1 caches in the cores. The L2 is a write-back cache. The L2 cache has a variable load-to-use latency of no less than 25 cycles. The L2 cache size is 1 or 2 Mbytes depending on configuration. L2 cache entries are protected from errors through the use of an error correcting code (ECC). The L2 to L1 data path is 16 bytes wide; critical data within a cache line is forwarded first. The L2 has four 512-Kbyte banks. Bits 7:6 of the cache line address determine which bank holds the cache line. For a large contiguous block of data, this organization will naturally spread the cache lines out over all 4 banks. The banks can operate on requests in parallel and can each deliver 16 bytes per cycle, for a total peak read bandwidth of 64 bytes per cycle for the L2. Bandwidth to any individual core is 16 bytes per cycle peak, so with four cores, the four banks can each deliver 16 bytes of data to each core simultaneously. The banking scheme provides bandwidth for all four cores in the processing complex that can achieve the level that a private per-core L2 would provide. 12 Microarchitecture of the Family 16h Processor Chapter 2

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

2.5.1
L
1
Instruction
Cache
The
AMD
Family
16
h
processor
contains
a
32
-Kbyte
, 2
-way
set
associative
L
1
instruction
cache
.
Cache
line
size
is
64
bytes
;
however
,
only
32
bytes
are
fetched
in
a
cycle
.
Functions
associated
with
the
L
1
instruction
cache
are
fetching
cache
lines
from
the
L
2
cache
,
providing
instruction
bytes
to
the
decoder
,
prefetching
instructions
,
and
predicting
branches
.
Requests
that
miss
in
the
L
1
instruction
cache
are
fetched
from
the
L
2
cache
or
,
if
not
resident
in
the
L
2
cache
,
from
system
memory
.
On
misses
,
the
L
1
instruction
cache
generates
fill
requests
for
the
naturally-aligned
64
-byte
block
that
includes
the
miss
address
and
one
or
two
sequential
blocks
(
prefetches
.
Because
code
typically
exhibits
spatial
locality
,
prefetching
is
an
effective
technique
for
avoiding
decode
stalls
.
Cache-line
replacement
is
based
on
a
least-
recently-used
replacement
algorithm
.
The
L
1
instruction
cache
is
protected
from
error
through
the
use
of
parity
.
Due
to
the
indexing
and
tagging
scheme
used
in
the
instruction
cache
,
optimal
performance
is
obtained
when
two
hot
cache
lines
which
need
to
be
resident
in
the
instruction
cache
simultaneously
do
not
share
the
same
virtual
address
bits
[20:6].
2.5.2
L
1
Data
Cache
The
AMD
Family
16
h
processor
contains
a
32
-Kbyte
, 8
-way
set
associative
L
1
data
cache
.
This
is
a
write-back
cache
that
supports
one
128
-bit
load
and
one
128
-bit
store
per
cycle
.
In
addition
,
the
L
1
cache
is
protected
from
bit
errors
through
the
use
of
parity
.
There
is
a
hardware
prefetcher
that
brings
data
into
the
L
1
data
cache
to
avoid
misses
.
The
L
1
data
cache
has
a
3
-cycle
integer
load-to-use
latency
,
and
a
5
-cycle
FPU
load-to-use
latency
.
The
data
cache
natural
alignment
boundary
is
16
bytes
.
A
misaligned
load
or
store
operation
suffers
,
at
minimum
,
a
one
cycle
penalty
in
the
load-store
pipeline
if
it
spans
a
16
-byte
boundary
.
Throughput
for
misaligned
loads
and
stores
is
half
that
of
aligned
loads
and
stores
since
a
misaligned
load
or
store
requires
two
cycles
to
access
the
data
cache
(
versus
a
single
cycle
for
aligned
loads
and
stores
.
For
aligned
memory
accesses
,
the
aligned
and
unaligned
load
and
store
instructions
(
for
example
,
MOVUPS
/
MOVAPS
provide
identical
performance
.
Natural
alignment
for
both
128
-bit
and
256
-bit
vectors
is
16
bytes
.
There
is
no
advantage
in
aligning
256
-bit
vectors
to
a
32
-byte
boundary
on
the
Family
16
h
processor
because
256
-bit
vectors
are
loaded
and
stored
as
two
128
-bit
halves
.
2.5.3
L
2
Cache
The
AMD
Family
16
h
processor
implements
a
unified
16
-way
set
associative
L
2
cache
shared
by
up
to
4
cores
.
This
on-die
L
2
cache
is
inclusive
of
the
L
1
caches
in
the
cores
.
The
L
2
is
a
write-back
cache
.
The
L
2
cache
has
a
variable
load-to-use
latency
of
no
less
than
25
cycles
.
The
L
2
cache
size
is
1
or
2
Mbytes
depending
on
configuration
.
L
2
cache
entries
are
protected
from
errors
through
the
use
of
an
error
correcting
code
(
ECC
.
The
L
2
to
L
1
data
path
is
16
bytes
wide
;
critical
data
within
a
cache
line
is
forwarded
first
.
The
L
2
has
four
512
-Kbyte
banks
.
Bits
7:6
of
the
cache
line
address
determine
which
bank
holds
the
cache
line
.
For
a
large
contiguous
block
of
data
,
this
organization
will
naturally
spread
the
cache
lines
out
over
all
4
banks
.
The
banks
can
operate
on
requests
in
parallel
and
can
each
deliver
16
bytes
per
cycle
,
for
a
total
peak
read
bandwidth
of
64
bytes
per
cycle
for
the
L
2.
Bandwidth
to
any
individual
core
is
16
bytes
per
cycle
peak
,
so
with
four
cores
,
the
four
banks
can
each
deliver
16
bytes
of
data
to
each
core
simultaneously
.
The
banking
scheme
provides
bandwidth
for
all
four
cores
in
the
processing
complex
that
can
achieve
the
level
that
a
private
per-core
L
2
would
provide
.
Software
Optimization
Guide
for
AMD
Family
16
h
Processors
52128
Rev
. 1.1
March
2013
12
Microarchitecture
of
the
Family
16
h
Processor
Chapter
2