AMD OS1354WBJ4BGHBOX Optimization Guide - Page 23

Store

Page 23 highlights

52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors • CVTSS2SD • MOVLPS xmm1,[mem] • CVTSI2SD (32-/64-BIT) • MOVSD xmm1,xmm2 • MOVLPD xmm1,[mem] • RCPSS • ROUNDSS • ROUNDSD • RSQRTSS • SQRTSD • SQRTSS 2.12 Load Store Unit The AMD Family 16h processor load-store (LS) unit handles data accesses. The LS unit contains two largely independent pipelines enabling the execution of one 128-bit load memory operation and one 128-bit store memory operation per cycle. The LS unit includes a 16-entry memory ordering queue (MOQ). The MOQ receives both load and store operations at dispatch. Loads leave the MOQ when the load has completed and delivered data to the integer unit or the floating-point unit. Stores leave the MOQ when their address has been translated. The LS unit utilizes a 20-entry store queue which holds stores from dispatch until the store data can be written to the data cache. The LS unit dynamically reorders operations, supporting both load operations bypassing older loads and loads bypassing older non-conflicting stores. The LS unit ensures that the processor adheres to the architectural load and store ordering rules as defined by the AMD64 architecture. The LS unit supports store-to-load forwarding (STLF) when all of the following conditions are met: • the store address and load address both start on the exact same byte • the store operation size is the same or larger than the load operation size • neither the load nor the store operation are misaligned One STLF pitfall to avoid is aliases where store/load virtual address bits [15:4] match, but mismatch in the range [47:16] because it can delay execution of the load. The LS unit can track up to eight outstanding in-flight cache misses. The load store pipelines are optimized for zero-segment-base operations. A load or store that has a non-zero segment base suffers a one-cycle penalty in the load-store pipeline. Most modern operating systems use zero segment bases while running user processes and thus applications will not normally experience this penalty. Chapter 2 Microarchitecture of the Family 16h Processor 23

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

CVTSS2SD
MOVLPS xmm1,[mem]
CVTSI2SD
(32
-
/64
-BIT
MOVSD xmm1,xmm2
MOVLPD xmm1,[mem]
RCPSS
ROUNDSS
ROUNDSD
RSQRTSS
SQRTSD
SQRTSS
2.12
Load
Store
Unit
The
AMD
Family
16
h
processor
load-store
(
LS
unit
handles
data
accesses
.
The
LS
unit
contains
two
largely
independent
pipelines
enabling
the
execution
of
one
128
-bit
load
memory
operation
and
one
128
-bit
store
memory
operation
per
cycle
.
The
LS
unit
includes
a
16
-entry
memory
ordering
queue
(
MOQ
.
The
MOQ
receives
both
load
and
store
operations
at
dispatch
.
Loads
leave
the
MOQ
when
the
load
has
completed
and
delivered
data
to
the
integer
unit
or
the
floating-point
unit
.
Stores
leave
the
MOQ
when
their
address
has
been
translated
.
The
LS
unit
utilizes
a
20
-entry
store
queue
which
holds
stores
from
dispatch
until
the
store
data
can
be
written
to
the
data
cache
.
The
LS
unit
dynamically
reorders
operations
,
supporting
both
load
operations
bypassing
older
loads
and
loads
bypassing
older
non-conflicting
stores
.
The
LS
unit
ensures
that
the
processor
adheres
to
the
architectural
load
and
store
ordering
rules
as
defined
by
the
AMD
64
architecture
.
The
LS
unit
supports
store-to-load
forwarding
(
STLF
when
all
of
the
following
conditions
are
met
:
the
store
address
and
load
address
both
start
on
the
exact
same
byte
the
store
operation
size
is
the
same
or
larger
than
the
load
operation
size
neither
the
load
nor
the
store
operation
are
misaligned
One
STLF
pitfall
to
avoid
is
aliases
where
store
/
load
virtual
address
bits
[15:4]
match
,
but
mismatch
in
the
range
[47:16]
because
it
can
delay
execution
of
the
load
.
The
LS
unit
can
track
up
to
eight
outstanding
in-flight
cache
misses
.
The
load
store
pipelines
are
optimized
for
zero-segment-base
operations
.
A
load
or
store
that
has
a
non-zero
segment
base
suffers
a
one-cycle
penalty
in
the
load-store
pipeline
.
Most
modern
operating
systems
use
zero
segment
bases
while
running
user
processes
and
thus
applications
will
not
normally
experience
this
penalty
.
52128
Rev
. 1.1
March
2013
Software
Optimization
Guide
for
AMD
Family
16
h
Processors
Chapter
2
Microarchitecture
of
the
Family
16
h
Processor
23