AMD OS1354WBJ4BGHBOX Optimization Guide - Page 24

Software, Optimization, Guide, instruction, latency, Throughput

Page 24 highlights

Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 Appendix A Instruction Latencies The companion file AMD64_16h_InstrLatency_1.1.xlsx distributed with this Software Optimization Guide provides additional detailed information for the AMD Family 16h processor. The first worksheet in the spreadsheet, "Overview," provides some useful reference information which is related to the second worksheet, "Latencies." This appendix explains the columns and definitions used in the table of latencies. Information in the spreadsheet is based on estimates and is subject to change. A.1 Instruction Latency Assumptions The term instruction latency refers to the number of processor clock cycles required to complete the execution of a particular instruction from the time that it is issued. Throughput refers to the number of results that can be generated in a unit of time given the repeated execution of a given instruction. Many factors affect instruction execution time. For instance, when a source operand must be loaded from a memory location, the time required to read the operand from system memory adds to the execution time. Furthermore, latency is highly variable due to the fact that a memory operand may or may not be found in one of the levels of data cache. In some cases, the target memory location may not even be resident in system memory due to being paged out to backing storage. In estimating the instruction latency and reciprocal throughput, the following assumptions are necessary: • The instruction is an L1 I-cache hit that has already been fetched and decoded, with the operations loaded into the scheduler. • Memory operands are in the L1 data cache. • There is no contention for execution resources or load-store unit resources. Each latency value listed in the spreadsheet denotes the typical execution time of the instruction when run in isolation on a processor. For real programs executed on this highly aggressive super-scalar family of processors, multiple instructions can execute simultaneously; therefore, the effective latency for any given instruction's execution may be overlapped with the latency of other instructions executing in parallel. The latencies in the spreadsheet reflect the number of cycles from instruction issuance to instruction retirement. This includes the time to write results to registers or the write buffer, but not the time for results to be written from the write buffer to L1 D-cache, which may not occur until after the instruction is retired. For most instructions, the only forms listed are the ones without memory operands. The latency for instruction forms that load from memory can be calculated by adding the load latencies listed on the overview worksheet to the latency for the register-only form. To measure the latency of an instruction which stores data to memory, it is necessary to define an end-point at which the instruction is said to be complete. This guide has chosen instruction retirement as the end point, and under that definition writes add no additional latency. Choosing another end point, such as the point at which the data has been written to the L1 cache, would result in variable latencies and would not be meaningful without taking into account the context in which the instruction is executed. There are cases where additional latencies may be incurred in a real program that are not described in the spreadsheet, such as delays caused by L1 cache misses or contention for execution or load-store unit resources. A.2 Spreadsheet Column Descriptions The following describes the information provided in each column of the spreadsheet: Column A Instruction Instruction opcodes 24

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

Appendix
A
Instruction
Latencies
The
companion
file
AMD
64_16
h
_
InstrLatency
_1.1.
xlsx
distributed
with
this
Software
Optimization
Guide
provides
additional
detailed
information
for
the
AMD
Family
16
h
processor
.
The
first
worksheet
in
the
spreadsheet
,
"Overview
,
"
provides
some
useful
reference
information
which
is
related
to
the
second
worksheet
,
"Latencies
.
"
This
appendix
explains
the
columns
and
definitions
used
in
the
table
of
latencies
.
Information
in
the
spreadsheet
is
based
on
estimates
and
is
subject
to
change
.
A
.1
Instruction
Latency
Assumptions
The
term
instruction
latency
refers
to
the
number
of
processor
clock
cycles
required
to
complete
the
execution
of
a
particular
instruction
from
the
time
that
it
is
issued
.
Throughput
refers
to
the
number
of
results
that
can
be
generated
in
a
unit
of
time
given
the
repeated
execution
of
a
given
instruction
.
Many
factors
affect
instruction
execution
time
.
For
instance
,
when
a
source
operand
must
be
loaded
from
a
memory
location
,
the
time
required
to
read
the
operand
from
system
memory
adds
to
the
execution
time
.
Furthermore
,
latency
is
highly
variable
due
to
the
fact
that
a
memory
operand
may
or
may
not
be
found
in
one
of
the
levels
of
data
cache
.
In
some
cases
,
the
target
memory
location
may
not
even
be
resident
in
system
memory
due
to
being
paged
out
to
backing
storage
.
In
estimating
the
instruction
latency
and
reciprocal
throughput
,
the
following
assumptions
are
necessary
:
The
instruction
is
an
L
1
I-cache
hit
that
has
already
been
fetched
and
decoded
,
with
the
operations
loaded
into
the
scheduler
.
Memory
operands
are
in
the
L
1
data
cache
.
There
is
no
contention
for
execution
resources
or
load-store
unit
resources
.
Each
latency
value
listed
in
the
spreadsheet
denotes
the
typical
execution
time
of
the
instruction
when
run
in
isolation
on
a
processor
.
For
real
programs
executed
on
this
highly
aggressive
super-scalar
family
of
processors
,
multiple
instructions
can
execute
simultaneously
;
therefore
,
the
effective
latency
for
any
given
instruction's
execution
may
be
overlapped
with
the
latency
of
other
instructions
executing
in
parallel
.
The
latencies
in
the
spreadsheet
reflect
the
number
of
cycles
from
instruction
issuance
to
instruction
retirement
.
This
includes
the
time
to
write
results
to
registers
or
the
write
buffer
,
but
not
the
time
for
results
to
be
written
from
the
write
buffer
to
L
1
D-cache
,
which
may
not
occur
until
after
the
instruction
is
retired
.
For
most
instructions
,
the
only
forms
listed
are
the
ones
without
memory
operands
.
The
latency
for
instruction
forms
that
load
from
memory
can
be
calculated
by
adding
the
load
latencies
listed
on
the
overview
worksheet
to
the
latency
for
the
register-only
form
.
To
measure
the
latency
of
an
instruction
which
stores
data
to
memory
,
it
is
necessary
to
define
an
end-point
at
which
the
instruction
is
said
to
be
complete
.
This
guide
has
chosen
instruction
retirement
as
the
end
point
,
and
under
that
definition
writes
add
no
additional
latency
.
Choosing
another
end
point
,
such
as
the
point
at
which
the
data
has
been
written
to
the
L
1
cache
,
would
result
in
variable
latencies
and
would
not
be
meaningful
without
taking
into
account
the
context
in
which
the
instruction
is
executed
.
There
are
cases
where
additional
latencies
may
be
incurred
in
a
real
program
that
are
not
described
in
the
spreadsheet
,
such
as
delays
caused
by
L
1
cache
misses
or
contention
for
execution
or
load-store
unit
resources
.
A
.2
Spreadsheet
Column
Descriptions
The
following
describes
the
information
provided
in
each
column
of
the
spreadsheet
:
Column
A
Instruction
Instruction
opcodes
Software
Optimization
Guide
for
AMD
Family
16
h
Processors
52128
Rev
. 1.1
March
2013
24