AMD OS1354WBJ4BGHBOX Optimization Guide - Page 19

Floating-Point

Page 19 highlights

52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors Figure 2. Integer Schedulers and Execution Units All integer operations can be handled in the ALUs (ALU0 and 1 are fully symmetrical) with the exception of integer multiply, integer divide, and three-operand LEA instructions. While two-operand LEA instructions are mapped as a single-cycle micro-op in the ALUs, three-operand LEA instructions are mapped to the store AGU and have 2 cycle latency, with results inserted back in to the ALU1 pipeline. The integer multiply unit can handle multiplies of up to 32 bits × 32 bits with 3 cycle latency, fully pipelined. 64-bit × 64-bit multiplies require data pumping and have a 6-cycle latency with a throughput rate of 1 every 4 cycles. If the multiply instruction has 2 destination registers, an additional one cycle latency and one cycle reduction in throughput is required. The radix-4 hardware integer divider unit can compute 2 bits of results per cycle. 2.9.3 Retire Control Unit The retire control unit (RCU) tracks the completion status of all outstanding operations (integer, load/store, and floating-point) and is the final arbiter for exception processing and recovery. The unit can receive up to 2 macroops dispatched per cycle and track up to 64 macro-ops in-flight. A macro-op is eligible to be committed by the retire unit when all corresponding micro-ops have finished execution. For most cases of fastpath double macroops (like when an AVX 256-bit instruction is broken into two 128-bit macro-ops), it is further required that both macro-ops have finished execution before commitment can occur. The retire unit handles in-order commit of up to two macro-ops per cycle. The retire control unit also manages internal integer register mapping and renaming. The integer physical register file (PRF) consists of 64 registers, with between 20 to 31 mapped to architectural state or microarchitectural temporary state. The remaining 44 to 33 registers are available for out-of-order renames. Generally physical register renames are needed for instructions that write to an integer register destination (for example, ADD), but not for those instructions that only write flags (for example, CMP) or perform stores to memory. 2.10 Floating-Point Unit The AMD Family 16h processor provides native support for 32-bit single precision, 64-bit double precision, and 80-bit extended precision primary floating-point data types as well as 128-bit packed single and double precision vector floating-point data types. The 256-bit packed single and double precision vector floating-point data types are fully supported through the use of two 128-bit macro-ops per instruction. The floating-point load and store Chapter 2 Microarchitecture of the Family 16h Processor 19

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

Figure
2.
Integer
Schedulers
and
Execution
Units
All
integer
operations
can
be
handled
in
the
ALUs
(
ALU
0
and
1
are
fully
symmetrical
with
the
exception
of
integer
multiply
,
integer
divide
,
and
three-operand
LEA
instructions
.
While
two-operand
LEA
instructions
are
mapped
as
a
single-cycle
micro-op
in
the
ALUs
,
three-operand
LEA
instructions
are
mapped
to
the
store
AGU
and
have
2
cycle
latency
,
with
results
inserted
back
in
to
the
ALU
1
pipeline
.
The
integer
multiply
unit
can
handle
multiplies
of
up
to
32
bits
× 32
bits
with
3
cycle
latency
,
fully
pipelined
.
64
-bit
× 64
-bit
multiplies
require
data
pumping
and
have
a
6
-cycle
latency
with
a
throughput
rate
of
1
every
4
cycles
.
If
the
multiply
instruction
has
2
destination
registers
,
an
additional
one
cycle
latency
and
one
cycle
reduction
in
throughput
is
required
.
The
radix-
4
hardware
integer
divider
unit
can
compute
2
bits
of
results
per
cycle
.
2.9.3
Retire
Control
Unit
The
retire
control
unit
(
RCU
tracks
the
completion
status
of
all
outstanding
operations
(
integer
,
load
/
store
,
and
floating-point
and
is
the
final
arbiter
for
exception
processing
and
recovery
.
The
unit
can
receive
up
to
2
macro-
ops
dispatched
per
cycle
and
track
up
to
64
macro-ops
in-flight
.
A
macro-op
is
eligible
to
be
committed
by
the
retire
unit
when
all
corresponding
micro-ops
have
finished
execution
.
For
most
cases
of
fastpath
double
macro-
ops
(
like
when
an
AVX
256
-bit
instruction
is
broken
into
two
128
-bit
macro-ops
,
it
is
further
required
that
both
macro-ops
have
finished
execution
before
commitment
can
occur
.
The
retire
unit
handles
in-order
commit
of
up
to
two
macro-ops
per
cycle
.
The
retire
control
unit
also
manages
internal
integer
register
mapping
and
renaming
.
The
integer
physical
register
file
(
PRF
consists
of
64
registers
,
with
between
20
to
31
mapped
to
architectural
state
or
micro-
architectural
temporary
state
.
The
remaining
44
to
33
registers
are
available
for
out-of-order
renames
.
Generally
physical
register
renames
are
needed
for
instructions
that
write
to
an
integer
register
destination
(
for
example
,
ADD
,
but
not
for
those
instructions
that
only
write
flags
(
for
example
,
CMP
or
perform
stores
to
memory
.
2.10
Floating-Point
Unit
The
AMD
Family
16
h
processor
provides
native
support
for
32
-bit
single
precision
, 64
-bit
double
precision
,
and
80
-bit
extended
precision
primary
floating-point
data
types
as
well
as
128
-bit
packed
single
and
double
precision
vector
floating-point
data
types
.
The
256
-bit
packed
single
and
double
precision
vector
floating-point
data
types
are
fully
supported
through
the
use
of
two
128
-bit
macro-ops
per
instruction
.
The
floating-point
load
and
store
52128
Rev
. 1.1
March
2013
Software
Optimization
Guide
for
AMD
Family
16
h
Processors
Chapter
2
Microarchitecture
of
the
Family
16
h
Processor
19