AMD OS1354WBJ4BGHBOX Optimization Guide - Page 20

Floating-point, Block, Diagram

Page 20 highlights

Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 paths are 128 bits wide. As a result, the maximum throughput of both single-precision and double-precision floating-point SSE vector operations has improved by a factor of two over the AMD Family 14h processor. The floating-point unit (FPU) utilizes a coprocessor model. As such it contains its own scheduler, register files, and renamers and does not share them with the integer units. It can handle dispatch and renaming of 2 floatingpoint macro-ops per cycle, and the scheduler can issue 1 micro-op per cycle for each pipe. The floating-point scheduler has an 18-entry micro-op capacity. The floating-point retire queue holds up to 44 floating-point micro-ops between dispatch and retire. Any macroop that has a floating-point micro-op component, and that is dispatched into the integer retire control unit, will be held in the floating-point retire queue until the macro-op retires from the integer retire control unit. Thus a maximum of 44 macro-ops which have floating-point micro-op components can be in-flight in the 64-macro-op in-flight window that the integer retire control unit provides. Figure 3. Floating-point Unit Block Diagram The FPU contains a 128-bit floating-point multiply unit (FPM) and a 128-bit floating-point adder unit (FPA). The FPM contains two 76-bit × 27-bit multipliers, which means that double precision (64-bit) and extended precision (80-bit) floating-point multiplication computations require iteration. A few selected floating-point micro-ops, primarily logical/move/shuffle micro-ops, can execute in either the FPM or the FPA. The FPU also contains two 128-bit vector arithmetic / logical units (VALUs) which perform arithmetic and logical operations on AVX, SSE, and legacy MMX packed integer data, and a 128-bit integer multiply unit (VIMUL). The store/ convert unit (STC) primarily handles stores (up to 128-bit operand size), floating-point / integer conversions, and integer / floating-point conversions. The register file and bypass network can also accept one 128-bit load per cycle from the load-store unit. There are two important organizational dimensions to understand with respect to the execution units. The first is the pipeline binding. Pipe 0 contains vector integer ALU 0 (VALU0), the vector integer multiplier (VIMUL), and the floating-point adder (FPA). Pipe 1 contains vector integer ALU 1 (VALU1), the store/convert unit, and 20 Microarchitecture of the Family 16h Processor Chapter 2

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

paths
are
128
bits
wide
.
As
a
result
,
the
maximum
throughput
of
both
single-precision
and
double-precision
floating-point
SSE
vector
operations
has
improved
by
a
factor
of
two
over
the
AMD
Family
14
h
processor
.
The
floating-point
unit
(
FPU
utilizes
a
coprocessor
model
.
As
such
it
contains
its
own
scheduler
,
register
files
,
and
renamers
and
does
not
share
them
with
the
integer
units
.
It
can
handle
dispatch
and
renaming
of
2
floating-
point
macro-ops
per
cycle
,
and
the
scheduler
can
issue
1
micro-op
per
cycle
for
each
pipe
.
The
floating-point
scheduler
has
an
18
-entry
micro-op
capacity
.
The
floating-point
retire
queue
holds
up
to
44
floating-point
micro-ops
between
dispatch
and
retire
.
Any
macro-
op
that
has
a
floating-point
micro-op
component
,
and
that
is
dispatched
into
the
integer
retire
control
unit
,
will
be
held
in
the
floating-point
retire
queue
until
the
macro-op
retires
from
the
integer
retire
control
unit
.
Thus
a
maximum
of
44
macro-ops
which
have
floating-point
micro-op
components
can
be
in-flight
in
the
64
-macro-op
in-flight
window
that
the
integer
retire
control
unit
provides
.
Figure
3.
Floating-point
Unit
Block
Diagram
The
FPU
contains
a
128
-bit
floating-point
multiply
unit
(
FPM
and
a
128
-bit
floating-point
adder
unit
(
FPA
.
The
FPM
contains
two
76
-bit
× 27
-bit
multipliers
,
which
means
that
double
precision
(64
-bit
and
extended
precision
(80
-bit
floating-point
multiplication
computations
require
iteration
.
A
few
selected
floating-point
micro-ops
,
primarily
logical
/
move
/
shuffle
micro-ops
,
can
execute
in
either
the
FPM
or
the
FPA
.
The
FPU
also
contains
two
128
-bit
vector
arithmetic
/
logical
units
(
VALUs
which
perform
arithmetic
and
logical
operations
on
AVX
,
SSE
,
and
legacy
MMX
packed
integer
data
,
and
a
128
-bit
integer
multiply
unit
(
VIMUL
.
The
store
/
convert
unit
(
STC
primarily
handles
stores
(
up
to
128
-bit
operand
size
,
floating-point
/
integer
conversions
,
and
integer
/
floating-point
conversions
.
The
register
file
and
bypass
network
can
also
accept
one
128
-bit
load
per
cycle
from
the
load-store
unit
.
There
are
two
important
organizational
dimensions
to
understand
with
respect
to
the
execution
units
.
The
first
is
the
pipeline
binding
.
Pipe
0
contains
vector
integer
ALU
0 (
VALU
0 ,
the
vector
integer
multiplier
(
VIMUL
,
and
the
floating-point
adder
(
FPA
.
Pipe
1
contains
vector
integer
ALU
1 (
VALU
1 ,
the
store
/
convert
unit
,
and
Software
Optimization
Guide
for
AMD
Family
16
h
Processors
52128
Rev
. 1.1
March
2013
20
Microarchitecture
of
the
Family
16
h
Processor
Chapter
2