AMD OS1354WBJ4BGHBOX Optimization Guide - Page 21

Denormals

Page 21 highlights

52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors the floating-point multiply unit (FPM). The floating-point scheduler can issue one micro-op to one unit per pipe per cycle and provides logic to prevent pipeline hazards like resource contention on the result bus. The second organizational dimension for execution units is forwarding domains. The FPU is divided into three clusters, and forwarding between clusters requires an extra cycle in the bypass network. The three clusters are the Floating-point Cluster (composed of the FPM and FPA units), the Integer Cluster (composed of the VALU0, VALU1, and VIMUL units), and the Store / Convert Cluster (STC). When the result of an instruction executing in one domain is consumed as input by a subsequent instruction executing in a different domain there is a one cycle forwarding delay. This delay does not increase the time that either of the instructions is occupying the execution units, but the scheduler will not attempt to schedule the second instruction earlier. Most FPU instructions support local forwarding, which eliminates this delay when the consuming instruction executes in the same domain. However some instructions (marked with the note "local forwarding disabled" in the latency spreadsheet) do not support local forwarding and experience the forwarding delay even when the consuming instruction executes in the same domain. The following table summarizes the majority of instruction latencies in the FPU. Table 2. Summary of Floating-point Instruction Latencies Instruction Class Latency Throughput Execution Pipe SIMD ALU (most) 1 2 / cycle Either Floating-point logical 1 2 / cycle Either SIMD IMUL 2 1 / cycle Pipe 0 Floating-point multiply 2 single-precision 1 / cycle Pipe 1 Floating-point add 3 1 / cycle Pipe 0 Store/Convert (many) 3 1 / cycle Pipe 1 Floating-point multiply 4 double-precision 1 / 2 cycles Pipe 1 Floating-point multiply 5 extended-precision (x87) 1 / 3 cycles Pipe 1 Floating-point DIV/ SQRT Iterative Iterative Pipe 1 Unit(s) VALU0, VALU1 FPA, FPM VIM FPM FPA Store/Convert FPM FPM Cluster Integer Floating-pt Integer Floating-pt Floating-pt STC Floating-pt Floating-pt FPM Floating-pt Refer to the AMD64_16h_InstrLatency.xlsx spreadsheet described in Appendix A for more instruction latency and throughput information. 2.10.1 Denormals Denormal floating-point values (also called subnormals) can be created by a program either by explicitly specifying a denormal value in the source code or by calculations on normal floating-point values. A significant performance cost (more than 100 processor cycles) may be incurred when these values are encountered. For SSE/AVX instructions, the denormal penalties are a function of the configuration of MXCSR and the instruction sequences that are executed in the presence of a denormal value. Denormal penalties may occur in two phases: usage of a denormal in a computation (pre-computation penalty), and production of a denormal during the execution of an instruction (post-computation penalty). A sequence of floating-point compute instructions may incur a pre-computation penalty when a denormal value is encountered as an input. This penalty occurs on a floating-point computation instruction, such as [V]ADDPS, Chapter 2 Microarchitecture of the Family 16h Processor 21

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

the
floating-point
multiply
unit
(
FPM
.
The
floating-point
scheduler
can
issue
one
micro-op
to
one
unit
per
pipe
per
cycle
and
provides
logic
to
prevent
pipeline
hazards
like
resource
contention
on
the
result
bus
.
The
second
organizational
dimension
for
execution
units
is
forwarding
domains
.
The
FPU
is
divided
into
three
clusters
,
and
forwarding
between
clusters
requires
an
extra
cycle
in
the
bypass
network
.
The
three
clusters
are
the
Floating-point
Cluster
(
composed
of
the
FPM
and
FPA
units
,
the
Integer
Cluster
(
composed
of
the
VALU
0,
VALU
1,
and
VIMUL
units
,
and
the
Store
/
Convert
Cluster
(
STC
.
When
the
result
of
an
instruction
executing
in
one
domain
is
consumed
as
input
by
a
subsequent
instruction
executing
in
a
different
domain
there
is
a
one
cycle
forwarding
delay
.
This
delay
does
not
increase
the
time
that
either
of
the
instructions
is
occupying
the
execution
units
,
but
the
scheduler
will
not
attempt
to
schedule
the
second
instruction
earlier
.
Most
FPU
instructions
support
local
forwarding
,
which
eliminates
this
delay
when
the
consuming
instruction
executes
in
the
same
domain
.
However
some
instructions
(
marked
with
the
note
"local
forwarding
disabled"
in
the
latency
spreadsheet
do
not
support
local
forwarding
and
experience
the
forwarding
delay
even
when
the
consuming
instruction
executes
in
the
same
domain
.
The
following
table
summarizes
the
majority
of
instruction
latencies
in
the
FPU
.
Table
2.
Summary
of
Floating-point
Instruction
Latencies
Instruction
Class
Latency
Throughput
Execution
Pipe
Unit
(
s
Cluster
SIMD
ALU
(
most
1
2 /
cycle
Either
VALU
0,
VALU
1
Integer
Floating-point
logical
1
2 /
cycle
Either
FPA
,
FPM
Floating-pt
SIMD
IMUL
2
1 /
cycle
Pipe
0
VIM
Integer
Floating-point
multiply
single-precision
2
1 /
cycle
Pipe
1
FPM
Floating-pt
Floating-point
add
3
1 /
cycle
Pipe
0
FPA
Floating-pt
Store
/
Convert
(
many
3
1 /
cycle
Pipe
1
Store
/
Convert
STC
Floating-point
multiply
double-precision
4
1 / 2
cycles
Pipe
1
FPM
Floating-pt
Floating-point
multiply
extended-precision
(
x
87
5
1 / 3
cycles
Pipe
1
FPM
Floating-pt
Floating-point
DIV
/
SQRT
Iterative
Iterative
Pipe
1
FPM
Floating-pt
Refer
to
the
AMD
64_16
h
_
InstrLatency
.
xlsx
spreadsheet
described
in
Appendix
A
for
more
instruction
latency
and
throughput
information
.
2.10.1
Denormals
Denormal
floating-point
values
(
also
called
subnormals
can
be
created
by
a
program
either
by
explicitly
specifying
a
denormal
value
in
the
source
code
or
by
calculations
on
normal
floating-point
values
.
A
significant
performance
cost
(
more
than
100
processor
cycles
may
be
incurred
when
these
values
are
encountered
.
For
SSE
/
AVX
instructions
,
the
denormal
penalties
are
a
function
of
the
configuration
of
MXCSR
and
the
instruction
sequences
that
are
executed
in
the
presence
of
a
denormal
value
.
Denormal
penalties
may
occur
in
two
phases
:
usage
of
a
denormal
in
a
computation
(
pre-computation
penalty
,
and
production
of
a
denormal
during
the
execution
of
an
instruction
(
post-computation
penalty
.
A
sequence
of
floating-point
compute
instructions
may
incur
a
pre-computation
penalty
when
a
denormal
value
is
encountered
as
an
input
.
This
penalty
occurs
on
a
floating-point
computation
instruction
,
such
as
[V]ADDPS
,
52128
Rev
. 1.1
March
2013
Software
Optimization
Guide
for
AMD
Family
16
h
Processors
Chapter
2
Microarchitecture
of
the
Family
16
h
Processor
21