AMD OS1354WBJ4BGHBOX Optimization Guide - Page 22

Register, Merge, Optimization

Page 22 highlights

Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 if the denormal value was loaded into an XMM or YMM register from memory by a pure load instruction (such as [V]MOVUPS), or was produced by a vector-integer or logical instruction. The penalty will only occur once per new denormal value in a sequence of floating-point instructions. A similar penalty does not occur when the floating-point compute instruction is in load-op form and the memory operand is denormal, for example on [V]ADDPS xmm0,[mem] where [mem] is a denormal value. If a compiler can determine that a memory input to a floating-point sequence is denormal, it can avoid this precomputation penalty using a sequence such as: XORPS xmm0,xmm0; ADDPS xmm0,[mem] instead of MOVUPS xmm0,[mem]. Vector ALU and logical instructions will also incur a pre-computation penalty if they encounter a denormal input that was produced by a floating-point instruction. If software does not require the precision that denormals provide, it can set MXCSR.DAZ (bit 6). Any denormal input will then be treated as a zero without a pre-computation penalty. Post-computation penalties occur when a floating-point compute instruction produces a denormal result and both the precision exception and the underflow exception are masked in the MXCSR (that is, both bits 11 Precision Mask and bit 12 Underflow Mask are set). If software does not require the precision that denormals provide, it can set MXCSR.FTZ (bit 15). Any denormal output will then be converted to zero without a post-computation penalty. Post-computation penalties generally cannot be eliminated by compilers. If denormal precision is not required, it is recommended that software set both MXCSR.DAZ and MXCSR.FTZ. Note that setting MXCSR.DAZ or MXCSR.FTZ will cause the processor to produce results that are not compliant with the IEEE-754 standard when operating on or producing denormal values. For x87 instructions both pre-computation and post-computation penalties are incurred when denormals are encountered. A pre-computation penalty is incurred when loading denormal values from memory onto the x87 floating-point stack. A post-computation penalty is incurred when a floating-point compute instruction produces a denormal result and both the precision exception and underflow exception are masked in the x87 floating-point control word (FCW). The x87 FCW does not provide functionality equivalent to MXCSR.DAZ or MXCSR.FTZ, so it is not possible to avoid these denormal penalties when using x87 instructions that encounter or produce denormal values. Programs that call x87 floating-point routines that internally produce denormal values will potentially incur this penalty as well. To completely avoid this penalty, ensure that programs written using legacy x87 instructions do not produce denormal values. 2.11 XMM Register Merge Optimization The AMD Family 16h processor implements an XMM register merge optimization. The processor keeps track of XMM registers whose upper portions have been cleared to zeros. This information can be followed through multiple operations and register destinations until non-zero data is written into a register. For certain instructions, this information can be used to bypass the usual result merging for the upper parts of the register. For instance, SQRTSS does not change the upper 96 bits of the destination register. If some instruction clears the upper 96 bits of its destination register and any arbitrary following sequence of instructions fails to write non-zero data in these upper 96 bits, then the SQRTSS instruction can proceed without waiting for any instructions that wrote to that destination register. The instructions that benefit from this merge optimization are: • CVTPI2PS • CVTSI2SS (32-/64-BIT) • MOVSS xmm1,xmm2 • CVTSD2SS 22 Microarchitecture of the Family 16h Processor Chapter 2

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

if
the
denormal
value
was
loaded
into
an
XMM
or
YMM
register
from
memory
by
a
pure
load
instruction
(
such
as
[V]MOVUPS
,
or
was
produced
by
a
vector-integer
or
logical
instruction
.
The
penalty
will
only
occur
once
per
new
denormal
value
in
a
sequence
of
floating-point
instructions
.
A
similar
penalty
does
not
occur
when
the
floating-point
compute
instruction
is
in
load-op
form
and
the
memory
operand
is
denormal
,
for
example
on
[V]ADDPS xmm0,[mem]
where
[mem]
is
a
denormal
value
.
If
a
compiler
can
determine
that
a
memory
input
to
a
floating-point
sequence
is
denormal
,
it
can
avoid
this
pre-
computation
penalty
using
a
sequence
such
as
:
XORPS xmm0,xmm0; ADDPS xmm0,[mem]
instead
of
MOVUPS xmm0,[mem]
.
Vector
ALU
and
logical
instructions
will
also
incur
a
pre-computation
penalty
if
they
encounter
a
denormal
input
that
was
produced
by
a
floating-point
instruction
.
If
software
does
not
require
the
precision
that
denormals
provide
,
it
can
set
MXCSR
.
DAZ
(
bit
6 .
Any
denormal
input
will
then
be
treated
as
a
zero
without
a
pre-computation
penalty
.
Post-computation
penalties
occur
when
a
floating-point
compute
instruction
produces
a
denormal
result
and
both
the
precision
exception
and
the
underflow
exception
are
masked
in
the
MXCSR
(
that
is
,
both
bits
11
Precision
Mask
and
bit
12
Underflow
Mask
are
set
.
If
software
does
not
require
the
precision
that
denormals
provide
,
it
can
set
MXCSR
.
FTZ
(
bit
15 .
Any
denormal
output
will
then
be
converted
to
zero
without
a
post-computation
penalty
.
Post-computation
penalties
generally
cannot
be
eliminated
by
compilers
.
If
denormal
precision
is
not
required
,
it
is
recommended
that
software
set
both
MXCSR
.
DAZ
and
MXCSR
.
FTZ
.
Note
that
setting
MXCSR
.
DAZ
or
MXCSR
.
FTZ
will
cause
the
processor
to
produce
results
that
are
not
compliant
with
the
IEEE-
754
standard
when
operating
on
or
producing
denormal
values
.
For
x
87
instructions
both
pre-computation
and
post-computation
penalties
are
incurred
when
denormals
are
encountered
.
A
pre-computation
penalty
is
incurred
when
loading
denormal
values
from
memory
onto
the
x
87
floating-point
stack
.
A
post-computation
penalty
is
incurred
when
a
floating-point
compute
instruction
produces
a
denormal
result
and
both
the
precision
exception
and
underflow
exception
are
masked
in
the
x
87
floating-point
control
word
(
FCW
.
The
x
87
FCW
does
not
provide
functionality
equivalent
to
MXCSR
.
DAZ
or
MXCSR
.
FTZ
,
so
it
is
not
possible
to
avoid
these
denormal
penalties
when
using
x
87
instructions
that
encounter
or
produce
denormal
values
.
Programs
that
call
x
87
floating-point
routines
that
internally
produce
denormal
values
will
potentially
incur
this
penalty
as
well
.
To
completely
avoid
this
penalty
,
ensure
that
programs
written
using
legacy
x
87
instructions
do
not
produce
denormal
values
.
2.11
XMM
Register
Merge
Optimization
The
AMD
Family
16
h
processor
implements
an
XMM
register
merge
optimization
.
The
processor
keeps
track
of
XMM
registers
whose
upper
portions
have
been
cleared
to
zeros
.
This
information
can
be
followed
through
multiple
operations
and
register
destinations
until
non-zero
data
is
written
into
a
register
.
For
certain
instructions
,
this
information
can
be
used
to
bypass
the
usual
result
merging
for
the
upper
parts
of
the
register
.
For
instance
,
SQRTSS
does
not
change
the
upper
96
bits
of
the
destination
register
.
If
some
instruction
clears
the
upper
96
bits
of
its
destination
register
and
any
arbitrary
following
sequence
of
instructions
fails
to
write
non-zero
data
in
these
upper
96
bits
,
then
the
SQRTSS
instruction
can
proceed
without
waiting
for
any
instructions
that
wrote
to
that
destination
register
.
The
instructions
that
benefit
from
this
merge
optimization
are
:
CVTPI2PS
CVTSI2SS
(32
-
/64
-BIT
MOVSS xmm1,xmm2
CVTSD2SS
Software
Optimization
Guide
for
AMD
Family
16
h
Processors
52128
Rev
. 1.1
March
2013
22
Microarchitecture
of
the
Family
16
h
Processor
Chapter
2