AMD OS1354WBJ4BGHBOX Optimization Guide - Page 22

if

the

denormal

value

was

loaded

into

an

XMM

or

YMM

register

from

memory

by

a

pure

load

instruction

(

such

as

[V]MOVUPS

,

or

was

produced

by

a

vector-integer

or

logical

instruction

.

The

penalty

will

only

occur

once

per

new

denormal

value

in

a

sequence

of

floating-point

instructions

.

A

similar

penalty

does

not

occur

when

the

floating-point

compute

instruction

is

in

load-op

form

and

the

memory

operand

is

denormal

,

for

example

on

[V]ADDPS xmm0,[mem]

where

[mem]

is

a

denormal

value

.

If

a

compiler

can

determine

that

a

memory

input

to

a

floating-point

sequence

is

denormal

,

it

can

avoid

this

pre-

computation

penalty

using

a

sequence

such

as

:

XORPS xmm0,xmm0; ADDPS xmm0,[mem]

instead

of

MOVUPS xmm0,[mem]

.

Vector

ALU

and

logical

instructions

will

also

incur

a

pre-computation

penalty

if

they

encounter

a

denormal

input

that

was

produced

by

a

floating-point

instruction

.

If

software

does

not

require

the

precision

that

denormals

provide

,

it

can

set

MXCSR

.

DAZ

(

bit

6 .

Any

denormal

input

will

then

be

treated

as

a

zero

without

a

pre-computation

penalty

.

Post-computation

penalties

occur

when

a

floating-point

compute

instruction

produces

a

denormal

result

and

both

the

precision

exception

and

the

underflow

exception

are

masked

in

the

MXCSR

(

that

is

,

both

bits

11

Precision

Mask

and

bit

12

Underflow

Mask

are

set

.

If

software

does

not

require

the

precision

that

denormals

provide

,

it

can

set

MXCSR

.

FTZ

(

bit

15 .

Any

denormal

output

will

then

be

converted

to

zero

without

a

post-computation

penalty

.

Post-computation

penalties

generally

cannot

be

eliminated

by

compilers

.

If

denormal

precision

is

not

required

,

it

is

recommended

that

software

set

both

MXCSR

.

DAZ

and

MXCSR

.

FTZ

.

Note

that

setting

MXCSR

.

DAZ

or

MXCSR

.

FTZ

will

cause

the

processor

to

produce

results

that

are

not

compliant

with

the

IEEE-

754

standard

when

operating

on

or

producing

denormal

values

.

For

x

87

instructions

both

pre-computation

and

post-computation

penalties

are

incurred

when

denormals

are

encountered

.

A

pre-computation

penalty

is

incurred

when

loading

denormal

values

from

memory

onto

the

x

87

floating-point

stack

.

A

post-computation

penalty

is

incurred

when

a

floating-point

compute

instruction

produces

a

denormal

result

and

both

the

precision

exception

and

underflow

exception

are

masked

in

the

x

87

floating-point

control

word

(

FCW

.

The

x

87

FCW

does

not

provide

functionality

equivalent

to

MXCSR

.

DAZ

or

MXCSR

.

FTZ

,

so

it

is

not

possible

to

avoid

these

denormal

penalties

when

using

x

87

instructions

that

encounter

or

produce

denormal

values

.

Programs

that

call

x

87

floating-point

routines

that

internally

produce

denormal

values

will

potentially

incur

this

penalty

as

well

.

To

completely

avoid

this

penalty

,

ensure

that

programs

written

using

legacy

x

87

instructions

do

not

produce

denormal

values

.

2.11

XMM

Register

Merge

Optimization

The

AMD

Family

16

h

processor

implements

an

XMM

register

merge

optimization

.

The

processor

keeps

track

of

XMM

registers

whose

upper

portions

have

been

cleared

to

zeros

.

This

information

can

be

followed

through

multiple

operations

and

register

destinations

until

non-zero

data

is

written

into

a

register

.

For

certain

instructions

,

this

information

can

be

used

to

bypass

the

usual

result

merging

for

the

upper

parts

of

the

register

.

For

instance

,

SQRTSS

does

not

change

the

upper

96

bits

of

the

destination

register

.

If

some

instruction

clears

the

upper

96

bits

of

its

destination

register

and

any

arbitrary

following

sequence

of

instructions

fails

to

write

non-zero

data

in

these

upper

96

bits

,

then

the

SQRTSS

instruction

can

proceed

without

waiting

for

any

instructions

that

wrote

to

that

destination

register

.

The

instructions

that

benefit

from

this

merge

optimization

are

:

•

CVTPI2PS

•

CVTSI2SS

(32

-

/64

-BIT

•

MOVSS xmm1,xmm2

•

CVTSD2SS

Software

Optimization

Guide

for

AMD

Family

16

h

Processors

52128

Rev

. 1.1

March

2013

22

Microarchitecture

of

the

Family

16

h

Processor

Chapter

2

Section	Page
Contents	3
List of Figures	4
List of Tables	5
Revision History	6
1 Preface	7
2 Microarchitecture of the Family 16h Processor	8
2.1 Features	8
2.2 Instruction Decomposition	10
2.3 Superscalar Organization	10
2.4 Processor Block Diagram	11
2.5 Processor Cache Operations	11
2.5.1 L1 Instruction Cache	12
2.5.2 L1 Data Cache	12
2.5.3 L2 Cache	12
2.6 Memory Address Translation	13
2.6.1 L1 Translation Lookaside Buffers	13
2.6.2 L2 Translation Lookaside Buffers	13
2.6.3 Hardware Page Table Walker	13
2.7 Optimizing Branching	13
2.7.1 Branch Prediction	13
2.7.1.1 Next Address Logic	14
2.7.1.2 Branch Target Buffer	14
2.7.1.3 Branch Target Address Calculator	14
2.7.1.4 Out-of-Page Target Array	15
2.7.1.5 Branch Marker Caching	15
2.7.1.6 Return Address Stack	15
2.7.1.7 Indirect Target Predictor	16
2.7.1.8 Conditional Branch Predictor	16
2.7.1.9 Fetch Window Tracking Structure	16
2.7.2 Loop Alignment	16
2.7.2.1 Encoding Padding for Loop Alignment	16
2.7.2.2 Aligning Loops to Reduce Power Consumption	17
2.8 Instruction Fetch and Decode	18
2.9 Integer Unit	18
2.9.1 Integer Schedulers	18
2.9.2 Integer Execution Units	18
2.9.3 Retire Control Unit	19
2.10 Floating-Point Unit	19
2.10.1 Denormals	21
2.11 XMM Register Merge Optimization	22
2.12 Load Store Unit	23
Appendix A Instruction Latencies	24
A.1 Instruction Latency Assumptions	24

AMD OS1354WBJ4BGHBOX Optimization Guide - Page 22

Register, Merge, Optimization

Page 22 highlights