AMD OS1354WBJ4BGHBOX Optimization Guide - Page 21

the

floating-point

multiply

unit

(

FPM

.

The

floating-point

scheduler

can

issue

one

micro-op

to

one

unit

per

pipe

per

cycle

and

provides

logic

to

prevent

pipeline

hazards

like

resource

contention

on

the

result

bus

.

The

second

organizational

dimension

for

execution

units

is

forwarding

domains

.

The

FPU

is

divided

into

three

clusters

,

and

forwarding

between

clusters

requires

an

extra

cycle

in

the

bypass

network

.

The

three

clusters

are

the

Floating-point

Cluster

(

composed

of

the

FPM

and

FPA

units

,

the

Integer

Cluster

(

composed

of

the

VALU

0,

VALU

1,

and

VIMUL

units

,

and

the

Store

/

Convert

Cluster

(

STC

.

When

the

result

of

an

instruction

executing

in

one

domain

is

consumed

as

input

by

a

subsequent

instruction

executing

in

a

different

domain

there

is

a

one

cycle

forwarding

delay

.

This

delay

does

not

increase

the

time

that

either

of

the

instructions

is

occupying

the

execution

units

,

but

the

scheduler

will

not

attempt

to

schedule

the

second

instruction

earlier

.

Most

FPU

instructions

support

local

forwarding

,

which

eliminates

this

delay

when

the

consuming

instruction

executes

in

the

same

domain

.

However

some

instructions

(

marked

with

the

note

"local

forwarding

disabled"

in

the

latency

spreadsheet

do

not

support

local

forwarding

and

experience

the

forwarding

delay

even

when

the

consuming

instruction

executes

in

the

same

domain

.

The

following

table

summarizes

the

majority

of

instruction

latencies

in

the

FPU

.

Table

2.

Summary

of

Floating-point

Instruction

Latencies

Instruction

Class

Latency

Throughput

Execution

Pipe

Unit

(

s

Cluster

SIMD

ALU

(

most

1

2 /

cycle

Either

VALU

0,

VALU

1

Integer

Floating-point

logical

1

2 /

cycle

Either

FPA

,

FPM

Floating-pt

SIMD

IMUL

2

1 /

cycle

Pipe

0

VIM

Integer

Floating-point

multiply

single-precision

2

1 /

cycle

Pipe

1

FPM

Floating-pt

Floating-point

add

3

1 /

cycle

Pipe

0

FPA

Floating-pt

Store

/

Convert

(

many

3

1 /

cycle

Pipe

1

Store

/

Convert

STC

Floating-point

multiply

double-precision

4

1 / 2

cycles

Pipe

1

FPM

Floating-pt

Floating-point

multiply

extended-precision

(

x

87

5

1 / 3

cycles

Pipe

1

FPM

Floating-pt

Floating-point

DIV

/

SQRT

Iterative

Iterative

Pipe

1

FPM

Floating-pt

Refer

to

the

AMD

64_16

h

_

InstrLatency

.

xlsx

spreadsheet

described

in

Appendix

A

for

more

instruction

latency

and

throughput

information

.

2.10.1

Denormals

Denormal

floating-point

values

(

also

called

subnormals

can

be

created

by

a

program

either

by

explicitly

specifying

a

denormal

value

in

the

source

code

or

by

calculations

on

normal

floating-point

values

.

A

significant

performance

cost

(

more

than

100

processor

cycles

may

be

incurred

when

these

values

are

encountered

.

For

SSE

/

AVX

instructions

,

the

denormal

penalties

are

a

function

of

the

configuration

of

MXCSR

and

the

instruction

sequences

that

are

executed

in

the

presence

of

a

denormal

value

.

Denormal

penalties

may

occur

in

two

phases

:

usage

of

a

denormal

in

a

computation

(

pre-computation

penalty

,

and

production

of

a

denormal

during

the

execution

of

an

instruction

(

post-computation

penalty

.

A

sequence

of

floating-point

compute

instructions

may

incur

a

pre-computation

penalty

when

a

denormal

value

is

encountered

as

an

input

.

This

penalty

occurs

on

a

floating-point

computation

instruction

,

such

as

[V]ADDPS

,

52128

Rev

. 1.1

March

2013

Software

Optimization

Guide

for

AMD

Family

16

h

Processors

Chapter

2

Microarchitecture

of

the

Family

16

h

Processor

21

Section	Page
Contents	3
List of Figures	4
List of Tables	5
Revision History	6
1 Preface	7
2 Microarchitecture of the Family 16h Processor	8
2.1 Features	8
2.2 Instruction Decomposition	10
2.3 Superscalar Organization	10
2.4 Processor Block Diagram	11
2.5 Processor Cache Operations	11
2.5.1 L1 Instruction Cache	12
2.5.2 L1 Data Cache	12
2.5.3 L2 Cache	12
2.6 Memory Address Translation	13
2.6.1 L1 Translation Lookaside Buffers	13
2.6.2 L2 Translation Lookaside Buffers	13
2.6.3 Hardware Page Table Walker	13
2.7 Optimizing Branching	13
2.7.1 Branch Prediction	13
2.7.1.1 Next Address Logic	14
2.7.1.2 Branch Target Buffer	14
2.7.1.3 Branch Target Address Calculator	14
2.7.1.4 Out-of-Page Target Array	15
2.7.1.5 Branch Marker Caching	15
2.7.1.6 Return Address Stack	15
2.7.1.7 Indirect Target Predictor	16
2.7.1.8 Conditional Branch Predictor	16
2.7.1.9 Fetch Window Tracking Structure	16
2.7.2 Loop Alignment	16
2.7.2.1 Encoding Padding for Loop Alignment	16
2.7.2.2 Aligning Loops to Reduce Power Consumption	17
2.8 Instruction Fetch and Decode	18
2.9 Integer Unit	18
2.9.1 Integer Schedulers	18
2.9.2 Integer Execution Units	18
2.9.3 Retire Control Unit	19
2.10 Floating-Point Unit	19
2.10.1 Denormals	21
2.11 XMM Register Merge Optimization	22
2.12 Load Store Unit	23
Appendix A Instruction Latencies	24
A.1 Instruction Latency Assumptions	24

AMD OS1354WBJ4BGHBOX Optimization Guide - Page 21

Denormals

Page 21 highlights