AMD OS1354WBJ4BGHBOX Optimization Guide - Page 19

Figure

2.

Integer

Schedulers

and

Execution

Units

All

integer

operations

can

be

handled

in

the

ALUs

(

ALU

0

and

1

are

fully

symmetrical

with

the

exception

of

integer

multiply

,

integer

divide

,

and

three-operand

LEA

instructions

.

While

two-operand

LEA

instructions

are

mapped

as

a

single-cycle

micro-op

in

the

ALUs

,

three-operand

LEA

instructions

are

mapped

to

the

store

AGU

and

have

2

cycle

latency

,

with

results

inserted

back

in

to

the

ALU

1

pipeline

.

The

integer

multiply

unit

can

handle

multiplies

of

up

to

32

bits

× 32

bits

with

3

cycle

latency

,

fully

pipelined

.

64

-bit

× 64

-bit

multiplies

require

data

pumping

and

have

a

6

-cycle

latency

with

a

throughput

rate

of

1

every

4

cycles

.

If

the

multiply

instruction

has

2

destination

registers

,

an

additional

one

cycle

latency

and

one

cycle

reduction

in

throughput

is

required

.

The

radix-

4

hardware

integer

divider

unit

can

compute

2

bits

of

results

per

cycle

.

2.9.3

Retire

Control

Unit

The

retire

control

unit

(

RCU

tracks

the

completion

status

of

all

outstanding

operations

(

integer

,

load

/

store

,

and

floating-point

and

is

the

final

arbiter

for

exception

processing

and

recovery

.

The

unit

can

receive

up

to

2

macro-

ops

dispatched

per

cycle

and

track

up

to

64

macro-ops

in-flight

.

A

macro-op

is

eligible

to

be

committed

by

the

retire

unit

when

all

corresponding

micro-ops

have

finished

execution

.

For

most

cases

of

fastpath

double

macro-

ops

(

like

when

an

AVX

256

-bit

instruction

is

broken

into

two

128

-bit

macro-ops

,

it

is

further

required

that

both

macro-ops

have

finished

execution

before

commitment

can

occur

.

The

retire

unit

handles

in-order

commit

of

up

to

two

macro-ops

per

cycle

.

The

retire

control

unit

also

manages

internal

integer

register

mapping

and

renaming

.

The

integer

physical

register

file

(

PRF

consists

of

64

registers

,

with

between

20

to

31

mapped

to

architectural

state

or

micro-

architectural

temporary

state

.

The

remaining

44

to

33

registers

are

available

for

out-of-order

renames

.

Generally

physical

register

renames

are

needed

for

instructions

that

write

to

an

integer

register

destination

(

for

example

,

ADD

,

but

not

for

those

instructions

that

only

write

flags

(

for

example

,

CMP

or

perform

stores

to

memory

.

2.10

Floating-Point

Unit

The

AMD

Family

16

h

processor

provides

native

support

for

32

-bit

single

precision

, 64

-bit

double

precision

,

and

80

-bit

extended

precision

primary

floating-point

data

types

as

well

as

128

-bit

packed

single

and

double

precision

vector

floating-point

data

types

.

The

256

-bit

packed

single

and

double

precision

vector

floating-point

data

types

are

fully

supported

through

the

use

of

two

128

-bit

macro-ops

per

instruction

.

The

floating-point

load

and

store

52128

Rev

. 1.1

March

2013

Software

Optimization

Guide

for

AMD

Family

16

h

Processors

Chapter

2

Microarchitecture

of

the

Family

16

h

Processor

19

Section	Page
Contents	3
List of Figures	4
List of Tables	5
Revision History	6
1 Preface	7
2 Microarchitecture of the Family 16h Processor	8
2.1 Features	8
2.2 Instruction Decomposition	10
2.3 Superscalar Organization	10
2.4 Processor Block Diagram	11
2.5 Processor Cache Operations	11
2.5.1 L1 Instruction Cache	12
2.5.2 L1 Data Cache	12
2.5.3 L2 Cache	12
2.6 Memory Address Translation	13
2.6.1 L1 Translation Lookaside Buffers	13
2.6.2 L2 Translation Lookaside Buffers	13
2.6.3 Hardware Page Table Walker	13
2.7 Optimizing Branching	13
2.7.1 Branch Prediction	13
2.7.1.1 Next Address Logic	14
2.7.1.2 Branch Target Buffer	14
2.7.1.3 Branch Target Address Calculator	14
2.7.1.4 Out-of-Page Target Array	15
2.7.1.5 Branch Marker Caching	15
2.7.1.6 Return Address Stack	15
2.7.1.7 Indirect Target Predictor	16
2.7.1.8 Conditional Branch Predictor	16
2.7.1.9 Fetch Window Tracking Structure	16
2.7.2 Loop Alignment	16
2.7.2.1 Encoding Padding for Loop Alignment	16
2.7.2.2 Aligning Loops to Reduce Power Consumption	17
2.8 Instruction Fetch and Decode	18
2.9 Integer Unit	18
2.9.1 Integer Schedulers	18
2.9.2 Integer Execution Units	18
2.9.3 Retire Control Unit	19
2.10 Floating-Point Unit	19
2.10.1 Denormals	21
2.11 XMM Register Merge Optimization	22
2.12 Load Store Unit	23
Appendix A Instruction Latencies	24
A.1 Instruction Latency Assumptions	24

AMD OS1354WBJ4BGHBOX Optimization Guide - Page 19

Floating-Point

Page 19 highlights