AMD OS1354WBJ4BGHBOX Optimization Guide - Page 20

paths

are

128

bits

wide

.

As

a

result

,

the

maximum

throughput

of

both

single-precision

and

double-precision

floating-point

SSE

vector

operations

has

improved

by

a

factor

of

two

over

the

AMD

Family

14

h

processor

.

The

floating-point

unit

(

FPU

utilizes

a

coprocessor

model

.

As

such

it

contains

its

own

scheduler

,

register

files

,

and

renamers

and

does

not

share

them

with

the

integer

units

.

It

can

handle

dispatch

and

renaming

of

2

floating-

point

macro-ops

per

cycle

,

and

the

scheduler

can

issue

1

micro-op

per

cycle

for

each

pipe

.

The

floating-point

scheduler

has

an

18

-entry

micro-op

capacity

.

The

floating-point

retire

queue

holds

up

to

44

floating-point

micro-ops

between

dispatch

and

retire

.

Any

macro-

op

that

has

a

floating-point

micro-op

component

,

and

that

is

dispatched

into

the

integer

retire

control

unit

,

will

be

held

in

the

floating-point

retire

queue

until

the

macro-op

retires

from

the

integer

retire

control

unit

.

Thus

a

maximum

of

44

macro-ops

which

have

floating-point

micro-op

components

can

be

in-flight

in

the

64

-macro-op

in-flight

window

that

the

integer

retire

control

unit

provides

.

Figure

3.

Floating-point

Unit

Block

Diagram

The

FPU

contains

a

128

-bit

floating-point

multiply

unit

(

FPM

and

a

128

-bit

floating-point

adder

unit

(

FPA

.

The

FPM

contains

two

76

-bit

× 27

-bit

multipliers

,

which

means

that

double

precision

(64

-bit

and

extended

precision

(80

-bit

floating-point

multiplication

computations

require

iteration

.

A

few

selected

floating-point

micro-ops

,

primarily

logical

/

move

/

shuffle

micro-ops

,

can

execute

in

either

the

FPM

or

the

FPA

.

The

FPU

also

contains

two

128

-bit

vector

arithmetic

/

logical

units

(

VALUs

which

perform

arithmetic

and

logical

operations

on

AVX

,

SSE

,

and

legacy

MMX

packed

integer

data

,

and

a

128

-bit

integer

multiply

unit

(

VIMUL

.

The

store

/

convert

unit

(

STC

primarily

handles

stores

(

up

to

128

-bit

operand

size

,

floating-point

/

integer

conversions

,

and

integer

/

floating-point

conversions

.

The

register

file

and

bypass

network

can

also

accept

one

128

-bit

load

per

cycle

from

the

load-store

unit

.

There

are

two

important

organizational

dimensions

to

understand

with

respect

to

the

execution

units

.

The

first

is

the

pipeline

binding

.

Pipe

0

contains

vector

integer

ALU

0 (

VALU

0 ,

the

vector

integer

multiplier

(

VIMUL

,

and

the

floating-point

adder

(

FPA

.

Pipe

1

contains

vector

integer

ALU

1 (

VALU

1 ,

the

store

/

convert

unit

,

and

Software

Optimization

Guide

for

AMD

Family

16

h

Processors

52128

Rev

. 1.1

March

2013

20

Microarchitecture

of

the

Family

16

h

Processor

Chapter

2

Section	Page
Contents	3
List of Figures	4
List of Tables	5
Revision History	6
1 Preface	7
2 Microarchitecture of the Family 16h Processor	8
2.1 Features	8
2.2 Instruction Decomposition	10
2.3 Superscalar Organization	10
2.4 Processor Block Diagram	11
2.5 Processor Cache Operations	11
2.5.1 L1 Instruction Cache	12
2.5.2 L1 Data Cache	12
2.5.3 L2 Cache	12
2.6 Memory Address Translation	13
2.6.1 L1 Translation Lookaside Buffers	13
2.6.2 L2 Translation Lookaside Buffers	13
2.6.3 Hardware Page Table Walker	13
2.7 Optimizing Branching	13
2.7.1 Branch Prediction	13
2.7.1.1 Next Address Logic	14
2.7.1.2 Branch Target Buffer	14
2.7.1.3 Branch Target Address Calculator	14
2.7.1.4 Out-of-Page Target Array	15
2.7.1.5 Branch Marker Caching	15
2.7.1.6 Return Address Stack	15
2.7.1.7 Indirect Target Predictor	16
2.7.1.8 Conditional Branch Predictor	16
2.7.1.9 Fetch Window Tracking Structure	16
2.7.2 Loop Alignment	16
2.7.2.1 Encoding Padding for Loop Alignment	16
2.7.2.2 Aligning Loops to Reduce Power Consumption	17
2.8 Instruction Fetch and Decode	18
2.9 Integer Unit	18
2.9.1 Integer Schedulers	18
2.9.2 Integer Execution Units	18
2.9.3 Retire Control Unit	19
2.10 Floating-Point Unit	19
2.10.1 Denormals	21
2.11 XMM Register Merge Optimization	22
2.12 Load Store Unit	23
Appendix A Instruction Latencies	24
A.1 Instruction Latency Assumptions	24

AMD OS1354WBJ4BGHBOX Optimization Guide - Page 20

Floating-point, Block, Diagram

Page 20 highlights