AMD OS1354WBJ4BGHBOX Optimization Guide - Page 25

Columns

B–E

Op

n

Instruction

operands

.

The

following

notations

are

used

in

these

columns

:

•

imm—an

immediate

operand

(

value

range

left

unspecified

•

imm

8

—an

8

-bit

immediate

operand

•

m—an

8, 16, 32

or

64

-bit

memory

operand

(128

and

256

bit

memory

operands

are

always

explicitly

specified

as

m

128

or

m

256

•

mm—any

64

-bit

MMX

register

•

m

N

—an

N

-bit

memory

operand

•

r—any

general

purpose

(

integer

register

•

r

N

—an

N

-bit

general

purpose

register

•

xmm

N

—any

xmm

register

,

the

N

distinguishes

among

multiple

operands

of

the

same

type

•

ymm

N

—any

ymm

register

,

the

N

distinguishes

among

multiple

operands

of

the

same

type

A

slash

denotes

an

alternative

,

for

example

m

64/

m

32

is

a

32

-bit

or

64

-bit

memory

operand

.

The

notation

"

<

xmm

0>

"

denotes

that

the

register

xmm

0

is

an

implicit

operand

of

the

instruction

.

Column

F

Cpuid

flag

CPUID

feature

flag

for

the

instruction

Column

G

Macro

Ops

Number

of

macro-ops

for

the

instruction

.

Any

number

greater

than

2

implies

that

the

instruction

is

microcoded

,

with

the

given

number

of

macro-ops

in

the

micro-program

.

If

the

entry

in

this

column

is

simply

‘ucode’

then

the

instruction

is

microcoded

but

the

exact

number

of

macro-ops

either

has

not

been

determined

or

is

variable

.

Column

H

Unit

Execution

units

.

The

following

abbreviations

are

used

:

•

ALU—Arithmetic

/

logical

unit

.

•

FPA—Floating-point

add

functional

element

within

the

floating-point

cluster

of

the

floating-

point

unit

.

•

FPM—Floating-point

multiply

functional

element

in

the

floating-point

cluster

of

the

floating-

point

unit

.

•

DIV—Integer

divide

functional

element

within

the

integer

unit

•

MUL—Integer

multiply

functional

element

within

the

integer

unit

.

•

SAGU—Store

address

generation

unit

within

the

integer

unit

.

•

STC—Store

/

convert

functional

element

in

the

store

/

convert

cluster

of

the

floating

point

unit

.

•

VALU—Either

of

the

vector

ALUs

(

VALU

0

or

VALU

1

within

the

integer

cluster

of

the

floating-point

unit

.

•

VIMUL—Vector

integer

multiply

functional

element

within

the

integer

cluster

of

the

floating-

point

unit

.

•

ST—Store

unit

.

In

this

column

,

a

vertical

bar

indicates

that

the

instruction

can

use

either

of

two

alternative

resources

.

A

comma

indicates

that

both

of

the

comma-separated

resources

are

required

.

A

number

of

instructions

are

floating-point

load-ops

which

combine

a

transfer

of

data

from

the

integer

unit

to

the

floating-point

unit

with

a

floating

point

operation

.

This

transfer

is

implemented

by

storing

the

data

from

the

integer

unit

to

a

private

scratch

memory

location

,

then

loading

it

back

into

the

floating

point

unit

.

The

Unit

column

indicates

this

with

"ST

,

LD-

fpunit

"

where

fpunit

is

the

floating

point

unit

required

for

the

load-op

.

52128

Rev

. 1.1

March

2013

Software

Optimization

Guide

for

AMD

Family

16

h

Processors

25

Section	Page
Contents	3
List of Figures	4
List of Tables	5
Revision History	6
1 Preface	7
2 Microarchitecture of the Family 16h Processor	8
2.1 Features	8
2.2 Instruction Decomposition	10
2.3 Superscalar Organization	10
2.4 Processor Block Diagram	11
2.5 Processor Cache Operations	11
2.5.1 L1 Instruction Cache	12
2.5.2 L1 Data Cache	12
2.5.3 L2 Cache	12
2.6 Memory Address Translation	13
2.6.1 L1 Translation Lookaside Buffers	13
2.6.2 L2 Translation Lookaside Buffers	13
2.6.3 Hardware Page Table Walker	13
2.7 Optimizing Branching	13
2.7.1 Branch Prediction	13
2.7.1.1 Next Address Logic	14
2.7.1.2 Branch Target Buffer	14
2.7.1.3 Branch Target Address Calculator	14
2.7.1.4 Out-of-Page Target Array	15
2.7.1.5 Branch Marker Caching	15
2.7.1.6 Return Address Stack	15
2.7.1.7 Indirect Target Predictor	16
2.7.1.8 Conditional Branch Predictor	16
2.7.1.9 Fetch Window Tracking Structure	16
2.7.2 Loop Alignment	16
2.7.2.1 Encoding Padding for Loop Alignment	16
2.7.2.2 Aligning Loops to Reduce Power Consumption	17
2.8 Instruction Fetch and Decode	18
2.9 Integer Unit	18
2.9.1 Integer Schedulers	18
2.9.2 Integer Execution Units	18
2.9.3 Retire Control Unit	19
2.10 Floating-Point Unit	19
2.10.1 Denormals	21
2.11 XMM Register Merge Optimization	22
2.12 Load Store Unit	23
Appendix A Instruction Latencies	24
A.1 Instruction Latency Assumptions	24

AMD OS1354WBJ4BGHBOX Optimization Guide - Page 25

fpunit

Page 25 highlights