AMD OS1354WBJ4BGHBOX Optimization Guide - Page 10

2.2

Instruction

Decomposition

The

AMD

Family

16

h

processor

implements

the

AMD

64

instruction

set

by

means

of

macro-ops

(

the

primary

units

of

work

managed

by

the

processor

and

micro-ops

(

the

primitive

operations

executed

in

the

processor's

execution

units

.

These

operations

are

designed

to

include

direct

support

for

AMD

64

instructions

and

adhere

to

the

high-performance

principles

of

fixed-length

encoding

,

regularized

instruction

fields

,

and

a

large

register

set

.

This

enhanced

microarchitecture

enables

higher

processor

core

performance

and

promotes

straightforward

extensibility

for

future

designs

.

Instructions

are

marked

as

fastpath

single

(

one

macro-op

,

fastpath

double

(

two

macro-ops

,

or

microcode

(

greater

than

2

macro-ops

.

Macro-ops

can

normally

contain

up

to

2

micro-ops

.

The

table

below

lists

some

examples

showing

how

instructions

are

mapped

to

macro-ops

and

how

these

macro-ops

are

mapped

into

one

or

more

micro-ops

.

Table

1.

Typical

Instruction

Mappings

Instruction

Macro-ops

Micro-ops

Comments

MOV reg,[mem]

1

1:

load

Fastpath

single

MOV [mem],reg

1

1:

store

Fastpath

single

MOV [mem],imm

1

2:

move-imm

,

store

Fastpath

single

REP MOVS [mem],[mem]

Many

Many

Microcode

ADD reg,reg

1

1:

add

Fastpath

single

ADD reg,[mem]

1

2:

load

,

add

Fastpath

single

ADD [mem],reg

1

2:

load/store

,

add

Fastpath

single

MOVAPD [mem],xmm

1

2:

store

,

FP-store-data

Fastpath

single

VMOVAPD [mem],ymm

2

4: 2 × {

store

,

FP-store-data

}

256

b

AVX

Fastpath

double

ADDPD xmm,xmm

1

1:

addpd

Fastpath

single

ADDPD xmm,[mem]

1

2:

load

,

addpd

Fastpath

single

VADDPD ymm,ymm

2

2: 2 × {

addpd

}

256

b

AVX

Fastpath

double

VADDPD ymm,[mem]

2

4: 2 × {

load

,

addpd

}

256

b

AVX

Fastpath

double

2.3

Superscalar

Organization

The

AMD

Family

16

h

processor

is

an

out-of-order

,

two-way

superscalar

AMD

64

processor

.

It

can

fetch

,

decode

,

and

retire

up

to

two

AMD

64

instructions

per

cycle

.

The

processor

uses

decoupled

execution

units

to

process

instructions

through

fetch

/

branch-predict

,

decode

,

schedule

/

execute

,

and

retirement

pipelines

.

The

processor

can

fetch

32

bytes

per

cycle

and

can

scan

two

16

-byte

instruction

windows

for

up

to

two

instruction

decodes

per

cycle

.

The

decoder

marks

each

instruction

as

fastpath

single

,

fastpath

double

,

or

microcode

.

The

dispatcher

can

send

up

to

two

macro-ops

to

the

retire

unit

for

tracking

,

as

well

as

sending

the

corresponding

micro-ops

to

the

schedulers

.

These

are

upper

limits

,

however

.

The

actual

number

of

bytes

fetched

or

scanned

,

instructions

decoded

,

or

macro-ops

dispatched

may

be

lower

,

depending

on

a

number

of

factors

such

as

whether

instructions

can

be

broken

up

into

16

-byte

windows

.

The

processor

uses

decoupled

independent

schedulers

,

consisting

of

an

integer

ALU

scheduler

,

an

AGU

scheduler

,

and

a

floating-point

scheduler

.

These

three

schedulers

can

simultaneously

issue

up

to

six

micro-ops

to

Software

Optimization

Guide

for

AMD

Family

16

h

Processors

52128

Rev

. 1.1

March

2013

10

Microarchitecture

of

the

Family

16

h

Processor

Chapter

2

Section	Page
Contents	3
List of Figures	4
List of Tables	5
Revision History	6
1 Preface	7
2 Microarchitecture of the Family 16h Processor	8
2.1 Features	8
2.2 Instruction Decomposition	10
2.3 Superscalar Organization	10
2.4 Processor Block Diagram	11
2.5 Processor Cache Operations	11
2.5.1 L1 Instruction Cache	12
2.5.2 L1 Data Cache	12
2.5.3 L2 Cache	12
2.6 Memory Address Translation	13
2.6.1 L1 Translation Lookaside Buffers	13
2.6.2 L2 Translation Lookaside Buffers	13
2.6.3 Hardware Page Table Walker	13
2.7 Optimizing Branching	13
2.7.1 Branch Prediction	13
2.7.1.1 Next Address Logic	14
2.7.1.2 Branch Target Buffer	14
2.7.1.3 Branch Target Address Calculator	14
2.7.1.4 Out-of-Page Target Array	15
2.7.1.5 Branch Marker Caching	15
2.7.1.6 Return Address Stack	15
2.7.1.7 Indirect Target Predictor	16
2.7.1.8 Conditional Branch Predictor	16
2.7.1.9 Fetch Window Tracking Structure	16
2.7.2 Loop Alignment	16
2.7.2.1 Encoding Padding for Loop Alignment	16
2.7.2.2 Aligning Loops to Reduce Power Consumption	17
2.8 Instruction Fetch and Decode	18
2.9 Integer Unit	18
2.9.1 Integer Schedulers	18
2.9.2 Integer Execution Units	18
2.9.3 Retire Control Unit	19
2.10 Floating-Point Unit	19
2.10.1 Denormals	21
2.11 XMM Register Merge Optimization	22
2.12 Load Store Unit	23
Appendix A Instruction Latencies	24
A.1 Instruction Latency Assumptions	24

AMD OS1354WBJ4BGHBOX Optimization Guide - Page 10

Instruction, Decomposition, Superscalar, Organization

Page 10 highlights