AMD OS1354WBJ4BGHBOX Optimization Guide - Page 24

Appendix

A

Instruction

Latencies

The

companion

file

AMD

64_16

h

_

InstrLatency

_1.1.

xlsx

distributed

with

this

Software

Optimization

Guide

provides

additional

detailed

information

for

the

AMD

Family

16

h

processor

.

The

first

worksheet

in

the

spreadsheet

,

"Overview

,

"

provides

some

useful

reference

information

which

is

related

to

the

second

worksheet

,

"Latencies

.

"

This

appendix

explains

the

columns

and

definitions

used

in

the

table

of

latencies

.

Information

in

the

spreadsheet

is

based

on

estimates

and

is

subject

to

change

.

A

.1

Instruction

Latency

Assumptions

The

term

instruction

latency

refers

to

the

number

of

processor

clock

cycles

required

to

complete

the

execution

of

a

particular

instruction

from

the

time

that

it

is

issued

.

Throughput

refers

to

the

number

of

results

that

can

be

generated

in

a

unit

of

time

given

the

repeated

execution

of

a

given

instruction

.

Many

factors

affect

instruction

execution

time

.

For

instance

,

when

a

source

operand

must

be

loaded

from

a

memory

location

,

the

time

required

to

read

the

operand

from

system

memory

adds

to

the

execution

time

.

Furthermore

,

latency

is

highly

variable

due

to

the

fact

that

a

memory

operand

may

or

may

not

be

found

in

one

of

the

levels

of

data

cache

.

In

some

cases

,

the

target

memory

location

may

not

even

be

resident

in

system

memory

due

to

being

paged

out

to

backing

storage

.

In

estimating

the

instruction

latency

and

reciprocal

throughput

,

the

following

assumptions

are

necessary

:

•

The

instruction

is

an

L

1

I-cache

hit

that

has

already

been

fetched

and

decoded

,

with

the

operations

loaded

into

the

scheduler

.

•

Memory

operands

are

in

the

L

1

data

cache

.

•

There

is

no

contention

for

execution

resources

or

load-store

unit

resources

.

Each

latency

value

listed

in

the

spreadsheet

denotes

the

typical

execution

time

of

the

instruction

when

run

in

isolation

on

a

processor

.

For

real

programs

executed

on

this

highly

aggressive

super-scalar

family

of

processors

,

multiple

instructions

can

execute

simultaneously

;

therefore

,

the

effective

latency

for

any

given

instruction's

execution

may

be

overlapped

with

the

latency

of

other

instructions

executing

in

parallel

.

The

latencies

in

the

spreadsheet

reflect

the

number

of

cycles

from

instruction

issuance

to

instruction

retirement

.

This

includes

the

time

to

write

results

to

registers

or

the

write

buffer

,

but

not

the

time

for

results

to

be

written

from

the

write

buffer

to

L

1

D-cache

,

which

may

not

occur

until

after

the

instruction

is

retired

.

For

most

instructions

,

the

only

forms

listed

are

the

ones

without

memory

operands

.

The

latency

for

instruction

forms

that

load

from

memory

can

be

calculated

by

adding

the

load

latencies

listed

on

the

overview

worksheet

to

the

latency

for

the

register-only

form

.

To

measure

the

latency

of

an

instruction

which

stores

data

to

memory

,

it

is

necessary

to

define

an

end-point

at

which

the

instruction

is

said

to

be

complete

.

This

guide

has

chosen

instruction

retirement

as

the

end

point

,

and

under

that

definition

writes

add

no

additional

latency

.

Choosing

another

end

point

,

such

as

the

point

at

which

the

data

has

been

written

to

the

L

1

cache

,

would

result

in

variable

latencies

and

would

not

be

meaningful

without

taking

into

account

the

context

in

which

the

instruction

is

executed

.

There

are

cases

where

additional

latencies

may

be

incurred

in

a

real

program

that

are

not

described

in

the

spreadsheet

,

such

as

delays

caused

by

L

1

cache

misses

or

contention

for

execution

or

load-store

unit

resources

.

A

.2

Spreadsheet

Column

Descriptions

The

following

describes

the

information

provided

in

each

column

of

the

spreadsheet

:

Column

A

Instruction

Instruction

opcodes

Software

Optimization

Guide

for

AMD

Family

16

h

Processors

52128

Rev

. 1.1

March

2013

24

Section	Page
Contents	3
List of Figures	4
List of Tables	5
Revision History	6
1 Preface	7
2 Microarchitecture of the Family 16h Processor	8
2.1 Features	8
2.2 Instruction Decomposition	10
2.3 Superscalar Organization	10
2.4 Processor Block Diagram	11
2.5 Processor Cache Operations	11
2.5.1 L1 Instruction Cache	12
2.5.2 L1 Data Cache	12
2.5.3 L2 Cache	12
2.6 Memory Address Translation	13
2.6.1 L1 Translation Lookaside Buffers	13
2.6.2 L2 Translation Lookaside Buffers	13
2.6.3 Hardware Page Table Walker	13
2.7 Optimizing Branching	13
2.7.1 Branch Prediction	13
2.7.1.1 Next Address Logic	14
2.7.1.2 Branch Target Buffer	14
2.7.1.3 Branch Target Address Calculator	14
2.7.1.4 Out-of-Page Target Array	15
2.7.1.5 Branch Marker Caching	15
2.7.1.6 Return Address Stack	15
2.7.1.7 Indirect Target Predictor	16
2.7.1.8 Conditional Branch Predictor	16
2.7.1.9 Fetch Window Tracking Structure	16
2.7.2 Loop Alignment	16
2.7.2.1 Encoding Padding for Loop Alignment	16
2.7.2.2 Aligning Loops to Reduce Power Consumption	17
2.8 Instruction Fetch and Decode	18
2.9 Integer Unit	18
2.9.1 Integer Schedulers	18
2.9.2 Integer Execution Units	18
2.9.3 Retire Control Unit	19
2.10 Floating-Point Unit	19
2.10.1 Denormals	21
2.11 XMM Register Merge Optimization	22
2.12 Load Store Unit	23
Appendix A Instruction Latencies	24
A.1 Instruction Latency Assumptions	24

AMD OS1354WBJ4BGHBOX Optimization Guide - Page 24

Software, Optimization, Guide, instruction, latency, Throughput

Page 24 highlights