AMD OS1354WBJ4BGHBOX Optimization Guide - Page 12

2.5.1

L

1

Instruction

Cache

The

AMD

Family

16

h

processor

contains

a

32

-Kbyte

, 2

-way

set

associative

L

1

instruction

cache

.

Cache

line

size

is

64

bytes

;

however

,

only

32

bytes

are

fetched

in

a

cycle

.

Functions

associated

with

the

L

1

instruction

cache

are

fetching

cache

lines

from

the

L

2

cache

,

providing

instruction

bytes

to

the

decoder

,

prefetching

instructions

,

and

predicting

branches

.

Requests

that

miss

in

the

L

1

instruction

cache

are

fetched

from

the

L

2

cache

or

,

if

not

resident

in

the

L

2

cache

,

from

system

memory

.

On

misses

,

the

L

1

instruction

cache

generates

fill

requests

for

the

naturally-aligned

64

-byte

block

that

includes

the

miss

address

and

one

or

two

sequential

blocks

(

prefetches

.

Because

code

typically

exhibits

spatial

locality

,

prefetching

is

an

effective

technique

for

avoiding

decode

stalls

.

Cache-line

replacement

is

based

on

a

least-

recently-used

replacement

algorithm

.

The

L

1

instruction

cache

is

protected

from

error

through

the

use

of

parity

.

Due

to

the

indexing

and

tagging

scheme

used

in

the

instruction

cache

,

optimal

performance

is

obtained

when

two

hot

cache

lines

which

need

to

be

resident

in

the

instruction

cache

simultaneously

do

not

share

the

same

virtual

address

bits

[20:6].

2.5.2

L

1

Data

Cache

The

AMD

Family

16

h

processor

contains

a

32

-Kbyte

, 8

-way

set

associative

L

1

data

cache

.

This

is

a

write-back

cache

that

supports

one

128

-bit

load

and

one

128

-bit

store

per

cycle

.

In

addition

,

the

L

1

cache

is

protected

from

bit

errors

through

the

use

of

parity

.

There

is

a

hardware

prefetcher

that

brings

data

into

the

L

1

data

cache

to

avoid

misses

.

The

L

1

data

cache

has

a

3

-cycle

integer

load-to-use

latency

,

and

a

5

-cycle

FPU

load-to-use

latency

.

The

data

cache

natural

alignment

boundary

is

16

bytes

.

A

misaligned

load

or

store

operation

suffers

,

at

minimum

,

a

one

cycle

penalty

in

the

load-store

pipeline

if

it

spans

a

16

-byte

boundary

.

Throughput

for

misaligned

loads

and

stores

is

half

that

of

aligned

loads

and

stores

since

a

misaligned

load

or

store

requires

two

cycles

to

access

the

data

cache

(

versus

a

single

cycle

for

aligned

loads

and

stores

.

For

aligned

memory

accesses

,

the

aligned

and

unaligned

load

and

store

instructions

(

for

example

,

MOVUPS

/

MOVAPS

provide

identical

performance

.

Natural

alignment

for

both

128

-bit

and

256

-bit

vectors

is

16

bytes

.

There

is

no

advantage

in

aligning

256

-bit

vectors

to

a

32

-byte

boundary

on

the

Family

16

h

processor

because

256

-bit

vectors

are

loaded

and

stored

as

two

128

-bit

halves

.

2.5.3

L

2

Cache

The

AMD

Family

16

h

processor

implements

a

unified

16

-way

set

associative

L

2

cache

shared

by

up

to

4

cores

.

This

on-die

L

2

cache

is

inclusive

of

the

L

1

caches

in

the

cores

.

The

L

2

is

a

write-back

cache

.

The

L

2

cache

has

a

variable

load-to-use

latency

of

no

less

than

25

cycles

.

The

L

2

cache

size

is

1

or

2

Mbytes

depending

on

configuration

.

L

2

cache

entries

are

protected

from

errors

through

the

use

of

an

error

correcting

code

(

ECC

.

The

L

2

to

L

1

data

path

is

16

bytes

wide

;

critical

data

within

a

cache

line

is

forwarded

first

.

The

L

2

has

four

512

-Kbyte

banks

.

Bits

7:6

of

the

cache

line

address

determine

which

bank

holds

the

cache

line

.

For

a

large

contiguous

block

of

data

,

this

organization

will

naturally

spread

the

cache

lines

out

over

all

4

banks

.

The

banks

can

operate

on

requests

in

parallel

and

can

each

deliver

16

bytes

per

cycle

,

for

a

total

peak

read

bandwidth

of

64

bytes

per

cycle

for

the

L

2.

Bandwidth

to

any

individual

core

is

16

bytes

per

cycle

peak

,

so

with

four

cores

,

the

four

banks

can

each

deliver

16

bytes

of

data

to

each

core

simultaneously

.

The

banking

scheme

provides

bandwidth

for

all

four

cores

in

the

processing

complex

that

can

achieve

the

level

that

a

private

per-core

L

2

would

provide

.

Software

Optimization

Guide

for

AMD

Family

16

h

Processors

52128

Rev

. 1.1

March

2013

12

Microarchitecture

of

the

Family

16

h

Processor

Chapter

2

Section	Page
Contents	3
List of Figures	4
List of Tables	5
Revision History	6
1 Preface	7
2 Microarchitecture of the Family 16h Processor	8
2.1 Features	8
2.2 Instruction Decomposition	10
2.3 Superscalar Organization	10
2.4 Processor Block Diagram	11
2.5 Processor Cache Operations	11
2.5.1 L1 Instruction Cache	12
2.5.2 L1 Data Cache	12
2.5.3 L2 Cache	12
2.6 Memory Address Translation	13
2.6.1 L1 Translation Lookaside Buffers	13
2.6.2 L2 Translation Lookaside Buffers	13
2.6.3 Hardware Page Table Walker	13
2.7 Optimizing Branching	13
2.7.1 Branch Prediction	13
2.7.1.1 Next Address Logic	14
2.7.1.2 Branch Target Buffer	14
2.7.1.3 Branch Target Address Calculator	14
2.7.1.4 Out-of-Page Target Array	15
2.7.1.5 Branch Marker Caching	15
2.7.1.6 Return Address Stack	15
2.7.1.7 Indirect Target Predictor	16
2.7.1.8 Conditional Branch Predictor	16
2.7.1.9 Fetch Window Tracking Structure	16
2.7.2 Loop Alignment	16
2.7.2.1 Encoding Padding for Loop Alignment	16
2.7.2.2 Aligning Loops to Reduce Power Consumption	17
2.8 Instruction Fetch and Decode	18
2.9 Integer Unit	18
2.9.1 Integer Schedulers	18
2.9.2 Integer Execution Units	18
2.9.3 Retire Control Unit	19
2.10 Floating-Point Unit	19
2.10.1 Denormals	21
2.11 XMM Register Merge Optimization	22
2.12 Load Store Unit	23
Appendix A Instruction Latencies	24
A.1 Instruction Latency Assumptions	24

AMD OS1354WBJ4BGHBOX Optimization Guide - Page 12

Movups, Movaps

Page 12 highlights