AMD OS1354WBJ4BGHBOX Optimization Guide - Page 13

2.6

Memory

Address

Translation

A

translation-lookaside

buffer

(

TLB

holds

the

most-recently-used

page

mapping

information

.

It

assists

and

accelerates

the

translation

of

virtual

addresses

to

physical

addresses

.

A

hardware

table

walker

loads

page

table

information

into

the

TLBs

.

The

AMD

Family

16

h

processor

utilizes

a

two-level

TLB

structure

.

2.6.1

L

1

Translation

Lookaside

Buffers

The

AMD

Family

16

h

processor

contains

a

fully-associative

L

1

instruction

TLB

(

ITLB

with

32 4

-Kbyte

page

entries

and

8 2

-Mbyte

page

entries

.

The

fully-associative

L

1

data

TLB

(

DTLB

provides

40 4

-Kbyte

page

entries

and

8 2

-Mbyte

page

entries

.

2.6.2

L

2

Translation

Lookaside

Buffers

The

AMD

Family

16

h

processor

provides

a

4

-way

set-associative

L

2

instruction

TLB

with

512 4

-Kbyte

page

entries

.

The

L

2

data

TLB

provides

two

independent

translation

buffers

which

are

accessed

in

parallel

;

a

4

-way

set-

associative

buffer

with

512 4

-Kbyte

page

entries

and

a

2

-way

set-associative

buffer

with

256 2

-Mbyte

page

entries

.

2.6.3

Hardware

Page

Table

Walker

The

hardware

page

table

walker

handles

L

2

TLB

misses

.

Misses

can

start

speculatively

from

either

the

instruction

or

the

data

side

.

The

table

walker

includes

a

16

-entry

Page

Directory

Cache

(

PDC

to

speed

up

table

walks

.

The

table

walker

supports

1

-Gbyte

pages

by

smashing

the

page

into

a

2

-Mbyte

window

,

and

returning

a

2

-Mbyte

TLB

entry

.

In

legacy

mode

, 4

-Mbyte

entries

are

also

supported

by

returning

a

smashed

2

-Mbyte

TLB

entry

.

INVLPG

and

INVLPGA

instructions

cause

a

flush

of

the

entire

TLB

if

any

1

-Gbyte

smashed

entries

have

been

created

since

the

last

flush

.

System

software

may

wish

to

avoid

the

use

of

1

-Gbyte

pages

.

In

a

nested

paging

environment

,

the

processor

does

not

create

smashed

entries

if

the

nested

page

tables

use

1

-Gbyte

pages

but

the

guest

page

tables

do

not

use

1

-Gbyte

pages

.

See

the

definition

of

the

terms

smashing

and

smashed

in

the

Preface

.

2.7

Optimizing

Branching

Branching

can

reduce

throughput

when

instruction

execution

must

wait

on

the

completion

of

the

instructions

prior

to

the

branch

that

determine

whether

the

branch

is

taken

.

The

Family

16

h

processor

integrates

logic

that

is

designed

to

reduce

the

average

cost

of

conditional

branching

by

attempting

to

predict

the

outcome

of

a

branch

decision

prior

to

the

resolution

of

the

condition

upon

which

the

decision

is

based

.

This

prediction

is

used

to

speculatively

fetch

,

decode

,

and

execute

instructions

on

the

predicted

path

.

When

the

prediction

is

correct

,

waiting

is

avoided

and

the

instruction

throughput

is

increased

.

The

minimum

branch

misprediction

penalty

is

14

cycles

.

The

following

topic

describes

the

branch

prediction

hardware

facilities

of

the

processor

.

This

is

followed

by

a

discussion

of

how

to

align

code

within

a

loop

to

use

the

loop

optimization

hardware

to

its

fullest

advantage

.

2.7.1

Branch

Prediction

To

predict

and

accelerate

branches

the

AMD

Family

16

h

processor

employs

:

52128

Rev

. 1.1

March

2013

Software

Optimization

Guide

for

AMD

Family

16

h

Processors

Chapter

2

Microarchitecture

of

the

Family

16

h

Processor

13

Section	Page
Contents	3
List of Figures	4
List of Tables	5
Revision History	6
1 Preface	7
2 Microarchitecture of the Family 16h Processor	8
2.1 Features	8
2.2 Instruction Decomposition	10
2.3 Superscalar Organization	10
2.4 Processor Block Diagram	11
2.5 Processor Cache Operations	11
2.5.1 L1 Instruction Cache	12
2.5.2 L1 Data Cache	12
2.5.3 L2 Cache	12
2.6 Memory Address Translation	13
2.6.1 L1 Translation Lookaside Buffers	13
2.6.2 L2 Translation Lookaside Buffers	13
2.6.3 Hardware Page Table Walker	13
2.7 Optimizing Branching	13
2.7.1 Branch Prediction	13
2.7.1.1 Next Address Logic	14
2.7.1.2 Branch Target Buffer	14
2.7.1.3 Branch Target Address Calculator	14
2.7.1.4 Out-of-Page Target Array	15
2.7.1.5 Branch Marker Caching	15
2.7.1.6 Return Address Stack	15
2.7.1.7 Indirect Target Predictor	16
2.7.1.8 Conditional Branch Predictor	16
2.7.1.9 Fetch Window Tracking Structure	16
2.7.2 Loop Alignment	16
2.7.2.1 Encoding Padding for Loop Alignment	16
2.7.2.2 Aligning Loops to Reduce Power Consumption	17
2.8 Instruction Fetch and Decode	18
2.9 Integer Unit	18
2.9.1 Integer Schedulers	18
2.9.2 Integer Execution Units	18
2.9.3 Retire Control Unit	19
2.10 Floating-Point Unit	19
2.10.1 Denormals	21
2.11 XMM Register Merge Optimization	22
2.12 Load Store Unit	23
Appendix A Instruction Latencies	24
A.1 Instruction Latency Assumptions	24

AMD OS1354WBJ4BGHBOX Optimization Guide - Page 13

Memory, Address, Translation, Optimizing, Branching

Page 13 highlights