AMD OS1354WBJ4BGHBOX Optimization Guide - Page 13

Memory, Address, Translation, Optimizing, Branching

Page 13 highlights

52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors 2.6 Memory Address Translation A translation-lookaside buffer (TLB) holds the most-recently-used page mapping information. It assists and accelerates the translation of virtual addresses to physical addresses. A hardware table walker loads page table information into the TLBs. The AMD Family 16h processor utilizes a two-level TLB structure. 2.6.1 L1 Translation Lookaside Buffers The AMD Family 16h processor contains a fully-associative L1 instruction TLB (ITLB) with 32 4-Kbyte page entries and 8 2-Mbyte page entries. The fully-associative L1 data TLB (DTLB) provides 40 4-Kbyte page entries and 8 2-Mbyte page entries. 2.6.2 L2 Translation Lookaside Buffers The AMD Family 16h processor provides a 4-way set-associative L2 instruction TLB with 512 4-Kbyte page entries. The L2 data TLB provides two independent translation buffers which are accessed in parallel; a 4-way setassociative buffer with 512 4-Kbyte page entries and a 2-way set-associative buffer with 256 2-Mbyte page entries. 2.6.3 Hardware Page Table Walker The hardware page table walker handles L2 TLB misses. Misses can start speculatively from either the instruction or the data side. The table walker includes a 16-entry Page Directory Cache (PDC) to speed up table walks. The table walker supports 1-Gbyte pages by smashing the page into a 2-Mbyte window, and returning a 2-Mbyte TLB entry. In legacy mode, 4-Mbyte entries are also supported by returning a smashed 2-Mbyte TLB entry. INVLPG and INVLPGA instructions cause a flush of the entire TLB if any 1-Gbyte smashed entries have been created since the last flush. System software may wish to avoid the use of 1-Gbyte pages. In a nested paging environment, the processor does not create smashed entries if the nested page tables use 1-Gbyte pages but the guest page tables do not use 1-Gbyte pages. See the definition of the terms smashing and smashed in the Preface. 2.7 Optimizing Branching Branching can reduce throughput when instruction execution must wait on the completion of the instructions prior to the branch that determine whether the branch is taken. The Family 16h processor integrates logic that is designed to reduce the average cost of conditional branching by attempting to predict the outcome of a branch decision prior to the resolution of the condition upon which the decision is based. This prediction is used to speculatively fetch, decode, and execute instructions on the predicted path. When the prediction is correct, waiting is avoided and the instruction throughput is increased. The minimum branch misprediction penalty is 14 cycles. The following topic describes the branch prediction hardware facilities of the processor. This is followed by a discussion of how to align code within a loop to use the loop optimization hardware to its fullest advantage. 2.7.1 Branch Prediction To predict and accelerate branches the AMD Family 16h processor employs: Chapter 2 Microarchitecture of the Family 16h Processor 13

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

2.6
Memory
Address
Translation
A
translation-lookaside
buffer
(
TLB
holds
the
most-recently-used
page
mapping
information
.
It
assists
and
accelerates
the
translation
of
virtual
addresses
to
physical
addresses
.
A
hardware
table
walker
loads
page
table
information
into
the
TLBs
.
The
AMD
Family
16
h
processor
utilizes
a
two-level
TLB
structure
.
2.6.1
L
1
Translation
Lookaside
Buffers
The
AMD
Family
16
h
processor
contains
a
fully-associative
L
1
instruction
TLB
(
ITLB
with
32 4
-Kbyte
page
entries
and
8 2
-Mbyte
page
entries
.
The
fully-associative
L
1
data
TLB
(
DTLB
provides
40 4
-Kbyte
page
entries
and
8 2
-Mbyte
page
entries
.
2.6.2
L
2
Translation
Lookaside
Buffers
The
AMD
Family
16
h
processor
provides
a
4
-way
set-associative
L
2
instruction
TLB
with
512 4
-Kbyte
page
entries
.
The
L
2
data
TLB
provides
two
independent
translation
buffers
which
are
accessed
in
parallel
;
a
4
-way
set-
associative
buffer
with
512 4
-Kbyte
page
entries
and
a
2
-way
set-associative
buffer
with
256 2
-Mbyte
page
entries
.
2.6.3
Hardware
Page
Table
Walker
The
hardware
page
table
walker
handles
L
2
TLB
misses
.
Misses
can
start
speculatively
from
either
the
instruction
or
the
data
side
.
The
table
walker
includes
a
16
-entry
Page
Directory
Cache
(
PDC
to
speed
up
table
walks
.
The
table
walker
supports
1
-Gbyte
pages
by
smashing
the
page
into
a
2
-Mbyte
window
,
and
returning
a
2
-Mbyte
TLB
entry
.
In
legacy
mode
, 4
-Mbyte
entries
are
also
supported
by
returning
a
smashed
2
-Mbyte
TLB
entry
.
INVLPG
and
INVLPGA
instructions
cause
a
flush
of
the
entire
TLB
if
any
1
-Gbyte
smashed
entries
have
been
created
since
the
last
flush
.
System
software
may
wish
to
avoid
the
use
of
1
-Gbyte
pages
.
In
a
nested
paging
environment
,
the
processor
does
not
create
smashed
entries
if
the
nested
page
tables
use
1
-Gbyte
pages
but
the
guest
page
tables
do
not
use
1
-Gbyte
pages
.
See
the
definition
of
the
terms
smashing
and
smashed
in
the
Preface
.
2.7
Optimizing
Branching
Branching
can
reduce
throughput
when
instruction
execution
must
wait
on
the
completion
of
the
instructions
prior
to
the
branch
that
determine
whether
the
branch
is
taken
.
The
Family
16
h
processor
integrates
logic
that
is
designed
to
reduce
the
average
cost
of
conditional
branching
by
attempting
to
predict
the
outcome
of
a
branch
decision
prior
to
the
resolution
of
the
condition
upon
which
the
decision
is
based
.
This
prediction
is
used
to
speculatively
fetch
,
decode
,
and
execute
instructions
on
the
predicted
path
.
When
the
prediction
is
correct
,
waiting
is
avoided
and
the
instruction
throughput
is
increased
.
The
minimum
branch
misprediction
penalty
is
14
cycles
.
The
following
topic
describes
the
branch
prediction
hardware
facilities
of
the
processor
.
This
is
followed
by
a
discussion
of
how
to
align
code
within
a
loop
to
use
the
loop
optimization
hardware
to
its
fullest
advantage
.
2.7.1
Branch
Prediction
To
predict
and
accelerate
branches
the
AMD
Family
16
h
processor
employs
:
52128
Rev
. 1.1
March
2013
Software
Optimization
Guide
for
AMD
Family
16
h
Processors
Chapter
2
Microarchitecture
of
the
Family
16
h
Processor
13