AMD OS1354WBJ4BGHBOX Optimization Guide - Page 23

•

CVTSS2SD

•

MOVLPS xmm1,[mem]

•

CVTSI2SD

(32

-

/64

-BIT

•

MOVSD xmm1,xmm2

•

MOVLPD xmm1,[mem]

•

RCPSS

•

ROUNDSS

•

ROUNDSD

•

RSQRTSS

•

SQRTSD

•

SQRTSS

2.12

Load

Store

Unit

The

AMD

Family

16

h

processor

load-store

(

LS

unit

handles

data

accesses

.

The

LS

unit

contains

two

largely

independent

pipelines

enabling

the

execution

of

one

128

-bit

load

memory

operation

and

one

128

-bit

store

memory

operation

per

cycle

.

The

LS

unit

includes

a

16

-entry

memory

ordering

queue

(

MOQ

.

The

MOQ

receives

both

load

and

store

operations

at

dispatch

.

Loads

leave

the

MOQ

when

the

load

has

completed

and

delivered

data

to

the

integer

unit

or

the

floating-point

unit

.

Stores

leave

the

MOQ

when

their

address

has

been

translated

.

The

LS

unit

utilizes

a

20

-entry

store

queue

which

holds

stores

from

dispatch

until

the

store

data

can

be

written

to

the

data

cache

.

The

LS

unit

dynamically

reorders

operations

,

supporting

both

load

operations

bypassing

older

loads

and

loads

bypassing

older

non-conflicting

stores

.

The

LS

unit

ensures

that

the

processor

adheres

to

the

architectural

load

and

store

ordering

rules

as

defined

by

the

AMD

64

architecture

.

The

LS

unit

supports

store-to-load

forwarding

(

STLF

when

all

of

the

following

conditions

are

met

:

•

the

store

address

and

load

address

both

start

on

the

exact

same

byte

•

the

store

operation

size

is

the

same

or

larger

than

the

load

operation

size

•

neither

the

load

nor

the

store

operation

are

misaligned

One

STLF

pitfall

to

avoid

is

aliases

where

store

/

load

virtual

address

bits

[15:4]

match

,

but

mismatch

in

the

range

[47:16]

because

it

can

delay

execution

of

the

load

.

The

LS

unit

can

track

up

to

eight

outstanding

in-flight

cache

misses

.

The

load

store

pipelines

are

optimized

for

zero-segment-base

operations

.

A

load

or

store

that

has

a

non-zero

segment

base

suffers

a

one-cycle

penalty

in

the

load-store

pipeline

.

Most

modern

operating

systems

use

zero

segment

bases

while

running

user

processes

and

thus

applications

will

not

normally

experience

this

penalty

.

52128

Rev

. 1.1

March

2013

Software

Optimization

Guide

for

AMD

Family

16

h

Processors

Chapter

2

Microarchitecture

of

the

Family

16

h

Processor

23

Section	Page
Contents	3
List of Figures	4
List of Tables	5
Revision History	6
1 Preface	7
2 Microarchitecture of the Family 16h Processor	8
2.1 Features	8
2.2 Instruction Decomposition	10
2.3 Superscalar Organization	10
2.4 Processor Block Diagram	11
2.5 Processor Cache Operations	11
2.5.1 L1 Instruction Cache	12
2.5.2 L1 Data Cache	12
2.5.3 L2 Cache	12
2.6 Memory Address Translation	13
2.6.1 L1 Translation Lookaside Buffers	13
2.6.2 L2 Translation Lookaside Buffers	13
2.6.3 Hardware Page Table Walker	13
2.7 Optimizing Branching	13
2.7.1 Branch Prediction	13
2.7.1.1 Next Address Logic	14
2.7.1.2 Branch Target Buffer	14
2.7.1.3 Branch Target Address Calculator	14
2.7.1.4 Out-of-Page Target Array	15
2.7.1.5 Branch Marker Caching	15
2.7.1.6 Return Address Stack	15
2.7.1.7 Indirect Target Predictor	16
2.7.1.8 Conditional Branch Predictor	16
2.7.1.9 Fetch Window Tracking Structure	16
2.7.2 Loop Alignment	16
2.7.2.1 Encoding Padding for Loop Alignment	16
2.7.2.2 Aligning Loops to Reduce Power Consumption	17
2.8 Instruction Fetch and Decode	18
2.9 Integer Unit	18
2.9.1 Integer Schedulers	18
2.9.2 Integer Execution Units	18
2.9.3 Retire Control Unit	19
2.10 Floating-Point Unit	19
2.10.1 Denormals	21
2.11 XMM Register Merge Optimization	22
2.12 Load Store Unit	23
Appendix A Instruction Latencies	24
A.1 Instruction Latency Assumptions	24

AMD OS1354WBJ4BGHBOX Optimization Guide - Page 23

Store

Page 23 highlights