AMD OS1354WBJ4BGHBOX Optimization Guide - Page 10

Instruction, Decomposition, Superscalar, Organization

Page 10 highlights

Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 2.2 Instruction Decomposition The AMD Family 16h processor implements the AMD64 instruction set by means of macro-ops (the primary units of work managed by the processor) and micro-ops (the primitive operations executed in the processor's execution units). These operations are designed to include direct support for AMD64 instructions and adhere to the high-performance principles of fixed-length encoding, regularized instruction fields, and a large register set. This enhanced microarchitecture enables higher processor core performance and promotes straightforward extensibility for future designs. Instructions are marked as fastpath single (one macro-op), fastpath double (two macro-ops), or microcode (greater than 2 macro-ops). Macro-ops can normally contain up to 2 micro-ops. The table below lists some examples showing how instructions are mapped to macro-ops and how these macro-ops are mapped into one or more micro-ops. Table 1. Typical Instruction Mappings Instruction Macro-ops MOV reg,[mem] 1 MOV [mem],reg 1 MOV [mem],imm 1 REP MOVS [mem],[mem] Many ADD reg,reg 1 ADD reg,[mem] 1 ADD [mem],reg 1 MOVAPD [mem],xmm 1 VMOVAPD [mem],ymm 2 ADDPD xmm,xmm 1 ADDPD xmm,[mem] 1 VADDPD ymm,ymm 2 VADDPD ymm,[mem] 2 Micro-ops 1: load 1: store 2: move-imm, store Many 1: add 2: load, add 2: load/store, add 2: store, FP-store-data 4: 2 × {store, FP-store-data} 1: addpd 2: load, addpd 2: 2 × {addpd} 4: 2 × {load, addpd} Comments Fastpath single Fastpath single Fastpath single Microcode Fastpath single Fastpath single Fastpath single Fastpath single 256b AVX Fastpath double Fastpath single Fastpath single 256b AVX Fastpath double 256b AVX Fastpath double 2.3 Superscalar Organization The AMD Family 16h processor is an out-of-order, two-way superscalar AMD64 processor. It can fetch, decode, and retire up to two AMD64 instructions per cycle. The processor uses decoupled execution units to process instructions through fetch/branch-predict, decode, schedule/execute, and retirement pipelines. The processor can fetch 32 bytes per cycle and can scan two 16-byte instruction windows for up to two instruction decodes per cycle. The decoder marks each instruction as fastpath single, fastpath double, or microcode. The dispatcher can send up to two macro-ops to the retire unit for tracking, as well as sending the corresponding micro-ops to the schedulers. These are upper limits, however. The actual number of bytes fetched or scanned, instructions decoded, or macro-ops dispatched may be lower, depending on a number of factors such as whether instructions can be broken up into 16-byte windows. The processor uses decoupled independent schedulers, consisting of an integer ALU scheduler, an AGU scheduler, and a floating-point scheduler. These three schedulers can simultaneously issue up to six micro-ops to 10 Microarchitecture of the Family 16h Processor Chapter 2

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

2.2
Instruction
Decomposition
The
AMD
Family
16
h
processor
implements
the
AMD
64
instruction
set
by
means
of
macro-ops
(
the
primary
units
of
work
managed
by
the
processor
and
micro-ops
(
the
primitive
operations
executed
in
the
processor's
execution
units
.
These
operations
are
designed
to
include
direct
support
for
AMD
64
instructions
and
adhere
to
the
high-performance
principles
of
fixed-length
encoding
,
regularized
instruction
fields
,
and
a
large
register
set
.
This
enhanced
microarchitecture
enables
higher
processor
core
performance
and
promotes
straightforward
extensibility
for
future
designs
.
Instructions
are
marked
as
fastpath
single
(
one
macro-op
,
fastpath
double
(
two
macro-ops
,
or
microcode
(
greater
than
2
macro-ops
.
Macro-ops
can
normally
contain
up
to
2
micro-ops
.
The
table
below
lists
some
examples
showing
how
instructions
are
mapped
to
macro-ops
and
how
these
macro-ops
are
mapped
into
one
or
more
micro-ops
.
Table
1.
Typical
Instruction
Mappings
Instruction
Macro-ops
Micro-ops
Comments
MOV reg,[mem]
1
1:
load
Fastpath
single
MOV [mem],reg
1
1:
store
Fastpath
single
MOV [mem],imm
1
2:
move-imm
,
store
Fastpath
single
REP MOVS [mem],[mem]
Many
Many
Microcode
ADD reg,reg
1
1:
add
Fastpath
single
ADD reg,[mem]
1
2:
load
,
add
Fastpath
single
ADD [mem],reg
1
2:
load/store
,
add
Fastpath
single
MOVAPD [mem],xmm
1
2:
store
,
FP-store-data
Fastpath
single
VMOVAPD [mem],ymm
2
4: 2 × {
store
,
FP-store-data
}
256
b
AVX
Fastpath
double
ADDPD xmm,xmm
1
1:
addpd
Fastpath
single
ADDPD xmm,[mem]
1
2:
load
,
addpd
Fastpath
single
VADDPD ymm,ymm
2
2: 2 × {
addpd
}
256
b
AVX
Fastpath
double
VADDPD ymm,[mem]
2
4: 2 × {
load
,
addpd
}
256
b
AVX
Fastpath
double
2.3
Superscalar
Organization
The
AMD
Family
16
h
processor
is
an
out-of-order
,
two-way
superscalar
AMD
64
processor
.
It
can
fetch
,
decode
,
and
retire
up
to
two
AMD
64
instructions
per
cycle
.
The
processor
uses
decoupled
execution
units
to
process
instructions
through
fetch
/
branch-predict
,
decode
,
schedule
/
execute
,
and
retirement
pipelines
.
The
processor
can
fetch
32
bytes
per
cycle
and
can
scan
two
16
-byte
instruction
windows
for
up
to
two
instruction
decodes
per
cycle
.
The
decoder
marks
each
instruction
as
fastpath
single
,
fastpath
double
,
or
microcode
.
The
dispatcher
can
send
up
to
two
macro-ops
to
the
retire
unit
for
tracking
,
as
well
as
sending
the
corresponding
micro-ops
to
the
schedulers
.
These
are
upper
limits
,
however
.
The
actual
number
of
bytes
fetched
or
scanned
,
instructions
decoded
,
or
macro-ops
dispatched
may
be
lower
,
depending
on
a
number
of
factors
such
as
whether
instructions
can
be
broken
up
into
16
-byte
windows
.
The
processor
uses
decoupled
independent
schedulers
,
consisting
of
an
integer
ALU
scheduler
,
an
AGU
scheduler
,
and
a
floating-point
scheduler
.
These
three
schedulers
can
simultaneously
issue
up
to
six
micro-ops
to
Software
Optimization
Guide
for
AMD
Family
16
h
Processors
52128
Rev
. 1.1
March
2013
10
Microarchitecture
of
the
Family
16
h
Processor
Chapter
2