AMD OS1354WBJ4BGHBOX Optimization Guide - Page 9

Movbe, Xsave, Xsaveopt, Lzcnt, Popcnt, Rdrand, Invpcid

Page 9 highlights

52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors • 128-bit and 256-bit single-instruction / multiple-data (SIMD) instructions. The following instruction subsets are supported: • Streaming SIMD Extensions 1 (SSE1) • Streaming SIMD Extensions 2 (SSE2) • Streaming SIMD Extensions 3 (SSE3) • Supplemental Streaming SIMD Extensions 3 (SSSE3) • Streaming SIMD Extensions 4a (SSE4a) • Streaming SIMD Extensions 4.1 (SSE4.1) • Streaming SIMD Extensions 4.2 (SSE4.2) • Advanced Vector Extensions (AVX) • Half-precision floating-point conversion (F16C) • Carry-less Multiply (CLMUL) instructions • Advanced Encryption Standard (AES) acceleration instructions • Bit Manipulation Instructions (BMI) • Move Big-Endian instruction (MOVBE) • XSAVE / XSAVEOPT • LZCNT / POPCNT • AMD Virtualization™ technology (AMD-V™) The AMD Family 16h processor does not support the following instruction subsets: • Fused Multiply/Add instructions (FMA3 / FMA4) • XOP instructions • Trailing bit manipulation (TBM) instructions • Light-weight profiling (LWP) instructions • Read and write fsbase and gsbase instructions • RDRAND, and INVPCID instructions The AMD Family 16h processor includes many features designed to improve software performance. The microarchitecture provides the following key features: • Unified 1-2-Mbyte L2 cache shared by up to 4 cores • Integrated memory controller with memory prefetcher • 32-Kbyte L1 instruction cache per core • 32-Kbyte L1 data cache per core • Prefetchers for L2 cache, L1 data cache, and L1 instruction cache • Advanced dynamic branch prediction • 32-byte instruction fetch • 2-way x86 instruction decoding with sideband stack optimizer • Dynamic out-of-order scheduling and speculative execution • Two-way integer execution • Two-way address generation (1 load and 1 store) • Two-way 128-bit wide floating-point and packed integer execution • Integer hardware divider • Superforwarding • L1 Instruction TLB of 32 4-Kbyte entries and L1 Data TLB of 40 4-Kbyte entries • Four fully-symmetric core performance counters Chapter 2 Microarchitecture of the Family 16h Processor 9

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

128
-bit
and
256
-bit
single-instruction
/
multiple-data
(
SIMD
instructions
.
The
following
instruction
subsets
are
supported
:
Streaming
SIMD
Extensions
1 (
SSE
1
Streaming
SIMD
Extensions
2 (
SSE
2
Streaming
SIMD
Extensions
3 (
SSE
3
Supplemental
Streaming
SIMD
Extensions
3 (
SSSE
3
Streaming
SIMD
Extensions
4
a
(
SSE
4
a
Streaming
SIMD
Extensions
4.1 (
SSE
4.1
Streaming
SIMD
Extensions
4.2 (
SSE
4.2
Advanced
Vector
Extensions
(
AVX
Half-precision
floating-point
conversion
(
F
16
C
Carry-less
Multiply
(
CLMUL
instructions
Advanced
Encryption
Standard
(
AES
acceleration
instructions
Bit
Manipulation
Instructions
(
BMI
Move
Big-Endian
instruction
(
MOVBE
XSAVE
/
XSAVEOPT
LZCNT
/
POPCNT
AMD
Virtualization™
technology
(
AMD-V™
The
AMD
Family
16
h
processor
does
not
support
the
following
instruction
subsets
:
Fused
Multiply
/
Add
instructions
(
FMA
3 /
FMA
4
XOP
instructions
Trailing
bit
manipulation
(
TBM
instructions
Light-weight
profiling
(
LWP
instructions
Read
and
write
fsbase
and
gsbase
instructions
RDRAND
,
and
INVPCID
instructions
The
AMD
Family
16
h
processor
includes
many
features
designed
to
improve
software
performance
.
The
microarchitecture
provides
the
following
key
features
:
Unified
1
2
-Mbyte
L
2
cache
shared
by
up
to
4
cores
Integrated
memory
controller
with
memory
prefetcher
32
-Kbyte
L
1
instruction
cache
per
core
32
-Kbyte
L
1
data
cache
per
core
Prefetchers
for
L
2
cache
,
L
1
data
cache
,
and
L
1
instruction
cache
Advanced
dynamic
branch
prediction
32
-byte
instruction
fetch
2
-way
x
86
instruction
decoding
with
sideband
stack
optimizer
Dynamic
out-of-order
scheduling
and
speculative
execution
Two-way
integer
execution
Two-way
address
generation
(1
load
and
1
store
Two-way
128
-bit
wide
floating-point
and
packed
integer
execution
Integer
hardware
divider
Superforwarding
L
1
Instruction
TLB
of
32 4
-Kbyte
entries
and
L
1
Data
TLB
of
40 4
-Kbyte
entries
Four
fully-symmetric
core
performance
counters
52128
Rev
. 1.1
March
2013
Software
Optimization
Guide
for
AMD
Family
16
h
Processors
Chapter
2
Microarchitecture
of
the
Family
16
h
Processor
9