AMD OS1354WBJ4BGHBOX Optimization Guide

AMD OS1354WBJ4BGHBOX - Third-Generation Opteron 2.2 GHz Processor Manual

AMD OS1354WBJ4BGHBOX manual content summary:

  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 1
    Software Optimization Guide for AMD Family 16h Processors Publication # 52128 Revision: 1.1 Issue Date: March 2013 Advanced Micro Devices
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 2
    AMD's Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty to support or sustain life, or in any other application in which the failure of AMD's AMD concerning such products or this documentation, for any interruption of service
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 3
    Software Optimization Guide for AMD Family 16h Processors Revision History...6 1 Preface...7 2 Microarchitecture of the Family 16h Processor 8 2.1 Features...8 2.2 Instruction Decomposition...10 2.3 Superscalar Organization...10 2.4 Processor Block Diagram...11 2.5 Processor Cache Operations
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 4
    Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 List of Figures Figure 1. Family 16h Processor Block Diagram...11 Figure 2. Integer Schedulers and Execution Units...18 Figure 3. Floating-point Unit Block Diagram...20 4 List of Figures
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 5
    52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors List of Tables Table 1. Typical Instruction Mappings...10 Table 2. Summary of Floating-point Instruction Latencies...21 List of Tables 5
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 6
    Software Optimization Guide for AMD Family 16h Processors Revision History Date Rev. March 2013 1.1 Description Initial Public Release 52128 Rev. 1.1 March 2013 6 Revision History
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 7
    Refer to BIOS and Kernel Developers Guide (BKDG) for AMD Family 16h Models 00h-0Fh Processors (Order # 48751) for more information about machine-specific registers, debug, and performance profiling tools. Notational Convention Instruction mnemonics, micro-instructions, and example code are set in
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 8
    some of the key features of the AMD Family 16h Processor. The AMD Family 16h processor implements a specific subset of the AMD64 instruction set architecture. Instruction set architecture support includes: • General-purpose instructions, including support for 64-bit operands • x87 Floating-point
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 9
    52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors • 128-bit and 256-bit single-instruction / multiple-data (SIMD) instructions. The following instruction subsets are supported: • Streaming SIMD Extensions 1 (SSE1) • Streaming SIMD Extensions 2 (SSE2) • Streaming
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 10
    Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 2.2 Instruction Decomposition The AMD Family 16h processor implements the AMD64 instruction set by means of macro-ops (the primary units of work managed by the processor) and micro-ops (the primitive operations executed in the processor
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 11
    Optimization Guide for AMD Family 16h Processors the Processor Block Diagram A block diagram of the AMD Family 16h processor is shown below. Figure 1. Family 16h Processor Block Diagram 2.5 Processor Cache Operations AMD Family 16h processors use three different caches to accelerate instruction
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 12
    lines which need to be resident in the instruction cache simultaneously do not share the same virtual address bits [20:6]. 2.5.2 L1 Data Cache The AMD Family 16h processor contains a 32-Kbyte, 8-way set associative L1 data cache. This is a write-back cache that supports one 128-bit load and one 128
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 13
    Software Optimization Guide for AMD Family 16h Processors 2.6 Lookaside Buffers The AMD Family 16h processor provides a 4-way set-associative L2 instruction TLB with 512 4-Mbyte entries are also supported by returning a smashed 2-Mbyte TLB entry. INVLPG and INVLPGA instructions cause a flush of
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 14
    Software Optimization Guide for AMD Family 16h Processors 52128 Rev. every cycle to support the 32 byte per cycle fetch bandwidth of the processor. When branches are sparse branch predictor and maps up to the first two branches per instruction cache line (64 bytes), for a total of 1024 entries. The
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 15
    Dense branches may or may not remain resident in the dense predictor when the L1 instruction cache is reloaded. Sparse markers in the shared L2 can be shared with other cores 6.2 in the Software Optimization Guide for AMD Family 10h and 12h Processors. Chapter 2 Microarchitecture of the Family 16h
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 16
    Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 2.7.1.7 Indirect Target Predictor The processor implements Padding for Loop Alignment Aligning loops is typically accomplished by adding NOP instructions ahead of the loop. This section provides guidance on the proper
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 17
    52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors The table below lists encodings for NOP instructions of lengths from 1 to 15. Beyond length 8, longer NOP instructions are encoded by adding one or more operand size override prefixes (66h) to the beginning of the
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 18
    Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 2.8 Instruction Fetch and Decode The AMD Family 16h processor fetches instructions in 32-byte naturally aligned blocks. The processor can perform an instruction block fetch every cycle. The first two branches in a
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 19
    to an integer register destination (for example, ADD), but not for those instructions that only write flags (for example, CMP) or perform stores to memory. 2.10 Floating-Point Unit The AMD Family 16h processor provides native support for 32-bit single precision, 64-bit double precision, and 80-bit
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 20
    Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 paths are 128 bits wide. As a result, the maximum throughput of both single-precision and double-precision floating-point SSE vector operations has improved by a factor of two over the AMD Family 14h processor. The
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 21
    Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors the floating-point multiply unit (FPM instruction earlier. Most FPU instructions support local forwarding, which eliminates this delay when the consuming instruction executes in the same domain. However some instructions
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 22
    Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 if the denormal value was loaded into an XMM or YMM register from memory by a pure load instruction (such as [V]MOVUPS), or was produced by a vector-integer or logical instruction. The penalty will only occur once
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 23
    Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors • CVTSS2SD • MOVLPS xmm1,[mem] • CVTSI2SD (32-/64-BIT) • MOVSD xmm1,xmm2 • MOVLPD xmm1,[mem] • RCPSS • ROUNDSS • ROUNDSD • RSQRTSS • SQRTSD • SQRTSS 2.12 Load Store Unit The AMD Family 16h processor load-store (LS) unit
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 24
    Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 Appendix A Instruction Latencies The companion file AMD64_16h_InstrLatency_1.1.xlsx distributed with this Software Optimization Guide provides additional detailed information for the AMD Family 16h processor. The
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 25
    52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors Columns Opn B-E Instruction operands. The following notations are used . If the entry in this column is simply 'ucode' then the instruction is microcoded but the exact number of macro-ops either has not been
  • AMD OS1354WBJ4BGHBOX | Optimization Guide - Page 26
    Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 Column I Column J Column K The notation x2 or x3 appended to one of the above specifies the number of macro-ops executed on that unit for the instruction. For example, FPMx2 indicates the instruction requires two
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

Software
Optimization
Guide
for
AMD
Family
16
h
Processors
Publication #
52128
Revision:
1.1
Issue Date:
March 2013
Advanced
Micro
Devices