AMD OS1354WBJ4BGHBOX Optimization Guide - Page 19
Floating-Point
UPC - 730143266024
View all AMD OS1354WBJ4BGHBOX manuals
Add to My Manuals
Save this manual to your list of manuals |
Page 19 highlights
52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors Figure 2. Integer Schedulers and Execution Units All integer operations can be handled in the ALUs (ALU0 and 1 are fully symmetrical) with the exception of integer multiply, integer divide, and three-operand LEA instructions. While two-operand LEA instructions are mapped as a single-cycle micro-op in the ALUs, three-operand LEA instructions are mapped to the store AGU and have 2 cycle latency, with results inserted back in to the ALU1 pipeline. The integer multiply unit can handle multiplies of up to 32 bits × 32 bits with 3 cycle latency, fully pipelined. 64-bit × 64-bit multiplies require data pumping and have a 6-cycle latency with a throughput rate of 1 every 4 cycles. If the multiply instruction has 2 destination registers, an additional one cycle latency and one cycle reduction in throughput is required. The radix-4 hardware integer divider unit can compute 2 bits of results per cycle. 2.9.3 Retire Control Unit The retire control unit (RCU) tracks the completion status of all outstanding operations (integer, load/store, and floating-point) and is the final arbiter for exception processing and recovery. The unit can receive up to 2 macroops dispatched per cycle and track up to 64 macro-ops in-flight. A macro-op is eligible to be committed by the retire unit when all corresponding micro-ops have finished execution. For most cases of fastpath double macroops (like when an AVX 256-bit instruction is broken into two 128-bit macro-ops), it is further required that both macro-ops have finished execution before commitment can occur. The retire unit handles in-order commit of up to two macro-ops per cycle. The retire control unit also manages internal integer register mapping and renaming. The integer physical register file (PRF) consists of 64 registers, with between 20 to 31 mapped to architectural state or microarchitectural temporary state. The remaining 44 to 33 registers are available for out-of-order renames. Generally physical register renames are needed for instructions that write to an integer register destination (for example, ADD), but not for those instructions that only write flags (for example, CMP) or perform stores to memory. 2.10 Floating-Point Unit The AMD Family 16h processor provides native support for 32-bit single precision, 64-bit double precision, and 80-bit extended precision primary floating-point data types as well as 128-bit packed single and double precision vector floating-point data types. The 256-bit packed single and double precision vector floating-point data types are fully supported through the use of two 128-bit macro-ops per instruction. The floating-point load and store Chapter 2 Microarchitecture of the Family 16h Processor 19