AMD OS1354WBJ4BGHBOX Optimization Guide - Page 21
Denormals
UPC - 730143266024
View all AMD OS1354WBJ4BGHBOX manuals
Add to My Manuals
Save this manual to your list of manuals |
Page 21 highlights
52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors the floating-point multiply unit (FPM). The floating-point scheduler can issue one micro-op to one unit per pipe per cycle and provides logic to prevent pipeline hazards like resource contention on the result bus. The second organizational dimension for execution units is forwarding domains. The FPU is divided into three clusters, and forwarding between clusters requires an extra cycle in the bypass network. The three clusters are the Floating-point Cluster (composed of the FPM and FPA units), the Integer Cluster (composed of the VALU0, VALU1, and VIMUL units), and the Store / Convert Cluster (STC). When the result of an instruction executing in one domain is consumed as input by a subsequent instruction executing in a different domain there is a one cycle forwarding delay. This delay does not increase the time that either of the instructions is occupying the execution units, but the scheduler will not attempt to schedule the second instruction earlier. Most FPU instructions support local forwarding, which eliminates this delay when the consuming instruction executes in the same domain. However some instructions (marked with the note "local forwarding disabled" in the latency spreadsheet) do not support local forwarding and experience the forwarding delay even when the consuming instruction executes in the same domain. The following table summarizes the majority of instruction latencies in the FPU. Table 2. Summary of Floating-point Instruction Latencies Instruction Class Latency Throughput Execution Pipe SIMD ALU (most) 1 2 / cycle Either Floating-point logical 1 2 / cycle Either SIMD IMUL 2 1 / cycle Pipe 0 Floating-point multiply 2 single-precision 1 / cycle Pipe 1 Floating-point add 3 1 / cycle Pipe 0 Store/Convert (many) 3 1 / cycle Pipe 1 Floating-point multiply 4 double-precision 1 / 2 cycles Pipe 1 Floating-point multiply 5 extended-precision (x87) 1 / 3 cycles Pipe 1 Floating-point DIV/ SQRT Iterative Iterative Pipe 1 Unit(s) VALU0, VALU1 FPA, FPM VIM FPM FPA Store/Convert FPM FPM Cluster Integer Floating-pt Integer Floating-pt Floating-pt STC Floating-pt Floating-pt FPM Floating-pt Refer to the AMD64_16h_InstrLatency.xlsx spreadsheet described in Appendix A for more instruction latency and throughput information. 2.10.1 Denormals Denormal floating-point values (also called subnormals) can be created by a program either by explicitly specifying a denormal value in the source code or by calculations on normal floating-point values. A significant performance cost (more than 100 processor cycles) may be incurred when these values are encountered. For SSE/AVX instructions, the denormal penalties are a function of the configuration of MXCSR and the instruction sequences that are executed in the presence of a denormal value. Denormal penalties may occur in two phases: usage of a denormal in a computation (pre-computation penalty), and production of a denormal during the execution of an instruction (post-computation penalty). A sequence of floating-point compute instructions may incur a pre-computation penalty when a denormal value is encountered as an input. This penalty occurs on a floating-point computation instruction, such as [V]ADDPS, Chapter 2 Microarchitecture of the Family 16h Processor 21