AMD OS1354WBJ4BGHBOX Optimization Guide - Page 22
Register, Merge, Optimization
UPC - 730143266024
View all AMD OS1354WBJ4BGHBOX manuals
Add to My Manuals
Save this manual to your list of manuals |
Page 22 highlights
Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 if the denormal value was loaded into an XMM or YMM register from memory by a pure load instruction (such as [V]MOVUPS), or was produced by a vector-integer or logical instruction. The penalty will only occur once per new denormal value in a sequence of floating-point instructions. A similar penalty does not occur when the floating-point compute instruction is in load-op form and the memory operand is denormal, for example on [V]ADDPS xmm0,[mem] where [mem] is a denormal value. If a compiler can determine that a memory input to a floating-point sequence is denormal, it can avoid this precomputation penalty using a sequence such as: XORPS xmm0,xmm0; ADDPS xmm0,[mem] instead of MOVUPS xmm0,[mem]. Vector ALU and logical instructions will also incur a pre-computation penalty if they encounter a denormal input that was produced by a floating-point instruction. If software does not require the precision that denormals provide, it can set MXCSR.DAZ (bit 6). Any denormal input will then be treated as a zero without a pre-computation penalty. Post-computation penalties occur when a floating-point compute instruction produces a denormal result and both the precision exception and the underflow exception are masked in the MXCSR (that is, both bits 11 Precision Mask and bit 12 Underflow Mask are set). If software does not require the precision that denormals provide, it can set MXCSR.FTZ (bit 15). Any denormal output will then be converted to zero without a post-computation penalty. Post-computation penalties generally cannot be eliminated by compilers. If denormal precision is not required, it is recommended that software set both MXCSR.DAZ and MXCSR.FTZ. Note that setting MXCSR.DAZ or MXCSR.FTZ will cause the processor to produce results that are not compliant with the IEEE-754 standard when operating on or producing denormal values. For x87 instructions both pre-computation and post-computation penalties are incurred when denormals are encountered. A pre-computation penalty is incurred when loading denormal values from memory onto the x87 floating-point stack. A post-computation penalty is incurred when a floating-point compute instruction produces a denormal result and both the precision exception and underflow exception are masked in the x87 floating-point control word (FCW). The x87 FCW does not provide functionality equivalent to MXCSR.DAZ or MXCSR.FTZ, so it is not possible to avoid these denormal penalties when using x87 instructions that encounter or produce denormal values. Programs that call x87 floating-point routines that internally produce denormal values will potentially incur this penalty as well. To completely avoid this penalty, ensure that programs written using legacy x87 instructions do not produce denormal values. 2.11 XMM Register Merge Optimization The AMD Family 16h processor implements an XMM register merge optimization. The processor keeps track of XMM registers whose upper portions have been cleared to zeros. This information can be followed through multiple operations and register destinations until non-zero data is written into a register. For certain instructions, this information can be used to bypass the usual result merging for the upper parts of the register. For instance, SQRTSS does not change the upper 96 bits of the destination register. If some instruction clears the upper 96 bits of its destination register and any arbitrary following sequence of instructions fails to write non-zero data in these upper 96 bits, then the SQRTSS instruction can proceed without waiting for any instructions that wrote to that destination register. The instructions that benefit from this merge optimization are: • CVTPI2PS • CVTSI2SS (32-/64-BIT) • MOVSS xmm1,xmm2 • CVTSD2SS 22 Microarchitecture of the Family 16h Processor Chapter 2