AMD OS1354WBJ4BGHBOX Optimization Guide - Page 15
Out-of- Target, Array, Branch, Marker, Caching, Return, Address, Stack
UPC - 730143266024
View all AMD OS1354WBJ4BGHBOX manuals
Add to My Manuals
Save this manual to your list of manuals |
Page 15 highlights
52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors 2.7.1.4 Out-of-Page Target Array The out-of-page target array (OPG) holds the high address bits ([28:12]) for 32 targets that are outside the current page for branches marked in the sparse BTB. Only sparse branches are eligible for out-of-page target prediction. Branches marked by the dense predictor are not eligible for OPG target prediction. Direct dense branches that are out-of-page will have their targets corrected by the branch target address calculator with a 4cycle penalty. Direct sparse branch targets that cross a 28-bit address block boundary (beyond the range of the out-of-page target array) are also corrected by the branch target address calculator. 2.7.1.5 Branch Marker Caching When a cache line is evicted, the sparse marker information for the first two branches in that cache line are slightly compressed and written out into a subset of the L2 ECC bits-but only if the line contains instructions exclusively. These markers are brought back into the core and reloaded into the sparse predictor if their L2 line is reloaded into the L1 instruction cache before eviction from L2 or before the line is the target of a store. Dense branches may or may not remain resident in the dense predictor when the L1 instruction cache is reloaded. Sparse markers in the shared L2 can be shared with other cores that fetch from the same L2 line. Software with extremely large instruction footprints, especially those with multiple threads that share instruction cache lines, can take advantage of this property by targeting a branch density of no more than 2 branches per cache line. 2.7.1.6 Return Address Stack The Family 16h processor implements a 16-entry return address stack (RAS) to predict return addresses from a near call. As calls are fetched, the address of the following instruction is pushed onto the return address stack. Typically, the return address of the call is correctly predicted by the address popped off the top of the return address stack. However, mispredictions sometimes arise during speculative execution that can cause incorrect pushes and/or pops to the return address stack. The processor implements mechanisms that correctly recover the return address stack in most cases. If the return address stack cannot be recovered, it is invalidated and the execution hardware restores it to a consistent state. The following commonly used coding practices optimized for other processor microarchitectures are not optimum for the Family 16h processor: CALL 0h In prior processor families (for example, Family 10h ) a CALL 0h followed by a POP instruction was recommended for 32-bit software to get the RIP value into a general-purpose register. CALL 0h was recognized and treated specially, and the return address stack was kept consistent even though there was no return instruction paired with the call. On the Family 16h processor, CALL 0h is not treated specially, and thus this code sequence will cause the RAS to get out of sync due to the un-paired call. It is recommended to avoid the use of CALL 0h in 32-bit software, and instead use a true subroutine call, a MOV reg,[RSP] instruction, and a paired return to get the value of the RIP register into a general-purpose register. REP RET For prior processor families, such as Family 10h and 12h, a three-byte return-immediate RET instruction had been recommended as an optimization to improve performance over a single-byte near-return. With processor Families 15h and 16h, this is no longer recommended and a single-byte near-return (opcode C3h) can be used with no negative performance impact. This will result in smaller code size over the three-byte method. For the rationale for the former recommendation, see section 6.2 in the Software Optimization Guide for AMD Family 10h and 12h Processors. Chapter 2 Microarchitecture of the Family 16h Processor 15