AMD OS1354WBJ4BGHBOX Optimization Guide - Page 23
Store
UPC - 730143266024
View all AMD OS1354WBJ4BGHBOX manuals
Add to My Manuals
Save this manual to your list of manuals |
Page 23 highlights
52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors • CVTSS2SD • MOVLPS xmm1,[mem] • CVTSI2SD (32-/64-BIT) • MOVSD xmm1,xmm2 • MOVLPD xmm1,[mem] • RCPSS • ROUNDSS • ROUNDSD • RSQRTSS • SQRTSD • SQRTSS 2.12 Load Store Unit The AMD Family 16h processor load-store (LS) unit handles data accesses. The LS unit contains two largely independent pipelines enabling the execution of one 128-bit load memory operation and one 128-bit store memory operation per cycle. The LS unit includes a 16-entry memory ordering queue (MOQ). The MOQ receives both load and store operations at dispatch. Loads leave the MOQ when the load has completed and delivered data to the integer unit or the floating-point unit. Stores leave the MOQ when their address has been translated. The LS unit utilizes a 20-entry store queue which holds stores from dispatch until the store data can be written to the data cache. The LS unit dynamically reorders operations, supporting both load operations bypassing older loads and loads bypassing older non-conflicting stores. The LS unit ensures that the processor adheres to the architectural load and store ordering rules as defined by the AMD64 architecture. The LS unit supports store-to-load forwarding (STLF) when all of the following conditions are met: • the store address and load address both start on the exact same byte • the store operation size is the same or larger than the load operation size • neither the load nor the store operation are misaligned One STLF pitfall to avoid is aliases where store/load virtual address bits [15:4] match, but mismatch in the range [47:16] because it can delay execution of the load. The LS unit can track up to eight outstanding in-flight cache misses. The load store pipelines are optimized for zero-segment-base operations. A load or store that has a non-zero segment base suffers a one-cycle penalty in the load-store pipeline. Most modern operating systems use zero segment bases while running user processes and thus applications will not normally experience this penalty. Chapter 2 Microarchitecture of the Family 16h Processor 23