AMD OS1354WBJ4BGHBOX Optimization Guide - Page 12
Movups, Movaps
UPC - 730143266024
View all AMD OS1354WBJ4BGHBOX manuals
Add to My Manuals
Save this manual to your list of manuals |
Page 12 highlights
Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 2.5.1 L1 Instruction Cache The AMD Family 16h processor contains a 32-Kbyte, 2-way set associative L1 instruction cache. Cache line size is 64 bytes; however, only 32 bytes are fetched in a cycle. Functions associated with the L1 instruction cache are fetching cache lines from the L2 cache, providing instruction bytes to the decoder, prefetching instructions, and predicting branches. Requests that miss in the L1 instruction cache are fetched from the L2 cache or, if not resident in the L2 cache, from system memory. On misses, the L1 instruction cache generates fill requests for the naturally-aligned 64-byte block that includes the miss address and one or two sequential blocks (prefetches). Because code typically exhibits spatial locality, prefetching is an effective technique for avoiding decode stalls. Cache-line replacement is based on a leastrecently-used replacement algorithm. The L1 instruction cache is protected from error through the use of parity. Due to the indexing and tagging scheme used in the instruction cache, optimal performance is obtained when two hot cache lines which need to be resident in the instruction cache simultaneously do not share the same virtual address bits [20:6]. 2.5.2 L1 Data Cache The AMD Family 16h processor contains a 32-Kbyte, 8-way set associative L1 data cache. This is a write-back cache that supports one 128-bit load and one 128-bit store per cycle. In addition, the L1 cache is protected from bit errors through the use of parity. There is a hardware prefetcher that brings data into the L1 data cache to avoid misses. The L1 data cache has a 3-cycle integer load-to-use latency, and a 5-cycle FPU load-to-use latency. The data cache natural alignment boundary is 16 bytes. A misaligned load or store operation suffers, at minimum, a one cycle penalty in the load-store pipeline if it spans a 16-byte boundary. Throughput for misaligned loads and stores is half that of aligned loads and stores since a misaligned load or store requires two cycles to access the data cache (versus a single cycle for aligned loads and stores). For aligned memory accesses, the aligned and unaligned load and store instructions (for example, MOVUPS/ MOVAPS) provide identical performance. Natural alignment for both 128-bit and 256-bit vectors is 16 bytes. There is no advantage in aligning 256-bit vectors to a 32-byte boundary on the Family 16h processor because 256-bit vectors are loaded and stored as two 128-bit halves. 2.5.3 L2 Cache The AMD Family 16h processor implements a unified 16-way set associative L2 cache shared by up to 4 cores. This on-die L2 cache is inclusive of the L1 caches in the cores. The L2 is a write-back cache. The L2 cache has a variable load-to-use latency of no less than 25 cycles. The L2 cache size is 1 or 2 Mbytes depending on configuration. L2 cache entries are protected from errors through the use of an error correcting code (ECC). The L2 to L1 data path is 16 bytes wide; critical data within a cache line is forwarded first. The L2 has four 512-Kbyte banks. Bits 7:6 of the cache line address determine which bank holds the cache line. For a large contiguous block of data, this organization will naturally spread the cache lines out over all 4 banks. The banks can operate on requests in parallel and can each deliver 16 bytes per cycle, for a total peak read bandwidth of 64 bytes per cycle for the L2. Bandwidth to any individual core is 16 bytes per cycle peak, so with four cores, the four banks can each deliver 16 bytes of data to each core simultaneously. The banking scheme provides bandwidth for all four cores in the processing complex that can achieve the level that a private per-core L2 would provide. 12 Microarchitecture of the Family 16h Processor Chapter 2