AMD OS1354WBJ4BGHBOX Optimization Guide - Page 13
Memory, Address, Translation, Optimizing, Branching
UPC - 730143266024
View all AMD OS1354WBJ4BGHBOX manuals
Add to My Manuals
Save this manual to your list of manuals |
Page 13 highlights
52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors 2.6 Memory Address Translation A translation-lookaside buffer (TLB) holds the most-recently-used page mapping information. It assists and accelerates the translation of virtual addresses to physical addresses. A hardware table walker loads page table information into the TLBs. The AMD Family 16h processor utilizes a two-level TLB structure. 2.6.1 L1 Translation Lookaside Buffers The AMD Family 16h processor contains a fully-associative L1 instruction TLB (ITLB) with 32 4-Kbyte page entries and 8 2-Mbyte page entries. The fully-associative L1 data TLB (DTLB) provides 40 4-Kbyte page entries and 8 2-Mbyte page entries. 2.6.2 L2 Translation Lookaside Buffers The AMD Family 16h processor provides a 4-way set-associative L2 instruction TLB with 512 4-Kbyte page entries. The L2 data TLB provides two independent translation buffers which are accessed in parallel; a 4-way setassociative buffer with 512 4-Kbyte page entries and a 2-way set-associative buffer with 256 2-Mbyte page entries. 2.6.3 Hardware Page Table Walker The hardware page table walker handles L2 TLB misses. Misses can start speculatively from either the instruction or the data side. The table walker includes a 16-entry Page Directory Cache (PDC) to speed up table walks. The table walker supports 1-Gbyte pages by smashing the page into a 2-Mbyte window, and returning a 2-Mbyte TLB entry. In legacy mode, 4-Mbyte entries are also supported by returning a smashed 2-Mbyte TLB entry. INVLPG and INVLPGA instructions cause a flush of the entire TLB if any 1-Gbyte smashed entries have been created since the last flush. System software may wish to avoid the use of 1-Gbyte pages. In a nested paging environment, the processor does not create smashed entries if the nested page tables use 1-Gbyte pages but the guest page tables do not use 1-Gbyte pages. See the definition of the terms smashing and smashed in the Preface. 2.7 Optimizing Branching Branching can reduce throughput when instruction execution must wait on the completion of the instructions prior to the branch that determine whether the branch is taken. The Family 16h processor integrates logic that is designed to reduce the average cost of conditional branching by attempting to predict the outcome of a branch decision prior to the resolution of the condition upon which the decision is based. This prediction is used to speculatively fetch, decode, and execute instructions on the predicted path. When the prediction is correct, waiting is avoided and the instruction throughput is increased. The minimum branch misprediction penalty is 14 cycles. The following topic describes the branch prediction hardware facilities of the processor. This is followed by a discussion of how to align code within a loop to use the loop optimization hardware to its fullest advantage. 2.7.1 Branch Prediction To predict and accelerate branches the AMD Family 16h processor employs: Chapter 2 Microarchitecture of the Family 16h Processor 13