AMD OS1354WBJ4BGHBOX Optimization Guide - Page 24
Software, Optimization, Guide, instruction, latency, Throughput
UPC - 730143266024
View all AMD OS1354WBJ4BGHBOX manuals
Add to My Manuals
Save this manual to your list of manuals |
Page 24 highlights
Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 Appendix A Instruction Latencies The companion file AMD64_16h_InstrLatency_1.1.xlsx distributed with this Software Optimization Guide provides additional detailed information for the AMD Family 16h processor. The first worksheet in the spreadsheet, "Overview," provides some useful reference information which is related to the second worksheet, "Latencies." This appendix explains the columns and definitions used in the table of latencies. Information in the spreadsheet is based on estimates and is subject to change. A.1 Instruction Latency Assumptions The term instruction latency refers to the number of processor clock cycles required to complete the execution of a particular instruction from the time that it is issued. Throughput refers to the number of results that can be generated in a unit of time given the repeated execution of a given instruction. Many factors affect instruction execution time. For instance, when a source operand must be loaded from a memory location, the time required to read the operand from system memory adds to the execution time. Furthermore, latency is highly variable due to the fact that a memory operand may or may not be found in one of the levels of data cache. In some cases, the target memory location may not even be resident in system memory due to being paged out to backing storage. In estimating the instruction latency and reciprocal throughput, the following assumptions are necessary: • The instruction is an L1 I-cache hit that has already been fetched and decoded, with the operations loaded into the scheduler. • Memory operands are in the L1 data cache. • There is no contention for execution resources or load-store unit resources. Each latency value listed in the spreadsheet denotes the typical execution time of the instruction when run in isolation on a processor. For real programs executed on this highly aggressive super-scalar family of processors, multiple instructions can execute simultaneously; therefore, the effective latency for any given instruction's execution may be overlapped with the latency of other instructions executing in parallel. The latencies in the spreadsheet reflect the number of cycles from instruction issuance to instruction retirement. This includes the time to write results to registers or the write buffer, but not the time for results to be written from the write buffer to L1 D-cache, which may not occur until after the instruction is retired. For most instructions, the only forms listed are the ones without memory operands. The latency for instruction forms that load from memory can be calculated by adding the load latencies listed on the overview worksheet to the latency for the register-only form. To measure the latency of an instruction which stores data to memory, it is necessary to define an end-point at which the instruction is said to be complete. This guide has chosen instruction retirement as the end point, and under that definition writes add no additional latency. Choosing another end point, such as the point at which the data has been written to the L1 cache, would result in variable latencies and would not be meaningful without taking into account the context in which the instruction is executed. There are cases where additional latencies may be incurred in a real program that are not described in the spreadsheet, such as delays caused by L1 cache misses or contention for execution or load-store unit resources. A.2 Spreadsheet Column Descriptions The following describes the information provided in each column of the spreadsheet: Column A Instruction Instruction opcodes 24