AMD AMD-K6-2/500AFX Data Sheet - Page 38

Branch History Table, Branch Target Cache, Return Address Stack, bytes. In total

Page 38 highlights

AMD-K6®-2 Processor Data Sheet Preliminary Information 21850J/0-February 2000 Branch History Table Branch Target Cache Return Address Stack program behavior and its negative effects on instruction execution, such as stalls due to delayed instruction fetching and the draining of the processor pipeline. The branch logic contains an 8192-entry branch history table, a 16-entry by 16-byte branch target cache, a 16-entry return address stack, and a branch execution unit. The AMD-K6-2 processor handles unconditional branches without any penalty by redirecting instruction fetching to the target address of the unconditional branch. However, conditional branches require the use of the dynamic branch-prediction mechanism built into the AMD-K6-2 p ro c e s s o r. A t wo -l eve l a d a p t ive h i s t o ry a l g o r i t h m i s implemented in an 8192-entry branch history table. This table stores executed branch information, predicts individual branches, and predicts the behavior of groups of branches. To accommodate the large branch history table, the AMD-K6-2 processor does not store predicted target addresses. Instead, the branch target addresses are calculated on-the-fly using ALUs during the decode stage. The adders calculate all possible target addresses before the instructions are fully decoded and the processor chooses which addresses are valid. To avoid a one clock cache-fetch penalty when a branch is predicted taken, a built-in branch target cache supplies the first 16 bytes of instructions directly to the instruction buffer (assuming the target address hits this cache). (See Figure 3 on page 11.) The branch target cache is organized as 16 entries of 16 bytes. In total, the branch prediction logic achieves branch prediction rates greater than 95%. The return address stack is a special device designed to optimize CALL and RET pairs. Software is typically compiled with subroutines that are frequently called from various places in a program. This is usually done to save space. Entry into the subroutine occurs with the execution of a CALL instruction. At that time, the processor pushes the address of the next instruction in memory following the CALL instruction onto the stack (allocated space in memory). When the processor encounters a RET instruction (within or at the end of the subroutine), the branch logic pops the address from the stack and begins fetching from that location. To avoid the latency of main memory accesses during CALL and RET operations, the return address stack caches the pushed addresses. 18 Internal Architecture Chapter 2

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166
  • 167
  • 168
  • 169
  • 170
  • 171
  • 172
  • 173
  • 174
  • 175
  • 176
  • 177
  • 178
  • 179
  • 180
  • 181
  • 182
  • 183
  • 184
  • 185
  • 186
  • 187
  • 188
  • 189
  • 190
  • 191
  • 192
  • 193
  • 194
  • 195
  • 196
  • 197
  • 198
  • 199
  • 200
  • 201
  • 202
  • 203
  • 204
  • 205
  • 206
  • 207
  • 208
  • 209
  • 210
  • 211
  • 212
  • 213
  • 214
  • 215
  • 216
  • 217
  • 218
  • 219
  • 220
  • 221
  • 222
  • 223
  • 224
  • 225
  • 226
  • 227
  • 228
  • 229
  • 230
  • 231
  • 232
  • 233
  • 234
  • 235
  • 236
  • 237
  • 238
  • 239
  • 240
  • 241
  • 242
  • 243
  • 244
  • 245
  • 246
  • 247
  • 248
  • 249
  • 250
  • 251
  • 252
  • 253
  • 254
  • 255
  • 256
  • 257
  • 258
  • 259
  • 260
  • 261
  • 262
  • 263
  • 264
  • 265
  • 266
  • 267
  • 268
  • 269
  • 270
  • 271
  • 272
  • 273
  • 274
  • 275
  • 276
  • 277
  • 278
  • 279
  • 280
  • 281
  • 282
  • 283
  • 284
  • 285
  • 286
  • 287
  • 288
  • 289
  • 290
  • 291
  • 292
  • 293
  • 294
  • 295
  • 296
  • 297
  • 298
  • 299
  • 300
  • 301
  • 302
  • 303
  • 304
  • 305
  • 306
  • 307
  • 308
  • 309
  • 310
  • 311
  • 312
  • 313
  • 314
  • 315
  • 316
  • 317
  • 318
  • 319
  • 320
  • 321
  • 322
  • 323
  • 324
  • 325
  • 326
  • 327
  • 328
  • 329
  • 330

18
Internal Architecture
Chapter 2
AMD-K6
®
-2 Processor Data Sheet
21850J/0—February 2000
Preliminary Information
program behavior and its negative effects on instruction
execution, such as stalls due to delayed instruction fetching and
the draining of the processor pipeline. The branch logic
contains an 8192-entry branch history table, a 16-entry by
16-byte branch target cache, a 16-entry return address stack,
and a branch execution unit.
Branch History Table
The AMD-K6-2 processor handles unconditional branches
without any penalty by redirecting instruction fetching to the
target address of the unconditional branch. However,
conditional branches require the use of the dynamic
branch-prediction mechanism built into the AMD-K6-2
processor. A two-level adaptive history algorithm is
implemented in an 8192-entry branch history table. This table
stores executed branch information, predicts individual
branches, and predicts the behavior of groups of branches. To
accommodate the large branch history table, the AMD-K6-2
processor does not store predicted target addresses. Instead,
the branch target addresses are calculated on-the-fly using
ALUs during the decode stage. The adders calculate all
possible target addresses before the instructions are fully
decoded and the processor chooses which addresses are valid.
Branch Target Cache
To avoid a one clock cache-fetch penalty when a branch is
predicted taken, a built-in branch target cache supplies the first
16 bytes of instructions directly to the instruction buffer
(assuming the target address hits this cache). (See Figure 3 on
page 11.) The branch target cache is organized as 16 entries of
16 bytes. In total, the branch prediction logic achieves branch
prediction rates greater than 95%.
Return Address Stack
The return address stack is a special device designed to
optimize CALL and RET pairs. Software is typically compiled
with subroutines that are frequently called from various places
in a program. This is usually done to save space. Entry into the
subroutine occurs with the execution of a CALL instruction. At
that time, the processor pushes the address of the next
instruction in memory following the CALL instruction onto the
stack (allocated space in memory). When the processor
encounters a RET instruction (within or at the end of the
subroutine), the branch logic pops the address from the stack
and begins fetching from that location. To avoid the latency of
main memory accesses during CALL and RET operations, the
return address stack caches the pushed addresses.