IBM 88554RU Installation Guide - Page 51

also has a dedicated local Cache and Scalability Controller, Memory Controller

Page 51 highlights

Caches also improve performance because they reduce queuing time of accesses that miss the caches and require a physical memory access. For most commercial applications, cache hit rates are usually greater than 70 percent. In this case, the cache greatly reduces memory latency because most processor memory requests are serviced by the faster cache. The caches act as filters and reduce the load on the memory controller, which results in lower queuing delays (waiting in line) at the memory controller, thereby speeding up the average memory access time. Another bottleneck in many SMP systems is the front-side bus. The front-side bus connects the processors to the shared memory controller. Process-to-memory requests travel across the front-side bus, which can become overloaded when three or more high-speed CPUs are added to the same bus. This, in turn, leads to a performance bottleneck and lower system scalability. Large processor caches also help improve performance because they assist in filtering many of the requests that must travel over the front-side bus (a processor cache-hit does not require a front-side bus for memory transaction). However, even with a large L3 cache, the number of memory transactions that miss the cache is still so great that it often causes the memory controller to bottleneck. This happens when more than three or four processors are installed in the same system. Non-uniform Memory Access (NUMA) is an architecture designed to improve performance and solve latency problems inherent in large (greater than four processors) SMP systems. The x455 implements a NUMA-based architecture and can scale up to 16 processors using multiple servers. The servers each contain up to four CPUs and 28 memory DIMMs. Each servers also has a dedicated local Cache and Scalability Controller, Memory Controller, and 64 MB XceL4 Level 4 cache. The additional fourth level of cache greatly improves performance for the four processors in the server because it is able to respond to a majority of processor-to-memory requests, thereby reducing the load on the memory controller and speeding up average memory access times. As shown in Figure 1-12 on page 23, each server is connected to another server using three independent 3.2 GBps scalability cables. These scalability cables mirror front-side bus operations to all other servers and are key to building large multiprocessing multinode systems. By mirroring transactions on the front-side bus across the scalability links to other processors, the x455 is able to run standard SMP software. All SMP systems must perform processor-to-processor communication (also known as "snooping") to ensure that all processors receive the most recent copy of requested data. Since any processor can store data in a local cache and modify that data at any Chapter 1. Technical description 37

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166
  • 167
  • 168
  • 169
  • 170
  • 171
  • 172
  • 173
  • 174
  • 175
  • 176
  • 177
  • 178
  • 179
  • 180
  • 181
  • 182
  • 183
  • 184
  • 185
  • 186
  • 187
  • 188
  • 189
  • 190
  • 191
  • 192
  • 193
  • 194
  • 195
  • 196
  • 197
  • 198
  • 199
  • 200
  • 201
  • 202
  • 203
  • 204
  • 205
  • 206
  • 207
  • 208
  • 209
  • 210
  • 211
  • 212
  • 213
  • 214
  • 215
  • 216
  • 217
  • 218
  • 219
  • 220
  • 221
  • 222
  • 223
  • 224
  • 225
  • 226
  • 227
  • 228
  • 229
  • 230
  • 231
  • 232

Chapter 1. Technical description
37
±
Caches also improve performance because they reduce queuing time of
accesses that miss the caches and require a physical memory access.
For most commercial applications, cache hit rates are usually greater than 70
percent. In this case, the cache greatly reduces memory latency because
most processor memory requests are serviced by the faster cache. The
caches act as filters and reduce the load on the memory controller, which
results in lower queuing delays (waiting in line) at the memory controller,
thereby speeding up the average memory access time.
Another bottleneck in many SMP systems is the front-side bus. The front-side
bus connects the processors to the shared memory controller.
Process-to-memory requests travel across the front-side bus, which can become
overloaded when three or more high-speed CPUs are added to the same bus.
This, in turn, leads to a performance bottleneck and lower system scalability.
Large processor caches also help improve performance because they assist in
filtering many of the requests that must travel over the front-side bus (a processor
cache-hit does not require a front-side bus for memory transaction).
However, even with a large L3 cache, the number of memory transactions that
miss the cache is still so great that it often causes the memory controller to
bottleneck. This happens when more than three or four processors are installed
in the same system.
Non-uniform Memory Access (NUMA) is an architecture designed to improve
performance and solve latency problems inherent in large (greater than four
processors) SMP systems. The x455 implements a NUMA-based architecture
and can scale up to 16 processors using multiple servers.
The servers each contain up to four CPUs and 28 memory DIMMs. Each servers
also has a dedicated local Cache and Scalability Controller, Memory Controller,
and 64 MB XceL4 Level 4 cache. The additional fourth level of cache greatly
improves performance for the four processors in the server because it is able to
respond to a majority of processor-to-memory requests, thereby reducing the
load on the memory controller and speeding up average memory access times.
As shown in Figure 1-12 on page 23, each server is connected to another server
using three independent 3.2 GBps scalability cables. These scalability cables
mirror front-side bus operations to all other servers and are key to building large
multiprocessing multinode systems.
By mirroring transactions on the front-side bus across the scalability links to other
processors, the x455 is able to run standard SMP software. All SMP systems
must perform processor-to-processor communication (also known as “snooping”)
to ensure that all processors receive the most recent copy of requested data.
Since any processor can store data in a local cache and modify that data at any