HP DL360 The Intel processor roadmap for industry-standard servers technology - Page 8

Since multi-processing operating systems such as Microsoft Windows and Linux are designed to

divide their workload into threads that can be independently scheduled, these operating systems can

send two distinct threads to work their way through execution in the same device. This provides the

opportunity for a higher abstraction level of parallelism at the thread level rather than simply at the

instruction level, as in the Pentium 4 design. To illustrate this concept, refer to Table 3: It is obvious

that instruction-level parallelism can take advantage of opportunities in the instruction stream to

execute independent instructions at the same time. Thread-level parallelism, shown in Table 4, takes

this a step further since two independent instruction streams are available for simultaneous execution

opportunities.

It should be noted that the performance gain from adding HT Technology does not equal the expected

gain from adding a second physical processor or processor core. The overhead to maintain the

threads and the requirement to share processor resources limit HT Technology performance.

Nevertheless, HT Technology was a valuable and cost-effective addition to the Pentium 4 design.

Table 3.

Example of instruction-level parallelism

Instruction

number

Instruction

thread

Instruction execution

1

Read register A

2

Write register B

3

Read register C

Operations 1, 2, and 3 are independent and can execute simultaneously if

resources permit.

4

Add A + B

This operation must wait for instructions 1 and 2 to complete, but it can

execute in parallel with operation 3.

5

Inc A

This operation needs to wait for the completion of instruction 4 before

executing.

Table 4.

Example of thread-level parallelism

Instruction

number

Instruction

thread

Instruction

number

Instruction

thread

Instruction execution

1a

Read

register A

1b

Add D + E

2a

Write

register B

2b

Inc E

3a

Read

register C

3b

Read F

4a

Add A + B

4b

Add E+F

5a

Inc A

5b

Write E

None of the instructions in Thread

2 depend on those in Thread 1;

therefore, to the extent that

execution units are available, any

of them can execute in parallel

with those in Thread 1.

As an example, instruction 2b

must wait for instruction 1b, but

does not need to wait for 1a.

Similarly, if two arithmetic units

are available, 4a and 4b can

execute at the same time.

According to Intel’s simulations, HT Technology achieves its objective of improving the

microarchitecture utilization rate significantly. Improved performance is the real goal though, and Intel

reports that the performance gain can be as high as 30 percent.

The performance gained by these design changes is limited by the fact that two threads now share

and compete for processor resources, such as the execution pipeline and Level 1 (L1) and L2 caches.

There is some risk that data needed by one thread can be replaced in a cache by data that the other

is using, resulting in a higher turnover of cache data (referred to as thrashing) and a reduced hit rate.

8

Section	Page
Abstract	2
Introduction	2
Intel processor architecture and microarchitectures	2
NetBurst® microarchitecture	5
Hyper-pipeline and clock frequency	5
Hyper-Threading Technology	7
NetBurst microarchitecture on 90nm silicon process technology	9
Extended hyper-pipeline	10
SSE3 instructions	10
64-bit extensions —Intel 64	10
Two-core technology	11
Intel Core™ microarchitecture	12
Processors	12
Xeon two-core processors	12
Xeon four-core processors	13
Enhanced SpeedStep® Technology	14
Intel Virtualization® Technology	15
Intel® Microarchitecture Nehalem	15
Integrated memory controller	15
Intel® QuickPath Technology	16
Three-level cache hierarchy	17
Intel® Hyper-Threading Technology	18
Intel® Turbo Boost Technology	18
Dynamic Power Management	19
Performance comparisons	20
TPC-C performance	20
SPEC performance	20
Conclusion	21
For more information	22

HP DL360 The Intel processor roadmap for industry-standard servers technology - Page 8

reports that the performance gain can be as high as 30 percent.

Page 8 highlights