hyperthreadtechnology-130831120940-phpapp02

8/10/2019 hyperthreadtechnology-130831120940-phpapp02

http://slidepdf.com/reader/full/hyperthreadtechnology-130831120940-phpapp02 1/16

Introduction:Throughout the evolution of the IA-32 Intel Architecture,

Intel has continuously added innovations to the architecture to improve

processor performance and to address specific needs of compute intensive

applications. The latest of these innovations is Hyper-Threading technology,

which Intel has developed to improve the performance of IA-32 processors

when executing multiple-processor !"# capa$le operating systems and multi-

threaded applications.

Intel%s Hyper-Threading Technology $rings the concept of

simultaneous multi-threading to the Intel Architecture. Hyper-Threading

Technology ma&es a single physical processor appear as two logical processors'

the physical execution resources are shared and the architecture state is

duplicated for the two logical processors. (rom a software or architecture

perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on multiple physical

processors. (rom a micro architecture perspective, this means that instructions

from $oth logical processors will persist and execute simultaneously on shared

execution resources.

What is Hyper Thread Technology?

Hyper-Thread Technology allows a single physical processor to execute multiple threads instruction streams# simultaneously,

potentially providing greater throughput and improved performance.

Intel%s Hyper-Thread Technology $rings the concept of

simultaneous multi-threading to the Intel Architecture. Hyper-Threading

Technology ma&es a single physical processor appear as two logical processors'

the physical execution resources are shared and the architecture state is

duplicated for the two logical processors. (rom a software or architecture

perspective, this means operating systems and user programs can schedule

processes or threads to logical processors as they would on multiple physical processors. (rom a micro architecture perspective, this means that instructions

from $oth logical processors will persist and execute simultaneously on shared




Hyper Thread Technology Architecture:

Hyper Thread Technology ma&es a single physical processor

appear as multiple logical processors. To do this, there is one copy of the

architecture state for each logical processor, and the logical processors sharea single set of physical execution resources. (rom a software or architecture

perspective, this means operating systems and user programs can schedule

processes or threads to logical processors as they would on conventional

physical processors in a multi-processor system. (rom a micro architecture

perspective, this means that instructions from logical processors will persist

and execute simultaneously on shared execution resources.

)iagramIn the a$ove figure, a multiprocessor system with two physical

processors those are not capa$le with Hyper Thread Technology.

)iagramThe a$ove figure shows a multiprocessor system with tow

physical processors that are Hyper Thread Technology capa$le. *ith two

copies of the architectural state on each physical processor, the system

appears to have four logical processors.

The first implementation of Hyper Thread Technology is $eing

made availa$le on the Intel+ eon processor family for dual and

multiprocessor servers, with two logical processors per physical processor.

y more efficiently using existing processor resources, the Intel eon

processor family can significantly improve performance at virtually the

same system cost. This implementation of Hyper Thread Technology added

less than /0 to the relative chip si1e and maximum power reuirements, $ut

can provide performance $enefits much greater than that.

ach logical processor maintains a complete set of the

architecture state. The architecture state consists of registers including the

general-purpose registers, the control registers, the advanced programma$le

interrupt controller A"I4# registers, and some machine state registers. (rom

a software perspective, once the architecture state is duplicated, the

processor appears to $e two processors. The num$er of transistors to store



the architecture state is an extremely small fraction of the total. 5ogical

processors share nearly all other resources on the physical processor, such as

caches, execution units, $ranch predictors, control logic, and $uses.

ach logical processor has its own interrupt controller or A"I4.

Interrupts sent to a specific logical processor are handled only $y that logical

processor.

FIRST IMPLEMETATI! ! THE ITEL "E!

PR!#ESS!R FAMIL$

6everal goals were at the heart of the micro architecture design

choices made for the Intel+ eonT! processor !" implementation of Hyper-

Threading Technology. 7ne goal was to minimi1e the die area cost of

implementing Hyper-Threading Technology. 6ince the logical processorsshare the vast ma8ority of micro architecture resources and only a few small

structures were replicated, the die area cost of the first implementation was

less than /0 of the total die area.

A second goal was to ensure that when one logical processor is

stalled the other logical processor could continue to ma&e forward progress.

A logical processor may $e temporarily stalled for a variety of reasons,

including servicing cache misses, handling $ranch miss predictions, or

waiting for the results of previous instructions. Independent forward

progress was ensured $y managing $uffering ueues such that no logical

processor can use all the entries when two active software threads2 were

executing. This is accomplished $y either partitioning or limiting the num$er

of active entries each thread can have.

A third goal was to allow a processor running only one active

software thread to run at the same speed on a processor with Hyper-

Threading Technology as on a processor without this capa$ility. This means

that partitioned resources should $e recom$ined when only one software

thread is active. A high-level view of the micro architecture pipeline isshown in (igure 9. As shown, $uffering ueues separate ma8or pipeline logic

$loc&s. The $uffering ueues are either partitioned or duplicated to ensure

independent forward progress through each logic $loc&.



Active software threads include the operating system idle loop

$ecause it runs a seuence of code that continuously chec&s the wor&

ueues#. The operating system idle loop can consume considera$le


Instruction Scheduling:

The schedulers are at the heart of the out-of-order execution

engine. (ive uop schedulers are used to schedule different types of uops for

the various execution units. 4ollectively, they can dispatch up to six uops

each cloc& cycle. The schedulers determine when uops are ready to execute

$ased on the readiness of their dependent input register operands and the

availa$ility of the execution unit resources.

The memory instruction ueue and general instruction ueuessend uops to the five scheduler ueues as fast as they can, alternating

$etween uops for the two logical processors every cloc& cycle, as needed.

ach scheduler has its own scheduler ueue of eight to twelve

entries from which it selects uops to send to the execution units. The

schedulers choose uops regardless of whether they $elong to one logical

processor or the other. The schedulers are effectively o$livious to logical

processor distinctions. The uops are simply evaluated $ased on dependent

inputs and availa$ility of execution resources. (or example, the schedulers

could dispatch two uops from one logical processor and two uops from theother logical processor in the same cloc& cycle. To avoid deadloc& and

ensure fairness, there is a limit on the num$er of active entries that a logical

processor can have in each scheduler%s ueue. This limit is dependent on the

si1e of the scheduler ueue.

FR!T E%:

The front end of the pipeline is responsi$le for delivering

instructions to the later pipe stages. Instructions generally come from thexecution Trace 4ache T4#, which is the primary or 5evel : 5:#

instruction cache. (igure /$ shows that only when there is a T4 miss does

the machine fetch and decode instructions from the integrated 5evel 2 52#

cache. ;ear the T4 is the !icrocode <7!, which stores decoded

instructions for the longer and more complex IA-32 instructions.



)IA=<A!

E&ecution Trace #ache 'T#(:

The T4 stores decoded instructions, called micro-operations or

>uops.> !ost instructions in a program are fetched and executed from the

T4. Two sets of next-instruction-pointers independently trac& the progress of

the two software threads executing. The two logical processors ar$itrate

access to the T4 every cloc& cycle. If $oth logical processors want access to

the T4 at the same time, access is granted to one then the other in alternating

cloc& cycles. (or example, if one cycle is used to fetch a line for one logical

processor, the next cycle would $e used to fetch a line for the other logical

processor, provided that $oth logical processors reuested access to the trace

cache. If one logical processor is stalled or is una$le to use the T4, the other

logical processor can use the full $andwidth of the trace cache, every cycle.

The T4 entries are tagged with thread information and are

dynamically allocated as needed. The T4 is ?-way set associative, and

entries are replaced $ased on a least-recently-used 5<@# algorithm that is

$ased on the full ? ways. The shared nature of the T4 allows one logical

processor to have more entries than the other if needed.

Microcode R!M:

*hen a complex instruction is encountered, the T4 sends a

microcode-instruction pointer to the !icrocode <7!. The !icrocode <7!

controller then fetches the uops needed and returns control to the T4. Two

microcode instruction pointers are used to control the flows independently if

$oth logical processors are executing complex IA-32 instructions.

oth logical processors share the !icrocode <7! entries.

Access to the !icrocode <7! alternates $etween logical processors 8ust as

in the T4.



ITL) and )ranch Prediction:

If there is a T4 miss, then instruction $ytes need to $e fetched

from the 52 cache and decoded into uops to $e placed in the T4. The

Instruction Translation 5oo& a side uffer IT5# receives the reuest from

the T4 to deliver new instructions, and it translates the next-instruction

pointer address to a physical address. A reuest is sent to the 52 cache, and

instruction $ytes are returned. These $ytes are placed into streaming $uffers,

which hold the $ytes until they can $e decoded.

The IT5s are duplicated. ach logical processor has its own

IT5 and its own set of instruction pointers to trac& the progress of

instruction fetch for the two logical processors. The instruction fetch logic in

charge of sending reuests to the 52 cache ar$itrates on a first-come first-

served $asis, while always reserving at least one reuest slot for each logical processor. In this way, $oth logical processors can have fetches pending

simultaneously.

ach logical processor has its own set of two 9-$yte streaming

$uffers to hold instruction $ytes in preparation for the instruction decode

stage. The IT5s and the streaming $uffers are small structures, so the die

si1e cost of duplicating these structures is very low.

The $ranch prediction structures are either duplicated or shared.

The return stac& $uffer, which predicts the target of return instructions, isduplicated $ecause it is a very small structure and the callBreturn pairs are

$etter predicted for software threads independently. The $ranch history

$uffer used to loo& up the glo$al history array is also trac&ed independently

for each logical processor. However, the large glo$al history array is a

shared structure with entries that are tagged with a logical processor I).

S*PP!RTI+ IA,-. PR!#ESS!RS:

An IA-32 processor with Hyper Thread Technology will appear to software as two independent IA-32 processors, similar to two physical

processors in a traditional )" platform. This configuration allows operating

system and application software that is already designed to run on a

traditional )" or !" system to run unmodified on a platform that uses one

or more IA-32 processors with Hyper Thread Technology. Here, the multiple

threads that would $e dispatched to two or more physical processors are now



dispatched to the logical processors in one or more IA-32 processors with

Hyper Thread Technology. At the firmware I76# level, the $asic

procedures to initiali1e multiple processors with Hyper Thread Technology

in an !" platform resem$le closely those for a traditional !" platform9. An

operating system designed to run on an traditional )" or !" platform can

use the 4"@-I) instruction to detect the presence of IA-32 processors with

Hyper Thread Technology. The same mechanisms that are descri$ed in the

multiprocessor specification version :.9 to wa&e physical processors apply

to the logical processors in an IA-32 with Hyper Thread Technology.

Although existing operating system and application code will run correctly

on a processor with Hyper Thread Technology, some relatively simple code

modifications are recommended to get the optimum $enefit from Hyper

Thread Technology.

IMPLEMETATI! !F IA,-. PR!#ESS!RS:

IA,-. Instruction %ecode:

IA-32 instructions are cum$ersome to decode $ecause the

instructions have a varia$le num$er of $ytes and have many different

options. A significant amount of logic and intermediate state is needed to

decode these instructions. (ortunately, the T4 provides most of the uops,

and decoding is only needed for instructions that miss the T4.

The decode logic ta&es instruction $ytes from the streaming $uffers and decodes them into uops. *hen $oth threads are decoding

instructions simultaneously, the streaming $uffers alternate $etween threads

so that $oth threads share the same decoder logic. The decode logic has to

&eep two copies of all the state needed to decode IA-32 instructions for the

two logical processors even though it only decodes instructions for one

logical processor at a time. In general, several instructions are decoded for

one logical processor $efore switching to the other logical processor. The

decision to do a coarser level of granularity in switching $etween logical

processors was made in the interest of die si1e and to reduce complexity. 7f course, if only one logical processor needs the decode logic, the full decode

$andwidth is dedicated to that logical processor. The decoded instructions

are written into the T4 and forwarded to the uop ueue.



*op /ueue:

After uops are fetched from the trace cache or the !icrocode

<7!, or forwarded from the instruction decode logic, they are placed in a

>uop ueue.> This ueue decouples the (ront nd from the 7ut-of-order

xecution ngine in the pipeline flow. The uop ueue is partitioned such

that each logical processor has half the entries. This partitioning allows $oth

logical processors to ma&e independent forward progress regardless of front-

end stalls e.g., T4 miss# and execution stalls.

!*T,!F,!R%ER E"E#*TI! E+IEC

The out-of-order execution engine consists of the allocation,

register renaming, scheduling, and execution functions, as shown in (igure. This part of the machine re-orders instructions and executes them as

uic&ly as their inputs are ready, without regard to the original program

order.

Allocator:

The out-of-order execution engine has several $uffers to

perform its re-ordering, tracing, and seuencing operations. The allocator

logic ta&es uops from the uop ueue and allocates many of the &ey machine

$uffers needed to execute each uop, including the :2 re-order $uffer

entries, :2? integer and :2? floating-point physical registers, 9? load and 29

store $uffer entries. 6ome of these &ey $uffers are partitioned such that each

logical processor can use at most half the entries. 6pecifically, each logical

processor can use up to a maximum of 3 re-order $uffer entries, 29 load

$uffers, and :2 store $uffer entries.

If there are uops for $oth logical processors in the uop ueue,

the allocator will alternate selecting uops from the logical processors every

cloc& cycle to assign resources. If a logical processor has used its limit of aneeded resource, such as store $uffer entries, the allocator will signal >stall>

for that logical processor and continue to assign resources for the other

logical processor. In addition, if the uop ueue only contains uops for one

logical processor, the allocator will try to assign resources for that logical

processor every cycle to optimi1e allocation $andwidth, though the resource

limits would still $e enforced.



y limiting the maximum resource usage of &ey $uffers, the

machine helps enforce fairness and prevents deadloc&s.

Register Rena0e:

The register rename logic renames the architectural IA-32

registers onto the machine%s physical registers. This allows the ? general-use

IA-32 integer registers to $e dynamically expanded to use the availa$le :2?

physical registers. The renaming logic uses a <egister Alias Ta$le <AT# to

trac& the latest version of each architectural register to tell the next

instructions# where to get its input operands.

6ince each logical processor must maintain and trac& its own

complete architecture state, there are two <ATs, one for each logical

processor. The register renaming process is done in parallel to the allocator logic descri$ed a$ove, so the register rename logic wor&s on the same uops

to which the allocator is assigning resources.

7nce uops have completed the allocation and register rename

processes, they are placed into two sets of ueues, one for memory

operations loads and stores# and another for all other operations. The two

sets of ueues are called the memory instruction ueue and the general

instruction ueue, respectively. The two sets of ueues are also partitioned

such that uops from each logical processor can use at most half the entries.

E&ecution *nits:

The execution core and memory hierarchy are also largely

o$livious to logical processors. 6ince the source and destination registers

were renamed earlier to physical registers in a shared physical register pool,

uops merely access the physical register file to get their destinations, and

they write results $ac& to the physical register file. 4omparing physical

register num$ers ena$les the forwarding logic to forward results to other

executing uops without having to understand logical processors.

After execution, the uops are placed in the re-order $uffer. The

re-order $uffer decouples the execution stage from the retirement stage. The

re-order $uffer is partitioned such that each logical processor can use half

the entries.



Retire0ent:

@op retirement logic commits the architecture state in program

order. The retirement logic trac&s when uops from the two logical processors

are ready to $e retired, then retires the uops in program order for each logical

processor $y alternating $etween the two logical processors. <etirement

logic will retire uops for one logical processor, then the other, alternating

$ac& and forth. If one logical processor is not ready to retire any uops then

all retirement $andwidth is dedicated to the other logical processor.

7nce stores have retired, the store data needs to $e written into

the level-one data cache. 6election logic alternates $etween the two logical

processors to commit store data to the cache.

MEM!R$ S*)S$STEM:

The memory su$system includes the )T5, the low-latency

5evel : 5:# data cache, the 5evel 2 52# unified cache, and the 5evel 3

unified cache the 5evel 3 cache is only availa$le on the Intel+ eonT!

processor !"#. Access to the memory su$system is also largely o$livious to

logical processors. The schedulers send load or store uops without regard to

logical processors and the memory su$system handles them as they come.

%TL):

The )T5 translates addresses to physical addresses. It has 9

fully associative entries' each entry can map either a 9D or a 9! page.

Although the )T5 is a shared structure $etween the two logical processors,

each entry includes a logical processor I) tag. ach logical processor also

has a reservation register to ensure fairness and forward progress in

processing )T5 misses.

L1 %ata #ache2 L. #ache2 L- #ache:

The 5: data cache is 9-way set associative with 9-$yte lines. It

is a write-through cache, meaning that writes are always copied to the 52

cache. The 5: data cache is virtually addressed and physically tagged.



The 52 and 53 caches are ?-way set associative with :2?-$yte

lines. The 52 and 53 caches are physically addressed. oth logical

processors, without regard to which logical processor%s uops may have

initially $rought the data into the cache, can share all entries in all three

levels of cache.

ecause logical processors can share data in the cache, there is

the potential for cache conflicts, which can result in lower o$served

performance. However, there is also the possi$ility for sharing data in the

cache. (or example, one logical processor may prefetch instructions or data,

needed $y the other, into the cache' this is common in server application

code. In a producer-consumer usage model, one logical processor may

produce data that the other logical processor wants to use. In such cases,

there is the potential for good performance $enefits.

)*S C

5ogical processor memory reuests not satisfied $y the cache

hierarchy are serviced $y the $us logic. The $us logic includes the local

A"I4 interrupt controller, as well as off-chip system memory and IB7 space.

us logic also deals with cachea$le address coherency snooping# of

reuests originated $y other external $us agents, plus incoming interrupt

reuest delivery via the local A"I4s.

(rom a service perspective, reuests from the logical processorsare treated on a first-come $asis, with ueue and $uffering space appearing

shared. "riority is not given to one logical processor a$ove the other.

)istinctions $etween reuests from the logical processors are

relia$ly maintained in the $us ueues nonetheless. <euests to the local

A"I4 and interrupt delivery resources are uniue and separate per logical

processor. us logic also carries out portions of $arrier fence and memory

ordering operations, which are applied to the $us reuest ueues on a per

logical processor $asis.

(or de$ug purposes, and as an aid to forward progress

mechanisms in clustered multiprocessor implementations, the logical

processor I) is visi$ly sent onto the processor external $us in the reuest

phase portion of a transaction. 7ther $us transactions, such as cache line



eviction or prefetch transactions, inherit the logical processor I) of the

reuest that generated the transaction

SI+LE,TAS3 A% M*LTI,TAS3 M!%ESC

To optimi1e performance when there is one software thread to

execute, there are two modes of operation referred to as single-tas& 6T# or

multi-tas& !T#. In !T-mode, there are two active logical processors and

some of the resources are partitioned as descri$ed earlier. There are two

flavors of 6T-modeC single-tas& logical processor E 6TE# and single-tas&

logical processor : 6T:#. In 6TE- or 6T:-mode, only one logical processor

is active, and resources that were partitioned in !T-mode are re-com$ined to

give the single active logical processor use of all of the resources. The IA-32

Intel Architecture has an instruction called HA5T that stops processor

execution and normally allows the processor to go into a lower-power mode.HA5T is a privileged instruction, meaning that only the operating system or

other ring-E processes may execute this instruction. @ser-level applications

cannot execute HA5T.

7n a processor with Hyper-Threading Technology, executing

HA5T transitions the processor from !T-mode to 6TE- or 6T:-mode,

depending on which logical processor executed the HA5T. (or example, if

logical processor E executes HA5T, only logical processor : would $e

active' the physical processor would $e in 6T:-mode and partitioned

resources would $e recom$ined giving logical processor : full use of all

processor resources. If the remaining active logical processor also executes

HA5T, the physical processor would then $e a$le to go to a lower-power

mode.

In 6TE- or 6T:-modes, an interrupt sent to the Halted processor

would cause a transition to !T-mode. The operating system is responsi$le

for managing !T-mode transitions descri$ed in the next section#.

The Intel+ eonT!

processor family delivers the highest server system performance of any IA-32 Intel architecture processor introduced to

date. Initial $enchmar& tests show up to a /0 performance increase on

high-end server applications when compared to the previous-generation

"entium+ III eon processor on 9-way server platforms. A significant

portion of those gains can $e attri$uted to Hyper-Threading Technology.



)IA=<A!

The online transaction processing performance, scaling from a

single-processor configuration through to a 9-processor system with Hyper-

Threading Technology ena$led. This graph is normali1ed to the performanceof the single-processor system. It can $e seen that there is a significant

overall performance gain attri$uta$le to Hyper-Threading Technology, 2:0

in the cases of the single and dual-processor systems.

The $enefit of Hyper-Threading Technology when executing

other server-centric $enchmar&s. The wor&loads chosen were two different

$enchmar&s that are designed to exercise data and *e$ server characteristics

and a wor&load that focuses on exercising a server-side Fava environment. In

these cases the performance $enefit ranged from : to 2?0.

All the performance results uoted a$ove are normali1ed to

ensure that readers focus on the relative performance and not the a$solute

performance.

"erformance tests and ratings are measured using specific

computer systems andBor components and reflect the approximate

performance of Intel products as measured $y those tests. Any difference in

system hardware or software design or configuration may affect actual

performance. uyers should consult other sources of information to evaluatethe performance of systems or components they are considering purchasing.

Hyper Thread Technology and Windo4s Strea0s:

*indows-$ased servers receive processor information from the

I76. ach server vendor creates their own I76 using specifications

provided $y Intel. Assuming the I76 is written according to Intel

specifications, it $egins counting processors using the first logical processor

on each physical processor. 7nce it has counted a logical processor on all of

the physical processors, it will count the second logical processor on each physical processor, and so on.



;um$ers indicate the order in which logical processors are recognised $y I76 when

written according to the Intel specifications. This diagram shows a four way systemena$led with Hyper Thread Technology.

It is critical that the I76 count logical processors in the

manner descri$ed' otherwise, *indows 2EEE or its applications may use

logical processors when they should $e using physical processors instead.

(or example, consider an application that is licensed to use two processors

on the system diagrammed a$ove. 6uch an application will achieve $etter

performance using two separate physical processors than it would use two

logical processors on the same physical processor.

#!#L*SI!:

IntelGs Hyper Thread Technology $rings the concept of

simultaneous multi-threading to the Intel Architecture. This is a significant

new technology direction for IntelGs future processors. It will $e $ecome

increasingly important going forward as it adds a new techniue for

o$taining additional performance for lower transistor and power costs. The

first implementation of Hyper Thread Technology was done on the Intel+

eon processor !". In this implementation there are two logical

processors on each physical processor. The logical processors have their own

independent architecture state, $ut they share nearly all the physical

execution and hardware resources of the processor. The goal was to

implement the technology at minimum cost while ensuring forward progress

on logical processors, even if the other is stalled, and to deliver full

: / 2 3 9 ?

Logical processors

Physical Processors



performance even when there is only one active logical processor. These

goals were achieved through efficient logical processor selection algorithms

and the creative partitioning and recom$ining algorithms of many &ey

resources. !easured performance on the Intel eon processor !" with

Hyper Thread Technology shows performance gains of up to 3E0 on

common server application $enchmar&s for this technology.

The potential of Hyper Thread Technology is tremendous' our

current implementation has only 8ust $egun to tap into this potential. Hyper

Thread Technology is expected to $e via$le from mo$ile processors to

servers' its introduction into mar&et segments other than server is only gated

$y the availa$ility and prevalence of threaded applications and wor&loads in

those mar&ets.

hyperthreadtechnology-130831120940-phpapp02

Documents