from instruction level to thread level...

From Instruction Level ToFrom Instruction Level ToThread Level ParallelismThread Level Parallelism

John Paul ShenJohn Paul ShenDirector of Director of MicroarchitectureMicroarchitecture ResearchResearch

Intel LabsIntel Labs

October 29, 2002October 29, 2002Microprocessor Research ForumMicroprocessor Research Forum

““Iron LawIron Law”” of of Microprocessor PerformanceMicroprocessor Performance

1/Processor Performance = ---------------Time

Program

= ------------------ X ---------------- X ------------Instructions Cycles

Program Instruction

Time

Cycle

(inst. count) (CPI) (cycle time)

Processor Performance = -----------------IPC x GHzinst. count

Frequency & Performance BoostFrequency & Performance Boost

•• 13X due to process 13X due to process technologytechnology

•• Additional 4X due to Additional 4X due to microarchitecturemicroarchitecture

10

100

1,000

10,000

1.0µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ

Frequency (MHz)

Freq (uArch)Freq (Process)

13X

4X

i486Pentium® proc

Pentium® 4 proc

Pentium® II and III proc

10

100

1,000

10,000

1.0µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ

Frequency (MHz)

Freq (uArch)Freq (Process)

13X

4X

i486Pentium® proc

Pentium® 4 proc


Frequency Increased 50XFrequency Increased 50X

1

10

100

1.0µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ

Relative Performance

RelativePerformanceRelativeFrequency

13X

6X

i486

Pentium® proc

Pentium® 4 proc


1

10

100

1.0µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ

Relative Performance

RelativePerformanceRelativeFrequency

13X

6X

i486

Pentium® proc

Pentium® 4 proc


•• 13X due to process 13X due to process technologytechnology

•• Additional >6X due Additional >6X due to to microarchitecturemicroarchitecture

Performance Increased >75XPerformance Increased >75X

**Note: Performance measured Note: Performance measured using using SpecINTSpecINT and and SpecFPSpecFP

Source: Intel CorporationSource: Intel Corporation

Parallelism in TransitionParallelism in Transition

1

10

100

1000

10000

100000

1000000

1980 1985 1990 1995 2000 2005 2010

MIP

S Pentium® Pro ArchitectureSpeculative Out of Order

Pentium® 4 ArchitectureTrace Cache

Future Xeon™ ArchitectureMulti-Threaded

Multi-Threaded, Multi-Core

Pentium® ArchitectureSuper Scalar

Era of Era of Instruction Instruction ParallelismParallelism

Era of Era of Thread Thread

ParallelismParallelism

Superscalar IssueSuperscalar Issue

Time

Chip Multiprocessor (CMP)Chip Multiprocessor (CMP)Time

CPU0CPU0

CPU1CPU1

TimeTime--slicing Multislicing Multi--ThreadingThreadingTime

SwitchSwitch--onon--Event MultiEvent Multi--ThreadingThreadingTime

Simultaneous MultiSimultaneous Multi--Threading Threading (SMT)(SMT)

Time

Maximum utilization of function units by independent operations

Maximum utilization of function units by Maximum utilization of function units by independent operationsindependent operations

Accelerate Performance of Accelerate Performance of Threaded ApplicationsThreaded Applications

??SMT: most efficient/highest SMT: most efficient/highest performance optionperformance option??More performance for CPU More performance for CPU

execution resourcesexecution resources–– Uses resources more fullyUses resources more fully

??Greater performance Greater performance improvements from improvements from additional processor additional processor resourcesresources

Per

form

ance

Per

form

ance

Processor execution resourcesProcessor execution resources

HyperHyper--Threading Threading TechnologyTechnology

Current Current processorsprocessors

HyperHyper--Threading TechnologyThreading Technology??Executes two tasks simultaneously Executes two tasks simultaneously

––Two different applicationsTwo different applications––Two threads of same applicationTwo threads of same application

??CPU maintains architecture state for two CPU maintains architecture state for two processorsprocessors––Two logical processors per physical processorTwo logical processors per physical processor

??Demonstrated on prototype IntelDemonstrated on prototype Intel®® XeonXeon™™Processor MPProcessor MP––Two logical processors for < 5% additional die areaTwo logical processors for < 5% additional die area––Power efficient performance gainPower efficient performance gain––Result of significant research, design effort, and Result of significant research, design effort, and

validationvalidation

Replicated vs. Shared ResourcesReplicated vs. Shared Resources

Multi-Processors replicate execution resourcesHyper-Threading Technology shares resourcesMultiMulti--Processors replicate execution resourcesProcessors replicate execution resourcesHyperHyper--Threading Technology shares resourcesThreading Technology shares resources

MultiprocessorMultiprocessor HyperHyper--ThreadingThreading

Processor Processor Execution Execution ResourcesResources

Arch StateArch State

Ren

ame/

Allo

cR

enam

e/A

lloc

uop

Que

ues

uop

Que

ues

Tra

ce C

ach

eT

race

Cac

he

uCodeuCodeROMROM

33 33

Dec

od

erD

eco

der

BTB

& I

BTB

& I--

TL

BT

LB

BTBBTB

Reo

rder

/Ret

ire

Reo

rder

/Ret

ire

FP

RF

FP

RF

FmulFmul, , FAddFAddMMX, SSEMMX, SSE

FP loadFP loadFP storeFP store

StoreStoreAGUAGULoadLoadAGUAGU

Sch

edu

lers

Sch

edu

lers

Inte

ger

RF

Inte

ger

RF

ALUALU

ALUALU

ALUALU

ALUALU

L2L2CacheCache

L3 L3 Cache Cache

L1 DL1 D--CacheCacheand Dand D--TLBTLB

L2/L3 Cache ControlL2/L3 Cache Control



Ren

ame/

Allo

cR

enam

e/A

lloc

uop

Que

ues

uop

Que

ues

Tra

ce C

ach

eT

race

Cac

he

uCodeuCodeROMROM

33 33

Dec

od

erD

eco

der

BTB

& I

BTB

& I--

TL

BT

LB

BTBBTB

Reo

rder

/Ret

ire

Reo

rder

/Ret

ire

FP

RF

FP

RF



StoreStoreAGUAGU

LoadLoadAGUAGU

Sch

edu

lers

Sch

edu

lers

Inte

ger

RF

Inte

ger

RF

ALUALU

ALUALU

ALUALU

ALUALU

L2L2CacheCache

L3 L3 Cache Cache



Arch StateArch State Arch StateArch State


Ren

ame/

Allo

cR

enam

e/A

lloc

uop

Que

ues

uop

Que

ues

Tra

ce C

ach

eT

race

Cac

he

uCodeuCodeROMROM

33 33D

eco

der

Dec

od

er

BTB

& I

BTB

& I--

TL

BT

LB

BTBBTB

Reo

rder

/Ret

ire

Reo

rder

/Ret

ire

FP

RF

FP

RF



StoreStoreAGUAGU

LoadLoadAGUAGU

Sch

edu

lers

Sch

edu

lers

Inte

ger

RF

Inte

ger

RF

ALUALU

ALUALU

ALUALU

ALUALU

L2L2CacheCache

L3 L3 Cache Cache





Ren

ame/

Allo

cR

enam

e/A

lloc

uop

Que

ues

uop

Que

ues

Tra

ce C

ach

eT

race

Cac

he

uCodeuCodeROMROM

33 33

Dec

od

erD

eco

der

BTB

& I

BTB

& I--

TL

BT

LB

BTBBTB

Reo

rder

/Ret

ire

Reo

rder

/Ret

ire

FP

RF

FP

RF



StoreStoreAGUAGU

LoadLoadAGUAGU

Sch

edu

lers

Sch

edu

lers

Inte

ger

RF

Inte

ger

RF

ALUALU

ALUALU

ALUALU

ALUALU

L2L2CacheCache

L3 L3 Cache Cache




Changes for HyperChanges for Hyper--ThreadingThreading

??Replicate resourcesReplicate resources––All perAll per--CPU architectural stateCPU architectural state––Instruction Pointers, renaming logicInstruction Pointers, renaming logic––Some smaller resourcesSome smaller resources

–– E.g, return stack predictor, ITLB, etcE.g, return stack predictor, ITLB, etc

??Partition resources Partition resources ––Several buffers (ReSeveral buffers (Re--order buffer, load/store suffers, order buffer, load/store suffers,

queues,etc) queues,etc)

??Share most resourcesShare most resources––OutOut--ofof--Order execution engineOrder execution engine––CachesCaches

What Was AddedWhat Was Added

Instruction TLB

Next IPInstruction Streaming Buffers

Trace Cache Fill Buffers

Register Alias Tables

Trace Cache Next IP

Registers

RegisterRename

RegistersROB

Store Buffer

L1 D-Cache

Allocate

RenameUop

QueueRegister

Read Execute D-CacheRegister

WriteRetire QueueSched

Fetch QueueI-Fetch

TraceCache

IP

Execution PipelineExecution Pipeline

Registers

RegisterRename

RegistersROB

Store Buffer

L1 D-Cache

Allocate

RenameUop

QueueRegister



Fetch QueueI-Fetch

TraceCache

IP


In-Order Pipeline Out-of-Order Pipeline In-Order

Registers

RegisterRename

RegistersROB

Store Buffer

L1 D-Cache

Allocate

RenameUop

QueueRegister



Fetch QueueI-Fetch

TraceCache

IP


Partition queues between major pipestages of pipeline

E-Commerce Workload

0.00

0.50

1.00

1.50

2.00

2.50

2 4

Number of Processors

Per

form

ance

HT OffHT On

Transaction Processing Workload

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1 2 4

Number of Processors

Per

form

ance

HT OffHT On

Server PerformanceServer Performance

20%

20%

10%

17%

14%

Performance tests and ratings are measured using specific computPerformance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any ance of Intel products as measured by those tests. Any difference in system hardware or software design or configuratiodifference in system hardware or software design or configuration may affect actual performance. Buyers should consult other soun may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or rces of information to evaluate the performance of systems or components they are considering purchasing. For more informationcomponents they are considering purchasing. For more information on performance tests and on the performance of Intel products, on performance tests and on the performance of Intel products, reference reference www.intel.com/procs/perf/limits.htmwww.intel.com/procs/perf/limits.htm or call (U.S.) or call (U.S.) 11--800800--628628--8686 or 18686 or 1--916916--356356--31043104

Good performance benefit from Good performance benefit from small die area investmentsmall die area investment

Hyper-Threading Technology expected todeliver increasingly higher performance

HyperHyper--Threading Technology expected toThreading Technology expected todeliver increasingly higher performancedeliver increasingly higher performance

IntelIntel’’s Longs Long--Term Term HyperHyper--Threading StrategyThreading StrategyPerformancePerformance

PrototypePrototype

IntroduceIntroduce

ContinuousContinuousDevelopmentDevelopment

TimeTimeSingleSingle--thread performancethread performance

HyperHyper--thread performancethread performance

Challenges for MultithreadingChallenges for Multithreading

??Multithreaded ApplicationsMultithreaded Applications–– Multithread ProgrammingMultithread Programming–– Automatic Thread PartitioningAutomatic Thread Partitioning

??Design ComplexityDesign Complexity–– Additional Validation OverheadAdditional Validation Overhead–– Increasing Scalar InefficiencyIncreasing Scalar Inefficiency–– SMT vs. CMP Tradeoffs ?SMT vs. CMP Tradeoffs ?

??Latency of SingleLatency of Single--Threaded ApplicationsThreaded Applications

MultiMulti--Threaded ProcessorsThreaded Processors

??Targeting: Targeting: ––Throughput of MultiThroughput of Multi--tasking Workloads tasking Workloads ––Latency of MultiLatency of Multi--threaded Applications threaded Applications

??Not Targeting:Not Targeting:––Latency of SingleLatency of Single--threaded Applicationsthreaded Applications

??Research Challenge:Research Challenge:––Leverage MultiLeverage Multi--threaded CPU to Improve Latency threaded CPU to Improve Latency

of Singleof Single--threaded Applicationsthreaded Applications

Asymmetric MultiAsymmetric Multi--ThreadingThreading

?? Symmetric MultiSymmetric Multi--ThreadingThreading–– Partition single thread into Partition single thread into

multiple threadsmultiple threads–– Achieve performance by Achieve performance by

parallel execution of multiple parallel execution of multiple threadsthreads

?? Asymmetric MultiAsymmetric Multi--ThreadingThreading–– Attach helper threads to Attach helper threads to

original singleoriginal single--threaded codethreaded code–– Achieve speedup via memory Achieve speedup via memory

prefetchingprefetching by helper threadsby helper threads

Asymmetric Helper ThreadsAsymmetric Helper Threads

??Symmetric MultiSymmetric Multi--Threading CompilerThreading Compiler–– Partition single thread into multiple threadsPartition single thread into multiple threads–– Must ensure semantic correctnessMust ensure semantic correctness–– Difficult for common and legacy codeDifficult for common and legacy code

??Asymmetric MultiAsymmetric Multi--Threading CompilerThreading Compiler–– Attach helper threads to original codeAttach helper threads to original code–– Leverage side effect of helper threadsLeverage side effect of helper threads–– Can be dynamically invoked/controlledCan be dynamically invoked/controlled

PrePre--fetch via Helper Threadsfetch via Helper Threads

Delinquent LoadDelinquent Load CacheCacheMissMiss

AvoidedAvoided

Cache Pre-fetchInitiated

Cache PreCache Pre--fetchfetchInitiatedInitiated

TriggerTrigger

Hel

per

Th

read

Hel

per

Th

read

Chaining Triggers ofChaining Triggers ofHelper ThreadsHelper Threads

Delinquent LoadDelinquent Load CacheCacheMissMiss

AvoidedAvoided

TriggerTrigger

Hel

per

Th

read

Hel

per

Th

read

Hel

per

Th

read

Hel

per

Th

read

Hel

per

Th

read

Hel

per

Th

read

Chaining Trigger AdvantagesChaining Trigger Advantages??LowLow--cost Thread Spawning:cost Thread Spawning:

––Chaining triggers initiate helper threads without Chaining triggers initiate helper threads without impacting main thread performanceimpacting main thread performance

??LongLong--range range PrefetchingPrefetching::––Can target delinquent loads far ahead of the main Can target delinquent loads far ahead of the main

threadthread––Helper threads make progress independent of Helper threads make progress independent of

main threadmain thread’’s lack of progresss lack of progress

PrefetchingPrefetching Effectiveness ofEffectiveness ofHelper Threads on HT MachineHelper Threads on HT Machine

7.08%7.08%Integer programming algorithm used Integer programming algorithm used for bus schedulingfor bus scheduling

MCFMCF ((SPECintSPECint))

11% 11% -- 24%24%Hierarchical database modeling Hierarchical database modeling health care system health care system

HealthHealth (Olden)(Olden)

23% 23% -- 40%40%Minimal Spanning Tree algorithm Minimal Spanning Tree algorithm used for Data Clusteringused for Data Clustering

MSTMST (Olden)(Olden)

22% 22% -- 45%45%Graph traversal in large random Graph traversal in large random graph simulating large database graph simulating large database

retrievalretrieval

SyntheticSynthetic

SpeedupSpeedupDescriptionDescriptionBenchmarkBenchmark

Source: Intel LabsSource: Intel Labs

Asymmetric MT effectiveness: speedup, L2 miss coverage; % helper thread Icount

0%

20%

40%

60%

80%

100%

120%

synthetic mst health mcf

Benchmarks

Speedup(wall clk)

Coverage ofL2 Misses

Relative % ofHelperThreadsIcount

Helper Threads on Helper Threads on HyperHyper--Threading MachineThreading Machine

Source: Intel LabsSource: Intel Labs

IntelIntel®® Processors with Processors with NetburstNetburst™™MicroarchitectureMicroarchitecture

Intel XeonIntel XeonProcessorProcessor

256KB 2nd256KB 2nd--Level Level CacheCache

.18u process.18u process

IntelIntel®® XeonXeon™™ MP MP ProcessorProcessor

256KB 2256KB 2ndnd--Level CacheLevel Cache1MB 3rd1MB 3rd--Level CacheLevel Cache


Intel Xeon Intel Xeon ProcessorProcessor

512KB 2512KB 2ndnd--Level Level CacheCache


from instruction level to thread level...

Documents