from instruction level to thread level...
TRANSCRIPT
From Instruction Level ToFrom Instruction Level ToThread Level ParallelismThread Level Parallelism
John Paul ShenJohn Paul ShenDirector of Director of MicroarchitectureMicroarchitecture ResearchResearch
Intel LabsIntel Labs
October 29, 2002October 29, 2002Microprocessor Research ForumMicroprocessor Research Forum
““Iron LawIron Law”” of of Microprocessor PerformanceMicroprocessor Performance
1/Processor Performance = ---------------Time
Program
= ------------------ X ---------------- X ------------Instructions Cycles
Program Instruction
Time
Cycle
(inst. count) (CPI) (cycle time)
Processor Performance = -----------------IPC x GHzinst. count
Frequency & Performance BoostFrequency & Performance Boost
•• 13X due to process 13X due to process technologytechnology
•• Additional 4X due to Additional 4X due to microarchitecturemicroarchitecture
10
100
1,000
10,000
1.0µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ
Frequency (MHz)
Freq (uArch)Freq (Process)
13X
4X
i486Pentium® proc
Pentium® 4 proc
Pentium® II and III proc
10
100
1,000
10,000
1.0µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ
Frequency (MHz)
Freq (uArch)Freq (Process)
13X
4X
i486Pentium® proc
Pentium® 4 proc
Pentium® II and III proc
Frequency Increased 50XFrequency Increased 50X
1
10
100
1.0µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ
Relative Performance
RelativePerformanceRelativeFrequency
13X
6X
i486
Pentium® proc
Pentium® 4 proc
Pentium® II and III proc
1
10
100
1.0µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ
Relative Performance
RelativePerformanceRelativeFrequency
13X
6X
i486
Pentium® proc
Pentium® 4 proc
Pentium® II and III proc
•• 13X due to process 13X due to process technologytechnology
•• Additional >6X due Additional >6X due to to microarchitecturemicroarchitecture
Performance Increased >75XPerformance Increased >75X
**Note: Performance measured Note: Performance measured using using SpecINTSpecINT and and SpecFPSpecFP
Source: Intel CorporationSource: Intel Corporation
Parallelism in TransitionParallelism in Transition
1
10
100
1000
10000
100000
1000000
1980 1985 1990 1995 2000 2005 2010
MIP
S Pentium® Pro ArchitectureSpeculative Out of Order
Pentium® 4 ArchitectureTrace Cache
Future Xeon™ ArchitectureMulti-Threaded
Multi-Threaded, Multi-Core
Pentium® ArchitectureSuper Scalar
Era of Era of Instruction Instruction ParallelismParallelism
Era of Era of Thread Thread
ParallelismParallelism
Superscalar IssueSuperscalar Issue
Time
Chip Multiprocessor (CMP)Chip Multiprocessor (CMP)Time
CPU0CPU0
CPU1CPU1
TimeTime--slicing Multislicing Multi--ThreadingThreadingTime
SwitchSwitch--onon--Event MultiEvent Multi--ThreadingThreadingTime
Simultaneous MultiSimultaneous Multi--Threading Threading (SMT)(SMT)
Time
Maximum utilization of function units by independent operations
Maximum utilization of function units by Maximum utilization of function units by independent operationsindependent operations
Accelerate Performance of Accelerate Performance of Threaded ApplicationsThreaded Applications
??SMT: most efficient/highest SMT: most efficient/highest performance optionperformance option??More performance for CPU More performance for CPU
execution resourcesexecution resources–– Uses resources more fullyUses resources more fully
??Greater performance Greater performance improvements from improvements from additional processor additional processor resourcesresources
Per
form
ance
Per
form
ance
Processor execution resourcesProcessor execution resources
HyperHyper--Threading Threading TechnologyTechnology
Current Current processorsprocessors
HyperHyper--Threading TechnologyThreading Technology??Executes two tasks simultaneously Executes two tasks simultaneously
––Two different applicationsTwo different applications––Two threads of same applicationTwo threads of same application
??CPU maintains architecture state for two CPU maintains architecture state for two processorsprocessors––Two logical processors per physical processorTwo logical processors per physical processor
??Demonstrated on prototype IntelDemonstrated on prototype Intel®® XeonXeon™™Processor MPProcessor MP––Two logical processors for < 5% additional die areaTwo logical processors for < 5% additional die area––Power efficient performance gainPower efficient performance gain––Result of significant research, design effort, and Result of significant research, design effort, and
validationvalidation
Replicated vs. Shared ResourcesReplicated vs. Shared Resources
Multi-Processors replicate execution resourcesHyper-Threading Technology shares resourcesMultiMulti--Processors replicate execution resourcesProcessors replicate execution resourcesHyperHyper--Threading Technology shares resourcesThreading Technology shares resources
MultiprocessorMultiprocessor HyperHyper--ThreadingThreading
Processor Processor Execution Execution ResourcesResources
Arch StateArch State
Ren
ame/
Allo
cR
enam
e/A
lloc
uop
Que
ues
uop
Que
ues
Tra
ce C
ach
eT
race
Cac
he
uCodeuCodeROMROM
33 33
Dec
od
erD
eco
der
BTB
& I
BTB
& I--
TL
BT
LB
BTBBTB
Reo
rder
/Ret
ire
Reo
rder
/Ret
ire
FP
RF
FP
RF
FmulFmul, , FAddFAddMMX, SSEMMX, SSE
FP loadFP loadFP storeFP store
StoreStoreAGUAGULoadLoadAGUAGU
Sch
edu
lers
Sch
edu
lers
Inte
ger
RF
Inte
ger
RF
ALUALU
ALUALU
ALUALU
ALUALU
L2L2CacheCache
L3 L3 Cache Cache
L1 DL1 D--CacheCacheand Dand D--TLBTLB
L2/L3 Cache ControlL2/L3 Cache Control
Arch StateArch State
Processor Processor Execution Execution ResourcesResources
Ren
ame/
Allo
cR
enam
e/A
lloc
uop
Que
ues
uop
Que
ues
Tra
ce C
ach
eT
race
Cac
he
uCodeuCodeROMROM
33 33
Dec
od
erD
eco
der
BTB
& I
BTB
& I--
TL
BT
LB
BTBBTB
Reo
rder
/Ret
ire
Reo
rder
/Ret
ire
FP
RF
FP
RF
FmulFmul, , FAddFAddMMX, SSEMMX, SSE
FP loadFP loadFP storeFP store
StoreStoreAGUAGU
LoadLoadAGUAGU
Sch
edu
lers
Sch
edu
lers
Inte
ger
RF
Inte
ger
RF
ALUALU
ALUALU
ALUALU
ALUALU
L2L2CacheCache
L3 L3 Cache Cache
L1 DL1 D--CacheCacheand Dand D--TLBTLB
L2/L3 Cache ControlL2/L3 Cache Control
Arch StateArch State Arch StateArch State
Processor Processor Execution Execution ResourcesResources
Ren
ame/
Allo
cR
enam
e/A
lloc
uop
Que
ues
uop
Que
ues
Tra
ce C
ach
eT
race
Cac
he
uCodeuCodeROMROM
33 33D
eco
der
Dec
od
er
BTB
& I
BTB
& I--
TL
BT
LB
BTBBTB
Reo
rder
/Ret
ire
Reo
rder
/Ret
ire
FP
RF
FP
RF
FmulFmul, , FAddFAddMMX, SSEMMX, SSE
FP loadFP loadFP storeFP store
StoreStoreAGUAGU
LoadLoadAGUAGU
Sch
edu
lers
Sch
edu
lers
Inte
ger
RF
Inte
ger
RF
ALUALU
ALUALU
ALUALU
ALUALU
L2L2CacheCache
L3 L3 Cache Cache
L1 DL1 D--CacheCacheand Dand D--TLBTLB
L2/L3 Cache ControlL2/L3 Cache Control
Arch StateArch State
Processor Processor Execution Execution ResourcesResources
Ren
ame/
Allo
cR
enam
e/A
lloc
uop
Que
ues
uop
Que
ues
Tra
ce C
ach
eT
race
Cac
he
uCodeuCodeROMROM
33 33
Dec
od
erD
eco
der
BTB
& I
BTB
& I--
TL
BT
LB
BTBBTB
Reo
rder
/Ret
ire
Reo
rder
/Ret
ire
FP
RF
FP
RF
FmulFmul, , FAddFAddMMX, SSEMMX, SSE
FP loadFP loadFP storeFP store
StoreStoreAGUAGU
LoadLoadAGUAGU
Sch
edu
lers
Sch
edu
lers
Inte
ger
RF
Inte
ger
RF
ALUALU
ALUALU
ALUALU
ALUALU
L2L2CacheCache
L3 L3 Cache Cache
L1 DL1 D--CacheCacheand Dand D--TLBTLB
L2/L3 Cache ControlL2/L3 Cache Control
Arch StateArch State
Changes for HyperChanges for Hyper--ThreadingThreading
??Replicate resourcesReplicate resources––All perAll per--CPU architectural stateCPU architectural state––Instruction Pointers, renaming logicInstruction Pointers, renaming logic––Some smaller resourcesSome smaller resources
–– E.g, return stack predictor, ITLB, etcE.g, return stack predictor, ITLB, etc
??Partition resources Partition resources ––Several buffers (ReSeveral buffers (Re--order buffer, load/store suffers, order buffer, load/store suffers,
queues,etc) queues,etc)
??Share most resourcesShare most resources––OutOut--ofof--Order execution engineOrder execution engine––CachesCaches
What Was AddedWhat Was Added
Instruction TLB
Next IPInstruction Streaming Buffers
Trace Cache Fill Buffers
Register Alias Tables
Trace Cache Next IP
Registers
RegisterRename
RegistersROB
Store Buffer
L1 D-Cache
Allocate
RenameUop
QueueRegister
Read Execute D-CacheRegister
WriteRetire QueueSched
Fetch QueueI-Fetch
TraceCache
IP
Execution PipelineExecution Pipeline
Registers
RegisterRename
RegistersROB
Store Buffer
L1 D-Cache
Allocate
RenameUop
QueueRegister
Read Execute D-CacheRegister
WriteRetire QueueSched
Fetch QueueI-Fetch
TraceCache
IP
Execution PipelineExecution Pipeline
In-Order Pipeline Out-of-Order Pipeline In-Order
Registers
RegisterRename
RegistersROB
Store Buffer
L1 D-Cache
Allocate
RenameUop
QueueRegister
Read Execute D-CacheRegister
WriteRetire QueueSched
Fetch QueueI-Fetch
TraceCache
IP
Execution PipelineExecution Pipeline
Partition queues between major pipestages of pipeline
E-Commerce Workload
0.00
0.50
1.00
1.50
2.00
2.50
2 4
Number of Processors
Per
form
ance
HT OffHT On
Transaction Processing Workload
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1 2 4
Number of Processors
Per
form
ance
HT OffHT On
Server PerformanceServer Performance
20%
20%
10%
17%
14%
Performance tests and ratings are measured using specific computPerformance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any ance of Intel products as measured by those tests. Any difference in system hardware or software design or configuratiodifference in system hardware or software design or configuration may affect actual performance. Buyers should consult other soun may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or rces of information to evaluate the performance of systems or components they are considering purchasing. For more informationcomponents they are considering purchasing. For more information on performance tests and on the performance of Intel products, on performance tests and on the performance of Intel products, reference reference www.intel.com/procs/perf/limits.htmwww.intel.com/procs/perf/limits.htm or call (U.S.) or call (U.S.) 11--800800--628628--8686 or 18686 or 1--916916--356356--31043104
Good performance benefit from Good performance benefit from small die area investmentsmall die area investment
Hyper-Threading Technology expected todeliver increasingly higher performance
HyperHyper--Threading Technology expected toThreading Technology expected todeliver increasingly higher performancedeliver increasingly higher performance
IntelIntel’’s Longs Long--Term Term HyperHyper--Threading StrategyThreading StrategyPerformancePerformance
PrototypePrototype
IntroduceIntroduce
ContinuousContinuousDevelopmentDevelopment
TimeTimeSingleSingle--thread performancethread performance
HyperHyper--thread performancethread performance
Challenges for MultithreadingChallenges for Multithreading
??Multithreaded ApplicationsMultithreaded Applications–– Multithread ProgrammingMultithread Programming–– Automatic Thread PartitioningAutomatic Thread Partitioning
??Design ComplexityDesign Complexity–– Additional Validation OverheadAdditional Validation Overhead–– Increasing Scalar InefficiencyIncreasing Scalar Inefficiency–– SMT vs. CMP Tradeoffs ?SMT vs. CMP Tradeoffs ?
??Latency of SingleLatency of Single--Threaded ApplicationsThreaded Applications
MultiMulti--Threaded ProcessorsThreaded Processors
??Targeting: Targeting: ––Throughput of MultiThroughput of Multi--tasking Workloads tasking Workloads ––Latency of MultiLatency of Multi--threaded Applications threaded Applications
??Not Targeting:Not Targeting:––Latency of SingleLatency of Single--threaded Applicationsthreaded Applications
??Research Challenge:Research Challenge:––Leverage MultiLeverage Multi--threaded CPU to Improve Latency threaded CPU to Improve Latency
of Singleof Single--threaded Applicationsthreaded Applications
Asymmetric MultiAsymmetric Multi--ThreadingThreading
?? Symmetric MultiSymmetric Multi--ThreadingThreading–– Partition single thread into Partition single thread into
multiple threadsmultiple threads–– Achieve performance by Achieve performance by
parallel execution of multiple parallel execution of multiple threadsthreads
?? Asymmetric MultiAsymmetric Multi--ThreadingThreading–– Attach helper threads to Attach helper threads to
original singleoriginal single--threaded codethreaded code–– Achieve speedup via memory Achieve speedup via memory
prefetchingprefetching by helper threadsby helper threads
Asymmetric Helper ThreadsAsymmetric Helper Threads
??Symmetric MultiSymmetric Multi--Threading CompilerThreading Compiler–– Partition single thread into multiple threadsPartition single thread into multiple threads–– Must ensure semantic correctnessMust ensure semantic correctness–– Difficult for common and legacy codeDifficult for common and legacy code
??Asymmetric MultiAsymmetric Multi--Threading CompilerThreading Compiler–– Attach helper threads to original codeAttach helper threads to original code–– Leverage side effect of helper threadsLeverage side effect of helper threads–– Can be dynamically invoked/controlledCan be dynamically invoked/controlled
PrePre--fetch via Helper Threadsfetch via Helper Threads
Delinquent LoadDelinquent Load CacheCacheMissMiss
AvoidedAvoided
Cache Pre-fetchInitiated
Cache PreCache Pre--fetchfetchInitiatedInitiated
TriggerTrigger
Hel
per
Th
read
Hel
per
Th
read
Chaining Triggers ofChaining Triggers ofHelper ThreadsHelper Threads
Delinquent LoadDelinquent Load CacheCacheMissMiss
AvoidedAvoided
TriggerTrigger
Hel
per
Th
read
Hel
per
Th
read
Hel
per
Th
read
Hel
per
Th
read
Hel
per
Th
read
Hel
per
Th
read
Chaining Trigger AdvantagesChaining Trigger Advantages??LowLow--cost Thread Spawning:cost Thread Spawning:
––Chaining triggers initiate helper threads without Chaining triggers initiate helper threads without impacting main thread performanceimpacting main thread performance
??LongLong--range range PrefetchingPrefetching::––Can target delinquent loads far ahead of the main Can target delinquent loads far ahead of the main
threadthread––Helper threads make progress independent of Helper threads make progress independent of
main threadmain thread’’s lack of progresss lack of progress
PrefetchingPrefetching Effectiveness ofEffectiveness ofHelper Threads on HT MachineHelper Threads on HT Machine
7.08%7.08%Integer programming algorithm used Integer programming algorithm used for bus schedulingfor bus scheduling
MCFMCF ((SPECintSPECint))
11% 11% -- 24%24%Hierarchical database modeling Hierarchical database modeling health care system health care system
HealthHealth (Olden)(Olden)
23% 23% -- 40%40%Minimal Spanning Tree algorithm Minimal Spanning Tree algorithm used for Data Clusteringused for Data Clustering
MSTMST (Olden)(Olden)
22% 22% -- 45%45%Graph traversal in large random Graph traversal in large random graph simulating large database graph simulating large database
retrievalretrieval
SyntheticSynthetic
SpeedupSpeedupDescriptionDescriptionBenchmarkBenchmark
Source: Intel LabsSource: Intel Labs
Asymmetric MT effectiveness: speedup, L2 miss coverage; % helper thread Icount
0%
20%
40%
60%
80%
100%
120%
synthetic mst health mcf
Benchmarks
Speedup(wall clk)
Coverage ofL2 Misses
Relative % ofHelperThreadsIcount
Helper Threads on Helper Threads on HyperHyper--Threading MachineThreading Machine
Source: Intel LabsSource: Intel Labs
IntelIntel®® Processors with Processors with NetburstNetburst™™MicroarchitectureMicroarchitecture
Intel XeonIntel XeonProcessorProcessor
256KB 2nd256KB 2nd--Level Level CacheCache
.18u process.18u process
IntelIntel®® XeonXeon™™ MP MP ProcessorProcessor
256KB 2256KB 2ndnd--Level CacheLevel Cache1MB 3rd1MB 3rd--Level CacheLevel Cache
.18u process.18u process
Intel Xeon Intel Xeon ProcessorProcessor
512KB 2512KB 2ndnd--Level Level CacheCache
.13u process.13u process