theory: asleep at the switch to many-core phillip b. gibbons intel research pittsburgh workshop on...
Post on 20-Dec-2015
225 views
TRANSCRIPT
Theory: Asleep at the Switch to Many-Core
Phillip B. GibbonsIntel Research Pittsburgh
Workshop on Theory and Many-CoreMay 29, 2009
Slides are © Phillip B. Gibbons
Theory: Asleep at the Switch to Many-Core3
Two Decades after the peak of Theory’s interest in parallel computing…
The Age of Many-Core is finally underway
• Fueled by Moore’s Law: 2X cores per chip every 18 months
All aboard theparallelism train!
• (Almost) The only way to faster apps
Theory: Asleep at the Switch to Many-Core4
All Aboard the Parallelism Train?
Switch to Many-Core…Many Challenges
• Interest waned long ago
• Yet problems were NOT solved
Research needed in all aspects of Many-Core
• Computer Architecture
• Programming Languages & Compilers
• Operating & Runtime Systems
• Theory
YES!
YES!
YES!
Who has answered the call?
Theory: Asleep at the Switch to Many-Core5
Theory: Asleep at the Switch
Theory needs to wake-up & regain a leadership role in parallel computing
“Engineer driving derailed Staten Island train may have fallen asleep at the switch.” (12/26/08)
Theory: Asleep at the Switch to Many-Core6
Theory’s Strengths
• Conceptual Models– Abstract models of computation
• New Algorithmic Paradigms– New algorithms, new protocols
• Provable Correctness– Safety, liveness, security, privacy,…
• Provable Performance Guarantees– Approximation, probabilistic, new metrics
• Inherent Power/Limitations– Of primitives, features,…
…among others
Theory: Asleep at the Switch to Many-Core7
• Parallel Thinking
• Memory Hierarchy
• Asymmetry/Heterogeniety
• Concurrency Primitives
• Power
Montparnasse 1895
Five Areas in Which Theory Can (Should) Have an Important Impact
Theory: Asleep at the Switch to Many-Core8
Impact Area: Parallel Thinking
Key: Good Model of Parallel Computation
• Express Parallelism
• Good parallel programmer’s model
• Good for teaching, teaching “how to think”
• Can be engineered to good performance
Theory: Asleep at the Switch to Many-Core9
Impact Area: Memory Hierarchy
• Deep cache/storage hierarchy
• Need conceptual model
• Need smart thread schedulers
Theory: Asleep at the Switch to Many-Core10
Impact Area: Asymmetry/Heterogeniety
• Fat/Thin cores
• SIMD extensions
• Multiple coherence domains
• Mixed-mode parallelism
• Virtual Machines
• ...
Theory: Asleep at the Switch to Many-Core11
Impact Area: Concurrency Primitives
• Parallel prefix
• Hash map [Herlihy08]
• Map reduce [Karloff09]
• Transactional memory
• Memory block transactions [Blelloch08]
• Graphics primitives [Ha08]
• Make the case Many-Core should (not) support• Improve the algorithm• Recommend new primitives (prescriptive)
Theory: Asleep at the Switch to Many-Core12
Impact Area: Power
Many-cores provide features for reducing power
• Voltage scaling [Albers07]
• Dynamically run on fewer cores, fewer banks
Fertile area for Theory help
Theory: Asleep at the Switch to Many-Core13
Deep Dive: Memory Hierarchy
• Deep cache/storage hierarchy
• Need conceptual model
• Need smart thread schedulers
Theory: Asleep at the Switch to Many-Core14
Good Performance Requires Effective Use of the Memory
HierarchyCPU
L1
L2 Cache
Main Memory
Magnetic Disks
Performance:• Running/response time
• Throughput
• Power
Two new trends: Pervasive Multicore & Pervasive Flashbring new challenges and opportunities
Theory: Asleep at the Switch to Many-Core15
L2 Cache
New Trend 1: Pervasive Multicore
Shared L2 Cache
Main Memory
Magnetic Disks
CPU
L1
CPU
L1
CPU
L1
Makes Effective Use of Hierarchy Much Harder
Challenges
• Cores compete for hierarchy
• Hard to reason about parallel performance
• Hundred cores coming soon
• Cache hierarchy design in flux
• Hierarchies differ across platforms
Opportunity
• Rethink apps & systems to take advantage of more CPUs on chip
Theory: Asleep at the Switch to Many-Core16
Shared L2 Cache
New Trend 2: Pervasive Flash
Main Memory
Magnetic Disks
CPU
L1
CPU
L1
CPU
L1
New Type of Storage in the Hierarchy
Flash
Devices
Challenges
• Performance quirks of Flash
• Technology in flux, e.g., Flash Translation Layer (FTL)
Opportunity
• Rethink apps & systems to take advantage
Theory: Asleep at the Switch to Many-Core17
How Hierarchy is Treated Today
Ignorant(Pain)-Fully
Aware
Hand-tuned to platform
[Effort high, Not portable,Limited sharing scenarios]
API view: Memory + I/O;Parallelism often ignored
[Performance iffy]
Algorithm Designers & Application/System Developersoften tend towards one of two extremes
Or they focus on one or a few aspects, but without a comprehensive view of the whole
Theory: Asleep at the Switch to Many-Core18
• Hide what can be hid
• Expose what must be exposedfor good performance
• Robust: many platforms,many resource sharing scenarios
• Sweet-spot between ignorant and (pain)fully aware
“Hierarchy-Savvy”
Hierarchy-Savvy Parallel Algorithm Design
(Hi-Spade) project
…seeks to enable:
A hierarchy-savvy approach to algorithm design & systems for emerging parallel hierarchies
http://www.pittsburgh.intel-research.net/projects/hi-spade/
Theory: Asleep at the Switch to Many-Core19
Hi-Spade Research Scope
A hierarchy-savvy approach to algorithm design & systems for emerging parallel hierarchies
Research agenda includes
• Theory: conceptual models, algorithms, analytical guarantees
• Systems: runtime support, performance tools, architectural features
• Applications:
databases, operating systems, application kernels
Theory: Asleep at the Switch to Many-Core20
Cache Hierarchies: Sequential
External Memory (EM) Algorithms
Hi-Spade: Hierarchy-savvy Parallel Algorithm Design
[See Vitter’s ACM Surveys
article]
+ Simple model+ Minimize I/Os– Only 2 levels
Main Memory (size M)
External Memory
Block size B
External Memory Model
Theory: Asleep at the Switch to Many-Core21
Alternative:Cache-Oblivious Algorithms
Hi-Spade: Hierarchy-savvy Parallel Algorithm Design
Main Memory (size M)
External Memory
Block size B
Cache-Oblivious Model
Twist on EM Model: M & B unknown to Algorithm
+ simple model
Key Goal: Good performancefor any M & B
+ Key Goal Guaranteed good cache performance at all levels of hierarchy
– Single CPU only
[Frigo99]
Cache Hierarchies: Sequential
Theory: Asleep at the Switch to Many-Core22
Cache Hierarchies: Parallel
Hi-Spade: Hierarchy-savvy Parallel Algorithm Design
Explicit Multi-level Hierarchy:Multi-BSP Model [Valiant08]
Goal:Approach simplicity of cache-oblivious model
Hierarchy-Savvy sweet spot
Theory: Asleep at the Switch to Many-Core23
Challenge:– Theory of cache-oblivious algorithms falls apart once
introduce parallelism:
Hi-Spade: Hierarchy-savvy Parallel Algorithm Design
Good performance for any M & B on 2 levels DOES NOT imply good performance at all levels of hierarchy
Key reason: Caches not fully shared
L2 CacheShared L2 Cache
CPU2
L1
CPU1
L1
CPU3
L1
What’s good for CPU1 isoften bad for CPU2 & CPU3 e.g., all want to write B at ≈ the same timeB
– Parallel cache-obliviousness too strict a goal
Theory: Asleep at the Switch to Many-Core24
Hi-Spade: Hierarchy-savvy Parallel Algorithm Design
Key new dimension:Scheduling of parallel threads
Has LARGE impact on cache performance
L2 CacheShared L2 Cache
CPU2
L1
CPU1
L1
CPU3
L1Can mitigate (but not solve)if can schedule the writes
to be far apart in time
Recall our problem scenario:
all CPUs want to write B at ≈ the same time
B
Theory: Asleep at the Switch to Many-Core25
Existing Parallel Cache ModelsExisting Parallel Cache Models
main memory
block transfer
(size = B)
private cache(size = C)
CPU
block transfer
(size = B)
private cache(size = C)
CPU
block transfer
(size = B)
private cache(size = C)
CPU
11 22 pp
ParallelParallel
Private-Cache Private-Cache
ModelModel::
main memory
shared cache(size = C)
CPU
block transfer
(size = B)
CPU CPU
11 22 pp
ParallelParallel
Shared-CacheShared-Cache
ModelModel::
Slide fromRezaul Chowdhury
Theory: Asleep at the Switch to Many-Core26
Competing Demands of Private and Shared Competing Demands of Private and Shared CachesCaches
private cache
CPU
private cache
CPU
private cache
CPU11 22 pp
main memory
shared cache
Shared cache:Shared cache: cores work on the cores work on the same set of cache blockssame set of cache blocks
PPrivate cache:rivate cache: cores work on cores work on disjoint sets of cache blocksdisjoint sets of cache blocks
Experimental results have shown that on CMP architecturesExperimental results have shown that on CMP architectures
work-stealingwork-stealing, i.e., the state-of-art scheduler for private-cache , i.e., the state-of-art scheduler for private-cache
model, can suffer from model, can suffer from excessive shared-cache missesexcessive shared-cache misses
parallel depth firstparallel depth first, i.e., the best scheduler for shared-cache, i.e., the best scheduler for shared-cache
model, can incur model, can incur excessive private-cache missesexcessive private-cache misses
Slide fromRezaul Chowdhury
Theory: Asleep at the Switch to Many-Core27
Private vs. Shared Caches
• Parallel all-shared hierarchy:+ Provably good cache performance for cache-oblivious algs
• 3-level multi-core model: insights on private vs. shared
+ Designed new scheduler with provably good cache performance for class of divide-and-conquer algorithms [Blelloch08]
Hi-Spade: Hierarchy-savvy Parallel Algorithm Design
L2 CacheShared L2 Cache
CPU2
L1
CPU1
L1
CPU3
L1
– Results require exposingworking set size for
each recursive subproblem
Theory: Asleep at the Switch to Many-Core28
Parallel Tree of Caches
Hi-Spade: Hierarchy-savvy Parallel Algorithm Design
…
…
…
… …
… …
…
… …
… …
…
…
… …
… …
…
… …
… …
Approach: [Blelloch09] Design low-depth cache-oblivious algorithm
Thrm: for each level i, only O(M P D/ B ) missesmore than the sequential schedule
ii
Low depth D Good miss bound
Theory: Asleep at the Switch to Many-Core29
Five Areas in Which Theory Can (Should) Have an Important Impact
• Parallel Thinking
• Memory Hierarchy
• Asymmetry/Heterogeniety
• Concurrency Primitives
• Power