theory: asleep at the switch to many-core phillip b. gibbons intel research pittsburgh workshop on...

28
Theory: Asleep at the Switch to Many- Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip B. Gibbons

Post on 20-Dec-2015

225 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core

Phillip B. GibbonsIntel Research Pittsburgh

Workshop on Theory and Many-CoreMay 29, 2009

Slides are © Phillip B. Gibbons

Page 2: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core3

Two Decades after the peak of Theory’s interest in parallel computing…

The Age of Many-Core is finally underway

• Fueled by Moore’s Law: 2X cores per chip every 18 months

All aboard theparallelism train!

• (Almost) The only way to faster apps

Page 3: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core4

All Aboard the Parallelism Train?

Switch to Many-Core…Many Challenges

• Interest waned long ago

• Yet problems were NOT solved

Research needed in all aspects of Many-Core

• Computer Architecture

• Programming Languages & Compilers

• Operating & Runtime Systems

• Theory

YES!

YES!

YES!

Who has answered the call?

Page 4: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core5

Theory: Asleep at the Switch

Theory needs to wake-up & regain a leadership role in parallel computing

“Engineer driving derailed Staten Island train may have fallen asleep at the switch.” (12/26/08)

Page 5: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core6

Theory’s Strengths

• Conceptual Models– Abstract models of computation

• New Algorithmic Paradigms– New algorithms, new protocols

• Provable Correctness– Safety, liveness, security, privacy,…

• Provable Performance Guarantees– Approximation, probabilistic, new metrics

• Inherent Power/Limitations– Of primitives, features,…

…among others

Page 6: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core7

• Parallel Thinking

• Memory Hierarchy

• Asymmetry/Heterogeniety

• Concurrency Primitives

• Power

Montparnasse 1895

Five Areas in Which Theory Can (Should) Have an Important Impact

Page 7: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core8

Impact Area: Parallel Thinking

Key: Good Model of Parallel Computation

• Express Parallelism

• Good parallel programmer’s model

• Good for teaching, teaching “how to think”

• Can be engineered to good performance

Page 8: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core9

Impact Area: Memory Hierarchy

• Deep cache/storage hierarchy

• Need conceptual model

• Need smart thread schedulers

Page 9: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core10

Impact Area: Asymmetry/Heterogeniety

• Fat/Thin cores

• SIMD extensions

• Multiple coherence domains

• Mixed-mode parallelism

• Virtual Machines

• ...

Page 10: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core11

Impact Area: Concurrency Primitives

• Parallel prefix

• Hash map [Herlihy08]

• Map reduce [Karloff09]

• Transactional memory

• Memory block transactions [Blelloch08]

• Graphics primitives [Ha08]

• Make the case Many-Core should (not) support• Improve the algorithm• Recommend new primitives (prescriptive)

Page 11: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core12

Impact Area: Power

Many-cores provide features for reducing power

• Voltage scaling [Albers07]

• Dynamically run on fewer cores, fewer banks

Fertile area for Theory help

Page 12: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core13

Deep Dive: Memory Hierarchy

• Deep cache/storage hierarchy

• Need conceptual model

• Need smart thread schedulers

Page 13: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core14

Good Performance Requires Effective Use of the Memory

HierarchyCPU

L1

L2 Cache

Main Memory

Magnetic Disks

Performance:• Running/response time

• Throughput

• Power

Two new trends: Pervasive Multicore & Pervasive Flashbring new challenges and opportunities

Page 14: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core15

L2 Cache

New Trend 1: Pervasive Multicore

Shared L2 Cache

Main Memory

Magnetic Disks

CPU

L1

CPU

L1

CPU

L1

Makes Effective Use of Hierarchy Much Harder

Challenges

• Cores compete for hierarchy

• Hard to reason about parallel performance

• Hundred cores coming soon

• Cache hierarchy design in flux

• Hierarchies differ across platforms

Opportunity

• Rethink apps & systems to take advantage of more CPUs on chip

Page 15: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core16

Shared L2 Cache

New Trend 2: Pervasive Flash

Main Memory

Magnetic Disks

CPU

L1

CPU

L1

CPU

L1

New Type of Storage in the Hierarchy

Flash

Devices

Challenges

• Performance quirks of Flash

• Technology in flux, e.g., Flash Translation Layer (FTL)

Opportunity

• Rethink apps & systems to take advantage

Page 16: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core17

How Hierarchy is Treated Today

Ignorant(Pain)-Fully

Aware

Hand-tuned to platform

[Effort high, Not portable,Limited sharing scenarios]

API view: Memory + I/O;Parallelism often ignored

[Performance iffy]

Algorithm Designers & Application/System Developersoften tend towards one of two extremes

Or they focus on one or a few aspects, but without a comprehensive view of the whole

Page 17: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core18

• Hide what can be hid

• Expose what must be exposedfor good performance

• Robust: many platforms,many resource sharing scenarios

• Sweet-spot between ignorant and (pain)fully aware

“Hierarchy-Savvy”

Hierarchy-Savvy Parallel Algorithm Design

(Hi-Spade) project

…seeks to enable:

A hierarchy-savvy approach to algorithm design & systems for emerging parallel hierarchies

http://www.pittsburgh.intel-research.net/projects/hi-spade/

Page 18: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core19

Hi-Spade Research Scope

A hierarchy-savvy approach to algorithm design & systems for emerging parallel hierarchies

Research agenda includes

• Theory: conceptual models, algorithms, analytical guarantees

• Systems: runtime support, performance tools, architectural features

• Applications:

databases, operating systems, application kernels

Page 19: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core20

Cache Hierarchies: Sequential

External Memory (EM) Algorithms

Hi-Spade: Hierarchy-savvy Parallel Algorithm Design

[See Vitter’s ACM Surveys

article]

+ Simple model+ Minimize I/Os– Only 2 levels

Main Memory (size M)

External Memory

Block size B

External Memory Model

Page 20: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core21

Alternative:Cache-Oblivious Algorithms

Hi-Spade: Hierarchy-savvy Parallel Algorithm Design

Main Memory (size M)

External Memory

Block size B

Cache-Oblivious Model

Twist on EM Model: M & B unknown to Algorithm

+ simple model

Key Goal: Good performancefor any M & B

+ Key Goal Guaranteed good cache performance at all levels of hierarchy

– Single CPU only

[Frigo99]

Cache Hierarchies: Sequential

Page 21: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core22

Cache Hierarchies: Parallel

Hi-Spade: Hierarchy-savvy Parallel Algorithm Design

Explicit Multi-level Hierarchy:Multi-BSP Model [Valiant08]

Goal:Approach simplicity of cache-oblivious model

Hierarchy-Savvy sweet spot

Page 22: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core23

Challenge:– Theory of cache-oblivious algorithms falls apart once

introduce parallelism:

Hi-Spade: Hierarchy-savvy Parallel Algorithm Design

Good performance for any M & B on 2 levels DOES NOT imply good performance at all levels of hierarchy

Key reason: Caches not fully shared

L2 CacheShared L2 Cache

CPU2

L1

CPU1

L1

CPU3

L1

What’s good for CPU1 isoften bad for CPU2 & CPU3 e.g., all want to write B at ≈ the same timeB

– Parallel cache-obliviousness too strict a goal

Page 23: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core24

Hi-Spade: Hierarchy-savvy Parallel Algorithm Design

Key new dimension:Scheduling of parallel threads

Has LARGE impact on cache performance

L2 CacheShared L2 Cache

CPU2

L1

CPU1

L1

CPU3

L1Can mitigate (but not solve)if can schedule the writes

to be far apart in time

Recall our problem scenario:

all CPUs want to write B at ≈ the same time

B

Page 24: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core25

Existing Parallel Cache ModelsExisting Parallel Cache Models

main memory

block transfer

(size = B)

private cache(size = C)

CPU

block transfer

(size = B)

private cache(size = C)

CPU

block transfer

(size = B)

private cache(size = C)

CPU

11 22 pp

ParallelParallel

Private-Cache Private-Cache

ModelModel::

main memory

shared cache(size = C)

CPU

block transfer

(size = B)

CPU CPU

11 22 pp

ParallelParallel

Shared-CacheShared-Cache

ModelModel::

Slide fromRezaul Chowdhury

Page 25: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core26

Competing Demands of Private and Shared Competing Demands of Private and Shared CachesCaches

private cache

CPU

private cache

CPU

private cache

CPU11 22 pp

main memory

shared cache

Shared cache:Shared cache: cores work on the cores work on the same set of cache blockssame set of cache blocks

PPrivate cache:rivate cache: cores work on cores work on disjoint sets of cache blocksdisjoint sets of cache blocks

Experimental results have shown that on CMP architecturesExperimental results have shown that on CMP architectures

work-stealingwork-stealing, i.e., the state-of-art scheduler for private-cache , i.e., the state-of-art scheduler for private-cache

model, can suffer from model, can suffer from excessive shared-cache missesexcessive shared-cache misses

parallel depth firstparallel depth first, i.e., the best scheduler for shared-cache, i.e., the best scheduler for shared-cache

model, can incur model, can incur excessive private-cache missesexcessive private-cache misses

Slide fromRezaul Chowdhury

Page 26: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core27

Private vs. Shared Caches

• Parallel all-shared hierarchy:+ Provably good cache performance for cache-oblivious algs

• 3-level multi-core model: insights on private vs. shared

+ Designed new scheduler with provably good cache performance for class of divide-and-conquer algorithms [Blelloch08]

Hi-Spade: Hierarchy-savvy Parallel Algorithm Design

L2 CacheShared L2 Cache

CPU2

L1

CPU1

L1

CPU3

L1

– Results require exposingworking set size for

each recursive subproblem

Page 27: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core28

Parallel Tree of Caches

Hi-Spade: Hierarchy-savvy Parallel Algorithm Design

… …

… …

… …

… …

… …

… …

… …

… …

Approach: [Blelloch09] Design low-depth cache-oblivious algorithm

Thrm: for each level i, only O(M P D/ B ) missesmore than the sequential schedule

ii

Low depth D Good miss bound

Page 28: Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip

Theory: Asleep at the Switch to Many-Core29

Five Areas in Which Theory Can (Should) Have an Important Impact

• Parallel Thinking

• Memory Hierarchy

• Asymmetry/Heterogeniety

• Concurrency Primitives

• Power