berkeley erkeley p lab b p l ar ab the future ofthe future ... · michael anderson, bryan...

18
2/11/2010 © Kurt Keutzer 1 BERKELEY PAR LAB The Future of The Future of Microprocessor-based Computation (is Highly Parallel) (is Highly Parallel) Kurt Keutzer and the PALLAS Team Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Dorothea Kolossa, Chao-Yue Lai, Mark Murphy, Bor-Yiing Su, Narayanan Sundaram UC Berkeley BERKELEY PAR LAB 1E+10 Moore’s Law: Transistor Count 0000000 0000000 1E+09 1E 10 10000 100000 1000000 0000000 1000 10000 1970 1980 1990 2000 2010 2020 2

Upload: others

Post on 17-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BERKELEY ERKELEY P LAB B P L AR AB The Future ofThe Future ... · Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Dorothea Kolossa, Chao-Yue Lai, Mark Murphy, Bor-Yiing

2/11/2010

© Kurt Keutzer 1

BERKELEY PAR LABBERKELEY PAR LAB

The Future ofThe Future of Microprocessor-based

Computation (is Highly Parallel)(is Highly Parallel)

Kurt Keutzerand the PALLAS Team

Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Dorothea Kolossa, Chao-Yue Lai,

Mark Murphy, Bor-Yiing Su, Narayanan Sundaram

UC Berkeley

BERKELEY PAR LAB

1E+10

Moore’s Law: Transistor Count

0000000

0000000

1E+09

1E 10

10000

100000

1000000

0000000

1000

10000

1970 1980 1990 2000 2010 2020

2

Page 2: BERKELEY ERKELEY P LAB B P L AR AB The Future ofThe Future ... · Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Dorothea Kolossa, Chao-Yue Lai, Mark Murphy, Bor-Yiing

2/11/2010

© Kurt Keutzer 2

BERKELEY PAR LAB

Pipeline Evolution

Instruction Instruction F t hF t h

Instruction Instruction D dD d ExecuteExecute MemoryMemory WritebackWritebackFetchFetch DecodeDecode ecuteecute e o ye o y tebactebac

DEC1DEC1 DEC2DEC2 RATRAT ROBROB DISDIS EXEX RET1RET1 RET2RET2IFU1IFU1 IFU2IFU2 IFU3IFU3

MIPS

QueQue SchSch SchSch SchSch DispDisp DispDisp RFRF RFRF ExEx FlagsFlags Br CkBr Ck DriveDriveTCTCNxtIPNxtIP

TCTCFetchFetch DriveDriveTCTC

NxtIPNxtIPTCTC

FetchFetch AllocAlloc ReRe--namename

ReRe--namename

P6

Netburst

3

Prescott

BERKELEY PAR LAB

What’s wrong with that?

5 stage pizza pipeline – 1 pizza/minute

10 stage pizza pipeline – 1 pizza 30/seconds

4

20 stage pizza pipeline – 1 pizza/20 secondshttp://jansdough.janktheproofer.com/Make-Pizza-tutorial.htm

31 stage pizza pipeline – too much heat in the kitchen

Page 3: BERKELEY ERKELEY P LAB B P L AR AB The Future ofThe Future ... · Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Dorothea Kolossa, Chao-Yue Lai, Mark Murphy, Bor-Yiing

2/11/2010

© Kurt Keutzer 3

BERKELEY PAR LAB

Results: Pipelined Sequential Performance

1970 1980 1990 2000 2010 2020

5

BERKELEY PAR LAB

Alternative: Parallelism

10 stage pizza pipeline – 1 pizza 30/seconds

two10 stage pizza pipelines – 1 pizza 15/seconds

6four10 stage pizza pipelines – 1 pizza 7.5/second???

Page 4: BERKELEY ERKELEY P LAB B P L AR AB The Future ofThe Future ... · Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Dorothea Kolossa, Chao-Yue Lai, Mark Murphy, Bor-Yiing

2/11/2010

© Kurt Keutzer 4

BERKELEY PAR LAB

Parallelism in HW

Intel: Larrabee Nvidia: Fermi

7

32 processorseach 16-wide vector unit

16 processorseach 32-wide vector unit

BERKELEY PAR LAB

We can build them … but …

No question we can build these processorsBig question: How do we program 32 512 processors to solve yourBig question: How do we program 32 – 512 processors to solve your favorite application?

Perform content-based image retrievalLarge-vocabulary speech recognitionMRI image reconstructionV l t i k l i i tit ti fiValue-at-risk analysis in quantitative financeEtc.

Parallel processors are more efficient, if we can program them. So … the future of microprocessors is really about software.That’s why the rest of the talk is all about programming applications on parallel microprocessors

8

Page 5: BERKELEY ERKELEY P LAB B P L AR AB The Future ofThe Future ... · Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Dorothea Kolossa, Chao-Yue Lai, Mark Murphy, Bor-Yiing

2/11/2010

© Kurt Keutzer 5

BERKELEY PAR LAB

How NOT to parallel program

Initial CodeInitial Code

ProfilerRe-code with more threads

N t f t

Performanceprofile

Not fastenough

Fast enough

9

Fast enough

Ship it

Lots of failuresLots of failures

Software running on N Software running on N processor cores is processor cores is slower slower than than on 1on 1

BERKELEY PAR LAB

What is this person thinking of?

Re-code with more threadsmore threads

Edward Lee,

10Mood: anxious, depressedMood: anxious, depressed

“The Problem with Threads”

Threads, locks, Threads, locks, semaphores, data racessemaphores, data races

Page 6: BERKELEY ERKELEY P LAB B P L AR AB The Future ofThe Future ... · Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Dorothea Kolossa, Chao-Yue Lai, Mark Murphy, Bor-Yiing

2/11/2010

© Kurt Keutzer 6

BERKELEY PAR LABOur PALLAS Methodology

Application Specification

Architect SW toIdentify Parallelism

Performanceprofile

Not fast enoughThought Experiment:

Map SW Arch. to HW Arch.

Write Code

11

profile

Fast enough

Ship it

BERKELEY PAR LAB

What is this person thinking of?

Today’s code is serial but the world is parallel we just need to find the parallelism in ourwe just need to find the parallelism in our applications and reflect it in our software

Re-architectwith patterns

t t l ttt t l tt

12

Software architectureSoftware architecture

structural patternsstructural patternscomputational patternscomputational patterns

Mood: wellMood: well--adjusted, optimistic adjusted, optimistic

Page 7: BERKELEY ERKELEY P LAB B P L AR AB The Future ofThe Future ... · Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Dorothea Kolossa, Chao-Yue Lai, Mark Murphy, Bor-Yiing

2/11/2010

© Kurt Keutzer 7

BERKELEY PAR LAB

Our formula

Tall-skinny

Domain experts

State-of-the-art algorithms

Tall-skinny parallel

programmers

Architect software with

patterns

Game changing speed-upsCompelling

patterns

Parallel speed ups(with scalability)

applications hardware

13

BERKELEY PAR LAB

Applications

Our formula in a bit of detail: MRI reconstructionOther applications where our formula has proven itselfOther applications where our formula has proven itselfApplications in progress

14

Page 8: BERKELEY ERKELEY P LAB B P L AR AB The Future ofThe Future ... · Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Dorothea Kolossa, Chao-Yue Lai, Mark Murphy, Bor-Yiing

2/11/2010

© Kurt Keutzer 8

BERKELEY PAR LAB

Compelling Application:Fast, Robust Pediatric MRI

Pediatric MRI is difficult:Children cannot sit still breathholdChildren cannot sit still, breathholdLow tolerance for long examsAnesthesia is costly and risky

Like to accelerate MRI acquisitionAdvanced MRI techniques exist, but

i d t d t i trequire data- and compute- intense algorithms for image reconstruction

Reconstruction must be fast, or time saved in accelerated acquisition is lost in computing reconstruction

Non-starter for clinical useNon-starter for clinical use

15

BERKELEY PAR LAB

Domain Experts andState-of-the-Art Algorithms

Collaboration with MRI Researchers:Miki Lustig Ph D Berkeley EECSMiki Lustig, Ph.D., Berkeley EECSMarc Alley, Ph.D., Stanford EEShreyas Vasanawala, M.D./Ph.D., Stanford Radiology

Advanced MRI: Parallel Imaging and Compressed Sensing to dramatically reduce MRI image acquisition time

Computational IOU: Must solve constrained L1 minimization

16

Page 9: BERKELEY ERKELEY P LAB B P L AR AB The Future ofThe Future ... · Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Dorothea Kolossa, Chao-Yue Lai, Mark Murphy, Bor-Yiing

2/11/2010

© Kurt Keutzer 9

BERKELEY PAR LAB

SW architecture of image reconstruction

17

BERKELEY PAR LAB

Game-Changing Speedup

100X faster reconstructionHigher quality faster MRIHigher-quality, faster MRIThis image: 8 month-old patient with cancerous mass in liver

256 x 84 x 154 x 8 data sizeSerial Recon: 1 hourP ll l R 1 i tParallel Recon: 1 minute

Fast enough for clinical useSoftware currently deployed at Lucile Packard Children's Hospital for clinical study of the reconstruction techniquereconstruction technique

18

Page 10: BERKELEY ERKELEY P LAB B P L AR AB The Future ofThe Future ... · Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Dorothea Kolossa, Chao-Yue Lai, Mark Murphy, Bor-Yiing

2/11/2010

© Kurt Keutzer 10

BERKELEY PAR LAB

Applications

Our formula in a bit of detail: MRI reconstructionOther applications where our formula has proven itselfOther applications where our formula has proven itselfApplications in progress

19

BERKELEY PAR LAB

Support-Vector Machine Mini-Framework

Algorithmic changes and parallel i l i l d fimplementation lead to performance speedup: Core 2 Duo versus G80

Computation LIBSVM Our algorithm2-core Parallel

Our algorithm, 16-coreParallelParallel

CPUParallel GPU

SVM Training(geo-mean)

771.8 s --- 38.5 s

SVM Classification

41.42 s 4.21 s 0.38 s

20

Fast support vector machine training and classification, Catanzaro, Sundaram, Keutzer, International Conference on Machine Learning 2008

100X speed‐up793 downloads since release in 10/2008

Classification

Page 11: BERKELEY ERKELEY P LAB B P L AR AB The Future ofThe Future ... · Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Dorothea Kolossa, Chao-Yue Lai, Mark Murphy, Bor-Yiing

2/11/2010

© Kurt Keutzer 11

BERKELEY PAR LAB

Image Contour Detection Mini-Framework

Original Serial Code(Nehalem)

C + Pthreads, our algorithms8 threads, 2 sockets (Barcelona)

Our algorithms, GTX280

222 seconds 29.79 seconds 1.8 seconds

"Efficient, High-Quality Image Contour Detection" Catanzaro, Su, Sundaram, Lee, Murphy, Keutzer International Conference on Computer Vision, 2009

130X speed‐up490 downloads since release in 6/2009

BERKELEY PAR LABSpeech Recognition Framework

Input: Speech audio waveformInput: Speech audio waveform Output: Recognized word sequences

Achieved 11x speedup over sequential versionAll 3 5 f t th l ti itiAllows 3.5x faster than real time recognition

Our technique is being deployed in a hotline call-center data analytics companyUsed to search content, track service quality and provide early d t ti f i idetection of service issues

22

Scalable HMM based Inference Engine in Large Vocabulary Continuous Speech Recognition,Kisun You, Jike Chong, Youngmin Yi, Ekaterina Gonina, Christopher Hughes, Wonyong Sung and Kurt Keutzer, IEEE Signal Processing Magazine, March 2010

Page 12: BERKELEY ERKELEY P LAB B P L AR AB The Future ofThe Future ... · Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Dorothea Kolossa, Chao-Yue Lai, Mark Murphy, Bor-Yiing

2/11/2010

© Kurt Keutzer 12

BERKELEY PAR LAB

Computational Finance

Value-at-Risk Computation with Monte Carlo MethodMonte Carlo MethodSummarizes a portfolio’s vulnerabilities to market movementsImportant to algorithmic trading, derivative usage and highly leveraged hedge fundsgImproved implementation to run 60x faster on a parallel microprocessor

f ( )

Four Steps of Monte Carlo Method in Finance

Uniform RandomNumber Generation

Market Parameter Transformation

Instrument Pricing

f (x)Data

Assimilation

Matthew Dixon, Jike Chong, Kurt Keutzer, “Acceleration of Market Value-at-Risk Estimation”, Workshop on High Performance Computing in Finance at Super Computing 2009, November 15, 2009.

BERKELEY PAR LAB

Applications

Our formula in a bit of detail: MRI reconstructionOther applications where our formula has proven itselfOther applications where our formula has proven itselfApplications in progress

24

Page 13: BERKELEY ERKELEY P LAB B P L AR AB The Future ofThe Future ... · Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Dorothea Kolossa, Chao-Yue Lai, Mark Murphy, Bor-Yiing

2/11/2010

© Kurt Keutzer 13

BERKELEY PAR LAB

Poselet Human Detection

Can locate humans in images20x speedup through algorithmic improvements and parallel implementation

Work can be extended to pose estimation for controller free

25

estimation for controller-free video game interfaces using ordinary web cameras

BERKELEY PAR LAB

Option Pricing Application

Price an option - a tradable financial security whose value depends on the value of an underlying asset and market parameters.on the value of an underlying asset and market parameters. Black-Sholes equation:

Speedup:

6x pricing 1 option25x pricing 128 options on Larrabee

26

Page 14: BERKELEY ERKELEY P LAB B P L AR AB The Future ofThe Future ... · Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Dorothea Kolossa, Chao-Yue Lai, Mark Murphy, Bor-Yiing

2/11/2010

© Kurt Keutzer 14

BERKELEY PAR LAB

Optical Flow

• Optical Flow involves computing the motion vectors (“flow field”) between the consecutivevectors (“flow field”) between the consecutive frames of a video

• Involves solving a non‐linear optimization problem

Speedup

32x linear solver

Linear solver

Matrix creation

Interpolation & WarpingDownsampling

27

7x overall ow s p g

Filter

Other

Memcopy CPU-GPU

Serial Parallel

BERKELEY PAR LAB2009 ITRS - Functions/chip and Chip Size

10000000

2009 ITRS Cost Performance MPU Functions

Figure 9b Logic

Including ’11 snapshot AnalysisITRS Roadmap

100000

1000000

and

Squa

re M

illim

eter

s

2009 ITRS Cost-Performance MPU Functionsper chip at production (Mtransistorst)

2009 ITRS High-Performance MPU Functionsper chip at production (Mtransistors)

2009 Cost-Performance MPU Chip size atproduction (mm2)

2009 High-Performance MPU Chip size at

MPU= 2x/3yrsMPU

= 2x/2yrs

Average "Moore's Law" = 2x/2yrs

100

1000

10000

Mill

ion

Tran

sist

ors

(1e6

) a g pproduction (mm2)

<260mm2<140mm2

2011: “22nm”/(38nm M1)MPU Model Generations

Work in Progress – Do Not Publish! 28

101995 2000 2005 2010 2015 2020 2025

Year of Production 2009 ITRS: 2009-2024

Page 15: BERKELEY ERKELEY P LAB B P L AR AB The Future ofThe Future ... · Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Dorothea Kolossa, Chao-Yue Lai, Mark Murphy, Bor-Yiing

2/11/2010

© Kurt Keutzer 15

BERKELEY PAR LABParallelism Delivers Moore’s Law (and More) in the Future

T iTransistorsSerial PerformanceParallel Performance

1970 1980 1990 2000 2010 2020

29

BERKELEY PAR LAB

Tip of the Iceberg

Today we’re only exploiting the smallest tip of the iceberg of computation that will be available in the futurecomputation that will be available in the future

30

Page 16: BERKELEY ERKELEY P LAB B P L AR AB The Future ofThe Future ... · Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Dorothea Kolossa, Chao-Yue Lai, Mark Murphy, Bor-Yiing

2/11/2010

© Kurt Keutzer 16

BERKELEY PAR LAB

Limits of our formula

Tall-skinny

Domain experts

State-of-the-art algorithms

Tall-skinny parallel

programmers

Architecting software with

patterns

Game changing speed-upsCompelling

patterns

Parallel speed ups(with scalability)

applications hardware

31

BERKELEY PAR LAB

Our research formula

D iT ll ki Application frameworks

Domain experts use frameworks

Tall skinny programmers

Game-changing

d

Compellingapplications

Parallel hardware

speed-ups

32

Page 17: BERKELEY ERKELEY P LAB B P L AR AB The Future ofThe Future ... · Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Dorothea Kolossa, Chao-Yue Lai, Mark Murphy, Bor-Yiing

2/11/2010

© Kurt Keutzer 17

BERKELEY PAR LAB

How do we deliver all those new applications to the user?

I’ve talked about how to translate Moore’s Law i t f l li ti i iinto useful applications via microprocessors.Our next speakers will talk about how to deliver those applications to you.

The Future of the CloudMichael Franklin

The Future of MobileEric Brewer

33

BERKELEY PAR LAB

Backup

34

Page 18: BERKELEY ERKELEY P LAB B P L AR AB The Future ofThe Future ... · Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Dorothea Kolossa, Chao-Yue Lai, Mark Murphy, Bor-Yiing

2/11/2010

© Kurt Keutzer 18

BERKELEY PAR LAB

SW architecture of reconstruction

Pipe and Filter1. Apply SPIRiT Operator:

Iterative POCS Algorithm:

Fork-Join

Linear Alg. Linear Alg.

2. Wavelet Soft-Thresholding

3. Fourier-space projection

Data Parallelism / Fourier Transforms

Data Parallelism / Fourier Transforms

Fork-Join

Iter. Refinement / Spectral Method

Data Parallelism / Convolutions

Data Parallelism / Wavelet xforms

35

Data Parallelism / Fourier xformsData Parallelism / Fourier Transforms