berkeley erkeley p lab b p l ar ab the future ofthe future ... · michael anderson, bryan...
TRANSCRIPT
2/11/2010
© Kurt Keutzer 1
BERKELEY PAR LABBERKELEY PAR LAB
The Future ofThe Future of Microprocessor-based
Computation (is Highly Parallel)(is Highly Parallel)
Kurt Keutzerand the PALLAS Team
Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Dorothea Kolossa, Chao-Yue Lai,
Mark Murphy, Bor-Yiing Su, Narayanan Sundaram
UC Berkeley
BERKELEY PAR LAB
1E+10
Moore’s Law: Transistor Count
0000000
0000000
1E+09
1E 10
10000
100000
1000000
0000000
1000
10000
1970 1980 1990 2000 2010 2020
2
2/11/2010
© Kurt Keutzer 2
BERKELEY PAR LAB
Pipeline Evolution
Instruction Instruction F t hF t h
Instruction Instruction D dD d ExecuteExecute MemoryMemory WritebackWritebackFetchFetch DecodeDecode ecuteecute e o ye o y tebactebac
DEC1DEC1 DEC2DEC2 RATRAT ROBROB DISDIS EXEX RET1RET1 RET2RET2IFU1IFU1 IFU2IFU2 IFU3IFU3
MIPS
QueQue SchSch SchSch SchSch DispDisp DispDisp RFRF RFRF ExEx FlagsFlags Br CkBr Ck DriveDriveTCTCNxtIPNxtIP
TCTCFetchFetch DriveDriveTCTC
NxtIPNxtIPTCTC
FetchFetch AllocAlloc ReRe--namename
ReRe--namename
P6
Netburst
3
Prescott
BERKELEY PAR LAB
What’s wrong with that?
5 stage pizza pipeline – 1 pizza/minute
10 stage pizza pipeline – 1 pizza 30/seconds
4
20 stage pizza pipeline – 1 pizza/20 secondshttp://jansdough.janktheproofer.com/Make-Pizza-tutorial.htm
31 stage pizza pipeline – too much heat in the kitchen
2/11/2010
© Kurt Keutzer 3
BERKELEY PAR LAB
Results: Pipelined Sequential Performance
1970 1980 1990 2000 2010 2020
5
BERKELEY PAR LAB
Alternative: Parallelism
10 stage pizza pipeline – 1 pizza 30/seconds
two10 stage pizza pipelines – 1 pizza 15/seconds
6four10 stage pizza pipelines – 1 pizza 7.5/second???
2/11/2010
© Kurt Keutzer 4
BERKELEY PAR LAB
Parallelism in HW
Intel: Larrabee Nvidia: Fermi
7
32 processorseach 16-wide vector unit
16 processorseach 32-wide vector unit
BERKELEY PAR LAB
We can build them … but …
No question we can build these processorsBig question: How do we program 32 512 processors to solve yourBig question: How do we program 32 – 512 processors to solve your favorite application?
Perform content-based image retrievalLarge-vocabulary speech recognitionMRI image reconstructionV l t i k l i i tit ti fiValue-at-risk analysis in quantitative financeEtc.
Parallel processors are more efficient, if we can program them. So … the future of microprocessors is really about software.That’s why the rest of the talk is all about programming applications on parallel microprocessors
8
2/11/2010
© Kurt Keutzer 5
BERKELEY PAR LAB
How NOT to parallel program
Initial CodeInitial Code
ProfilerRe-code with more threads
N t f t
Performanceprofile
Not fastenough
Fast enough
9
Fast enough
Ship it
Lots of failuresLots of failures
Software running on N Software running on N processor cores is processor cores is slower slower than than on 1on 1
BERKELEY PAR LAB
What is this person thinking of?
Re-code with more threadsmore threads
Edward Lee,
10Mood: anxious, depressedMood: anxious, depressed
“The Problem with Threads”
Threads, locks, Threads, locks, semaphores, data racessemaphores, data races
2/11/2010
© Kurt Keutzer 6
BERKELEY PAR LABOur PALLAS Methodology
Application Specification
Architect SW toIdentify Parallelism
Performanceprofile
Not fast enoughThought Experiment:
Map SW Arch. to HW Arch.
Write Code
11
profile
Fast enough
Ship it
BERKELEY PAR LAB
What is this person thinking of?
Today’s code is serial but the world is parallel we just need to find the parallelism in ourwe just need to find the parallelism in our applications and reflect it in our software
Re-architectwith patterns
t t l ttt t l tt
12
Software architectureSoftware architecture
structural patternsstructural patternscomputational patternscomputational patterns
Mood: wellMood: well--adjusted, optimistic adjusted, optimistic
2/11/2010
© Kurt Keutzer 7
BERKELEY PAR LAB
Our formula
Tall-skinny
Domain experts
State-of-the-art algorithms
Tall-skinny parallel
programmers
Architect software with
patterns
Game changing speed-upsCompelling
patterns
Parallel speed ups(with scalability)
applications hardware
13
BERKELEY PAR LAB
Applications
Our formula in a bit of detail: MRI reconstructionOther applications where our formula has proven itselfOther applications where our formula has proven itselfApplications in progress
14
2/11/2010
© Kurt Keutzer 8
BERKELEY PAR LAB
Compelling Application:Fast, Robust Pediatric MRI
Pediatric MRI is difficult:Children cannot sit still breathholdChildren cannot sit still, breathholdLow tolerance for long examsAnesthesia is costly and risky
Like to accelerate MRI acquisitionAdvanced MRI techniques exist, but
i d t d t i trequire data- and compute- intense algorithms for image reconstruction
Reconstruction must be fast, or time saved in accelerated acquisition is lost in computing reconstruction
Non-starter for clinical useNon-starter for clinical use
15
BERKELEY PAR LAB
Domain Experts andState-of-the-Art Algorithms
Collaboration with MRI Researchers:Miki Lustig Ph D Berkeley EECSMiki Lustig, Ph.D., Berkeley EECSMarc Alley, Ph.D., Stanford EEShreyas Vasanawala, M.D./Ph.D., Stanford Radiology
Advanced MRI: Parallel Imaging and Compressed Sensing to dramatically reduce MRI image acquisition time
Computational IOU: Must solve constrained L1 minimization
16
2/11/2010
© Kurt Keutzer 9
BERKELEY PAR LAB
SW architecture of image reconstruction
17
BERKELEY PAR LAB
Game-Changing Speedup
100X faster reconstructionHigher quality faster MRIHigher-quality, faster MRIThis image: 8 month-old patient with cancerous mass in liver
256 x 84 x 154 x 8 data sizeSerial Recon: 1 hourP ll l R 1 i tParallel Recon: 1 minute
Fast enough for clinical useSoftware currently deployed at Lucile Packard Children's Hospital for clinical study of the reconstruction techniquereconstruction technique
18
2/11/2010
© Kurt Keutzer 10
BERKELEY PAR LAB
Applications
Our formula in a bit of detail: MRI reconstructionOther applications where our formula has proven itselfOther applications where our formula has proven itselfApplications in progress
19
BERKELEY PAR LAB
Support-Vector Machine Mini-Framework
Algorithmic changes and parallel i l i l d fimplementation lead to performance speedup: Core 2 Duo versus G80
Computation LIBSVM Our algorithm2-core Parallel
Our algorithm, 16-coreParallelParallel
CPUParallel GPU
SVM Training(geo-mean)
771.8 s --- 38.5 s
SVM Classification
41.42 s 4.21 s 0.38 s
20
Fast support vector machine training and classification, Catanzaro, Sundaram, Keutzer, International Conference on Machine Learning 2008
100X speed‐up793 downloads since release in 10/2008
Classification
2/11/2010
© Kurt Keutzer 11
BERKELEY PAR LAB
Image Contour Detection Mini-Framework
Original Serial Code(Nehalem)
C + Pthreads, our algorithms8 threads, 2 sockets (Barcelona)
Our algorithms, GTX280
222 seconds 29.79 seconds 1.8 seconds
"Efficient, High-Quality Image Contour Detection" Catanzaro, Su, Sundaram, Lee, Murphy, Keutzer International Conference on Computer Vision, 2009
130X speed‐up490 downloads since release in 6/2009
BERKELEY PAR LABSpeech Recognition Framework
Input: Speech audio waveformInput: Speech audio waveform Output: Recognized word sequences
Achieved 11x speedup over sequential versionAll 3 5 f t th l ti itiAllows 3.5x faster than real time recognition
Our technique is being deployed in a hotline call-center data analytics companyUsed to search content, track service quality and provide early d t ti f i idetection of service issues
22
Scalable HMM based Inference Engine in Large Vocabulary Continuous Speech Recognition,Kisun You, Jike Chong, Youngmin Yi, Ekaterina Gonina, Christopher Hughes, Wonyong Sung and Kurt Keutzer, IEEE Signal Processing Magazine, March 2010
2/11/2010
© Kurt Keutzer 12
BERKELEY PAR LAB
Computational Finance
Value-at-Risk Computation with Monte Carlo MethodMonte Carlo MethodSummarizes a portfolio’s vulnerabilities to market movementsImportant to algorithmic trading, derivative usage and highly leveraged hedge fundsgImproved implementation to run 60x faster on a parallel microprocessor
f ( )
Four Steps of Monte Carlo Method in Finance
Uniform RandomNumber Generation
Market Parameter Transformation
Instrument Pricing
f (x)Data
Assimilation
Matthew Dixon, Jike Chong, Kurt Keutzer, “Acceleration of Market Value-at-Risk Estimation”, Workshop on High Performance Computing in Finance at Super Computing 2009, November 15, 2009.
BERKELEY PAR LAB
Applications
Our formula in a bit of detail: MRI reconstructionOther applications where our formula has proven itselfOther applications where our formula has proven itselfApplications in progress
24
2/11/2010
© Kurt Keutzer 13
BERKELEY PAR LAB
Poselet Human Detection
Can locate humans in images20x speedup through algorithmic improvements and parallel implementation
Work can be extended to pose estimation for controller free
25
estimation for controller-free video game interfaces using ordinary web cameras
BERKELEY PAR LAB
Option Pricing Application
Price an option - a tradable financial security whose value depends on the value of an underlying asset and market parameters.on the value of an underlying asset and market parameters. Black-Sholes equation:
Speedup:
6x pricing 1 option25x pricing 128 options on Larrabee
26
2/11/2010
© Kurt Keutzer 14
BERKELEY PAR LAB
Optical Flow
• Optical Flow involves computing the motion vectors (“flow field”) between the consecutivevectors (“flow field”) between the consecutive frames of a video
• Involves solving a non‐linear optimization problem
Speedup
32x linear solver
Linear solver
Matrix creation
Interpolation & WarpingDownsampling
27
7x overall ow s p g
Filter
Other
Memcopy CPU-GPU
Serial Parallel
BERKELEY PAR LAB2009 ITRS - Functions/chip and Chip Size
10000000
2009 ITRS Cost Performance MPU Functions
Figure 9b Logic
Including ’11 snapshot AnalysisITRS Roadmap
100000
1000000
and
Squa
re M
illim
eter
s
2009 ITRS Cost-Performance MPU Functionsper chip at production (Mtransistorst)
2009 ITRS High-Performance MPU Functionsper chip at production (Mtransistors)
2009 Cost-Performance MPU Chip size atproduction (mm2)
2009 High-Performance MPU Chip size at
MPU= 2x/3yrsMPU
= 2x/2yrs
Average "Moore's Law" = 2x/2yrs
100
1000
10000
Mill
ion
Tran
sist
ors
(1e6
) a g pproduction (mm2)
<260mm2<140mm2
2011: “22nm”/(38nm M1)MPU Model Generations
Work in Progress – Do Not Publish! 28
101995 2000 2005 2010 2015 2020 2025
Year of Production 2009 ITRS: 2009-2024
2/11/2010
© Kurt Keutzer 15
BERKELEY PAR LABParallelism Delivers Moore’s Law (and More) in the Future
T iTransistorsSerial PerformanceParallel Performance
1970 1980 1990 2000 2010 2020
29
BERKELEY PAR LAB
Tip of the Iceberg
Today we’re only exploiting the smallest tip of the iceberg of computation that will be available in the futurecomputation that will be available in the future
30
2/11/2010
© Kurt Keutzer 16
BERKELEY PAR LAB
Limits of our formula
Tall-skinny
Domain experts
State-of-the-art algorithms
Tall-skinny parallel
programmers
Architecting software with
patterns
Game changing speed-upsCompelling
patterns
Parallel speed ups(with scalability)
applications hardware
31
BERKELEY PAR LAB
Our research formula
D iT ll ki Application frameworks
Domain experts use frameworks
Tall skinny programmers
Game-changing
d
Compellingapplications
Parallel hardware
speed-ups
32
2/11/2010
© Kurt Keutzer 17
BERKELEY PAR LAB
How do we deliver all those new applications to the user?
I’ve talked about how to translate Moore’s Law i t f l li ti i iinto useful applications via microprocessors.Our next speakers will talk about how to deliver those applications to you.
The Future of the CloudMichael Franklin
The Future of MobileEric Brewer
33
BERKELEY PAR LAB
Backup
34
2/11/2010
© Kurt Keutzer 18
BERKELEY PAR LAB
SW architecture of reconstruction
Pipe and Filter1. Apply SPIRiT Operator:
Iterative POCS Algorithm:
Fork-Join
Linear Alg. Linear Alg.
2. Wavelet Soft-Thresholding
3. Fourier-space projection
Data Parallelism / Fourier Transforms
Data Parallelism / Fourier Transforms
Fork-Join
Iter. Refinement / Spectral Method
Data Parallelism / Convolutions
Data Parallelism / Wavelet xforms
35
Data Parallelism / Fourier xformsData Parallelism / Fourier Transforms