the landscape of parallel computing research: a view...
TRANSCRIPT
The Landscape of Parallel Computing
Research: A View from Berkeley 2.0Krste Asanovic, Ras Bodik, Jim Demmel,
John Kubiatowicz, Kurt Keutzer, Edward Lee, George Necula, Dave Patterson, Koushik Sen, John Shalf, John Wawrzynek, and Kathy Yelick
June, 2007
2
Parallel Revolution, Ready or Not
Power Wall + Memory Wall + ILP Wall = Brick Wall
End of uniprocessors
New “Moore‟s Law” is 2X “cores” per chip every technology generation, same clock
“not a breakthrough… but a retreat from even greater challenges that thwart efficient silicon implementation of traditional uniprocessor architectures”
Sea change for HW & SW industries since changing programmer model, responsibilities
HW/SW industries bet their future that parallelism will succeed despite track record of failures
Dead Computer Society: Illiac IV, Encore, Sequent, Transputer, Thinking Machines, MasPar, NCUBE, Kendall Square Research, Convex, (SGI) …
3
Need a Fresh Approach
Berkeley researchers from many backgrounds meeting since Feb. 2005 to discuss parallelism Krste Asanovic, Ras Bodik, Jim Demmel, John Kubiatowicz,
Kurt Keutzer, Edward Lee, George Necula, Dave Patterson, Koushik Sen, John Shalf, John Wawrzynek, Kathy Yelick, …
Circuit design, computer architecture, massively parallel computing, computer-aided design, embedded hardware and software, programming languages, compilers, scientific programming, and numerical analysis
Tried to learn from successes in parallel embedded (BWRC) and high performance computing (LBNL)
“Berkeley View” Tech. Report: see view.eecs.berkeley.edu
This talk next step in ideas: Berkeley View 2.0
A. The computing platforms of the future
B. Novel approach to design of future systems
C. 7 questions to frame parallel research
Technology enables two paths:
1. increasing performance, same cost (and form factor)
onstant price, increasing performance
WS
Time
Mainframes
PCLo
g p
ric
e
2010’s
Bell‟s Evolution Of Computer Classes
Bell‟s Evolution Of Computer ClassesTechnology enables two
paths:
2. constant performance, decreasing cost(and form factor)
Mini
Time
Mainframe
PC
Lo
g p
ric
e
WS
Laptop
Handset
Ubiquitous
6
A: Re-inventing Client/Server
Laptop/Handheld as future client, Datacenter as future server
“The Datacenter is the Computer”
Building sized computers: Google, MS …
“The Laptop/Handheld is the Computer”
„08: Dell no. laptops > desktops? „06 Profit?
1B Cell phones/yr, increasing in function
Apple iPhone
Otellini demoed "Universal Communicator”Combination cell phone, PC and video device
7
Laptop/Handheld Reality
Use with large displays at home or office
What % time disconnected? 10%? 30% 50%? Disconnectedness due to Economics
Cell towers and support system expensive to maintain charge for use to recover costs costs to communicate
Policy varies, but most countries allow wireless investment where make most money Cities well covered Rural areas will never be covered
Disconnectedness due to Technology
No coverage deep inside buildings
Satellite communication uses up batteries
Need computation & storage in L/H
8
B:Try Application Driven Research?
Conventional Wisdom in CS Research
Users don‟t know what they want
Computer Scientists solve individual parallel problems with clever language feature (e.g., dataflow), new compiler pass, or novel hardware widget (e.g., SIMD)
Approach: Push (foist?) CS nuggets/solutions on users
Problem: Stupid users don‟t learn/use proper solution
Another Approach
Work with domain experts developing compelling apps
Provide HW/SW infrastructure necessary to build, compose, and understand parallel software written in multiple languages
Research guided by commonly recurring patterns actually observed in compelling application codes
9
C: 7 Questions for Parallelism Applications:
1. What are the apps?
2. What are kernels of apps?
Hardware:
3. What are the HW building blocks?
4. How to connect them?
Programming Model & Systems Software:
5. How to describe apps and kernels?
6. How to program the HW?
Evaluation:
7. How to measure success?
(Inspired by a view of the Golden Gate Bridge from Berkeley)
10
“Who needs 100 cores to run M/S Word?”
Failure of imagination? (CS education?)
Need compelling apps that use 100s of cores
What about parallel benchmarks?
Few examples (e.g., SPLASH, NAS)
Optimized to old models, languages, architectures…
How invent parallel systems of future when tied to old code, programming models of the past?
Apps and Kernels Tower: What are the problems?
11
Coronary Heart Disease
Modeling to help patient compliance?• 400k deaths/year, 16M w. symptom, 72MBP
Massively parallel, Real-time variations• CFD FE solid (non-linear), fluid (Newtonian), pulsatile
• Blood pressure, activity, habitus, cholesterol
Before After
12
Content Based Image Retrieval
Relevance
Feedback
Image
Database
Query by example
Similarity
Metric
Candidate
Results Final Result
Built around Key Characteristics of personal databasesVery large number of pictures (>5K)Non labeled imagesMany pictures of few peopleComplex pictures including people, events, places,
and objects
1000’s of
images
13
Compelling Laptop/Handheld Apps
Health Coach Since laptop/handheld always with you,
Record images of all meals, weigh plate before and after, analyze calories consumed so far “What if I order a pizza for my next meal?
A salad?”
Since laptop/handheld always with you, record amount of exercise so far, show how body would look if maintain this exercise and diet pattern next 3 months “What would I look like if I regularly 2
miles? 4 miles?”
Face Recognizer/Name Whisperer Laptop/handheld scans faces, matches
image database, whispers name in ear (relies on Content Based Image Retreival)
14
Compelling Laptop/Handheld Apps
Meeting Diarist Laptops/ Handhelds
at meeting coordinate to create speaker identified, partially translated text diary of meeting
Teleconference speaker identifier, speech helper
L/Hs used for teleconference, identifies who is speaking, “closed caption” hint of what being said
15
Compelling Laptop/Handheld Apps
Music Enhancer Enhanced sound delivery systems for
home sound systems using large microphone and speaker arrays
Laptop/Handheld recreate 3D sound over ear buds
Hearing Augmenter Laptop/Handheld as accelerator for
hearing aide
Novel Instrument User Interface New composition and performance
systems beyond keyboards
New forms of bodily engagement when performing musical material
Input device for Laptop/Handheld
Berkeley Center for New Music and Audio
Technology (CNMAT) created a compact
loudspeaker array in a single location with a
radiation pattern controlled by onboard FPGA. 10-
inch-diameter icosahedron incorporating 120 1.25-
inch drivers developed by Meyer Sound, each
driven from a separate audio channel, plus circuit
boards for all of the control and class-D
amplification functions. Laptop controls array via
Gigabit Ethernet.
16
“Who needs 100 cores to run M/S Word?”
Failure of imagination (CS education?)
Need compelling apps that can use 100 cores
What about parallel benchmarks?
Few examples (e.g., SPLASH, NAS)
Optimized to old models, languages, architectures…
How invent parallel systems of future when tied to old code, programming models of the past?
Apps and Kernels Tower: What are the problems?
17
Can find patterns widely used?
Look for common patterns of communication and computation
1. Embedded Computing (EEMBC benchmark)
2. Desktop/Server Computing (SPEC2006)
3. Data Base / Text Mining Software Advice from Jim Gray of Microsoft and Joe Hellerstein of UC
4. Games/Graphics/Vision
5. Machine Learning Advice from Mike Jordan and Dan Klein of UC Berkeley
6. High Performance Computing (Original “7 Dwarfs”)
Result: 13 “Dwarfs”
18
Dwarf Popularity (Red Hot Blue Cool)
Embed SPEC DB Games ML HPC
1 Finite State Mach.
2 Combinational
3 Graph Traversal
4 Structured Grid
5 Dense Matrix
6 Sparse Matrix
7 Spectral (FFT)
8 Dynamic Prog
9 N-Body
10 MapReduce
11 Backtrack/ B&B
12 Graphical Models
13 Unstructured Grid
19
13 Dwarfs (so far)8. Dynamic Programming
9. N-Body Methods
10. MapReduce
11. Back-track/
Branch & Bound
12. Graphical Model Inference
13. Unstructured Grids
1. Finite State Machine
2. Combinational Logic
3. Graph Traversal
4. Structured Grids
5. Dense Linear Algebra
6. Sparse Linear Algebra
7. Spectral Methods (FFT)
• Claim: parallel arch., lang., compiler … must do at least these well to do future parallel apps well• Note: MapReduce is embarrassingly parallel;
perhaps FSM is embarrassingly sequential?
20
Em
be
d
SP
EC
DB
Ga
me
s
ML
HP
C
Health Image Speech Music
Finite State Mach.
Combinational
Graph Traversal
Structured Grid
Dense Matrix
Sparse Matrix
Spectral (FFT)
Dynamic Prog
N-Body
MapReduce
Backtrack/ B&B
Graphical Models
Unstructured Grid
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
How do compelling apps relate to 13 dwarfs?
Dwarf Popularity (Red Hot Blue Cool)
21
Em
be
d
SP
EC
DB
Ga
me
s
ML
HP
C
He
alt
h
Ima
ge
Sp
ee
ch
Mu
sic
De
lau
na
y
me
sh
g
en
era
tio
m
Ge
ne
S
eq
ue
nc
ing
Km
ea
ns
C
lus
teri
ng
Va
ca
tio
n
Re
se
rva
tio
n
Sy
ste
m
Finite State Mach.
Combinational
Graph Traversal
Structured Grid
Dense Matrix
Sparse Matrix
Spectral (FFT)
Dynamic Prog
N-Body
MapReduce
Backtrack/ B&B
Graphical Models
Unstructured Grid
Stanford Transactional Apps for Multi Processing?
Dwarf Popularity (Red Hot Blue Cool)
(http://stamp.stanford.edu)
22
4 Valuable Roles of Dwarfs
1. “Anti-benchmarks” not tied to code or language artifacts encourage innovation in algorithms, languages, data structures, and/or hardware
2. Universal, understandable vocabulary to talk across disciplinary boundaries
3. Define building blocks for creating libraries & frameworks that cut across app domains
4. They decouple research, allowing analysis of HW & SW engineering and programming without waiting years for full apps
23
Berkeley View 2.0: Apps Tower
Application-Driven Research vs. CS Solution-Driven Research
Drill down on 4 app areas to guide research agenda
Dwarfs to represent broader set of apps to guide research agenda
Dwarfs help break through traditional interfaces Benchmarking, multidisciplinary conversations, target
for libraries, and parallelizing parallel research
24
7 Questions for Parallelism Applications:
1. What are the apps?
2. What are kernels of apps?
Hardware:
3. What are the HW building blocks?
4. How to connect them?
Programming Model & Systems Software:
5. How to describe apps and kernels?
6. How to program the HW?
Evaluation:
7. How to measure success?(Inspired by a view of the
Golden Gate Bridge from Berkeley)
25
How do we describe apps and kernels? Observation: Use Dwarfs. Dwarfs are of 2 types
Libraries Dense matrices Sparse matrices Spectral Combinational Finite state machines
Patterns/Frameworks MapReduce Graph traversal, graphical models Dynamic programming Backtracking/B&B N-Body (Un) Structured Grid
Algorithms in the dwarfs can either be implemented as:• Compact parallel computations within a traditional library• Compute/communicate pattern implemented as framework
• Computations may be viewed a multiple levels: e.g., an FFT library may be built by instantiating a Map-Reduce framework, mapping 1D FFTs and then transposing (generalize reduce)
26
Composing dwarfs to build apps
Any parallel application of arbitrary complexitymay be built by composing parallel
and serial components
Parallel patterns with serial plug-inse.g., MapReduce
Serial code invoking parallel libraries, e.g., FFT, matrix ops.,…
Composition is hierarchical
27
Programming the Hardware
2 types of programmers 2 layers
Productivity Layer (90% of today‟s programmers)
Domain experts / Naïve programmers productively build parallel apps using frameworks & libraries
Frameworks & libraries composed to form app frameworks
Efficiency Layer (10% of today‟s programmers)
Expert programmers build: Frameworks: software that supports general structural patterns of
computation and communication: e. g. MapReduce
Libraries: software that supports compact computational expressions: e.g. Sketch for Combinational or Grid computation
“Bare metal” efficiency possible at Efficiency Layer
Effective composition techniques allows the efficiency programmers to be highly leveraged
28
Content Based Image Retrieval – Extract Features
0
50
100
150
200
250
RGB Histogram
Summarize image meaningfully and distinctively Global and Segment RGB and HSV Histograms Average gray value Color moments (mean, variance, kurtosis) DCT coefficients Edge direction histogram RGB and HSV Correlagrams Wavelet transforms(tree structured and pyramid) Scale Invariant Feature Transform histogram Color Coherence Vector Geometry from extracted faces
29
Coordination & Composition in CBIR Application
Parallelism in CBIR is hierarchical
Mostly independent tasks/data with combiningDCT extractor
Face Recog
?
DWT
?
…
stream parallel
over images
task parallel
over extraction
algorithms
data parallel
map DCT
over tiles
combine
concatenate
feature vectors
Output:
stream of
feature
vectors
DCT
Input: stream
of imagesfeature
extraction
combine
reduction on
histograms
from each tile
output one histogram
(feature vector)
30
Coordination & Composition Language
Coordination & Composition language for productivity
2 key challenges1. Correctness: ensuring independence using decomposition operators,
copying and requirements specifications on frameworks
2. Efficiency: resource management during composition; domain-specific OS/runtime support
Language control features hide core resources, e.g., Map DCT over tiles in language becomes set of DCTs/tiles per core
Hierarchical parallelism managed using OS mechanisms
Data structure hide memory structures Partitioners on arrays, graphs, trees produce independent data
Framework interfaces give independence requirements: e.g., map-reduce function must be independent, either by copying or application to partitioned data object (set of tiles from partitioner)
31
For parallelism to succeed, must provide productivity, efficiency, and correctness simultaneously for scalable hardware Can‟t make SW productivity even worse!Why do in parallel if efficiency doesn‟t matter? Correctness usually considered orthogonal problem Productivity slows if code incorrect or inefficient
Most programmers not ready to produce correct parallel programs IBM SP customer escalations: concurrency bugs
worst, can take months to fix
How do we program the HW?What are the problems?
32
Ensuring Correctness
• Productivity Layer: • Enforce independence of tasks using decomposition
(partitioning) and copying operators
• Goal: Remove concurrency errors (nondeterminism from execution order, not just low level data races)
• E.g., the race-free program “atomic delete” + “atomic insert” does not compose to an “atomic replace”; need higher level properties, not solved with transactions (or locks)
• Efficiency Layer: Check for subtle concurrency bugs (races, deadlocks, and so on)• Mixture of verification and automated directed testing
• Error detection on framework and libraries; some techniques applicable to third-party software
33
Compilers and Operating Systems are large, complex, resistant to innovation
(Worst possible time to be resistant to innovation!)
Takes a decade for compiler innovations to show up in production compilers?
Time for idea in SOSP to appear in production OS?
Traditional OSes brittle, insecure, memory hogs Traditional monolithic OS image uses lots of precious
memory * 100s - 1000s times (e.g., AIX uses GBs of DRAM / CPU)
Support Software: What are the problems?
34
Deconstructing Operating Systems
Resurgence of interest in virtual machines
Hypervisor: thin SW layer btw guest OS and HW
Future OS: libraries where only functions needed are linked into app, on top of thin hypervisor providing protection and sharing of resources
Opportunity for OS innovation
Leverage HW partitioning support for very thin hypervisors, and to allow software full access to hardware within partition
35
21st Century Code Generation
Search space for
matmul block sizes:
• Axes are block dim
• Temp is speed
Problem: generating optimal code is like searching for a needle in a haystack
New approach: “Auto-tuners” 1st run variations of program on computer to heuristically search for best combinations of optimizations (blocking, padding, …) and data structures, then produce C code to be compiled for that computer E.g., PHiPAC (BLAS), Atlas (BLAS), Spiral (DSP), FFT-W
Example: Sparse Matrix (SPMv) for 3 multicores
Fastest SPMv: 2X OSKI/PETSc Intel Clovertown, 4X AMD Opteron
Optimization space: register blocking, cache blocking, TLB blocking, prefetching/DMA options, NUMA, BCOO v. BCSR data structures, 16b v. 32b indices, …
36
Example: Sparse Matrix * Vector
Name Clovertown Opteron Cell
Chips*Cores 2*4 = 8 2*2 = 4 1*8 = 8
Architecture 4-/3-issue, 2-/1-SSE3, OOO, caches, prefetch
2-VLIW, SIMD, local store, DMA
Clock Rate 2.3 GHz 2.2 GHz 3.2 GHz
Peak MemBW 21.3 GB/s 21.3 25.6 GB/s
Peak GFLOPS 74.6 GF 17.6 GF 14.6 (DP Fl. Pt.)
Naïve SPMv (median of many matrices)
1.0 GF 0.6 GF --
Efficiency % 1% 3% --
Autotune SPMv 1.5 GF 1.9 GF 3.4 GF
Auto Speedup 1.5X 3.2X ∞
37
Software Span Summary
Parallelizing compiler + multicore + multilevel caches + SW and HW prefetching
Autotuner + multicore +local store +DMA
SPMv: Easier to autotune single local store + DMA
than multilevel caches + HW and SW prefetching
• Greater productivity and efficiency for SPMv?
• Deconstructed OS relying on thin hypervisors • Productivity Layer & Efficiency Layer• C&C Language to compose Libraries/Frameworks
38
7 Questions for Parallelism Applications:
1. What are the apps?
2. What are kernels of apps?
Hardware:
3. What are the HW building blocks?
4. How to connect them?
Programming Model & Systems Software:
5. How to describe apps and kernels?
6. How to program the HW?
Evaluation:
7. How to measure success?
(Inspired by a view of the Golden Gate Bridge from Berkeley)
39
Multicore (2, 4…) v. Manycore (64, 128…)? How can novel architectural support
improve productivity, efficiency, and correctness for scalable hardware?Efficiency instead of performance to capture
energy as well as performance
Also, power, design and verification costs, low yield, higher error rates
Hardware Tower: What are the problems?
40
HW Solution: Small is Beautiful
Expect modestly pipelined (5- to 9-stage) CPUs, FPUs, vector, SIMD PEs Small cores not much slower than large cores
Parallel is energy efficient path to performance:CV2F Lower threshold and supply voltages lowers energy per op
Redundant processors can improve chip yield Cisco Metro 188 CPUs + 4 spares; Sun Niagara sells 6 or 8 CPUs
Small, regular processing elements easier to verify
One size fits all?
Amdahl‟s Law Heterogeneous processors
Special function units to accelerate popular functions
41
HW features supporting Parallel SW
Want Composable Primitives, Not Packaged Solutions Transactional Memory is usually a Packaged Solution
Partitions
Fast Barrier Synchronization & Atomic Fetch-and-Op
Active messages plus user-level event handlingUsed by parallel language runtimes to provide fast communication,
synchronization, thread scheduling
Configurable Memory Hierarchy (Cell v. Clovertown) Can configure on-chip memory as cache or local store
Programmable DMA to move data without occupying CPU
Cache coherence: Mostly HW but SW handlers for complex cases
Hardware logging of memory writes to allow rollback
42
Partitions
InfiniCore chip
with 16x16 tile
array
Partition: hardware-isolated group• Chip divided into hardware-isolated partition, under
control of supervisor software
• User-level software has almost complete control of
hardware inside partition
• Power-of-2 size, naturally aligned
43
Multiple Fast Barrier Networks
Tap for 4-way
partition
Tap for 8-way
partition
Multiple parallel barrier networks (1 bit each) at each level
Signals propagate combinationally, with varying clock divisor (optimize latency not bandwidth)
Hypervisor sets taps saying where partition sees barrier
Tiles
Distributed
Barrier
Logic
44
7 Questions for Parallelism Applications:
1. What are the apps?
2. What are kernels of apps?
Hardware:
3. What are the HW building blocks?
4. How to connect them?
Programming Model & Systems Software:
5. How to describe apps and kernels?
6. How to program the HW? Evaluation:
7. How to measure success?
(Inspired by a view of the Golden Gate Bridge from Berkeley)
45
1. Only companies can build HW, and it takes years
2. Software people don‟t start working hard until hardware arrives
• 3 months after HW arrives, SW people list everything that must be fixed, then we all wait 4 years for next iteration of HW/SW
3. How to quickly get 100-1000 CPU systems in hands of researchers to let them innovate in algorithms, compilers, languages, OS, architectures, … ASAP?
4. Can avoid waiting 4 years + 3 months between HW/SW iterations?
Measuring Success: What are the problems?
46
Build Academic Manycore from FPGAs As 10 CPUs will fit in Field Programmable Gate Array
(FPGA), 1000-CPU system from 100 FPGAs?• 8 32-bit simple “soft core” RISC at 100MHz in 2004 (Virtex-II)
• FPGA generations every 1.5 yrs; 2X CPUs, 1.2X clock rate
HW research community does logic design (“gate shareware”) to create out-of-the-box, Manycore E.g., 1000 processor, standard ISA binary-compatible, 64-bit,
cache-coherent supercomputer @ 150 MHz/CPU in 2007
Ideal for heterogeneous chip architectures
RAMPants: 10 faculty at Berkeley, CMU, MIT, Stanford, Texas, and Washington
“Research Accelerator for Multiple Processors” as a vehicle to lure more researchers to parallel challenge and decrease time to parallel salvation
47
768 CPU “RAMP Blue” (Wawrzynek, Krasnov,… at Berkeley)
768 = 12 32-bit RISC cores / FPGA, 4 FGPAs/board, 16 boards, $10k/bd Simple MicroBlaze soft cores @ 90 MHz
Full star-connection between modules
1008 node RAMP Blue soon (21 boards)
NASA Advanced Supercomputing (NAS) Parallel Benchmarks (all class S) UPC versions (C plus shared-memory abstraction)
CG, EP, IS, MG; DEMO?
RAMPants creating HW & SW for many-core community using next gen FPGAs Chuck Thacker & Microsoft designing next boards
3rd party to manufacture and sell boards: 1H08
48
Universities help start 19 1B$+ Industries
Source:Innovation
in Information Technology,
National Research
Council Press, 2003.
Cal
Cal
tech
CERN
CMUIllinois
MIT
Purd
ue
Roc
hes
t.
Sta
nfo
rd
Toky
o
UCLA
Uta
h
Wisc.
$1B+ Industry1 Timesharing
2 Client/server
3 Graphics
4 Entertainment
5 Internet
6 LANs
7 Workstations
8 GUI
9 VLSI design
10 RISC processors
11 Relational DB
12 Parallel DB
13 Data mining
14 Parallel computing
15 RAID disk arrays
16 Portable comm.
17 World Wide Web
18 Speech recognition
19 Broadband last mile
Total 7 2 2 3 2 5 1 1 4 1 2 1 2
49
Change directions of research funding for parallelism?
Cal UIUC MIT Stanford …
Application
Language
Compiler
Libraries
Networks
Architecture
Hardware
CAD
Historically:Get leading experts per discipline (across US) working togetherto work on parallelism
50
Cal UIUC MIT Stanford …
Application
Language
Compiler
Libraries
Networks
Architecture
Hardware
CAD
To increase cross-disciplinary bandwidth, get experts per sitecollabor-ating on (app-driven) parallelism
Change directions of research funding for parallelism?
51
Summary: A Berkeley View 2.0 Our industry has bet its future on parallelism (!)
Goal: Productive, Efficient, Correct Parallel Progs while doubling number of cores every 2 years (!)
Try Apps-Driven vs. CS Solution-Driven Research Laptop-Handhelds/Datacenters as modern Client/Server
13 Dwarfs as lingua franca , anti-benchmarks
Composition is critical to parallel computing success Composition is open problem for Transactional Memory
Productivity layer for ≈90% today‟s programmers Use C&C Lang to reuse expert‟s code
Efficiency layer for ≈10% today‟s programmers Create libraries, frameworks, … for use in productivity layer
Autotuners over Parallelizing Compilers
OS & HW: Composable Primitives over CS Solutions
52
Acknowledgments
Berkeley View material comes from discussions with:
Profs Krste Asanovic, Ras Bodik, Jim Demmel, Kurt Keutzer, John Kubiatowicz, John Wawrzynek, Kathy Yelick
UCB Grad students Bryan Catanzaro, Jike Chong, Joe Gebis, William Plishker, and Sam Williams
LBNL:Parry Husbands, Bill Kramer, Lenny Oliker, John Shalf
See view.eecs.berkeley.edu
RAMP based on work of RAMP Developers:
Krste Asanovic (MIT/Berkeley), Derek Chiou (Texas), James Hoe (CMU), Christos Kozyrakis (Stanford), Shih-Lien Lu (Intel), Mark Oskin (Washington), David Patterson (Berkeley, Co-PI), and John Wawrzynek (Berkeley, PI)
See ramp.eecs.berkeley.edu