ubiquitous parallel computing b llinois stanford

..........................................................................................................................................................................................................................

UBIQUITOUS PARALLEL COMPUTINGFROM BERKELEY, ILLINOIS,

AND STANFORD..........................................................................................................................................................................................................................

THE PARLAB AT BERKELEY, UPCRC-ILLINOIS, AND THE PERVASIVE PARALLEL LABORATORY

AT STANFORD ARE STUDYING HOW TO MAKE PARALLEL PROGRAMMING SUCCEED

GIVEN INDUSTRY’S RECENT SHIFT TO MULTICORE COMPUTING. ALL THREE CENTERS

ASSUME THAT FUTURE MICROPROCESSORS WILL HAVE HUNDREDS OF CORES AND ARE

WORKING ON APPLICATIONS, PROGRAMMING ENVIRONMENTS, AND ARCHITECTURES THAT

WILL MEET THIS CHALLENGE. THIS ARTICLE BRIEFLY SURVEYS THE SIMILARITIES AND

DIFFERENCE IN THEIR RESEARCH.

......For more than three decades, themicroelectronics industry has followed thetrajectory set by Moore’s Law. The micro-processor industry has leveraged this evolu-tion to increase uniprocessor performanceby decreasing cycle time and increasing theaverage number of executed instructions percycle (IPC).

This evolution stopped a few years ago,however. Power constraints are resulting instagnant clock rates, and new microarchitec-ture designs yield limited IPC improve-ments. Instead, the industry uses continuedincreases in transistor counts to populatechips with an increasing number of cores—multiple independent processors. Thischange has profound implications for theIT industry. In the past, each generation ofhardware brought increased performance onexisting applications, with no code rewrite,and enabled new, performance-hungry appli-cations. This is still true but only for

applications written to run in parallel andto scale to an increasing number of cores.

This is less of a problem for server appli-cations, which can leverage parallelism byserving a larger number of clients or by con-solidating server functions onto fewer sys-tems. Parallelism for client and mobileapplications is harder because they are turn-around oriented, and the use of parallelismto reduce response time requires more algo-rithmic work. Furthermore, mobile systemsare power constrained, and improved wire-less connectivity enables shifting computa-tions to the server (or the cloud).

Given these conditions, can we identifyapplications that will run on clients and willrequire significant added compute power?Can we develop a programming environmentthat allows a large programmer communityto develop parallel codes for such applica-tions? In addition, can multicore architecturesand their software scale to the hundreds of

[3B2-14] mmi2010020041.3d 31/3/010 7:14 Page 41

Bryan Catanzaro

Armando Fox

Kurt Keutzer

David Patterson

Bor-Yiing Su

University of California,

Berkeley

Marc Snir

University of Illinois

at Urbana-Champaign

Kunle Olukotun

Pat Hanrahan

Hassan Chafi

Stanford University

0272-1732/10/$26.00 !c 2010 IEEE Published by the IEEE Computer Society

...................................................................

41

cores that hardware will be able to support ina decade? We believe we can answer suchquestions positively, but a timely solutionwill require a significant acceleration of thetransfer of research ideas into practice.

Research on such topics is ongoing at theParallel Computing Laboratory (ParLab) atthe University of California, Berkeley, theUniversal Parallel Computing ResearchCenter at the University of Illinois, Urbana-Champaign (UPCRC-Illinois), and StanfordUniversity’s Pervasive Parallelism Laboratory(PPL). All three centers are parallelizing spe-cific applications, rather than developingtechnology for undefined future applicationsas is traditional in research. All focus primar-ily on client computing, designing tech-nology that can work well up throughhundreds of cores. All three also reject asingle-solution approach, assuming insteadthat software is developed by teams of pro-grammers with different specialties requiringdifferent sets of tools. Even though the workat the three centers bears many similarities,they focus on different programming envi-ronments, approaches for the production ofparallel code, and architectural supports.

ParLab at BerkeleyRecognizing the difficulty of the multi-

core challenge, Intel and Microsoft invited25 universities to propose parallel computingcenters. They selected the Berkeley ParLab in2008. Since then, another six companies havejoined Intel and Microsoft as affiliates: Na-tional Instruments, NEC, Nokia, NVIDIA,Samsung, and Sun Microsystems. Em-bracing open source software and welcom-ing all collaborators, the ParLab is a teamof 50 PhD students and a dozen facultyleaders from many fields who work togethertoward making parallel computing produc-tive, performant, energy-efficient, scalable,portable, and at least as correct as sequentialprograms.1

Patterns and frameworksMany presume the challenge is for today’s

programmers to become efficient parallelprogrammers with only modest training.Our goal is to enable the productive develop-ment of efficient parallel programs by tomor-row’s programmers. We believe future

programmers will be either domain expertsor sophisticated computer scientists becausefew domain experts have the time to developperformance programming skills and fewcomputer scientists have the time to developdomain expertise. The latter group will createframeworks and software stacks that enabledomain experts to develop applications with-out understanding details of the underlyingplatforms.

We believe that software architecture is thekey to designing parallel programs, and thekey to these programs’ efficient implementa-tion is frameworks. In our approach, thebasis of both is design patterns and a patternlanguage. Borrowed from civil architecture,design pattern means solutions to recurringdesign problems that domain experts learn.A pattern language is an organized way ofnavigating through a collection of design pat-terns to produce a design. Our pattern lan-guage consists of a series of computationalpatterns drawn largely from 13 motifs.1,2

We see these as the fundamental softwarebuilding blocks that are composed using a se-ries of structural patterns drawn from com-mon software architectural styles.

A software architecture is then the hierar-chical composition of computational andstructural patterns, which we refine usinglower-level design patterns. This softwarearchitecture and its refinement, although use-ful, are entirely conceptual. To implementthe software, we rely on frameworks. Wedefine a pattern-oriented software frameworkas an environment built around a softwarearchitecture in which customization is onlyallowed in harmony with the framework’sarchitecture. For example, if based on thepipe-and-filter style, customization involvesonly modifying pipes or filters.

We envision a two-layer software stack. Inthe productivity layer, domain experts princi-pally develop applications using applicationframeworks. In the efficiency layer, computerscientists develop these high-level applicationframeworks as well as other supporting soft-ware frameworks in the stack. Applicationframeworks have two advantages. First, theapplication programmer works within afamiliar environment using concepts drawnfrom the application domain. Second, weprevent expression of parallel programming’s

[3B2-14] mmi2010020041.3d 30/3/010 12:34 Page 42

....................................................................

42 IEEE MICRO

...............................................................................................................................................................................................HOT CHIPS

many annoying problems, such as nondeter-minism, races, deadlock, starvation, and soon. To reduce such problems inside theseframeworks, we use dynamic testing to exe-cute problematic schedules.

SEJITSFramework writers are more productive

when they write in high-level productivity-layer languages (PLLs) such as Python orRuby with abstractions that match the appli-cation domain; studies have reported factorsof three to 10 fewer lines of code and threeto five times faster development when usingPLLs rather than efficiency-level languages(ELLs) such as C++ or Java. However, PLLperformance might be orders of magnitudeworse than ELL code, in part due to inter-preted execution.

We’re developing a new approach tobridge the gap between PLLs and ELLs.Modern scripting languages such as Pythonand Ruby include facilities to allow late bind-ing of an ELL module to execute a PLL func-tion. In particular, introspection lets afunction inspect its own abstract syntax tree(AST) when first called to determine whetherthe AST can be transformed into one thatmatches a computation performed by someELL module. If it can, PLL metaprogram-ming support then specializes the functionat runtime by generating, compiling, andlinking the ELL code to the running PLL in-terpreter. Indeed, the ELL code generationmight include syntax-directed translation ofthe AST into the ELL. If the AST cannotbe matched to an existing ELL module orthe module targets the wrong hardware, thePLL interpreter just continues executing asusual. This approach preserves portability be-cause it doesn’t modify the source PLLprogram.

Although just-in-time (JIT) code genera-tion and specialization is well establishedwith Java and Microsoft .NET, our approachselectively specializes only those functionsthat have a matching ELL code generator,rather than having to generate code dynami-cally for the entire PLL. Also, the introspec-tion and metaprogramming facilities let usembed the specialization machinery in thePLL directly rather than having to modifythe PLL interpreter. Hence, we call our

approach selective, embedded, just-in-timespecialization (SEJITS).3

SEJITS helps domain experts use thework of efficiency programmers. Efficiencyprogrammers can ‘‘drop’’ new modules spe-cialized for particular computations into theSEJITS framework, which will make runtimedecisions when to use it. By separating theconcerns of ELL and PLL programmers,SEJITS lets them concentrate on their re-spective specialties.

An example is Copperhead, which is adata-parallel Python dialect. Every Copper-head program is valid Python, executableby the Python interpreter. However, whenit consists of data parallel operations takenfrom a Python library, the Copperhead run-time specializes the resulting program intoefficient parallel code. Preliminary experi-ments show encouraging efficiency, while en-abling parallel programming at a muchhigher level of abstraction. Our goal is tospecialize an entire image processing compu-tation, such as in Figure 1.

Platforms and applicationsWe believe future applications will have

footholds in both mobile clients and cloudcomputing. We’re investigating parallelbrowsers because many client applicationsare downloaded and run inside a browser.4

We note that both client and cloud areconcerned with responsiveness and power ef-ficiency. The application client challenge isresponsiveness while preserving battery lifegiven various platforms. The applicationcloud challenge is responsiveness andthroughput while minimizing cost in a pay-as-you-go environment.5 In addition, cloudcomputing providers want to lower costs,which are primarily power and cooling.

To illustrate the client/cloud split, sup-pose a domain expert wants to create an ap-plication that suggests the name of a personwalking toward you. Once it has identifiedthis person, the client device whispers the in-formation to you. Responsiveness is key, asseconds separate timely from irrelevant.Depending on wireless connectivity and bat-tery state, a ‘‘name whisperer’’ could sendphotos to the cloud to do the search or theclient could search locally from a cache ofpeople you’ve met.

[3B2-14] mmi2010020041.3d 30/3/010 12:34 Page 43

....................................................................

MARCH/APRIL 2010 43

Object recognition systemsClearly, an object recognition system will

be a major component of such an applica-tion. Figure 1 shows the computationalflow of a local object recognizer,6 whichgets good results on vision benchmarks.Although high quality, it is computationallyintensive; it takes 5.5 minutes to identifyfive kinds of objects in a 0.15-Mbyte image,which is far too slow for our application.

The main goal of an application frame-work is to help application developers designtheir applications efficiently. For computervision, computations include image contourdetection, texture extraction, image segmen-tation, feature extraction, classification, clus-tering, dimensionality reduction, and so on.Many algorithms have been proposed forthese computations, each with differenttrade-offs. Each method corresponds to anapplication pattern. The application frame-work integrates these patterns. As a result,the application developers can composetheir applications by arranging the computa-tions together and let the framework figureout which algorithm to use, how to set param-eters for the algorithm, and how to communi-cate between different computations.

Figure 1 shows the object recognizer’sfour main computations. We can try outapplication patterns for different contourdetectors, image segmentors, feature extrac-tors, and trainer/classifier and find the mostaccurate composition. An alternative is toonly specify the computation’s compositionand let the application framework choosethe proper application pattern to realize thecomputation.

We choose application patterns manually.The contour detector, for example, uses75 percent of the time. Among all contourdetectors, our version of the gPb algorithmachieves the highest accuracy (see Figure 1).The major computational bottlenecks arethe localcue computation and the generalizedeigensolver. For the localcue computation,we replaced the explicit histogram accumula-tion by integral image and applied parallelscan for realizing integral image. For the gen-eralized eigensolver, we proposed a highlyparallel SpMV kernel and investigatedappropriate reorthogonal approaches for theLanczos algorithm. By selecting good pat-terns to form a proper software architecturethat reveals parallelism and then exploringappropriate algorithmic approaches and

[3B2-14] mmi2010020041.3d 30/3/010 12:34 4

Input image

Convertcolor space

Filter bankconvolution

Integralimages

Contourdetection

Interveningcontour

Generalizedeigensolver


Combine,normalize

Skeletonize

Contours Recognizedobjects

Classification

Featureextraction

Traineddata

Imagesegmentation

Gradients

SpMVSGEMM

BLAS1

Combine

Non-maxsuppression


Histogram

K-means

SGEMM

Relabel

Figure 1. Computation flow of object recognition using regions. The major computational bottlenecks are the localcue

computation and the generalized eigensolver.

....................................................................

44 IEEE MICRO

...............................................................................................................................................................................................HOT CHIPS

parallel implementations within that soft-ware architecture, we accelerated contourexecution time by 140 times, from4.2 minutes to 1.8 seconds on a GPU.7

Performance measurement and autotuningThe prevailing hardware trend of dynam-

ically improving performance with little soft-ware visibility has become counterproductive;software must adapt if parallel programsare going to be portable, fast, and energyefficient. Hence, parallel programs must beable to understand and measure any com-puter so that they can adapt effectively.This perspective suggests architectures withtransparent performance and energy con-sumption and Standard Hardware OperationTrackers (SHOT).8 SHOT enables parallelprogramming environments to deliver porta-bility, performance, and energy efficiency.For example, we used SHOT to examinealternative data structures for the image con-tour detector, by examining realized memorybandwidth versus the data layout.

Autotuners produce high-quality code bygenerating many variants and measuringeach variant on the target platform. Thesearch process tirelessly tries many unusualvariants of a particular routine. Unlike libra-ries, autotuners also allow tuning to the par-ticular problem size. Autotuners also preserveclarity and help portability by reducing thetemptation to mangle the source code to im-prove performance for a particular computer.

The Copperhead specialization machineryhas built-in heuristics that guide decisionsabout data layout, parallelization strategy,execution configuration, and so on. Autotun-ing lets the Copperhead specializer gainefficiency portably because the optimal execu-tion strategy differs by platform.

Future architecture and operating systemWe expect the client hardware of 2020

will contain hundreds of cores in replicatedhardware tiles. Each tile will contain one pro-cessor designed for instruction-level parallel-ism for sequential code and a descendant ofvector and GPU architectures for data-levelparallelism. Task-level parallelism occursacross the tiles. The number of tiles perchip varies depending on cost-performancegoals. Thus, although the tiles will be

identical for ease of design and fabrication,the chip supports heterogeneous parallelism.The memory hierarchy will be a hybrid oftraditional caches and software-controlledscratchpads.9 We believe that such mecha-nisms for mobile clients will also aid serversin the cloud.

Because we rarely run just one program,the hardware will partition to provideperformance isolation and security betweenmultiprogrammed applications. Partitioningsuggests restructuring systems services as aset of interacting distributed components.The resulting deconstructed Tessellationoperating system implements schedulingand resource management of partitions.10

Applications and operating system services(such as file systems) run within partitions.Partitions are lightweight and can be resizedor suspended with overhead comparable to acontext swap. The operating system kernel isa thin layer responsible for only the coarse-grained scheduling and assignment of resour-ces to partitions and secure restricted com-munications among partitions. It avoids theperformance issues with traditional micro-kernels by providing operating system ser-vices through secure messaging to spatiallycoresident service partitions, rather than con-text-switching to time-multiplexed serviceprocesses. User-level schedulers are usedwithin a single partition to schedule applica-tion tasks onto processors across potentiallymultiple different libraries and frameworks,and the Lithe layer offers an interface toschedule independent libraries efficiently.11

ParLab todayWe have identified the architectural and

software patterns and are using them in thedevelopment of applications in vision,music, health, speech, and browsers. Theseapplications drive the rest of the research.We have two SEJITS prototypes and auto-tuners for several motifs and architectures.We have a 64-core implementation of ourtiled architecture in field-programmablegate arrays that we use for architecturalexperiments, including the first SHOT im-plementation. FPGAs afforded a 250 timesincrease in simulation time over soft-ware simulators, and the lengthier experi-ments often led to opposite conclusions.12

[3B2-14] mmi2010020041.3d 30/3/010 12:34 Page 45

....................................................................

MARCH/APRIL 2010 45

The initial Tessellation operating system bootson our prototype and can create a partition,and Lithe has demonstrated efficient composi-tion of code written using OpenMP libraries.

UPCRC-IllinoisThe Universal Parallel Computing Re-

search Center at the University of Illinois,Urbana-Champaign was established in2008 as a result of the same competitionthat led to the ParLab. UPCRC-Illinoisinvolves about 50 faculty and students andis codirected by Wen-mei Hwu and MarcSnir. It is one of several major projects inparallel computing at Illinois (http://parallel.illinois.edu), continuing a tradition thatstarted with the Illiac projects.

Our work on applications focuses on thecreation of compelling 3D Web applicationsand human-centered interfaces. We have astrong focus on programming language,compiler, and runtime technologies aimedat supporting parallel programming modelsthat provide simple, sequential-by-defaultsemantics with parallel performance modelsand that avoid concurrency bugs. Our archi-tecture work is focused on efficientlysupporting shared memory with many hun-dreds of cores. (The full list of UPCRC-Illinois ongoing projects is available athttp://www.upcrc.illinois.edu.)

ApplicationsThe 3D Internet will enable many new

compelling applications. For example, theTeeve 3D teleimmersion framework and itsdescendants have supported remote collabo-rative dancing, remote Tai-Chi training, ma-nipulation of virtual archeological artifacts,training of wheelchair-bound basketball play-ers, and gaming.13 Applications are limitedby a low frame rate and inability to handlevisually noisy environments. Broad real-time usage in areas such as multiplayer gam-ing or telemedicine requires several orders ofmagnitude performance improvements intasks such as the synthesis of 3D modelsfrom multiple 2D images; the recognitionof faces, facial expressions, objects, andgestures; and the rendering of dynamicallycreated synthetic environments. Such tasksmust execute on mobile clients to reducethe impact of Internet latencies on

real-time interactions. Many of the sametasks will be at the core of future human-centered ubiquitous computing environ-ments that can understand human behaviorand anticipate needs, but this will requiresignificant technical progress.

We’re developing new parallel algorithmsfor these tasks that can achieve required per-formance levels. We have implemented newparallel algorithms for depth image-basedrendering—creating a virtual view of a 3Dobject from a new angle based on informa-tion from a depth camera and multiple opti-cal cameras—and shown speedups of morethan 74 times. We’ve implemented parallelversions of algorithms for analyzing videostreams, demonstrating speedups in excessof 400 times on a hand-tracking task andfor B-spline image interpolation. The algo-rithms have been used for the NIST TRECVideo Retrieval Challenge, which requiresthe identification of specific events in surveil-lance videos. The complete applicationachieves speedups in excess of 13 times.We collected the algorithms into the publiclyavailable Vivid library (http://libvivid.sourceforge.net) in the form of parallelfunctions callable from Python code.14

Spatial data structures form the backboneof many computationally intensive 3D appli-cations and data-mining and machine-learning algorithms. To support such appli-cations, we’re developing ParKD, a compre-hensive framework for parallelized spatialqueries and updates through scalable, parallelk-D tree implementations. ParKD algo-rithms speed up the generation of k-D treesand generate k-D trees that enable efficientparallel rendering. We plan to integrate allthese components into an end-to-end teleim-mersive application.

Performance constraints of mobile plat-forms restrict the functionality of mobileapplication and lengthen their developmenttime. For example, to achieve high-quality,real-time graphics in interactive games, weneed to precompute data structures usedfor rendering. This restricts the number ofsupported scenarios and increases develop-ment time and cost. Less constrained multi-player games use lower quality graphics. Anadded advantage of our work will beenabling applications that provide better

[3B2-14] mmi2010020041.3d 30/3/010 12:34 Page 46

....................................................................

46 IEEE MICRO

...............................................................................................................................................................................................HOT CHIPS

user experience and that can be developedfaster.

Parallelism can also help achieve othergoals, such as better quality of service andimproved security. Our work on Teeveshows how the use of multiple cores can sim-plify QoS provision for multiple concurrentreal-time tasks. Our work on the Opus Palla-dianum (OP) Web browser shows that paral-lelism both improves browser performanceand reduces vulnerabilities by usingcompartmentalization.15

Programming environmentWe distinguish between concurrent

programming that focuses on problemswhere concurrency is part of the specificationand parallel programming that focuses onproblems where concurrent execution is usedonly for improving a computation’s perfor-mance. Reactive systems (such as an operatingsystem, GUI, or online transaction-processingsystem) where computations (or transactions)are triggered by nondeterministic, possiblyconcurrent, requests or events use concurrentprogramming. Parallel programming is usedin transformational systems, such as in scien-tific computing or signal processing, wherean initial input (or an input stream) is mappedthrough a chain of (usually) deterministictransformations into an output (or an outputstream). The prevalence of multicore plat-forms doesn’t increase the need for concurrentprogramming or make it harder; it increasesthe need for parallel programming. We con-tend that parallel programming is much easier

than concurrent programming; in particular,it is seldom necessary to use nondeterministiccode.

Figure 2 presents a schematic view of theparallel software creation process. Developersoften start with a sequential code, althoughstarting from a high-level specification is pre-ferred. It’s often possible to identify outer-loop parallelism where we can encapsulateall or most of the sequential code logic intoa parallel execution framework (pipeline,master-slave, and so on) with little or nochange in the sequential code. If this doesn’tachieve the desired performance, developersidentify compute-intensive kernels, encapsu-late them into libraries, and tune these libra-ries to leverage parallelism at a finer grain,including single-instruction, multiple-data(SIMD) parallelism. Whereas the productionof carefully tuned parallel libraries will in-volve performance coding experts, simple,coarse-level parallelism should be accessibleto all.

We propose helping ‘‘easy parallelism’’ bydeveloping refactoring tools that let us con-vert sequential code into parallel code writtenusing existing parallel frameworks in C# orJava16 as well as debugging tools that helpidentify concurrency bugs.17 More funda-mentally, we propose supporting simple par-allelism with languages that are deterministicby default and concurrency safe.

Languages such as Java and C# providesafety guarantees (such as type or memorysafety) that significantly reduce the opportu-nities for hard-to-track bugs and improve

[3B2-14] mmi2010020041.3d 30/3/010 12:34 Page 47

Yes

Serial code Coarse-grainedparallel code

Goodenough?

No

Fine-grainedparallel libraries

Done

Done Identify, isolate, andparallelize computeintensive kernels

Integrate

Identify andcode “easyparallelism”

Figure 2. Parallel code development workflow. Coarse-grained parallelism is often sufficient

and easy to achieve; fine-grained parallelism is the work of experts.

....................................................................

MARCH/APRIL 2010 47

programmer productivity. We plan to bringthe same benefits to parallel and concurrentprogramming; a concurrency safe languageprevents, by design, the occurrence of dataraces. As a result, the semantics of concurrentexecution are well-defined, and hard-to-trackbugs are avoided. A deterministic-by-defaultlanguage will also ensure that, unless explicitnondeterministic constructs are used, anyexecution of a parallel code will have thesame outcome as a sequential execution.This significantly reduces the effort of testingparallel code and facilitates the porting ofsequential code.

Our work on Deterministic Parallel Java(DPJ)18 shows we can satisfy both properties,even for modern object-oriented program-ming, without undue loss of expressivenessand with good performance. Moreover,we can enforce the race-free guaranteewith simple, compile-time type checking.Although such a safe language requiressome initial programming effort, these effortsare small compared with that of designingand developing a parallel program and canbe significantly reduced via interactive port-ing tools.19 Our tool, DPJizer, can infermuch of the added information DPJrequires. The effort has a large long-termpayoff in terms of greatly improved docu-mentation, maintainability, and robustnessunder future software changes.

The development of high-performanceparallel libraries requires a different environ-ment that enables programmers to fine-tunecode and facilitates porting to new platforms.We believe that such work should happen inan environment where compiler analysis andcode development interact closely so that theperformance programmer works with thecompiler, not against it, as is often the casetoday. We’re working on refactoring envi-ronments where code changes are mostlyannotations that add information not other-wise available in the source code. The anno-tation information is extended by advancedcompiler analysis to enable robust deploy-ment of transformations such as data layoutadjustment and loop tiling transformations.One particular goal of our Gluon work isto enable a portable common parallel codebase for both fine-grained engines such asGPUs and coarser grained multicore CPUs,

as well as their associated memory hierar-chies. Autotuning by manipulating the algo-rithm or code is another technique we applyto this problem.

Parallel libraries are most easily integratedinto sequential code when they follow theSIMD model. Our work on hierarchicaltiled arrays (HTAs) confirms earlier resultsshowing that this model can also be usedto support a range of complex applicationswith good performance.20 The use of tilesas a first-class object gives programmersgood control of locality, granularity, andload balancing.

Integrating parallel libraries and frame-works into a concurrency-safe programmingenvironment requires a careful design of inter-faces to ensure that the safety guarantees arenot violated. We can use simple language fea-tures as specialized contracts at frameworkinterfaces to ensure that client code using aparallel framework (with internal parallelism)doesn’t violate assumptions made withinthe framework implementation. Such anapproach gives an expert framework imple-menter freedom to use low-level and/orhighly tuned techniques within the frame-work while enforcing a safety net for less ex-pert application programmers using theframework. The expert framework implemen-tation can be subject to extensive testing andanalysis techniques, including proofs of con-currency safety using manual annotations.

ArchitectureThe continued scaling of feature sizes will

enable systems with hundreds of conven-tional cores, and possibly thousands of light-weight cores, within a decade. Current cachecoherence protocols don’t scale to such num-bers. With current protocols, each sharedmemory access by a core is considered tobe a potential communication or synchroni-zation with any other core. In fact, parallelprograms communicate and synchronize instylized ways. A key to shared memory scal-ing is adjusting coherence protocols to lever-age the prevalent structure of shared memorycodes for performance.

We’re exploring three approaches to doso. The Bulk Architecture is executing coher-ence operations in bulk, committing largegroups of loads and stores at a time.21

[3B2-14] mmi2010020041.3d 30/3/010 12:34 Page 48

....................................................................

48 IEEE MICRO

...............................................................................................................................................................................................HOT CHIPS

In this architecture, memory accesses appearto interleave in a total order, even in the pres-ence of data races—which helps softwaredebugging and productivity—while the per-formance is high through aggressive reorder-ing of loads and stores within each group.The DeNovo architecture codesigns thehardware with concurrency-safe program-ming models, resulting in a much simplifiedand scalable coherence protocol.22 The Rigelarchitecture plans to shift more coherenceactivities to software.23 In addition, a higher-level view of communication and synchroniza-tion across threads enables the architecture tohelp program development, for example, bysupporting deterministic replay of parallelprograms or by tracking races in codes thatdon’t prevent them by design.24

Stanford University PPLIn May 2008, Stanford University offi-

cially launched the Pervasive Parallelism Lab-oratory. PPL’s goal is to make parallelismaccessible to average software developers sothat it can be freely used in all computation-ally demanding applications. The PPL poolsthe efforts of many leading Stanford com-puter scientists and electrical engineers withsupport from Sun Microsystems, NVIDIA,IBM, Advanced Micro Devices, Intel,NEC, and Hewlett-Packard under an openindustrial affiliates program. The lab’s opennature lets other companies join the effortand doesn’t provide any member companywith exclusive intellectual property rights tothe research results.

PPL approachA fundamental premise of the PPL is that

parallel computing hardware will be hetero-geneous. This is already true today; personalcomputer systems currently shipping consistof a chip multiprocessor and a highly data-parallel GPU coprocessor. Large clusters ofsuch nodes have already been deployed andmost future high-performance computingenvironments will contain GPUs. To fullyleverage the computational capabilities ofthese systems, an application developermust contend with multiple, sometimes in-compatible, programming models. Sharedmemory multiprocessors are generally pro-grammed using threads and locks (such as

pthreads and OpenMP), while GPUs areprogrammed using data-parallel languages(such as CUDA and OpenCL), and commu-nication between the nodes in a cluster isprogrammed with a message passing library(such as MPI). Heterogeneous hardware sys-tems are driven by the desire to improvehardware productivity, measured by perfor-mance per watt and per dollar. This desirewill continue to drive even greater hardwareheterogeneity that will include special-purpose processing units.

As the degree of hardware heterogeneityincreases, developing software for these sys-tems will become even more complex. Ourhypothesis is that the only way to radicallysimplify the process of developing parallelapplications and improve programmer pro-ductivity, measured by programmer effortrequired for a given level of performance, isto use very-high-level domain-specific pro-gramming languages and environments.These environments will capture parallelismimplicitly and will optimize and map thisparallelism to heterogeneous hardware usingdomain-specific knowledge.

Thus, our vision for the future of parallelapplication programming is to replace a dis-parate collection of programming models,which require specialized architecture knowl-edge, with domain-specific languages thatmatch application developer knowledge andunderstanding.

PPL research agendaTo drive PPL research, we are developing

new applications in areas that have thepotential to exploit significant amounts ofparallelism but also present significant soft-ware development challenges. These applica-tion areas demand enormous amounts ofcomputing power to process large amountsof information, often in real time. They in-clude traditional scientific and engineeringapplications from geosciences, mechanicalengineering, and bioengineering; a massivevirtual world including a client-side gameengine and a scalable world server; personalrobotics including autonomous drivingvehicles and robots that can navigate homeand office environments; and sophisticateddata-analysis applications that can extract in-formation from huge amounts of data.

[3B2-14] mmi2010020041.3d 30/3/010 12:34 Page 49

....................................................................

MARCH/APRIL 2010 49

These applications will be developed bydomain experts in collaboration with PPLresearchers.

The core of our research agenda is toallow a domain expert to develop parallelsoftware without becoming an expert in par-allel programming. Our approach is to use alayered system (Figure 3) based on implicitlyparallel domain-specific languages (DSLs), adomain embedding language, a commonparallel runtime system, and a heterogeneousarchitecture that provides efficient mecha-nisms for communication, synchronization,and performance monitoring.

We expect that most programming offuture parallel systems will be done inDSLs at the abstraction level of Matlab orSQL. DSLs enable the average programmerto be highly productive in writing parallelprograms by isolating the programmer fromthe details of parallelism, synchronization,and locality. The use of DSLs also recognizesthat most applications aren’t written fromscratch, but rather built by combining exist-ing systems and libraries.25 The DSL envi-ronment uses its high-level view of thecomputation and its domain knowledge todirect placement and scheduling to optimize

parallel execution. Our goal is to build theunderlying technology that makes it easy tocreate implicitly parallel DSLs and designand implement at least three specific lan-guages using this technology to demonstratethat it can support multiple DSLs. To sup-port science and engineering applications,we’re developing a mesh-based PDE DSLcalled Liszt and a physics DSL that is basedon the physics simulation library PhysBAM,which has been extensively used in biome-chanics, virtual surgery, and visual specialeffects. To support the many algorithms inrobotics and data informatics that are basedon machine learning, we’re developing amachine-learning DSL called OptiML.Finally, we’re using the Scala programminglanguage to serve as the language used forembedding the DSLs.26 Scala integrates keyfeatures of object-oriented and functionallanguages, and Scala’s extensibility makes itpossible to define new DSLs naturally.

Our common parallel runtime system(CPR), called Delite, supports both implicittask-level parallelism for generality andexplicit data-level parallelism for efficiency.The CPR maps the parallelism extractedfrom a DSL-based application to

[3B2-14] mmi2010020041.3d 30/3/010 12:34 Page 50

OOO cores SIMD cores

Programmablehierarchies

Scalablecoherence

Isolation andatomicity

Domain embedding language (Scala)

Virtualworlds

Personalrobotics

Datainformatics

Scientific engineeringApplications

Domain-specificlanguages

DSLinfrastructure

Heterogeneoushardware

Physics(Liszt) Scripting Probabilistic

Machinelearning(OptiML)

Rendering

Common parallel runtime (Delite)

Domain-specificoptimization

Locality-awarescheduling

Task and dataparallelism

Hardware architecture

Threaded cores

Pervasivemonitoring

Figure 3. A layered approach to the problem of parallel computing The use of a common

embedding language allows the interoperability between DSLs and the DSL infrastructure

allows new DSLs to be developed more easily.

....................................................................

50 IEEE MICRO

...............................................................................................................................................................................................HOT CHIPS

heterogeneous architectures and manages theallocation and scheduling of processing andmemory resources. The mapping processbegins with a task graph, which exposestask- and data-level parallelism and retainsthe high-level domain knowledge expressedby the DSL. This domain knowledge isused to optimize the graph by reducing thetotal amount of work and by exposingmore parallelism. The CPR scheduler usesthis task graph to reason about large portionsof the program and make allocation andscheduling decisions that reduce communi-cation and improve locality of data access.Each node of the task graph can have multi-ple implementations that target differentarchitectures, which supports heterogeneousparallelism.

To support the CPR, we’re developing aset of architecture mechanisms that providecommunication and synchronization withlow overhead. Such mechanisms will supportboth efficient fine-grained exploitation ofparallelism to get many processors workingtogether on a fixed-size data set and coarse-grained parallelism to gain efficiency throughbulk operations. A key challenge is the designof a memory hierarchy that simultaneouslysupports both implicit reactive mechanisms(caching with coherence and transactions)and explicit proactive mechanisms (explicitstaging of data to local memory). Our goalis to develop a simple set of mechanismsthat is general enough to support executionmodels ranging from speculative threads(to support legacy codes) to streaming (tosupport explicitly scheduled data-parallelDSLs).27 To evaluate these architecturemechanisms with full-size applications athardware speeds, we’re using a prototypingsystem called the Flexible ArchitectureResearch Machine. FARM tightly couplescommodity processors chips for performancewith FPGA chips for flexibility, using acache-coherent shared address space.

LisztLiszt is a domain-specific programming

environment developed in Scala for imple-menting PDE solvers on unstructuredmeshes for hypersonic fluid simulation. Itabstracts the representation of the commonobjects and operations used in flow

simulation. Because 3D vectors are commonin physical simulation, Liszt implementsthem as objects with common methods fordot and cross products. In addition, Lisztcompletely abstracts the mesh data structure,performing all mesh access through standardinterfaces. Field variables are associated withtopological elements such as cells, faces,edges, and vertices, but they are accessedthrough methods so their representation isnot exposed. Finally, sparse matrices areindexed by topological elements, not inte-gers. This code appeals to computational sci-entists because it is written in a form theyunderstand. It also appeals to computer sci-entists because it hides the details of themachine.

One key to good parallel performance isgood memory locality, and the choice ofdata structure representation and data decom-position can have an enormous impact onlocality. The use of DSLs and domainknowledge make it possible to pick the datastructure that best fits the characteristics ofthe architecture. The DSL compiler forLiszt, for example, knows what a mesh, acell, and a face are. As a result, the compilerhas the information needed to select thedata structure representations, data decompo-sition, and the layout of field variables thatare optimized for a specific architecture.Using this domain-specific approach, it ispossible to generate a special version ofthe code for a given mesh running on agiven architecture. No general-purpose com-piler could possibly do this type of super-optimization.

DeliteDelite’s layered approach to parallel het-

erogeneous programming hides the complex-ity of the underlying machine behind acollection of DSLs. The use of DSLs createstwo types of programmers: application devel-opers, who use DSLs and are shielded fromparallel or heterogeneous programming con-structs (Figure 4a); and DSL developers, whouse the Delite framework to implement aDSL. The DSL developer defines the map-ping of DSL methods into domain-specificunits of execution called OPs. OPs holdthe implementation of a particular domain-specific operation and other information

[3B2-14] mmi2010020041.3d 30/3/010 12:34 Page 51

....................................................................

MARCH/APRIL 2010 51

needed by the runtime for efficient parallelexecution (operation cost, possible sideeffects, and so on). OPs also encode anydependencies on data objects or existing OPs.

When an application calls a DSL method,an OP is submitted to the Delite runtimethrough a defer method and receives aproxy in return (Figure 4b). The applicationis oblivious to the fact that computation hasbeen deferred and runs ahead, allowing moreOPs to be submitted. These OPs form a dy-namic task graph that can then be scheduledto run in parallel, if independent (Figure 4c).Multiple variants of each OP in the domain-specific language can be generated to targetdifferent available parallel architectures(Figure 4d). As new architectures emerge,Delite can be enhanced to generate code forthese new architectures. The applicationdoesn’t need to be rewritten to benefit fromthese new architectures, and unless the DSLinterface changes, the application need noteven be recompiled.

Delite can be used to extract and optimizeboth task- and data-level parallelism for sev-eral machine-learning kernels written inOptiML. Our results show significant

potential for this approach to uncover largeamounts of parallelism from applicationswritten using a Delite DSL. Future workwill demonstrate how the Delite infrastruc-ture can be used to target a heterogeneousparallel architecture composed of multicoresand GPUs and interface with more special-ized parallel runtimes that we are developingsuch as Sequoia28 (divide and conquer) andGRAMPS29 (producer/consumer).

T he stagnation in uniprocessor perfor-mance and the shift to multicore

architectures is a technological discontinuitythat will force major changes in the ITindustry. The three centers are proud of theopportunity given to them to have a majorrole in driving this change and humbled bythe magnitude of the task. MICRO

AcknowledgmentsUC Berkeley ParLab research is spon-

sored by the Universal Parallel ComputingResearch Center, which is funded by Inteland Microsoft (award no. 20080469) andby matching funds from UC Discovery(award no. DIG07-10227). Additional

[3B2-14] mmi2010020041.3d 30/3/010 12:34 Page 52

Applicationdef example(a: Matrix[Int],

b: Matrix[Int], c: Matrix[Int],d: Matrix[Int]) =

{

val ab = a * bval cd = c * d return ab + cd

}

Calls matrixDSL methods

def *(m: Matrix[Int]) =delite.defer(OP_mult(this, m))

def +(m: Matrix[Int]) =delite.defer(OP_plus(this, m))

Matrix DSL

DSL defers OPexecution to Delite

Hardware schedule

Dev

ice

Time

(c)

(a)

(d)

(b)

0

1

Delite maps DAGto resources

a b

c

Delite runtime

+

d

Figure 4. Delite example application execution overview. Delite simplifies parallel

programming using implicitly parallel DSLs.

....................................................................

52 IEEE MICRO

...............................................................................................................................................................................................HOT CHIPS

support comes from six ParLab affiliate com-panies: National Instruments, NEC, Nokia,NVIDIA, Samsung, and Sun Microsystems.UPCRC-Illinois is funded by Intel andMicrosoft and by matching funds from theUniversity of Illinois. Stanford PPL is fundedby Sun Microsystems, NVIDIA, IBM,Advanced Micro Devices, Intel, and NEC.

....................................................................References1. K. Asanovic et al., ‘‘A View of the Parallel

Computing Landscape,’’Comm.ACM, vol. 52,

no. 10, Oct. 2009, pp. 56-67.

2. K. Keutzer and T. Mattson, ‘‘A Design Pat-

tern Language for Engineering (Parallel)

Software,’’ to appear in Intel Technical J.,

vol. 13, no. 4, 2010.

3. B. Catanzaro et al., ‘‘SEJITS: Getting Pro-

ductivity and Performance with Selective

Embedded JIT Specialization,’’ Proc. 1st

Workshop Programmable Models for

Emerging Architecture (PMEA), EECS

Dept., Univ. of California, Berkeley, tech.

report UCB/EECS-2010-23, Mar. 2010.

4. C.G. Jones et al., ‘‘Parallelizing the Web

Browser,’’ Proc. 1st Usenix Workshop Hot

Topics in Parallelism (HotPar 09), Usenix

Assoc., 2009, http://parlab.eecs.berkeley.

edu/publication/220.

5. M. Armhurst et al., ‘‘Above the Clouds: A

Berkeley View of Cloud Computing,’’ Comm.

ACM, vol. 53, no. 4, Apr. 2010, pp. 50-58.

6. C. Gu et al., ‘‘Recognition Using Regions,’’

Proc. Computer Vision and Pattern Recogni-

tion (CVPR 09), IEEE CS Press, 2009,

pp. 1030-1037.

7. B. Catanzaro et al., ‘‘Efficient, High-Quality

Image Contour Detection,’’ Proc. IEEE Int’l

Conf. Computer Vision (ICCV 09), 2009;

http://www.cs.berkeley.edu/~catanzar/

Damascene/iccv2009.pdf.

8. S. Bird et al., ‘‘Software Knows Best: Porta-

ble Parallelism Requires Standardized

Measurements of Transparent Hardware,’’

EECS Dept., Univ. of California, Berkeley,

tech. report, Mar. 2010.

9. H. Cook, K. Asanovic, and D.A. Patterson,

Virtual Local Stores: Enabling Software-

Managed Memory Hierarchies in Main-

stream Computing Environments, EECS

Dept., Univ. of California, Berkeley, tech.

report UCB/EECS-2009-131, Sept. 2009.

10. R. Liu et al., ‘‘Tessellation: Space-Time

Partitioning in a Manycore Client OS,’’

Proc. 1st Usenix Workshop Hot Topics in

Parallelism (HotPar 09), Usenix Assoc.,

2009; http://parlab.eecs.berkeley.edu/

publication/221.

11. H. Pan, B. Hindman, and K. Asanovic,

‘‘Lithe: Enabling Efficient Composition of

Parallel Libraries,’’ Proc. 1st Usenix Work-

shop Hot Topics in Parallelism (HotPar 09),

Usenix Assoc., 2009; http://parlab.eecs.

berkeley.edu/publication/222.

12. Z. Tan et al., ‘‘A Case for FAME: FPGA

Architecture Model Execution,’’ to be

published in Proc. Int’l Symp. Computer

Architecture, June 2010.

13. W. Wu et al., ‘‘MobileTI: A Portable Tele-

immersive System,’’ Proc. 17th ACM Int’l

Conf. Multimedia, ACM Press, 2009,

pp. 877-880.

14. D. Lin et al., ‘‘The Parallelization of Video

Processing,’’ Signal Processing, vol. 26,

no. 6, 2009, pp. 103-112.

15. C. Grier, S. Tang, and S.T. King, ‘‘Secure

Web Browsing with the OP Web Browser,’’

Proc. 2008 IEEE Symp. Security and Privacy,

IEEE CS Press, 2008, pp. 402-416.

16. D. Dig et al., ‘‘ReLooper: Refactoring

for Loop Parallelism in Java,’’ Companion

Proc. Object-Oriented Programming, Sys-

tems, Languages, and Applications (OOPSLA

09), ACM Press, 2009, pp. 793-794.

17. A. Farzan, P. Madhusudan, and F. Sorrentino,

‘‘Meta-analysis for Atomicity Violations under

Nested Locking,’’ Proc. Int’l Conf. Computer

Aided Verification (CAV), Springer, 2009,

pp. 248-262.

18. R. Bocchino et al., ‘‘A Type and Effect Sys-

tem for Deterministic Parallel Java,’’ Proc.

Object-Oriented Programming, Systems,

Languages, and Applications (OOPSLA),

ACM Press, 2009, pp, 97-116.

19. M. Vakilian et al., ‘‘Inferring Method Effect

Summaries for Nested Heap Regions,‘‘

Proc. 24th IEEE/ACM Conf. on Automated

Software Eng. (ASE 2009), ACM Press,

2009, pp. 421-432.

20. G. Bikshandi et al., ‘‘Programming for Par-

allelism and Locality with Hierarchically

Tiled Arrays,’’ Proc. 11th ACM SIGPLAN

Symp. Principles and Practice of Parallel

Programming (PPoPP), ACM Press, 2006,

pp. 48-57.

[3B2-14] mmi2010020041.3d 1/4/010 9:30 Page 53

....................................................................

MARCH/APRIL 2010 53

21. J. Torrellas et al., ‘‘The Bulk Multicore for

Improved Programmability,’’ Comm. ACM,

Dec. 2009, pp. 58-65.

22. S. Adve and H.-J. Boehm, ‘‘Memory Models:

A Case for Rethinking Parallel Languages and

Hardware,’’ to appear Comm. ACM, http://

rsim.cs.uiuc.edu/Pubs/10-cacm-memory-

models.pdf.

23. J. Kelm et al., ‘‘Rigel: An Architecture and

Scalable Programming Interface for a

1000-core Accelerator,’’ Proc. 36th Int’l

Symp. Computer Architecture, IEEE CS

Press, 2009, pp. 140-151.

24. A. Nistor, D. Marinov, and J. Torrellas,

‘‘Light64: Lightweight Hardware Support

for Race Detection during Systematic Test-

ing of Parallel Programs,’’ Proc. Int’l Symp.

Microarchitecture (MICRO), IEEE CS Press,

2009, pp. 541-552.

25. N. Bronson et al., ‘‘A Practical Concurrent

Binary Search Tree,’’ Proc. 15th Ann.

Symp. Principles and Practice of Parallel

Programming (PPoPP), ACM Press, 2010,

pp. 257-268.

26. M. Odersky, L. Spoon, and B. Venners,

Programming in Scala: A Comprehensive

Step-by-Step Guide, Artima Press, 2008.

27. D. Sanchez, R. Yoo, and C. Kozyrakis, ‘‘Flex-

ible Architectural Support for Fine-Grain

Scheduling,’’ Proc. Int’l Conf. Architectural

Support for Programming Languages and

Operating Systems (ASPLOS), ACM Press,

2010; http://csl.stanford.edu/~christos/

publications/2010.adm.asplos.pdf.

28. K. Fatahalian et al., ‘‘Sequoia: Programming

the Memory Hierarchy,’’ Proc. 2006 ACM/

IEEE Conf. on Supercomputing, ACM Press,

2006; http://graphics.stanford.edu/papers/

sequoia/sequoia_sc06.pdf.

29. J. Sugerman et al., ‘‘GRAMPS: A Program-

ming Model for Graphics Pipelines,’’ ACM

Trans. Graphics, vol. 28, no. 1, 2009,

pp. 1-11.

Bryan Catanzaro is a PhD candidate in thein the electrical engineering and computerscience department at the University ofCalifornia, Berkeley. His research interestscenter on programming models for many-core computers, with an applications-drivenemphasis. Catanzaro has an MS in electricalengineering from Brigham Young Univer-sity. He is a member of IEEE.

Armando Fox is an adjunct associateprofessor in the electrical engineering andcomputer science at the University ofCalifornia, Berkeley, and active in the ParallelComputing Laboratory and the ReliableAdaptive Distributed Systems laboratory(RAD Lab). His research interests includecloud computing, programming languagesand frameworks, and applied machine learn-ing. Fox has a PhD in computer science fromthe University of California, Berkeley. He is asenior member of the ACM.

Kurt Keutzer is a professor of electricalengineering and computer science at theUniversity of California, Berkeley, and theprincipal investigator in the Parallel Com-puting Laboratory. His research interestsinclude patterns and frameworks for efficientparallel programming. Keutzer has a PhD incomputer science from Indiana University.He is a fellow of IEEE.

David Patterson is the director of theParallel Computing Laboratory, director ofthe Reliable Adaptive Distributed SystemsLaboratory, and Pardee Professor of Com-puter Science at the University of California,Berkeley. His research interests includehardware and software issues in many-coresystems and cloud computing. Patterson hasa PhD in computer science from theUniversity of California, Los Angeles. He isa fellow of IEEE and the ACM.

Bor-Yiing Su is a PhD candidate in theelectrical engineering and computer scienceat the University of California, Berkeley. Hisresearch interests focus on developing pro-gramming frameworks for computer visionapplications on many-core platforms. Su hasa BS in electrical engineering from NationalTaiwan University.

Marc Snir is the Richard Feiman andSaburo Muroga Professor of ComputerScience at the University of Illinois, Urba-na-Champaign, codirector of UPCRC-Illinois, and chief software architect for thePetascale Blue Waters system. His researchinterests include high-performance comput-ing, parallel programming models andprogramming environments, and computing

[3B2-14] mmi2010020041.3d 30/3/010 12:34 Page 54

....................................................................

54 IEEE MICRO

...............................................................................................................................................................................................HOT CHIPS

education. Snir has a PhD in mathematicsfrom the Hebrew University of Jerusalem.He is a fellow of IEEE, the ACM, and theAAAS.

Kunle Olukotun is a professor of electricalengineering and computer science at Stan-ford University and director of the StanfordPervasive Parallelism Lab. His researchinterests include computer architecture andparallel programming environments.Olukotunhas a PhD in computer engineering fromthe University of Michigan. He is a fellowof IEEE and the ACM.

Pat Hanrahan is the Canon Professor ofComputer Science and Electrical Engineer-ing at Stanford University. His researchinterests include programming environmentsfor parallel computing, graphics systems and

architectures, and visualization. Hanrahanhas a PhD in biophysics from the Universityof Wisconsin.

Hassan Chafi is a PhD candidate inelectrical engineering at Stanford University.His research interests include transactionalmemory and parallel programming models.Hassan has an MS in electrical engineeringfrom Stanford University.

Direct questions and comments to MarcSnir, Univ. of Illinois, Siebel Center forComputer Science, 201 N. Goodwin Ave.,Urbana, IL, 60801; [email protected].

[3B2-14] mmi2010020041.3d 30/3/010 12:34 Page 55

....................................................................

MARCH/APRIL 2010 55

ubiquitous parallel computing b llinois stanford

Documents