programming multicores do pplications programmers …

..........................................................................................................................................................................................................................

PROGRAMMING MULTICORES:DO APPLICATIONS PROGRAMMERS

NEED TO WRITE EXPLICITLYPARALLEL PROGRAMS?

..........................................................................................................................................................................................................................

IN THIS PANEL DISCUSSION FROM THE 2009 WORKSHOP ON COMPUTER ARCHITECTURE

RESEARCH DIRECTIONS, DAVID AUGUST AND KESHAV PINGALI DEBATE WHETHER EXPLICITLY

PARALLEL PROGRAMMING IS A NECESSARY EVIL FOR APPLICATIONS PROGRAMMERS,

ASSESS THE CURRENT STATE OF PARALLEL PROGRAMMING MODELS, AND DISCUSS

POSSIBLE ROUTES TOWARD FINDING THE PROGRAMMING MODEL FOR THE MULTICORE ERA.

Moderator’s introduction: ArvindDo applications programmers need to writeexplicitly parallel programs? Most people be-lieve that the current method of parallel pro-gramming is impeding the exploitation ofmulticores. In other words, the number ofcores in a microprocessor is likely to trackMoore’s law in the near future, but the pro-gramming of multicores might remain thebiggest obstacle in the forward march ofperformance.

Let’s assume that this premise is true.Now, the real question becomes: how shouldapplications programmers exploit the poten-tial of multicores? There have been two mainideas in exploiting parallelism: implicitly andexplicitly.

Implicit parallelismThe concept of the implicit exploitation

of the parallelism in a program has its roots

in the 1970s and 1980s, when two mainapproaches were developed.

The first approach required that the com-pilers do all the work in finding the parallel-ism. This was often referred to as the ‘‘dustydecks’’ problem—that is, how to exploit par-allelism in existing programs. This approachtaught us a lot about compiling. But mostimportantly, it taught us how to write a pro-gram in the first place, so that the compilerhad a chance of finding the parallelism.

The second approach, to which I alsocontributed, was to write programs in a man-ner such that the inherent (or obvious) paral-lelism in the algorithm is not obscured in theprogram. I explored declarative languages forthis purpose. This line of research also taughtus a lot. It showed us that we can express allkinds of parallelism in a program, but evenafter all the parallelism has been exposed, itis fairly difficult to efficiently map the

[3B2-14] mmi2010030019.3d 28/6/010 13:37 9

Arvind

Massachusetts Institute

of Technology

David August

Princeton University

Joshua J. Yi

University of Texas

School of Law

Resit Sendag

University of Rhode

Island

Keshav Pingali

Derek Chiou

University of Texas

at Austin

0272-1732/10/$26.00 �c 2010 IEEE Published by the IEEE Computer Society

...................................................................

19

Authorized licensed use limited to: University of Texas at Austin. Downloaded on July 20,2010 at 13:48:30 UTC from IEEE Xplore. Restrictions apply.

exposed parallelism on a given hardware sub-strate. So, we are faced with two problems:

� How do we expose the parallelism in aprogram?

� How do we package the parallelism fora particular machine?

Explicit parallelismThe other approach is to program

machines explicitly to exploit parallelism.This means that the programmer should bemade aware of all the machine resources:the type of interconnection, the numberand configuration of caches in the memoryhierarchy, and so on. But, it is obvious thatif too much information were disclosedabout the machine, programming difficultywould increase rapidly. Perhaps a more sys-tematic manner of exposing machine detailscould alleviate this problem.

Difficulty is always at the forefront of anydiscussion of explicit parallel programming.In the earliest days of the message passing in-terface (MPI), experts were concerned thatpeople would not be able to write parallelprograms at all because humans’ sequentialmanner of thinking would make writingthese programs difficult. In addition, themore resources a machine exposes, the lessabstract the programming becomes. Hence,it is not surprising that such programs’ por-tability becomes a difficult issue. How willthe program be transferred to the next gener-ation of a given machine from the same ven-dor or to an entirely different machine?Today, the issue of portability is so impor-tant that giving it up is not really an option.To do so would require that every programbe rewritten for each machine configuration.The last problem, which is equally importantas the first two, is that of composability.After all, the main purpose of parallelprogramming is to enhance performance.Composing parallel programs in a mannerthat is informative about the performanceof the composed program remains one ofthe most daunting challenges facing us.

Implicit versus explicit debateThis minipanel addresses the implicit ver-

sus explicit debate. How much about themachine should be exposed? Professors

David August and Keshav Pingali will debatethe following questions:

� How should applications programmersexploit the potential of multicores?

� Is explicitly parallel programminginherently more difficult than implic-itly parallel programming?

� Can we design languages, compilers,and runtime systems so that applica-tions programmers can get away withwriting only implicit parallel programs,without sacrificing performance?

� Does anyone other than a few compilerwriters and systems programmers needexplicitly parallel programming?

� Do programmers need to be able to ex-press nondeterminism (for example, se-lection from a set) to exploit parallelismin an application?

� Is a speculative execution model essen-tial for exploiting parallelism?

The case for an implicitly parallelprogramming model and dynamicparallelization: David August

Let’s address Arvind’s questions directly,starting with the first.

How should applications programmers exploitthe potential of multicores? We appear tohave two options. The first is explicitly paral-lel programming (parallel programming forshort); the second is parallelizing compilers.In one approach, humans perform all ofthe parallelism extraction. In the other,tools, such as compilers, do the extraction.Historically, both approaches have been dis-mal failures. We need a new approach, calledthe implicitly parallel programming model anddynamic parallelization. This is a hybridapproach that is fundamentally differentfrom simply combining explicitly parallelprogramming with parallelizing compilers.

To understand this new approach, wemust understand the difference between‘‘explicit’’ and ‘‘implicit.’’

Consider the __INLINE__ directive. Youcan, using some compilers, mark a functionwith this directive, and the compiler will au-tomatically and reliably inline it for you.This directive is explicit. The tool makes

[3B2-14] mmi2010030019.3d 28/6/010 13:37 0

....................................................................

20 IEEE MICRO

...............................................................................................................................................................................................

COMPUTER ARCHITECTURE DEBATES


inlining more convenient than inlining man-ually. However, it is not as convenient assimply not having to concern yourself withquestions of inlining, because you are still re-sponsible for making the decision andinforming the inlining tool of your decision.Explicit inlining is now unnecessary becausecompilers are better at making inlining deci-sions than programmers. Instead, by usingfunctions and methods, programmers pro-vide compilers with enough information todecide on their own what and where toinline. Inlining has become implicit sinceeach function and method by itself (withoutthe __INLINE__ directive) is a suggestion tothe compiler that it should make a decisionabout whether to inline.

Parallel programming is explicit in that theprogrammer concretely specifies how to parti-tion the program. In parallel programming,we have many choices. Figure 1 shows a partiallist of more than 150 parallel programminglanguages that programmers can choosefrom if they want to be explicit about howto parallelize a program. Somehow, despiteall these options, the problem still isn’t solved.

Programmers have a hard time making gooddecisions about how to parallelize codes, soforcing them to be explicit about it to addressthe multicore problem isn’t a solution. Aswe’ll see, it actually makes the problemworse in the long run. That’s why I’m not aproponent of explicitly parallel programming.

Is explicitly parallel programming inherentlymore difficult than implicitly parallel pro-gramming? For this question, I’m notgoing to make a strong argument. Instead,my opponent (Keshav Pingali) will makea strong argument for me. In his PLDI(Programming Language Design and Imple-mentation) 2007 paper,1 he quotes TimSweeney, who ‘‘designed the first multi-threaded Unreal 3 game engine.’’ Sweeneyestimates that ‘‘writing multithreaded codetripled software code cost at Epic games.’’This means that explicitly parallel program-ming is going to hurt. And, it’s going tohurt more as the number of cores increases.Through Sweeney, I think my opponenthas typified the universal failure of explicitlyparallel programming to solve the problem.

[3B2-14] mmi2010030019.3d 28/6/010 13:37 Page 21

Figure 1. A partial list of parallel programming languages.

....................................................................

MAY/JUNE 2010 21


To support the case for implicitly parallelprogramming languages, I refer to the samepaper by my opponent, in which he says‘‘it is likely that more SQL programs are exe-cuted in parallel than programs in any otherprogramming language. However, most SQLprogrammers do not write explicitly parallelcode . . . .’’ This is a great example of implic-itly parallel programming offering success inthe real world. What we should be doing as acommunity is broadening the scope ofimplicitly parallel programming beyond thedatabase domain.

Can we design languages, compilers, andruntime systems so that applications pro-grammers can get away with writing onlyimplicitly parallel programs, withoutsacrificing performance? I’d like to answerthis question by describing an experimentwe’ve done. In Figure 2, we have a functioncalled rand. It’s just a pseudo number gen-erator with some internal state, called

state. Elsewhere, we have an apparentlyDoAll loop that calls rand. Unfortu-nately, a dependence existing betweeneach rand call must be respected, creatinga serialization of the calls to rand. Sched-uling iteration 1 on core 1, iteration 2 oncore 2, and iteration 3 on core 3 providesno benefit as rand is called at the startand end of each iteration. This serializeseach iteration of the loop with the next.

With this constraint, parallelism seemsimpossible to extract, because this depen-dence must be respected for correct execu-tion, yet we can’t speculate it, schedule it,or otherwise deal with it in the compiler.As programmers, though, we know that therand function meets its contractual obliga-tions regardless of the order in which it’scalled. The value order returned by randis irrelevant in our example. The sequentialprogramming language constrains the com-piler to maintaining the single original se-quential order of dependences between callsto rand. This is unnecessary, so we shouldrelax this constraint.

To let the programmer relax this constraint,we add an annotation to the sequential pro-gramming model that tells the compiler thatthis function can be invoked in any order.We call this annotation commutative,referring to the programmer’s intuitive no-tion of a commutative mathematical opera-tor. The commutative annotation saysthat any invocation order is legal so longas there is an order. Armed with this infor-mation, the compiler can now execute iter-ations in parallel, as Figure 3 shows. Theactual dependence pattern becomes a dy-namic dependence pattern. The order inwhich rand is called is opportunistic, andlarge amounts of parallelism are exposed.The details of the commutative annota-tion are available elsewhere.2

There are several important things tonote. First, the annotation is not an explicitlyparallel annotation like an OpenMP pragma.Second, the annotation is not a hint to thecompiler to do something that it could pos-sibly figure out on its own if it were smartenough, like the register and inline keywordsin C are now. It is an annotation that com-municates information about dependencesto the compiler, information that is only

[3B2-14] mmi2010030019.3d 28/6/010 13:37 Page 22

Core 1

(a) (b)

Core 2

Time

Core 3

char *state;

void * rand(int size);

void *rand() {

state =

f(state);

return state;

}

rand1

rand2

rand3

rand4

rand5

rand6

Figure 2. Parallelization diagram: pseudocode (a) and dynamic execution

schedule (b). The dependence between each rand call creates a

serialization of the calls to rand.

....................................................................

22 IEEE MICRO

...............................................................................................................................................................................................



knowable through an understanding of thereal-world task. Additionally, although wearrived at this use of commutative byexamining this program’s execution plan,we don’t expect the programmer to have todo so. Instead, we expect the programmerto integrate commutative directly intothe programming process, where it shouldbe annotated on any function that is seman-tically commutative. That is, commuta-tive is a property of the function as usedin this program, not a property of the under-lying hardware, the execution plan, nor anyof the other concerns explicitly parallel pro-grammers must consider.

The idea isn’t new; my opponent alsorefers to this idea in his PLDI 2007 paperand talks about ‘‘some serial order’’ and‘‘commutativity in the semantic sense.’’Note that our commutative isn’t inher-ently commutative (like addition), butsemantically commutative. So, it states thatreceiving random numbers out of order, asFigure 3 shows, is acceptable for this pro-gram’s task even though it might not be ac-ceptable for other tasks. For our implicitlyparallel programming experiment, we tookall of the C programs in the SPEC CPU2000 integer benchmark suite and modified50 of the roughly 500,000 lines of code(LOC). Most of the modifications involvedadding the commutative annotation.This annotation adds a little nondeterminismin the form of new legal orderings. As Figure4 illustrates, by using this method, werestored the performance trend lost in 2004.

Do programmers need to be able to expressnondeterminism (for example, selection from aset) to exploit parallelism in an application?The answer is yes. Figure 5 shows performancewith and without those annotations— the ef-fect of those 50 lines of code out of 500,000.Without the annotations, we can only talkabout speedup in percentages, such as 10 or50 percent—what one normally expectsfrom a compiler. Using the annotations withtoday’s compiler technology, we can talkabout speedup in terms of multiples, such as5 or 20 times.

Is a speculative execution model essential forexploiting parallelism? Yes, the compiler is

speculating a lot, but we don’t expose it tothe programmer. Speculation managementcan and should be hidden from the program-mer, just as speculative hardware has done forthe last few decades.

And, really the final question is: Does anyoneother than a few compiler writers and systemsprogrammers need explicitly parallel pro-gramming? Obviously, compiler writersand system programmers have to do it. Fornow, they will have to deal with these issues.For everyone lucky enough not to be com-piler writers or systems programmers, my an-swer is: No, you don’t need explicitly parallelprogramming.

My opponent might argue, as he does inhis paper, that the programmer must providethe means to support speculation rollback.Well, in our approach, we didn’t need that.He might also argue, as he does in hispaper, that a need for opportunistic

[3B2-14] mmi2010030019.3d 28/6/010 13:37 Page 23

Core 1

(a) (b)

Core 2

rand1

rand2

rand3

rand4

rand5

rand6

Time

Core 3 char *state;

void * rand(int size);

@Commutative

void *rand() {

state =

f(state);

return state;

}

Figure 3. Parallelization diagram: pseudocode with commutative

annotation (a) and dynamic execution schedule (b). Since the value order

returned by rand is irrelevant in the example given in Figure 2, we can

relax the constraint on maintaining a single original sequential order

of dependences between calls to rand. We do this by adding the

commutative annotation to the sequential programming model that

tells the compiler that this function can be invoked in any order. With

this annotation, the compiler can now execute iterations in parallel.

....................................................................

MAY/JUNE 2010 23


parallelism exists. As our results in Table 1show, our implicitly parallel approach pro-vides less than 2 times speedup on some pro-grams. You might think that this is a failure

of implicitly parallel programming and thatfalling back to explicitly parallel program-ming is the only way to speed up perl.That would be a disaster. Given more time,

[3B2-14] mmi2010030019.3d 28/6/010 13:37 Page 24

CPU 92 CPU 95 CPU 2000 CPU 2006 Our technology

SP

EC

CP

U in

teg

er p

erfo

rman

ce (

log

arith

mic

sca

le)

1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

Figure 4. Restoration of performance trend lost in 2004. The compiler technology in this figure is the implicitly parallel

approach developed by the Liberty Research Group, led by David August at Princeton University.

100

10

1Sp

eed

up v

s. s

ing

le th

read

ed

164.g

zip

175.v

pr

176.g

cc

181.m

cf

186.c

rafty

197.p

arse

r

253.p

erlbmk

254.g

ap

255.v

ortex

256.b

zip2

300.t

wolf

Geo m

ean

ExistingFramework + annotations

Figure 5. Performance with and without the implicitly parallel approach with annotations.

Approximately 50 of 500,000 lines of code have been modified.

....................................................................

24 IEEE MICRO

...............................................................................................................................................................................................



I could tell you how we can speed that upwithout falling back to explicitly parallel pro-gramming. This leads me to one finalthought.

I have an ally in this debate in the form ofthe full employment theorem for compilerwriters. It says, ‘‘Compilers and runtime sys-tems can always be made better.’’ Every timeyou see a tool fail to make good use of im-plicit parallelism, you can fix it. You can al-ways make it good enough to avoid revertingback to explicit parallelism and all of itsproblems. The human effort that went intoimproving a single application’s performancecould have been better applied to improvingthe tools, likely improving the performanceof many programs, existing and not yetwritten.

The case for an explicitly parallelprogramming: Keshav Pingali

I believe that applications programmersdo need to write explicitly parallel programs.I can summarize my thoughts in three points.

First, explicitly parallel programming isindeed more complex than implicitly parallelprogramming. I think we all agree aboutthat. Unfortunately, I think it is a necessaryevil for applications programmers, particu-larly for those who write what we call ‘‘irreg-ular’’ programs (mainly programs that dealwith very large pointer-based data structures,such as trees and graphs).

Second, exploiting nondeterminism isimportant for performance, and by that Idon’t mean race conditions. What I meanhere is what experts call ‘‘don’t-care nonde-terminism.’’ There are many programs thatare allowed to produce multiple outputs,and any one of those outputs is acceptable.You want the system to figure out whichoutput is the most efficient to produce.This kind of nondeterminism is extremelyimportant. I think David pretty much agreeswith that, so I won’t spend too much timeon it.

Finally, optimistic or speculative parallelexecution is absolutely essential. This is apoint that many people don’t appreciate,even those in the parallel programming com-munity. Parallelism in many programsdepends on runtime values. And therefore,unless something is done at runtime, such

as speculation, we really can’t exploit parallel-ism in those kinds of programs.

Delaunay mesh refinementI find it easiest to make arguments with

concrete examples. A popular example inthe community is Delaunay mesh refinement(DMR).3 The input to this algorithm is a tri-angulation of a region in the plane such asthe mesh in Figure 6a. Some of the trianglesin the mesh might be badly shaped accordingto certain shape criteria (the dark gray trian-gles in Figure 6a). If so, we use an iterative

[3B2-14] mmi2010030019.3d 28/6/010 13:37 Page 25

Table 1. Maximum speedup achieved on up to 32 threads over

single-threaded execution and minimum number of threads

at which the maximum speedup occurred.

Benchmark No. of threads Speedup

164.gzip 32 29.91

175.vpr 15 3.59

176.gcc 16 5.06

181.mcf 32 2.84

186.crafty 32 25.18

197.parser 32 24.5

253.perlbmk 5 1.21

254.gap 10 1.94

255.vortex 32 4.92

256.bzip2 12 6.72

300.twolf 8 2.06

GeoMean 17 5.54

ArithMean 20 9.81

(a) (b)

Figure 6. Delaunay mesh refinement example. Through an iterative refine-

ment procedure, triangles within the neighborhood, or cavity, of a badly

shaped triangle (dark gray triangles in (a)) are retriangulated, creating new

triangles (light gray triangles in (b))

....................................................................

MAY/JUNE 2010 25


refinement procedure to eliminate themfrom the mesh. In each step, the refinementprocedure

� picks a bad triangle from the worklist,� collects several triangles in the neigh-

borhood (the cavity) of that bad triangle(dark gray regions in Figure 6a), and

� retriangulates the cavity, creating thelight gray triangles in Figure 6b. Ifthis retriangulation creates new badlyshaped triangles, they are added to theworklist.

There is room for nondeterminism in thisalgorithm because the final mesh’s shapedepends on the order in which we processthe bad triangles. But we can show thatevery processing order terminates and produ-ces a mesh without badly shaped triangles.The fact that any one of these outputmeshes can be produced is important forperformance.

Another important point is that parallel-ism in this application depends on runtimevalues. Whether bad triangles can be refinedconcurrently depends on whether their cav-ities overlap. Because dependences betweencomputations on different bad triangles de-pend on the input mesh and on the modifi-cations made to the mesh at runtime, theparallelism can’t be exposed by compile-time

program analyses such as points-toor shape analysis.4,5 To exploit this ‘‘amor-phous data-parallelism,’’ as we call it, it’snecessary in general to use speculative oroptimistic parallel execution.1 Differentthreads process worklist items concurrently,but to ensure that the sequential program’ssemantics are respected, the runtime systemdetects dependence violations between con-current computations and rolls back con-flicting computations as needed. (DMRcan be executed in parallel without specula-tion by repeatedly building interferencegraphs and finding maximal independentsets, and so on, but this doesn’t work ingeneral.)

FrameworkDavid and I disagree about what to call

these kinds of applications. The frameworkin Figure 7 is useful for thinking about theimplicitly and explicitly parallel program-ming models.

Let’s go back to David’s SQL program-ming example. Database programming, inmy opinion, is a great example of an areawhere implicitly parallel programming suc-ceeded. But it’s important to rememberthat there are two kinds of programs andtwo kinds of programmers. Applications pro-grammers write SQL programs that areimplicitly parallel. So, I accept that thisSQL programming has been a successstory for implicitly parallel programming.However, application programmers relyon implementations of relations, such asB-trees, which have been carefully coded inparallel by a small number of systems pro-grammers. The question is whether we cangeneralize this to general-purpose parallelprograms. I think about it in terms ofNiklaus Wirth’s equation: program ¼algorithm þ data structure. Weneed to think about an application in termsof both the algorithm, which the applicationsprogrammer codes using an abstract datatype; and the concrete data structure thatimplements that abstract data type. In data-base programming, algorithms are indeedbeing expressed with an implicitly parallellanguage such as SQL. However, the datastructures are being implemented with anexplicitly parallel language such as Cþþ or

[3B2-14] mmi2010030019.3d 28/6/010 13:37 Page 26

SQLprograms

B-treeand

so on

Implicityparallel

Explicityparallel

Figure 7. A framework for understanding

the implicitly and explicitly parallel program-

ming models. Applications programmers

write SQL programs, which are implicitly

parallel, and they rely on implementations

of relations, such as B-trees, which

systems programmers have carefully

coded in parallel.

....................................................................

26 IEEE MICRO

...............................................................................................................................................................................................



Java. Can we make this software develop-ment model general purpose? We have hadsome success doing so; however, we’ve alsorun into some stumbling blocks, whichmakes us more skeptical about this particularmodel.

A programming model example and the needfor explicitly parallel programming

A recent effort on the parallel execution ofirregular programs (which are organizedaround pointer-based data structures suchas trees and graphs) has led the developmentof the Galois programming model.1,6,7 Thisprogramming model is sequential andobject-oriented (sequential Java). Applica-tions programmers write algorithms usingGalois set iterators, which essentially givethem nondeterministic choices, sets, andso on. They then rely on the equivalent ofa collections library, in which all the datastructures, such as graphs and trees, areimplemented. An applications programmercan easily write an algorithm, such asDMR, with this sequential language. How-ever, they’re relying on a collections librarythat someone else implemented using explic-itly parallel programming with locks and soon. Is this good enough? In other words,can we ship a collections library of data struc-tures that have been implemented withexplicitly parallel programming that applica-tions programmers can then use in their se-quential object-oriented models and thusavoid having to write any concurrent pro-grams at all? Unfortunately, here we runinto three fundamental problems, whichmake me skeptical as to whether this pro-gramming model will work.

The first problem is that no current fixedset of data structures meets all applications’needs. For example, the Java collectionslibrary contains many concurrent data struc-tures, but there are many other datastructures that aren’t in there, such as kd-trees and suffix trees. As library writers in-teract with applications programmers, theyfind more of these data structures. Theythen have to code concurrent versionsusing explicitly parallel languages. Just togive you an idea of how many data struc-tures there are, a Google search on ‘‘hand-book of data structures’’ returns a great

book by Sartaj Sahni.8 It’s 1,392 pagesand lists hundreds of data structures. Whatthat means is that there’s a large variety ofdata structures that we must code in termsof some explicitly parallel programminglanguage. We might then be able to usethe Galois programming model on top toexpress the algorithm. If the data structuresthat we need to implement the algorithmaren’t supported in the library, however,they must still be coded in explicitly parallelprogramming languages.

Second, even generic data structures mustbe tuned for a particular application. For ex-ample, if you need a graph, you find it in thelibrary and use it for your program. How-ever, if your application has some propertiesthat you must exploit to get good perfor-mance, you have no option but to startmucking around with the data structure im-plementation, which is of course in an explic-itly parallel language.

The third problem is tied to the first two.In many applications, the API functionsof the various classes (graph, tree, data struc-ture, and so on) do relatively little computa-tion. In databases, this model works wellbecause most of the execution time is spentin library functions that system programmershave carefully implemented in explicitlyparallel languages. Hence, there’s no needto worry too much about the top-levelprogram. However, many general-purposeapplications don’t spend a lot of time ondata structure operations. Therefore, to getreally good performance, you need to squishthe application code and data structureimplementation together in some way. Un-fortunately, we don’t currently have the com-piler technology to generate good code ofthat kind from a separate specification ofthe algorithm and the data structure.

We run into these three issues when wetry to shield applications programmersfrom the complexities of explicitly parallelprogramming. Are these fundamental prob-lems? If they are, you have no choice butto write some amount of explicitly parallelcode even if you’re an applications program-mer. I see no solutions for these problemsright now. So, I’m putting these issues upfor debate and will see if my opponent hasa viewpoint on them.

[3B2-14] mmi2010030019.3d 28/6/010 13:37 Page 27

....................................................................

MAY/JUNE 2010 27


Parallel libraries and automaticparallelizationAugust: There’s a lot of talk about solvingthe parallel programming model problemthrough the use of libraries. It occurs tome, especially after hearing that one booklists numerous data structures, that thereare more libraries than programs. If that’sso, maybe the problem is not how to writeparallel programs, but how to write parallellibraries. So, we’re back to where we started.We need a solution that works for writinglibraries and applications.

Pingali: I don’t think that’s going to solveanything. I agree that explicitly parallel pro-gramming is difficult. Therefore, we wantto minimize the amount of code writtenexplicitly parallel. However, if applicationsprogrammers want really good perfor-mance, they might need to dig for datastructure implementations, whether or notthousands of them exist. No technologythat I know of lets you write these opti-mized data structure implementations forparticular applications. For example, Idon’t know of any automatic way of takingsome implicitly parallel descriptions of datastructures, such as kd-trees and suffix trees,and producing optimized parallel imple-mentations. Currently, the only way is towrite explicitly parallel code. Transactionalmemory might make programming thosekinds of data structures easier, but ulti-mately you don’t get the performance youwant. And, even for the transactions, there’ssome notion of explicit parallelism.

August: If we were to explicitly parallelize alarge number of libraries, we would start tosee some patterns, and if we start to see pat-terns, effort might be better spent imple-menting ways to automate those patternsby merging those patterns that are in paral-lel form in a compiler.

In the past, when compilers were the 15-percent solution, the economics weren’tthere for many of us to get involved in writ-ing automatic parallelization techniques ortransforms. But, now that we have a choicebetween relying on some kind of system orturning to the recurring expense of writingexplicitly parallel programs, the economics

might be there to take that approach. Whatdo you think?

Pingali: That’s an interesting idea, David. Iwould be happy about this because I don’tparticularly like explicitly parallel program-ming, either. However, almost all of the com-piler technology developed over the past 40to 50 years has been targeted toward densematrix computation. We know very littleabout how to analyze irregular pointer-based data structures (such as large trees andgraphs). Some work has been done, but ithasn’t gone anywhere, in my opinion. So,I’ve become somewhat skeptical about thepossibility of developing compiler technologyfor generating a parallel implementation forthese data structures from high-level specifica-tions. But it’s an interesting goal.

August: Perhaps we should give up the ideathat we can fully understand code statically,an idea that I think has misdirected effort incompiler development for decades. Let’srefer to the poor performer benchmark, perl,in our implicitly parallel program experiment.In perl, the parallelism is fully encoded inthe input set. The perl script has parallelismin it, but, most likely, the input defineswhere the parallelism is. Therefore, no staticanalysis can tell you what the input to that pro-gram is or where the parallelism will be be-cause they haven’t yet been determined.Therefore, maybe we should think about thisas the fundamental problem and move to amodel where we look at the code and do trans-formations at runtime or generationally. Weshould abandon doing things like shape anal-ysis and heroic compiler analyses in general.

Where should it be donein the software stack?Audience: I never thought that parallelismwould be a problem, at least for many appli-cations. The problem arises in actually run-ning these programs effectively on amachine. A lot of that has to do with local-ity. For example, where to get data, whereto get the structures, how quickly they’reobtained, and how much bandwidth isobtained. Unfortunately, I haven’t seen any-thing in the discussion so far to address that.I was just wondering first of all whether

[3B2-14] mmi2010030019.3d 28/6/010 13:37 Page 28

....................................................................

28 IEEE MICRO

...............................................................................................................................................................................................



your systems work. That’s what was so goodabout an API. It’s terrible that you have todo everything yourself but it really workswhen you’re sure things are in the rightplace. What are your thoughts on this?

August: From the implicitly parallel pro-gramming model viewpoint, we’re goingto rely on a dynamic component—a layerbetween the program and the multicoreprocessor—to do the tuning. There hasbeen some success in the iterative compilationworld. Compilers don’t know the right thingto do unless they can observe the effects oftheir transformations. Once they have thechance to do so, they can make increasinglybetter decisions. The goal there is to reducethe burden on the applications programmersto do that kind of tuning.

Arvind: Just for clarification, are you talk-ing about an interactive system?

August: Because of the problem that I men-tioned with the perl benchmark, we need todeal with the program after it starts running.We want to observe dependences on values,to speculate upon them, and to tune. Wealso want to understand if there are otherthings happening dynamically in the system,such as might reduce the effective core count.We’d like to adjust for such events as well.So, this is a runtime system.

Arvind: Is this for future program runs?

August: It could be the same run. We haveextra cores, so we can expend some effortadjusting the program. Some of the most suc-cessful models are those in which the staticcompiler only partially encodes the programand leaves the rest to be adjusted at runtime.It isn’t a dynamic optimizer as such since theprogram isn’t fully defined. Instead, the pro-gram can be parameterized by feedback fromthe performance at runtime. So, it’s a hybridbetween a compiler running statically andpurely runtime optimization.

Pingali: One mistake we keep making in theprogramming languages and architecturecommunity, or the systems communitymore broadly, is thinking about programmers

as monolithic because we’re used to the ideaof SPEC benchmarks. So, there is one bigpiece of code and someone has to put some-thing in that C (or other language) code toexploit parallelism. We need to get awayfrom this kind of thinking because domainsin which people have successfully exploitedparallelism have two classes of programmers.There are ‘‘Joe’’ programmers—applicationsprogrammers with little knowledge aboutparallel programming although they’redomain experts. And, there are ‘‘Steve’’ or‘‘Stephanie’’ programmers—highly trainedparallel programming experts who know alot about the machine, but little about aparticular application domain. These pro-grammers write the middle layers.

In the database domain, for example, Joeprogrammers write SQL code. They don’tknow a lot about concurrency control andall that other stuff, but they get the jobdone. There are a small number of peopleat IBM, Oracle, and other corporations,who implement, for example, WebSpherecarefully and painfully in Cþþ or Java.Therefore, clearly we need to manage resour-ces and locality to get good performance. Thereal question we should be asking is where inthe software stack this management shouldtake place. If there’s any hope for bringingparallelism into the mainstream and for hav-ing millions of people write programs, wecan’t leave resource and locality managementto applications programmers. Rather, it mustbe done at a lower level, even if it’s less effi-cient than doing it with respect to havingsome knowledge about the application. So,I don’t think we have any choice. Weshouldn’t be considering whether to do it,but rather, where in the software stack todo it and how to make sure that as muchas possible is done lower in the stack. Other-wise, multicores will never go mainstream.

August: That sounds like an argument forimplicit programming.

Pingali: We should do as much implicitprogramming as we can at the Joe program-mer level. There is no debate about that.The question is whether we can get awayfrom explicitly parallel programming alto-gether, and my position is no.

[3B2-14] mmi2010030019.3d 28/6/010 13:37 Page 29

....................................................................

MAY/JUNE 2010 29


Are libraries really the solution?Audience: Professor Pingali, you were talk-ing about how people always handcraft libra-ries. Let’s talk about the Cþþ StandardTemplate Library. In many software compa-nies today, if you handcraft data structures oroperations that can’t be expressed using STL,you’re fired.

Pingali: We still haven’t dealt with parallel-ism in regard to STL. So, can we build a par-allel implementation of STL that covers sucha broad application scope that companies canreally start firing people the moment they saythey’re getting terrible performance for a par-ticular application and they need to codesomething at the STL level using explicitlyparallel programming? I don’t think we’rethere yet. I don’t know if we’ll ever getthere for reasons I already mentioned.

Audience: The flip side of this is that STLconsists not only of data structure manipula-tions, linked lists, and so on, but also ofmath applications such as those applying anoperation in mathematical parallel. Almostno one uses math operations in the STL li-brary. Professor August, can your compilerconvert some of the code used in this areaby employing the math implementation?

August: Let’s take an example of an iteratortraversing a linked list. We wish that a pro-grammer had used the array library. We alsowish that the access order didn’t matter—random access to the array elements wouldbe nice, especially in the DoAll style of par-allelism. But that doesn’t stop us from find-ing parallelism. Perhaps we can’t think of amore serial operation than the traversal of alinked list. But, there are techniques thatcan transform serial data structures, suchas a linked list, into other data structuresthat enable parallelism, such as arrays orskip lists specific to particular traversals.There are some automatic techniques todeal with these kinds of problems.

Paradigm shift and education problemAudience: To roughly characterize the twoarguments, both of you are saying that ex-plicit programming is hard. One of youargues that it’s so hard we shouldn’t use it.

The other argues that it’s hard but we haveto do it. I’m curious about the quote onthe UnReal engine that David mentioned.How difficult was developing or buildingthe parallel version of that engine for thefirst time? How much of this is really a tran-sient effect? How much extra programmertime and cost did it entail the next timethey tried to build an engine like that?Is this really an education problem becauseit’s a big paradigm shift from single-threadedto multithreaded code?

August: Maybe one thing that will happen isthat programmers will change. Perhaps therewon’t be any Joe programmers anymore.That’s possible, but I doubt it. The otherthing that we know will change is the num-ber of cores. We’re talking about 4, 6, and8 now. What will happen when we have32, 64, or more? What about thousands?The multithreaded version of the UnReal en-gine, which was too hard to build the firsttime, needs to be built again. The first ver-sion didn’t explore speculation; however,now you have to do speculation becausewe need much better speedups than only3 times. It seems like just as you solved theproblem, the microprocessor manufacturersmade the problem harder. So, we’re contin-ually chasing the rabbit, and there must bea better way.

Audience: Well, that implies that eventuallyyou must do explicit programming. Unlessyou’re saying that your implicit infrastruc-ture can do a better job?

August: Yes, that’s my claim.

Pingali: The way I see it, the fewer opportu-nities for making errors and bugs, the betteroff Joe programmers are. Perhaps withenough training, we can even get them towrite reasonably good concurrent code. Butthe point is, it’s almost always better if youdon’t have to worry about these concurrencybugs, even if you’re Steve or Stephanie pro-grammer. So, what we should do is try tomove concurrency management somewherein the software stack, where it can be dealtwith in a controlled way as opposed to justletting it out and saying everyone must

[3B2-14] mmi2010030019.3d 28/6/010 13:37 Page 30

....................................................................

30 IEEE MICRO

...............................................................................................................................................................................................



become a Stephanie programmer. I don’tthink that’s going to happen because ithasn’t happened in a few domains, such asdense numerical linear algebra, that success-fully harnessed parallelism. In the dense nu-merical algebra domain, a small number ofpeople write matrix factorization libraries,and a large number of Matlab programmerswho simply write Matlab use these libraries.If you start looking at the library implemen-tations, such as the matrix factorization codeand matrix multiplication, the code is some-what horrific. To have everyone write theirown explicitly parallel code isn’t a scalablemodel of software development.

Implicit versus explicitAudience: David, you have annotations thatyou need to add to your programs. Wheredo you draw the line between implicit andexplicit?

August: If the annotation brings informationthat can’t be determined otherwise, it’simplicitly parallel. If a compiler or runtimesystem can determine the existence of depen-dence automatically, annotations confirmingits existence or indicating what to do aboutit are explicit. Consider the situation whereyou can give the user the numbers from apseudorandom number generator in anyorder, changing the program’s output. If theprogrammer states that this is legal, thatstatement is implicit because it’s otherwiseindeterminable. The compiler can’t knowwhether the user needs the numbers in theoriginal order, because sometimes the orderis important and sometimes it isn’t, depend-ing on the real-world task. To me, thatisn’t only an implicitly parallel annotation,but it’s a good language annotation since itisn’t automatically determinable.

How should parallel programmingbe introduced in a computer scienceand engineering curriculum?Arvind: When I was an assistant professor,I went to the department head and said, Ihave to teach this course, it’s very impor-tant. He asked how important; I said veryimportant. Then, he said, why don’t youteach it at the freshmen level. I said no,

no, it isn’t that important. So, the questionreally is how important is parallel program-ming? Is it separate from sequential pro-gramming? And within that context, howimportant is explicitly parallel program-ming? Should we teach it at the freshmenlevel, or is it like quantum mechanics,where you must master classical mechanicsbefore you can touch quantum mechanics?I would like both of you to express youropinion on this.

August: It’s as important as computerarchitecture. So, it should be taught at thejunior/senior level.

Pingali: I’d like to go back to NiklausWirth’s equation: program ¼ algo-rithm þ data structure. We needto teach algorithms and data structuresfirst. When teaching students sequentialalgorithms and data structures, we can alsoteach them parallelism in algorithms, whichwe can do without talking about a particularprogramming model or implementation oflocks, transactional memory, synchroniza-tion, and so on. We can talk about locality,parallelism, and algorithms without talkingabout all of those other things. After that,we should introduce implicitly parallel pro-gramming at the Joe programmer level.That is, Joe programmers write these Joeprograms using somebody’s data structuresin libraries, which are painfully written, likea collection of classes written over manyyears. This way, they get a feel of whatthey can get out of parallel programming.Then, you take the covers off and tell themwhat’s in these data structure libraries, andexpose them to their implementations.That’s the sequence that we have to gothrough. I would start with data structuresand algorithms at the abstract level.Then, show them implicit parallel program-ming at the Joe programmer level usingparallel libraries. Then, expose them to itsimplementation. MICRO

....................................................................References

1. M. Kulkarni et al., ‘‘Optimistic Parallelism

Requires Abstractions,’’ Proc. ACM SIG-

PLAN Conf. Programming Language Design

[3B2-14] mmi2010030019.3d 28/6/010 13:37 Page 31

....................................................................

MAY/JUNE 2010 31


and Implementation (PLDI 07), ACM Press,

2007, pp. 211-222.

2. M. Bridges et al., ‘‘Revisiting the Sequential

Programming Model for Multi-Core,’’ Proc.

40th Ann. IEEE/ACM Int’l Symp. Microarch-

itecture, IEEE CS Press, 2007, pp. 69-84.

3. L. Paul Chew, ‘‘Guaranteed-Quality Mesh

Generation for Curved Surfaces,’’ Proc.

9th Ann. Symp. Computational Geometry

(SCG 93), ACM Press, 1993, pp. 274-280.

4. L. Hendren and A. Nicolau, ‘‘Parallelizing

Programs with Recursive Data Structures,’’

IEEE Trans. Parallel and Distributed Sys-

tems, vol. 1, no. 1, Jan. 1990, pp. 35-47.

5. M. Sagiv, T. Reps, and R. Wilhelm, ‘‘Solving

Shape-Analysis Problems in Languages

with Destructive Updating,’’ ACM Trans.

Programming Languages and Systems,

vol. 20, no. 1, Jan. 1998, pp. 1-50.

6. M. Kulkarni et al., ‘‘Optimistic Parallelism

Benefits from Data Partitioning,’’ Proc.

Conf. Architectural Support for Programming

Languages and Operating Systems (ASPLOS

08), ACM Press, 2008, pp. 233-243.

7. M. Kulkarni et al., ‘‘Scheduling Strategies

for Optimistic Parallel Execution of Irregular

Programs,’’ Proc. Symp. Parallelism in Algo-

rithms and Architectures (SPAA 08), ACM

Press, 2008, pp. 217-228.

8. D. Mehta ed., Handbook of Data Struc-

tures and Applications, Chapman and

Hall, 2004.

Arvind is the Johnson Professor of Compu-ter Science and Engineering at the Massa-chusetts Institute of Technology. His currentresearch focus is on enabling rapid develop-ment of embedded systems. He is coauthorof Implicit Parallel Programming in pH.Arvind has a PhD in computer science fromthe University of Minnesota. He is a Fellowof both IEEE and ACM, and a member ofthe National Academy of Engineering.

David I. August is an associate professor inthe Department of Computer Science atPrinceton University, where he directs theLiberty Research Group, which aims to solvethe multicore problem. August has a PhD inelectrical engineering from the Universityof Illinois at Urbana-Champaign. He is amember of IEEE and ACM.

Keshav Pingali is a professor in theDepartment of Computer Science at theUniversity of Texas at Austin, where he holdsthe W.A.‘‘Tex’’ Moncrief Chair of Grid andDistributed Computing in the Institute forComputational Engineering and Science.His research interests include programminglanguages, compilers, and high-performancecomputing. Pingali has an ScD in electricalengineering from the Massachusetts Instituteof Technology. He is a Fellow of IEEE andof the American Association for the Ad-vancement of Science (AAS).

Resit Sendag is an associate professor in theDepartment of Electrical, Computer, andBiomedical Engineering at the University ofRhode Island, Kingston. His research inter-ests include high-performance computerarchitecture, memory systems performanceissues, and parallel computing. Sendag has aPhD in electrical and computer engineeringfrom the University of Minnesota, Minnea-polis. He is a member of IEEE and the IEEEComputer Society.

Derek Chiou is an assistant professor in theElectrical and Computer Engineering De-partment at the University of Texas atAustin. His research interests include com-puter system simulation, computer architec-ture, parallel computer architecture, andInternet router architecture. Chiou has aPhD in electrical engineering and computerscience from the Massachusetts Institute ofTechnology. He is a senior member of IEEEand a member of ACM.

Joshua J. Yi is a first year JD student at theUniversity of Texas School of Law. Hisresearch interests include high-performancecomputer architecture, performance metho-dology, and deep submicron effects. Yi has aPhD in electrical engineering from theUniversity of Minnesota, Minneapolis. Heis a member of IEEE and the IEEEComputer Society.

Direct questions and comments to Ar-vind at the Stata Center, MIT, 32 VassarSt., 32-G866, Cambridge, MA 02139;[email protected].

[3B2-14] mmi2010030019.3d 28/6/010 13:37 Page 32

....................................................................

32 IEEE MICRO

...............................................................................................................................................................................................



[3B2-14] mmi2010030019.3d 28/6/010 15:15 Page 33


programming multicores do pplications programmers …

Documents