helix-rc: an architecture-compiler co-design for …tmj32/papers/docs/campanoni14-isca.pdf · for...

12
HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs Simone Campanoni Kevin Brownell Svilen Kanev Timothy M. Jones + Gu-Yeon Wei David Brooks Harvard University University of Cambridge + Cambridge, MA, USA Cambridge, UK Abstract Data dependences in sequential programs limit paralleliza- tion because extracted threads cannot run independently. Al- though thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to syn- chronize actual dependences counteract the benefits of paral- lelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing com- piler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a pro- cessor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks. 1. Introduction In today’s multicore era, program performance largely depends on the amount of thread level parallelism (TLP) available. While some computing problems often translate to either inherently parallel or easy-to-parallelize numerical programs, sequentially designed, non-numerical programs with complicated control (e.g., execution path) and data flow (e.g., aliasing) are much more common, but difficult to analyze precisely. These non- numerical programs are the focus of this paper. While conven- tional wisdom is that non-numerical programs cannot make good use of multiple cores, research in the last decade has made steady progress towards extracting TLP from complex, sequentially- designed programs such as the integer benchmarks from SPEC CPU suites [3, 6, 27, 45, 48]. To further extend this body of research, this paper presents lightweight architectural enhance- ments for fast inter-core communication in order to support advances in a custom compiler framework that parallelizes loop iterations across multiple cores within a chip multiprocessor. Performance gains sought by parallelizing loop iterations of non-numerical programs depend on two key factors: (i) accuracy of the data dependence analysis and (ii) speed of communica- tion provided by the underlying computer architecture to satisfy the dependences. Unfortunately, complex control and data flow in non-numerical programs—both exacerbated by ambiguous pointers and ambiguous indirect calls—make accurate data de- pendence analysis difficult. In addition to actual dependences that require communication between cores, a compiler must conservatively handle apparent dependences never realized at runtime. While thread level speculation (TLS) avoids the need for accurate data dependence analysis by speculating that some apparent dependences are not realized [29, 38, 39], TLS suffers overheads to support misspeculation and must therefore target relatively large loops to amortize penalties. In contrast to existing parallelization solutions, we propose an alternate strategy that instead targets small loops, which are much easier to analyze via state-of-the-art control and data flow analysis, significantly improving accuracy. Furthermore, this ease of analysis enables transformations that simply re-compute shared variables in order to remove a large fraction of actual dependences. This strategy increases TLP and reduces core-to- core communication. Such optimizations do not readily translate to TLS because the complexity of TLS-targeted code typically spans multiple procedures in larger loops. Finally, our data shows parallelizing small hot loops yields high program coverage and produces meaningful speedups for the non-numerical programs in the SPEC CPU2000 suite. Targeting small loops presents its own set of challenges. Even after extensive code analysis and optimizations, small hot loops will retain actual dependences, typically to share dynamically allocated data. Moreover, since loop iterations of small loops tend to be short in duration (less than 25 clock cycles on average), they require frequent, memory-mediated communication. At- tempting to run these iterations in parallel demands low-latency core-to-core communication for memory traffic, something not available in commodity multicore processors. To meet these demands, we present HELIX-RC, a co-designed architecture-compiler parallelization framework for chip multi- processors. The compiler identifies what data must be shared between cores and the architecture proactively circulates this data along with synchronization signals among cores. Rather than waiting for a request, this proactive communication immediately circulates shared data, as early as possible—decoupling commu- nication from computation. HELIX-RC builds on the HCCv1 compiler, developed for the first iteration of HELIX [6, 7], that automatically generates parallel code for commodity multicore processors. Because performance improvements from HCCv1 saturate at four cores due to communication latency, we pro- pose ring cache—an architectural enhancement that facilitates low-latency core-to-core communication—to satisfy inter-thread memory dependences and relies on guarantees provided by the co-designed HCCv3 compiler to keep it lightweight. HELIX-RC automatically parallelizes non-numerical program with unmatched performance improvements. Across a range of SPEC CINT2000 benchmarks, decoupling communication en- ables a three-fold improvement in performance when compared 978-1-4799-4394-4/14/$31.00 c 2014 IEEE

Upload: phamdieu

Post on 25-Mar-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

HELIX-RC: An Architecture-Compiler Co-Designfor Automatic Parallelization of Irregular Programs

Simone Campanoni Kevin Brownell Svilen Kanev Timothy M. Jones+ Gu-Yeon Wei David Brooks

Harvard University University of Cambridge+

Cambridge, MA, USA Cambridge, UK

AbstractData dependences in sequential programs limit paralleliza-

tion because extracted threads cannot run independently. Al-though thread-level speculation can avoid the need for precisedependence analysis, communication overheads required to syn-chronize actual dependences counteract the benefits of paral-lelization. To address these challenges, we propose a lightweightarchitectural enhancement co-designed with a parallelizing com-piler, which together can decouple communication from threadexecution. Simulations of these approaches, applied to a pro-cessor with 16 Intel Atom-like cores, show an average of 6.85×performance speedup for six SPEC CINT2000 benchmarks.

1. IntroductionIn today’s multicore era, program performance largely dependson the amount of thread level parallelism (TLP) available. Whilesome computing problems often translate to either inherentlyparallel or easy-to-parallelize numerical programs, sequentiallydesigned, non-numerical programs with complicated control(e.g., execution path) and data flow (e.g., aliasing) are muchmore common, but difficult to analyze precisely. These non-numerical programs are the focus of this paper. While conven-tional wisdom is that non-numerical programs cannot make gooduse of multiple cores, research in the last decade has made steadyprogress towards extracting TLP from complex, sequentially-designed programs such as the integer benchmarks from SPECCPU suites [3, 6, 27, 45, 48]. To further extend this body ofresearch, this paper presents lightweight architectural enhance-ments for fast inter-core communication in order to supportadvances in a custom compiler framework that parallelizes loopiterations across multiple cores within a chip multiprocessor.

Performance gains sought by parallelizing loop iterations ofnon-numerical programs depend on two key factors: (i) accuracyof the data dependence analysis and (ii) speed of communica-tion provided by the underlying computer architecture to satisfythe dependences. Unfortunately, complex control and data flowin non-numerical programs—both exacerbated by ambiguouspointers and ambiguous indirect calls—make accurate data de-pendence analysis difficult. In addition to actual dependencesthat require communication between cores, a compiler mustconservatively handle apparent dependences never realized atruntime. While thread level speculation (TLS) avoids the needfor accurate data dependence analysis by speculating that someapparent dependences are not realized [29, 38, 39], TLS suffers

overheads to support misspeculation and must therefore targetrelatively large loops to amortize penalties.

In contrast to existing parallelization solutions, we proposean alternate strategy that instead targets small loops, which aremuch easier to analyze via state-of-the-art control and data flowanalysis, significantly improving accuracy. Furthermore, thisease of analysis enables transformations that simply re-computeshared variables in order to remove a large fraction of actualdependences. This strategy increases TLP and reduces core-to-core communication. Such optimizations do not readily translateto TLS because the complexity of TLS-targeted code typicallyspans multiple procedures in larger loops. Finally, our data showsparallelizing small hot loops yields high program coverage andproduces meaningful speedups for the non-numerical programsin the SPEC CPU2000 suite.

Targeting small loops presents its own set of challenges. Evenafter extensive code analysis and optimizations, small hot loopswill retain actual dependences, typically to share dynamicallyallocated data. Moreover, since loop iterations of small loopstend to be short in duration (less than 25 clock cycles on average),they require frequent, memory-mediated communication. At-tempting to run these iterations in parallel demands low-latencycore-to-core communication for memory traffic, something notavailable in commodity multicore processors.

To meet these demands, we present HELIX-RC, a co-designedarchitecture-compiler parallelization framework for chip multi-processors. The compiler identifies what data must be sharedbetween cores and the architecture proactively circulates this dataalong with synchronization signals among cores. Rather thanwaiting for a request, this proactive communication immediatelycirculates shared data, as early as possible—decoupling commu-nication from computation. HELIX-RC builds on the HCCv1compiler, developed for the first iteration of HELIX [6, 7], thatautomatically generates parallel code for commodity multicoreprocessors. Because performance improvements from HCCv1saturate at four cores due to communication latency, we pro-pose ring cache—an architectural enhancement that facilitateslow-latency core-to-core communication—to satisfy inter-threadmemory dependences and relies on guarantees provided by theco-designed HCCv3 compiler to keep it lightweight.

HELIX-RC automatically parallelizes non-numerical programwith unmatched performance improvements. Across a range ofSPEC CINT2000 benchmarks, decoupling communication en-ables a three-fold improvement in performance when compared

978-1-4799-4394-4/14/$31.00 c© 2014 IEEE

to HCCv1, on a simulated multicore processor consisting of 16Atom-like, in-order cores with a ring cache with 1KB per nodeof memory (32× smaller than the L1 data cache). The proposedsystem offers an average speedup of 6.85× when compared torunning un-parallelized code on a single core. Detailed evalua-tions show that even with a conservative ring cache configuration,HELIX-RC is able to achieve 95% of the speedup possible withunlimited resources (i.e., unbounded bandwidth, instantaneousinter-core communication, and unconstrained size). Moreover,simulations for a HELIX-RC system comprising 16 out-of-ordercores show 3.8× performance speedup for the same set of non-numerical programs. This result confirms HELIX-RC’s abilityto extract TLP on top of the instruction level parallelism (ILP)provided by an out-of-order processor.

The remainder of this paper further describes the motivationfor and results of implementing HELIX-RC. We first reviewthe limitations of compiler-only improvements and identify co-design opportunities to improve TLP of loop iterations. Next,we explore the speedups obtained by decoupling communicationfrom computation with compiler support. After describing theoverall HELIX-RC approach, we delve deeper into both the com-piler and the hardware enhancement. Finally, we use a detailedsimulation framework to evaluate the performance of HELIX-RCand analyze its sensitivity to architectural parameters.

2. Background and Opportunities

2.1. Limits of compiler-only improvements

To understand what limits the performance of parallel code ex-tracted from non-numerical programs, we started with HCCv1 [6,7], a state-of-the-art parallelizing compiler.

HCCv1. This first generation compiler automatically gener-ates parallel threads from sequential programs by distributingsuccessive loop iterations across adjacent cores within a singlemulticore processor, similar to conventional DOACROSS par-allelism [10]. Since there are data dependences between loopiterations (i.e., loop-carried dependences), some segments ofa loop’s body—called sequential segments—must execute initeration order on the separate cores to preserve the semantics ofsequential code. Synchronization operations mark the beginningand end of each sequential segment.

HCCv1 includes a large set of code optimizations (e.g., codescheduling, method inlining, loop unrolling), most of whichare specifically tuned to extract TLP. Despite this, performanceimprovements obtained by the original HCCv1 compiler saturateat four cores, due to high core-to-core communication latency.

HCCv2. We first improved code analysis and transformation.Specifically, we increased the accuracy of both data dependenceand induction variable analysis, and we added other transforma-tions to extract more parallelism (e.g., scalar expansion, scalarrenaming, parallel reductions, and loop splitting [1]). We callthis improved compiler HCCv2.

Figure 1 compares speedups for HCCv1 and HCCv2 based onsimulations of parallel code generated by each when targetinga 16-core processor with an optimistic 10-cycle core-to-core

164.g

zip

175.v

pr

197.p

arse

r

300.t

wolf

181.m

cf

256.b

zip2

INT Geo

mean

183.e

quak

e

179.a

rt

188.a

mmp

177.m

esa

FPGeo

mean

Geomea

n02468

10121416

Pro

gram

spee

dup Numerical

Programs

Non-NumericalPrograms

HCCv1 HCCv2

Figure 1: Improving the HCCv1 compiler alone does not improveperformance for SPEC CINT2000 benchmarks.

communication latency.1 The engineering improvements ofHCCv2 significantly raised speedups for numerical programs(SPEC CFP2000) over HCCv1 from 2.4× to 11×. HCCv2successfully parallelized the numerical programs because theaccuracy of the data dependence analysis is high for loops atalmost any level of the loop nesting hierarchy. Furthermore, theimproved compiler removed the remaining actual dependencesamong registers (e.g., via parallel reduction) to generate loopswith long iterations that can run in parallel on different cores.

Unfortunately, non-numerical programs (SPEC CINT) arenot as compliant to compiler improvements and saw little to nobenefit from HCCv2. Because core-to-core communication inconventional systems is expensive, the compiler must parallelizelarge loops (the larger the loop with loop-carried dependences,the less frequently cores synchronize), which limits the accuracyof dependence analysis and thereby limits TLP extraction. Thisis why HELIX-RC focuses on small (hot) loops to parallelize thisclass of programs. Our hypothesis is that modest architecturalenhancements, co-designed with a compiler that targets smallloops, can successfully parallelize non-numerical programs.

2.2. Opportunity

There is an opportunity to aggressively parallelize non-numericalprograms based on the following insights: (i) small loops areeasier to analyze with high accuracy; (ii) predictable computa-tion means most of the required communication updates sharedmemory locations; (iii) we can efficiently satisfy communicationdemands of actual dependences for small loops with low-latency,core-to-core communication; and (iv) proactive communicationefficiently hides communication latencies.

Accurate data dependence analysis is possible for smallloops in non-numerical programs. The accuracy of data de-pendence analysis increases for smaller loops because (i) thereis less code—therefore less complexity—to analyze and (ii) thenumber of possible aliases of a pointer in the code scales downwith code size. In other words, we can avoid conservative pointeraliasing assumptions that lower accuracy for large loops.

To evaluate the accuracy of data dependence analysis for smallloops using modern compilers, we started with a state-of-the-art

1Details of this experiment are in Section 6.

0% 100%81%48%

VLLPA +flow sensitive +path based +data type +lib calls

Figure 2: Accuracy of data dependence analysis for small hotloops in SPEC CINT2000 benchmarks.

analysis called VLLPA [13]. Figure 2 shows the starting accu-racy (i.e., average number of actual data dependences comparedto all dependences identified for our set of loops) of this analysisis 48%. To improve accuracy, we extended VLLPA (i) to befully flow sensitive [8], which tracks values of both registers andmemory locations according to their position in the code; (ii)to be path-based, which names runtime locations by how theyare accessed from program variables [11]; (iii) to exploit datatype and type casting information to conservatively eliminateincompatible aliases; and (iv) to exploit standard library call se-mantics. Figure 2 shows that these extensions increase accuracyfor small loops to 81%. As a result, most of the loop-carried datadependences identified by the compiler are actual and thereforerequire core-to-core communication.

Most required communication is to update shared mem-ory locations. Sharing data among loop iterations requires core-to-core communication to propagate new values when loop it-erations run on different cores. However, if new values arepredictable (e.g., incrementing a shared variable at every itera-tion), communication can be avoided. We extended the variableanalysis in HCCv1 to capture the following predictable variables:(i) induction variables where the update function is a polyno-mial up to the second order; (ii) accumulative, maximum, andminimum variables; (iii) variables set but not used until afterthe loop; and (iv) variables set in every iteration even whenthe updated value is not constant.2 If a variable falls into anyof these categories, each core can re-compute its correct valueindependently.

Exploiting the predictability of variables, again for small loopsin non-numerical programs, allows the compiler to remove alarge fraction of the communication required to share registers.Figure 3 compares a naive solution that propagates new valuesfor all loop-carried data dependences (100%) versus a solutionthat exploits variable predictability. By re-computing variables,the majority of the remaining communication is to share memorylocations rather than registers.

Communication for small hot loops must be fast. Whilethe simplicity of small loops allows for easy analysis, small loopshave short iterations—typically less than 100 clock cycles long.However, because these short iterations require (at least) somecommunication to run in parallel, efficient parallel executiondemands a low-latency core-to-core communication mechanism.

To better understand this need for fast communication, Fig-ure 4a plots a cumulative distribution of average iteration execu-tion times on a single Atom-like core (described in Section 6) forthe set of hot loops from SPEC CINT2000 benchmarks chosenfor parallelization by HELIX-RC. The shaded portion of the

2This is an example of code replication. Details about this transformation areoutside the scope of this paper.

Mem

Mem Register100%15%

Re-compute

Figure 3: Predictability of variables reduces register communi-cation.

plot shows that more than half of the loop iterations completewithin 25 clock cycles. The plot also delineates the measuredcore-to-core round trip communication latencies for three mod-ern multicore processors. Even for the shortest-latency machine,Ivy Bridge, 75 cycles is much too long for the majority of theseshort loops. Of course, a conventional region-extending trans-formation such as loop unrolling could lengthen the durationof these inner loops, but this would also increase the lengths ofsequential segments, reducing exploitable parallelism.

Proactive communication achieves low latency by decou-pling communication from computation. A compiler mustconservatively assume dependences exist between all iterationsfor most of loop-carried dependences in non-numerical programs.Because of the complexity of control and data flow in such pro-grams, a compiler cannot easily infer the distance between a loopiteration that generates data and the ones that consume it. Forconventional synchronization approaches [6, 25, 26, 43, 47, 48],this assumption of dependences between all subsequent iterationsleads to sequential chains that severely limit the performancesought by running loop iterations in parallel.3

These sequential chains, which include both communica-tion and computation, have two sources of inefficiency. First,adjacent-core synchronization often turns out to not be necessaryfor every link of these chains. Second, when data forwardingis initiated lazily (at request time), it blocks computation whilewaiting for data transfers between cores.

Finally, for loops parallelized by HELIX-RC, most communi-cation is not between successive loop iterations. Hence, becauseHELIX-RC distributes successive loop iterations to adjacentcores, most communication is not between adjacent cores. Fig-ure 4b charts the distribution of undirected distances betweendata-producing cores and the first consumer core on a platformwith 16 cores organized in a ring. Only 15% of those transfersare between adjacent cores. Moreover, Figure 4c shows thatmost of the shared values (86%) from these loops are consumedby multiple cores. Since consumers of shared values are notknown at compile time, HELIX-RC implements a mechanismthat proactively broadcasts data and signals to all other cores.Such proactive communication, which does not block computa-tion, is the cornerstone of the HELIX-RC approach.

3. The HELIX-RC SolutionThe goal of HELIX-RC is to decouple all communication re-quired to efficiently run iterations of small hot loops in parallel.This is realized by decoupling value forwarding from value gen-eration and by decoupling signal transmission from synchroniza-tion. We now show how HELIX-RC achieves such decoupling.

3Others have called this chain a critical forwarding path [40, 48].

0 25 75 110 260Loop iteration execution time (cycles)

0

50

100

Perc

enta

geof

loop

itera

tions

Measured cachecoherence latency

Nehalem

Atom

IvyBridge

(a) Short loop iterations

Hop

6+6%

59%

4

12%

3 39%

2

22% 112%

(b) Distance

Core

6+9%

5

34%

4 12%

3

21%

2

8% 116%

(c) Consumers

Figure 4: Small hot loops have short iterations that send dataover multiple hops and to multiple cores.

3.1. Approach

HELIX-RC is a co-design of its compiler (HCCv3) and archi-tectural enhancements (the ring cache). HCCv3 distinguishesparallel code (i.e., code outside any sequential segment) fromsequential code (i.e., code within sequential segments) by usingtwo instructions that extend the instruction set. The ring cacheis a ring network that connects together ring nodes attached toeach core in the processor to operate during sequential segmentsas a distributed first-level cache that precedes the private L1cache. Because it relies on compiler-guaranteed properties of thecode, the hardware support can be simple and efficient. The nextparagraphs summarize the main components of HELIX-RC.

ISA. We introduce a pair of instructions—wait and signal—that mark the beginning and end of a sequential segment. Eachof these instructions has an integer value as a parameter that iden-tifies the particular sequential segment. The wait instructionblocks execution of the core that issued it (e.g., wait 3) until allother cores have finished executing the corresponding sequen-tial segment, which they signify by executing the appropriatesignal instruction (e.g., signal 3).

Compiler. HCCv3 takes sequential programs and parallelizesloops that are most likely to speed up performance when theiriterations execute in parallel. Only one loop runs in parallel ata time and its successive iterations run on cores organized as aunidirectional ring.

To satisfy loop-carried data dependences, HCCv3 keeps theexecution of sequential segments in iteration order by insertingwait and signal instructions to delimit the entry and exit pointsof these segments. In this way, HCCv3 guarantees that accessesto a variable or another memory location that might need to beshared between cores are always within sequential segments.Moreover, shared variables (normally allocated to registers insequential code) are mapped to specially-allocated memory lo-cations. Hence, their accesses within sequential segments occurvia memory operations.

Core. A core forwards all memory accesses within sequentialsegments to its local ring node. All other memory accesses(not within a sequential segment) go through the private L1cache. To determine whether the executing code is part of a

sequential segment or not, a core simply counts the number ofexecuted wait and signal instructions. If more waits havebeen executed than matching signals, then the executing codebelongs to a sequential segment.

Memory. The ring cache is a connected ring of nodes, oneper core. Each ring node has a cache array that satisfies bothloads and stores received from its attached core. HELIX-RCdoes not require other changes to the existing memory hierarchybecause the ring cache orchestrates interactions with it. Toavoid any changes to conventional cache coherence protocols,the ring cache permanently maps each memory address to aunique ring node. All accesses from the distributed ring cache tothe next cache level (L1) go through the associated node for acorresponding address.

3.2. Decoupling communication from computation

Having seen the main components of HELIX-RC, we describehow they interact to efficiently decouple communication fromcomputation.

Shared data communication. HELIX-RC decouples com-munication of variables and other shared data locations by prop-agating new shared data through the ring cache as soon as itis generated. Once a ring node receives a store, it records thenew value and proactively forwards its address and value to anadjacent node in the ring cache, all without interrupting the ex-ecution of the attached core. The value then propagates fromnode to node through the rest of the ring without interruptingthe computation of any core—decoupling communication fromcomputation.

Synchronization. Given the difficulty of determining whichiteration depends on which in non-numerical programs, compil-ers typically make the conservative assumption that an iterationdepends on all of its predecessor iterations. Therefore, a corecannot execute sequential code until it is unblocked by its prede-cessor [6, 25, 40]. Moreover, an iteration unblocks its successoronly if both it and its predecessors have executed this sequen-tial segment or if they are not going to. This execution modelleads to a chain of signal propagation across loop iterations thatincludes unnecessary synchronization: even if an iteration is notgoing to execute sequential code, it still needs to synchronizewith its predecessor before unblocking its successor.

HELIX-RC removes these synchronization overheads by en-abling an iteration to detect the readiness of all predecessoriterations, not just one. Therefore, once an iteration forgoesexecuting the sequential segment, it immediately notifies its suc-cessor without waiting for its predecessor. Unfortunately, whileHELIX-RC removes unnecessary synchronization, it increasesthe number of signals that can be in flight simultaneously.

HELIX-RC relies on the new signal instruction to handlesynchronization signals efficiently. Synchronization between aproducer and a consumer includes (i) the producer generating asignal, (ii) the consumer requesting that signal, and (iii) signaltransmission between the two. On a conventional multicore,which relies on a pull-based memory hierarchy for communica-tion, signal transmission is inherently lazy, and signal request and

wait 1;signal 1;

...

...

a = a+1;

wait 1;a=load;

store a;signal 1;

1

sequentialsegment

Parallel code

Sequential chain

Sequential code

Data forwarding

(a) Parallel code

Time

signal wait

signal

wait

signal

Core 0 Core 1 Core 2

load

signal stall

data stall

(b) Coupled communication

signal wait

signal

Core 0 Core 1 Core 2

signal

load

(c) Decoupled communication

Figure 5: Example illustrating benefits of decoupling communication from computation.

transmission get serialized. On the other hand, in HELIX-RC,signal instructs the ring cache to proactively forward a signalto all other nodes in the ring without interrupting any of the cores,thereby decoupling signal transmission from synchronization.

Code example. Given the importance of these decouplingmechanisms to fully realize performance benefits, let’s explorehow HELIX-RC implements them using a concrete example.The code in Figure 5(a), abstracted for clarity, represents a smallhot loop from 175.vpr of SPEC CINT2000 that is responsible for55% of the total execution time of that program. It contains asequential segment with two possible execution paths. The leftpath contains an actual dependence where instances of instruc-tion 1 in an iteration use values from previous iterations. Theright path does not depend on prior data. Because the compilercannot predict the execution path of a particular iteration (dueto complex control flow), it must assume that instruction 1, inany given iteration, depends on the previous iteration. Therefore,it must synchronize all successive iterations by inserting waitand signal instructions on every execution path. Figure 5(b)highlights this sequential chain in red. Now, assume only iter-ations 0 and 2, running on cores 0 and 2, respectively, executeinstruction 1. Then, this sequential chain is unnecessarily longbecause of the superfluous wait in iteration 1. Each iterationwaits (via the wait instruction) for the signal generated by thesignal instruction of the previous iteration. Also, iterationsthat update a (iterations 0 and 2) must load previous values first(using a regular load). Hence, two sets of stalls slow down thechain. First, iteration 1 performs unnecessary synchronization(signal stalls), because it only contains parallel code. Second,lazy forwarding of the shared data leads to data stalls, becausethe transfer only begins when requested, at a load, rather thanwhen generated, at a store.

HELIX-RC proactively communicates data and synchroniza-tion signals between cores, which leads to the more efficientscenario shown in Figure 5(c). The sequential chain nowonly includes the delay required to satisfy the dependence—communication updating a shared value.

4. Compiler

The decoupled execution model of HELIX-RC described sofar is possible given the tight co-design of the compiler and

architecture. In this section, we focus on compiler-guaranteedcode properties that enable a lightweight ring cache design, andfollow up with code optimizations that make use of the ringcache.

Code properties.• Only one loop can run in parallel at a time. Apart from a

dedicated core responsible for executing code outside parallelloops, each core is either executing an iteration of the currentloop or waiting for the start of the next one.

• Successive loop iterations are distributed to threads in a round-robin manner. Since each thread is pinned to a predefined core,and cores are organized in a unidirectional ring, successiveiterations form a logical ring.

• Communication between cores executing a parallelized loopoccurs only within sequential segments.

• Different sequential segments always access different shareddata. HCCv3 only generates multiple sequential segmentswhen there is no intersection of shared data. Consequently,instances of distinct sequential segments may run in parallel.

• At most two signals per sequential segment emitted by a givencore can be in flight at any time. Hence, only two signals persegment need to be tracked by the ring cache.

This last property eliminates unnecessary wait instructionswhile keeping the architectural enhancement simple. Eliminat-ing waits allows a core to execute a later loop iteration than itssuccessor (significantly boosting parallelism). Future iterations,however, produce signals that must be buffered. The last codeproperty prevents a core from getting more than one “lap” aheadof its successor. So when buffering signals, each ring cachenode only needs to recognize two types—those from the pastand those from the future.

Code optimizations. In addition to the optimizations ofHCCv2, HCCv3 includes ones that are essential for best per-formance of non-numerical programs on a ring-cache-enhancedarchitecture: aggressive splitting of sequential segments intosmaller code blocks; identification and selection of small hotloops; and elimination of unnecessary wait instructions.

Sizing sequential segments poses a tradeoff. Additional seg-ments created by splitting run in parallel with others, but extrasegments entail extra synchronization, which adds communica-tion overhead. Thanks to decoupling, HCCv3 can split more

Data and Signals

Cache array

Signal buffer

... PastFuture

Signal 1Signal S

ReadPort

WritePort

Credits

Data and Signals

LinkBuffers

Data and Signals

Credits Control

Loads from Core

Stores/Signalsfrom Core

Ringnode

DL1Cache

Core

Remote L1Request/Reply

L1 Cache Reads/Writes

Core

Figure 6: Ring cache architecture overview. From left to right: overall system; single core slice; ring node internal structure.

aggressively than HCCv2 to significantly increase TLP. Notethat segments cannot be split indefinitely—each shared locationmust belong to only one segment.

To identify small hot loops that are most likely to speed upwhen their iterations run in parallel, HCCv3 includes a profiler tocapture the behavior of the ring cache. Whereas HCCv1 relies onan analytical performance model to select the loops to parallelize,HCCv3 profiles loops on representative inputs. During profiling,instrumentation code emulates execution with the ring cache,resulting in an estimate of time saved by parallelization. Finally,HCCv3 uses a loop nesting graph, annotated with the profilingresults, to choose the most promising loops.

5. Architecture EnhancementsAdding a ring cache to a multicore architecture enables the proac-tive circulation of data and signals that boost parallelization. Thissection describes the design of the ring cache and its constituentring nodes. The design is guided by the following objectives:

Low-latency communication. HELIX-RC relies on fastcommunication between cores in a multicore processor for syn-chronization and for data sharing between loop iterations. Sincelow-latency communication is possible between physically ad-jacent cores in modern processors, the ring cache implements asimple unidirectional ring network.

Caching shared values. A compiler cannot easily guaranteewhether and when shared data generated by a loop iterationwill be consumed by other cores running subsequent iterations.Hence, the ring cache must cache shared data. Keeping shareddata on local ring nodes provides quick access for the associatedcores. As with data, it is also important to buffer signals in eachring node for immediate consumption.

Easy integration. The ring cache is a minimally-invasiveextension to existing multicore systems, easy to adopt and inte-grate. It does not require modifications to the existing memoryhierarchy or to cache coherence protocols.

With these objectives in mind, we now describe the internals ofthe ring cache and its interaction with the rest of the architecture.

5.1. Ring Cache Architecture

The ring cache architecture relies on properties of compiled codeincluding: (i) parallelized loop iterations execute in separatethreads on separate cores, arranged in a logical ring; and (ii) data

shared between iterations moves between cores from current tofuture iterations. These properties imply that the data involvedin timing-critical dependences that potentially limit overall per-formance are both produced and consumed in the same order asloop iterations. Furthermore, a ring network topology capturesthis data flow, as sketched in Figure 6. The following paragraphsdescribe the structure and purpose of each ring cache component.

Ring node structure. The internal structure of a per-corering node is shown in the right half of Figure 6. Parts of thisstructure resemble a simple network router. Unidirectional linksconnect a node to its two neighbors to form the ring backbone.Bidirectional connections to the core and private L1 cache allowinjection of data into and extraction of data from the ring. Thereare three separate sets of data links and buffers. A primary setforwards data and signals between cores. Two other sets manageinfrequent traffic for integration with the rest of the memoryhierarchy (see Section 5.2). Separating these three traffic typessimplifies the design and avoids deadlock. Finally, signals movein lockstep with forwarded data to ensure that a shared memorylocation is not accessed before the data arrives.

In addition to these router-like elements, a ring node alsocontains structures more common to caches. A set associativecache array stores all data values (and their tags) received bythe ring node, whether from a predecessor node or from itsassociated core. The line size of this cache array is kept atone machine word. While the small line is contrary to typicalcache designs, it ensures there will be no false data sharing byindependent values from the same line.

The final structural component of the ring node is the signalbuffer, which stores signals until they are consumed.

Node-to-node connection. The main purpose of the ringcache is to proactively provide many-to-many core communica-tion in a scalable and low-latency manner. In the unidirectionalring formed by the ring nodes, data propagates by value circula-tion. Once a ring node receives an (address, value) pair, eitherfrom its predecessor, or from its associated core, it stores a localcopy in its cache array and propagates the same pair to its suc-cessor node. The pair eventually propagates through the entirering (stopping after a full cycle) so that any core can consumethe data value from its local ring node, as needed.

This value circulation mechanism allows the ring cache tocommunicate between cores faster than reactive systems (like

most coherent cache hierarchies). In a reactive system, datatransfer begins once the receiver requests the shared data, whichadds transfer latency to an already latency-critical code path.In contrast, a proactive scheme overlaps transfer latencies withcomputation to lower the receiver’s perceived latency.

The ring cache prioritizes the common case, where data gen-erated within sequential segments must propagate to all othernodes as quickly as possible. Assuming no contention overthe network and single-cycle node-to-node latency, the designshown in Figure 6 allows us to bound the latency for a full triparound the ring to N clock cycles, where N is the number ofcores. Each ring node prioritizes data received from the ring andstalls injection from its local core.

In order to eliminate buffering delays within the node thatare not due to L1 traffic, the number of write ports in eachnode’s cache array must match the link bandwidth between twonodes. While this may seem like an onerous design constraintfor the cache array, Section 6.3 shows that just one write port issufficient to reap more than 99% of the ideal-case benefits.

To ensure correctness under network contention, the ringcache is sometimes forced to stall all messages (data and signals)traveling along the ring. The only events that can cause con-tention and stalls are ring cache misses and evictions, which maythen need to fetch data from a remote L1 cache. While these ringstalls are necessary to guarantee correctness, they are infrequent.

The ring cache relies on credit-based flow control [17] and isdeadlock free. Each ring node has at least two buffers attached tothe incoming links to guarantee forward progress. The networkmaintains the invariant that there is always at least one emptybuffer per set of links somewhere in the ring. That is why a nodeonly injects new data from its associated core into the ring whenthere is no data from a predecessor node to forward.

Node-core integration. Ring nodes are connected to theirrespective cores as the closest level in the cache hierarchy (Fig-ure 6). The core’s interface to the ring cache is through regularloads and stores for memory accesses in sequential segments.

As previously discussed, wait and signal instructions delin-eate code within a sequential segment. A thread that needs toenter a sequential segment first executes a wait, which only re-turns from the associated ring node when matching signals havebeen received from all other cores executing prior loop iterations.The signal buffer within the ring node enforces this. Specializedcore logic detects the start of the sequential segment and routesmemory operations to the ring cache.4 Finally, executing thecorresponding signal marks the end of the sequential segment.

The wait and signal instructions require special treatmentin out-of-order cores. Since they may have system-wide sideeffects, these instructions must issue non-speculatively fromthe core’s store queue and regular loads and stores cannotbe reordered around them. Our implementation reuses logicfrom load-store queues for memory disambiguation and holds alightweight local fence in the load queue until the wait returnsto the senior store queue. This is not a concern for in-order cores.

4This feature may add one multiplexer delay to the critical delay path fromthe core to L1.

Benchmark Phases Parallel loop coverageHELIX-RC HCCv2 HCCv1

Integer benchmarks164.gzip 12 98.2% 42.3% 42.3%175.vpr 28 99% 55.1% 55.1%197.parser 19 98.7% 60.2% 60.2%300.twolf 18 99% 62.4% 62.4%181.mcf 19 99% 65.3% 65.3%256.bzip2 23 99% 72.3% 72.1%

Floating point benchmarks183.equake 7 99% 99% 77.1%179.art 11 99% 99% 84.1%188.ammp 23 99% 99% 60.2%177.mesa 8 99% 99% 64.3%

Table 1: Characteristics of parallelized benchmarks.

5.2. Memory Hierarchy Integration

The ring cache is a level within the cache hierarchy and as suchmust not break any consistency guarantees that the hierarchynormally provides. Consistency between the ring cache and theconventional memory hierarchy results from the following invari-ants: (i) shared memory can only be accessed within sequentialsegments through the ring cache (compiler-enforced); (ii) onlya uniquely assigned owner node can read or write a particularshared memory location through the L1 cache on a ring cachemiss (ring cache-enforced); and (iii) the cache coherence proto-col preserves the order of stores to a memory location through aparticular L1 cache.5

Sequential consistency. To preserve the semantics of a paral-lelized single-threaded program, memory operations on sharedvalues require sequential consistency. The ring cache meets thisrequirement by leveraging the unidirectional data flow guaran-teed by the compiler. Sequential consistency must be preservedwhen ring cache values reach lower-level caches, but the con-sistency model provided by conventional memory hierarchiesis weaker. We resolve this difference by introducing a singleserialization point per memory location, namely a unique ownernode responsible for all interactions with the rest of the mem-ory hierarchy. When a shared value is moved between the ringcache and L1 caches (owing to occasional ring cache load missesand evictions), only its owner node can perform the requiredL1 cache accesses. This solution preserves existing consistencymodels with minimal impact on performance.

Cache flush. Finally, to guarantee coherence between paral-lelized loops and serial code between loop invocations, each ringnode flushes the dirty values of memory locations it owns to L1once a parallel loop has finished execution. This is equivalentto executing a distributed fence at the end of loops. In a multi-program scenario, signal buffers must also be flushed/restored atprogram context switches.

6. EvaluationBy co-designing the compiler along with the architecture,HELIX-RC more than triples the performance of parallelizedcode when compared to a compiler-only solution (HCCv2). This

5Most cache coherence protocols (including Intel, AMD, and ARM imple-mentations) provide this minimum guarantee.

164.g

zip

175.v

pr

197.p

arse

r

300.t

wolf

181.m

cf

256.b

zip2

INT Geo

mean

183.e

quak

e

179.a

rt

188.a

mmp

177.m

esa

FPGeo

mean

Geomea

n0

2

4

6

8

10

12

14

16

Pro

gram

spee

dup

NumericalPrograms

Non-NumericalPrograms

HCCv2 HELIX-RC

Figure 7: HELIX-RC triples the speedup obtained by HCCv2.Speedups are relative to sequential program execution.

section investigates HELIX-RC’s performance benefits and theirsensitivity to ring cache parameters. We confirm that the majorityof speedups come from decoupling all types of communicationand synchronization. We conclude by analyzing the remainingoverheads of the execution model.

6.1. Experimental Setup

We ran experiments on two sets of architectures. The first relieson a conventional memory hierarchy to share data among thecores. The second relies on the ring cache.

Simulated conventional hardware. Unless otherwise noted,we simulate a multicore in-order x86 processor by addingmultiple-core support to the XIOSim simulator. The single-core XIOSim models have been extensively validated against anIntel R© Atom

TMprocessor [19]. We use XIOSim because it is a

publicly-available simulator that is able to simulate fine-grainedmicroarchitectural events with high precision. For one of thesensitivity experiments, we also simulate out-of-order cores mod-eled after Intel Nehalem using the models from Zesto [21].

The simulated cache hierarchy has two levels: a per-core32KB, 8-way associative L1 cache and a shared 8MB 16-bankL2 cache. We vary the core count from 1 to 16, but do not varythe amount of L2 cache with the number of cores, keeping it at8MB for all configurations. Also scaling cache size would makeit difficult to distinguish the benefits of parallelizing a workloadfrom the benefits of fitting its working set into the larger cache,causing misleading results. Finally, we use DRAMSim2 [33] forcycle-accurate simulation of memory controllers and DRAM.

We extended XIOSim with a cache coherence protocol as-suming an optimistic cache-to-cache latency of 10 clock cycles.This 10-cycle latency is optimistically low even compared toresearch prototypes of low-latency coherence [23]. In fact, it isthe minimum reasonably possible with a 4×4 2D mesh network.(Running microbenchmarks in our testbed, we found that IntelIvy Bridge is 75 cycles, Intel Sandy Bridge is 95 cycles, and IntelNehalem is 110 cycles.) We only use this low-latency model

164.g

zip

175.v

pr

197.p

arse

r

300.t

wolf

181.m

cf

256.b

zip2

INT Geo

mean

0

2

4

6

8

10

12

14

16

Pro

gram

spee

dup

Ben

efito

fdec

oupl

ing

mem

ory

com

mun

icat

ion

Benefits ofdecouplingsynchronization

HCCv2decoupled reg. communicationdecoupled reg. comm. and synch.decoupled reg. and memory comm.HELIX-RC (decoupled all communication)

Figure 8: Breakdown of benefits of decoupling communicationfrom computation.

to simulate conventional hardware, and later (Section 6.2) showthat low latency alone is not enough to compensate for the lazynature of its coherence protocol.

Simulated ring cache. We extended XIOSim to simulate thering cache as described in Section 5. Unless otherwise noted, ithas the following configuration: a 1KB 8-way associative arraysize, one-word data bandwidth, five-signal bandwidth, single-cycle adjacent core latency, and two cycles of core-to-ring-nodeinjection latency to minimally impact the already delay-criticalpath from the core to the L1 cache. We use a simple bit mask asthe hash function to distribute memory addresses to their ownernodes. To avoid triggering the cache coherence protocol, allwords of a cache line have the same owner. Lastly, XIOSimsimulates changes made to the core to route memory accesseseither to the attached ring node or to the private L1.

Benchmarks. We use 10 out of the 15 C benchmarks fromthe SPEC CPU2000 suite: 4 floating point (CFP2000) and 6integer benchmarks (CINT2000). For engineering reasons, thedata dependence analysis that HCCv3 relies on [13] requireseither too much memory or too much time to handle the rest.This limitation is orthogonal to the results described in this paper.

Compiler. We extended the ILDJIT compilation frame-work [5], version 1.1, to use LLVM 3.0 for backend machinecode generation. We generated both single- and multi-threadedversions of the benchmarks. The single-threaded programs arethe unmodified versions of benchmarks, optimized (O3) andgenerated by LLVM. This code outperforms GCC 4.8.1 by 8%on average and underperforms ICC 14.0.0 by 1.9%.6 The multi-threaded programs were generated by HCCv3 and HCCv2 to runon ring-cache-enhanced and conventional architectures, respec-tively. Both compilers produce code automatically and do notrequire any human intervention. During compilation, they useSPEC training inputs to select the loops to parallelize.

Measuring performance. We compute speedups relative to

6As an aside, automatic parallelization features of ICC led to a geomeanslowdown of 2.6% across SPEC CINT2000 benchmarks, suggesting ICC cannotparallelize non-numerical programs.

164.g

zip

175.v

pr

197.p

arse

r

300.t

wolf

181.m

cf

256.b

zip2

INT Geo

mean

020406080

100120140160180

%E

xecu

tion

Tim

e510

C

C C C

C CC

RR R R R R R

Slowdown

Speedup

Communication Computation

Figure 9: While code generated by HCCv3 speeds up with a ringcache (R), it slows down on conventional hardware (C).

sequential simulation. Both single- and multi-threaded runs usereference inputs. To make simulation feasible, we simulate mul-tiple phases of 100M instructions as identified by SimPoint [14].

6.2. Speedup Analysis

In our 16-core processor evaluation system, HELIX-RC booststhe performance of sequentially-designed programs (CINT2000),assumed not to be amenable to parallelization. Figure 7 showsthat HELIX-RC raises the geometric mean of speedups for thesebenchmarks from 2.2× for HCCv2 without ring cache to 6.85×.

HELIX-RC not only maintains the performance increases ofHCCv2 (compared to HCCv1) on numerical programs (SPECCFP2000), but also increases the geometric mean of speedupsfor CFP2000 benchmarks from 11.4×7 to almost 12×.

We now turn our attention to understanding where thespeedups come from.

Communication. Speedups obtained by HELIX-RC comefrom decoupling both synchronization and data communicationfrom computation in loop iterations, which significantly reducescommunication overhead, allows the compiler to split sequentialsegments into smaller blocks, and cuts down the critical pathof the generated parallel code. Figure 8 compares the speedupsgained by multiple combinations of decoupling synchroniza-tion, register-, and memory-based communication. As expected,fast register transfers alone do not provide much speedup sincemost in-register dependences can be satisfied by re-computingthe shared variables involved (Section 2). Instead, most of thespeedups come from decoupling communication for both syn-chronization and memory-carried actual dependences. To thebest of our knowledge, HELIX-RC is the only solution thataccelerates all three types of transfers for actual dependences.

In order to assess the impact of decoupling communicationfrom computation for CINT2000 benchmarks, we executed theparallel code generated by HCCv3—assuming a decoupling ar-chitecture like a ring cache—on a simulated conventional systemwhich does not decouple. The loops selected under this assump-tion do require frequent communication (every 24 instructionson average). Figure 9 shows that such code, running on a conven-

7These speedups are possible even with a cache coherence latency of conven-tional processors (e.g., 75 cycles).

02468

10121416

Pro

gram

spee

dup

4-way OoO2-way OoO

2-way IO

164.g

zip

175.v

pr

197.p

arse

r

300.t

wolf

181.m

cf

256.b

zip2

INT Geo

mean

0

1

Seq

.

Figure 10: Speedup obtained by changing the complexity of thecore from a 2-way in-order to a 4-way out-of-order.

tional multicore (left bars), performs no better than sequentialexecution (100%), even with the optimistic 10-cycle core-to-corelatency. These results further stress the importance of selectingloops based on the core-to-core latency of the architecture.

Sequential segments. While more splitting offers higher TLP(more sequential segments can run in parallel), more splittingalso requires more synchronization at run time. Hence, the highsynchronization cost for conventional multicores discouragesaggressive splitting of sequential segments.8 In contrast, the ringcache enables aggressive splitting to maximize TLP.

To analyze the relationship between splitting and TLP, wecomputed the number of instructions that execute concurrentlyfor the following two scenarios: (i) conservative splitting con-strained by a contemporary multicore processor with high syn-chronization penalty (100 cycles) and (ii) aggressive splitting forHELIX-RC with low-latency communication (<10 cycles) pro-vided by the ring cache. In order to compute TLP independent ofboth the communication overhead and core pipeline advantages,we used a simple abstracted model of a multicore system thathas no communication cost and is able to execute one instructionat a time. Using the same set of loops chosen by HELIX-RCand used in Figure 7, TLP increased from 6.4 to 14.2 instruc-tions with aggressive splitting. Moreover, the average numberof instructions per sequential segment dropped from 8.5 to 3.2instructions.

Coverage. Despite all the loop-level speedups possible via de-coupling communication and aggressively splitting of sequentialsegments, Amdahl’s law states that program coverage dictatesthe overall speedup of a program. Prior parallelization tech-niques have avoided selecting loops with small bodies becausecommunication would slow down execution on conventionalprocessors [6, 39]. Since HELIX-RC does not suffer from thisproblem, the compiler can freely select small hot loops to coveralmost the entirety of the original program. Table 1 shows thatHELIX-RC achieves >98% for all of the benchmarks evaluated.

6.3. Sensitivity to Architectural Parameters

Speedup results so far assumed the default configuration (inSection 6.2) for the ring cache. We now investigate the impact ofdifferent architectural parameters on speedup. In the next set of

8This is the rationale behind DOACROSS parallelization [10].

164.g

zip

175.v

pr

197.p

arse

r

300.t

wolf

181.m

cf

256.b

zip2

0

2

4

6

8

10

12

14

16

Pro

gram

spee

dup 16 cores

8 cores4 cores2 cores

(a) Core count.

164.g

zip

175.v

pr

197.p

arse

r

300.t

wolf

181.m

cf

256.b

zip2

0

2

4

6

8

10

12

14

16

1 cycle48

1632

(b) Adjacent node link latency.

164.g

zip

175.v

pr

197.p

arse

r

300.t

wolf

181.m

cf

256.b

zip2

0

2

4

6

8

10

12

14

16

Unbounded4 Signals

21

(c) Signal bandwidth.

164.g

zip

175.v

pr

197.p

arse

r

300.t

wolf

181.m

cf

256.b

zip2

0

2

4

6

8

10

12

14

16

Unbounded32 KB

1 KB256 B

(d) Node memory size.

Figure 11: Sensitivity to core count and ring cache parameters. Only SPEC CINT benchmarks are shown.

HELIX-RCSpeedup

AdditionalInstructions

Wait/SignalInstructions

MemoryIterationImbalance

Low TripCount

CommunicationDependenceWaiting

177.mesa188.ammp

179.art183.equake

256.bzip2181.mcf

300.twolf197.parser

175.vpr164.gzip

29.3% 0.9% 3.7% 58.4% 7.3% 0.0% 0.3% 15.1x64.1% 8.0% 6.3% 7.4% 8.9% 2.2% 3.1% 12.5x0.2% 0.0% 47.7% 24.8% 16.1% 0.0% 11.3% 10.5x0.2% 0.0% 9.1% 1.5% 87.7% 0.0% 1.5% 10.1x3.4% 3.4% 51.6% 0.1% 1.1% 19.7% 20.7% 12.0x37.7% 10.4% 5.5% 1.2% 3.2% 20.9% 21.2% 8.7x0.1% 0.2% 41.8% 1.4% 31.8% 0.0% 24.6% 7.6x31.3% 24.3% 15.3% 5.0% 0.3% 11.6% 12.2% 7.3x11.9% 0.4% 74.2% 12.4% 0.0% 0.5% 0.5% 6.1x40.8% 8.1% 9.6% 4.5% 0.0% 18.1% 18.8% 3.0x

Figure 12: Breakdown of overheads that prevent achieving ideal speedup.

experiments we sweep one parameter of the ring cache at a timewhile keeping all others constant at the default configuration.

Core type. While HELIX-RC successfully improves TLP forin-order cores, one may ask how ILP provided by more com-plex cores impacts speedups. Figure 10 (upper) plots speedupsfor two additional core types—4-way and 2-way out-of-order(OoO) cores—in addition to the default 2-way in-order core(IO) for CINT2000 benchmarks. The lower graph plots the se-quential execution time normalized to that of the 4-way OoOcore. Although the OoO cores can extract more ILP for the sameworkloads (the 4-way OoO core is on average 1.9× faster thanthe default 2-way IO core), HELIX-RC speeds up most of thebenchmarks except for 164.gzip. Thoroughly characterizing andaccounting for this tradeoff between HELIX-RC-extracted TLPand ILP with different architectures is the subject of future work.

Core count. Figure 11a shows that HELIX-RC efficientlyscales parallel performance with core count, from 2 to 16.

Link latency. Figure 11b shows the speedups obtained versusthe minimum communication latency between adjacent ringnodes. As expected, HELIX-RC performance degrades forlonger latencies for most of the benchmarks. It is importantto note that current technologies can satisfy single-cycle adja-cent core latencies, confirmed by commercial designs [46] andCACTI [24] wire models of interconnect lengths for dimensionsin modern multicore processors.

Link bandwidth. A ring cache uses separate dedicated wiresfor data and signals to simplify design. Simulations confirm thata minimum data bandwidth of one machine word (hence, singlewrite port) sufficiently sustains more than 99.9% of the perfor-mance obtained by a data link with unbounded bandwidth for allbenchmarks. In contrast, reducing signal bandwidth can degradeperformance, as shown in Figure 11c, due to synchronizationstalls. However, the physical overhead of adding additional

signals (up to 4) is negligible.Memory size. Figure 11d shows the impact of memory size.

The finite-size cases assume LRU replacement. Reducing cachearray size within the ring node only impacts 197.parser, whichhas the largest ring cache working set.

6.4. Analysis of Overhead

To understand areas for improvement, we categorize every over-head cycle (preventing ideal speedup) based on a set of simulatorstatistics and the methodology presented by Burger et al. [4]. Fig-ure 12 shows the results of this categorization for HELIX-RC,again implemented on a 16-core processor.

Most importantly, the small fraction of communication over-heads suggests that HELIX-RC successfully eliminates the core-to-core latency for data transfer in most benchmarks. For severalbenchmarks, notably 175.vpr, 300.twolf, 256.bzip2, and 179.art,the major source of overhead is the low number of iterationsper parallelized loop (low trip count). While many hot loopsare frequently invoked, low iteration count (ranging from 8 to20) leads to idle cores. Other benchmarks such as 164.gzip,197.parser, 181.mcf, and 188.ammp suffer from dependence wait-ing due to large sequential segments. Finally, HCCv3 mustsometimes add a large number of wait and signal instructions(i.e., many sequential segments) to increase TLP, as seen for164.gzip, 197.parser, 181.mcf, and 256.bzip2.

7. Related Work

To compare HELIX-RC to a broad set of related work, Table 2summarizes different parallelization schemes proposed for non-numerical programs organized with respect to the types of com-munication decoupling implemented (register vs. memory) andthe types of dependences targeted (actual vs. false). HELIX-RC

Actual dependences False dependencesRegister HELIX-RC, Multiscalar, TRIPS, T3 HELIX-RC, Multiscalar, TRIPS, T3Memory HELIX-RC HELIX-RC, TLS-based approaches,

Multiscalar, TRIPS, T3

Table 2: Only HELIX-RC decouples communication for all typesof dependences.

covers the entire design space and is the only one to decouplememory accesses from computation for actual dependences.

Multiscalar register file. Multiscalar processors [38] extractboth ILP and TLP from an ordinary application. While a ringcache’s structure resembles a Multiscalar register file, there arefundamental differences. For the Multiscalar register file, thereis a fixed and relatively small number of shared elements thatmust be known at compile time. Furthermore, the Multiscalarregister file cannot handle memory updates by simply mappingmemory to a fixed number registers without a replacement mech-anism. In contrast, the ring cache does not require compile-timeknowledge to handle an arbitrary number of elements sharedbetween cores (i.e., memory locations allocated at runtime) andcan readily handle register updates by deallocating a register toa memory location. In other words, HELIX-RC proposes to usea distributed cache to handle both register and memory updates.

Cache coherence protocols. The ring cache addresses an en-tirely different set of communication demands. Cache coherenceprotocols target relatively small amounts of data shared infre-quently between cores. Hence, cores can communicate lazily,but the resulting communication almost always lies in the criticalsequential chain. In contrast, the ring cache targets frequent andtime-critical data sharing between cores.

On-chip networks. While on-chip-networks (OCNs) cantake several forms, they commonly implement reactive co-herence protocols [34, 37, 41, 44, 46] that do not fulfill thelow-latency communication requirements of HELIX-RC. Scalaroperand networks [12, 42] somewhat resemble a ring cache toenable tight coupling between known producers and consumersof specific operands, but they suffer from the same limitationsas the Multiscalar register file. Hence, HELIX-RC implementsa relatively simple OCN, but supported by compiler guaranteesand additional logic to implement automatic forwarding.

Off-chip networks. Networks that improve bandwidth be-tween processors have been studied extensively [36, 46]. Whilethey work well for CMT parallelization techniques [9, 28] thatrequire less frequent data sharing, there is less overall parallelism.Moreover, networks that target chip-to-chip communication donot meet the very different low-latency core-to-core communica-tion demands of HELIX-RC [17]. Our results show HELIX-RCis much more sensitive to latency than to bandwidth.

Non-commodity processors. Multiscalar [38], TRIPS [35],and T3 [32] are polymorphous architectures that target paral-lelism at different granularities. They differ from HELIX-RC inthat (i) they require a significantly larger design effort and (ii)they only decouple register-to-register communication and/orfalse memory dependence communication by speculating.

An iWarp system [2] implements special-purpose arrays that

execute fine- and coarse-grained parallel numerical programs.However, without an efficient broadcast mechanism, iWarp’s fastcommunication cannot reach the speedups offered by HELIX-RC.

Automatic parallelization of non-numerical programs.Several automatic methods to extract TLP have demonstratedrespectable speedups on commodity multicore processors fornon-numerical programs [6, 16, 27, 29, 30, 43, 49]. All of thesemethods transform loops into parallel threads. Decoupled soft-ware pipelining (DSWP) [27] reduces sensitivity to communica-tion latency by restructuring a loop to create a pipeline among theextracted threads with unidirectional communication betweenpipeline stages. Demonstrated both on simulators and on real sys-tems, DSWP performance is largely insensitive to latency. How-ever, significant restructuring of the loop makes speedups diffi-cult to predict and generated code can sometimes be slower thanthe original. Moreover, DSWP faces the challenges of selectingappropriate loops to parallelize and keeping the pipeline bal-anced at runtime. While DSWP-based approaches focus more onrestructuring loops to hide communication latency [16, 27, 30],HELIX-RC proposes an architecture-compiler co-design strategythat selects the most appropriate loops for parallelization.

Combining DSWP with HELIX-RC has the potential to yieldsignificantly better performance than either alone. DSWP cannoteasily scale beyond four cores [31] without being combined withapproaches that exploit parallelism among loop iterations [16](e.g., DOALL [22]). While DSWP + DOALL can scale be-yond several cores, DOALL parallelism is not easy to find innon-numerical code. Instead, DSWP + HELIX-RC presents anopportunity to parallelize a much broader set of loops.

Several TLS-based techniques [15, 18, 20, 39, 48, 49], includ-ing STAMPede, Stanford Hydra, and POSH, combine hardware-assisted thread-level speculation (TLS) with compiler optimiza-tions to manage dependences between loop iterations execut-ing in different threads. When the compiler identifies sourcesand destinations of frequent dependences, it synchronizes us-ing wait and signal primitives; otherwise, it uses speculation.HELIX-RC, on the other hand, optimizes code assuming alldependences are actual. While we believe adding speculationmay help HELIX-RC, Figure 7 shows decoupled communica-tion already yields significant speedups without misspeculationoverheads.

8. Conclusion

Decoupling communication from computation makes non-numerical programs easier to parallelize automatically by compil-ing loop iterations as parallel threads. While numerical programscan often be parallelized by compilation alone, non-numericalprograms greatly benefit from a combined compiler-architectureapproach. Our HELIX-RC prototype shows that a minimally-invasive architecture extension co-designed with a parallelizingcompiler can liberate enough parallelism to make good use of16 cores for non-numerical benchmarks commonly thought notto be parallelizable.

Acknowledgements

We thank the anonymous reviewers for their feedback on nu-merous manuscripts. Moreover, we would like to thank GlennHolloway for his invaluable contributions to the HELIX project.This work was possible thanks to the sponsorship of the RoyalAcademy of Engineering, EPSRC and the National ScienceFoundation (award number IIS-0926148). Any opinions, find-ings, and conclusions or recommendations expressed in thismaterial are those of the authors and do not necessarily reflectthe views of our sponsors.

References[1] Randy Allen and Ken Kennedy. Optimizing compilers for modern archi-

tectures. Morgan Kaufmann, 2002.[2] Shekhar Borkar, Robert Cohn, George Cox, Sha Gleason, Thomas Gross,

H. T. Kung, Monica Lam, Brian Moore, Craig Peterson, John Pieper, LindaRankin, P. S. Tseng, Jim Sutton, John Urbanski, and Jon Webb. iWarp: Anintegrated solution to high-speed parallel computing. In Supercomputing,1988.

[3] Matthew J. Bridges, Neil Vachharajani, Yun Zhang, Thomas Jablin, andDavid I. August. Revisiting the sequential programming model for multi-core. In MICRO, 2007.

[4] Doug Burger, James R. Goodman, and Alain Kägi. Memory bandwidthlimitations of future microprocessors. In ISCA, 1996.

[5] Simone Campanoni, Giovanni Agosta, Stefano Crespi Reghizzi, and An-drea Di Biagio. A Highly Flexible, Parallel Virtual Machine: Design andExperience of ILDJIT. In Software: Practice and Experience, 2010.

[6] Simone Campanoni, Timothy M. Jones, Glenn Holloway, VijayJanapa Reddi, Gu-Yeon Wei, and David Brooks. HELIX: AutomaticParallelization of Irregular Programs for Chip Multiprocessing. In CGO,2012.

[7] Simone Campanoni, Timothy M. Jones, Glenn Holloway, Gu-Yeon Wei,and David Brooks. HELIX: Making the Extraction of Thread-Level Paral-lelism Mainstream. In IEEE Micro, 2012.

[8] Ramkrishna Chatterjee, Barbara G. Ryder, and William A. Landi. RelevantContext Inference. In POPL, 1999.

[9] Lynn Choi and Pen-Chung Yew. Compiler and hardware support forcache coherence in large-scale multiprocessors: Design considerations andperformance study. In ISCA, 1996.

[10] Ron Cytron. DOACROSS: Beyond vectorization for multiprocessors. InICPP, 1986.

[11] Alain Deutsch. A storeless model of aliasing and its abstractions usingfinite representations of right-regular equivalence relations. In ICCL, 1992.

[12] Paul Gratz, Changkyu Kim, Karthikeyan Sankaralingam, Heather Hanson,Premkishore Shivakumar, Stephen W. Keckler, and Doug Burger. On-ChipInterconnection Networks of the TRIPS Chip. In IEEE Micro, 2007.

[13] Bolei Guo, Matthew J. Bridges, Spyridon Triantafyllis, Guilherme Ottoni,Easwaran Raman, and David I. August. Practical and accurate low-levelpointer analysis. In CGO, 2005.

[14] Greg Hamerly, Erez Perelman, and Brad Calder. How to use simpoint topick simulation points. In ACM SIGMETRICS Performance EvaluationReview, 2004.

[15] Lance Hammond, Benedict A. Hubbert, Michael Siu, Manohar K. Prabhu,Michael K. Chen, and Kunle Olukotun. The Stanford Hydra CMP. InIEEE Micro, 2000.

[16] Jialu Huang, Arun Raman, Thomas B. Jablin, Yun Zhang, Tzu-Han Hung,and David I. August. Decoupled software pipelining creates parallelizationopportunities. In CGO, 2010.

[17] Natalie Enright Jerger and Li-Shiuan Peh. On-Chip Networks. SynthesisLectures on Computer Architecture. Morgan & Claypool, 2009.

[18] Troy A. Johnson, Rudolf Eigenmann, and T. N. Vijaykumar. Speculativethread decomposition through empirical optimization. In PPoPP, 2007.

[19] Svilen Kanev, Gu-Yeon Wei, and David Brooks. XIOSim: power-performance modeling of mobile x86 cores. In ISLPED, 2012.

[20] Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau,and Josep Torrellas. POSH: A TLS compiler that exploits program structure.In PPoPP, 2006.

[21] Gabriel H Loh, Samantika Subramaniam, and Yuejian Xie. Zesto: Acycle-level simulator for highly detailed microarchitecture exploration. InISPASS, 2009.

[22] Stephen F. Lundstrom and George H. Barnes. A controllable MIMDarchitecture. In Advanced computer architecture, 1986.

[23] Milo M. K. Martin. Token coherence. PhD thesis, University of Wisconsin-Madison, 2003.

[24] Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi.CACTI 6.0: A tool to model large caches. Technical Report 85, HPLaboratories, 2009.

[25] Alexandru Nicolau, Guangqiang Li, and Arun Kejariwal. Techniques forefficient placement of synchronization primitives. In PPoPP, 2009.

[26] Alexandru Nicolau, Guangqiang Li, Alexander V. Veidenbaum, and ArunKejariwal. Synchronization optimizations for efficient execution on multi-cores. In ICS, 2009.

[27] Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August. Au-tomatic thread extraction with decoupled software pipelining. In MICRO,2005.

[28] David K. Poulsen and Pen-Chung Yew. Data prefetching and data forward-ing in shared memory multiprocessors. In ICPP, 1994.

[29] Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, andDavid I. August. Speculative parallelization using software multi-threadedtransactions. In ASPLOS, 2010.

[30] Easwaran Raman, Guilherme Ottoni, Arun Raman, Matthew J. Bridges,and David I. August. Parallel-stage decoupled software pipelining. InCGO, 2008.

[31] Ram Rangan, Neil Vachharajani, Guilherme Ottoni, and David I. August.Performance scalability of decoupled software pipelining. In ACM TACO,2008.

[32] Behnam Robatmil, Dong Li, Hadi Esmaeilzadeh, Sibi Govindan, AaronSmith, Andrew Putnam, Doug Burger, and Stephen W. Keckler. Howto Implement Effective Prediction and Forwarding for Fusable DynamicMulticore Architectures. In HPCA, 2013.

[33] Paul Rosenfeld, Elliott Cooper-Balis, and Bruce Jacob. DRAMSim2: A Cy-cle Accurate Memory System Simulator. In IEEE Computer ArchitectureLetters, 2011.

[34] Daniel Sanchez, Richard M. Yoo, and Christos Kozyrakis. Flexible archi-tectural support for fine-grain scheduling. In ASPLOS, 2010.

[35] Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu,Changkyu Kim, Jaehyuk Huh, Nitya Ranganathan, Doug Burger,Stephen W. Keckler, Robert G. McDonald, and Charles R. Moore. TRIPS:A polymorphous architecture for exploiting ILP, TLP, and DLP. In ACMTACO, 2004.

[36] Steven L. Scott. Synchronization and Communication in the T3E Multi-processor. In ASPLOS, 1996.

[37] Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash,Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, RobertCavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan.Larrabee: a many-core x86 architecture for visual computing. In ACMTransactions on Graphics, 2008.

[38] Gurindar S. Sohi, Scott E. Breach, and T. N. Vijaykumar. Multiscalarprocessors. In ISCA, 1995.

[39] J. Gregory Steffan, Christopher Colohan, Antonia Zhai, and Todd C.Mowry. The STAMPede approach to thread-level speculation. In ACMTransactions on Computer Systems, 2005.

[40] J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C.Mowry. Improving value communication for thread-level speculation. InHPCA, 2002.

[41] Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, FaeGhodrat, Ben Greenwald, Henry Hoffman, Paul Johnson, Jae-Wook Lee,Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman,Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Ararwal.The RAW microprocessor: A computational fabric for software circuitsand general-purpose programs. In IEEE Micro, 2002.

[42] Michael Bedford Taylor, Walter Lee, Saman P. Amarasinghe, and AnantAgarwal. Scalar Operand Networks. In IEEE Transactions on ParallelDistributed Systems, 2005.

[43] Georgios Tournavitis, Zheng Wang, Björn Franke, and Michael F. P.O’Boyle. Towards a holistic approach to auto-parallelization. In PLDI,2009.

[44] Rob F. van der Wijngaart, Timothy G. Mattson, and Werner Haas. Light-weight communications on Intel’s single-chip cloud computer processor.In SIGOPS Operating Systems Review, 2011.

[45] Hans Vandierendonck, Sean Rul, and Koen De Bosschere. The paralaxinfrastructure: Automatic parallelization with a helping hand. In PACT,2010.

[46] David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, BruceEdwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown,III, and Anant Agarwal. On-chip interconnection architecture of the tileprocessor. In IEEE Micro, 2007.

[47] Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan, and Todd C.Mowry. Compiler optimization of scalar value communication betweenspeculative threads. In ASPLOS, 2002.

[48] Antonia Zhai, J. Gregory Steffan, Christopher B. Colohan, and Todd C.Mowry. Compiler and hardware support for reducing the synchronizationof speculative threads. In ACM TACO, 2008.

[49] Hongtao Zhong, Mojtaba Mehrara, Steve Lieberman, and Scott Mahlke.Uncovering hidden loop level parallelism in sequential applications. InHPCA, 2008.