sarc coherence:scaling d cache coherence ...pages.cs.wisc.edu/~kaxiras/papers/05582068.pdflater in...

..........................................................................................................................................................................................................................

SARC COHERENCE: SCALINGDIRECTORY CACHE COHERENCEIN PERFORMANCE AND POWER

..........................................................................................................................................................................................................................

THE SARC PROJECT SEEKS TO IMPROVE POWER SCALABILITY OF SHARED-MEMORY CHIP

MULTIPROCESSORS (CMPS) BY MAKING DIRECTORY COHERENCE MORE EFFICIENT IN BOTH

POWER AND PERFORMANCE. THE AUTHORS DESCRIBE HOW THEY ELIMINATE TWO MAJOR

SOURCES OF INEFFICIENCY FOR DIRECTORY COHERENCE PROTOCOLS: INVALIDATION

TRAFFIC ON WRITES AND DIRECTORY INDIRECTION FOR FINDING THE WRITER.

......To scale application performancein the multicore era, we must go beyondinstruction-level parallelism (ILP) and in-stead rely on explicit parallelism. An impor-tant issue for advances in this direction isease of parallel programming. To this end,the shared-memory programming modeloffers a good starting point. However, wecannot ignore the issue of power efficiency.Poor power efficiency—increasing powerfor diminishing performance gains—killedthe development of wider ILP architectures.This danger is also visible in multicores—that is, when a parallel application experien-ces sublinear speedup and/or when its powerconsumption increases faster than the num-ber of cores allocated to it.

Hill describes various notions of scalabil-ity.1 For multicores, it is useful to think ofthe scalability of a parallel program interms of power performance. In this article,we use the energy-delay product (EDP) tostudy the scalability of parallel programs.Three forces shape a parallel application’spower efficiency (EDP):

� the application’s performance scalabil-ity (speedup),

� the application’s communication-to-computation growth rate, and

� the working set size and how it fits inthe core caches.2

We focus on improving power efficiencyfor parallel applications in a shared-memoryCMPs. The SARC project (http://www.sarc-ip.org) provides the framework. Ourapproach uses tear-off cache blocks (blocksthat are not registered in the directorybut self-invalidate on synchronization) toeliminate invalidation traffic, reduce falsesharing, and upgrade traffic, and writer pre-diction to eliminate directory indirectionand go to the writers directly. We thusachieve both power (in the network andcaches) and performance (for reads andwrites) benefits. We evaluate our approachusing Gems and show significant improve-ments in EDP over a base MESI protocol—improvements that increase with corecount.

[3B2-14] mmi2010050054.3d 9/11/010 16:57 Page 54

Stefanos Kaxiras

Uppsala University,

Sweden

Georgios Keramidas

Industrial Systems

Institute, Greece

..............................................................

54 Published by the IEEE Computer Society 0272-1732/10/$26.00 �c 2010 IEEE

Energy-delay product scalabilityin shared memory CMPs

The SARC architecture assumes a hetero-geneous multicore processor composedof general-purpose cores and application-specific accelerators.3 General-purpose coreshave a private cache hierarchy that we calllevel-one (L1) (although it might consist ofmore than one level) and a last-level cachebefore the external memory, called level-two (L2). General-purpose parallel programs,such as the Splash-2 benchmarks, run on thegeneral-purpose cores. All on-chip commu-nication is point-to-point messaging via anetwork on chip (NoC). A directory-basedcoherence protocol keeps distributed private(L1) caches coherent.4 The directory is colo-cated with the shared on-chip L2 cache—that is, the directory tracks only the on-chipcache blocks.

EDP scalability depends on the behavior ofcore, cache, and network power. Core power,at first approximation, is fairly stable per coreand increases about linearly with the number

of cores. This leads to poor power efficiencyif we use more cores for relatively less speedup.Figure 1 shows the speedups achieved in oursimulations of a 16-core CMP for Splash-2benchmarks (we discuss the evaluation setuplater in the article). Because core EDP can bederived from the performance scalability andaverage core power, we can concentrate insteadon the more interesting problems of the cacheand network EDP scalability:

� Network power is an interplay betweenthe capacity-miss traffic at low corecounts and the increased coherence-communication traffic at higher corecounts. Furthermore, network poweris also affected by the distances the mes-sages travel. As we use larger networks,the energy spent per message increases.

� Cache power is also directly related tocoherence. A significant source of ineffi-ciency is the directory itself—specifically,the directory’s central role in all coher-ence operations necessitates costly

[3B2-14] mmi2010050054.3d 9/11/010 16:57 Page 55

fft radix barnes radiosity fmm volrendcholesky ocean-cont ocean-ncont water-spa water-ns

01 2 4 8 16

2

4

6

8

10

12

14

16

0

1

2

3

4

Base

Proposed

1 2 4 8 160

0.10.20.30.40.50.60.70.80.91.0

1 2 4 8 16

0

0.5

1.0

1.5

2.0

2.5

BASE

Proposed

1 2 4 8 16 1 2 4 8 161 2 4 8 160

2

4

6

8

10

Proposed

Base

0

2

4

6

8

Proposed

Base

(a) (b) (c)

(d) (e) (f)

Number of processors Number of processorsNumber of processors

Number of processors Number of processorsNumber of processors

Figure 1. Speedup (a), normalized core energy-delay product (EDP) (b), normalized network and cache EDP averaged over

all Splash-2 benchmarks (c) and normalized network & cache EDP for fft (d), radix (e), and ocean non-cont (f).

....................................................................

SEPTEMBER/OCTOBER 2010 55

(in power and latency) indirectionsthrough it.

� Directory coherence, due to the largenumber of messages it sends and thelarge number of cache (directory)accesses it causes, eventually does notscale well in terms of power and perfor-mance (EDP). Indeed, as Figure 1shows, cache-and-network EDP scal-ability suffers significantly as we runthe Splash-2 benchmarks on an increas-ing number of cores.

To address these problems, we combinethe use of tear-off cache blocks and writerprediction. This combination is importantbecause tear-off blocks make going to thedirectory optional. This greatly simplifieswriter prediction, which can be built ontop of a tear-off protocol practically forfree: guess a writer, go to it directly; if it isindeed the writer, simply create a tear-offcopy. The combined protocol is minimalin its messaging and requires minimal hard-ware support over a standard MESI protocol.

Furthermore, we use frugal (very power-efficient) predictors for the writer prediction.We use instruction-based prediction with asmall PC-indexed predictor (near the L1)and fall back to simple address-based

prediction when needed. Our predictors pro-vide the necessary accuracy and improveEDP by as much as 82.62 percent in 16cores. Furthermore, the EDP improvementin SARC coherence is not a trade-off be-tween energy and delay, but rather an im-provement in both.

Transparent reads and tear-off copiesTransparent reads are reads that do not

register in the directory and create a tear-off, read-only (TRO) copy.5 The TROcopy is thrown away (self-invalidates) at thefirst synchronization event experienced bythe core that issued the transparent read.Figure 2 shows an example. In the figure,the node on the left transparently reads aclean block (1) by going to the L2/directory.A transparent read leaves no trace in the di-rectory but gets a clean cache block inread-only mode (RO); its state is designatedtear-off or TRO. A writer (on the right ofFigure 2) goes to the directory (2) andobtains the block with read/write permis-sions (R/W). The writer cannot invalidatethe reader because the latter did not registerin the directory. However, the transparentreader is bound to throw away the TROblock at the first synchronization point (3).The writer, on the other hand, registered inthe directory, does not throw away its copy(4). The self-invalidated reader sends a newtransparent read (5), which, after reachingthe directory, proceeds to fetch the latestvalue from the writer (6). A second writer(7), invalidates the previous one (8) andregisters its own ID in the directory.

With transparent reads, coherence in thetraditional sense—that is, sending invalida-tions upon a write, is maintained onlyamong writers. We call this writer coherence.Writers must still register in the directory sothe latest value of the cache block can besafely tracked. A new write simply invalidatesthe previous writer of the block (typically,there are no readers registered in the direc-tory). A significant difference between ourwork and prior work5 is that the writerdoes not downgrade with a transparentread. This significantly reduces upgrade traf-fic. Normal reads still downgrade a writerfrom read/write permissions to read-only.The downgraded copy, however, is not a

[3B2-14] mmi2010050054.3d 9/11/010 16:57 Page 56

Directory time line

T-reader node 0

T-reader node 0

8

No invalidation is sent to T-reader

Invalidation is sent to 1

Writer 1

Writer 2

Sync

SyncSync

Sync

R/W

R/W

RO

RO

Cached

Cached

Cached

Cached copydestroyed at sync event

Writers survive syncs

5

3 4

1

2

6 7

1

2

R/W

RO: Read only R/W: Read/write T-reader: Transparent reader

Figure 2. Basic operations with transparent reads and tear-off copies

(tear-off, read-only [TRO] protocol).

....................................................................

56 IEEE MICRO

...............................................................................................................................................................................................

MULTICORE: THE VIEW FROM EUROPE

TRO and is registered in the directory,which means that it will be invalidatedby the next writer. Lastly, because the inval-idation protocol is always active underneaththe transparent read and tear-off mecha-nisms, we can freely mix invalidation andtear-off copies, simultaneously, in the samedirectory entry.

Correctness with tear-off cache copiesrequires that programs be written as iffor a weak-ordered memory system, whichmeans that tear-off copies require correctlysynchronized programs.5,6 As long as syn-chronization in the program is clearly identi-fied and exposed to the hardware (to purgethe TRO copies), programs run correctly.TRO copies are incompatible with dataraces such as those used for flag synchroniza-tion (flag synchronization based on ordinaryreads and writes) because reads might com-pletely miss all writes, in the absence of anysynchronization among them. Banning suchcode practices, however, might ultimatelylead to safer code. Where needed, replacingflag synchronization with semaphore syn-chronization corrects the problem. In ourcase, the Splash-2 benchmarks run unmodi-fied, without glitch, on our protocol by sim-ply making the synchronization primitives(atomic instructions and lock releases) visibleto the hardware for the purpose of self-invalidating the TRO copies.

Transparent reads and tear-off copies haveboth performance and power implications. Interms of performance, writes become faster be-cause reader invalidation (and its acknowledg-ment) is no longer required; in addition,upgrade traffic is reduced. When transparentreads and tear-off copies are coupled with aweakly ordered consistency model, theimprovement of write performance does notcontribute significantly to the overall perfor-mance. In large part, weak consistency modelshide write latency,6 so its reduction is unim-portant in this context. Secondary performancebenefits come from removing the invalidationand upgrade traffic from the network, reducingcongestion and average message latency. Onthe other hand, significant power benefitscome from removing much of the invalidationtraffic as well as upgrade traffic.

A possible pitfall when using trans-parent reads is to needlessly discard at

synchronization copies that do not change(read-mostly data). This could negativelyimpact both performance and power byconstantly refetching such copies aftersynchronization events. To guard againstthis, we can use the directory to classifywhether a cache block is read-mostly or fre-quently written. TRO copies are appropriatefor frequently written lines (because self-in-validation would replace the frequent inva-lidations), whereas invalidation copies aremost appropriate for read-mostly lines be-cause they are rarely invalidated and canstay live in the caches for as long as needed.In our case, the tell-tale sign for read-mostlyTRO copies is that the directory sees thesame nodes repeating their transparentreads for the same line without an interven-ing write. Lebeck and Wood proposed a sim-ilar but distributed adaptation for tear-offcopies based on block versioning, in whichthe nodes themselves decide whether to re-quest TRO copies.5 We implemented ouradaptation mechanism in the directory and,although it adds marginal benefit, it alsoincreases the directory’s complexity andcost. For this reason, in the rest of this article,we omit directory adaptation in the interestof clarity.

Writer prediction: Avoidingdirectory indirection

Although transparent reads and tear-offcopies reduce the invalidation traffic, anothersignificant source of inefficiency remains in di-rectory protocols: the indirection through thedirectory. The directory (whether centralizedor distributed) is a fixed point of reference tolocate and obtain the latest version of thedata from the writer or invalidate its sharers.It is difficult and rather complex to avoid an in-direction via the directory,7 which leads to theclassic three- or four-hop invalidation proto-cols, shown in Figures 3a and 3b.

Our main contribution is in exploitingthe properties of transparent reads and tear-off copies, and, with a simple and minimalprotocol, go directly to the writers (by pre-dicting their identity), skipping the directorywhen possible. In essence, we let the writersassume the role of the directory (the centralrole of coherence) but revert back to

[3B2-14] mmi2010050054.3d 9/11/010 16:57 Page 57

....................................................................


directory indirection (the safety net) whenwe cannot locate the writers. We can freelyintermix the TRO and the TRO enhancedwith writer-prediction with the base invalida-tion protocol at any time and for any cacheblock.

ReadsIn directory protocols, a read goes to the

directory to find the location of the latest ver-sion of the data. The last writer then forwardsthe correct data using a three-hop protocol(Figure 3b) or sends it back via the directorywith a four-hop protocol (Figure 3a). TROcopies, however, let the reads avoid going tothe directory altogether. Rather, reads try toobtain the data directly from the writer ifthey can locate it.

We use prediction to locate the currentwriter. Based on the history, a reader sends a di-rect request to a predicted writer (Figure 3d). Ifthis node has write permissions (read/write)for the requested cache block, the predictionis correct: the node is the (one and only)writer of the block and has the latest copy

of the data. If the node is in any otherstate (including knowing nothing aboutthe block), it bumps the request to the di-rectory (Figure 3e). From there, the requestis routed to the correct writer and back tothe reader (as in a normal three-hop proto-col). The penalty for a misprediction is oneextra message (indirection via the wrongnode).

Avoiding directory indirection for reads isimportant in two ways. First, reads, in con-trast to writes, are performance-critical,meaning a reduction of their latency directlyreflects on overall performance. Second, di-rectory indirection accounts for a significantpart of the read traffic. Eliminating it imme-diately impacts network power and perfor-mance. Writer prediction yields a two-hoptransaction when correct, or a four-hoptransaction on a misprediction. A simple cal-culation reveals that any prediction accuracyover 50 percent yields both performance andpower benefits.

Writes/upgradesUnder an invalidation protocol, a (new)

writer sends its request to the directory,which either forwards the request to the pre-vious writer if one exists (Figure 4a), or inva-lidates multiple readers (Figure 4b). In ourcase, the new writer predicts the previous writ-er and sends a direct request to it. Figures 4c,4d, and 4e show the sequence of hops for thebase three-hop invalidation as well as for cor-rect prediction and misprediction.

� No prediction (Figure 4c). In the basethree-hop invalidation, data from theold writer directly transfers to the newwriter along with read/write permis-sions. The previous writer returns itsacknowledgement to the directory(3b). This is a three-hop-latency proto-col with an additional overlapping hop(3a and 3b overlap), which we desig-nate as a (3þ1)-hop protocol.

� Correct prediction (Figure 4d). A directrequest from the new writer arrives atthe predicted node. If the predictednode has read/write privileges, it is theblock’s writer. It returns the data to therequester, passing along its read/writeprivileges. It also informs the directory

[3B2-14] mmi2010050054.3d 9/11/010 16:57 8

4-hop1

(a)

(c)

(d)

(e)(b)

H R

W

Home(dir) Reader—/I→RO

—/I→RO

—/ I / TRO

mis-pred

R/W→RO

R/W→RO

Dataupdate

Dataupdate

Data 4Data 32

Writer

H R

W

No prediction

— →TRO

3Data

Correct prediction

Misprediction

3-hop2

1

H R

W

I→TRO

2Data

Data

2-hop

4-hop

pred1

H R

W

3Data

2

3-hop1

I→TRO

—: Miss

I: Invalid RO: Read only

TRO: Tear-off read only

R/W: Read-write

H R

W

1

4

2

3

Figure 3. Writer prediction aims to avoid the indirection via the directory

(when possible) and go directly to the writer. Directory indirection protocols

are either three-hop (a) or four-hop (b), depending on whether the data

returns to the directory. Writer prediction protocols: no prediction (c);

correct prediction (d); and misprediction (e).

....................................................................

58 IEEE MICRO

...............................................................................................................................................................................................


that it has relinquished its rights to thenew writer and has self-invalidated.This is a (2þ1)-hop protocol because itoverlaps the last two messages. The previ-ous writer’s acknowledgment to the di-rectory (message 2b) carries the newwriter’s identity and plays the same roleas acknowledgment (message 3b) in theno-prediction case.

� Misprediction (Figure 4e). The incor-rectly routed request bumps-off thewrong node, which is not the block’swriter, and is rerouted to the directory.The penalty in this case is an extra re-quest message indirection via thewrong node, resulting in a (4þ1)-hopprotocol.

The main idea here is that the previouswriter assumes the directory role, passingout its read/write privileges. Similarly to thewriter prediction for reads, when the predic-tion is correct, the predicted node indeedhas write permissions to the cache block andreturns data and read/write privileges to therequestor. The previous writer also notifiesthe directory of the new writer’s identityand invalidates its local copy. This notifica-tion from the previous writer to the directoryis the basis for resolving possible races among(new) writers, some of which might use pre-diction while others might go directly tothe directory. We carefully checked the writerprediction’s correctness for writes and upgradesand have emphasized resolving writer racessafely.8

Correct writer prediction reduces the mes-sage count from four messages to threebecause it coalesces the two directory-indirection messages into one direct messageto the writer. In addition to the power benefitof eliminating a message, there is a perfor-mance benefit because the critical latencyfurther drops from three hops to two hops—hence, a (2þ1)-hop protocol. However, a mis-prediction results in a four-hop, five-messageprotocol with negative impact on bothpower and performance. A simple calculationagain reveals that a prediction accuracy ofmore than 50 percent starts to yield benefits.

Although simple predictor schemes canachieve an accuracy of 50 percent or greater,writer prediction is futile in certain

situations. Writer prediction in migratorysharing or unstructured sharing cannot, inany useful sense, be performed using historyinformation. In this case, we could adapt thedirectory to prevent writer prediction for mi-gratory data.9,10 We do not use migratorysharing classification in our evaluation toavoid the increased cost and complexityand keep our approach simple. In this re-spect, our results without it are conservative.This classification is not the same as the mi-gratory optimization—that is, collapsing aread-miss and a write-miss into one9-11—but simply prevents writer prediction inhopeless cases. The migratory optimizationis orthogonal to our approach and can beeasily incorporated in the same instruction-based predictor for additional performanceand power benefits.

ImplementationImplementing transparent reads requires

minimal changes in the cores, protocolengines, and caches.5 Protocol logic changes

[3B2-14] mmi2010050054.3d 9/11/010 16:57 Page 59

4-hop1

(a)

(c)

(d)

(e)(b)

H W

W

Home(dir)

Home(dir)

New writer

(New) writer

—/I→R/W

—/I/RO

R/W→I

→R/W

4 data+ack3 data+ack2

Old writer

W

H W

No prediction

—/I/TRO

—/I/TRO

—/I/TRO

3a

2a

Ack

Ack 4Ack

Correct prediction

Misprediction

(3+1)-hop

(2+1)-hop

(4+1)-hop

2

2a

3a

Old

2b

4b4aack

Old

Old

1New

NewH W

W

→R/W

→R/W

R/W→I

R/W→I

R/W→IRO→I

RO→I

RO→I

Wrong

New writer

1

H W

R

RR

1

→R/WH W

W

1

3

2

3b

Data+ack

Data+R/W

Data+ack

—: Miss

I: Invalid RO: Read only

TRO: Tear-off read only

R/W: Read-write

(4+x)-hop

Figure 4. Writer prediction on writes and upgrades: Four-hop invalidation of

a single writer (a), four-hop invalidation of multiple readers when no writer

exists (b), base three-hop invalidation when there is no prediction (c),

correct prediction (d), and misprediction (e).

....................................................................


and modifications to MESI DiriNB12 orDASH-like directory protocols are simple.4

Discarding the TRO copies upon synchroni-zation requires some support in the cachesthat is similar to but simpler than hierarchi-cal decay counters, which minimally impactpower consumption.13

Prior work has shown that instruction-based prediction can efficiently capture theaccess behavior of programs and relate it toa small set of PCs.11 This efficiency is dueto the fact that, unlike address prediction,which is based on data behavior, the behaviorof the code is constant over time and can belearned quickly.11,14 In contrast to otherwork that aims to predict the destinationset of readers,14 we predict the writer.

The predictor is a small structure indexedand tagged by the PC of instructions causingmisses in the L1. By using just 64 entries(8-way set associative with least-recently usedreplacement), we can capture 99 percent ofthe benefit of a predictor of any larger size.Each entry holds the predicted writer and a2-bit saturating confidence counter. A sidebuffer holds the miss address and the PCfor updating the predictor when a miss is sat-isfied. We implement this side structure withminimal cost as part of the L1 MSHRs.

Although the instruction-based predictorworks well for instructions causing read andwrite misses, its performance for stores thatcause upgrade misses lags behind. The rea-son is that a read miss precedes eachupgrade miss. Because the predictor isupdated when misses are satisfied, writer in-formation in this case is related to the loadthat caused the read miss. By the time thestore that caused the upgrade miss isupdated, the writer information is not avail-able. To solve this problem, we rely on asimple address-based prediction, which iscarried by the TRO cache blocks. EachTRO copy carries the ID of the lastknown writer—that is, the node fromwhich it got the data. The overhead for a16-core CMP is just 4 bits per cache line,which is negligible. The performance ofthis simple scheme is enough to give storesa prediction accuracy comparable to—orbetter than—the loads.

EvaluationWe implemented our protocols using

Gems on top of Vitutech Simics (http://www.virtutech.com). We configured Simicsto simulate a 16-core SPARC runningSolaris 10, and configured Gems/Ruby tomodel a CMP with a mesh interconnect.Table 1 shows the simulator parametersand Table 2 shows the Splash-2 benchmarkswith their inputs. As Figure 1 shows, all ofthe benchmarks show relatively good speed-ups up to 16 cores, except for radiosity,due to its small input (larger inputs for radio-sity lead to long simulation times). We usefull-run simulations (from start to comple-tion of the parallel part of the application)to correctly compute EDP. Relying on asmall part of the execution (for example, a

[3B2-14] mmi2010050054.3d 9/11/010 16:57 Page 60

Table 1. Simulator configuration.

Parameter Chip configuration

Processor 16 cores

Cores 3 GHz in-order, single-issue, blocking model (Simics)

Block size 64 bytes

Data and instruction

L1 caches

256 Kbytes, 8-way, 3-cycle latency, Pseudo-least

recently used (LRU) (32 Kbytes, 64 Kbytes,

256 Kbytes, 512 Kbytes, 1 Mbyte for sensitivity

studies8,15)

Shared L2 cache 8-Mbyte NUCA, 16-way, 4 banks, 15-cycle latency

for tag accesses and 30-cycle latency for data

accesses, P-LRU, each L2 bank is connected to

the four edge routers of the 2D mesh network

Memory 4 Gbytes, 256-cycle latency

Interconnection network 2D mesh topology, 2-cycle link latency, 16 bytes

flit size

Table 2. Benchmarks used in our evaluation.

Benchmark Input

fft 64 K complex doubles

radix 2 M keys

barnes 8 K bodies, 4 time steps

radiosity Room

fmm 8 K particles

volrend Head

cholesky tk29.0

ocean non-contig. 258 � 258 grid

ocean contig. 258 � 258 grid

water-SP 512 molecules, 3 time steps

water-Nsq 512 molecules, 3 time steps

....................................................................

60 IEEE MICRO

...............................................................................................................................................................................................


fixed number of instructions) cannot yieldreliable EDP results because instruction percycle (IPC) is not the correct metric forCMPs. We concentrate on cache and net-work EDP because they are critical for scal-ing the coherence protocol. Individual corepower (which we can compute using partialGems/Opal runs) is roughly constant perbenchmark, regardless of the core countused to run the benchmark. Thus, coreEDP largely follows the speedup curves,and so is not interesting in our study.

We rely on Orion 2.0 to compute net-work energy and Cacti 6.5 to computecache energy. We used a 45-nanometer pro-cess technology for both Orion and Cacti.We derived all process-specific parametersused by Cacti 6.5 from the ITRS roadmap,and modeled the caches in Cacti using theparameters in Table 1. We also modeledthe instruction-based predictor in Cacti asan 8-way 64-entry cache (2-byte tags and1-byte prediction data). Our Cacti simula-tions reveal that the instruction-based predic-tor energy per access is 10.06 percent of theenergy per accesses consumed by a 256-Kbyte L1 cache (12.12 percent of a32-Kbyte cache, 1.63 percent of the8-Mbyte L2 cache, and 8.2 percent of arouter). The extra power introduced by ourinstruction-based predictor is more than suf-ficiently compensated by reducing the dataand control messages as well as by

substituting L2 bank accesses with L1accesses. Finally, we assume that the networklink width for L1-to-router communicationand for router-to-router communication is16 bytes (the same as the flit size), whilethe network link width for L2-directory-to-router communication is 2 bytes.

Oracle writer predictionTo determine writer prediction’s poten-

tial, we explore an oracle predictor with100 percent accuracy. Its coverage gives thenumber of times a writer actually exists ina local cache and can supply the data witha direct transfer to a requestor. For the missesnot covered by the oracle predictor, no writerexists in the system at the time of the miss;data must be fetched by going to the direc-tory (L2). The oracle predictor simplypeeks at the directory to see who the writeris (the node holding the specific cache linein exclusive/modified state) whenever a pre-diction opportunity exists.

Figure 5 shows the potential benefitacross several cores (from 2 to 16) and fora wide range of L1 cache sizes (from 32Kbytes to 1 Mbyte) for the Splash-2 bench-marks. The potential benefit is significant inmany of the benchmarks, but varies signifi-cantly within a benchmark in relation tocore count and cache size. The potential ben-efit largely depends on the total amount ofcache space available. Increasing either the

[3B2-14] mmi2010050054.3d 9/11/010 16:57 Page 61

0

10

20

30

40

50

60

70

80

90

1002P 4P 8P 16

P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

ocean-cont ocean-ncont water-spa water-ns

Per

cent

of

times

writ

er e

xist

s

32 Kbytes 64 Kbytes 128 Kbytes 256 Kbytes 512 Kbytes 1,024 Kbytes

fft radix barnes radiosity fmm volrend cholesky avg

Figure 5. Potential benefit from writer prediction. The benefit largely depends on how much cache space is available.

....................................................................


core count or the cache size increases thetotal amount of L1 cache space available tothe application. With more cache space,writers remain longer in the L1 caches andcan satisfy direct requests.

ResultsFigure 6 presents the overall results for ex-

ecution time, energy, and EDP (network andcache) for 11 of the Splash-2 benchmarks for256-Kbyte L1 caches and realistic writer pre-diction (comparable results, accounting forthe reduced prediction potential, for 32-Kbyte L1 caches are available elsewhere15).Each graph shows the percentage reduction(improvement) in execution time, energy,and EDP normalized to the base MESI pro-tocol. For each benchmark we show resultsfor 2, 4, 8, and 16 cores. In each case, we pres-ent the results for the TRO protocol, for theTRO with writer prediction (TRO þ WP),and for the TRO with oracle prediction (Or-acle WP). As the graphs show, the TRO pro-tocol alone achieves a modest reduction inexecution time over the base protocol—from0.3 percent (2 cores) to 9.48 percent(16 cores) on (arithmetic) average over allbenchmarks—while writer prediction yieldsa reduction from 2.57 percent (2 cores) to15.37 percent (16 cores). Both the TROand the TRO þ WP yield significant reduc-tion in network and cache energy, rangingfrom 10.58 percent (2 cores) to 35.88 percent(16 cores) for TRO and 27.66 percent to 52.8percent for TRO þ WP (Figure 6b). This isthe result of eliminating a portion of the over-all network traffic and reducing L1 misses andL2 accesses. The reduction in cache and net-work energy combined with execution timereduction gives the EDP reduction shown inFigure 6c. Overall, the reduction in EDP is8.66 percent (2 cores) to 39.77 percent(16 cores) for TRO and 17.38 percent to50.78 percent for TRO þ WP. These resultscontribute to our initial argument for betterEDP scaling.

To gain a better understanding of theseresults, especially with respect to individualbenchmarks’ behavior, we also present thereduction experienced in the network trafficin Figure 7. Because of space limitations,we only present the results for eight and16 cores. We place traffic in three categories:

invalidation traffic (invalidations and ac-knowledgments), data messages, and controlmessages (requests). According to our previ-ous analysis, the TRO protocol almost elim-inates the invalidation traffic. This is indeedthe case for most benchmarks. Only choleskysees just a modest reduction in invalidationtraffic. In this case, although the actual inval-idation messages are sufficiently reduced (upto 75 percent in 16 nodes), those that remaincorrespond to a relatively large number of ac-knowledgment messages.

The basic protocol transactions of TROþWP remain invariant with respect to datatraffic compared to the base protocol. TROcopies, however, can be self-invalidated andrefetched even in the absence of interveningwrites. This results in additional data trafficnonextant in the base protocol. Recall thatwe do not use any directory classification(that is, for frequently written versus read-mostly lines) to avoid this. Increased datatraffic due to this phenomenon appears inbarnes, radiosity, and cholesky. Even whenprotocol classification is present, barnes can-not benefit from tear-off copies.5 In manybenchmarks (fft, fmm, volrend, ocean-cont,ocean-ncont, water-spa, and water-ns) wesee fewer data messages. In the base protocol,data are put back to the directory when awriter is downgraded because there is no‘‘owner’’ state. In contrast, because we donot downgrade the writers on TRO reads,we achieve the same benefit as with havingan owner state in addition to a significant re-duction in L1 upgrade misses—32.06 per-cent on average in 16 cores.

Writer prediction also has a significant ef-fect on the control (request) messages. Thereduction of requests depends on both thecorrect predictions and the mispredictions(which introduce additional control mes-sages). In general, the benchmarks withgood accuracy also show a significant de-crease in control messages. Prediction accu-racy ranges from 91.43 percent in twocores to 73.97 percent in 16 cores, averagedover all benchmarks. Detailed accuracy andcoverage results are available elsewhere.15

The combination of the change in thethree message categories (invalidation traffic,data, and control messages) weighted by theirrelative frequency in the program actually

[3B2-14] mmi2010050054.3d 9/11/010 16:57 Page 62

....................................................................

62 IEEE MICRO

...............................................................................................................................................................................................


[3B2-14] mmi2010050054.3d 9/11/010 16:57 Page 63

0

10

20

30

40

50

60

702P 4P 8P 16

P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

fft radix barnes radiosity fmm volrend cholesky ocean-cont ocean-ncont water-spa water-ns

ocean-cont ocean-ncont water-spa water-ns

water-ns

avg

0

10

20

30

40

50

60

70

80

90

100

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

fft radix

(b)

(c)

(a)

barnes radiosity fmm volrend cholesky

ocean-cont ocean-ncont water-spa

avg

0

10

20

30

40

50

60

70

80

90

100

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

2P 4P 8P 16P

fft radix barnes radiosity fmm volrend cholesky avg

Tear-off, read only Tear-off, read only and writer prediction Oracle writer prediction



Ene

rgy-

del

ay p

rod

uct r

educ

tion

(%)

Ene

rgy

red

uctio

n (%

)E

xecu

tion

time

red

uctio

n (%

)

Figure 6. Execution time reduction (a), cache and network energy (b), and EDP reduction (c) normalized to the base MESI

protocol for 256-Kbyte L1 caches.

....................................................................


represents the reduction in L2/directoryaccesses, which ranges from 34.08 percentfor two nodes to 55.91 percent for 16nodes (averaged over all benchmarks) andreaching as high as 78.12 percent for ocean_cont in 16 nodes.15 L2 cache accesses have animportant contribution in the total energy.Their reduction combined with reduced net-work traffic constitute the bulk of our energysavings.

B y simply relying on the independenceof tear-off blocks from costly (and

many times complex) directory updates, ourwriter prediction approach avoids many ofthe problems and much of the complexityof earlier attempts to graft prediction ontocache coherence. It thus opens the way toefficiently incorporate further optimizationswithout inordinately complicating the co-herence protocols. Inspiration for manyoptimizations can be drawn from the vastbody of prior work on cache coherence,spanning hardware caches to softwarevirtual memory systems. Our future workconcentrates on incorporating such optimi-zations to further reduce power consumptionand improve network performance. M I CR O

....................................................................References

1. M. Hill, ‘‘What is Scalability?’’ Computer Archi-

tecture News, vol. 18, no. 4, 1990, pp. 18-21.

2. J.P. Singh and D. Culler, Parallel Computer Ar-

chitecture: A Hardware/Software Approach,

Morgan-Kaufmann Publishers, 1998.

3. A. Ramirez et al., ‘‘The SARC Architecture,’’

IEEE Micro, vol. 30, no. 5, 2010, pp. 16-29.

4. D. Lenoski et al., ‘‘The Stanford DASH

Multiprocessor,’’ Computer, vol. 25, no. 3,

1992, pp. 63-79.

5. A.R. Lebeck and D. Wood, ‘‘Dynamic Self-

Invalidation: Reducing Coherence Overhead

in Shared-Memory Multiprocessors,’’ Proc.

Int’l Symp. Computer Architecture (ISCA-

22), IEEE Press, 1995, pp. 48-59.

6. K. Gharachorloo et al., ‘‘Programming for

Different Memory Consistency Models,’’

J. Parallel and Distributed Computing,

vol. 15, no. 4, 1992, pp. 399-407.

7. M.E. Acacio et al., ‘‘Owner Prediction for

Accelerating Cache-to-Cache Transfer

Misses in a cc-NUMA Architecture,’’ Proc.

Int’l Conf. Supercomputing (ICS-02), IEEE

Press, 2002, pp. 1-12.

8. S. Kaxiras, G. Keramidas, and I. Oikonomou,

‘‘Power-Efficient Scaling of CMP Directory

Coherence,’’ Proc. Workshop Programm-

ability Issues for Multi-Core Computers,

2009; available at http://multiprog.ac.upc.edu/

multiprog09/resources/proceedings2009.pdf

9. A.L. Cox and R.J. Fowler, ‘‘Adaptive Cache

Coherency for Detecting Migratory Shared

Data,’’ Proc. Int’l Symp. Computer Archi-

tecture (ISCA-20), IEEE Press, 1993,

pp. 98-108.

[3B2-14] mmi2010050054.3d 9/11/010 16:57 Page 64

–40

–30

–20

–10

0

10

20

30

40

50

60

70

80

90

100

8P 16P 8P 16P 8P 16P 8P 16P 8P 16P 8P 16P 8P 16P 8P 16P 8P 16P 8P 16P 8P 16P

fft radix barnes radiosity fmm volrend cholesky ocea-cont ocean-ncont water-spa water-ns

Invalidations Data messages Control messages

Traf

fic r

educ

tion

(%)

Figure 7. Relative traffic reduction (TRO þ WP versus the base MESI protocol) for 256-Kbyte L1 caches.

....................................................................

64 IEEE MICRO

...............................................................................................................................................................................................


10. P. Stenstrom, M. Brorsson, and L. Sandberg,

‘‘An Adaptive Cache Coherence Protocol

Optimized for Migratory Sharing,’’ Proc. Int’l

Symp. Computer Architecture (ISCA-20),

IEEE Press, 1993, pp. 109-118.

11. S. Kaxiras and J.R. Goodman, ‘‘Improving

CC-NUMA Performance Using Instruction-

based Prediction,’’ Proc. Int’l Symp. High-

Performance Computer Architecture (HPCA-

5), IEEE Press, 1999, pp. 161-172.

12. A. Agarwal, M. Horowitz, and J. Hennessy,

‘‘An Evaluation of Directory Schemes for

Cache Coherence,’’ Proc. Int’l Symp. Com-

puter Architecture (ISCA-15), IEEE Press,

1988, pp. 280-298.

13. S. Kaxiras, Z. Hu, and M. Martonosi, ‘‘Cache

Decay: Exploiting Generational Behavior to

Reduce Cache Leakage Power,’’ Proc. Int’l

Symp. Computer Architecture (ISCA-28),

IEEE Press, 2001, pp. 240-251.

14. S. Kaxiras and C. Young, ‘‘Coherence Com-

munication Prediction in Shared-Memory

Multiprocessors,’’ Proc. Int’l Symp. High-

Performance Computer Architecture

(HPCA-6), IEEE Press, 2000, pp. 156-167.

15. S. Kaxiras and G. Keramidas, ‘‘Power-

Scalable Coherence, HiPEAC tech. report,’’

http://www.hipeac.net/system/files/Power_

Scalable_Coherence_TR.pdf, 2010.

Stefanos Kaxiras is a professor in theInformation Technology Department atUppsala University, Sweden. His researchinterests include power-efficient processor/memory design and modeling of powerconsumption and performance in multicorearchitectures. Kaxiras has a PhD in computerscience from the University of Wisconsin.

Georgios Keramidas is a postdoctoral fellowat the Industrial Systems Institute, Greece,and a visiting assistant professor at theUniversity of Patras, Greece. His researchinterests include computer architecture, inparticular, the design of the memory sub-system of single core and multicore systems.Keramidas has a PhD in electrical andcomputer engineering from the Universityof Patras.

Direct questions and comments aboutthis article to Georgios Keramidas, ISI,Patras Science Park S.A., Stadiou str.,Platani, Patras GR-26504, Greece; [email protected].

[3B2-14] mmi2010050054.3d 9/11/010 16:57 Page 65

....................................................................


sarc coherence:scaling d cache coherence ...pages.cs.wisc.edu/~kaxiras/papers/05582068.pdflater in...

Documents