efficient persist barriers for multicoreshomepages.inf.ed.ac.uk/vnagaraj/papers/micro2015.pdf ·...

12
Efficient Persist Barriers for Multicores Arpit Joshi University of Edinburgh [email protected] Vijay Nagarajan University of Edinburgh [email protected] Marcelo Cintra Intel, Germany [email protected] Stratis Viglas University of Edinburgh [email protected] ABSTRACT Emerging non-volatile memory technologies enable fast, fine-grained persistence compared to slow block-based devices. In order to ensure consistency of persistent state, dirty cache lines need to be periodically flushed from caches and made persistent in an order specified by the persistency model. A persist barrier is one mech- anism for enforcing this ordering. In this paper, we first show that current persist bar- rier implementations, owing to certain ordering depen- dencies, add cache line flushes to the critical path. Our main contribution is an efficient persist barrier, that re- duces the number of cache line flushes happening in the critical path. We evaluate our proposed persist barrier by using it to enforce two persistency models: buffered epoch persistency with programmer inserted barriers; and buffered strict persistency in bulk mode with hard- ware inserted barriers. Experimental evaluations us- ing micro-benchmarks (buffered epoch persistency) and multi-threaded workloads (buffered strict persistency) show that using our persist barrier improves perfor- mance by 22% and 20% respectively over the state-of- the-art. Keywords non-voaltile memory, data persistence, persist barrier, multicore 1. INTRODUCTION Recoverability in case of system crashes has been an important problem for programmers and system design- ers. To recover from a crash, a consistent state of pro- gram data has to be maintained in some persistent stor- age media. Traditionally disks were the only mode of Permission to make digital or hard copies of all or part of this work for per- sonal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstract- ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. MICRO-48, December 05 - 09, 2015, Waikiki, HI, USA Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-4034-2/15/12 ...$15.00 DOI: http://dx.doi.org/10.1145/2830772.2830805 persistence, which led to the development of recover- ability techniques designed for slow block based devices. But the emergence of non-volatile memory (NVRAM) technologies like Phase Change Memory (PCM) and Spin-Transfer Torque RAM (STT-MRAM) provide a fast, byte-addressable alternative for persistence. Since NVRAM sits on the memory bus, existing pro- grams can use NVRAM for persistence using CPU load and store instructions. Therefore, using NVRAM for persistence avoids the hardware (I/O device delay) and software (legacy file system interfaces) overhead for per- sistence, which can greatly improve application perfor- mance. An important challenge in designing NVRAM based persistent memory is maintaining the consistency of data structures in persistent memory. Consistency of data structures is required to ensure correct recovery of program state after a crash. For example, consider a program which, while adding a node to a linked list, first writes to the new node and then updates a pointer to point to the new node. At the time of failure, the cache might have flushed the pointer write to persis- tent memory, but the write to the node might still be in the cache leading to an inconsistent state of the data structure in persistent memory. Hence, to ensure con- sistency of memory state, the correct ordering of cache line flushes needs to be enforced. Most modern proces- sors reorder writes to memory at multiple levels (e.g., load-store queue, cache hierarchy, memory controller) to optimize performance. Because of this reordering, a system failure might leave the data structures in persis- tent memory in an inconsistent state. To ensure consistency of persisted data, a mechanism like a persist barrier is needed to enforce the ordering of writes to NVRAM. A persist barrier will ensure that stores appearing before the barrier persist before the stores appearing after the barrier. One way to imple- ment a persist barrier is by using existing instructions like clflush and mfence 1 [3, 4, 5, 6, 7]. However, this im- plementation tightly couples visibility and persistence. This coupling forces persistence (e.g., cache line flushes) to happen in the critical path of execution, which can 1 Although these instructions ensure the correct order of cache line flushes, they provide no guarantees on the order in which these cache lines persist (are written to NVRAM by memory controller) [1]. For this, additional instructions like the new pcommit [2] need to be used.

Upload: dinhthuan

Post on 31-Aug-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Persist Barriers for Multicoreshomepages.inf.ed.ac.uk/vnagaraj/papers/micro2015.pdf · Efficient Persist Barriers for Multicores ... that stores persist in the correct

Efficient Persist Barriers for Multicores

Arpit JoshiUniversity of Edinburgh

[email protected]

Vijay NagarajanUniversity of [email protected]

Marcelo CintraIntel, Germany

[email protected]

Stratis ViglasUniversity of Edinburgh

[email protected]

ABSTRACTEmerging non-volatile memory technologies enable fast,fine-grained persistence compared to slow block-baseddevices. In order to ensure consistency of persistentstate, dirty cache lines need to be periodically flushedfrom caches and made persistent in an order specifiedby the persistency model. A persist barrier is one mech-anism for enforcing this ordering.

In this paper, we first show that current persist bar-rier implementations, owing to certain ordering depen-dencies, add cache line flushes to the critical path. Ourmain contribution is an efficient persist barrier, that re-duces the number of cache line flushes happening in thecritical path. We evaluate our proposed persist barrierby using it to enforce two persistency models: bufferedepoch persistency with programmer inserted barriers;and buffered strict persistency in bulk mode with hard-ware inserted barriers. Experimental evaluations us-ing micro-benchmarks (buffered epoch persistency) andmulti-threaded workloads (buffered strict persistency)show that using our persist barrier improves perfor-mance by 22% and 20% respectively over the state-of-the-art.

Keywordsnon-voaltile memory, data persistence, persist barrier,multicore

1. INTRODUCTIONRecoverability in case of system crashes has been an

important problem for programmers and system design-ers. To recover from a crash, a consistent state of pro-gram data has to be maintained in some persistent stor-age media. Traditionally disks were the only mode of

Permission to make digital or hard copies of all or part of this work for per-sonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than the author(s) must be honored. Abstract-ing with credit is permitted. To copy otherwise, or republish, to post onservers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected], December 05 - 09, 2015, Waikiki, HI, USACopyright is held by the owner/author(s). Publication rights licensed toACM.ACM 978-1-4503-4034-2/15/12 ...$15.00DOI: http://dx.doi.org/10.1145/2830772.2830805

persistence, which led to the development of recover-ability techniques designed for slow block based devices.But the emergence of non-volatile memory (NVRAM)technologies like Phase Change Memory (PCM) andSpin-Transfer Torque RAM (STT-MRAM) provide afast, byte-addressable alternative for persistence.

Since NVRAM sits on the memory bus, existing pro-grams can use NVRAM for persistence using CPU loadand store instructions. Therefore, using NVRAM forpersistence avoids the hardware (I/O device delay) andsoftware (legacy file system interfaces) overhead for per-sistence, which can greatly improve application perfor-mance. An important challenge in designing NVRAMbased persistent memory is maintaining the consistencyof data structures in persistent memory. Consistency ofdata structures is required to ensure correct recoveryof program state after a crash. For example, considera program which, while adding a node to a linked list,first writes to the new node and then updates a pointerto point to the new node. At the time of failure, thecache might have flushed the pointer write to persis-tent memory, but the write to the node might still bein the cache leading to an inconsistent state of the datastructure in persistent memory. Hence, to ensure con-sistency of memory state, the correct ordering of cacheline flushes needs to be enforced. Most modern proces-sors reorder writes to memory at multiple levels (e.g.,load-store queue, cache hierarchy, memory controller)to optimize performance. Because of this reordering, asystem failure might leave the data structures in persis-tent memory in an inconsistent state.

To ensure consistency of persisted data, a mechanismlike a persist barrier is needed to enforce the orderingof writes to NVRAM. A persist barrier will ensure thatstores appearing before the barrier persist before thestores appearing after the barrier. One way to imple-ment a persist barrier is by using existing instructionslike clflush and mfence 1 [3, 4, 5, 6, 7]. However, this im-plementation tightly couples visibility and persistence.This coupling forces persistence (e.g., cache line flushes)to happen in the critical path of execution, which can

1Although these instructions ensure the correct order ofcache line flushes, they provide no guarantees on the orderin which these cache lines persist (are written to NVRAMby memory controller) [1]. For this, additional instructionslike the new pcommit [2] need to be used.

Page 2: Efficient Persist Barriers for Multicoreshomepages.inf.ed.ac.uk/vnagaraj/papers/micro2015.pdf · Efficient Persist Barriers for Multicores ... that stores persist in the correct

lead to significant performance degradation [8].Condit et al. [9] propose hardware support for real-

ising an improved persist barrier, that enforces persistordering lazily. We refer to this barrier as Lazy Barrier(LB). Their key idea is to decouple visibility from persis-tence, allowing program execution to continue beyondthe persist barrier, without waiting for stores from pre-vious epochs2 to persist; their memory system ensuresthat stores persist in the correct order, out of the criticalpath.

In LB, the memory system delays the flushing ofcache lines; a cache line is only flushed, either due tonatural eviction (e.g., replacement), or due to a forcibleeviction. Forcible evictions are required in case of epochconflicts, where old epochs need to be flushed in thecritical path to ensure consistency of data in persistentmemory. For example, before a dirty cache line belong-ing to a newer epoch can be replaced, cache lines writtenin older epochs need to be flushed first. Thus, in caseof an epoch conflict, the conflicting request has to waituntil all the relevant epochs have been flushed to mem-ory. Epoch conflicts bring persist ordering constraints,and consequently cache line flushes, back in the criticalpath. This again couples visibility with persistence forthe duration of conflicts.

In this paper, we design and implement a persist bar-rier (LB++) which improves upon LB. We first catego-rize epoch conflicts into inter-thread and intra-threadconflicts. We then propose optimizations to reduce theoverhead due to the conflicts. We propose an Inter-thread Dependence Tracking (IDT) mechanism for dy-namically tracking inter-thread dependencies in hard-ware, which allows us to reduce the overhead of inter-thread conflicts. We then propose a Proactive Flushing(PF) scheme to flush epochs proactively as opposed tothe reactive approach of LB. Once an epoch completes,the values of all its cache lines are final. PF exploits thisproperty and starts flushing cache lines on completionof epochs. A related issue in multi-threaded programsis that deadlocks can occur on inter-thread epoch con-flicts if the programmer does not correctly place persistbarriers. We present a solution to break persistencedeadlocks, by splitting epochs, on detecting scenarioswhich could potentially lead to deadlocks. Finally, wepropose a detailed protocol for flushing epochs in thecorrect order for a system with multi-banked caches.

We demonstrate the efficacy of LB++ by employingit to enforce 2 persistency models [8]. First, we use itto enforce Buffered Epoch Persistency (BEP) [8]. Us-ing micro-benchmarks we show that using LB++ (asopposed to LB) improves performance by 22%. Second,we show how LB++ coupled with logging support canbe used to enforce Buffered Strict Persistency (BSP) [8]in bulk mode. Our experiments using a subset of PAR-SEC, SPLASH and STAMP benchmarks show that us-ing LB++ allows us to support BSP with a 1.3× ex-ecution time overhead over a non-persistent execution,which is an improvement of 20% over LB.

2Epochs are instruction groups divided by persist barriers.

2. BACKGROUNDIn this section, we first provide an overview of persis-

tency models [8] and highlight the limitations of currentimplementation [9] of those models. We then presentthe system configuration that we consider for our work.

2.1 Persistency ModelsOur focus in this paper is on leveraging NVRAM

technologies to enable fast persistence of programs. Inorder to enable correct recovery, program state that ispersistent in NVRAM needs to be in a consistent state.The definition of what constitutes a consistent state de-pends on the programming model or more specifically,on the persistency model [8]. One easy way to under-stand persistency models is to think about them in rela-tion to memory consistency models. Just as consistencymodels allow us to reason about visibility of stores, per-sistency models allow us to reason about durability ofstores. Pelley et al. [8] introduce 3 persistency models:Strict, Epoch and Strand persistency. Here we focusonly on Strict and Epoch persistency.Strict persistency (SP) couples memory persistencywith memory consistency. So at the point of failure,whatever updates are visible are guaranteed to havebeen persisted. For example, Total-Store-Order (TSO)systems under strict persistency operate under the fol-lowing rules: S1.) stores persist in program order andS2.) a store cannot be made visible until the previ-ous store (in program order) has persisted. A sequenceof stores to different cache lines under SP is shown inFigure 1(a). As shown in the figure, SP creates persistordering constraints at the level of each store opera-tion. Hence, caches effectively have a write-through be-haviour. Essentially, these fine grained persist orderingconstraints conflict with 2 key optimizations employedin most modern processors. First, multiple stores to acache line are coalesced in the caches and only writtenback to memory on a cache line replacement. Under SPsince a store operation cannot be issued until the pre-vious store operation persists (rule S2) multiple storesto a cache line cannot be coalesced (as shown in Fig-ure 1(a) for cache line a). Secondly, processors reordercache line persists to improve performance by exploitingtemporal and spatial locality. This reordering happensin caches as well as in memory controllers. But underSP, cache lines have to be flushed in program order (ruleS1), eliminating any possible performance gain from re-ordering of writes to memory.Epoch persistency (EP) relaxes persist ordering con-straints compared to SP and enforces ordering at thegranularity of epochs [9]. An epoch is a contiguousgroup of instructions which are demarcated using aprimitive known as a persist barrier. A system withEP operates under the following rules, E1.) stores be-longing to different epochs persist in the order of theirrespective epochs and E2.) a new epoch cannot beginuntil all stores belonging to the previous epoch havepersisted. Thus, EP allows coalescing of stores and re-ordering of persists for stores belonging to the sameepoch. A sequence of stores to different cache lines un-

Page 3: Efficient Persist Barriers for Multicoreshomepages.inf.ed.ac.uk/vnagaraj/papers/micro2015.pdf · Efficient Persist Barriers for Multicores ... that stores persist in the correct

a − store(a)

a − forced persist(a)

− critical path

a − cacheline replacement(a)

− persist barrier

a

da b

cb

c

ea

a e

f

f

fd

Persistence

Visibility

(a) Strict persistency

b ca Saved

Time

f

a

a

c

d e f

Epoch 2

d e f

Visibility

Persistence b

Epoch 1 Epoch 3

(b) Epoch Persistency

a b c e fd

Epoch 2

a

e fc db aPersistence

Visibility

Epoch 1

f − Conflicting request

Saved Time

Epoch 3

(c) Buffered Epoch Persistency

Figure 1: Timeline for completion of memoryrequests for various persistency models.

der EP is shown in Figure 1(b). As shown in the figure,EP allows coalescing for cache line a which reduces theoverall time taken to complete the sequence of accessescompared to SP. Moreover, since cache lines belongingto the same epoch can persist out of order, cache line bcan persist before cache line a. In EP, persist operationsare in the critical path of execution upon completion ofan epoch. Even though EP allows write coalescing andreordering of persists within an epoch, it still has a highperformance overhead over volatile execution.Buffered Epoch persistency (BEP)3: The funda-mental reason for overhead in EP is that persist op-erations are in the critical path of execution (because ofrule E2). BEP further relaxes constraint E2. Thus,BEP only requires that stores belonging to differentepochs persist in the order of their respective epochs.BEP allows program execution to continue across epochboundaries without waiting for previous epochs to per-sist. In this case, the cache sub-system has to ensurethat epochs are flushed in correct epoch order. Fig-ure 1(c) shows the timeline for a sequence of stores un-der BEP. Persist barrier after Epoch1 does not preventEpoch2 from executing before all the cache lines mod-ified by Epoch1 persist. While the program executioncontinues, modified cache lines can persist naturally be-cause of replacement as shown in the figure for cachelines b and a. In BEP, persist operations are not inthe critical path of execution as long as there are noepoch conflicts. An epoch conflict is a scenario where amemory request triggers an epoch flush. In Figure 1(c)a store to cache line f conflicts with Epoch2 becausecache line f has been modified in Epoch2 which hasnot yet persisted. The store request has to wait untilall epochs up to Epoch2 are flushed. Only epoch con-

3Pelley et al. [8] do not explicitly differentiate between epochand buffered epoch persistency.

MC 0 MC 1

C 0 C mC 1

LLC B0 LLC B1LLC Bn

On−chip Interconnection Network

DIMMs

NV Memory NV Memory

DIMMs

Figure 2: System Configuration: Multiple cores(C), a volatile shared multi-banked last levelcache (LLC) and multiple memory controllers(MC) connected by an on-chip interconnectionnetwork.

flicts bring the persist operation in the critical path ofexecution for BEP.

The persist barrier proposed by Condit et al. [9] (LB)is basically an implementation of BEP. To track thecurrent epoch, each core is extended with an epoch IDcounter which is incremented by one each time a persistbarrier is encountered. Whenever a store completes, itis tagged with the value of current epoch ID and coreID4. To track the status of cache lines, cache tags areextended to include epoch ID and core ID fields. CoreID identifies the core that last modified the cache lineand epoch ID identifies the epoch in which the cache linewas modified. Using these hardware extensions, cachecontrollers can track and enforce persist ordering de-pendencies between epochs belonging to the same core.

We call persists happening in the critical path as on-line persists and persists happening out of the criticalpath as offline persists. Online persists have a directperformance impact because they delay program exe-cution while waiting for persist operations to complete.Offline persists on the other hand have no direct per-formance impact since they are not in the critical path.The current implementation LB delays persist opera-tions by buffering epochs and relies on offline persists inthe form of natural cache line replacements. Althoughthis design improves performance, this is not optimal fortwo reasons. First, conflicts trigger online persists thusdelaying program execution. Second, this design doesnot actively reduce the number of online persist oper-ations. We elaborate on these limitations and presentsolutions in Section 3.Buffered strict persistency (BSP) [8] is a result ofrelaxing constraint S2 from strict persistency. Althoughit will remove persistence from the critical path, theproblems of not being able to coalesce writes and re-order persists would still remain. These problems inturn would trigger frequent conflicts resulting in a largerpercentage of persist operations being online persists.We present an optimized implementation of BSP in bulkmode with logging support in Section 7.2.

4Condit et al. [9] partition the epoch ID counter and use thehigh order bits as core ID and remaining bits as epoch ID.We show them as 2 fields for the sake of simplicity.

Page 4: Efficient Persist Barriers for Multicoreshomepages.inf.ed.ac.uk/vnagaraj/papers/micro2015.pdf · Efficient Persist Barriers for Multicores ... that stores persist in the correct

2.2 System ConfigurationWe consider a multicore system as shown in Figure 2.

In this system each core (C) has a private cache and allthe cores share a multi-banked last level cache (LLC).All the caches are volatile. It has multiple memory con-trollers (MC) to provide sufficient memory bandwidthfor large number of cores. These memory controllers areconnected to NVRAM. This system is similar to mostmodern server processors, with the only difference beingthat memory in our system is non-volatile memory.

3. PERSIST BARRIER DESIGNThe goal of LB is to decouple persistence from visibil-

ity, which allows persist operations to happen out of thecritical path. LB achieves this goal as long as there areno epoch conflicts. In this section, we first describe twotypes of conflicts and propose optimizations to reducethe overheads because of conflicts. We also illustratethe problem of epoch deadlocks and present a solutionfor the same.

3.1 Resolving Inter-thread Conflicts with IDTAn inter-thread conflict is a scenario where a thread

tries to read or write to a cache line which has beenmodified by some other thread in an epoch which hasnot yet been flushed. Consider the example shown inFigure 3(a). Thread T0 consists of epoch E00 and E01.Thread T1 consists of epoch E10 and E11. T0 tries toread address Y in epoch E01 after T1 has written to itin epoch E11. This creates a new persist ordering con-straint that epoch E11 of thread T1 should persist beforeepoch E01 of thread T0. The epoch tracking hardwarein LB can only track persist ordering constraints be-tween epochs from the same core. Since this is an inter-thread ordering constraint, before completing Ld Y re-quest from T0, epoch E11 needs to be persisted; if not,epoch E01 might persist before epoch E11, leaving per-sistent data in an inconsistent state. It is important tonote here that a read request Ld Y creates an epoch con-flict since it leads to persist ordering constraints whichLB cannot track.

Inter-thread conflicts can lead to significant perfor-mance degradation as shown in Figure 4(a), whichshows memory requests issued by 2 threads T0 and T1.T1 issues request RB to read cache line B. Since B hasbeen modified by T0 in epoch E00, it triggers an inter-thread conflict which triggers online persist of E00. Thisdelays the completion of request RB .Inter-thread Dependence Tracking (IDT). Ifhardware support is provided for tracking inter-threadordering constraints, the impact of inter-thread conflictscan be reduced. We define the epoch from which arequest triggered an inter-thread conflict as dependentepoch and the epoch which last modified the requestedcache line as the source epoch.

To avoid online persists of epochs in case of inter-thread conflicts, we propose a mechanism called IDT.On detecting a conflict, instead of waiting for the con-flicting epoch to flush, IDT records source and depen-dent epochs and enforces this dependence offline. Thus

St Y

St Z

St A

St B

St X

St Y

Ld Y

St D

St C

T1

T0

01E

E10

E11

E00

(a) Inter-thread Conflict

St A

St B

St B

St C

Ld A

T0

E00

01E

02E

(b) Intra-threadConflict

Figure 3: Examples illustrating epoch conflicts.(a) Highlights inter-thread conflict where epochE01 tries to read cache line Y modified in epochE11 (b) Highlights intra-thread conflict whereepoch E02 tries to modify cache line B modifiedin epoch E00.

a conflicting request does not have to wait for olderepochs to persist. Figure 4(b) illustrates the possibleperformance improvement by using IDT. Request RB

from thread T1 does not wait for the persist of epochE00 to complete. IDT records the dependence betweenepochs E00 and E11 and allows request RB to com-plete. When epoch E11 completes, cache line E is notallowed to persist until E00 has persisted. Thus theoverall completion time is reduced while enforcing thecorrect persist ordering constraints.

3.2 Resolving Intra-thread Conflict with PFAn intra-thread conflict is a scenario where a thread

tries to write to a cache line which it has already mod-ified in some prior epoch and the cache line has notyet been flushed. Consider the example shown in Fig-ure 3(b). Thread T0 writes to address B in epoch E00.It then again tries to write to the same address in epochE02. At this point in time the previous value of B hasnot yet persisted. If operation St B in epoch E02 com-pletes, then the value of B will be overwritten. Now ifthe system crashes after persisting epoch E00 but beforepersisting epoch E01 then it will lead to an inconsistentstate. To prevent this scenario, before completing St Bin epoch E02, epoch E00 needs to be persisted. It isimportant to note here that a read request (Ld A) doesnot create a conflict, as the persist ordering constraintbetween epochs within a thread is already being trackedby LB. On an intra-thread conflict, the epoch that lastwrote to the conflicting cache line (B in the example)and all the epochs before it need to be flushed. The onlyway to minimize the performance impact of this type ofconflicts is to minimize the number of such conflicts.Proactive Flushing (PF). To mitigate the problemof intra-thread conflicts, we propose to persist epochsproactively. Proactive flushing would increase the num-ber of epochs persisting offline. An intra-thread conflicthappens because a cache line modified by some olderepoch has not yet persisted because of natural cache line

Page 5: Efficient Persist Barriers for Multicoreshomepages.inf.ed.ac.uk/vnagaraj/papers/micro2015.pdf · Efficient Persist Barriers for Multicores ... that stores persist in the correct

a − forced persist(a) a − cacheline replacement(a)

− critical path− write(A)− read(A) − persist barrierWARA

RYRX WZ

WB WE WF

EBA F

WA

RP RB WERQ

EZ

T0Thread

T1

Epoch

Epoch Epoch

Persistence

Visibility

Visibility

Thread

E00

E10 E11

(a) Without IDT

RYRX WZ

WB WE WF

EBA F

WA

RP

Z

T1RB RQ WE

E

Epoch

T0Thread

Epoch

Epoch

Persistence

Visibility

Thread

Visibility

Saved

Time

E00

E10 E11

(b) With IDT

Figure 4: Example showing the benefit of IDToptimization. (a) Shows an example of how com-pletion of conflicting requests is delayed waitingfor persist of source epochs to complete. (b)Shows with the same example that by reducingthe completion time of conflicting request andallowing the source epoch to persist offline, whileenforcing persist ordering constraints, IDT im-proves performance.

eviction. By flushing a cache line proactively we reducethe probability of a conflict arising out of a subsequentaccess to it. This decrease in the probability of an intra-thread conflict results in improved performance. It isworth noting that proactive flush will similarly reducethe probability of inter-thread conflicts too. For ease ofexplanation, we have introduced it in relation to intra-thread conflicts.

While persisting epochs proactively, care also needsto be taken to ensure that we do not increase the num-ber of flushes to memory. In other words, a cache lineshould be persisted only when its value is final. This willavoid multiple writes to memory for persisting the samecache line. Therefore, we propose persisting epochsproactively after epochs complete. Naturally, we can-not start proactive persist of an epoch if the previousepoch is not yet fully persisted.

An epoch persist operation consists of durably writ-ing all the modified cache lines, belonging to the epochbeing persisted, to memory. An important aspect whenpersisting epochs proactively should be to ensure thatepoch persist operation does not invalidate cache lines.

St X

Ld A

St Y

St Z

St A

St B

St D

Ld X

St C

St E

0T 1T

Ei Ej

(a) Epoch Deadlock

St D

Ld X

St E

St X

Ld A

St Y

St Z

St A

St B

St C

0T 1T

Ej

Ei1

Ei2

(b) Epoch Deadlock Avoid-ance

Figure 5: (a) Shows an example of persistentepoch deadlock. Epoch Ei and Ej belonging tothreads T0 and T1 respectively have a circulardependence. Ej reads cache line A modified byEi and Ei reads cache line X modified by Ej. (b)Possible epoch deadlock between epochs Ei andEj is avoided by splitting the ongoing epoch Ei

into epochs Ei1 and Ei2 on detecting conflict withepoch Ej.

If hardware employs mechanisms similar to clflush in-struction, cache lines are also invalidated while beingwritten back to memory. This will have a negativeimpact on the overall system performance as the per-sist operation will start evicting working sets from thecache. We implement a non-invalidating flush operationsimilar to the new clwb instruction [2]. This mechanismwill not invalidate the cache line being persisted, thusavoiding any negative impact on performance.

3.3 Epoch Deadlocks and their AvoidanceIn multi-threaded applications, epochs belonging to

different threads are independent and can persist inparallel, as long as there are no inter-thread dependen-cies. In the presence of dependencies, epochs need topersist in the order specified by the dependencies. Forthe example shown in Figure 3(a), epochs E00 and E10

are independent and can persist in parallel. Whereas,epoch E01 is dependent on epoch E11, so E01 cannotpersist before E11 to ensure consistent state of memory.This dependence is enforced by the epoch persistencemechanism by either flushing epoch E11 before com-pleting Ld Y request of epoch E01 or by tracking theinter-thread dependence relation between epochs E01

and E11 and ensuring that E11 persists before E01.The epoch persistence mechanism can enforce epoch

ordering constraints when the dependence relation be-tween epochs is linear. Consider the example shown inFigure 5(a). Epochs Ei and Ej have a circular depen-dence between them. On encountering Ld A requestby T1 LB identifies Ei to Ej dependence and will tryto flush Ei before completing the request. But a flushfor epoch Ei cannot complete because the epoch is on-going. Then on encountering Ld X request by T0 theepoch persistence mechanism tries to flush Ej beforecompleting the request. This leads to a deadlock.

To prevent deadlocks, a scenario where a circular de-pendence relation arises needs to be avoided. We pro-

Page 6: Efficient Persist Barriers for Multicoreshomepages.inf.ed.ac.uk/vnagaraj/papers/micro2015.pdf · Efficient Persist Barriers for Multicores ... that stores persist in the correct

pose a solution to epoch deadlocks by conservativelypreventing a scenario which can lead to circular depen-dence. This solution is based on the observation thatcircular dependence can only occur if a request triggersan inter-thread dependence with an ongoing epoch. Byongoing epoch we mean an epoch whose persist barrierhas not yet occurred – in other words, an epoch whichhas not yet completed. If a request triggers an inter-thread dependence with a completed epoch, then thereis no chance of having an inverse dependence since nomemory operations are pending in the completed epoch.On detecting an inter-thread dependence with an ongo-ing epoch, our proposal is to divide the source epochinto two parts: the first part includes all the operationscompleted at the time of detection and the second partis the remaining portion of the epoch. Without IDTwe would have had to flush the first part of the epoch,whereas with IDT it suffices to register the inter-threaddependence in hardware, before completing the request.It is worth noting that, by breaking the source epoch,we have ensured that there is no chance of having an in-verse dependence. Figure 5(b) shows the solution withan example. When the first dependence is detected withrespect to ongoing epoch Ei, Ei is split into epoch Ei1

which consists of the part of the epoch that has alreadycompleted (until St C ) and the remaining epoch whichis called Ei2.Discussion. To prevent epoch deadlock scenarios fromhappening, persist barriers need to be placed appro-priately. One way to prevent epoch deadlocks fromhappening is to place persist barriers at the start andend of each critical section [10]. However, there canbe programs where it is difficult to identify where toplace epoch barriers to prevent deadlock (e.g., lock-freeprograms, programs with user-defined synchronisation,etc.) The deadlock avoidance scheme described abovecan be useful in such programs.

4. PERSIST BARRIER IMPLEMENTA-TION

In this section, we describe the implementation ofour persist barrier. We first present a detailed epochflush protocol that enforces persist ordering correctlyin systems with multi-banked caches. We then describehow IDT and PF are implemented. We summarize byhighlighting the additional hardware required for ourpersist barrier implementation.

4.1 Epoch Flush ProtocolTo ensure consistency of data in NVRAM, epochs

need to be persisted in the correct order. The union ofthe intra-thread program order and inter-thread sharedmemory dependencies define this epoch happens-beforeorder. The goal of the epoch flush protocol is to ensurethat order in which epochs are persisted is consistentwith this happens-before order.

In the system presented in Section 2.2 caches arevolatile, so persistence happens only when epochs havebeen written back to NVRAM. Hence, it is sufficient toensure that for any two epochs E1 and E2 such that E1

MC 0MC 1

2 2

1

1

EpochCMP (E)

L1 LLC

FlushLines (E)FlushLines (E)

PersistAcksPersistAcks

Figure 6: Line diagram explaining the handshak-ing protocol for Epoch Flush implementation ina multicore with monolithic last level cache.

happens-before E2, the last level cache (LLC) will notflush a cache line belonging to epoch E2 until all thecache lines belonging to epoch E1 have persisted.

To satisfy the above constraint LLC has to identifytwo pieces of information: first, the set of all the cachelines belonging to each epoch; second, it has to knowwhen a cache line has persisted. These two pieces ofinformation will help in identifying when an epoch haspersisted, based on which LLC can potentially start per-sisting its successor(s). LLC needs to identify when theL1 cache has written back all the cache lines belongingto an epoch. To convey this information the L1 con-troller sends an epoch completion message (EpochCMP)to LLC after writing back all the cache lines belongingto an epoch. This informs the LLC that it has seen allthe cache lines belonging to that epoch. It is importantto note here that receiving an EpochCMP message fora given epoch is a prerequisite for completing the flushof that epoch. If LLC has to flush an epoch but hasnot received EpochCMP message for the same, it canrequest L1 to flush all the cache lines belonging to thatepoch.Monolithic LLC. If the LLC is monolithic, it can flushepochs independently after receiving EpochCMP mes-sages for the epochs being flushed. Figure 6 illustratesthe protocol. After flushing the epoch preceding epochE, LLC in step 1© starts flushing cache lines belongingto epoch E. Memory controllers respond with a Per-sistACK message after durably writing cache lines toNVRAM in step 2©. On receiving PersistACK mes-sages for all the cache lines flushed, LLC registers epochE as having persisted and can start persisting the sub-sequent epoch. It is important to note that LLC hasalready received EpochCMP message for epoch E andhence it can consider the flush of E as having been com-pleted.Multi-banked LLC. The two step protocol presentedabove works for monolithic caches, but when extendedto a system with multi-banked caches it might not work.Consider the example shown in Figure 7(a). The sys-tem consists of an L1 cache and two banks of LLC.Epoch E1 consist of two cache lines A and B mappingto LLCB0 and LLCB1 respectively. Epoch E2 consistsof cache line C mapping to LLCB1. L1 first flushes

Page 7: Efficient Persist Barriers for Multicoreshomepages.inf.ed.ac.uk/vnagaraj/papers/micro2015.pdf · Efficient Persist Barriers for Multicores ... that stores persist in the correct

E1

LLC B0

LLC B1

B

E2

C

C

B C

Violation

A A

BA

Memory

L1

(a) Violation.

E1

LLCB0

LLCB1

B

E2

C

C

B

A A

A

C

C

BA

Memory

L1

(b) Correct ordering.

Figure 7: (a) Shows an example of how epochordering constraint is violated. Cache line C be-longing to epoch E2 persists before cache line Bbelonging to the previous epoch E1. (b) Showsthe correct enforcement of epoch ordering con-straints. LLCB1 delays persisting cache line Cbelonging to epoch E2 until all the cache linesbelonging to the previous epoch E1 have per-sisted.

epoch E1 and then epoch E2. LLCB1 decides to flushepoch E1 and hence flushes cache line B. MeanwhileLLCB0 delays flushing cache line A. When LLCB1 re-ceives cache line C belonging to epoch E2, it flushesthe cache line. LLCB1 is allowed to flush epoch E2, be-cause all its cache lines belonging to previous epoch E1

have already been flushed. This leads to a violation ofepoch ordering constraints since a cache line belongingto epoch E2 persists before the previous epoch E1 isflushed completely. If the system crashes at this point,persistent memory will be left in an inconsistent state.The violation shown in Figure 7(a) happened because,in a multi-banked cache organisation, each bank onlyhandles a range of addresses and has no informationabout the status of cache lines outside that range. In theexample, LLCB1 had no information about the pendingcache line belonging to epoch E1 in LLCB0. To avoidthis scenario, a bank of LLC should not start flushing acache line until all the banks have completed persistingthe previous epoch. With this constraint, LLCB1 willnot flush cache line C until LLCB0 has also flushed allthe cache lines belonging to epoch E1. This scenariowhere epoch ordering constraint is correctly enforcedis shown in Figure 7(b), where LLCB1 does not flushcache line C belonging to epoch E2 until LLCB0 hasflushed cache line A belonging to epoch E1.

To persist epochs in correct order all the banks ofthe LLC need to communicate with each other to co-ordinate flushing of every epoch. If all the banks sendmessages informing epoch completion to each other di-rectly, it will require O(n2) messages, where n is thenumber of banks. This can be a prohibitively largeoverhead, especially considering the fact that this willhave to be incurred for each epoch that is being flushed.Instead we propose using an arbiter module to controlflushing of epochs. All the banks will inform the ar-biter when they have completed persisting an epoch.The arbiter in turn on receiving messages from all thebanks will broadcast a message indicating that the rele-vant epoch has persisted. After receiving this message,

1

1

2 FlushLines2 FlushLines

2

2 FlushLines

LLCB1

LLCB0

MC 0 MC 1

4

4

3

3

FlushEpoch

L1

FlushLines

PersistAcks

PersistAcks

PersistCMP

PersistAcks

PersistAcks

BankAck

BankAck

FlushEpoch

PersistCMP

Figure 8: Line diagram explaining the handshak-ing protocol for Epoch Flush implementation ina multicore with multi-banked last level cache.

LLC banks can start flushing the next epoch. Using anarbiter in this way requires O(n) messages only. In amulticore, the arbiter can become a bottleneck if thereis a single arbiter for persisting epochs belonging to allthe threads. Instead we propose using a per thread ar-biter, which is responsible for coordinating the persistoperations belonging to a single thread. This per threadarbiter is placed along with private L1 caches in all thecores and is responsible for coordinating the persist ofepochs belonging to the thread executing on that core.

We propose a handshaking protocol by using an ar-biter module sitting in the L1 cache to orchestrate epochflush. The protocol is shown in Figure 8. The arbiter inthe L1 cache will start epoch flush by first flushing allthe cache lines, belonging to the epoch being flushed,from L1. In step 1©, L1 will flush all the cache linesbelonging to that epoch to LLC and also send an epochflush message to all LLC banks. In step 2© each LLCbank will start flushing all the cache lines belonging tothe epoch being flushed. On receiving PersistAcks forall the cache lines flushed, each LLC bank will send aBankAck message to the arbiter in the L1 controller instep 3©. Finally in step 4©, after receiving BankAckfrom all LLC banks, the arbiter will signal flush com-pletion (PersistCMP) to all the LLC banks. This finalstep will update the state corresponding to last flushedepoch in all the banks.

4.2 IDT and PF ImplementationEnforcing inter-thread persist ordering constraints

out of the critical path requires two things. First, pre-venting the dependent epoch from persisting before thesource epoch persists. For this, an entry called depen-dence register is created in the arbiter corresponding tothe dependent epoch. The arbiter before persisting an

Page 8: Efficient Persist Barriers for Multicoreshomepages.inf.ed.ac.uk/vnagaraj/papers/micro2015.pdf · Efficient Persist Barriers for Multicores ... that stores persist in the correct

Flush Engine

EpochID

...

Tag

Epoch Arbiter

Flush Engine

LLC Cache ControllerTagEpochIDCoreID

...

IDT Registers

Dependence Inform

L1 Cache Controller

Figure 9: Hardware extensions.

epoch will check (in addition to its predecessor epochsin program order) the dependence register to see if thatepoch is dependent on an epoch belonging to some otherthread. If so, the arbiter will not flush until the sourceepoch has been flushed. The second aspect is to in-form the dependent epoch when source epoch has beenflushed. For this, an entry called inform register is cre-ated in the arbiter corresponding to the source epoch.On completing the persist for an epoch, the arbiter willsend an epoch persist completion message to the depen-dent epochs listed in the inform registers.

To implement a proactive flushing scheme, once anepoch completes, a request is sent to the correspondingarbiter to start flushing the epoch. The arbiter startsflushing the completed epoch after ensuring that all itspredecessor epochs have persisted.

4.3 Hardware ExtensionsHardware extensions required to implement a persist

barrier are shown in Figure 9. To track the epoch sta-tus of each cache line, cache tags in both L1 and LLCare extended with EpochID. Cache tags in shared LLCneed to be extended with CoreID information to detectinter-thread conflicts. Apart from the cache tags we adda flush engine in each cache controller to flush epochsas and when required. Flush engine will trigger a flushfor the dirty cache lines of the epoch being flushed. Anepoch arbiter is added in the L1 cache controller to co-ordinate epoch flush operation in the presence of multi-banked caches. To track inter-thread dependencies asproposed in IDT, we add a pair of registers called de-pendence and inform registers in L1 cache controller toidentify the source and dependent epochs (using a com-bination of EpochID and CoreID). These registers areadded per in-flight epoch.

In our implementation we support 8 in-flight epochsin a 32-core machine. So EpochID is 3 bits wide andCoreID is 5 bits wide. The overhead of tagging cachelines is 5 bits and 8 bits per cache line in L1 and LLCrespectively. We add 4 pairs of IDT registers per epochto allow tracking of as many inter-thread dependenciesper epoch. The overhead of adding these registers is 64bytes in each L1 cache. The per core arbiter contains a5-bit counter to track BankAck messages received fromall the banks. Even though multiple epochs can be inflight, only one of them will be flushed at a time; so one

1. Persist Barrier

2. Copy (data[Head],Entry)

3. Persist Barrier

4. Head = Head + EntryLen

5. Persist Barrier

END

QUEUE_INSERT (Head, Entry)

A

B

(a) Queue insert pseudo-code

Head

E i E j Ek...

E i E j Ek...

Head

Epoch A

Epoch B

(b) Example

Figure 10: (a) Pseudo-code for a queue insertoperation using persist barriers for recovery incase of a system crash. (b) Example illustratingthe status of the queue on completion of differentepochs within the insert function.

counter is sufficient. Our flush engine maintains book-keeping information similar to [9], in order to reduce theoverhead of searching. In our implementation, we main-tain a bitmap per epoch, where each bit corresponds to64 sets in the cache, amounting to an overhead of 512bytes for 16-way 1MB LLC bank.

5. ENFORCING PERSISTENCY MODELSIn this section, we illustrate how our efficient persist

barrier can be used to implement two models of persis-tency: Buffered Epoch Persistency (BEP) and BufferedStrict Persistency (BSP) in bulk mode.

5.1 Buffered Epoch PersistencyIn BEP [8], programmer inserted persist barriers di-

vide the program into epochs and persist ordering is en-forced at epoch granularity. Consider the sample queueinsert function (similar to [8]) shown in Figure 10(a).Queue insert operations consist of two epochs. In EpochA from lines 2 to 3, a new entry is copied in the queueat the location pointed to by Head pointer. In Epoch Bfrom lines 4 to 5, Head pointer is updated to point tothe next empty location. Figure 10(b) shows the statusof the queue on completion of Epoch A and Epoch B.If the system crashes after Epoch A persists but beforeEpoch B persists then the new entry Ek is ignored onrecovery. If the system crashes after persisting EpochB then on recovery the program will see successful com-pletion of insert operation.

5.2 Buffered Strict Persistency in Bulk ModeWe enforce BSP in bulk mode, to minimize the over-

heads of strict persistency. Instead of enforcing persistordering constraints at memory operation granularity,we enforce them at an epoch granularity. This is similarin spirit to the way Sequential Consistency is enforcedby BulkSC [11].

BSP is implemented completely in hardware. Hence,no programmer annotations in the form of persist bar-riers are required. A hardware persistence engine di-vides the sequence of stores from a program executioninto epochs. Persistency is enforced at the granularity

Page 9: Efficient Persist Barriers for Multicoreshomepages.inf.ed.ac.uk/vnagaraj/papers/micro2015.pdf · Efficient Persist Barriers for Multicores ... that stores persist in the correct

of epochs using the optimized persist barrier presentedin Section 3. At the end of each epoch, along with themodified cache lines, processor state is also saved to per-sistent memory. This state can be used to restart theprocess, similar to the way it is done in [12]. It is worthnoting that epoch boundaries are the points at whichBSP holds. At the time of a crash though, some epochsmight have persisted partially and therefore BSP mightbe violated. To overcome this problem we propose touse logging to undo any partially persisted epochs, insuch situations.

Another issue that can arise in the proposed imple-mentation is the possibility of epoch deadlocks. Sincehardware dynamically creates epochs, it is oblivious todependencies between threads. This could lead to epochdeadlocks. We use the solution presented in Section 3.3to overcome this problem.

5.2.1 LoggingIn order to enforce BSP, epoch updates to persistent

memory need to be atomic. The granularity (atomicunit) at which memory can be updated by the hard-ware is typically much smaller than the size of an epoch(which spans multiple cache lines). To ensure atomicupdates of an epoch, logging support is needed. Thislogging support can be provided by the hardware orcan be explicitly managed in software through systemlibraries or by the application itself. In this work, wemake use of a hardware undo logging mechanism, de-scribed below, to enforce BSP.

Undo logging requires that before modifying a cacheline its old value should be written to the log area.Therefore, whenever a cache line is being modified forthe first time in an epoch, the old value of the cache line(which is either already in the cache or has been broughtinto the cache on a cache miss) is written back from thecache to the log in NVRAM, before the modified cacheline is written back. The first modification to a cacheline can be identified by looking at the epoch identifier:if the epoch identifier of the cache line is the same asthe epoch in which it is being modified, it means thatthis is not the first modification to the cache line in thisepoch.

It is important to note however that our key contri-bution is persist barrier implementation and our work isorthogonal to the logging mechanism. In other words,our work can be equally applied with any kind of under-lying logging infrastructure provided by the hardware orimplemented in software.

6. EXPERIMENTAL METHODOLOGYWe evaluate our proposed persist barrier (LB++) us-

ing gem5 [13] with Ruby in full system simulation mode.The on-chip interconnect is modelled using Garnet [14].We evaluate a 32-core multicore (1 thread per core) withmulti-banked LLC and 4 memory controllers placed on4 corners of the chip. Table 1 shows the parameters ofthe system.

Our aim is to evaluate BEP for applications thatmaintain persistent data structures (using micro-

Cores 32 OoO cores @ 2GHzROB Size 192 EntryWrite Buffer 32 EntryL1 I/D Cache 32KB 64B lines, 4−wayL1 Access Latency 3 cyclesL2 Cache 1MB×32 tiles, 64B lines,

16−wayL2 Access Latency 30 cyclesMemory Controllers 4NVRAM Access Latency 360 (240) cycles write (read)On-chip network 2D Mesh, 4 rows, 16B flits

Table 1: System Parameters

Hash Insert/delete entries in a hash tableQueue Insert/delete entries in a queueRBTree Insert/delete nodes in a red-black treeSDG Insert/delete edges in a scalable graphSPS Random swaps between entries in an array

Table 2: Micro-benchmarks used in our experiments

benchmarks) and BSP for long running applicationsthat require checkpointing (using real workloads).Workloads: We use micro-benchmarks listed in Ta-ble 2 to evaluate the proposal of using LB++ to imple-ment BEP. These micro-benchmarks implement datastructures that are similar to those in the benchmarksuite used by NVHeaps [4], except for the queue micro-benchmark which is similar to the copy-while-lockedqueue presented in [8]. The size of data entry (tableentries, tree nodes, queue entries etc.) for each micro-benchmark is 512 bytes. Each benchmark performssearch, delete and insert operations on the correspond-ing data structure. We inserted persist barriers at ap-propriate points to ensure persistency (as illustrated inFigure 10(a)).

To evaluate buffered strict persistency (BSP) we usebenchmarks from PARSEC [15], SPLASH-2 [16] andSTAMP [17] benchmark suites. The benchmarks wereunmodified; persist barriers are inserted transparentlyby the hardware to ensure BSP. In our experiments, wemodel the overhead of checkpointing all general pur-pose, special registers, privilege registers and floatingpoint registers (non-AVX) as part of the processor state.We ran all the workloads to completion.

7. RESULTSIn this section we evaluate our key contribution,

which is our proposed persist barrier (LB++). Morespecifically, we evaluate the additional speedup pro-vided by LB++, over the state-of-the-art persist bar-rier LB [4], in enforcing BEP (Section 5.1) and BSP(Section 5.2). We also present performance improve-ments provided by inter-thread dependence tracking(LB+IDT) and proactive flush (LB+PF) optimizationsindividually on top of the unoptimized barrier. Recallthat LB++ is a result of combining both IDT and PFoptimizations.

Both optimized and unoptimized barrier implemen-tations involve cache line flushes. Should the cache line

Page 10: Efficient Persist Barriers for Multicoreshomepages.inf.ed.ac.uk/vnagaraj/papers/micro2015.pdf · Efficient Persist Barriers for Multicores ... that stores persist in the correct

hash queuerbtree

sdg sps gmean

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Norm

aliz

ed T

ransa

ctio

n T

hro

ughput

LB LB+IDT LB+PF LB++

Figure 11: Transaction throughput normalizedto LB.

flush be an invalidating (similar to clflush instruction)or a non-invalidating flush (similar to recently intro-duced clwb instruction) 5? We analyzed the perfor-mance impact and found that using a non-invalidatingflush is significantly faster (around 30% faster). This isnot surprising, since an invalidating flush would disruptlocality by evicting lines from cache, which on subse-quent accesses need to be fetched again from NVRAM.We do not present detailed results owing to space con-straints. For the remainder of the section we only con-sider using non-invalidating cache line flushes to imple-ment all persist barriers.

7.1 Buffered Epoch PersistencyImpact of Optimizations. We study the perfor-mance improvement due to the two optimizations, firstindividually, and then in combination. Figure 11 showsthe transaction throughput for micro benchmarks, nor-malized to throughput of LB. On average, LB+IDT im-proves throughput by only 3%. Recall that IDT im-proves performance by reducing the latency of mem-ory requests that trigger inter-thread conflict. The pri-mary reason why LB+IDT does not have a high perfor-mance improvement is because the performance is dom-inated by intra-thread conflicts for these microbench-marks (it should be noted that IDT provides significantperformance improvement for BSP as shown in Sec-tion 7.2). LB+PF on the other hand improves transac-tion throughput by 17%. This improvement is becauseLB+PF reduces the number of conflicts, thereby re-ducing the overall latency of memory requests. LB++,which is obtained by combining IDT and PF optimiza-tions, achieves an improvement in throughput of 22%over LB.Epoch Conflicts. In the presence of epoch conflictspersist operations happen in the critical path. This iscontrary to the objective of LB, which is to perform of-fline persists. Figure 12 shows the percentage of epochsthat are flushed because of a conflict. On an average90% of epochs are flushed because of a conflict in LB.LB+IDT has a similar percentage of epoch conflicts,because IDT optimization does not directly impact thepercentage of conflicting epochs but only reduces thelatency of conflicting requests. PF optimization, on

5The reason for making this comparison is that many pro-cessors today only offer flush instructions that invalidate thecache line on completion.

hash queuerbtree

sdg sps amean

0

20

40

60

80

100

% C

onflic

ting E

poch

s

LB LB+IDT LB+PF LB++

Figure 12: Percentage of conflicting epochs (outof the total number of epochs).

the other hand, decreases the probability of an epochconflicting by persisting epochs proactively. Therefore,we can see that, on average LB+PF reduces the per-centage of epoch conflicts from 90% to 77%. LB++(LB+IDT+PF) reduces the epoch conflict percentagefurther down to 75%, as IDT can help PF. Recall thatPF will start flushing an epoch only after the epochcompletes. Since IDT allows epochs to complete faster,the scope of flushing epochs proactively increases.

7.2 Buffered Strict Persistency in Bulk ModeIn this section, we evaluate the performance overhead

of achieving BSP in bulk mode when using LB++, incomparison to the overhead when using LB. Recall thatwe target the x86 architecture which supports a variantof TSO, hence the resultant persistency model is alsothe same. We compare the performance of achievingBSP relative to a baseline that provides no guarantees:No Persistency (NP). It is worth noting that NP alsouses NVRAM as the memory and incurs NVRAM la-tencies. We used the NP baseline as it is interestingto know the overhead of enforcing BSP. A naive ap-proach to implement BSP will require caches to be writethrough. We analyzed the performance of such a designand found it to be about 8× slower than NP. This is aprohibitively large overhead, therefore we do not con-sider (optimizing) this design further, and only presentthe bulk BSP results.Epoch Size. In BSP, store operations are divided intoepochs dynamically by hardware. How large should theepoch size be? Smaller epochs are desirable as thatwould mean lesser work lost, whereas larger epochs areexpected to be more efficient. Therefore, we analyze theperformance impact of varying epoch size; we considersizes (dynamic stores) of 300 (LB300), 1000 (LB1K) and10000 (LB10K). We use the unoptimized persist barrier(LB) for this study. Figure 13 shows the execution timeof benchmarks for designs with varying epoch sizes nor-malized to NP.

We observe that, on average, performance improveswith increasing epoch size. LB300 has an executiontime overhead of 1.9×, whereas LB1K has a significantlylower overhead (40% reduction with respect to the base-line). This is because increasing the epoch sizes providesthe opportunity for multiple writes to same cache line(belonging to the same epoch) to be coalesced, therebydecreasing the number of persists. For the same rea-

Page 11: Efficient Persist Barriers for Multicoreshomepages.inf.ed.ac.uk/vnagaraj/papers/micro2015.pdf · Efficient Persist Barriers for Multicores ... that stores persist in the correct

cannealdedup

freqminebarnes

choleskyradix

intruderssca2

vacationgmean

0.0

0.5

1.0

1.5

2.0

2.5

Norm

aliz

ed e

xecu

tion t

ime

5.42

4.83

4.70

4.22

LB300 LB1K LB10K

Figure 13: Execution time with varying epochsizes normalized to NP.

son, we observe that LB10K performs marginally bet-ter than LB1K – although interestingly on some bench-marks like canneal, dedup, intruder and vacation LB1Koutperforms LB10K. We believe this is because epochsize increases the number of epoch conflicts, so withpersist coalescing providing diminishing returns, epochconflicts starts to dominate. This also explains whyperformance improvement saturates beyond epoch sizegreater than 10000 stores (not shown).Impact of Proposed Persist Barrier LB++. Theoverhead of ensuring strict persistency using the unop-timized barrier LB is quite significant even with largeepoch size (1.5×). This is the gap we are seeking toclose with our optimized barrier. We consider an epochsize of 10000 for this study, as this is what gave the bestresults.

We observe (Figure 14) that LB+IDT reduces theoverhead of strict persistency from 1.5× in LB to 1.35×.LB+IDT is able to achieve a 15% improvement becausea large number (86%) of conflicts are inter-thread con-flicts, which IDT is able to optimize on. LB++ fur-ther reduces the overhead to 1.3×, an improvement of20% with respect to the baseline. The performance im-provement is much more pronounced for some bench-marks. For instance, ssca2 sees a reduction from 4.22×to 2.62×; ssca2 is a write intensive benchmark with finegrained interaction between threads and the number ofepochs that need to persist for it is very high.

Although using our optimized persist barrier pro-vided a significant improvement, there is still a residualoverhead of 30% over NP. Since our implementation ofBSP requires logging, we wanted to understand howmuch of the residual overhead is due to logging. To thisend, we also present execution time for an implemen-tation of bulk persistency using the optimized persistbarrier without logging (LB++NOLOG). We can seethat it has an overhead of about 16% over NP. Fromthis we can conclude that about half of the residualoverhead of 30% is owing to logging.

8. RELATED WORKThere have been many proposals for enabling fast

persistence with NVRAM [3, 4, 6, 7, 9]. All thesetechniques provide a programming framework to exposeNVRAM to programmers. BPFS and NVHeaps [4, 9]rely on LB for ensuring correct order of persists, whereas

cannealdedup

freqminebarnes

choleskyradix

intruderssca2

vacationgmean

0.0

0.5

1.0

1.5

2.0

Norm

aliz

ed e

xecu

tion t

ime

4.22

2.62

2.62

LB LB+IDT LB++ LB++NOLOG

Figure 14: Execution time normalized to NP.

others [3, 6, 7] rely on instructions like clflush andmfence provided by existing processors. It is impor-tant to note that these instructions are neither optimalnor sufficient to enforce correct order of persists. Newlyproposed clwb instruction is optimal because it does notinvalidate a cache line while writing it back to memoryand another instruction pcommit is required to avoidreordering of persists at memory controller level. Allof these techniques can seamlessly benefit from our ef-ficient persist barrier implementation.

LOC [18] provides hardware logging support to re-duce the overhead of persistence. It would be inter-esting to use LOC in conjunction with LB++ for en-forcing BSP in bulk mode. Kiln [19] proposes a tech-nique to reduce persist latency by using a non-volatilelast level cache (NVLLC) along with NVRAM. Usinga non-volatile cache also eliminates the requirement oflogging by allowing NVLLC and NVRAM to store twoversions of a cache line, and one of the versions can beconceptually considered as a log entry. NVM Duet [20],FIRM [21] and DP2 [22] propose optimizations in mem-ory controller to improve the performance of persistentapplications using NVRAM. All these proposals broadlyhelp in reducing persist latency which is complimentaryto our proposal of efficient persist barrier, in which wereduce conflicts and online persists.

Techniques like WSP [12] have been proposed, whichsave the entire execution state in NVRAM on a powerfailure. They rely on a small battery backup to flushcaches and store processor state. Although this tech-nique works in case of power failure, it is not clear asto how it can be used in case of other failures such assoftware crashes. In contrast, BSP guarantees persis-tence and recovery for any kind of failure. TSP [23],on the other hand, discusses tradeoffs in fault tolerancemechanisms, depending on the failure model (softwarecrashes, power failures etc.). Pelley et al. [24] presentdesigns for implementing ACID transactions for a sys-tem with NVRAM. Central to their design is the notionof persisting a batch of transactions together to amor-tize cost. However, their persists happen in the criticalpath; in contrast, we seek to move the persists out ofthe critical path using hardware support. In memorypersistency [8] various models for persistency includingepoch and strict persistency have been proposed. Theyalso identify the possibility of inter-thread dependencetracking and enforcing strict persistency in bulk mode.However, they do not discuss how these can be realized.

Page 12: Efficient Persist Barriers for Multicoreshomepages.inf.ed.ac.uk/vnagaraj/papers/micro2015.pdf · Efficient Persist Barriers for Multicores ... that stores persist in the correct

Our work presents a detailed design and implementa-tion of the same.

9. CONCLUSIONMany techniques and programming models have been

proposed for achieving persistence using NVRAM. Allof them require persists to be ordered to ensure con-sistency of persisted data, i.e. they will benefit from apersist barrier mechanism. We first illustrate the perfor-mance bottlenecks in the state-of-the-art persist barrier(LB). We then propose solutions to these bottlenecksand incorporate these in the design and implementa-tion of an efficient persist barrier (LB++) for multicoreswith multi-banked caches. We evaluate LB++ by usingit to enforce two recently proposed persistency mod-els. We show that enforcing buffered epoch persistency(BEP) using LB++ (as opposed to LB) improves theperformance by 22%. We show that enforcing bufferedstrict persistency (BSP) using LB++ (as opposed toLB) improves the performance by 20%, allowing us tosupport BSP at 1.3× execution time overhead.

10. ACKNOWLEDGEMENTSWe would like to thank the anonymous reviewers for

their helpful comments. This work is supported by theIntel University Research Office and by EPSRC grantsEP/M001202/1 and EP/M027317/1 to the Universityof Edinburgh.

11. REFERENCES[1] S. R. Dulloor, S. Kumar, A. Keshavamurthy, P. Lantz,

D. Reddy, R. Sankaran, and J. Jackson, “System softwarefor persistent memory,” in Proceedings of the 9th EuropeanConference on Computer Systems, ACM, 2014.

[2] Intel Corporation, IntelR© Architecture Instruction SetExtensions Programming Reference. No. 319433-022, 2014.

[3] H. Volos, A. J. Tack, and M. M. Swift, “Mnemosyne:Lightweight persistent memory,” in Proceedings of the 16thInternational Conference on Architectural Support forProgramming Languages and Operating Systems, ACM,2011.

[4] J. Coburn, A. M. Caulfield, A. Akel, L. M. Grupp, R. K.Gupta, R. Jhala, and S. Swanson, “Nv-heaps: Makingpersistent objects fast and safe with next-generation,non-volatile memories,” in Proceedings of the 16thInternational Conference on Architectural Support forProgramming Languages and Operating Systems, ACM,2011.

[5] X. Wu and A. L. N. Reddy, “Scmfs: A file system forstorage class memory,” in Proceedings of InternationalConference for High Performance Computing, Networking,Storage and Analysis, ACM, 2011.

[6] D. R. Chakrabarti, H.-J. Boehm, and K. Bhandari, “Atlas:Leveraging locks for non-volatile memory consistency,” inProceedings of the International Conference on ObjectOriented Programming Systems Languages & Applications,ACM, 2014.

[7] A. Chatzistergiou, M. Cintra, and S. D. Viglas, “Rewind:Recovery write-ahead system for in-memory non-volatiledata-structures,” Proceedings of VLDB Endowment, vol. 8,no. 5, 2015.

[8] S. Pelley, P. M. Chen, and T. F. Wenisch, “Memorypersistency,” in Proceedings of the 41st AnnualInternational Symposium on Computer Architecture, IEEE,2014.

[9] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee,D. Burger, and D. Coetzee, “Better i/o throughbyte-addressable, persistent memory,” in Proceedings of the22nd Symposium on Operating Systems Principles, ACM,2009.

[10] D. R. Chakrabarti and H.-J. Boehm, “Durability semanticsfor lock-based multithreaded programs,” in Proceedings ofthe 5th USENIX Workshop on Hot Topics in Parallelism,USENIX, 2013.

[11] L. Ceze, J. Tuck, P. Montesinos, and J. Torrellas, “Bulksc:Bulk enforcement of sequential consistency,” in Proceedingsof the 34th Annual International Symposium on ComputerArchitecture, ACM, 2007.

[12] D. Narayanan and O. Hodson, “Whole-system persistencewith non-volatile memories,” in Proceedings of the 17thInternational Conference on Architectural Support forProgramming Languages and Operating Systems, ACM,2012.

[13] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt,A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna,S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D.Hill, and D. A. Wood, “The gem5 simulator,” SIGARCHComput. Archit. News, vol. 39, no. 2, 2011.

[14] N. Agarwal, T. Krishna, L.-S. Peh, and N. Jha, “Garnet: Adetailed on-chip network model inside a full-systemsimulator,” in Proceedings of International Symposium onPerformance Analysis of Systems and Software, IEEE,2009.

[15] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsecbenchmark suite: Characterization and architecturalimplications,” in Proceedings of the 17th InternationalConference on Parallel Architectures and CompilationTechniques, ACM, 2008.

[16] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta,“The splash-2 programs: Characterization andmethodological considerations,” in Proceedings of the 22ndAnnual International Symposium on ComputerArchitecture, ACM, 1995.

[17] C. C. Minh, J. Chung, C. Kozyrakis, and K. Olukotun,“Stamp: Stanford transactional applications formultiprocessing,” in Proceedings of the 4th InternationalSymposium on Workload Characterization, IEEE, 2008.

[18] Y. Lu, J. Shu, L. Sun, and O. Mutlu, “Loose-orderingconsistency for persistent memory,” in Proceedings of the32nd International Conference on Computer Design, IEEE,2014.

[19] J. Zhao, S. Li, D. H. Yoon, Y. Xie, and N. P. Jouppi, “Kiln:Closing the performance gap between systems with andwithout persistence support,” in Proceedings of the 46thAnnual International Symposium on Microarchitecture,ACM, 2013.

[20] R.-S. Liu, D.-Y. Shen, C.-L. Yang, S.-C. Yu, and C.-Y. M.Wang, “Nvm duet: Unified working memory and persistentstore architecture,” in Proceedings of the 19th InternationalConference on Architectural Support for ProgrammingLanguages and Operating Systems, ACM, 2014.

[21] J. Zhao, O. Mutlu, and Y. Xie, “Firm: Fair andhigh-performance memory control for persistent memorysystems,” in Proceedings of the 47th Annual InternationalSymposium on Microarchitecture, IEEE Computer Society,2014.

[22] L. Sun, Y. Lu, and J. Shu, “Dp2: Reducing transactionoverhead with differential and dual persistency in persistentmemory,” in Proceedings of the 12th InternationalConference on Computing Frontiers, ACM, 2015.

[23] F. Nawab, D. R. Chakrabarti, T. Kelly, and C. B. M. III,“Procrastination beats prevention: Timely sufficientpersistence for efficient crash resilience,” in Proceedings ofthe 18th International Conference on Extending DatabaseTechnology, 2015.

[24] S. Pelley, T. F. Wenisch, B. T. Gold, and B. Bridge,“Storage management in the nvram era,” Proceedings ofVLDB Endowment, vol. 7, no. 2, 2013.