advances towards data-race-free cache coherence through data...

66
ACTA UNIVERSITATIS UPSALIENSIS UPPSALA 2017 Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1521 Advances Towards Data-Race-Free Cache Coherence Through Data Classification MAHDAD DAVARI ISSN 1651-6214 ISBN 978-91-554-9925-9 urn:nbn:se:uu:diva-320595

Upload: others

Post on 23-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

ACTAUNIVERSITATIS

UPSALIENSISUPPSALA

2017

Digital Comprehensive Summaries of Uppsala Dissertationsfrom the Faculty of Science and Technology 1521

Advances Towards Data-Race-FreeCache Coherence Through DataClassification

MAHDAD DAVARI

ISSN 1651-6214ISBN 978-91-554-9925-9urn:nbn:se:uu:diva-320595

Page 2: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

Dissertation presented at Uppsala University to be publicly examined in 2446,Lägerhyddsvägen 2, Hus 2, Uppsala, Thursday, 8 June 2017 at 13:15 for the degree of Doctorof Philosophy. The examination will be conducted in English. Faculty examiner: ProfessorManuel Eugenio Acacio Sánchez (Universidad de Murcia).

AbstractDavari, M. 2017. Advances Towards Data-Race-Free Cache Coherence Through DataClassification. Digital Comprehensive Summaries of Uppsala Dissertations from theFaculty of Science and Technology 1521. 64 pp. Uppsala: Acta Universitatis Upsaliensis.ISBN 978-91-554-9925-9.

Providing a consistent view of the shared memory based on precise and well-defined semantics—memory consistency model—has been an enabling factor in the widespread acceptance andcommercial success of shared-memory architectures. Moreover, cache coherence protocols havebeen employed by the hardware to remove from the programmers the burden of dealing withthe memory inconsistency that emerges in the presence of the private caches. The principlebehind all such cache coherence protocols is to guarantee that consistent values are read fromthe private caches at all times.

In its most stringent form, a cache coherence protocol eagerly enforces two invariantsbefore each data modification: i) no other core has a copy of the data in its private caches,and ii) all other cores know where to receive the consistent data should they need the datalater. Nevertheless, by partly transferring the responsibility for maintaining those invariants tothe programmers, commercial multicores have adopted weaker memory consistency models,namely the Total Store Order (TSO), in order to optimize the performance for more commoncases.

Moreover, memory models with more relaxed invariants have been proposed based on theobservation that more and more software is written in compliance with the Data-Race-Free(DRF) semantics. The semantics of DRF software can be leveraged by the hardware to inferwhen data in the private caches might be inconsistent. As a result, hardware ignores theinconsistent data and retrieves the consistent data from the shared memory. DRF semanticstherefore removes from the hardware the burden of eagerly enforcing the strong consistencyinvariants before each data modification. Instead, consistency is guaranteed only when needed.This results in manifold optimizations, such as reducing the energy consumption and improvingthe performance and scalability. The efficiency of detecting and discarding the inconsistentdata is an important factor affecting the efficiency of such coherence protocols. For instance,discarding the consistent data does not affect the correctness, but results in performance lossand increased energy consumption.

In this thesis we show how data classification can be leveraged as an effective tool to simplifythe cache coherence based on the DRF semantics. In particular, we introduce simple but efficienthardware-based private/shared data classification techniques that can be used to efficientlydetect the inconsistent data, thus enabling low-overhead and scalable cache coherence solutionsbased on the DRF semantics.

Keywords: Shared Memory Architectures, Multicore, Memory Hierarchy, Cache Coherence,Data Classification

Mahdad Davari, Department of Information Technology, Division of Computer Systems, Box337, Uppsala University, SE-75105 Uppsala, Sweden. Department of Information Technology,Computer Architecture and Computer Communication, Box 337, Uppsala University,SE-75105 Uppsala, Sweden.

© Mahdad Davari 2017

ISSN 1651-6214ISBN 978-91-554-9925-9urn:nbn:se:uu:diva-320595 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-320595)

Page 3: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

To my parents, who always valued knowledge, prioritized my education, and supported me throughout.

Page 4: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free
Page 5: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

List of Papers

This thesis is based on the following papers, which are referred to in the text by their Roman numerals.

I Ros, A., Davari, M., Kaxiras, S. (2015) Hierarchical Private/Shared Classification. The Key to Simple and Efficient Coherence for Clus-tered Cache Hierarchies. International Symposium on High Perfor-mance Computer Architecture (HPCA)

II Davari, M., Ros, A., Hagersten, E., Kaxiras, S. (2015) The Effects of Granularity and Adaptivity on Private/Shared Classification for Co-herence. ACM Transactions on Architecture and Code Optimization (TACO)

III Davari, M., Ros, A., Hagersten, E., Kaxiras, S. (2015) An Efficient, Self-Contained, On-Chip Directory: DIR1-SISD. International Sym-posium on Parallel Architectures and Compilation Techniques (PACT)

IV Davari, M., Hagersten, E., Kaxiras, S. (2017) Scope-Aware Classifi-cation: Taking the Hierarchical Private/Shared Data Classification to the Next Level. (Under submission) Technical Report 2017-008, Department of Information Technology, Uppsala University

V Davari, M., Hagersten, E., Kaxiras, S. (2017) The Best of Both Works: A Hybrid Data-Race-Free Cache Coherence Scheme. (Ac-cepted with revision) ACM Transactions on Architecture and Code Optimization (TACO)

Reprints were made with permission from the respective publishers. All the papers have been reformatted to the one-column format of this book.

Page 6: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

Other publications not included in this thesis:

• Davari, M., Ros, A., Hagersten, E., Kaxiras, S. (2014) The Effects of Granularity and Adaptivity on Private/Shared Classification for Coherence. Seventh Swedish Workshop on Multicore Computing (MCC’14)

• Davari, M., Ros, A., Hagersten, E., Kaxiras, S. (2015) An Effi-cient, Self-Contained, On-Chip Directory. DIR1-SISD. Eighth Swedish Workshop on Multicore Computing (MCC’15)

Page 7: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

Contents

1. Introduction ................................................................................................. 1

2. Generational Classification: Data Classification for DRF Semantics ........ 72.1 Simplifying the Coherence for DRF Semantics ................................... 72.2 Generational Classification ................................................................ 10

2.2.1 Classification storage and mechanisms ...................................... 102.2.2 Classification Adaptivity ............................................................ 112.2.3 Adaptivity using Dead-Block Prediction .................................... 12

2.3 Generational Coherence ..................................................................... 132.4 Results ................................................................................................ 142.5 Summary ............................................................................................ 16

3. Dir1-SISD: a Minimal Directory for DRF Semantics ............................... 173.1 The Directory for DRF Semantics ..................................................... 17

3.1.1 Classification-Aware Inclusion Policy ....................................... 183.1.2 Self-Correcting Classification .................................................... 193.1.3 Implicit Classification Adaptivity .............................................. 20

3.2 Directory Compression ...................................................................... 203.3 Results ................................................................................................ 213.4 Summary ............................................................................................ 22

4. Hierarchical Classification and Coherence for DRF Semantics ............... 234.1 Private/Shared Classification in Hierarchies ...................................... 23

4.1.1 Topology and Terminology ........................................................ 244.1.2 Plain Hierarchical Classification ................................................ 244.1.3 Scope-Aware Classification ....................................................... 25

4.2 Scoped Synchronization ..................................................................... 254.3 Dir1-H: The Hierarchical Coherence .................................................. 27

4.3.1 Hierarchical Inclusion Policies ................................................... 284.4 Results ................................................................................................ 284.5 Summary ............................................................................................ 29

5. The Best of Both Schemes: A Hybrid Classification Approach ............... 315.1 The Penalty of Self-Invalidation and Self-Downgrade ...................... 31

5.1.1 Impact of Shared Classification on Self-Invalidations ............... 325.1.2 Impact of Shared Classification on Self-Downgrades ................ 32

5.2 Self-Downgrade vs. Demand-Downgrade ......................................... 33

Page 8: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

5.3 Hybrid Data Classification ................................................................. 375.3.1 Hybrid Approach using Speculation ........................................... 375.3.2 Hybrid Approach Optimized for Migratory Sharing .................. 385.3.3 Hybrid Approach Optimized for Producer-Consumer Sharing .. 38

5.4 Results ................................................................................................ 385.5 Summary ............................................................................................ 40

6. Future Directions ...................................................................................... 41

7. Summary ................................................................................................... 43

8. Svensk Sammanfattning ............................................................................ 45

Acknowledgements ....................................................................................... 47

References ..................................................................................................... 49

Page 9: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

Abbreviations

ACK Acknowledgement AI Artificial Intelligence CPU Central Processing Unit DRF Data-Race-Free GC Generational Classification/Coherence GPU Graphics Processing Unit INV Invalidation IoT Internet of Things L1 First-Level Cache L2 Second-Level Cache L3 Third-Level Cache LLC Last-Level Cache MSHR Miss Status Holding Register NACK Negative Acknowledgement OS Operating System P2P Private-to-Private P2S Private-to-Shared REQ Request RESP Response RO Read-Only RW Read-Write S2P Shared-to-Private SC Sequential Consistency SI Self-Invalidation SD Self-Downgrade WB Write-Back WT Write-Through

Page 10: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free
Page 11: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

1

1. Introduction

Simplicity is the ultimate sophistication.

⎯ Leonardo da Vinci1

Historical records trace early ideas of multiprocessing back to the 19th centu-ry, when Luigi Federico Menabrea commented on Charles Babbage’s Ana-lytical Engine in 1842 “… the machine can be brought into play so as to give several results at the same time, which will greatly abridge the whole amount of the processes [1].” Before moving into the mainstream in the modern era, multi-processors have been built and studied as early as the 60’s, such as the Bur-roughs B5000 [2, 3] and Carnegie Mellon’s C.mmp [4, 5]. The basic idea behind multiprocessing is to leverage multiple processors of lower perfor-mance in order to achieve higher aggregate performance needed to accom-plish tasks with higher computational demands not satisfied by an individual processor. Such paradigm is inevitable due to the inability of technology to deliver cost-effective individual processors with higher performance, which has led to the emergence of today’s chip multiprocessors, better known as multicores.

Shared-memory is considered an attractive choice in the multiprocessor architectural design space [6], enabling higher computational performance at a lower hardware cost as well as a simple programming model that can be adopted to solve a variety of real-world problems. Typically the entire memory is shared between the processors using a single logical address space, which allows simple communication between the cores in the form of regular memory store and load operations. Moreover, using a single address space allows a single instance of an operating system to be used for the en-tire processors in the system [7].

However, the benefits offered by the shared-memory architecture cannot be harnessed unless all the processors have a consistent view of the shared memory [6, 8, 9], which necessitates a formal definition for consistency in shared-memory architectures, better known as memory consistency model or memory model [9, 14, 15, 17, 19]. Once defined, the memory model serves as the standard that describes the consistent behavior of the shared-memory, which the hardware designers guarantee to provide for the programmers.

1 Also attributed to Clare Boothe Luce, among others.

Page 12: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

2

Figure 1.1. An abstraction of a multicore architecture consisting of four cores, each with a private cache, sharing the off-chip memory. A consistent view of the shared memory is shown in (a) where cores read the same datum from their private caches and observe the same value (all the private caches are coherent up to this point due to the read-only accesses by all the cores). Cores 0, 1, and 3 shown in (b) begin to observe an inconsistent view of the shared memory if they read from their private caches the datum that Core 2 has modified in its private cache (Cores 0, 1, and 3 are said to receive stale data due to the cache incoherence problem).

Based on an intuitive and appealing model to the programmers, where each core1 executes and finishes its instructions one after another in the order specified by its program, the shared-memory is considered consistent if, at any point in time, there would exist only one valid value for any datum across the whole shared memory system [19]. Nevertheless, such a strong definition for consistency has implications for hardware designers: individu-al cores in shared-memory architectures employ private cache memories in order to reduce the latency of accessing a large shared memory [5, 12]. Pri-vate caches can lead to cache incoherence problem that, based on the afore-mentioned strong definition of consistency, results in an inconsistent view of the shared memory explained by an example in Figure 1.1.

Besides the cache incoherence problem shown in Figure 1.1, there are other sources involved in the memory inconsistency problem. Among myri-ad optimizations, today’s cores employ the so-called store buffers in order to hide the latency of resolving the cache misses caused by the stores [9]. From a core’s perspective, a store operation is deemed completed as soon as the store result is written in a high-speed buffer. This technique keeps the core active by allowing the subsequent loads to resume execution and to access the private cache even if the result of the preceding stores cannot be immedi-ately written to the cache. However, in consequence of using store buffers, the memory inconsistency problem shown in Figure 1.1 emerges even before cache contents are modified. Moreover, in the presence of hardware-based

1 We use the term [multi]core hereafter instead of [multi]processor, however our discussions hold true for any shared-memory architecture.

Page 13: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

3

out-of-order processing and compiler-based instruction re-ordering, the memory inconsistency problem is exacerbated by the fact that today’s cores execute the instructions in an order different than the one specified by the programmer [9]. The latter example leads to a broader definition for the shared-memory consistency: consistency models deal with the order in which load and store operations are performed and are made visible to all the cores. Therefore, the goal of defining the memory consistency model is to create order out of the shared-memory chaos by bringing order into how the memory operations are performed and made visible to all the cores.

The examples given so far highlighted that memory inconsistency prob-lem arises in the shared-memory architectures, which has to be dealt with in order to take advantage of the benefits of such architectures. To address the inconsistency caused by cache incoherence problem, multicores employ cache-coherence mechanisms to keep the private caches coherent at all times [6, 8, 9]. The ultimate goal of cache coherence is therefore to ensure that loads always return consistent values from the private caches, where “con-sistent values” are defined via precise semantics given by the underlying memory model1 [13, 14, 15, 16, 17, 18]. While software and hybrid imple-mentations are available, multicores typically employ hardware coherence schemes to hide the complexities from the programmers [10, 11, 20, 21, 22].

To address the inconsistency caused by the store buffers, there has been a broad industrial consensus in favor of abandoning the strong memory order-ing for higher performance. Towards this end, the programmers’ conven-ience is partly traded off against the higher performance offered by the ac-ceptance of weaker memory models, such as the Total-Store-Order (TSO) model [9], or even weaker “relaxed” memory models. Nevertheless, pro-grammers are still able to enforce the strongest memory ordering and con-sistency by using specific ordering instructions, when needed [9]. This thesis addresses simplified coherence schemes for relaxed memory models. An overview of coherence schemes therefore helps to better highlight the con-tributions of this thesis.

Early shared memory architectures used centralized directories to keep the private caches coherent [23, 24, 25]. However, distributed schemes using shared buses2 were later adopted to overcome the storage overhead and high latency of the centralized directories [26, 59, 60]. Today, coherence schemes based on the distributed sparse-directories are prevailing due to the inability of the shared buses to scale to higher number of cores [38, 39]. Regardless of being bus-based or directory-based, coherence schemes guarantee that memory load operations always return consistent values defined and agreed by the accepted memory consistency model [5, 6, 8, 9, 13].

1 This implies that values that are considered consistent under a specific memory model can be deemed inconsistent under other models. 2 Also known as snoop-based, snoopy, or snoopy-bus protocols.

Page 14: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

4

Figure 1.2. Data-Race-Free (DRF) semantics divides the execution into epochs marked by explicit synchronizations—acquire/release and barrier semantics, repre-sented here by dashed lines—that guarantee the single-writer-multiple-readers invar-iant. The figure illustrates a hypothetical sharing pattern for a datum based on the DRF semantics (RW: Read-Write, RO: Read-Only).

Ensuring the consistency of the values returned by the memory read opera-tions implies that cache hits return consistent values and cache misses know where to find the consistent data. Coherence protocols guarantee the con-sistency of the values returned by the cache hits by either adopting the write-update or the write-invalidate policy1. In the write-update scheme, the coher-ence protocol ensures that the modified value is propagated to all the cores currently having the datum in their private cache, before a core is allowed to modify the datum in its private cache. However, most coherence protocols adopt the write-invalidate approach. In an invalidation-based scheme, the coherence protocol is required to maintain an invariant known as the single-writer-multiple-readers, which ensures that other cores do not have the da-tum in their private caches before a core is allowed to modify the datum in its private cache. At this point, the core is said to have the permission to modify the datum in its private cache2 [9].

Moreover, coherence protocols also ensure that once a core has modified a datum, all the successive readers receive the same value for the datum. This implies that the modifying core shall not relinquish the permission to modify the datum before ensuring that the value is accessible to all the suc-cessive readers once the permission to modify is relinquished3 [9].

Nevertheless, the software written in compliance with the data-race-free (DRF) semantics [16, 34, 35, 36, 37] partitions the execution into epochs (Figure 1.2) where the beginning and the end of each epoch are explicitly marked by synchronizations—acquire/release and barrier semantics. DRF semantics guarantees that, given any datum and any epoch, at most one core 1 Write-Propagation property. 2 Note that no such requirement is imposed on the store buffers, i.e. a store buffer may contain the modified value of a datum, while the old value still exists as valid in other caches, hence “cache coherence” and not “store-buffer coherence”. 3 Also known as Write-Serialization property or Data-Value invariant.

Page 15: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

5

is allowed to modify the datum in the epoch. In other words, in any epoch cores do not access the data that are being modified by another core1 within the same epoch. This software guarantee enables a hybrid scheme where the burden of eagerly invalidating the data copies in other cores is removed from the coherence protocol before each data modification. Instead, the DRF se-mantics of the software—in the form of explicit synchronizations—informs the coherence layer to enforce the cache coherency only when needed. A viable means to this end is to have each core to self-invalidate its incoherent data each time the core encounters synchronizations [27, 28, 29, 30, 31, 32, 33]. Such software-driven hybrid coherence schemes eliminate the coher-ence traffic caused by the excess invalidation and acknowledgement mes-sages.

At this point we are faced with the question as to how to detect the inco-herent data that have to be self-invalidated upon synchronizations. The an-swer to this question is the main contribution of this work. This thesis inves-tigates simple techniques than can be employed by such software-driven hybrid coherence schemes to efficiently identify the incoherent data that need to be self-invalidated upon synchronizations. In particular, we study how private/shared data classification can be leveraged as an efficient tool to enable simple coherence schemes based on the DRF semantics under relaxed memory consistency models.

The impact of private/shared data classification on coherence has been previously studied. For instance, coherence directories can be made smaller by disabling the coherence tracking for the private data [40, 49, 50], or co-herence traffic can be reduced by optimal placement of private and shared data in the memory hierarchy [41]. However, this thesis studies the pri-vate/shared data classification in the context of the coherence protocols whose mechanisms critically depend on private/shared data classification.

The thesis starts by introducing a fine-grained hardware-based pri-vate/shared classification scheme and the mechanisms needed to perform such data classification. It then continues to show how the classification can be leveraged by directory-less coherence schemes under relaxed memory models based on the DRF semantics. Next it introduces a minimal coherence directory tailored for private/shared data classification and coherence schemes based on the DRF semantics. The thesis continues by extending the classification approach to the hierarchical topologies, and introduces an effi-cient hierarchical private/shared classification and coherence. Finally the thesis shows how private/shared classification can be tailored to the sharing patterns in order to provide optimized results. Experimental results are pro-vided throughout the thesis in support of the effectiveness of the proposed approaches in reducing the coherence-related network traffic. 1 Core and thread are used interchangeably hereafter.

Page 16: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

6

Page 17: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

7

2. Generational Classification: Data Classification for DRF Semantics

Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away.

⎯ Antoine de Saint-Exupéry

This chapter investigates the simplification of cache coherence by removing the coherence directory in its entirety and eliminating the associated mecha-nisms such as invalidations, indirections, multicasts, and broadcasts. To-wards this end, we introduce a hardware-based scheme to classify the cache lines as private or shared based on the notion of cache-line generations. Such data classification enables a simplified directory-less coherence where cores self-invalidate their shared data upon synchronizations, and self-downgrade their shared data by writing through to the shared last-level cache.

2.1 Simplifying the Coherence for DRF Semantics Cache coherence design has seen recurring trends since its employment in the early multiprocessor computers, where centralized directories were used to enforce coherence between the private caches. However, schemes based on shared snoopy-buses took over to overcome the overheads associated with the centralized directories and to allow higher scalability [26, 59, 60]. In essence, shared-bus schemes transformed the coherence from a central-ized solution into a distributed solution where the processors collectively participate in maintaining the coherency. Nevertheless, once certain levels of scalability were reached, shared-bus schemes also failed to scale to a higher number of processors due to the bottlenecks such as the bandwidth problem. As a result, industry once again reverted to the directory-based coherence schemes and leveraged techniques such as distributed and sparse directory organizations to accommodate to the ever-increasing scalability demands.

Based on the observation that more and more software is being written in compliance with the DRF semantics, coherence can be simplified by once again reverting to non-centralized directory-less schemes, where the respon-sibility for maintaining the coherence is distributed among the cores. The DRF semantics guarantees that once a core is modifying a datum, other cores

Page 18: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

8

do not access the datum, even if they have the datum in their private caches. Synchronizations in the form of acquire/release and barrier semantics signal phase changes where: (i) modification of a datum by a core is finished and other cores can access the datum in the successive phase, or (ii) accesses to a datum by some cores are finished and another core can begin to modify the datum in the successive phase. The synchronizations can therefore be used to inform the hardware about the possible incoherency of the cached data, al-lowing the cores to self-invalidate their entire private caches, as well as up-dating the shared caches with their modified data in order to make the valid data accessible to all the cores in the successive phases (self-downgrade).

While the penalty is deemed acceptable for GPU/streaming workloads with coarse-grained sharing and global synchronizations, CPU workloads with fine-grained sharing significantly suffer from self-invalidating the valid data that can be re-used in the successive phases. To achieve the optimum results, self-invalidation should only be applied to the data that have been externally modified and are stale, which implies solutions that are more complex than the initial directory-based coherence schemes. The research question is therefore how we can efficiently restrict the self-invalidation only to the stale data or, if not possible, how we can minimize the amount of valid data that are being needlessly self-invalidated.

There have been proposals based on disciplined programming languages, such as the Deterministic Parallel Java (DPJ), where data are partitioned into regions with read-only or read-write access [42, 43, 44, 45]. The region in-formation is passed to the hardware, which is used to self-invalidate only the data that belong to the read-write regions.

At the other end of the spectrum sit proposals that use a more general ap-proach to address the self-invalidation of the stale data [46, 47]. Such pro-posals depend on the operating system to detect the pages that are accessed by more than one core. This information is saved along with the page attrib-utes in the page table and are leveraged by the cores to self-invalidate only the cache lines that belong to such shared pages. While such approaches do not require support from the deterministic programming languages, the oper-ating system is required to support the detection of the shared pages at runtime. Nevertheless, detection of the shared data at page granularity ren-ders the self-invalidation of the non-shared cache lines inevitable. Moreover, once classified as shared, a page is considered shared throughout the runtime, although the page may be shared only for a short period before be-ing accessed privately by a single core for the rest of the runtime.

In this chapter we introduce a data classification technique for cache co-herence, called generational classification, which neither requires OS sup-port nor deterministic programming languages. Cache lines are dynamically classified as private or shared based on the notion of cache-line generations [48]. Furthermore, the generational classification allows the shared data to be re-classified as private when possible, which we refer to as adaptation.

Page 19: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

9

Figure 2.1. Creation and termination of the cache-line1 generations. A generation of a cache line is created when the cache line is brought into a private cache upon a cache miss. The cache line is frequently accessed until the locality shifts away from the cache line—start of the dead time. The cache line remains inactive in the cache until being replaced due to inactivity. At this point, the generation is said to termi-nate. Based on this notion, a cache line can have concurrent generations up to the total number of cores in the system.

Figure 2.2. Private/shared classification based on the notion of cache-line genera-tions. A generation of the cache-line containing datum A is created each time a core brings the cache-line into its private cache following a cache miss on loading the datum A. (a) A cache line is classified as private if at any point in time there exists only one generation of the cache line (shown in blue). (b) A cache line is classified as shared if there exist concurrent generations (shown in orange) regardless of the generation being alive (active) or dead (dead-time started but the generation is not yet evicted/terminated).

1 Without loss of generality, cache line is the granularity of data at which the private/shared classification is performed and coherence is maintained.

Page 20: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

10

Figure 2.3. System-level abstraction of the generational classification. Upon cache misses, cores send data requests to the LLC, and receive responses from the LLC. The LLC therefore observes all the requests needed to perform the generational private/shared data classification. The owner/count metadata is used to locate the cores that are the owners of private generations. For the shared generations, howev-er, the metadata is used to keep the number of the sharers rather than the sharers.

2.2 Generational Classification Figure 2.1 and Figure 2.2 describe the notion of the cache-line generations [48] and the private/shared generational classification, respectively.

2.2.1 Classification storage and mechanisms Data requests from all the cores are sent to the shared LLC if the hierarchy of the private caches cannot resolve the cache misses. The LLC therefore performs the generational classification, since the LLC observes the begin-ning of all the generations due to the private cache misses. Cache lines in the LLC as well as the private caches contain a single bit that denotes whether the cache line is private or shared (P/S bit)1. Moreover, cache lines in the LLC use additional metadata to locate the cores that are the owners of the private generations2 (Figure 2.3.) The following scenarios are possible when a data request is received by the shared LLC:

Creation of a private generation: occurs when the data are not found in the LLC, or when the data exist in the LLC neither as private nor shared. If the data are not found in the LLC, data are first requested from the memory. LLC then sets the P/S bit to private, and registers the requestor as the private owner of the generation before responding to the requestor.

1 Used by the cores to self-invalidate their shared data upon synchronizations. 2 Used by the LLC to locate and downgrade the private data before responding to the new sharers.

Page 21: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

11

Recovery: occurs when a non-owner core requests a private generation (Core 3 in Figure 2.2 b.) Recovery begins with a notification sent from the LLC to the registered owner of the private generations. Depending on the response from the owner, either the generation remains as private with a new owner, or a shared generation is created as described in the Figure 2.4.

Update of the degree-of-sharing: The cores sharing the shared genera-tions are not registered. The metadata for registering the private owners are used to keep the number of the sharers. Each new request for a shared gener-ation increments the number of sharers. As an example, an additional request for the datum A in Figure 2.4 b increments the number of the sharers to 3.

2.2.2 Classification Adaptivity We consider the adaptivity of a private/shared classification scheme as the ability to re-classify the shared data as private, when possible [49, 50]. In absence of such adaptivity, eventually all the data in the system are classified as shared, inhibiting the optimizations for private data. For instance, shared migratory data can temporarily be classified as private while the data are being accessed by one core, before migration to other cores. While temporar-ily being classified as private, the migratory data are exempt from self-invalidations and self-downgrades in the presence of synchronizations. The same is true for the producer-consumer sharing pattern. Temporarily classi-fying the shared data as private provides opportunities to mitigate the penal-ties of the self-invalidations and the self-downgrades.

Our generational classification so far exhibits a weaker form of this prop-erty, when private generations remain private with new owners during the

Figure 2.4. Recovery. The LLC notifies the registered owner when a non-owner core requests a private generation. (a) The registered owner replies with a NACK if the generation is terminated. The requestor is registered as the new owner of the private generation (Private-to-private transition.) (b) The registered owner internally re-classifies the cache line as private and replies with an ACK. In case the owner has modified the cache line (dirty-bit is set) the ACK message shall also contain the data. The cache line is re-classified as shared by the LLC (Private-to-shared transi-tion) before responding to the new sharer.

Page 22: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

12

recovery mechanism (P2P transition shown in Figure 2.4.a). Nevertheless, we would like to have a stronger form of adaptivity, where the shared data are re-classified as private, thus minimizing the penalty of the self-invalidations.

In essence, a shared generation in the LLC becomes private once the de-gree-of-sharing reaches one. However, the shared-to-private transition can-not take place since the core owning the only shared copy of the data is not known1 and therefore cannot be notified by the LLC in order to internally re-classify the cache line as private. The shared-to-private transition therefore takes place when a new core requests a shared generation in the LLC with the degree-of-sharing of zero.

The degree-of-sharing is incremented each time a new generation is creat-ed due to a request by a new core after a private cache miss. In order to be able to decrement the degree of sharing, cores need to inform the LLC when shared generations terminate. This is achieved by requiring the cores to send Explicit Eviction Notification (EEN) upon the termination of the shared gen-erations. However, EENs are required to be sent to the LLC only when evict-ing the clean-shared generations. Dirty cache lines are written-back to the LLC, which implicitly signals the termination of the shared generations.

2.2.3 Adaptivity using Dead-Block Prediction Dead-block predictors [51] can be used to have early termination of the dead private generations. We use the cache-decay technique—based on saturating counting bits—as a simple dead-block predictor [52] and study the impact of the dead-block prediction on the private/shared classification (Paper II.)

Figure 2.5. Adaptive classification using dead-block prediction. (a) Access by Core 3 results in a P2S transition, although the private generation in Core 1 is dead. (b) The dead-block predictor predicts that the private generation in Core 1 is dead at point X, resulting in the termination of the generation upon the recovery and allow-ing Core 3 to receive a private generation.

1 We only keep the degree-of-sharing for the shared generations, not the sharers themselves.

Page 23: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

13

2.3 Generational Coherence The generational classification enables a simplified coherence scheme based on the DRF semantics with the following mechanisms:

Self-invalidation: cores self-invalidate their shared data upon synchroni-zations. Self-invalidation satisfies the write-propagation requirement for the coherence of the shared data.

Self-downgrade: to satisfy the write-serialization (data-value invariant) we require that cores self-downgrade their modified shared data to the LLC before crossing each synchronization point. Self-downgrade is performed in the form of lazy write-through using the MSHRs [46, 47]. The combination of self-invalidation and self-downgrade based on the DRF semantics elimi-nate the need for invalidations and indirections, resulting in a simplified coherence scheme where cores are guaranteed to receive consistent data from the LLC.

Refresh: unlike cache evictions, self-invalidations do not terminate the shared generations1. Having self-invalidations to terminate all the shared generations would require EENs for all the shared data to be sent to the LLC before crossing the synchronizations, which would impose a bottleneck. Self-invalidated cache lines therefore remain in caches and EENs are sent to the LLC upon the eviction of those self-invalidated cache lines. Requesting the self-invalidated cache lines from the LLC should not increase the degree-of-sharing for those cache lines, since generations for the self-invalidated cache lines are already created and exist in the LLC. In order to draw a dis-tinction, cores issue refresh requests to the LLC to re-access the self-invalidated cache lines. In the presence of a refresh request, the LLC pro-vides the data response without increasing the degree-of-sharing.

Shared-to-private (S2P) adaptation: based on the DRF semantics and the aforementioned mechanisms, refresh requests reveal the owners of the shared cache-lines with the degree-of-sharing of one. As a result, the LLC adapts such generations to private upon receiving refresh requests.

Figure 2.6 provides an LLC-centric view of the generational data classifi-cation in tandem with the generational coherence. The classification starts as private, transitions into shared, and adapts back to private, which highlights the adaptivity of the classification scheme.

1 Represented by decrementing the degree-of-sharing in the LLC.

Page 24: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

14

2.4 Results Our results reveal that the generational classification increases the amount of data classified as private by 50% compared to a coarse-grained non-adaptive classification scheme. However, classifying more data as private has no per-ceptible impact on the miss-ratio and the execution time. This is also evident in Figure 2.7, which shows that the network traffic due to data movement (the red bars in the figure) hardly decreases in the presence of generational classification and coherence. This is attributed to how the S2P adaptation works. The S2P adaptation takes place on two occasions:

<S, 0>: when all the shared generations have terminated. In other words, a shared generation does not adapt back to private while the generation re-sides in the private cache. All the shared generations have to be terminated either by being written back into the LLC—in case of being modified—or by sending the EENs to the LLC—in case of being clean—before a new genera-tion can be born as private. As a result, the S2P adaptation involves data movement from the LLC to the private caches pertaining to the newly born private generations.

<S, 1>: when the only self-invalidated shared generation is re-accessed by the core before the generation terminates, i.e. before the cache line is evicted from the cache. Under this situation, the core issues a refresh request to the LLC to obtain a new copy of the data without creating a new genera-tion. As discussed in section 2.3, the LLC adapts the classification back to private before responding to the core. As illustrated, the S2P adaptation in this case also involves data movement from the LLC to the private caches.

Figure 2.6. LLC-Centric view of the generational classification. The private classifi-cation requires the known owners of the private generations, whereas the shared classification only keeps track of the degree-of-sharing. The rightmost states <S, 0> and <P, Owner> are identical to the leftmost states NP and <P, Owner>, respective-ly; they are replicated only due to the readability reasons.

Page 25: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

15

The question can therefore be formulated as how coherence can benefit from the adaptive classification if the S2P adaptation incurs data movement?

The answer lies in how the data re-classified as private via adaptation are re-accessed. Our results confirm that the generations re-classified as private are re-accessed once all the shared generations have terminated, and they remain as private for considerably long epochs while being heavily modified by their private owners. This behavior results in the reduction in the network traffic by eliminating the write-through traffic. As shown in Figure 2.7, re-duction in the write-through traffic overweighs the overheads of the adaptive classification due to the EEN traffic (referred to as GC-Control) in fft, fmm, water-sp, and swaptions benchmarks.

On the other hand, in benchmarks such as barnes, lu, raytrace, volrend, and water-nsq, the overheads of the adaptive classification overweighs the negligible reduction in the write-through traffic, and therefore manifests as total increase in the network traffic.

There are also benchmarks such as lu-nc and ocean, where the reduction in the write-through traffic is nullified by the increase in the write-back traf-fic, resulting in the increase in the traffic due to the adaptivity overheads.

barnes

cholesky fftfm

m lulunc

ocean

radiosity

raytraceOpt2

volrend

waternsq

watersp

blackscholes

canneal

swaptions

Average

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Norm

aliz

ed N

etw

ork

Tra

ffic

Bre

akdow

n

ControlData

Write-BackWrite-Through

GC-Control1. MESI2. VIPS-M

3. GC

Figure 2.7. The generational data classification (GC) increases the amount of data classified as private and reduces the amount of data classified as shared, which man-ifests itself as the reduction in the write-through traffic caused by self-downgrading the shared data, compared to a coarse-grained non-adaptive classification [46, 47] (VIPS-M.) Moreover, GC reduces the average network traffic compared to an inval-idation-based MESI protocol by eliminating the invalidation traffic (Paper I.) The adaptive classification incurs overheads due to the EENs (GC-Control). This in-creases the network traffic (compared to the non-adaptive schemes) when the data re-classified as private through adaptation are not re-used.

Page 26: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

16

2.5 Summary In this chapter, which corresponds to Paper II in this thesis, we introduced a fine-grained, adaptive, hardware-based private/shared data classification approach that enables a simplified directory-less cache coherence scheme based on the DRF semantics. Under this scheme, the responsibility for main-taining the cache coherence is distributed among the cores, where cores (i) self-invalidate their stale—shared—data upon synchronizations, resulting in the elimination of the write-induced invalidations in the form of multicasts or broadcasts, and (ii) self-downgrade their modified shared data by writing through to the LLC, resulting in the elimination of request forwarding and indirections.

We did not address the loss-of-information problem in this chapter. The classification metadata are embedded in the LLC tags, which are lost when the LLC entries are evicted. Nevertheless, we assumed an unbounded stor-age for the classification information in our experiments, due to the LLC miss ratio being negligible in our setup. We address this issue in the next chapter, where we design a minimal directory with the DRF semantics.

Page 27: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

17

3. Dir1-SISD: a Minimal Directory for DRF Semantics

Everything should be made as simple as possible, but not simpler.

⎯ Albert Einstein

In this chapter we address the problem concerning the loss of the classifica-tion information introduced in the previous chapter. Instead of removing the directory in its entirety, we observe that a simplified directory offers mani-fold advantages to our generational classification and coherence scheme, including an elegant solution to the problem of losing the classification in-formation as well as enabling the S2P classification adaptation at no cost.

3.1 The Directory for DRF Semantics Figure 3.1 shows the design space for the directory cache [53, 54, 55, 56, 57, 58]. As depicted, the communication and storage are opposing factors that are often difficult to balance. Nevertheless, by leveraging the private/shared classification and based on DRF semantics, we introduce a directory scheme that inherits the best properties of the entire design space.

Figure 3.1. The invalidation-based directory-cache design space. The hypothetical Dir0-B incurs no storage overhead, however every memory store operation would incur invalidation broadcast. At the other extreme, Dirn-NB incurs no broadcast; however, each directory entry is required to allocate storage for tracking all the cores.

Page 28: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

18

A Dir1-X scheme lends itself well to the generational classification, since the generational classification only requires a single pointer that is used to track the owner of the private data. Sharers are not tracked by the directory, in-stead the task of maintaining the coherence for the shared data is delegated to the sharers, where the sharers self-invalidate and self-downgrade their shared data upon synchronizations. Therefore, we have a Dir1 directory scheme that employs Self-Invalidation and Self-Downgrade as the directory actions, hence the name Dir1-SISD.

The Dir1-SISD is a hybrid scheme where the responsibility for maintain-ing the coherence is distributed between the directory and the cores. The directory tracks the owners of the private data, and notifies the owner when P2P or P2S transitions are required. Once the data are classified as shared, the responsibility for maintaining the coherence is transferred to the cores by requiring the cores to self-invalidate and self-downgrade the shared data based on the DRF semantics. The resulting directory scheme encompasses the best features of the design space shown in Figure 3.1 by enabling maxi-mum degree of sharing—up to the total number of cores—using minimum storage—only a single pointer for the private data—with minimum commu-nication—no multicasts or broadcasts, only a unicast per P2P/P2S transition.

3.1.1 Classification-Aware Inclusion Policy We did not discuss the loss of classification information when introducing the generational classification and coherence in the previous chapter. The low miss-ratio of the LLC would render the problem infrequent, which would justify the modeling of unbounded storage for the classification in-formation.

Today’s applications’ large memory footprints make it impractical to backup the directory in the memory. As a result, inclusion is maintained between the directory-cache and the private caches, which incurs invalida-tion traffic and performance degradation. Although silent eviction of the directory entries is possible, broadcasts are required upon each directory miss in order to discover and re-build the sharing vector [56].

Nevertheless, by leveraging the DRF semantics, Dir1-SISD offers an ele-gant solution that does not require in-memory directory backup, does not maintain inclusion, and does not require broadcasts to discover and re-build the classification. Shared entries of the directory are evicted silently, since they contain no information about the sharers and the responsibility for maintaining the coherence for the shared data is already transferred to the cores. Private entries, however, cannot be silently evicted since they contain information about the owners of the private data and the responsibility for maintaining the coherence is not yet transferred to the owners of the private data. As a result, the responsibility for maintaining the coherence has to be

Page 29: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

19

transferred to the private owners (via a P2S transition explained in the previ-ous chapter) before the private entries are evicted from the directory.

3.1.2 Self-Correcting Classification An astute observer is now faced with the scenario where inconsistent copies of a cache line exist in different cores, as shown in Figure 3.2, due to the eviction of the private directory entries and the re-creation of those private entries after the evictions. However, a private generation of a datum under Dir1-SISD scheme is allowed to co-exist with the shared generations of the same datum. The DRF semantics guarantees that, despite being replicated in different caches, conflicting accesses to a datum never occurs. Three scenar-ios are possible: (i) the owner might modify its private generation before other sharers access the private generation. Based on DRF semantics, the sharers shall pass a synchronization point before accessing the shared data. As a result, the sharers self-invalidate their shared data and receive the up-to-date data from the LLC1, (ii) sharers might downgrade their modified data by writing through to the LLC. However, directory detects the inconsistency of the classification and notifies the private owner to re-classify the cache line as shared (P2S notification), (iii) in case none of the aforementioned scenar-ios occur, concurrent existence of private and shared generations are harm-less and the generations shall eventually terminate without conflicts.

Figure 3.2. Self-correcting classification. (a) The cache line is classified as private, owned by Core 0. (b) Before the private entry in the directory could be evicted, the responsibility for the coherence for the private cache line is transferred to the owner by notifying Core 0 to re-classify the cache line as shared, although only one genera-tion of the cache line exists (harmless due to the DRF semantics) (c) Core 1 receives the cache line as private since the entry is not found in the directory, resulting in two copies of the cache line to reside in two different cores with both private and shared classification (harmless due to the DRF semantics.)

1 The directory invokes the P2S transition since the owner of the private data is known, and the up-to-date data are written into the LLC before the data response is sent to the requestor.

Page 30: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

20

Figure 3.3. Network traffic comparison between the generational coherence (GC) and Dir1-SISD (results are normalized to a full-map directory implementing the MESI states.) Both protocols employ the same mechanisms such as self-invalidation and self-downgrade of the shared data, as well as the P2S transition. Additionally, GC employs explicit EEN messages to enable S2P classification adaptivity.

3.1.3 Implicit Classification Adaptivity The adaptivity of the generational classification was discussed in details in the previous chapter. EENs were used to mark the termination of the genera-tions. The shared classification would adapt back to private once the degree of sharing would reach zero. Nevertheless, as shown in Figure 2.7, the extra traffic due to the EENs nullifies the benefits of having adaptive classification in a number of benchmarks (Barnes, LU, Raytrace, Volrend, Water-Nsq).

Figure 3.3 compares the network traffic between the generational coher-ence (GC) and the Dir1-SISD. As shown in the figure, Dir1-SISD results in equal or less traffic compared to GC although Dir1-SISD does not employ explicit mechanisms to enable adaptive classification. However, the adaptiv-ity of classification is an intrinsic property of the Dir1-SISD. Directory evic-tions provide an intrinsic and a natural means for adaptive classification with zero overheads, allowing the successor requestors to classify the data as pri-vate, as shown in Figure 3.2. One can go a step further and use this Dir1-SISD feature to modulate the adaptivity by modulating the rate of the evic-tion of the shared directory entries, for instance by giving preference to evic-tion of the shared entries, by decaying the shared entries after a period of inactivity, or even by dynamically re-sizing the directory at run-time, all which leave an interesting direction for the future work.

3.2 Directory Compression In contrast to other directories, the primary functionality of Dir1-SISD is to track the private data and their owners, rather than the shared data. However, private data are far more common than shared data, which means that Dir1-SISD may face increased pressure compared to directories that only aim to track the shared data.

Page 31: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

21

Figure 3.4. (a) Dual-grain directory organization. Page-directory entries point to the owners of the private pages. Line-directory entries can either point to the owners of the private cache lines, or represent the shared cache-lines—sharers are not tracked. (b) Priority is given to the line-directory in case both page-directory and line directo-ry lookups are hits. Both directories can be looked up simultaneously to reduce the latency (the given algorithm assumes no P2S recovery for the line directory).

Dir1-SISD lends itself very well to low-complexity dual-grain directory or-ganizations. As shown in Figure 3.4 a, the dual-grain directory is composed of a line-directory and a page-directory. All the cache lines belonging to a private page are represented by a single entry in the page-directory. In a sys-tem with pages consisting of 64 lines, this translates into compression ratio of 1÷64, which significantly reduces the directory area. Furthermore, page-directory and line-directory can be looked up simultaneously to reduce the directory access latency.

First access to a page allocates an entry in the page-directory. Further ac-cesses by the page owner do not change the directory state. Upon receiving a request for the same page from a core other than the owner, an entry in line-directory is allocated to resolve the classification for the conflicting cache line. While the entries in the page directory always point to private owners—the first core accessing a page—entries in line directory might be classified as private or shared, depending on the outcome of the P2S recovery (Figure 3.4). Evictions from the page directory are more expensive than the line-directory evictions, since all the lines belonging to the page need to be re-classified as shared in the owner’s private cache. However, page-directory evictions are considered to be rare compared to the line-directory evictions.

3.3 Results Our results reveal that Dir1-SISD reduces the network traffic by about 20% compared to a coherence protocol that uses the same self-invalidation and self-downgrade mechanisms in tandem with a non-adaptive, page-granular private/shared data classification that is handled by the OS (Figure 3.5.)

Page 32: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

22

Figure 3.5. Dir1-SISD results in 20% reduction in the network traffic compared to a non-adaptive, OS-based, page-granular private/shared data classification and coher-ence scheme (VIPS-M) (Results are normalized to a full-map directory implement-ing the MESI states.)

3.4 Summary In this chapter, which corresponds to Paper III in this thesis, we introduced a minimal directory scheme (Dir1) that lends itself very well to our generation-al classification scheme and the cache coherence based on the DRF seman-tics. Instead of advocating the elimination of the directory, we observe that such a minimal directory offers manifold advantageous to our generational classification and coherence scheme, including an elegant solution to the problem of losing the classification information, as well as enabling classifi-cation adaptivity for free. In the next chapter we extend our directory to ad-dress the generational data classification and coherence for the clustered and hierarchical topologies.

Page 33: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

23

4. Hierarchical Classification and Coherence for DRF Semantics

All problems in computer science can be solved by another level of indirection.

⎯ David Wheeler

In this chapter we extend our Dir1-SISD directory to address a holistic pri-vate/shared classification and coherence scheme suited for the clustered and hierarchical topologies. Unlike the flat schemes discussed so far, a hierar-chical classification approach needs to incorporate into classification the scope in which the data are being classified. A datum can therefore be con-currently classified as private and shared due to concurrently being in nested scopes. Such scope-aware data classification enables scoped synchronization that, by enabling more selective self-invalidation of the shared data, miti-gates the penalty of self-invalidations in the shared caches in the hierarchies.

4.1 Private/Shared Classification in Hierarchies Hierarchical design has proven effective in facing the high demands of com-puting and the ever-increasing number of cores in multi-/many-core architec-tures [61-73]. We are therefore interested in studying the generational classi-fication and coherence in the context of hierarchical and clustered systems.

Figure 4.1 shows the private/shared data classification in the hierarchies using a non-hierarchical data classification scheme. Although the uniformity of the classification across the whole hierarchy requires less complex mech-anisms, the classification suffers from unnecessarily classifying the data as shared. As an example, the green data in Figure 4.1 are shared between cores 0 and 1, although the same data are considered private between cores 0 and 2. The ultimate goal of hierarchical private/shared classification is to adjust the classification across the hierarchy and to preserve the private classifica-tion, where possible, in order to mitigate the penalty of bulk self-invalidation of the shared data in the hierarchies. Unless duly addressed, self-invalidations in the shared intermediate caches in the hierarchies are consid-erably more expensive than self-invalidating the private L1 caches, due to the larger capacity of the intermediate shared caches.

Page 34: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

24

Figure 4.1. Non-hierarchical private/shared data classification applied to a hierar-chical organization (P:Private, S:Shared.) Green data are shared by cores 0 and 1, blue data shared by cores 3 and 4, and orange data shared by cores 5 and 6. Black data are private to core 7. Green data are unnecessarily classified as shared in the L3 cache and the LLC. Similarly, orange data can be classified as private in the LLC.

4.1.1 Topology and Terminology L1 caches in Figure 4.1 are referred to as sources, where initial requests originate upon L1 cache misses. Caches between L1s and the LLC are re-ferred to as intermediate caches. A cluster, marked by dashed lines in Figure 4.1, represents an entity comprised of sources and a hierarchy of shared in-termediate caches. The highest intermediate cache in a cluster, which is shared by all the sources in that cluster, is referred to as the sink. A cluster is represented by its sink. As an example, the leftmost L2 cache in Figure 4.1 is the sink for the cluster encompassing sources 0 and 1, and the rightmost L3 cache is the sink that represents the cluster encompassing sources 4 to 7. Similarly, the LLC is considered as the global sink that represents the global cluster (super-cluster) encompassing all the clusters (sub-clusters.) Moreo-ver, each cluster defines a scope that corresponds to the sink-centric view of how a datum is shared between the sub-clusters that share the sink.

4.1.2 Plain Hierarchical Classification Figure 4.2 shows the plain hierarchical private/shared classification scheme [74] applied to the same configuration used in Figure 4.1. As shown in Fig-ure 4.2, the plain hierarchical classification preserves the private classifica-tion in the scopes where data are private. More specifically, the scope in which data are classified as shared does not span over the sink of the cluster where data are shared between the sources inside that cluster. Although the data might be shared between the cores inside the clusters, same data are classified as private when observed from any scope outside those clusters.

Nevertheless, there are opportunities that can be leveraged in order to pre-serve the private classification in the intermediate caches inside the scopes where data are shared, leading us to the scope-aware classification.

Page 35: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

25

Figure 4.2. Plain hierarchical classification. Data that are shared inside the clusters are classified as private outside the clusters. More specifically, the scope in which data are classified as shared does not span over the sink of the cluster in which data are shared between the sources.

4.1.3 Scope-Aware Classification The orange data in Figure 4.2 are shared between cores 5 and 6 in the cluster marked by the dashed line, and are therefore classified as shared anywhere in the corresponding scope (in the sources, in the intermediate caches, and in the sink of the cluster). Nevertheless, the orange data is private in the scopes defined by the sub-clusters within the mentioned cluster. In other words, the plain hierarchical classification is not aware of the scope of the classifica-tion, and therefore overrides the internal-scope classification with the exter-nal-scope1 classification. To remedy this problem, we introduce the scope-aware classification that preserves the external-scope classification without losing the internal-scope classification.

Scope-aware classification can be thought of as a multimorphic classifica-tion, where the classification can be both private and shared at the same time. In such a scheme, a unique private or shared classification would only be meaningful with respect to the scope in which the classification refers to2. Scope-aware classification is implemented by extending the plain hierar-chical classification to capture both internal and external scopes at each in-termediate cache, as shown in Figure 4.3. The scope-aware classification further enables the scoped-synchronization that allows more selective self-invalidation of the shared data in the intermediate caches.

4.2 Scoped Synchronization The penalty of self-invalidating the shared data is exacerbated in hierarchical topologies due to the larger capacities of the intermediate shared caches.

1 Internal and external scopes are defined w.r.t. the clusters/sinks where data are classified. 2 Analogous to the ideas behind the general relativity and the quantum mechanics.

Page 36: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

26

Figure 4.3. Scope-aware classification. Classification is encoded as <internal-scope, external-scope>. The first element (preceded by upward pointing arrow) corresponds to the internal-scope inside the clusters and the second element (preceded by down-ward pointing arrow) corresponds to the external-scope outside the clusters. As an example, the orange data are internally classified as private and externally classified as shared in the cluster represented by the sink at L2. This is due to the fact that the orange data are only accessed by source 5 in the mentioned sub-cluster, however the same data are shared by the two sub-clusters when seen from the external scope represented by the sink at L3. Based on the same reasoning, the orange data is classi-fied as shared at internal scope and private at external scope in the sink at L3.

Unlike the sources, where self-invalidations only affect the shared data that belong to the source, self-invalidations in the intermediate caches affect the shared data that belong to the cores other than the source that is performing the self-invalidation. The generational classification does not track the shar-ers1 and therefore self-invalidations cannot detect and invalidate only the shared data that belong to the core issuing the synchronizations. As a result, each self-invalidation invalidates all the shared data in the intermediate caches. Software and hardware solutions have been proposed to remedy this problem for the heterogeneous architectures [75, 76, 77, 78]. Nevertheless, the scope-aware classification enables the scoped synchronization, which offers an elegant solution to the problem without complicating the memory model. In other words, the goal of scoped synchronization is to mitigate the penalty of the self-invalidations by applying the self-invalidations only to the scopes for which the synchronizations were meant.

At each intermediate cache, the data that are externally classified as pri-vate are exempt from being self-invalidated, regardless of being classified internally as private or shared. Such data are guaranteed not to be externally modified due to being classified as private at external scope. The data that are classified as shared at the external scope are likely to be externally modi-fied in the current phase, and are therefore required to be self-invalidated before crossing a synchronization point. However, consider the case in Fig-ure 4.3 where core 5 and core 6 are crossing a synchronization point for the

1 Recall that each directory entry has only a single pointer used to track the private owner. The responsibility for maintaining coherence for the shared data is transferred to the cores.

Page 37: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

27

shared orange data. A self-invalidation request from core 5 is sent to the L2, where both externally-shared blue and orange data reside. The blue data are internally classified as private to core 4 and the orange data are internally classified as private to core 5. Self-invalidating the externally-shared blue data is redundant in this case since core 5 is synchronizing for the orange data only, and core 4 may lose its valid externally-shared data which are being frequently accessed by the core. Self-invalidating all the externally-shared data is therefore not required, and can be mitigated through scoped synchronization, which requires that the externally-shared internally-private data be self-invalidated only if the synchronizations come from the owners of those data in the internal scope. The data that are classified as shared both at internal and external scopes are nevertheless required to be self-invalidated, since sharers are not tracked and therefore relations cannot be established between the synchronizations and the intended data sets.

On the other hand, in the aforementioned example core 5 may as well need to access the blue data after the synchronization. Core 5 is likely to receive the stale data from the L2 cache, since the blue data is not self-invalidated due to the scoped synchronization. In order to address this case, scoped synchronization requires that accesses to the valid <internally-private, externally-shared> data by cores other than the private owner in the internal scope should read the value through the higher shared caches (e.g. the sink or the LLC), where the valid value is guaranteed to exist. Once the values are read through, the data are re-classified as <internally-shared, ex-ternally-shared>, which makes the data susceptible to all the future synchro-nizations.

As the previous example showed, the scoped synchronization aims to mit-igate the penalty of the self-invalidations in the intermediate shared caches in the hierarchies. Instead of bulk self-invalidation of all the shared data, only the data that belong to the core crossing the synchronization are self-invalidated. Nevertheless, the data are read-through on demand, which trans-lates into more selective self-invalidations. Furthermore, as the example showed, the scoped synchronization does not rely on the software to obtain information about the scopes, thus not complicating the memory model. The scope of the synchronization is dynamically detected in the hardware based on the identity of the core issuing the synchronization and the internal and external scopes of the classification.

4.3 Dir1-H: The Hierarchical Coherence

The Dir1-SISD coherence scheme introduced in the previous chapter can be extended to hierarchical topologies by employing the scope-aware classifica-tion. The resulting coherence scheme requires the following mechanisms:

Page 38: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

28

Table 4.1. Scoped-synchronization behavior.

Scope-Aware Classification Response to Synchronizations

Internally-Private, Externally-Private Insensitive (private data only accessed by one core) Internally-Shared, Externally-Private Insensitive (sinks always contain consistent data) Internally-Private, Externally-Shared Self-invalidate on synchronizations by private owner Internally-Shared, Externally-Shared Always self-invalidate

Self-invalidation: L1s are required to self-invalidate their shared data upon synchronizations. Intermediate caches up to the LLC self-invalidate their data based on the scope of the classification and the synchronization (i.e. the scoped synchronization) summarized in Table 4.1.

Self-Downgrade: L1s self-downgrade their shared data by writing through to the intermediate caches. Intermediate caches write through to higher levels only if data are externally classified as shared, regardless of the internal classification. Writing the data through to the intermediate caches terminates at the sink of the cluster or the LLC, whichever is reached first.

Read-Through: accesses to <Internally-Private, Externally-Shared> data by cores other than the private owner have to read the valid value through the sink, as described in section 4.2.

4.3.1 Hierarchical Inclusion Policies The complexity of the coherence mechanisms discussed in the previous sec-tion is affected by the inclusion policy chosen for the shared data in the in-termediate caches. Mechanisms discussed so far are designed for an inclu-sive hierarchy. Similar to private data, shared data are replicated in all the levels of the hierarchy. Although an inclusive policy for the shared data re-duces the associated miss latency, self-invalidations and self-downgrades become more complex as they need to traverse the hierarchy towards the sinks. On the other hand, not enforcing inclusion for the shared data reduces the complexity and the latency of the self-invalidations and the self-downgrades, as the self-invalidation are only required to affect the sources and self-downgrades only need to directly write to the sinks, which also re-duces the network traffic caused by the write-through traffic. These trade-offs are studied in details in papers IV and I in this thesis.

4.4 Results Our results reveal that a hierarchical coherence scheme based on the scope-aware classification reduces the average network traffic and the execution time by 30% and 5%, respectively, compared to a hierarchical coherence scheme not adopting the scope-aware classification (figures 4.4 and 4.5.)

Page 39: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

29

Figure 4.4. Comparison of the execution times of hierarchical coherence based on scope-aware data classification (Dir1-H) vs. coherence based on plain hierarchical data classification (VIPS-H.) Results are normalized to an invalidation-based coher-ence protocol implementing the MOESI states (see papers IV and I.)

4.5 Summary In this chapter, which corresponds to Papers IV and I in this thesis, we intro-duced a holistic classification scheme, referred to as scope-aware classifica-tion, which extends the private/shared data classification to the hierarchical and clustered topologies. The scope-aware scheme is a multimorphic classi-fication, where the same datum is classified as both private and shared due to the existence of nested scopes. As a result, classification can be interpreted as both private or shared, depending on the scope in which data are being classified. Scope-aware classification further enables the scoped synchroni-zation, which can be leveraged to mitigate the penalty of excess self-invalidations of the shared data in the intermediate shared caches in the hier-archies.

0 0.2 0.4 0.6 0.8

1 1.2 1.4

Barne

s

Cho

lesk

yFFT

FMM LU

LU-n

c

Oce

an

Oce

an-n

c

Rad

iosity

Rad

ix

Ray

trace

Volre

nd

Wat

er-n

sq

Wat

er-s

p

Black

scho

les

Can

neal

Swap

tions

Avera

ge

Exe

cu

tio

n T

ime

(no

rma

lize

d t

o M

OE

SI )

VIPS-H DIR1-H

Page 40: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

0 0

.2 0

.4 0

.6 0

.8 1 1

.2 1

.4 1

.6

Network Traffic(normalized to MOESI)

Barne

s

Cho

lesk

y

FFT

FMM

LU

LU-n

c

Oce

an

Oce

an-n

c

Rad

iosi

ty

Rad

ix

Ray

trace

Volre

nd

Wat

er-n

sq

Wat

er-s

pBla

cksc

hole

s

Can

neal

Swap

tions

Avera

ge

1. M

OE

SI

2. V

IPS

-H3. D

IR1-H

Contr

ol

Data

Write

-back

Write

-thro

ugh

B

Figu

re 4

.5. T

he fi

gure

com

pare

s th

e ne

twor

k tra

ffic

of D

ir1-H

coh

eren

ce sc

hem

e (s

cope

-aw

are

hier

arch

ical

cla

ssifi

catio

n,) V

IPS-

H

cohe

renc

e sc

hem

e (p

lain

hie

rarc

hica

l cla

ssifi

catio

n,) a

nd M

OES

I (an

inva

lidat

ion-

base

d pr

otoc

ol).

Dir1

-H re

duce

s th

e ne

twor

k tra

ffic

by

elim

inat

ing

the

inva

lidat

ion

mes

sage

s (c

ompa

red

to M

OES

I) a

nd re

duci

ng th

e w

rite-

thro

ugh

traff

ic g

ener

ated

by

self-

dow

ngra

ding

th

e sh

ared

dat

a (c

ompa

red

to V

IPS-

H).

Page 41: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

31

5. The Best of Both Schemes: A HybridClassification Approach

The best way to have a good idea is to have lots of ideas.

⎯ Linus Pauling

The idea behind the private/shared classification is to limit the self-invalidations to the shared data only, thus minimizing the penalty of the self-invalidations. The adaptive generational classification aims to further miti-gate the penalty of the self-invalidations by minimizing the amount of data classified as shared. In this chapter, however, we investigate the impact of the private/shared data classification on the self-downgrades. Under our DRF coherence scheme, cores self-downgrade their modified shared data by writing through to the shared caches. Fewer shared data thus translates into fewer write-through, minimizing the network traffic. This leads us to a hy-brid classification scheme, where both the cache-line generations and the amount of write-through traffic affect the outcome of the data classification.

5.1 The Penalty of Self-Invalidation and Self-Downgrade The private/shared data classification for DRF coherence schemes was ex-tensively addressed in the previous chapters. An ideal private/shared data classification scheme does not misclassify the private data as shared due to using coarser classification granularities or due to being temporality-agnostic [49, 50]. Nevertheless, such a precise classification incurs overheads that would nullify the benefits of simple DRF coherence based on private/shared data classification. As a result, fine-grained and adaptive generational data classification schemes were introduced in the previous chapters to minimize the data classified as shared, thus minimizing the penalty of the self-invalidations. However, besides mitigating the penalty of the self-invalidations, fewer shared data also reduces the amount of the write-through traffic. In other words, the benefits of classifying fewer data as shared are twofold: reducing the data traffic (LLC-to-Core) as well as reducing the write-through traffic (Core-to-LLC).

Page 42: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

32

Figure 5.1. The figure shows a hypothetical data classification pattern for the data accessed by a core within an epoch. Frequent synchronizations due to fine-grained sharing (lock-acquire/lock-release) within epochs (barriers) results in frequent self-invalidation of the shared data accessed within each epoch outside the critical sec-tion. Temporarily classifying such shared data as private without violating the DRF semantics can result in fewer cache misses within each epoch. Nevertheless, such temporal private classification would be useful if the data re-classified as private are frequently accessed within the epoch, thus amortizing the overheads of temporarily re-classifying the data as private.

5.1.1 Impact of Shared Classification on Self-Invalidations The penalty of unnecessarily classifying the private data as shared manifests as the increase in the data traffic from the LLC to the cores caused by self-invalidating the valid data that are in use. Such redundant data movement is exacerbated in the presence of frequent synchronizations due to fine-grained sharing in larger DRF epochs, as depicted in Figure 5.1.

5.1.2 Impact of Shared Classification on Self-Downgrades In order to eliminate the complexities associated with directory indirections, the DRF coherence schemes discussed in the previous chapters require that cores self-downgrade their modified shared data by writing-through to the shared caches (LLC.) This enables a simplified DRF coherence scheme where cores always request data from the LLC and receive direct response from the LLC. The LLC is guaranteed to contain consistent data provided that the DRF semantics are adhered to. Nevertheless, increased write-through traffic can potentially nullify the benefits of such simple coherence mechanisms and inhibit the scalability of the system. Figure 5.2 explains the problem by giving an example.

In this chapter we investigate how the intensity of the write-through traf-fic can be used as a feedback to modulate the outcome of the generational private/shared data classification, so that the network traffic due to the write-through activity is reduced.

Page 43: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

33

Figure 5.2. Cores self-downgrade their shared data by writing-through to the shared caches (LLC). Nevertheless, frequent write-through is undesirable, in particular when the write-through traffic generated by a cache line exceeds the traffic caused by writing-back the whole cache line. The intensity of the orange color in the figure is proportional to the intensity of the generated write-through traffic by the shared data. The network traffic can be reduced by temporarily re-classifying as private the shared data with high write-through traffic (data in dark-orange color in the figure).

5.2 Self-Downgrade vs. Demand-Downgrade Self-downgrade of the shared data in the form of writing through to the LLC has the advantage of having the valid data available in the LLC once the synchronization point is crossed, therefore reducing the latency of accessing the valid data for the successive readers. Nevertheless, we are interested in studying the trade-off between the increase in the latency and the reduction in the network traffic. To study the trade-off, we investigate the other ex-treme in the design space, where self-downgrade and the associated write-through traffic is eliminated in its entirety. Instead data are downgraded on demand when requested by readers.

Once the self-downgrade of the modified shared data is eliminated, the valid data are not found in the LLC anymore. Measures need to be taken so that the successive readers know where to receive the valid data. We achieve this by requiring that cores register themselves before modifying the data [42, 45]. This is equivalent to a hybrid hardware/software approach for en-forcing the single-writer-multiple-readers invariant: the DRF semantics of the software guarantees that at most one core is allowed to modify the data at any given time, whose identity is tracked by the hardware1. In essence, this translates into temporarily classifying the cache lines as private upon modi-fications. This allows the modified data to be exempt from being self-downgraded via write-through to the LLC. Instead, the data will be down-graded on demand via P2S recovery (results shown in Figure 5.3 and 5.4).

1 This invariant is already enforced by the DRF semantics of the software at the smallest data granularity. However, the hardware tracks the writers per cache-line basis.

Page 44: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

Figu

re 5

.3. T

he fi

gure

com

pare

s th

e ex

ecut

ion

time

of D

ir1-S

ISD

coh

eren

ce s

chem

e (d

iscu

ssed

in c

hapt

er 3

) and

th

e D

ir1-S

I sch

eme

expl

aine

d in

this

cha

pter

, whi

ch o

nly

empl

oys t

he s

elf-

inva

lidat

ion

mec

hani

sm a

nd re

plac

es

the

self-

dow

ngra

de w

ith th

e de

man

d-do

wng

rade

. As s

how

n in

the

figur

e, D

ir1-S

I per

form

s on

aver

age

equa

l to

the

Dir1

-SIS

D, w

hich

em

phas

izes

that

sel

f-do

wng

rade

in th

e fo

rm o

f writ

e-th

roug

h do

es n

ot h

ave

a si

gnifi

cant

im

pact

on

the

perf

orm

ance

acr

oss

the

benc

hmar

ks u

sed

in th

e ev

alua

tion.

Page 45: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

0

0.2

0.4

0.6

0.8 1

Network Traffic Breakdown(normalized to Dir1-SISD)

Barne

sC

hole

sky

FFT

FMM

LU-N

c

LUO

cean

-nc

Oce

anR

adio

sity

Rad

ixR

aytra

ceVol

rend

Wat

er-n

sqW

ater

-sp

Black

scho

les

Can

neal Flu

idan

imat

e Swap

tions

X264

Avera

ge

1.

Dir1

-SIS

D2

. D

ir1

-SI

Da

taW

TW

BO

the

r

Figu

re 5

.4. T

he fi

gure

com

pare

s th

e ne

twor

k tra

ffic

of t

he D

ir1-S

ISD

coh

eren

ce sc

hem

e (d

iscu

ssed

in c

hapt

er 3

) an

d th

e D

ir1-S

I sch

eme

expl

aine

d in

this

cha

pter

(whi

ch o

nly

empl

oys t

he s

elf-

inva

lidat

ion

mec

hani

sm a

nd

repl

aces

the

self-

dow

ngra

de w

ith th

e de

man

d-do

wng

rade

). A

s sho

wn

in th

e fig

ure,

bot

h pr

otoc

ols b

ehav

e si

mi-

larly

on

the

aver

age.

Page 46: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

36

While some benchmarks generate less traffic with Dir1-SI scheme, Figure 5.4 also reveals benchmarks whose network traffic is increased. The detailed traffic breakdown shows that the following characteristics prevail in all the benchmarks whose network traffic decreases with Dir1-SI (barnes, fmm, radiosity, and fluidanimate in Figure 5.4, referred to as group I benchmarks):

• Data traffic decreases, • P2S traffic does not increase or the slight increase is balanced out

by the reduction in the data and write-through traffic.

Decrease in the data traffic with Dir1-SI is the result of temporarily classify-ing more data as private upon data modifications. Having more data classi-fied as private mitigates the penalty of the self-invalidations in the presence of the frequent synchronizations for fine-grained sharing.

The increase in the P2S traffic is also attributed to the classification of more data as private. Downgrading the modified shared data by writing through to the LLC eliminates the need to downgrade the data on demand via the P2S recovery.

On the other hand, the following characteristics prevail in the benchmarks whose network traffic increases with Dir1-SI (ocean, volrend, water-sp, and swaptions in Figure 5.4, referred to as group II benchmarks):

• Data traffic increases, • P2S traffic significantly increases.

The increase in the P2S traffic in the group II benchmarks is higher than the eliminated write-through traffic. Moreover, the increase in the data traffic in those benchmarks is attributed to the data that are sent by the LLC to the cores, in response to the request for modifying the data. This is necessary in order to guarantee the data consistency, although a core might already have the cache line in its private cache in the valid state1. While this overhead traffic is balanced out in the group I benchmarks, classifying more data as private is ineffective in mitigating the penalty of the self-invalidations in the benchmarks of group II, primarily due to the fact that the data re-classified as private are not reused. As a result, no savings are possible with respect to the self-invalidations, and therefore the overhead traffic manifests itself. The situation is exacerbated in the swaptions benchmark in the presence of sig-nificant increase in the P2S traffic due to the false-sharing effect [81]. In addition to the false-sharing problem, the write-back traffic significantly increases in the swaptions due to the replacement of the partially written cache lines.

1 Although unnecessary data movement can be avoided by using techniques such as version numbers [27], they are not adopted in order to maintain the simplicity of the scheme.

Page 47: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

37

5.3 Hybrid Data Classification Based on the results from the previous section, the optimal coherence scheme is a hybrid approach that applies demand-downgrade (write-back) to the shared data in the benchmarks of group I and self-downgrade (write-through) to the shared data in the benchmarks of group II (section 5.2). From the data-classification point of view, this is equal to a hybrid classification approach where both the cache-line generations and the write-through traffic affect the outcome of the private/shared classification:

Cache-lines with only one generation are classified as private Cache-lines with more than one generation are classified as shared; however, the shared generations adapt back to private if the generat-ed write-through traffic for a shared cache line exceeds the traffic caused by writing back the whole cache line.

Modulating the classification based on the intensity of the write-through traffic highlights the adaptivity of this hybrid approach, which complements the adaptivity gained through the directory evictions in sparse directories (chapter 3, Dir1-SISD). Nevertheless, measuring the write-through intensity imposes complexity and cost, which should be avoided in order not to nullify the advantages of having adaptive classification (discussed next).

5.3.1 Hybrid Approach using Speculation Mechanisms similar to the 1-bit branch predictors can be used to speculate on the density of the write-through activity for the shared data (Figure 5.5) . However, this approach may not be accurate in the presence of the lazy write-through approach [46], where the writes are coalesced in the private caches before being written through to the LLC.

Figure 5.5. Shared classification is adapted back to private when two consecutive WTs from the last reader are observed in the LLC (the owner pointer, which is un-used in shared classification, is utilized to register the last reader). The WT response from the LLC to the issuing core notifies the core to internally re-classify the data as private. If required, the precision can be tuned by adding more states (S-w2, S-w3, etc.) The adaptation is cancelled by going from S-w1 to S-w0 due to the requests not belonging to the last reader, which happens in case of false sharing.

–––

Page 48: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

38

5.3.2 Hybrid Approach Optimized for Migratory Sharing Optimizing the coherence for the migratory sharing pattern has been studied in the previous works [79, 80]. The migratory sharing pattern is formulated by the regular expression:

Ri

+Wi(Ri|Wi)* Rj

+Wj(Rj|Wj)*…

where + and * denote one-or-more and zero-or-more instances, and Ri and Wi denote read and write access to the same datum by the same core, respectively [79]. Shared migratory data may generate high write-through traffic before each migration, which can be avoided by temporarily classify-ing the migratory data as private. The DRF Dir1 directory organization intro-duced in the previous chapters lends itself very well to the detection of the migratory data. The directory pointer is used to register the last reader. Clas-sification is adapted back to private in case the last reader writes its modified data through to the LLC. The response to the write-through notifies the issu-ing core to internally adapt the classification back to private.

5.3.3 Hybrid Approach Optimized for Producer-Consumer Sharing The DRF semantics in the software guarantee that data being modified by a core are not accessed by another core before crossing a synchronization point. Based on this guarantee, the shared classification can be adapted back to private immediately upon receiving a read-to-modify request1, without even waiting for the modified data to be written through. This is the case for the producer-consumer sharing pattern, where the producer modifies the data—read-to-modify—before the consumer reads the data.

5.4 Results Our results reveal that hybrid private/shared classification discussed in this chapter provides an optimum scheme that reduces the network traffic by up to 15%.

1 The notations GETS and GETX are typically used to refer to read-to-shared and read-to-modify requests, respectively.

Page 49: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

0

0.2

0.4

0.6

0.8 1

Network Traffic Breakdown(normalized to Dir1-SISD)

Barne

sC

hole

sky

FFT

FMM

LU-N

c

LUO

cean

-nc

Oce

anR

adio

sity

Rad

ixR

aytra

ceVol

rend

Wat

er-n

sqW

ater

-sp

Black

scho

les

Can

neal Flu

idan

imat

e Swap

tions

X264

Avera

ge1. D

ir1-S

ISD

2. D

ir1-S

I3. D

ir1-S

I-H

ybrid-1

4. D

ir1-S

I-H

ybrid-2

Data

WT

WB

Oth

er

Figu

re 5

.6. T

he fi

gure

com

pare

s th

e ne

twor

k tra

ffic

of t

he c

oher

ence

sch

emes

dis

cuss

ed in

this

cha

pter

. Dir1

-SIS

D a

nd

Dir1

-SI w

ere

disc

usse

d in

det

ails

in th

e pr

evio

us s

ectio

ns. D

ir1-S

I-H

ybrid

-1 a

dds t

o D

ir1-S

I the

opt

imiz

atio

n fo

r the

m

igra

tory

shar

ing.

Dir1

-SI-

Hyb

rid-2

add

s th

e op

timiz

atio

n fo

r the

pro

duce

r-co

nsum

er sh

arin

g to

the

Dir1

-SI-

Hyb

rid-1

.

Page 50: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

40

Figure 5.7. Comparison of the execution time of the different coherence schemes discussed in this chapter. Dir1-SISD and Dir1-SI where discussed in details in Fig-ure 5.3. Dir1-SI-Hybrid-1 adds the migratory-sharing optimization to the Dir1-SI scheme. Dir1-SI-Hybrid-2 adds to Dir1-SI-Hybrid-1 the optimization for the pro-ducer-consumer sharing.

5.5 Summary In this chapter, which corresponds to Paper V in this thesis, we introduced a hybrid private/shared data classification and coherence scheme, where the decision to classify the data as private or shared does not only rely on the notion of the cache-line generations (introduced in chapter 2), but also de-pends on the density of the self-downgrade activity, which manifests as write-through traffic. In other words, the density of the write-through traffic is leveraged as a feedback mechanism to temporarily classify the shared data as private, resulting in the total reduction in the network traffic.

0

0.2

0.4

0.6

0.8

1

barn

es

choles

ky fftfm

mlunc lu

ocea

nnc

ocea

n

radios

ityra

dix

raytra

ceOpt

2

volre

nd

wat

erns

q

wat

ersp

blac

ksch

oles

cann

eal

fluidan

imat

e

swap

tions

x264

Avera

ge

Exe

cu

tio

n T

ime

(no

rma

lize

d t

o D

ir1

-SIS

D)

Dir1-SIDir1-SI-Hybrid-1Dir1-SI-Hybrid-2

Page 51: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

41

6. Future Directions

Research is what I am doing when I do not know what I am doing.

⎯ Wernher von Braun

This thesis concerns data classification and coherence for multi-/many-core architectures based on the software DRF semantics. Nevertheless, data clas-sification is a rather general concept that can be leveraged in a wider variety of applications, including optimizations targeting data movement and place-ment, distributed systems, cloud and big data, IoT, machine learning and AI, etc. Moreover, the scope-aware classification discussed in chapter 4 has the potential to be leveraged in the context of heterogeneous systems.

Being dynamic and hardware-based, the data classification scheme dis-cussed in this thesis can be extended to capture more details about the data access patterns—similar to the optimizations discussed in chapter 5—in order to enable further optimizations, including but not limited to low-overhead detection of the Read-Only accesses, etc.

Using the cache-decay technique as a simple mechanism to detect the dead cache-lines was discussed in chapter 2 (Paper II). Modulating the span of the generations using low-overhead predictors is an open research ques-tion that can be studied in more details in order to enhance the quality of the classification. Similarly, as discussed in chapter 3 (Paper III), the pri-vate/shared classification can be modulated by the choice of replacement policy employed by the directory, e.g. prioritizing the shared entries over the private entries, etc.

Moreover, a more fine-grained hardware/software co-design would be en-lightening to better understand the design space for the classification prob-lem. A more profound insight can be gained by trying to correlate the results to the specific parts of code in the benchmarks, which reveals more details about the behavior of the software and enables hardware solutions tailored to the software requirements. Such study in the context of the cloud and big data workloads would therefore be enlightening.

Page 52: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

42

Page 53: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

43

7. Summary

Coherent shared memory is an appealing choice from the programmers’ perspective, enabling the programmers to harness the power of the shared memory architecture without explicitly dealing with issues concerning data movement and replication. Nonetheless, cache coherence imposes a bottle-neck that limits the scalability of the coherent shared memory architectures.

Today’s parallel programming disciplines, such as the data-race-free par-adigm, obviate the need for the strongest cache coherence schemes, where the coherence layer eagerly reacts to the data modifications. Instead, the semantics of the data-race-free software can be leveraged to ensure a coher-ent and consistent view of the shared memory only when required according to those semantics. The end result is an on-demand coherence scheme that allows more scalable systems by reducing the invocation rate of the coher-ence mechanisms, from the more common case of data modifications to the less common case of synchronizations.

In its simplest realization, inter-core coherence communication is elimi-nated in favor of the intra-core coherence activities, where cores are guaran-teed to receive valid data directly from the shared LLC. To achieve this, each core invalidates, upon synchronizations, its own data that might become stale after the synchronization. Moreover, each core updates the shared LLC with its own modified data that might be accessed by other cores after the synchronization. Data classification serves as a useful tool that can be lever-aged by such coherence schemes: each core invalidates its own data classi-fied as shared, and updates the shared LLC with its own modified shared data. The data classified as private are not affected by the coherence actions.

In this work we studied the private/shared data classification for cache coherence based on the data-race-free semantics. We showed the trade-offs between different granularities of the classification, the adaptivity of the classification, as well as the software-based versus the hardware-based im-plementations. We also showed how the private/shared classification could be effectively applied to the hierarchical topologies. Moreover, we showed the advantages of such coherence schemes with respect to reduction in the network traffic and the improved performance.

Page 54: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

44

Page 55: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

45

8. Svensk Sammanfattning

Koherent delat minne i parallella processorer är en attraktiv egenskap hos en minnearkitektur ur programmerarens synvinkel. Den gör det möjligt för pro-grammerarna att enkelt använda det delade minnet utan att explicit hantera problem rörande flytt och replikering av data. Inte desto mindre innebär cache-koherens en flaskhals som begränsar skalbarheten hos minnesar-kitekturer med delat minne.

Dagens parallella programmeringsdiscipliner, som det data-race-fria par-adigmet, undanröjer behovet av de striktaste cachekonherenssystemen, där koherensprotokollet reagerar på alla dataändringar. Istället kan semantiken1 i den data-race-fria mjukvaran utnyttas för att säkerställa en enhetlig och konsekvent vy av det delade minnet, i mån av behov (som definieras av se-mantiken). Slutresultatet är ett system för koherens som möjliggör mer skalbara system, genom att reducera hur ofta koherensmekanismerna måste aktiveras: i stället för att aktiveras vid varje data läsning eller skrivning, behöver de bara aktiveras vid mer sällan förekommande synkroniseringar.

I sin enklaste form eliminerar ett sådant koherensprotokoll kommu-nikation mellan processorkärnor, som i stället ersätts av koherens-aktiviteter inom kärnan. Kärnor garanteras att få giltig data direkt från den delade sista-nivå-cachen (LLC, kort för Last-Level Cache). För att uppnå detta, ogil-tigförklarar varje kärna sin egen data när en synkronisering sker, eftersom datat skulle kunna ha blivit modifierat av en annan kärna. Dessutom uppdaterar varje kärna den delade LLC:n med sin egen modifierade data vid synkronisering, för att garantera att kärnor som läser datat efter synkroni-seringen får de uppdaterade värdena. Dataklassificering fungerar sålunda som ett användbart verktyg som kan utnyttjas av sådana koherenssystem: varje kärna ogiltigförklarar sin egen data klassificerad som delad, och uppdaterar den delade LLC:n med egna modifierade data som är klassi-ficerad som delad. De data som klassificeras som privata påverkas inte av koherensmekanismerna. Fördelarna med en sådan privat/delad dataklassific-ering för cache-koherens är tvåfaldig: förutom att eliminera overheaden för koherenstrafik för privat data, elimineras dessutom lagrings-overhead av privat data.

Effektiviteten av koherensprotokoll som använder sig av dataklassificer-ing beror på kvaliteten av denna dataklassificering. Att klassificera mer data 1 Synkronisering medelst barriärer och lås.

Page 56: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

46

än nödvändigt som delad påverkar inte korrektheten, men det medför extra overhead i form av onödig datainvalidering, vilket leder till ökad nätverkstrafik och i slutändan längre exekveringstid av programmet som körs. Ett effektivt dataklassificeringssystem får därför inte klassificera privat data som delat, till exempel på grund av att en olämplig granularitet av klas-sificeringen har använts. Vidare bör ett effektivt dataklassificeringssystem tillfälligt kunna omklassificera delad data som privat när det är möjligt. Utan den här egenskapen, som vi kallar klassificeringsadaptivitet, förblir data delad under hela körtiden när den väl en gång är klassificerad som delad, vilket förhindrar optimeringar som tillåts av att ha mer privat data. Vidare bör en adaptiv klassificering inte införa någon overhead i sig.

I det här arbetet studerar vi privat/delad dataklassificering för cache-koherens baserat på data-race-fri semantik. Vi visar hur cachekoherens för data-race-fri programvara kan dra nytta av en dynamisk privat/delad datak-lassificering med låg overhead. Vi studerar olika avvägningar mellan att implementera klassificeringen i programvara med hjälp av operativsystemet, och med dynamisk hårdvarukonfiguration, som en klassificeringskatalog. Vi studerar avvägningarna mellan olika granulariteter av klassificeringen, det vill säga klassificering med cacheradsgranularitet och sidgranularitet (Kapitel II och Papper II). Vi studerar också effekten av den adaptiva datak-lassificeringen på koherensen. Våra resultat bekräftar att en minimal privat/delad klassificeringskatalog utan uttryckliga mekanismer för hantering av klassificeringsadaptiviteten möjliggör effektiva data-race-fria koheren-sprotokoll där avhysning av de delade katalogposterna implicit ger klassific-eringsadaptivitet (Kapitel III och Papper III).

Dessutom visar vi hur den privata/delade klassificeringen kan effektivt tillämpas på de hierarkiska topologierna (Kapitel IV och Papper IV). Att bevara den privata klassificeringen i hierarkierna blir viktigare, eftersom ogiltigförklaring av det delade datat i hierarkierna medför allvarliga påföljder på grund av de större mellanliggande delade cacherna i hierarkin.

Vi introducerar och studerar också ett hybrid-dataklassificeringssystem som dynamiskt tar hänsyn till effekterna av den privata/delade dataklassific-eringen på nätverkstrafiken (Kapitel V och Papper V). Nätverkstrafiken som genereras på grund av koherensskiktet används som återkoppling för att påverka resultatet av den privata/delade klassificeringen, så att den genere-rade nätverkstrafiken minimeras. Det resulterande klassificeringssystemet är ett hybridsystem där klassificeringsresultatet beror på både delningsmönstret och den resulterande nätverkstrafiken. Med andra ord tillhandahåller hy-bridsystemet en lösning för klassificeringsadaptivitet med låg overhead, där den delade datan tillfälligt kan klassificeras som privat.

Page 57: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

47

Acknowledgements

I would like to thank Prof. Stefanos Kaxiras for accepting to be my advisor and supporting my research, introducing me to the topic of memory con-sistency and cache coherence1, and helping me to get started with writing my research papers. Moreover, I would like to thank my deputy advisor, Prof. Erik Hagersten, for his tremendous support and enlightening advice throughout my PhD research. Furthermore, I would like to thank Dr. Alberto Ros Bardisa from the University of Murcia, Spain, who assisted us with our simulation infrastructure and experimental setup during his visit to our re-search group as a post-doctorate researcher in the early phases of my re-search. I would also like to thank the director of the undergraduate studies at the Uppsala University, Dr. Mats Daniels, as well as my PhD candidate col-leagues Jonatan Lindén and Johan Janzén, for assisting me with preparing the “Swedish Summary” chapter of this thesis. My greatest thanks go to, among other colleagues at the Information Technology department, Michael Thuné the head of the IT department and Arnold Pears the head of the Com-puter Systems division, who always strived towards creating a healthy and a productive working environment. And last but not least, I would like to thank the Uppsala University for making all this happen by providing the opportunity for all of us to gather here.

I dedicated this thesis to my parents in the beginning, and I also wish to close this thesis by referring to my parents, to whom I owe a great debt of gratitude for their unlimited support.

1 Starting by the book written by Daniel. J. Sorin et. al. [9].

Page 58: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

48

Page 59: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

49

References

1. BYTE magazine Vol. 10, Num. 05 (May 1985), p.169. 2. The Operational Characteristics of the Processors for the Burroughs B5000

RevisionA,5000-21005. http://bitsavers.informatik.uni-stuttgart.de/pdf/burroughs

3. Alastair J. W. Mayer. 1982. The architecture of the Burroughs B5000: 20 years later and still ahead of the times?. SIGARCH Comput. Archit. News 10, 4 (June 1982), 3-10. DOI=http://dx.doi.org/10.1145/641542.641543

4. William A. Wulf and Samuel P. Harbison. 2000. Reflections in a pool of pro-cessors—an experience report on C.mmp/Hydra. In Readings in computer ar-chitecture, Mark D. Hill, Norman P. Jouppi, and Gurindar S. Sohi (Eds.). Mor-gan Kaufmann Publishers Inc., San Francisco, CA, USA 561-573.

5. Mark D. Hill, Norman P. Jouppi, and Gurindar S. Sohi (Eds.). 2000. Readings in Computer Architecture. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

6. Per Stenström. 1990. A Survey of Cache Coherence Schemes for Multiproces-sors. Computer 23, 6 (June 1990), 12-24.

7. Andrew S. Tanenbaum and Herbert Bos. 2014. Modern Operating Systems (4th ed.). Prentice Hall Press, Upper Saddle River, NJ, USA.

8. Marshall C. Yovits and Marvin Zelkowitz. 1995. Advances in Computers. Vol-ume 40. Academic Press. San Diego, CA, USA.

9. Daniel J. Sorin, Mark D. Hill, and David A. Wood. 2011. A Primer on Memory Consistency and Cache Coherence (1st ed.). Morgan & Claypool Publishers.

10. Erik G. Hallnor and Steven K. Reinhardt. 2000. A fully associative software-managed cache design. In Proceedings of the 27th annual international sympo-sium on Computer architecture (ISCA '00). ACM, New York, NY, USA, 107-116. DOI=http://dx.doi.org/10.1145/339647.339660

11. Chao Li, Yi Yang, Hongwen Dai, Shengen Yan, Frank Mueller, and Huiyang Zhou. 2014. Understanding the tradeoffs between software-managed vs. hard-ware-managed caches in GPUs. In Proceedings of the Performance Analysis of Systems and Software (ISPASS). DOI: 10.1109/ISPASS.2014.6844487

12. M. V. Wilkes. 2000. Slave memories and dynamic storage allocation. In Read-ings in computer architecture, Mark D. Hill, Norman P. Jouppi, and Gurindar S. Sohi (Eds.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA 371-372.

13. Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2015. Efficient GPU synchronization without scopes: saying no to complex consistency mod-els. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, 647-659. DOI: https://doi.org/10.1145/2830772.2830821

14. Sarita V. Adve and Kourosh Gharachorloo. 1996. Shared Memory Consistency Models: A Tutorial. Computer 29, 12 (December 1996), 66-76. DOI=http://dx.doi.org/10.1109/2.546611

Page 60: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

50

15. Kourosh Gharachorloo. 1996. Memory Consistency Models for Shared-Memory Multiprocessors. Ph.D. Dissertation. Stanford University, Stanford, CA, USA. UMI Order No. GAX96-20480.

16. Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy. 1990. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th annual in-ternational symposium on Computer Architecture (ISCA '90). ACM, New York, NY, USA, 15-26. DOI=http://dx.doi.org/10.1145/325164.325102

17. Sarita Vikram Adve. 1993. Designing Memory Consistency Models for Shared-Memory Multiprocessors. Ph.D. Dissertation. University of Wisconsin at Madi-son, Madison, WI, USA. UMI Order No. GAX94-07354.

18. Robert C. Steinke and Gary J. Nutt. 2004. A unified theory of shared memory consistency. J. ACM 51, 5 (September 2004), 800-849. DOI=http://dx.doi.org/10.1145/1017460.1017464

19. L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. Comput., C-28(9):690–691, 1979.

20. Sarita V. Adve, Vikram S. Adve, Mark D. Hill, and Mary K. Vernon. 1991. Comparison of hardware and software cache coherence schemes. In Proceed-ings of the 18th annual international symposium on Computer architecture (IS-CA '91). ACM, , 298-308. DOI=http://dx.doi.org/10.1145/115952.115982

21. Ioannis Schoinas, Babak Falsafi, AlvinR. Lebeck, Steve K. Reinhardt, James R. Larus, and David A. Wood. Fine-grain Access Control for Distributed Shared Memory. Submitted for publication, March 1994.

22. David Chaiken, John Kubiatowics, and Anant Agarwal. LimitLESS Directories: A Scalable Cache Coherence Scheme. In Proceedings of the Fourth Internation-al Conference on Architectural Support for Programming Languages and Oper-ating Systems (ASPLOS IV), pages 224– 234, April 1991.

23. C. K. Tang. 1976. Cache system design in the tightly coupled multiprocessor system. In Proceedings of the June 7-10, 1976, national computer conference and exposition (AFIPS '76). ACM, New York, NY, USA, 749-753. DOI=http://dx.doi.org/10.1145/1499799.1499901

24. L. M. Censier and P. Feautrier. 1978. A New Solution to Coherence Problems in Multicache Systems. IEEE Trans. Comput. 27, 12 (December 1978), 1112-1118. DOI=http://dx.doi.org/10.1109/TC.1978.1675013

25. Yen,W.C. and K.S.Fu. Coherence Problem in a Multicache System. Proc. of 1982 Int. Conf. on Parallel Processing, IEEE, 1982, pp. 332-339.

26. James R. Goodman. 1983. Using cache memory to reduce processor-memory traffic. In Proceedings of the 10th annual international symposium on Computer architecture (ISCA '83). ACM, New York, NY, USA, 124-131. DOI=http://dx.doi.org/10.1145/800046.801647

27. Alvin R. Lebeck and David A. Wood. 1995. Dynamic self-invalidation: reduc-ing coherence overhead in shared-memory multiprocessors. In Proceedings of the 22nd annual international symposium on Computer architecture (ISCA '95). ACM, 48-59. DOI=http://dx.doi.org/10.1145/223982.223995

28. Hoichi Cheong and Alexander V. Veidenbaum. Compiler-Directed Cache Man-agement in Multiprocessors. IEEE Computer, 23(6):39-48, June 1990.

29. R. Cytron, S. Karlovsky, and K.P. McAuliffe, "Automatic Management of Pro-grammable Caches," Proc. 1988 Int'l Conf. Parallel Processing, CS Press, Los Alamitos, Calif., Order No. 889, Vol. II, Aug. 1988, pp. 229-238.

30. E. Darnell and K. Kennedy. 1993. Cache coherence using local knowledge. In Proceedings of the 1993 ACM/IEEE conference on Supercomputing (Super-computing '93). ACM, 720-729. DOI=http://dx.doi.org/10.1145/169627.169821

Page 61: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

51

31. S. L. Min and J. L. Baer. 1992. Design and Analysis of a Scalable Cache Co-herence Scheme Based on Clocks and Timestamps. IEEE Trans. Parallel Dis-trib. Syst. 3, 1 (January 1992), 25-44. DOI=http://dx.doi.org/10.1109/71.113080

32. Mark D. Hill, James R. Larus, Steven K. Reinhardt, and David A. Wood. 1993. Cooperative shared memory: software and hardware for scalable multiproces-sors. ACM Trans. Comput. Syst. 11, 4 (November 1993), 300-318. DOI=http://dx.doi.org/10.1145/161541.161544

33. David A. Wood, Satish Chandra, Babak Falsafi, Mark D. Hill, James R. Larus, Alvin R. Lebeck, James C. Lewis, Shubhendu S. Mukherjee, Subbarao Palacharla, and Steven K. Reinhardt. 1993. Mechanisms for cooperative shared memory. In Proceedings of the 20th annual international symposium on com-puter architecture (ISCA '93). ACM, New York, NY, USA, 156-167. DOI=http://dx.doi.org/10.1145/165123.165151

34. S. V. Adve and M. D. Hill. Weak Ordering—A New Definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 2–14, May 1990. doi:10.1109/ ISCA.1990.134502

35. S. V. Adve and M. D. Hill. A Unified Formalization of Four Shared-Memory Models. IEEE Transactions on Parallel and Distributed Systems, June 1993. doi:10.1109/71.242161

36. S. V. Adve, M. D. Hill, B. P. Miller, and R. H. B. Netzer. Detecting Data Races on Weak Memory Systems. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pp. 234–43, May 1991. doi:10.1145/115952.115976

37. L. Valiant, “Bridging model for parallel computation,” Communications of the ACM, vol. 33, pp. 103–111, Aug. 1990.

38. A. Gupta, W.-D. Weber, and T. C. Mowry, “Reducing memory and traffic re-quirements for scalable directory-based cache coherence schemes,” in Int’l Conf. on Parallel Processing (ICPP), Aug. 1990, pp. 312–321.

39. The Intel® Xeon Phi TM Coprocessor Architecture Overview. Available: https://software.intel.com/sites/default/files/Intel®_Xeon_Phi™_Coprocessor_Architecture_Overview.pdf

40. Blas A. Cuesta, Alberto Ros, María E. Gómez, Antonio Robles, and José F. Duato. 2011. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In Proceedings of the 38th annual inter-national symposium on Computer architecture (ISCA '11). ACM, New York, NY, USA, 93-104. DOI=http://dx.doi.org/10.1145/2000064.2000076

41. Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2009. Reactive NUCA: near-optimal block placement and replication in distrib-uted caches. SIGARCH Comput. Archit. News 37, 3 (June 2009), 184-195. DOI=http://dx.doi.org/10.1145/1555815.1555779

42. Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Ho-narmand, Sarita V. Adve, Vikram S. Adve, Nicholas P. Carter, and Ching-Tsun Chou. 2011. DeNovo: Rethinking the Memory Hierarchy for Disciplined Paral-lelism. In Proceedings of the 2011 International Conference on Parallel Archi-tectures and Compilation Techniques (PACT '11). IEEE Computer Society, Washington,DC, USA, 155-166. DOI=http://dx.doi.org/10.1109/PACT.2011.21

43. Hyojin Sung, Rakesh Komuravelli, and Sarita V. Adve. 2013. DeNovoND: efficient hardware support for disciplined non-determinism. In Proceedings of the eighteenth international conference on Architectural support for program-ming languages and operating systems (ASPLOS '13). ACM, New York, NY, USA, 13-26. DOI=http://dx.doi.org/10.1145/2451116.2451119

Page 62: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

52

44. Hyojin Sung and Sarita V. Adve. 2015. DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations. In Proceed-ings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 545-559. DOI: http://dx.doi.org/10.1145/2694344.2694356

45. DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism, Hyojin Sung, Ph.D. thesis, University of Illinois, Urbana-Champaign, 2015.

46. Alberto Ros and Stefanos Kaxiras. 2012. Complexity-effective multicore coher-ence. In Proceedings of the 21st international conference on Parallel architec-tures and compilation techniques (PACT '12). ACM, New York, NY, USA, 241-252. DOI=http://dx.doi.org/10.1145/2370816.2370853

47. Stefanos Kaxiras and Alberto Ros. 2013. A new perspective for efficient virtu-al-cache coherence. In Proceedings of the 40th Annual International Symposi-um on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 535-546. DOI: http://dx.doi.org/10.1145/2485922.2485968

48. David A. Wood, Mark D. Hill, and Richard E. Kessler. 1991. A model for esti-mating trace-sample miss ratios. In Proceedings of the 19th ACM Special Inter-est Group on Measurement and Modeling of Computer Systems (SIGMET-RICS). 79–89.

49. Mohammad Alisafaee. Spatiotemporal coherence tracking. 2012. In Proceed-ings of the 45th IEEE/ACM International Symposium on Microarchitecture (MICRO). 341–350.

50. Jason Zebchuk, Babak Falsafi, and Andreas Moshovos. 2013. Multi-grain co-herence directories. In Proceedings of the 46th IEEE/ACM International Sym-posium on Microarchitecture (MICRO). 359–370.

51. An-Chow Lai, Cem Fide, and Babak Falsafi. 2001. Dead-block prediction & dead-block correlating prefetchers. In Proceedings of the 28th annual interna-tional symposium on Computer architecture (ISCA '01). ACM, New York, NY, USA, 144-154. DOI=http://dx.doi.org/10.1145/379240.379259

52. Stefanos Kaxiras, Zhigang Hu, and Margaret Martonosi. 2001. Cache decay: exploiting generational behavior to reduce cache leakage power. In Proceedings of the 28th annual international symposium on Computer architecture (ISCA '01). ACM, 240-251. DOI=http://dx.doi.org/10.1145/379240.379268

53. J. Archibald and J. L. Baer, “An economical solution to the cache coherence problem,” in 12th Int’l Symp. on Computer Architecture (ISCA), Jun. 1985, pp. 355–362.

54. A. Gupta, W.-D. Weber, and T. C. Mowry, “Reducing memory and traffic re-quirements for scalable directory-based cache coherence schemes,” in Int’l Conf. on Parallel Processing (ICPP), Aug. 1990, pp. 312–321.

55. M. Ferdman, P. Lotfi-Kamran, K. Balet, and B. Falsafi, “Cuckoo directory: A scalable directory for many-core systems,” in 17th Int’l Symp. on High-Performance Computer Architecture (HPCA), Feb. 2011, pp. 169–180.

56. S. Demetriades and S. Cho, “Stash directory: A scalable directory for many-core coherence,” in 20th Int’l Symp. on High- Performance Computer Architec-ture (HPCA), Feb. 2014, pp. 177–188.

57. A. Agarwal, R. Simoni, J. L. Hennessy, and M. A. Horowitz, “An evaluation of directory schemes for cache coherence,” in 15th Int’l Symp. on Computer Ar-chitecture (ISCA), May 1988, pp. 280–289.

58. W.-D. Weber and A. Gupta, “Analysis of cache invalidation patterns in multi-processors,” in 3th Int’l Conf. on Architectural Support for Programming Lan-guage and Operating Systems (ASPLOS), Apr. 1989, pp. 243–256.

Page 63: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

53

59. R. H. Katz, S. J. Eggers, D. A. Wood, C. L. Perkins, and R. G. Sheldon, “Im-plementing a cache consistency protocol,” in 12th Int’l Symp. on Computer Ar-chitecture (ISCA), Jun. 1985, pp. 276–283.

60. M. S. Papamarcos and J. H. Patel, “A low-overhead coherence solution for multiprocessors with private cache memories,” in 11th Int’l Symp. on Computer Architecture (ISCA), Jun. 1984, pp. 348–354.

61. J. A. W. Wilson, “Hierarchical cache/bus architecture for shared memory multi-processors,” in 14th Int’l Symp. on Computer Architecture (ISCA), Jun. 1987, pp. 244–252.

62. M. Butler, L. Barnes, D. D. Sarma, and B. Gelinas, “Bulldozer: An approach to multithreaded compute performance,” IEEE Micro, vol. 31, no. 2, pp. 6–15, Mar. 2011.

63. N. Eisley, L.-S. Peh, and L. Shang, “In-network cache coherence,” in 39th IEEE/ACM Int’l Symp. on Microarchitecture (MICRO), Dec. 2006, pp. 321–332.

64. D. B. Gustavson, “Scalable coherent interface and related standards,” IEEE Micro, vol. 12, no. 1, pp. 10–22, Jan. 1992.

65. S. Kaxiras and J. R. Goodman, “The GLOW cache coherence protocol exten-sions for widely shared data,” in 10th Int’l Conf. on Supercomputing (ICS), Jan. 1996, pp. 35–43.

66. J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy, A. Mahesri, S. S. Lumetta, M. I. Frank, and S. J. Patel, “Rigel: an architecture and scalable programming interface for a 1000-core accelerator,” in 36th Int’l Symp. on Computer Architecture (ISCA), Jun. 2009, pp. 140–151.

67. D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. L. Hen-nessy, M. A. Horowitz, and M. S. Lam, “The stanford DASH multiprocessor,” IEEE Computer, vol. 25, no. 3, pp. 63–79, Mar. 1992.

68. Y. Maa, D. Pradhan, and D. Thiebaut, “Two economical directory for large-scale multiprocessors,” ACM SIGARCH Computer Architecture News, vol. 19, p. 10, Sep. 1991.

69. M. M. K. Martin, M. D. Hill, and D. J. Sorin, “Why on-chip cache coherence is here to stay,” Communications of the ACM, vol. 55, pp. 78–89, Jul. 2012.

70. M. M. Martin, M. Hill, and D. Wood, “Token coherence: Decoupling perfor-mance and correctness,” in 30th Int’l Symp. on Computer Architecture (ISCA), Jun. 2003, pp. 182–193.

71. J. Nickolls and W. J. Dally, “The GPU computing era,” IEEE Micro, vol. 30, no. 2, pp. 56–69, Mar. 2010.

72. H. Nilsson and P. Stenstrom, “The scalable tree protocol - A cache coherence approach for large-scale multiprocessors,” in 4th Int’l Conference on Parallel and Distributed Computing, Dec. 1992, pp. 498–506.

73. M.Shah, J.Barreh, J.Brooks, R.Golla, G.Grohoski, N.Gura, R. Hetherington, P. Jordan, M. Luttrell, C. Olson, B. Saha, D. Sheahan, L. Spracklen, and A. Wynn, “Ultrasparc t2: A highly-threaded, power-efficient, sparc soc,” in IEEE Asian Solid-State Circuits Conference, Nov. 2007, pp. 22–25.

74. A. Ros, M. Davari, and S. Kaxiras, “Hierarchical private/shared classification: key to simple coherence for clustered hierarchies,” in 21st Int’l Symp. on High-Performance Computer Architecture (HPCA), Feb. 2015, pp. 186–197.

75. B. R. Gaster, D. Hower, and L. Howes, “HRF-relaxed: Adapting HRF to the complexities of industrial heterogeneous memory models,” ACM Transactions on Architecture and Code Optimization (TACO), vol. 12, no. 1, pp. 7:1–7:26, 2015. [Online]. Available: http://doi.acm.org/10.1145/2701618

Page 64: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

54

76. D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood, “Heterogeneous-race-free memory models,” in 19th Int’l Conf. on Architectural Support for Programming Language and Oper-ating Systems (ASPLOS), Feb. 2014, pp. 427–440.

77. M. S. Orr, S. Che, A. Yilmazer, B. M. Beckmann, M. D. Hill, and D. A. Wood, “Synchronization using remote-scope promotion,” in 20th Int’l Conf. on Archi-tectural Support for Programming Language and Operating Systems (ASPLOS), Mar. 2015, pp. 73–86.

78. M. D. Sinclair, J. Alsop, and S. V. Adve, “Efficient gpu synchronization with-out scopes: Saying no to complex consistency models,” in 48th IEEE/ACM Int’l Symp. on Microarchitecture (MICRO), Dec. 2015, pp. 647–659.

79. P. Stenstrom, M. Brorsson, and L. Sandberg, “An adaptive cache coherence protocol optimized for migratory sharing,” in 20st Int’l Symp. on Computer Ar-chitecture (ISCA), May 1993, pp. 109–118.

80. Alan L. Cox and Robert J. Fowler. 1993. Adaptive cache coherency for detect-ing migratory shared data. In Proceedings of the 20st International Symposium on Computer Architecture (ISCA), May 1993, 98–108.

81. H. Luo, X. Xiang, and C. Ding. Characterizing active data sharing in threaded applications using shared footprint. In Proceedings of the 11th International Workshop on Dynamic Analysis, 2013.

Page 65: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free
Page 66: Advances Towards Data-Race-Free Cache Coherence Through Data …uu.diva-portal.org/smash/get/diva2:1090151/FULLTEXT01.pdf · 2017-05-09 · Davari, M. 2017. Advances Towards Data-Race-Free

Acta Universitatis UpsaliensisDigital Comprehensive Summaries of Uppsala Dissertationsfrom the Faculty of Science and Technology 1521

Editor: The Dean of the Faculty of Science and Technology

A doctoral dissertation from the Faculty of Science andTechnology, Uppsala University, is usually a summary of anumber of papers. A few copies of the complete dissertationare kept at major Swedish research libraries, while thesummary alone is distributed internationally throughthe series Digital Comprehensive Summaries of UppsalaDissertations from the Faculty of Science and Technology.(Prior to January, 2005, the series was published under thetitle “Comprehensive Summaries of Uppsala Dissertationsfrom the Faculty of Science and Technology”.)

Distribution: publications.uu.seurn:nbn:se:uu:diva-320595

ACTAUNIVERSITATIS

UPSALIENSISUPPSALA

2017