copyright by rajesh ganesan 2017

60
Copyright by Rajesh Ganesan 2017

Upload: others

Post on 06-Jan-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Copyright by Rajesh Ganesan 2017

Copyright

by

Rajesh Ganesan

2017

Page 2: Copyright by Rajesh Ganesan 2017

The Thesis Committee for Rajesh GanesanCertifies that this is the approved version of the following thesis:

Reducing Cache Misses due to Frequent Context

Switching using a Cache Context Store

APPROVED BY

SUPERVISING COMMITTEE:

Jacob Abraham, Supervisor

Mark McDermott

Page 3: Copyright by Rajesh Ganesan 2017

Reducing Cache Misses due to Frequent Context

Switching using a Cache Context Store

by

Rajesh Ganesan, B.E.

Thesis

Presented to the Faculty of the Graduate School of

The University of Texas at Austin

in Partial Fulfillment

of the Requirements

for the Degree of

Master of Science in Engineering

The University of Texas at Austin

May 2017

Page 4: Copyright by Rajesh Ganesan 2017

Abstract

Reducing Cache Misses due to Frequent Context

Switching using a Cache Context Store

Rajesh Ganesan, M.S.E.

The University of Texas at Austin, 2017

Supervisor: Jacob Abraham

Computer system performance has been pushed further and further for

decades, and hence the complexity of the designs has been increasing as well.

This is true for both hardware and software. Problems at the interface of hard-

ware and software are particularly interesting, as are solutions which include

interaction between hardware and software. Cache misses due to frequent con-

text switching is one such problem and the solution proposed in this thesis is

to save the cache context of processes in a memory called a Cache Context

Store (CCS). The CCS is designed as a byte-addressable memory close to the

processor capable of holding multiple cache contexts. Speedup achievable by

such a system is calculated analytically using experimental data from running

SPEC CPU2006 benchmarks on a real system.

iv

Page 5: Copyright by Rajesh Ganesan 2017

Table of Contents

Abstract iv

List of Figures vii

Chapter 1. Introduction 1

Chapter 2. Context Switching 3

2.1 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Context Switch . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Performance Penalty due to Context Switching . . . . . . . . . 6

2.3.1 Direct Penalty . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.2 Indirect Penalty . . . . . . . . . . . . . . . . . . . . . . 7

Chapter 3. Related Work 8

3.1 Reducing the Performance Penalty due to Context Switching . 8

3.1.1 CLOCS, 1990 . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.2 ETS, 2014 . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Changing the Traditional Memory Hierarchy . . . . . . . . . . 12

3.2.1 CAMEO, 2014 . . . . . . . . . . . . . . . . . . . . . . . 12

3.2.2 ThyNVM, 2015 . . . . . . . . . . . . . . . . . . . . . . . 13

3.2.3 N3XT, 2015 . . . . . . . . . . . . . . . . . . . . . . . . 14

Chapter 4. Emerging Memory Technologies 15

4.1 Spin Transfer Torque RAM . . . . . . . . . . . . . . . . . . . . 16

4.2 Resistive RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3 Phase Change Memory . . . . . . . . . . . . . . . . . . . . . . 19

4.4 eDRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.5 Design Space Exploration using DESTINY . . . . . . . . . . . 21

v

Page 6: Copyright by Rajesh Ganesan 2017

Chapter 5. Architecture Description & Evaluation 28

5.1 Cache Context Store . . . . . . . . . . . . . . . . . . . . . . . 29

5.2 Design Decisions for Real Systems . . . . . . . . . . . . . . . . 33

5.3 Experiment Setup & Results . . . . . . . . . . . . . . . . . . . 34

5.4 Speedup Relative to the Worst Case Performance . . . . . . . 43

Chapter 6. Limitations & Future Improvements 46

6.1 Limitations of the Proposed Design . . . . . . . . . . . . . . . 46

6.2 Feasibility for Deployment of this Solution in a Real System . 47

6.3 Possible Future Improvements . . . . . . . . . . . . . . . . . . 47

Bibliography 49

vi

Page 7: Copyright by Rajesh Ganesan 2017

List of Figures

2.1 Address Space of a Process . . . . . . . . . . . . . . . . . . . . 5

2.2 Timer Interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 Traditional Memory Hierarchy Vs. CLOCS Memory Hierarchy 9

3.2 Impact of CS misses on different workloads . . . . . . . . . . . 10

4.1 STT-RAM Storage Cell . . . . . . . . . . . . . . . . . . . . . 16

4.2 ReRAM Storage Cell . . . . . . . . . . . . . . . . . . . . . . . 18

4.3 PCRAM Storage Cell . . . . . . . . . . . . . . . . . . . . . . . 19

4.4 Comparison of Emerging Memory Technologies . . . . . . . . 20

4.5 An Overview of the DESTINY Framework . . . . . . . . . . . 21

4.6 Write Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.7 Write Latency for PCRAM . . . . . . . . . . . . . . . . . . . . 23

4.8 Read Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.9 Read Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.10 Write Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.11 Read Dynamic Energy . . . . . . . . . . . . . . . . . . . . . . 25

4.12 Write Dynamic Energy . . . . . . . . . . . . . . . . . . . . . . 26

4.13 Write Dynamic Energy for PCRAM . . . . . . . . . . . . . . . 26

4.14 Leakage Power . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1 Fields in a Physical Address . . . . . . . . . . . . . . . . . . . 29

5.2 A Generic m-way Set Associative Cache . . . . . . . . . . . . 30

5.3 CCS Address Look-up Table . . . . . . . . . . . . . . . . . . . 31

5.4 Cache Context Store Update Mechanism . . . . . . . . . . . . 32

5.5 Variation of Cache Misses, Context Switches and Runtime . . 38

5.6 Speedup for a 64 MB 3D eDRAM Cache Context Store . . . . 45

vii

Page 8: Copyright by Rajesh Ganesan 2017

Chapter 1

Introduction

The objective of this thesis is to develop techniques for improving the

performance of a computer system by identifying non-obvious recurring events

that cause a degradation in maximum achievable performance. It proposes a

potential solution that can eliminate or at least mitigate this degradation.

Since these events are recurring, even a simple solution could have an impact

larger than would be expected.

To be more specific, the recurring event mentioned above is a Context

Switch, and the degradation is due to the CPU stalling cycles because of cache

misses due to the context switching. The proposed solution is to eliminate

these cache misses using a Cache Context Store (CCS) that can save the cache

contexts of multiple processes. The Cache Context Store is proposed as a

tightly integrated byte-addressable memory entirely managed by the hardware

and invisible to the software. A design exploration study is performed to find

a good candidate for the type of memory technology to be used for the Cache

Context Store.

The rest of this thesis is arranged as follows. In Chapter 2, basic princi-

ples of context switching are explained. In Chapter 3, prior work on handling

1

Page 9: Copyright by Rajesh Ganesan 2017

cache misses due to context switching is described and a few published hybrid

memory systems are surveyed to establish familiarity with such emerging mem-

ory hierarchies. In Chapter 4, potential memory technologies for the proposed

Cache Context Store are surveyed. In Chapter 5, the proposed architecture is

described and results are analyzed. In Chapter 6, limitations of this design are

mentioned and potential future design choices to mitigate them are suggested.

2

Page 10: Copyright by Rajesh Ganesan 2017

Chapter 2

Context Switching

An Operating System (OS) is the software that manages a computer

system and allows programs to judicially make use of the available hardware

resources such as CPU, Memory, I/O, etc. It primarily provides virtualiza-

tion, concurrency and persistence [1]. Virtualization provides the illusion to

each running program that it has all the hardware resources available in the

system for itself. Concurrency is the ability to allow multiple applications to

seemingly run at the same time. Persistence is necessary to save volatile main

memory contents to a non-volatile memory such as a hard disk or an SSD.

This is necessary because programs which need more memory than the avail-

able physical memory can produce correct results. It is also necessary so that

the system state is consistent across machine reboots.

2.1 Process

A process is a running instance of a program. It has its own address

space. This is shown in Figure 2.1 taken from Operating Systems: Three Easy

Pieces [1]. At any time, there are multiple processes ready to run if given

the chance. All processes time share the available hardware resources and the

3

Page 11: Copyright by Rajesh Ganesan 2017

OS decides how to schedule them. The context of a process consists of the

program counter, general purpose registers, floating point registers, and its

process ID. This context of a process resides in the Operating System kernel

memory in a data structure called Process Control Block (PCB). When a

process A is suspended and another process B starts (or resumes) executing,

this is called a context switch. The PCB of the suspended process A is saved

from corresponding registers to memory, and the PCB of the resuming process

B is copied from memory to the corresponding registers. The OS scheduling

can be either co-operative or pre-emptive. In co-operative scheduling, a process

yields to others. In pre-emptive scheduling, there is a time slice allocated for

each process and once the time slice expires, the OS schedules another process

for the next time slice. Mac OS 9 was the last major OS to use co-operative

scheduling, but today all modern OSs use pre-emptive scheduling including

Mac OS X.

4

Page 12: Copyright by Rajesh Ganesan 2017

Figure 2.1: Address Space of a Process

2.2 Context Switch

As shown in Figure 2.2 taken from [1], during a timer interrupt that

causes a context switch, the following happens.

1. Process A’s CPU register state is saved to its PCB in kernel memory.

2. The scheduler chooses a Process B from the list of available processes.

3. Process B’s PCB is copied to the CPU registers from kernel memory.

5

Page 13: Copyright by Rajesh Ganesan 2017

Figure 2.2: Timer Interrupt

2.3 Performance Penalty due to Context Switching

There are two distinct types of performance penalty due to context

switching. They are explained below.

6

Page 14: Copyright by Rajesh Ganesan 2017

2.3.1 Direct Penalty

This is the sum of time spent in copying the CPU register state of

Process A to memory, and in copying the saved CPU register state of Process

B to the actual CPU registers. This is very deterministic and does not depend

on the process getting switched in or out. In today’s CPUs, the direct penalty

is in the order of few hundred nanoseconds given that the clock speeds are in

the GHz range.

2.3.2 Indirect Penalty

Every time a context switch occurs, the old cache contents become

obsolete for the new process. The new process that is resuming execution

takes more cache misses than it would have if it had not been interrupted.

Since the CPU state is only a few registers, the direct penalty is relatively

small when compared to the indirect penalty, especially with increasing cache

sizes [2]. This is not deterministic and changes from process to process. So,

the indirect penalty could be hard to quantify. It depends on the following.

• The memory accesses made by a process.

• The phase a process is currently in with respect to memory accesses.

7

Page 15: Copyright by Rajesh Ganesan 2017

Chapter 3

Related Work

This chapter discusses related work, and is divided into two sections.

The first section deals with prior work on reducing the performance penalty

due to context switching and the second section deals with published hybrid

memory systems where the traditional memory hierarchy is augmented using

newer memory technologies.

3.1 Reducing the Performance Penalty due to ContextSwitching

In section 2.3, the performance penalty due to Context Switching was

classified into direct and indirect penalty. The direct penalty is the cost of

moving the context of the registers in and out. The indirect penalty is the

cost of warming up the cache every time a context switch occurs. In the

following two subsections, the first discusses techniques to reduce the direct

penalty, and the second addresses ways of reducing the indirect penalty.

3.1.1 CLOCS, 1990

CLOCS stands for Computer for LOw Context Switch time [3]. The

main idea in this work is that the context switch time can be reduced by

8

Page 16: Copyright by Rajesh Ganesan 2017

reducing the size of the ‘context’. The proposed architecture reduces the size

of the ‘context’ to just one register, i.e., the program counter. It is a memory

to memory architecture, where each instruction uses memory addresses as the

operands, thereby not needing a set of general purpose registers at all.

Figure 3.1: Traditional Memory Hierarchy Vs. CLOCS Memory Hierarchy

As shown in Figure 3.1 from [3], the CLOCS architecture removes the

caches from the memory hierarchy while retaining support for virtual memory.

This effectively eliminates updates to the memory hierarchy necessary due to

a context switch (for example, updating TLB and MMU with new virtual-to-

physical address mappings, writing back dirty cache lines to memory, etc.). It

also eliminates the indirect penalty of warming up the caches by removing the

caches altogether.

9

Page 17: Copyright by Rajesh Ganesan 2017

3.1.2 ETS, 2014

ETS stands for Elastic Time Slicing [4]. Here, the system considered

is an aggressively multi-tasked virtualized environment. The specific problem

solved was the performance impact of cache misses due to displaced cache

state after Context Switching (called CS misses). In the proposed solution,

ETS, the OS dynamically finds a time slice duration to optimize for system

performance instead of having a fixed time slice for each process, by utilizing

the CS miss count exposed to it by the hardware.

Figure 3.2: Impact of CS misses on different workloads

Figure 3.2 is from [4] where, (a) shows a workload that is suffering very

low performance penalties due to context switching, and (b) shows a workload

that suffers relatively more cache misses than (a), and (c) shows that if the time

slice for workloads like (b) is increased, then it can sustain peak performance

10

Page 18: Copyright by Rajesh Ganesan 2017

for longer periods.

A balance between performance and latency is found and constantly re-

evaluated using dynamic hardware measurements of the impact of CS misses

using an Auxillary Tag Directory in addition to the Main Tag Directory. An-

other interesting section in this paper makes an observation that cache opti-

mizations such as advanced cache replacement policies (e.g., Re-Reference In-

terval Prediction [5]) exacerbate this problem. Under the related work section

in this paper, a subsection titled Performance Impact of Context Switching

contains this note quoted below.

“There were many studies which aimed at understanding the perfor-

mance impact of CS events. They considered the influence of additional cache

misses and page faults on performance. The proposed solutions include job

speculative prefetching, CPU scheduling guided by memory scheduling, and

intelligent process scheduling. Most of the works concluded that the indirect

overhead, due to cache perturbation, associated with CS events is significant.

We attempted to address this overhead through our proposal.”

This corroborates the motivation for this thesis since it also tries to

solve this indirect overhead.

11

Page 19: Copyright by Rajesh Ganesan 2017

3.2 Changing the Traditional Memory Hierarchy

Chapter 4 explains the need for changing the traditional memory hier-

archy and to accommodate emerging memory technologies so that these newer

technologies could fill gaps in the traditional memory hierarchy. Here, a few

such hybrid memory systems are described to help in understanding the mem-

ory hierarchy envisioned for this thesis.

3.2.1 CAMEO, 2014

CAMEO stands for CAche-like MEmory Organization [6]. Chang-

ing the memory hierarchy to accommodate emerging memory technologies

poses a fundamental question to the computer architect. A key question is

whether to expose these memories to the software or to manage them micro-

architecturally, (i.e., invisible to the software). The focus here is on stacked

DRAM memories and how to find a place for them in the traditional memory

hierarchy. The proposed solution uses the stacked DRAM memory available in

the system efficiently at a fine granularity like a cache, while it also augments

the main memory visible and available to the software. It makes the stacked

DRAM usable as a cache in the traditional sense wherein data can be accessed

at a fine cache line granularity and at the same time it also does not hide it

from the software and makes it part of the software visible main memory. It is

possible to do this because the design can dynamically move cache lines from

the stacked DRAM to Main Memory and vice-versa. This paper also proposes

a Line Location Table and a Line Location Predictor to track the location of

12

Page 20: Copyright by Rajesh Ganesan 2017

all data lines.

3.2.2 ThyNVM, 2015

ThyNVM stands for Transparent Hybrid NVM [7]. This work also

addresses the problem of accommodating emerging NVM technologies in the

existing memory hierarchy. A few lines from the abstract of this paper is

quoted below.

“Emerging byte-addressable nonvolatile memories (NVMs) promise per-

sistent memory, which allows processors to directly access persistent data in

main memory.”

It also addresses the concern of increased programmer effort to take

advantage of the persistent nature of emerging NVM technologies. In today’s

systems, an NVM library provided by the hardware manufacturer is needed

to utilize persistent storage directly from a program [8]. This library typically

provides a set of APIs to directly malloc persistent memory, espcially targeted

at server applications such as databases. The specific problem solved by this

paper is this need of the software (or programmer) to ensure memory consis-

tency in the event of a power failure or a system crash. The proposed solution

provides software transparent crash consistency using a hardware assisted hy-

brid DRAM+NVM persistent memory design.

13

Page 21: Copyright by Rajesh Ganesan 2017

3.2.3 N3XT, 2015

N3XT is a Nano-Engineered Computing Systems Technology [9] that

is targeted at providing energy efficiency for abundant data workloads. It

promises 1000 times improvement in energy efficiency by integrating vastly

different technologies into a new form of computing system. The main ideas

proposed in this work are the following.

1. Use energy-efficient 1D carbon nanotube transistors (CNT FETs).

2. Use a hybrid memory system with 3D RRAM and STT-MRAM to derive

the benefits of both massive storage and quick access.

3. Exploit fine-grained monolithic 3D integration of compute and memory

components using ultradense nanoscale vias.

4. Take computation nearer to the memory and interleave memory and

computation for optimum latency while employing efficient nanoscale

heat removal methods.

In such a system, the software and hardware interact dynamically to

optimize for the design point; the design point could be either power or per-

formance. So, improvements in the context switch performance will improve

the energy efficiency of the system at both these design points.

14

Page 22: Copyright by Rajesh Ganesan 2017

Chapter 4

Emerging Memory Technologies

From an ideal memory, we expect infinite capacity and zero latency.

No single memory technology can provide both requirements. This is because,

very often, these requirements oppose each other. For example, a particular

memory technology might be able to provide high capacity but might not be

able to provide low latency (e.g., disks) or a particular memory technology

might be able to provide low latency but might not be able to provide high

capacity (e.g., on-chip caches). Hence, in most systems, we see a memory

hierarchy where, closer to the CPU we have a smaller but faster memory; as

we go further away from the CPU, the memories tend to become larger, but

also slower compared to the ones closer to the CPU. DRAM technology scaling

has slowed considerably and we see reliability and security problems due to

the extent of this scaling. One example of this problem demonstrated in [10]

which shows that by repeatedly accessing a row in a DRAM chip, we can flip

bits in adjacent rows.

For future systems, there is a need to accommodate emerging memory

technologies such as Phase Change Memory, Spin Transfer Torque RAM, Re-

sistive RAM, and eDRAM within the existing memory hierarchy. These new

15

Page 23: Copyright by Rajesh Ganesan 2017

memory technologies are byte-addressable and are also non volatile with the

exception of eDRAM. Therefore, they can either replace portions of the ex-

isting memory hierarchy or can augment it. Hybrid memory systems [11–13]

have been proposed to leverage the best of multiple technologies. In tradi-

tional charge storage based memories (e.g., DRAM, Flash) data is written

by capturing charge Q, and read by detecting the voltage V. With scaling,

the charge stored decreases from generation to generation, and the detection

circuitry limits further scaling. Conversely, in resistive memories (e.g., PCM,

STT-RAM, ReRAM), data is written by pulsing current dQ/dt and read by

detecting resistance R. In the following sections, data storage mechanisms of

these emerging memory technologies are described and the pictures [14–16]

are shown for visualization.

4.1 Spin Transfer Torque RAM

Figure 4.1: STT-RAM Storage Cell

16

Page 24: Copyright by Rajesh Ganesan 2017

STT-RAM uses the resistance difference between two configurations of

a Magnetic Tunnel Junction (MTJ) to store a “1” or a “0”. A Magnetic Tun-

nel Junction consists of a thin non-magnetic material (1 nm to 2 nm thick)

wedged between two ferro-magnetic materials (2 nm to 5 nm thick). One of

the layers is called the reference or fixed layer which has a fixed magnetic po-

larization and the other is called the free layer and has a magnetic polarization

that can be changed by applying a current to it. The resistance of the MTJ is

up to 600% higher when the two layers are polarized in the opposite direction

as compared to when they are polarized in the same direction [17]. This dif-

ference in resistance is further increased if the non-magnetic material between

the two ferro-magnetic materials is an electrical insulator. The conductivity

modulation is attributed to Tunnelling Magneto-Resistance (TMR), a quan-

tum mechanical phenomenon. This effect is due to spin-polarized electrons

tunnelling between the ferro-magnetic layers through the insulator. Read and

write latencies are around 10 ns and the endurance is more than 1014 cycles.

17

Page 25: Copyright by Rajesh Ganesan 2017

4.2 Resistive RAM

Figure 4.2: ReRAM Storage Cell

In ReRAM, an insulating dielectric is used to form a conducting path

between two metal electrodes by applying a sufficiently high voltage between

them. Oxygen vacancies in such a conducting path leads to a low resistance

which represents a “1” and when this path is broken, the resistance increases

and hence the state represented is a “0” [18]. Read latency is 6 to 8 ns and

write latency is around 20 to 30 ns. ReRAM has a low write endurance of 1011

cycles which is very often a limiting factor.

18

Page 26: Copyright by Rajesh Ganesan 2017

4.3 Phase Change Memory

Figure 4.3: PCRAM Storage Cell

Phase Change Memory material like GeSbTe (GST) can exist either

in amorphous or in crystalline form depending on the current pulse duration

applied to it. In the amorphous form it has a high resistance (106 − 107 Ω)

and in the crystalline form it has a low resistance (103 − 104 Ω) [11]. A write

operation consists of injecting current to change the phase of the material. A

SET pulse is a sustained current pulse (∼150 ns) to heat the cell above Tcryst

(∼300 °C) and RESET pulse is a comparatively shorter current pulse (∼100

ns) to heat it above Tmelt (∼800 °C). A read operation consists of detecting

the phase using the difference in resistance between the two phases. Because

of this, an important difference between PCM and other memory technologies

is that PCM has two different write latencies for SET and RESET operations

19

Page 27: Copyright by Rajesh Ganesan 2017

while the other memory technologies have an uniform latency regardless of

whether the write data is a 1 or a 0. PCM arrays are non-volatile and can

retain data for more than 10 years at 85 °C but have a very low endurance

which is capped at 109 cycles. Multi level cells with up to 4 bits per cell have

been demonstrated as prototypes. So, the density of such memories can be

very high.

4.4 eDRAM

Similar to a conventional DRAM, embedded DRAM consists of a ca-

pacitive storage element but they can be integrated on the same die as the

processor. Read and Write latencies are around 2 ns to 3 ns which is more

than an order of magnitude lower than conventional DRAMs [18]. Since the

stored charge leaks, a refresh is necessary periodically. The retention time is

around 4 ms for eDRAM whereas it is typically 64 ms for commodity DRAM

parts. So, eDRAM needs more frequent refreshes than DRAM. This causes

issues in scaling eDRAM to future technology nodes due to difficulty in precise

charge placement and data sensing.

STT-RAM ReRAM PCM eDRAMCell Element 1T1R 1T1R 1T1R 1T1CCell Area (F 2) 6 4 4 38Read Latency 10 ns 6-8 ns 12 ns 2 nsWrite Latency 10 ns 20-30 ns 100-150 ns 2 nsRetention >10y >10y >10y 4msEndurance (rw cycles) 1014 1011 109 1016

Figure 4.4: Comparison of Emerging Memory Technologies

20

Page 28: Copyright by Rajesh Ganesan 2017

4.5 Design Space Exploration using DESTINY

DESTINY is a 3D dEsign Space exploraTIon tool for SRAM, eDRAM

and Non-volatile memorY [19]. It can model 2D and 3D SRAM, eDRAM and

ReRAM memory structures. It can also model 2D structures of STT-RAM and

PCM memories. The user can configure it to optimize for various optimization

targets such as latency, area, leakage, refresh rate, and energy delay product.

An overview of the tool’s framework is shown in Figure 4.5 taken from

[20]. DESTINY can also be configured to model various types of regular mem-

ory array structures such as a RAM, or a CAM or a Cache. Further, it can

model process technologies ranging from 22nm to 180nm. Another parameter

that is configurable in the tool is the data width.

Figure 4.5: An Overview of the DESTINY Framework

In order to identify a particular memory technology for the Cache Con-

text Store, the tool is configured to optimize for latency. In this thesis, the type

of memory structure modeled is specified as a RAM and the process technol-

21

Page 29: Copyright by Rajesh Ganesan 2017

ogy is specified as 22nm. Using a python script, different configuration files are

generated for a range of memory sizes from 1MB to 128MB, and also for each

memory technology type except SRAM. Further, the data width was sweeped

across 64b, 128b and 256b configurations. In these simulations, changing the

data width did not change the read and write latencies considerably but had

a direct relation to the read and write bandwidths. Therefore, the reported

results are for the 256b bus width configuration. Since PCRAM has different

write latencies depending on whether the written bit is a 0 (RESET) or a 1

(SET), write latency and write dynamic energy plots are given separately for

PCRAM.

From Figures 4.6 to 4.14, it can be seen that 3D eDRAM is the best

candidate when all parameters are considered for the Cache Context Store de-

sign. It has the best balance between read and write latencies for the memory

sizes considered; that means the cache context can be stored and retrieved in

about the same time. From Figures 4.6 and 4.8, the read and write latency

for a 64MB eDRAM is around 2 ns to 4 ns. It also has a read and write

dynamic energy (∼ 1nJ) that is comparitively lower than the other memory

technologies. The leakage power is higher in eDRAM (∼ 3W) as compared

to other technologies due to the need to refresh the stored data periodically.

However, since the Cache Context Store is not going to be always needed as

in the case of main memory, it can be sent to a low power mode when not in

use. Similarly, the endurance of eDRAM is 1016 cycles, which is many orders

of magnitude better than other memory technologies as shown in Figure 4.4.

22

Page 30: Copyright by Rajesh Ganesan 2017

3D eDRAM 2D eDRAM STTRAM 2D ReRAM 3D ReRAM

0

2

4

6

8

10

Write

Latency

(ns)

1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB

Figure 4.6: Write Latency

RESET Latency SET Latency

0

50

100

150

200

Write

Latency

(ns)

1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB

Figure 4.7: Write Latency for PCRAM

23

Page 31: Copyright by Rajesh Ganesan 2017

3D eDRAM 2D eDRAM STTRAM 2D ReRAM 3D ReRAM PCRAM

0

2

4

6

8Rea

dLatency

(ns)

1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB

Figure 4.8: Read Latency

3D eDRAM 2D eDRAM STTRAM 2D ReRAM 3D ReRAM PCRAM

0

200

400

600

Rea

dBandwidth

(GB/s)

1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB

Figure 4.9: Read Bandwidth

24

Page 32: Copyright by Rajesh Ganesan 2017

3D eDRAM 2D eDRAM STTRAM 2D ReRAM 3D ReRAM PCRAM

0

100

200

300

400

Write

Bandwidth

(GB/s)

1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB

Figure 4.10: Write Bandwidth

3D eDRAM 2D eDRAM STTRAM 2D ReRAM 3D ReRAM PCRAM

0

1,000

2,000

3,000

4,000

Rea

dDynamic

Energy(p

J)

1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB

Figure 4.11: Read Dynamic Energy

25

Page 33: Copyright by Rajesh Ganesan 2017

3D eDRAM 2D eDRAM STTRAM 2D ReRAM 3D ReRAM

0

1,000

2,000

3,000

4,000W

rite

Dynamic

Energy(p

J)

1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB

Figure 4.12: Write Dynamic Energy

RESET Dynamic Energy SET Dynamic Energy

30

32

34

36

38

40

Write

Dynamic

Energy(n

J)

1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB

Figure 4.13: Write Dynamic Energy for PCRAM

26

Page 34: Copyright by Rajesh Ganesan 2017

3D eDRAM 2D eDRAM STTRAM 2D ReRAM 3D ReRAM PCRAM

0

2,000

4,000

6,000

8,000

Lea

kagePower

(mW

)

1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB

Figure 4.14: Leakage Power

27

Page 35: Copyright by Rajesh Ganesan 2017

Chapter 5

Architecture Description & Evaluation

The main idea of this thesis is to solve the cache cold start problem

due to context switching using a cache context store made out of an emerging

memory technology. This chapter discusses the micro-architectural changes

needed for such a system and shows the speedup that can be achieved for

SPEC CPU2006 benchmarks. In order to simplify the problem and come up

with a primitive solution, the system considered here has the following features.

• A single core CPU without simultaneous multithreading.

• A single level cache that is divided between instruction and data.

• A write-through policy so that there are no dirty lines in the cache.

Figure 5.4 shows the control flow for saving the cache context into the cache

context store and retrieving it. Once a context switch is identified by the

hardware, the PID of the old process is saved to an address look-up table

along with the address of the memory location in which the cache context is

going to be stored. Then, the cache context is saved to the cache context store

at that address. If the PID of the new process is present in the address look-up

table then the stored cache context is copied back from the cache context store

28

Page 36: Copyright by Rajesh Ganesan 2017

using the corresponding address field. The address look-up table is shown in

Figure 5.3. Some of the problems associated with more advanced systems and

the design decisions that need to be made for such real systems are discussed

in Section 5.2.

5.1 Cache Context Store

For a byte addressable memory, a generic m-way set associative cache

with p bits of physical address, divided into t bits of tag, i bits of index and

o bits of offset, and has the following parameters.

• A cache line size of 2o bytes.

• 2i sets.

• m cache lines in a set.

• m ∗ 2i cache lines.

p − 1 0Physical Address: Tag Index Offset

t bits i bits o bits

Figure 5.1: Fields in a Physical Address

A cache is typically organized into tag and data stores as shown below in

Figure 5.2, if the tag store has u additional bits for bookkeeping (replacement

policy, coherence policy etc.), then

29

Page 37: Copyright by Rajesh Ganesan 2017

• the data store size is m ∗ 2i ∗ 2o bytes,

• and the tag store size is m ∗ 2i ∗ (t + u) bits.

The total cache context is now given by m ∗ 2i ∗ (2o + ((t + u)/8))

bytes. This also includes the cache lines that are invalid, which need not

be saved. Assuming there are n cache lines that are invalid during the con-

text switch period, then the cache context that needs to be saved is given by

((m ∗ 2i) − n) ∗ (2o + ((t + u)/8)) bytes.

Tag Store

Way: 1 2 3 . . . m − 1 m

Set 1

Set 2

Set 3

.

.

.

Set 2i−1

Set 2i

Data Store

Way: 1 2 3 . . . m − 1 m

Set 1

Set 2

Set 3

.

.

.

Set 2i−1

Set 2i

Figure 5.2: A Generic m-way Set Associative Cache

Once a context switch is detected from an ISA specific register update

(e.g., the CR Control Registers in x86), if the cache tag and data stores can be

30

Page 38: Copyright by Rajesh Ganesan 2017

accessed as a byte addressable memory, then a cache context save can be done

by copying the contents to a predetermined location in the Cache Context

Store. We can compare this data movement overhead to the performance

penalty due to context switch caused cache misses to understand if there is a

potential benefit in storing the cache context.

The Cache Context Store mentioned above is envisioned to be a byte

addressable memory that can hold several tens of cache contexts. It should

have a reasonable read and write latency. In order to save and retrieve the

cache contexts from this memory, a look up table is necessary to store the

process ID and the address corresponding to the saved cache context of every

process in the Cache Context Store. The hardware implementation for this

look up table is similar to that of a Translation Lookaside Buffer (TLB), where

the data in the PID column need to be compared with that for the current

PID using a content addressable memory to provide a hit or miss output.

PID Address123...x

Figure 5.3: CCS Address Look-up Table

31

Page 39: Copyright by Rajesh Ganesan 2017

Cache Context Save Trigger from Control Register

Save Old PID to CCSAddress Look-up Table

Save Cache Context toCCS & Update LUT Entry

New PID inCCS Address

Look-up Table?

Copy Cache Contextfrom CCS to the Cache

Continue Execution

no

yes

Figure 5.4: Cache Context Store Update Mechanism

32

Page 40: Copyright by Rajesh Ganesan 2017

5.2 Design Decisions for Real Systems

For the experiment and results in the following sections, we consider

a very simple system in order to keep the complexity tractable. For a single

core CPU with only L1 instruction and data caches, and a write through

policy, there are not many tradeoffs to make in the implementation. However,

in a real system, with a multi core CPU, each CPU might have one or two

levels of private cache and all the cores might typically share a larger last

level cache. There has been research done on how to improve performance by

partitioning shared caches based on utility [21]. Further, if these caches utilize

a writeback policy instead of a write through policy, a number of questions

need to be answered to make the Cache Context Store architecture discussion

more complete for such real systems.

1. Is it enough to back up the private caches only? Or is it also necessary

to back up the cache lines in the shared last level cache?

2. Is it necessary to evict the dirty cachelines from the cache and store only

clean cache lines in the Cache Context Store? Or, can the saved cache

context contain dirty lines too?

3. In a shared memory multi processor, there could be communication be-

tween processes by referring to the same physical page from two different

virtual pages. In such a system, there could be accesses to stored cache

lines of a particular process from another process. The memory model

should define if shared pages are cacheable and if they are, the cache

33

Page 41: Copyright by Rajesh Ganesan 2017

controller and the cache context store need to make sure that the shared

cache lines are up to date.

These are some examples of the complexity involved in implementing this

scheme in a real system.

5.3 Experiment Setup & Results

In order to quantitatively evaluate the performance impact of context

switches, an experiment to determine the increase in cache misses due to con-

text switching on SPEC CPU2006 benchmark suite was done. The experiment

comprised of running a subset of the SPEC CPU2006 benchmarks [22] on an

AMD Phenom(tm) II X6 1055T processor and collecting the number of cache

misses, context switches and runtime using the on-chip performance monitor-

ing counters from the perf tool available in linux. The machine is a 6-core

processor with each core having a 64KB of L1 I and D caches, and also a

512KB L2 cache. All cores share a 6144KB L3 cache and the processor does

not support simultaneous multi-threading.

Since the processor in this system has 6 cores, care was taken to en-

sure that the benchmark and the co-executing processes run on the same core

throughout the experiment, thus making sure that an increase in the num-

ber of instances of the co-executing processes increases the number of context

switches proportionally, since the processes are time-sharing a single core. A

Linux tool called taskset was used to ensure this. This processor has a 6MB

34

Page 42: Copyright by Rajesh Ganesan 2017

shared last level cache and since only one process was allowed to run at any

time, that process had the cache for itself at all times. Care was also taken

to run only the benchmark and the co-executing processes on the system, this

included logging off the desktop system UI and using only a light weight ssh

connection so that the desktop system UI does not interfere with the measure-

ments.

1#inc lude<s t d i o . h>2#inc lude<s t d l i b . h>3

4 i n t main ( i n t argc , char ∗ argv [ ] ) 5 i n t ∗ array ;6 i n t i , j ;7 array = ( i n t ∗) mal loc (2000000 ∗ s i z e o f ( i n t ) ) ;8 f o r ( i =0; i <2000000; i++)9 array [ i ] = i ;

10 whi le (1 ) 11 f o r ( i =0; i <2000000; i=i +16)12 f o r ( j =1; j <16; j++)13 array [ i ] = array [ i ] + array [ i+j ] ;14 15 re turn 0 ;16

Listing 5.1: Code to Flush the Cache

Each benchmark was run along with 0 to 9 instances of a co-executing

process that flushes useful data in the cache left by the benchmark after each

time slice. The co-executing processes ensure that the setup simulates a sce-

nario where an application is running in an aggressive multitasking environ-

ment which is when the cache misses due to context switches impact appli-

cation performance noticeably. The code in Listing 5.1 was used to flush the

35

Page 43: Copyright by Rajesh Ganesan 2017

cache to simulate the above scenario. Running an increasing number of in-

stances of this process ensured that the number of cache misses due to context

switching increases proportionally as expected.

In the following plots, we can see the results of this experiment. The

cache misses are expressed in millions, the number of context switches is ex-

pressed in thousands and the runtime is expressed in seconds. We can see that

for some benchmarks – bzip2, astar, gcc, hmmer, the cache misses, con-

text switches and run time increases as the number of co-executing processes

is increased. For a few benchmarks – lbm, mcf, sphinx3, xalancbmk, the

cache misses and context switches does not follow the expected trend, although

the performance impact is still very clear from the runtime plot. xalancbmk

had a runtime of 350 seconds while having the CPU for itself. This increased

to 420 seconds when there were 9 instances of the co-executing process run-

ning alongside the benchmark. This is almost 20% increase in runtime directly

related to the effect of cache misses due to context switching.

As the number of co-executing processes increased, the perceived load

in the system increased, and hence the time slice allocated for each process

was shortened by the Linux scheduler. For most of the benchmarks, the aver-

age time slice allocated was 11.82ms when running on its own. This reduced

to 8ms, 6ms, 4ms, 3ms, 2ms or 1ms depending on the load in the system.

Such behaviour is very representative of most heavily loaded aggressive multi-

tasking systems, where the time slices allocated to a process changes dynam-

ically depending on the load. As the time slice reduces, the number of time

36

Page 44: Copyright by Rajesh Ganesan 2017

slices needed by a benchmark increases. This also increases the performance

penalty due to a cold cache at the beginning of each time slice.

37

Page 45: Copyright by Rajesh Ganesan 2017

0 2 4 6 8 10 12

8

10

12

14·106

No. of Processes

Cach

eMisses

0 2 4 6 8 10 12

20

40

60

80

·103

No. of Processes

ContextSwitch

es

0 2 4 6 8 10 12

115

120

125

No. of Processes

Runtime(s)

(a) bzip2

0 2 4 6 8 10 12

900

905

910

915

920

·106

No. of Processes

Cach

eMisses

0 2 4 6 8 10 12

5

10

15

·103

No. of Processes

ContextSwitch

es

0 2 4 6 8 10 12

68

68.5

69

No. of Processes

Runtime(s)

(b) perlbench

0 2 4 6 8 10 12

15

20

25

30

·106

No. of Processes

Cach

eMisses

0 2 4 6 8 10 12

100

200

300

·103

No. of Processes

ContextSwitch

es

0 2 4 6 8 10 12250

260

270

280

290

No. of Processes

Runtime(s)

(c) astar

Figure 5.5: Variation of Cache Misses, Context Switches and Runtime

38

Page 46: Copyright by Rajesh Ganesan 2017

0 2 4 6 8 10 12

256

258

260

262

·106

No. of Processes

Cach

eMisses

0 2 4 6 8 10 12

10

20

30

40

50·103

No. of Processes

ContextSwitch

es

0 2 4 6 8 10 12

51

52

53

54

No. of Processes

Runtime(s)

(d) gcc

0 2 4 6 8 10 12

160

180

200

220

·106

No. of Processes

Cach

eMisses

0 2 4 6 8 10 12

200

400

600

800

·103

No. of Processes

ContextSwitch

es

0 2 4 6 8 10 12

980

990

1,000

No. of Processes

Runtime(s)

(e) calculix

0 2 4 6 8 10 12

140

160

·106

No. of Processes

Cach

eMisses

0 2 4 6 8 10 12

100

200

300

400

·103

No. of Processes

ContextSwitch

es

0 2 4 6 8 10 12

490

500

510

No. of Processes

Runtime(s)

(f) dealII

Figure 5.5: Variation of Cache Misses, Context Switches and Runtime (cont.)

39

Page 47: Copyright by Rajesh Ganesan 2017

0 2 4 6 8 10 12

25

30

35

·106

No. of Processes

Cach

eMisses

0 2 4 6 8 10 12

0

50

100

150

·103

No. of Processes

ContextSwitch

es

0 2 4 6 8 10 12

132

134

136

138

No. of Processes

Runtime(s)

(g) hmmer

0 2 4 6 8 10 12

30

40

50

60

·106

No. of Processes

Cach

eMisses

0 2 4 6 8 10 12

200

400

·103

No. of Processes

ContextSwitch

es

0 2 4 6 8 10 12

700

750

No. of Processes

Runtime(s)

(h) lbm

0 2 4 6 8 10 12

30

40

50

60·106

No. of Processes

Cach

eMisses

0 2 4 6 8 10 12

200

400

600

·103

No. of Processes

ContextSwitch

es

0 2 4 6 8 10 12

720

740

No. of Processes

Runtime(s)

(i) libquantum

Figure 5.5: Variation of Cache Misses, Context Switches and Runtime (cont.)

40

Page 48: Copyright by Rajesh Ganesan 2017

0 2 4 6 8 10 12

40

60

80

·106

No. of Processes

Cach

eMisses

0 2 4 6 8 10 12

200

400

600

800

·103

No. of Processes

ContextSwitch

es

0 2 4 6 8 10 12

950

1,000

No. of Processes

Runtime(s)

(j) mcf

0 2 4 6 8 10 12

1,020

1,040

1,060

1,080

·106

No. of Processes

Cach

eMisses

0 2 4 6 8 10 12

0

100

200

·103

No. of Processes

ContextSwitch

es

0 2 4 6 8 10 12

225

230

No. of Processes

Runtime(s)

(k) povray

0 2 4 6 8 10 12

120

140

160

180

200

·106

No. of Processes

Cach

eMisses

0 2 4 6 8 10 12

200

400

600

800

·103

No. of Processes

ContextSwitch

es

0 2 4 6 8 10 12

850

900

950

No. of Processes

Runtime(s)

(l) sphinx3

Figure 5.5: Variation of Cache Misses, Context Switches and Runtime (cont.)

41

Page 49: Copyright by Rajesh Ganesan 2017

0 2 4 6 8 10 12

2,100

2,150

·106

No. of Processes

Cach

eMisses

0 2 4 6 8 10 120

100

200

300

·103

No. of Processes

ContextSwitch

es

0 2 4 6 8 10 12

360

380

400

420

No. of Processes

Runtime(s)

(m) xalancbmk

Figure 5.5: Variation of Cache Misses, Context Switches and Runtime (cont.)

42

Page 50: Copyright by Rajesh Ganesan 2017

5.4 Speedup Relative to the Worst Case Performance

Using the write bandwidth for 3D eDRAM and the number of context

switches for each of the benchmarks, the time spent in saving the cache con-

text can be analytically calculated. The write bandwidth for the 3D eDRAM

depends on the data width and the memory capacity. A 256 bit data width

and 64MB memory capacity is used so that cache contexts for at least 10 dif-

ferent processes can be saved and retrieved from the Cache Context Store. To

make a fair comparison, the cache context size is assumed to be 6 MB which

was the size of the cache for the experiment in the previous section.

Optimizations such as not saving invalid and outdated cache lines are

not taken into account, so this is a very pessimistic analysis. If the time taken

to save the Cache Context once is tsave and the total time spent doing this

for the entire runtime of a benchmark is toverhead.

tsave = Cache Size / Write Bandwidth

toverhead = tsave ∗ No. of Context Switches

The total saving overhead for an entire program has three components.

• Size of the Context to be saved on every context switch.

• Bandwidth of the Cache Context Store.

• Number of time slices necessary to run to completion.

43

Page 51: Copyright by Rajesh Ganesan 2017

Of the above, the number of time slices a program needs to run to

completion corresponds to the number of context switches. For Linux, this de-

pends on the load of the system since the time slice can vary from 1ms to 10ms

or more. So, if we use the average time slice from when the benchmark was

running on its own ie. closer to the maximum available time slice value, that

would result in a very optimistic speedup estimation. If we use the minimum

time slice a benchmark can get from the OS, then it would result in a very

pessimistic speedup estimation. A more realistic speedup estimation can be

obtained by using the median time slice value from running 0 to 9 co-executing

processes.

Figure 5.6 shows these speedups due to saving and reusing the cache

context when compared to the worst case runtime without saving the cache

context. The achieved speedup ranges from -1.3% to 19.1% and 7.4% on

average for an optimistic estimation. It ranges from -5.4% to 12.0% and 3.3%

on average for a realistic estimation. And it ranges from -5.9% to 11.1% and

0.9% on average for a pessimistic estimation. A consistent improvement is

observed for many benchmarks across all estimations. All benchmarks with

negative speedup can be stopped from using the Cache Context Store using

dynamic performance measurements similar to Elastic Time Slicing in [4].

44

Page 52: Copyright by Rajesh Ganesan 2017

perlbench

calculix

dealII

povray

gcc

hmmer

libquantum

mcf

bzip2

lbm

astar

sphinx3

xalancbm

k

GMEAN

AMEAN

HMEAN

0.95

1

1.05

1.1

1.15

Speedup

(a) Optimistic

hmmer

perlbench

calculix

dealII

povray

gcc

libquantum

bzip2

mcf

lbm

astar

xalancbm

k

sphinx3

GMEAN

AMEAN

HMEAN

0.95

1

1.05

1.1

1.15

Speedup

(b) Realistic

povray

hmmer

calculix

dealII

perlbench gc

c

libquantum

mcf

astar

bzip2

lbm

sphinx3

xalancbm

k

GMEAN

AMEAN

HMEAN

0.95

1

1.05

1.1

1.15

Speedup

(c) Pessimistic

Figure 5.6: Speedup for a 64 MB 3D eDRAM Cache Context Store

45

Page 53: Copyright by Rajesh Ganesan 2017

Chapter 6

Limitations & Future Improvements

This chapter addresses limitations of the proposed design, the feasibility

of deploying this solution in a real system, and possible future improvements

based on learning from this experiment.

6.1 Limitations of the Proposed Design

All programs have unique memory access patterns. Even a single pro-

gram can go through different phases of execution, in a particular phase the

memory access pattern might look different from a previous phase. There-

fore, solutions such as Elastic Time Slicing [4] evaluate the impact of context

switches on a running program dynamically and hence are more conducive for

implementation in today’s systems. On the other hand, such solutions require

the Operating System to be modified to make use of this information exposed

to the software from the hardware.

The proposed design has the ability to be a pure microarchitectural

solution without the need for software to interfere in normal use cases. Having

said that, the main drawback of the proposed design is that it spends time

in saving the cache context to the Cache Context Store every time a context

46

Page 54: Copyright by Rajesh Ganesan 2017

switch occurs. This means that the design is unable to differentiate between

various programs and program phases and adapt its behaviour accordingly.

6.2 Feasibility for Deployment of this Solution in a RealSystem

Section 5.2 briefly referred to some potential challenges in extending

this solution to a practical design. There have been a number of simplifying

assumptions made initially to reduce the problem to a tractable level. But the

idea that cache context is valuable runtime information known to the hardware

can be useful in the future, if it can be saved and retrieved swiftly. This is very

much extendable to any multitasking system with a cache. Therefore, even

though the proposed solution is not feasible today, a more practical solution

is very much possible in the near future.

6.3 Possible Future Improvements

As mentioned in Section 6.1, this design spends time in saving the cache

context by moving the tag and data store contents to the byte-addressable

Cache Context Store. Although this overhead is theoretically lower than the

penalty due to cache misses caused by context switching, this is still an over-

head every time a context switch occurs. To eliminate this time spent in

saving the cache context to the Cache Context Store completely, the obvious

next step is to snoop the memory transactions and build an independent copy

of the cache context in the Cache Context Store. This would mean that there

47

Page 55: Copyright by Rajesh Ganesan 2017

is no time spent in copying the tag and data store contents every time a con-

text switch occurs, instead the cache context is incrementally built within the

Cache Context Store at runtime. The tradeoff then would be between power

and performance requirements of the system.

48

Page 56: Copyright by Rajesh Ganesan 2017

Bibliography

[1] R. H. Arpaci-Dusseau and A. C. Arpaci-Dusseau, Operating Systems:

Three Easy Pieces, 0th ed. Arpaci-Dusseau Books, May 2015.

[2] D. Daly and H. W. Cain, “Cache Restoration for Highly Partitioned Vir-

tualized Systems,” in High Performance Computer Architecture (HPCA),

2012 IEEE 18th International Symposium on. IEEE, 2012, pp. 1–10.

[3] M. C. Davis, “A Computer for Low Context-Switch Time,” Tech. Rep.

90-012, 1990. [Online]. Available: http://www.cs.unc.edu/techreports/

90-012.pdf

[4] N. Jammula, M. Qureshi, A. Gavrilovska, and J. Kim, “Balancing Context

Switch Penalty and Response Time with Elastic Time Slicing,” in High

Performance Computing (HiPC), 2014 21st International Conference on.

IEEE, 2014, pp. 1–10.

[5] A. Jaleel, K. B. Theobald, S. C. Steely Jr, and J. Emer, “High Perfor-

mance Cache Replacement Using Re-Reference Interval Prediction (RRIP),”

in ACM SIGARCH Computer Architecture News, vol. 38, no. 3. ACM,

2010, pp. 60–71.

[6] C. Chou, A. Jaleel, and M. K. Qureshi, “CAMEO: A Two-Level Memory

Organization with Capacity of Main Memory and Flexibility of Hardware-

49

Page 57: Copyright by Rajesh Ganesan 2017

Managed Cache,” in Proceedings of the 47th Annual IEEE/ACM Interna-

tional Symposium on Microarchitecture. IEEE Computer Society, 2014,

pp. 1–12.

[7] J. Ren, J. Zhao, S. Khan, J. Choi, Y. Wu, and O. Mutlu, “ThyNVM:

Enabling Software-Transparent Crash Consistency in Persistent Memory

Systems,” in Proceedings of the 48th International Symposium on Mi-

croarchitecture. ACM, 2015, pp. 672–685.

[8] Persistent Memory Programming, NVM Library. [Online]. Available:

http://pmem.io/

[9] M. M. S. Aly, M. Gao, G. Hills, C.-S. Lee, G. Pitner, M. M. Shu-

laker, T. F. Wu, M. Asheghi, J. Bokor, F. Franchetti et al., “Energy-

Efficient Abundant-Data Computing: The N3XT 1,000x,” Computer,

vol. 48, no. 12, pp. 24–33, 2015.

[10] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai,

and O. Mutlu, “Flipping Bits in Memory Without Accessing Them: An

Experimental Study of DRAM Disturbance Errors,” in ACM SIGARCH

Computer Architecture News, vol. 42, no. 3. IEEE Press, 2014, pp. 361–

372.

[11] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting Phase Change

Memory as a Scalable DRAM Alternative,” in ACM SIGARCH Computer

Architecture News, vol. 37, no. 3. ACM, 2009, pp. 2–13.

50

Page 58: Copyright by Rajesh Ganesan 2017

[12] E. Kultursay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu, “Eval-

uating STT-RAM as an Energy-Efficient Main Memory Alternative,” in

Performance Analysis of Systems and Software (ISPASS), 2013 IEEE

International Symposium on. IEEE, 2013, pp. 256–267.

[13] J. Meza, Y. Luo, S. Khan, J. Zhao, Y. Xie, and O. Mutlu, “A Case

for Efficient Hardware/Software Cooperative Management of Storage and

Memory,” 2013.

[14] Schematic Diagram of the High and Low Resistance States in a Spin

Valve. [Online]. Available: https://commons.wikimedia.org/wiki/File:

Spin valve schematic.svg

[15] Cross Section of two PRAM Memory Cells. [Online]. Available:

https://commons.wikimedia.org/wiki/File:PRAM cell structure.svg

[16] Structure of an RRAM Memory Cell. [Online]. Available:

http://spectrum.ieee.org/image/MTY3ODUwMw

[17] M. K. Qureshi, S. Gurumurthi, and B. Rajendran, “Phase Change Mem-

ory: From Devices to Systems,” Synthesis Lectures on Computer Archi-

tecture, vol. 6, no. 4, pp. 1–134, 2011.

[18] S. Mittal, J. S. Vetter, and D. Li, “A Survey of Architectural Approaches

for Managing Embedded DRAM and Non-volatile On-chip Caches,” IEEE

Transactions on Parallel and Distributed Systems, vol. 26, no. 6, pp.

1524–1537, 2015.

51

Page 59: Copyright by Rajesh Ganesan 2017

[19] M. Poremba, S. Mittal, D. Li, J. S. Vetter, and Y. Xie, “DESTINY: A Tool

for Modeling Emerging 3D NVM and eDRAM caches,” in Proceedings of

the 2015 Design, Automation & Test in Europe Conference & Exhibition.

EDA Consortium, 2015, pp. 1543–1546.

[20] S. Mittal, M. Poremba, J. Vetter, and Y. Xie, “Exploring Design Space

of 3D NVM and eDRAM Caches Using DESTINY Tool,” Oak Ridge Na-

tional Laboratory, USA, Tech. Rep. ORNL/TM-2014/636, 2014.

[21] M. K. Qureshi and Y. N. Patt, “Utility-Based Cache Partitioning: A Low-

Overhead, High-Performance, Runtime Mechanism to Partition Shared

Caches,” in Microarchitecture, 2006. MICRO-39. 39th Annual IEEE/ACM

International Symposium on. IEEE, 2006, pp. 423–432.

[22] SPEC CPU2006. [Online]. Available: https://www.spec.org/cpu2006/

[23] X. Dong, C. Xu, N. Jouppi, and Y. Xie, “NVSim: A Circuit-Level Per-

formance, Energy, and Area Model for Emerging Nonvolatile Memory,”

in Emerging Memory Technologies. Springer, 2014, pp. 15–50.

[24] M. M. Shulaker, G. Hills, H.-S. P. Wong, and S. Mitra, “Transforming

Nanodevices to Next Generation Nanosystems,” in Embedded Computer

Systems: Architectures, Modeling and Simulation (SAMOS), 2016 Inter-

national Conference on. IEEE, 2016, pp. 288–292.

[25] M. M. Shulaker, T. F. Wu, A. Pal, L. Zhao, Y. Nishi, K. Saraswat, H.-S. P.

Wong, and S. Mitra, “Monolithic 3D Integration of Logic and Memory:

52

Page 60: Copyright by Rajesh Ganesan 2017

Carbon Nanotube FETs, Resistive RAM, and Silicon FETs,” in Electron

Devices Meeting (IEDM), 2014 IEEE International. IEEE, 2014, pp.

27–4.

[26] S. Mittal, J. S. Vetter, and D. Li, “Improving Energy Efficiency of Em-

bedded DRAM Caches for High-end Computing Systems,” in Proceedings

of the 23rd international symposium on High-performance parallel and

distributed computing. ACM, 2014, pp. 99–110.

[27] Y. Patt and S. Patel, Introduction to Computing Systems. McGraw-Hill,

2003.

[28] A. Agarwal, J. Hennessy, and M. Horowitz, “Cache Performance of Op-

erating System and Multiprogramming Workloads,” ACM Transactions

on Computer Systems (TOCS), vol. 6, no. 4, pp. 393–431, 1988.

53