software controlled

7/26/2019 Software Controlled

1/10

Software Controlled Memories for Scalable Many-Core Architectures

Luis Angel D. Bathen

Center for Embedded Computer SystemsUniversity of California, Irvine

[email protected]

Nikil D. Dutt

Center for Embedded Computer SystemsUniversity of California, Irvine

[email protected]

AbstractTechnology scaling along with the ever evolving demandfor media-rich software stacks have motivated the need for many-core

platforms. With the increase in compute power and its inherent demand

for high memory bandwidth comes the need for vast amounts of on-chipmemory space. Thus, designers must carefully provision the memory

real-estate to meet their applications needs. It has been shown in

the embedded systems domain that both software controlled memories

(e.g., scratchpad memories) and hardware-controlled memories (e.g.,caches) have their pros and cons, some application domains such as

multimedia fit very well in the software-controlled memory model,

while other domains such as databases work well with caches. As aresult, efficient memory management is extremely critical as it hasa great impact on the systems power consumption and throughput.

Traditional memory hierarchies primarily consist of SRAM-based on-

chip caches, however, with the emergence of non-volatile memories

(NVMs) and mixed-criticality systems, on-chip memories will beheterogeneous, not only in type (cache vs. scratchpad) but also in

technology (e.g., SRAM vs. NVM). This paper surveys the state of

the art in memory subsystems for many-core platforms, and presents

strategies for efficiently managing software-controlled memories in themany-core domain, while addressing the various challenges designers

face in deploying such memory subsystems (e.g., sharing the memory

resources, accounting for variations in the subsystem, etc.).

Keywords-many-core; multi-core; multiprocessors; memory manage-ment; scratchpads; virtualization; security; reliability

I. INTRODUCTION

The many-core revolution is driven by two tightly coupled

factors: user demand for media rich applications (e.g., streaming

content from the Cloud), resulting in complex software stacks,

and pure need for new ways to improve system performance to

cope with the increasingly complex software stacks. Gone are the

days where designers would simply scale down processes, add

more transistors, and bump up the frequencies of the devices.

These limitations were observed in the uni-processor domain,

where technology barriers such as the Power Wall and the

ILP (instruction-level parallelism) Wall [2], [78] made it near

impossible to keep up with the computational demands of new

software systems, thus motivating the need for multi- and many-

core platforms [75], [2]. With the increase in compute resources,

designers must deploy energy-conscious and high-performancememory subsystems to cope with the bandwidth demands of the

software stack (and processors). Efficient memory management is

a challenge since not only does the real-estate greatly increase,

but the memory hierarchy is gradually becoming more and more

heterogeneous due to variations in their physical characteristics

(e.g., distance from processors, read vs. write latencies and energy

differences, endurance, reliability, etc.). In this paper we present

the state-of-the-art in memory subsystems for many-core platforms,

motivate the need for memory-conscious software stacks that take

full advantage of the different characteristics in the memory sub-

system, and present the concept of virtual address spaces colored

by different characteristics (e.g., low-power, high-performance,

reliable, secure, etc.), allowing them to be exploited efficiently.

I I . MULTI-COREE RA

Multiprocessors have been extensively studied in the realms of

parallel and distributed computing [22] and embedded systems

[88]. However, general purpose commercial multi-core platforms

were not introduced until Intel released the Core Duo processor[30] in response to the thermal, power and performance issues that

arise while operating at high frequencies (e.g., beyond 3.0GHz)

[2]. Higher operating frequencies led to higher power dissipation,

and higher power dissipation meant high temperatures over time,

rendering standard cooling solutions inadequate. This issue is

exacerbated in mobile devices, which have much tighter energy

budgets than their desktop and server counterparts.

Commercial Multi-Processor System-on-Chips (MPSoCs) have

been around a lot longer than commercial multi-core platforms

[99]. Wolf et al. [99] classify MPSoCs as a distinct branch of mul-

tiprocessors given that they tend to be heterogeneous or specialized

rather than the more general purpose multi-cores introduced by

Olukotun et al. [75]. In this context, MPSoCs span a wide variety

of classes targeted for specific application domains, from wireless

base stations [1], packet processing [37], multimedia [36], [74], to

mobile platforms [89].

General purpose platforms have been expected to run generic

workloads (e.g., office productivity tools such as word processing,

browsers, multimedia, and gaming), while MPSoCs have been

expected run customized software stacks for each of the different

application domains (e.g., cell phones, automobile, etc.). As the em-

bedded software stack continues to evolve, the line between mobile

and desktop software stacks and platforms blurs significantly (e.g.,

iPad, Android Phones, etc.), despite the fact that both architecture

domains run on very different resource and energy constraints.

III. FROM M ULTI-CORE TO M AN Y-CORE

Figure 1 illustrates the growth in complexity of both, general

purpose mobile software stack and the hardware. Figure 1 (a) showsa traditional software stack and computing platform consisting

of a set of applications, an OS (proprietary), a CPU with two-

level memory hierarchy, off-chip memory, and connected through

a bus. Figure 1 (b) shows a multi-tasking Chip-Multiprocessor

(CMP) [36], [45] software stack running on a number of complex

CPUs, with shared memory, and connected through an on-chip bus.

Figure 1 (c) shows a shared-memory CMP running a much more

complex software stack (e.g., with support for multi-level security

[46]) capable of running a light-weight (or full) virtualization layer,

where one stack handles proprietary services such as voice and

2012 IEEE International Conference on Embedded and Real-Time Computing Systems and Applications

978-0-7695-4824-1/12 $26.00 2012 IEEE

DOI 10.1109/RTCSA.2012.60

1


2/10


3/10

nity. ITRS predicts that over the next decade performance variabil-

ity will increase from 48% to 66% and total power consumption

variability will increase by up to 100% [41], [83], [96]. There are

many factors that influence the variation within and across devices

(e.g., temperature, voltage, wear-out, etc.). The memory hierarchy

is also affected by variability [31]. Moreover, variability plays a

major role not only in system performance and power consumption

but also in production costs, since high degrees of variability mightcause a device to be discarded [16]. In order to cope with this

expected increase in variability, designers must build adaptable and

tunable software/hardware that can opportunistically exploit this

variability. An example of such an ambitious project is the NSF

Variability Expedition (http://variability.org).

D. Exploiting Emerging Non-Volatile Memories

Cache Size Norm. Density Latency

(cycles)

Dyn. Energy

(nJ)

Static Power

(W)

SRAM (1MB) 1 8 0.388 1.36

MRAM (4MB) 4 Read: 20

Write: 60

Read: 0.4

Write: 2.3

0.15

PRAM (16MB) 16 Read: 40

Write:200

Read: 0.8

Write: 1.5

0.3

Figure 3. Memory Technology Characteristics [101].

In order to reduce leakage power in SRAM-based memories,

designers have proposed emerging Non-Volatile Memories (NVMs)

as alternatives to SRAM for on-chip memories [92], [44], [69].

Typically, NVMs (e.g., PCRAM [56]) offer high densities, low

leakage power, comparable read latencies and dynamic read power

with respect to traditional embedded memories (SRAM/eDRAM).

These characteristics can be observed in Figure 3, where the

difference in both read and write power/latency between SRAMs,

PRAMs, and MRAMs can be observed. One major drawback

across NVMs is the expensive write operation (high latencies and

dynamic energy). This is illustrated in Figure 3, where MRAM

memories provide 4x the density of SRAMs, similar read energy,

less static power, but large write overheads (orders of magnitude

larger). To mitigate the drawbacks of the write operation in NVMs,

designers have made the case for deploying hybrid on-chip memory

hierarchies (e.g., SRAM, NVM) [92], which have shown up to 37%

reduction in leakage power [35], and increased IPC as a byproduct

of the higher density provided by NVMs [69].

V. MEMORY H IERARCHIES IN M AN Y-CORE P LATFORMS

The processor-memory gap is a well-known concern; as a result,

state-of-the-art platforms are deployed with complex memory hier-

archies consisting of one to three levels of caches in the orders of

megabytes. There are two basic models for the on-chip memory in

CMP systems: hardware-managed, implicitly-addressed, coherent

caches and software-managed, explicitly-addressed, local memo-

ries (also called streaming memories [60], scratchpad memories

[77], local stores [36], tightly coupled memories [5], or software

controlled caches [102]).

The cache-based model has built-in hardware policies where all

on-chip storage is used for private and shared caches that are kept

coherent. The on-chip memory space is not directly addressed

through software, thus providing the advantage of transparent

memory management, which exploits locality (best effort) and

automatically handles the communication between memories. Even

on occasions when accesses are too irregular to capture, the cache

may still be able to exploit some degree of inherent locality. On

the other hand, the software controlled memory model assumes

that part of the on-chip storage is organized as independently

addressable structures. Explicit accesses and DMA transfers are

needed to move data to and from off-chip memory or between

two on-chip structures. The address space can be localized to

each individual core or distributed (shared memory model); thisis illustrated in Figure 4, where each processing element consists

of locally addressable SPMs (SP) and globally addressable SPMs

(GM).

Figure 4. IBM Cyclops-64 Diagram [103].

Software-controlled memories have the advantage or providing

software with full flexibility on locality and communication man-

agement in terms of addressing, granularity, and replacement policy

[60]. Since communication is explicit, it can also be proactive,

unlike the mostly reactive behavior of the cache-based model.

Hence, this model allows software to exploit producer-consumer

locality, avoid redundant data moves, and perform application-

specific caching and prefetching [38], [60]. The main drawback of

this scheme is that the on-chip address space needs be to explicitly

managed, so the programmer or compiler need to explicitly tell

the platform what data to map where and when; this may result in

higher programming costs (e.g., higher learning curves, may impact

productivity, and the programmer needs to be platform-aware).

These issues motivate the need for transparent and efficient

management of the physical on-chip software-controlled memory

resources. The software control includes efficient compile-time

optimizations (front-end), minimal changes to the traditional pro-

gramming model (e.g., use of standard C-like APIs), as well as a

light-weight run-time memory management system that allows for

efficient dynamic allocation of the memory space (back-end). These

issues must be efficiently addressed as we migrate to many-core

platforms, where the total number of on-chip software-controlled

memories might be in the thousands [34], [66], [94], [15].

Figure 5 illustrates some possible memory hierarchy configura-

tions for many-core platforms. Figure 5 (a) shows Intels Single

Chip Cloud Computer [34], which is a NoC-interconnected many-

core platform consisting of a mesh of tiles, and four memory

controllers. Each tile in the NoC contains two simple IA-32 cores,

connected to two 256KB L2 caches and the router. This model

represents the pure cache-based memory model with two levels (L1

and L2). Each IA-core can boot Linux, and coherency is maintained

via software protocols such as MPI and OpenMP. Figure 5 (b)

shows Tileras TILEPro64 [94], which consists of a similar 8x8

Mesh NoC-based interconnect, four memory controllers, where

3


4/10


5/10


6/10

policies). The PHiLOSoftware Run-Time System (RTS) will take

the applications allocation policies and enforce them at run-

time by dynamically allocating the physical memory resources

accordingly (e.g., highly utilized read-only data to on-chip Non-

Volatile Memory).

MemoryController

Low Power DIMM

High Power DIMM

SRAMPreferred

NVM

DRAM

VoltageScaled

SPM

CPU

SPM

NominalVoltage

NVM

VoltageScaled

SPM

CPU

SPM

NominalVoltage

NVM

Remotely AllocatedOn-Chip Space

Locally AllocatedOn-Chip Space

CPU

VM SPM

ca eVoltageScaled

NominaVoltage

Voltage Nominal

(1) Voltage ScaledLow-Power / Mid Latency

(2) Voltage ScaledFault-tolerant(3) Nominal-Voltage

High Power / Low Latency(4) Nominal-Voltage

Low-Priority(5) Nominal-Voltage

Higher Power / LatencyL(6) NVM High Write

Power / Latency(7) NVM Higher Write

Power / LatencyL

(8) Low-Power DRAM

(9) High-Power DRAM

Figure 7. Variation-aware Virtual Address Space Partitioning.

For the sake of illustration, Figure 7 shows various address

spaces defined by the programmer, where each address space

will have different requirements (e.g., 1-9), ranging from (1) low-

voltage and mid-latency on-chip address space (e.g., aggressively

voltage scaled SRAM), (2) low-power fault-tolerant on-chip ad-

dress space (SRAM), (3) nominal-voltage high-performance on-

chip memory space (SRAM), (7) non-volatile remotely allocated

non-volatile on-chip space (NVM), to (8) low-power off-chip

memory space (DRAM). Each address space is created on-demand,

and their allocation policies are generated by the programmer

and/or compiler, while the run-time system (PHiLOSoftware RTS)

enforces each policy best effort.

PHiLOSoftware captures three key ideas: 1) Application intent

(requirements), 2) memory management for scalability, and 3)

adaptability to changes in the memory subsystem, thus spanning

across three mutually dependent layers (Figure 6):

Application and Compilation Layer: This layer statically de-

fines the functionality of the application (in a programming

language C/C++), the necessary resources to run the appli-

cation (e.g., memory, CPUs), and how the applications data

should be managed (e.g., allocation policies). The application

is optimized to meet a given goal (e.g., performance) while

adhering to a number of constraints (e.g., performance, relia-

bility, etc.).

Runtime System (RTS) Layer: This layer takes the appli-

cations requirements and manages the system memory re-

sources dynamically while trying to meet the applications

goal, adhering to one or more constraints (e.g., performance,

reliability, etc.), and adapting to changes in the underlying

platform (e.g., Figure 6).

Platform Layer: This layer defines the platform template for

which an application may be optimized. Though the number

of cores, type of communication fabric, and memory hierarchy

may be fixed, there may be hardware process variations (e.g.,

memory power and error rates), different memory technology

characteristics that affect the performance and power con-

sumption of the system, as well as failures due to wear-out.

A. Multi-tasking and NVM Support

In order to present the compiler/programmer with an abstracted

view of the memory hierarchy and minimize the complexity of

our run-time system PHiLOSoftware proposes the use of virtual

SPMs and virtual NVMs. PHiLOSoftware leverages the concept

of vSPMs [10], which enables a program to view and manage aset of vSPMs as if they were physical SPMs. In order virtualize

SPMs, a small part of main memory (DRAM) called protected evict

memory (PEM) space is locked and used as extra storage. The run-

time system would then prioritize the data mapping to SPM and

PEM space based on a utilization metric. Similarly, to manage on-

chip hybrid memory space, designers can exploit the concept of

virtual NVMs (vNVMs) [11], which behave similarly to vSPMs,

meaning that the run-time environment transparently allows each

application to manage their own set of vNVMs.

Management of virtual memories is done through a small set of

APIs [8], [11], which send management commands to the memory

manager. The memory manager then presents each application

with intermediate physical addresses (IPAs), which point to their

virtual SPMs/NVMs. Traditional SPM-based memory management

requires the data layout to use physical addresses by pointing to

the base register of the SPMs, as a result, the same is expected of

SPM/NVM-based memory hierarchies [35]. InPHiLOSoftware, all

policies use virtual SPM and NVM base addresses, so any run-time

re-mapping of data will remain transparent to the initial allocation

policies as the IPAs will not change.

Table IFILTERI NEQUALITIES ANDP REFERREDM EMORYT YPE

Filter Pref. Inequalities

F1 sram E(Dspmi ) > E(Ddrami )

E(Dnvmi ) < E(D

drami )

V > Tvol

F2 nvm E(Dnvmi ) > E(Ddrami )

V < Tvol

F3 either E(Dspmi ) > E(Ddrami )

E(Dnvmi ) > E(D

drami )

F4 dram E(Dspmi ) < E(Ddrami )

E(Dnvmi ) < E(D

drami )

Table IICONFIGURATIONS

Config. Applications CPUs vSPM/vNVM SPM/NVMSpace Space

C1 adpcm,aes 1 32/128 KB 16/64 KB

C2 adpcm,aes,blowfish,gsm 1 64/256 KB 16/64 KB

C3 C2 & h263,jpeg,motion,sha 1 128/512 KB 16/6 4 KB

C4 same as C2 2 64/256 KB 32/128 KB

C5 same as C3 2 128/512 KB 32/128 KB

C6 same as C3 4 128/512 KB 64/256 KB

PHiLOSoftware defines four block-based allocation policies:

1) Temporal allocation, which combines temporal SPM alloca-

tion ([93]) and hybrid memory allocation ([35]), and adheres to

the initial layout obtained through static analysis; however, the

applications SPM and NVM contents must be swapped on a

context-switch to avoid conflicts with other tasks. 2)FixedDynamic

allocation, which combines dynamic-spatial SPM allocation ([79])

and hybrid memory allocation [35], and maps the data block

to the preferred memory type (adhering to the initial layout) as

long as there is space, otherwise, data is mapped to DRAM. 3)

FilterDynamic allocation, which exploits the concept of filtering

and volatility to find the best placement. Each request is filtered

according to a set of inequalities (shown in Table I) which

6


7/10

determine the preferred memory type. Finally, 4) Oracle-based

allocation, which is a near-optimal policy because on every block-

allocation request, it feeds the entire memory map to the same

policy generator the compiler uses to generate policies statically.

0.0

0.51.0

1.5

C1 C2 C3 C4 C5 C6 AVG C1 C2 C3 C4 C5 C6 AVG

MRAM P/G=P PRAM P/G=P

Normalized Execution Time (Goal=Performance,pageSize=4KB) Temporal

Oracle

FixedDyn

FilterDyn

0.0

0.5

1.0

1.5

C1 C2 C3 C4 C5 C6 AVG C1 C2 C3 C4 C5 C6 AVG

MRAM E/G=P PRAM E/G=P

Normalized Energy (Goal=Performance,pageSize=4KB) Temporal

Oracle

FixedDyn

FilterDyn

Figure 8. Normalized Execution Time and Energy Comparison for

Performance Optimized Policies

Table II shows each configuration (C1-6), which has a set of

applications running concurrently over a numberof CPUs, and a

predefined hybrid memory physical space. The idea is to show that

PHiLOSoftwaresFilterDynamic policy (backward-slashed bars in

Figure 8) achieves competitive quality allocation solutions as the

more complexOraclepolicy. Figure 8 shows the normalized execu-

tion time and energy for each of the different configurations (C1-6,

Goal=Min Execution Time denoted as G=P) using 4KB blocks and

different memory types with respect to the Temporal policy. The

FixedDynamicpolicy (forward-slashed bars in Figure 8) suffers the

greatest impact on energy and execution time as memory space

increases (C4-6) since it adheres to the decisions made at compile-

time and does not efficiently allocate memory blocks at run-time.

In general, we see that theFilterDynamicpolicy performs almost as

well as theOraclepolicy (within 8.45% execution time). Compared

with the Temporal policy, PHiLOSoftwaresFilterDynamic policyis able to reduce execution time and energy by an average 75.42%

and 62.88% respectively when the initial application policies have

been optimized for execution time minimization [11] .

B. Mixed Criticality Support

In order to efficiently exploit the available memory variability

(on- and off-chip [12]), programmers need to provide PHiLOSoft-

ware with data mapping hints in the form of policies. A pro-

grammer will consider the applications requirements and partition

its virtual address space into regions, which are then associated

with a mapping policy that dictates how to map the data into

physical address space and the type of guarantee needed (power,

performance, fault-tolerance). Iyer et al. [43] proposed a QoS-

enabled memory architecture to enable more cache space andmemory bandwidth for high priority applications based on guidance

from the operating environment. PHiLOSoftware takes a similar

approach, however, it opportunistically exploits the variations in

the memory subsystem to achieve better performance and minimize

power consumption.

Figure 9 shows a sample address space partitioning for JPEG

[57], where the programmer has identified: 1) Read-only and highly

utilized data, i.e., look up tables (red blocks), 2) A temporary buffer

Tables (2KB), low-power/E-RAID1Pixel Data, low-power/NO ERAIDDCT,Q,ZZ Data, low-power/NO ERAIDHuff. Data, low-power/irregular access

Figure 9. Sample User Annotations and Policies

HP DRAM LP DRAM

Voltage Scaled SPMs

Readonce/

backup

Often-

updated

PEM

E-RAID

1

b) PHiLOSoftware Mapping

ead

RDBMP

DCT

Q

ZIGZAG

HUFFMAN

Full PhysicalSPMs (4KB)

Virtual Memory/DRAM

a) Traditional Mapping

HUFFMAN

irtual

Figure 10. Partitioning the Applications Memory Space

for inter-task communication (gray block), 3) Read-only pixel

data (black blocks), and 4) Irregularly access data (green blocks).Figure 10 (a) shows a traditional mapping of these data blocks,

where variability is not taken into account. Figure 10 (b) shows

the result of PHiLOSoftwares virtualization layer mapping that

exploits: 1) Data mapping policies customized by the programmer

and used to make dynamic memory allocation decisions. 2) On-chip

memory voltage scaling (using E-RAIDs [9] to deal with process

variation), and 3) DRAM variability. For the sake of illustration,

PHiLOSoftware maps commonly used read-only data to voltage

scaled SRAM protected by an E-RAID 1 level, pixel data to

voltage scaled SRAM (NO ERAID), and irregular commonly used

data to low power DRAM. A programmer with knowledge of the

applications requirements can create custom data mapping policies

with low-power (LP) memory space in mind. PHiLOSoftware then

takes these policies and tries to opportunistically enforce them (besteffort), regardless of how the LP memory space is implemented by

the hardware layer. For instance: 1) If there is no noticeable DRAM

power variability, then PHiLOSoftware will not prioritize DRAMs

and follow a more traditional memory management scheme (e.g.,

malloc()), or 2) If voltage scaling on-chip memories is not possible,

PHiLOSoftware will proceed to treat all on-chip memories the

same.

Recall Figure 6 and consider the case where there is power

and latency variation in both on-chip (due to voltage scaling the

SPMs) and off-chip memories (due to the inherent hardware-

variability in DRAM memories). The programmer then partitions

each applications virtual address space as shown in Figure 9 and

define allocation policies for each virtual address space a shown

in Figure 10 (add color to each virtual address space accordingits requirements). These annotations are used by PHiLOSoftwares

run-time system, which opportunistically exploits the variations in

the memory subsystem. Each application will then have different

requirements (e.g., fault-tolerant memory space, secure memory

space, etc.), some being more critical than others (e.g., h263s

higher memory footprint requirements vs. shas need for full

memory isolation).

Figure 11 shows various configurations (x-axis):

7


8/10

App. 2x2x1 4x2x2 8x2x4

adpcm Y Y

aes Y Y

blowfish Y

gsm Y

h263 Y Y Y

jpeg Y Y Y

motion Y

sha Y

Figure 11. Dynamic Policy-driven Variability-aware Allocation.

{#Apps}x{#OSes}x{#CPUs} with 4x8KB physical SPMsand the set of applications run for each configuration (marked

by a Y in their respective row/column). The base-line policy (P)

utilized the entire physical space with context-switching (CX)

enabled [93] (e.g., swap SPM data on CX ). Policy M1 uses

vSPMs and allows PHiLOSoftware to dynamically map each

applications data. Because we are running various applications

concurrently, PHiLOSoftware needs to prioritize and judiciously

map different data to on-chip and off-chip memory. The data

sets (T1-4) in Figure 11 represent a high-level abstraction of the

applications workload, in this experiment, despite mapping allT1-3 data to vSPM for a given application, it is not guaranteed

that the data will go into physical SPM space, as the resources are

limited (only 4x8KB). So PHiLOSoftware prioritizes among all

data blocks of each category (T1-4), and based on their priorities

(BlkPriority=block utilization), decides where to map the data.

For example, H.263 has much higher on-chip/off-chip memory

requirements, as a result, higher priority is given to H.263s T1-4

blocks than the other applications (on-chip SRAM and low-power

DRAM). User-defined policies (M1) managed to reduce dynamic

power consumption by 63% on average while reducing total

execution time by an average of 34% because: 1) There are up

to {8Apps}x{4OSes}x{4CPUs} competing for memory resources,and traditional malloc (P) is unable to efficiently cope with

the demand, and 2) PHiLOSoftware efficiently manages thememory space by exploiting the idea of variability-aware dynamic

policy-driven memory allocation.

C. PHiLOSoftware Summary

This section presented the concept of PHiLOSoftware, which

proposes the idea of tagging or coloring virtual address spaces

with different characteristics ranging from low-power virtual mem-

ory space, secure virtual memory space, to fault-tolerant virtual

memory space, etc. Each application will define different virtual

mappings (policies) according to its needs. PHiLOSoftwares run-

time system will then opportunistically enforce these memory al-

location policies. PHiLOSoftware, like [43], is a plausible solution

to support the efficient management of the memory subsystem

in scalable many-core platforms, thereby allowing for mixed-

criticality systems to execute in an energy efficient and high-

performance manner.

VIII. CONCLUSION

By deploying simpler processing elements and exploiting task-

level and application-level parallelism, the many-core era promises

high performance with reduced power consumption. In order to

support the memory bandwidth requirements of such massively

parallel system, the memory subsystem must be carefully designed

and managed. However, as we move forward, the memory hi-

erarchy will be hetereogeneous in nature, and thus, traditional

memory management schemes will not cope with the demands of

the processing elements. In this paper we surveyed the state-of-

the-art in distributed memory models and motivated the need for

efficient management of software-controlled many-core memories.

We showed how virtualization may help address some of the

emerging issues in the memory subsystem, from the adoption ofemerging non-volatile memories to the opportunistic exploitation

of the inherent variation in the memory subsystem.

ACKNOWLEDGMENT

This work was partially supported by NSF Variability Expedition

Grant Number CCF-1029783.

REFERENCES[1] B. Ackland et al. A single-chip, 1.6-billion, 16-b mac/s multi-

processor dsp. Solid-State Circuits, IEEE Journal of, 35, 2000.

[2] V. Agarwal et al. Clock rate versus ipc: the end of the road for

conventional microarchitectures. SIGARCH Comp. Arch. News,

28, 2000.

[3] F. Angiolini et al. Reliability support for on-chip memories using

networks-on-chip. In ICCD, oct. 2006.[4] A. Ansari et al. Zerehcache: armoring cache architectures in

high defect density technologies. In MICRO 42, 2009.

[5] ARM. Arm cortex-m3 processor. In http://www.arm.com/, 2012.

[6] N. Azizi et al. Low-leakage asymmetric-cell sram. TVLSI, Vol.

11, aug. 2003.

[7] A. BanaiyanMofrad et al. Fft-cache: a flexible fault-tolerant

cache architecture for ultra low voltage operation. CASES, 2011.

[8] L. Bathen. philosoftware: A low power, high performance,

reliable, and secure virtualization layer for on-chip software-

controlled memories. InThesis, (PhD), University of California,

Irvine, 2012.

[9] L. Bathen et al. E-RoC: Embedded raids-on-chip for low power

distributed dynamically managed reliable memoreis. In DATE,

2011.

[10] L. Bathen et al. Spmvisor: Dynamic scratchpad memory virtual-

ization for secure, low power and high performance, distributed

on-chip memories. InCODES+ISSS, 2011.

[11] L. Bathen et al. HaVOC: A hybrid-memory-aware virtualization

layer for on-chip distributed scratchpad and non-volatile mem-

ories. In DAC, 2012.

[12] L. Bathen et al. VaMV: Variability-aware memory virtualization.

In DATE, 2012.

[13] L. Benini and G. D. Micheli. Networks on chips: A new soc

paradigm. IEEE Computer, 35(1), 2002.

[14] K. Bernstein et al. High-performance cmos variability in the

65-nm regime and beyond. IBM J. Res. Dev., 50, 2006.

[15] S. Borkar. Thousand core chips: a technology perspective. In

DAC, 2007.

[16] S. Borkar et al. Parameter variations and impact on circuits and

microarchitecture. In DAC 03, 2003.

[17] B. Calhoun et al. A 256-kb 65-nm sub-threshold sram design

forultra-low-voltage operation. InIEEE J. of Solid-State Circuits

(JSSC), 2007.

[18] A. Chakraborty et al. E


9/10

[19] X. Chen, Z. Lu, A. Jantsch, and S. Chen. Run-time partitioning

of hybrid distributed shared memory on multi-core network-on-

chips. InPAAP, 2010.

[20] X. Chen et al. Supporting distributed shared memory on multi-

core network-on-chips using a dual microcoded controller. In

DATE, 2010.

[21] S.-H. Chou et al. No cache-coherence: A single-cycle ring

interconnection for multi-core l1-nuca sharing on 3d chips. InDAC, 2009.

[22] D. E. Culler, A. Gupta, and J. P. Singh. Parallel Computer Ar-

chitecture: A Hardware/Software Approach. Morgan Kaufmann

Publishers Inc., San Francisco, CA, USA, 1st edition, 1997.

[23] A. Das et al. Pad: Power-aware directory placementin distributed

caches. TR Northwestern University (NWU-EECS-10-11), 2010.

[24] A. Devgan et al. Power variability and its impact on design. In

VLSID, 2005.

[25] B. Egger et al. Dynamic scratchpad memory management for

code in portable systems with an mmu. ACM TECS, 7, January

2008.

[26] B. Egger et al. Scratchpad memory management techniques

for code in embedded systems without an mmu. Comp., IEEE

Trans. on, 2010.[27] A. Ferreira et al. Using pcm in next-generation embedded space

applications. InRTAS 10, april 2010.

[28] L. Gauthier et al. Minimizing inter-task interferences in scratch-

pad memory usage for reducing the energy consumption of

multi-task systems. In CASES 10, 2010.

[29] S. Ghosh et al. Reducing power consumption in memory ecc

checkers. In I TC, 2004.

[30] S. Gochman, A. Mendelson, A. Naveh, and E. Rotem. Intro-

duction to intel core duo processor architecture. Intel Technol.

J., vol. 10, no. 2, pp. 8997, 2006.

[31] M. Gottscho et al. Analyzing power variability of ddr3 dual

inline memory modules for applications. TR UCLA-EE, 2011.

[32] D. S. Gracia et al. Lp-nuca: Networks-in-cache for high-

performance low-power embedded processors. TVLSI, PP(99):1,2011.

[33] T. Granlund, B. Granbom, and N. Olsson. Soft error rate increase

for new generations of srams. Nuclear Science, IEEE Trans. on,

50(6):2065 2068, dec. 2003.

[34] J. Howard et al. A 48-core ia-32 message-passing processor

with dvfs in 45nm cmos. In ISSCC, feb. 2010.

[35] J. Hu et al. Towards energy efficient hybrid on-chip scratch pad

memory with non-volatile memory. In DATE 11, 2011.

[36] IBM. The cell project. http://www.research.ibm.com/cell/, 2005.

[37] Intel Corp. Intel ixp2855 network processor. Available:

http://www.intel.com, 2005.

[38] I. Issenin et al. Multiprocessor system-on-chip data reuse

analysis for exploring customized memory hierarchies. In DAC

06, 2006.

[39] ITRS. System drivers. http://www.itrs.net/, 2003.

[40] ITRS. System drivers. http://www.itrs.net/, 2005.

[41] ITRS. Process integration, device and structures.

http://www.itrs.net/, 2007.

[42] ITRS. 2008 update overview. http://www.itrs.net/, 2008.

[43] R. Iyer et al. Qos policies and architecture for cache/memory in

cmp platforms. SIGMETRICS Perform. Eval. Rev., 35(1), 2007.

[44] Y. Joo et al. Energy- and endurance-aware design of phase

change memory caches. In DATE 10, 2010.

[45] S. Kaneko et al. A 600mhz single-chip multiprocessor with

4.8gb/s internal shared pipelined bus and 512kb internal mem-

ory. InSSCC 03, 2003.

[46] P. A. Karger. Multi-level security requirements for hypervisors.

In ACSAC, 2005.

[47] J. Kelm et al. Waypoint: scaling coherence to thousand-corearchitectures. In PACT, pages 99110, 2010.

[48] O. Khan et al. Dcc: A dependable cache coherence multicore

architecture. IEEE CAL, 10, 2011.

[49] S. Kim. Area-efficient error protection for caches. In DATE,

2006.

[50] C. Kim et al. An adaptive, non-uniform cache structure for

wire-delay dominated on-chip caches. SIGOPS Oper. Syst. Rev.,

36:211222, 2002.

[51] J. Kim et al. Multi-bit error tolerant caches using two-

dimensional error coding. In MICRO, 2007.

[52] N. Kim et al. Leakage current: Moores law meets static power.

Computer, 36(12), dec. 2003.

[53] J. Kulkarni et al. A 160 mv, fully differential, robust schmitt

trigger based sub-threshold sram. In ISLPED 07, 2007.[54] F. Kurdahi et al. Low-power multimedia system design by

aggressive voltage scaling. TVLSI, 18(5), 2010.

[55] G. Kurian et al. Atac: a 1000-core cache-coherent processor

with on-chip optical network. InPACT, 2010.

[56] B. Lee et al. Architecting phase change memory as a scalable

dram alternative. SIGARCH Comput. Archit. News, 37, 2009.

[57] C. Lee et al. Mediabench: a tool for evaluating and synthesizing

multimedia and communicatons systems. In MICRO 97, 1997.

[58] H. Lee et al. Cloudcache: Expanding and shrinking private

caches. InHPCA, 2011.

[59] K. Lee et al. Mitigating soft error failures for multimedia

applications by selective data protection. In CASES 06, 2006.

[60] J. Leverich et al. Comparing memory systems for chip multi-

processors. SIGARCH Comput. Archit. News, 35(2), June 2007.[61] F. Li et al. Improving scratch-pad memory reliability through

compiler-guided data block duplication. In ICCAD, 2005.

[62] J. Lira et al. Lru-pea: A smart replacement policy for non-

uniform cache architectures on chip multiprocessors. In ICCD,

2009.

[63] T. Liu et al. Power-aware variable partitioning for dsps with

hybrid pram and dram main memory. InDAC 11, 2011.

[64] A. Marongiu et al. An openmp compiler for efficient use of

distributed scratchpad memory in mpsocs. Computers, IEEE

Transactions on, PP(99):1, 2010.

[65] R. Mastipuram and E. C. Wee. Soft errors impact on system re-

liability. In http://www.edn.com/ article/ CA454636, September

2004.

[66] T. Mattson et al. The 48-core scc processor: the programmersview. In SC, 2010.

[67] D. Melpignano et al. Platform 2012, a many-core computing

accelerator for embedded socs: performance evaluation of visual

analytics applications. In DAC, 2012.

[68] L. Mieszko et al. Shared memory via execution migration.

ASPLOS, 2011.

[69] A. Mishra et al. Architecting on-chip interconnects for stacked

3d stt-ram caches in cmps. InISCA 11, 2011.

9


10/10

[70] J. Mogul et al. Operating system support for nvm+dram hybrid

main memory. InHotOS09, 2009.

[71] M. Monchiero et al. Exploration of distributed shared memory

architectures for noc-based multiprocessors. Journal of Systems

Architecture, 53(10), 2007.

[72] F. Moradi et al. 65nm sub-threshold 11t-sram for ultra low

voltage applications. InSOCC 08, pages 113 118, sept. 2008.

[73] S. Nassif. Modeling and analysis of manufacturing variations.In Custom Integrated Circuits, 2001, IEEE Conf. on. , 2001.

[74] NXP. Nxp nexperia mobile multimedia processor pnx4101.

Available: http://www.nxp.com/, 2007.

[75] K. Olukotun et al. The case for a single-chip multiprocessor.

SIGPLAN Not., 31, September 1996.

[76] K. Osada et al. 16.7 fa/cell tunnel-leakage-suppressed 16 mb

sram for handling cosmic-ray-induced multi-errors. In ISSCC

03, 2003.

[77] P. Panda et al. Efficient utilization of scratch-pad memory in

embedded processor applications. In EDTC 97, 1997.

[78] D. A. Patterson and J. L. Hennessy. Computer Organization

and Design, Fourth Edition, Fourth Edition: The Hardware/-

Software Interface (The Morgan Kaufmann Series in Computer

Architecture and Design). Morgan Kaufmann Publishers Inc.,4th edition, 2008.

[79] F. Poletti et al. An integrated hardware/software approach for

run-time scratchpad management. In DAC 04, 2004.

[80] R. Pyka et al. Os integrated energy aware scratchpad allocation

strategies for multiprocess applications. In SCOPES, 2007.

[81] A. Ros et al. A direct coherence protocol for many-core chip

multiprocessors. TPDS, 21(12), Dec. 2010.

[82] F. Ruckerbauer et al. Soft error rates in 65nm sramsanalysis

of new phenomena. InIOLTS, 2007.

[83] J. Sartori et al. Variation-aware speed binning of multi-core

processors. In ISQED 10, 2010.

[84] A. Sasan et al. Process variation aware sram/cache for aggressive

voltage-frequency scaling. In DATE 09, 2009.

[85] M. Shalan et al. A dynamic memory management unit forembedded real-time system-on-a-chip. In CASES 00, 2000.

[86] P. Shirvani et al. Padded cache: A new fault-tolerance technique

for cache memories. In VTS, 1999.

[87] P. Shivakumar et al. Modeling the effect of technology trends

on the soft error rate of combinational logic. InDSN, 2002.

[88] S. K. Shukla et al. A brief history of multiprocessors and eda.

IEEE Design & Test of Computers, 28(3):96, 2011.

[89] STMicroeletronics. Nomadik - open multimedia platform for

next generation mobile devices. Technical Article TA305, Avail-

able: http:// www.st.com, 2003.

[90] V. Suhendra et al. Integrated scratchpad memory optimization

and task scheduling for mpsoc architectures. In CASES, 2006.

[91] V. Suhendra et al. Scratchpad allocation for concurrent embed-

ded software. In CODES+ISSS, 2008.[92] G. Sun et al. A novel architecture of the 3D stacked MRAM

L2 cache for CMPs. InHPCA 09, feb. 2009.

[93] H. Takase et al. Partitioning and allocation of scratch-pad

memory for priority-based preemptive multi-task systems. In

DATE 10 , 2010.

[94] Tilera. Tilepro64.http://www.tilera.com/, 2010.

[95] Y. Wang et al. Temperature-constrained power control for chip

multiprocessors with online model estimation. In ISCA, 2009.

[96] L. Wanner et al. A case for opportunistic embedded sensing in

presence of hardware power variability. In HotPower10, 2010.

[97] C. Wilkerson et al. Trading off cache capacity for reliability to

enable low voltage operation. In ISCA 08, 2008.

[98] C. Wilkerson et al. Reducing cache power with low-cost, multi-

bit error-correcting codes. SIGARCH Comput. Archit. News,

38(3):8393, June 2010.

[99] W. Wolf et al. Multiprocessor system-on-chip (mpsoc) technol-ogy. TCAD, 27(10), 2008.

[100] B. Wongchaowart et al. A content-aware block placement

algorithm for reducing pram storage bit writes. In MSST 10,

2010.

[101] X. Wu et al. Hybrid cache architecture with disparate memory

technologies. In ISCA 09, 2009.

[102] C.-L. Yang et al. Software-controlled cache architecture for

energy efficiency. TCSVT, 15(5):634 644, 2005.

[103] N. Yanwei et al. Performance modelling and optimization of

memory access on cellular computer architecture cyclops64. In

NPC, 2005.

[104] W. Zhang. Enhancing data cache reliability by the addition of

a small fully-associative replication cache. In ICS, 2004.

[105] P. Zhou et al. A durable and energy efficient main memory usingphase change memory technology. In ISCA 09, 2009.

10

software controlled

Documents