software controlled
TRANSCRIPT
-
7/26/2019 Software Controlled
1/10
Software Controlled Memories for Scalable Many-Core Architectures
Luis Angel D. Bathen
Center for Embedded Computer SystemsUniversity of California, Irvine
Nikil D. Dutt
Center for Embedded Computer SystemsUniversity of California, Irvine
AbstractTechnology scaling along with the ever evolving demandfor media-rich software stacks have motivated the need for many-core
platforms. With the increase in compute power and its inherent demand
for high memory bandwidth comes the need for vast amounts of on-chipmemory space. Thus, designers must carefully provision the memory
real-estate to meet their applications needs. It has been shown in
the embedded systems domain that both software controlled memories
(e.g., scratchpad memories) and hardware-controlled memories (e.g.,caches) have their pros and cons, some application domains such as
multimedia fit very well in the software-controlled memory model,
while other domains such as databases work well with caches. As aresult, efficient memory management is extremely critical as it hasa great impact on the systems power consumption and throughput.
Traditional memory hierarchies primarily consist of SRAM-based on-
chip caches, however, with the emergence of non-volatile memories
(NVMs) and mixed-criticality systems, on-chip memories will beheterogeneous, not only in type (cache vs. scratchpad) but also in
technology (e.g., SRAM vs. NVM). This paper surveys the state of
the art in memory subsystems for many-core platforms, and presents
strategies for efficiently managing software-controlled memories in themany-core domain, while addressing the various challenges designers
face in deploying such memory subsystems (e.g., sharing the memory
resources, accounting for variations in the subsystem, etc.).
Keywords-many-core; multi-core; multiprocessors; memory manage-ment; scratchpads; virtualization; security; reliability
I. INTRODUCTION
The many-core revolution is driven by two tightly coupled
factors: user demand for media rich applications (e.g., streaming
content from the Cloud), resulting in complex software stacks,
and pure need for new ways to improve system performance to
cope with the increasingly complex software stacks. Gone are the
days where designers would simply scale down processes, add
more transistors, and bump up the frequencies of the devices.
These limitations were observed in the uni-processor domain,
where technology barriers such as the Power Wall and the
ILP (instruction-level parallelism) Wall [2], [78] made it near
impossible to keep up with the computational demands of new
software systems, thus motivating the need for multi- and many-
core platforms [75], [2]. With the increase in compute resources,
designers must deploy energy-conscious and high-performancememory subsystems to cope with the bandwidth demands of the
software stack (and processors). Efficient memory management is
a challenge since not only does the real-estate greatly increase,
but the memory hierarchy is gradually becoming more and more
heterogeneous due to variations in their physical characteristics
(e.g., distance from processors, read vs. write latencies and energy
differences, endurance, reliability, etc.). In this paper we present
the state-of-the-art in memory subsystems for many-core platforms,
motivate the need for memory-conscious software stacks that take
full advantage of the different characteristics in the memory sub-
system, and present the concept of virtual address spaces colored
by different characteristics (e.g., low-power, high-performance,
reliable, secure, etc.), allowing them to be exploited efficiently.
I I . MULTI-COREE RA
Multiprocessors have been extensively studied in the realms of
parallel and distributed computing [22] and embedded systems
[88]. However, general purpose commercial multi-core platforms
were not introduced until Intel released the Core Duo processor[30] in response to the thermal, power and performance issues that
arise while operating at high frequencies (e.g., beyond 3.0GHz)
[2]. Higher operating frequencies led to higher power dissipation,
and higher power dissipation meant high temperatures over time,
rendering standard cooling solutions inadequate. This issue is
exacerbated in mobile devices, which have much tighter energy
budgets than their desktop and server counterparts.
Commercial Multi-Processor System-on-Chips (MPSoCs) have
been around a lot longer than commercial multi-core platforms
[99]. Wolf et al. [99] classify MPSoCs as a distinct branch of mul-
tiprocessors given that they tend to be heterogeneous or specialized
rather than the more general purpose multi-cores introduced by
Olukotun et al. [75]. In this context, MPSoCs span a wide variety
of classes targeted for specific application domains, from wireless
base stations [1], packet processing [37], multimedia [36], [74], to
mobile platforms [89].
General purpose platforms have been expected to run generic
workloads (e.g., office productivity tools such as word processing,
browsers, multimedia, and gaming), while MPSoCs have been
expected run customized software stacks for each of the different
application domains (e.g., cell phones, automobile, etc.). As the em-
bedded software stack continues to evolve, the line between mobile
and desktop software stacks and platforms blurs significantly (e.g.,
iPad, Android Phones, etc.), despite the fact that both architecture
domains run on very different resource and energy constraints.
III. FROM M ULTI-CORE TO M AN Y-CORE
Figure 1 illustrates the growth in complexity of both, general
purpose mobile software stack and the hardware. Figure 1 (a) showsa traditional software stack and computing platform consisting
of a set of applications, an OS (proprietary), a CPU with two-
level memory hierarchy, off-chip memory, and connected through
a bus. Figure 1 (b) shows a multi-tasking Chip-Multiprocessor
(CMP) [36], [45] software stack running on a number of complex
CPUs, with shared memory, and connected through an on-chip bus.
Figure 1 (c) shows a shared-memory CMP running a much more
complex software stack (e.g., with support for multi-level security
[46]) capable of running a light-weight (or full) virtualization layer,
where one stack handles proprietary services such as voice and
2012 IEEE International Conference on Embedded and Real-Time Computing Systems and Applications
978-0-7695-4824-1/12 $26.00 2012 IEEE
DOI 10.1109/RTCSA.2012.60
1
-
7/26/2019 Software Controlled
2/10
-
7/26/2019 Software Controlled
3/10
nity. ITRS predicts that over the next decade performance variabil-
ity will increase from 48% to 66% and total power consumption
variability will increase by up to 100% [41], [83], [96]. There are
many factors that influence the variation within and across devices
(e.g., temperature, voltage, wear-out, etc.). The memory hierarchy
is also affected by variability [31]. Moreover, variability plays a
major role not only in system performance and power consumption
but also in production costs, since high degrees of variability mightcause a device to be discarded [16]. In order to cope with this
expected increase in variability, designers must build adaptable and
tunable software/hardware that can opportunistically exploit this
variability. An example of such an ambitious project is the NSF
Variability Expedition (http://variability.org).
D. Exploiting Emerging Non-Volatile Memories
Cache Size Norm. Density Latency
(cycles)
Dyn. Energy
(nJ)
Static Power
(W)
SRAM (1MB) 1 8 0.388 1.36
MRAM (4MB) 4 Read: 20
Write: 60
Read: 0.4
Write: 2.3
0.15
PRAM (16MB) 16 Read: 40
Write:200
Read: 0.8
Write: 1.5
0.3
Figure 3. Memory Technology Characteristics [101].
In order to reduce leakage power in SRAM-based memories,
designers have proposed emerging Non-Volatile Memories (NVMs)
as alternatives to SRAM for on-chip memories [92], [44], [69].
Typically, NVMs (e.g., PCRAM [56]) offer high densities, low
leakage power, comparable read latencies and dynamic read power
with respect to traditional embedded memories (SRAM/eDRAM).
These characteristics can be observed in Figure 3, where the
difference in both read and write power/latency between SRAMs,
PRAMs, and MRAMs can be observed. One major drawback
across NVMs is the expensive write operation (high latencies and
dynamic energy). This is illustrated in Figure 3, where MRAM
memories provide 4x the density of SRAMs, similar read energy,
less static power, but large write overheads (orders of magnitude
larger). To mitigate the drawbacks of the write operation in NVMs,
designers have made the case for deploying hybrid on-chip memory
hierarchies (e.g., SRAM, NVM) [92], which have shown up to 37%
reduction in leakage power [35], and increased IPC as a byproduct
of the higher density provided by NVMs [69].
V. MEMORY H IERARCHIES IN M AN Y-CORE P LATFORMS
The processor-memory gap is a well-known concern; as a result,
state-of-the-art platforms are deployed with complex memory hier-
archies consisting of one to three levels of caches in the orders of
megabytes. There are two basic models for the on-chip memory in
CMP systems: hardware-managed, implicitly-addressed, coherent
caches and software-managed, explicitly-addressed, local memo-
ries (also called streaming memories [60], scratchpad memories
[77], local stores [36], tightly coupled memories [5], or software
controlled caches [102]).
The cache-based model has built-in hardware policies where all
on-chip storage is used for private and shared caches that are kept
coherent. The on-chip memory space is not directly addressed
through software, thus providing the advantage of transparent
memory management, which exploits locality (best effort) and
automatically handles the communication between memories. Even
on occasions when accesses are too irregular to capture, the cache
may still be able to exploit some degree of inherent locality. On
the other hand, the software controlled memory model assumes
that part of the on-chip storage is organized as independently
addressable structures. Explicit accesses and DMA transfers are
needed to move data to and from off-chip memory or between
two on-chip structures. The address space can be localized to
each individual core or distributed (shared memory model); thisis illustrated in Figure 4, where each processing element consists
of locally addressable SPMs (SP) and globally addressable SPMs
(GM).
Figure 4. IBM Cyclops-64 Diagram [103].
Software-controlled memories have the advantage or providing
software with full flexibility on locality and communication man-
agement in terms of addressing, granularity, and replacement policy
[60]. Since communication is explicit, it can also be proactive,
unlike the mostly reactive behavior of the cache-based model.
Hence, this model allows software to exploit producer-consumer
locality, avoid redundant data moves, and perform application-
specific caching and prefetching [38], [60]. The main drawback of
this scheme is that the on-chip address space needs be to explicitly
managed, so the programmer or compiler need to explicitly tell
the platform what data to map where and when; this may result in
higher programming costs (e.g., higher learning curves, may impact
productivity, and the programmer needs to be platform-aware).
These issues motivate the need for transparent and efficient
management of the physical on-chip software-controlled memory
resources. The software control includes efficient compile-time
optimizations (front-end), minimal changes to the traditional pro-
gramming model (e.g., use of standard C-like APIs), as well as a
light-weight run-time memory management system that allows for
efficient dynamic allocation of the memory space (back-end). These
issues must be efficiently addressed as we migrate to many-core
platforms, where the total number of on-chip software-controlled
memories might be in the thousands [34], [66], [94], [15].
Figure 5 illustrates some possible memory hierarchy configura-
tions for many-core platforms. Figure 5 (a) shows Intels Single
Chip Cloud Computer [34], which is a NoC-interconnected many-
core platform consisting of a mesh of tiles, and four memory
controllers. Each tile in the NoC contains two simple IA-32 cores,
connected to two 256KB L2 caches and the router. This model
represents the pure cache-based memory model with two levels (L1
and L2). Each IA-core can boot Linux, and coherency is maintained
via software protocols such as MPI and OpenMP. Figure 5 (b)
shows Tileras TILEPro64 [94], which consists of a similar 8x8
Mesh NoC-based interconnect, four memory controllers, where
3
-
7/26/2019 Software Controlled
4/10
-
7/26/2019 Software Controlled
5/10
-
7/26/2019 Software Controlled
6/10
policies). The PHiLOSoftware Run-Time System (RTS) will take
the applications allocation policies and enforce them at run-
time by dynamically allocating the physical memory resources
accordingly (e.g., highly utilized read-only data to on-chip Non-
Volatile Memory).
MemoryController
Low Power DIMM
High Power DIMM
SRAMPreferred
NVM
DRAM
VoltageScaled
SPM
CPU
SPM
NominalVoltage
NVM
VoltageScaled
SPM
CPU
SPM
NominalVoltage
NVM
Remotely AllocatedOn-Chip Space
Locally AllocatedOn-Chip Space
CPU
VM SPM
ca eVoltageScaled
NominaVoltage
Voltage Nominal
(1) Voltage ScaledLow-Power / Mid Latency
(2) Voltage ScaledFault-tolerant(3) Nominal-Voltage
High Power / Low Latency(4) Nominal-Voltage
Low-Priority(5) Nominal-Voltage
Higher Power / LatencyL(6) NVM High Write
Power / Latency(7) NVM Higher Write
Power / LatencyL
(8) Low-Power DRAM
(9) High-Power DRAM
Figure 7. Variation-aware Virtual Address Space Partitioning.
For the sake of illustration, Figure 7 shows various address
spaces defined by the programmer, where each address space
will have different requirements (e.g., 1-9), ranging from (1) low-
voltage and mid-latency on-chip address space (e.g., aggressively
voltage scaled SRAM), (2) low-power fault-tolerant on-chip ad-
dress space (SRAM), (3) nominal-voltage high-performance on-
chip memory space (SRAM), (7) non-volatile remotely allocated
non-volatile on-chip space (NVM), to (8) low-power off-chip
memory space (DRAM). Each address space is created on-demand,
and their allocation policies are generated by the programmer
and/or compiler, while the run-time system (PHiLOSoftware RTS)
enforces each policy best effort.
PHiLOSoftware captures three key ideas: 1) Application intent
(requirements), 2) memory management for scalability, and 3)
adaptability to changes in the memory subsystem, thus spanning
across three mutually dependent layers (Figure 6):
Application and Compilation Layer: This layer statically de-
fines the functionality of the application (in a programming
language C/C++), the necessary resources to run the appli-
cation (e.g., memory, CPUs), and how the applications data
should be managed (e.g., allocation policies). The application
is optimized to meet a given goal (e.g., performance) while
adhering to a number of constraints (e.g., performance, relia-
bility, etc.).
Runtime System (RTS) Layer: This layer takes the appli-
cations requirements and manages the system memory re-
sources dynamically while trying to meet the applications
goal, adhering to one or more constraints (e.g., performance,
reliability, etc.), and adapting to changes in the underlying
platform (e.g., Figure 6).
Platform Layer: This layer defines the platform template for
which an application may be optimized. Though the number
of cores, type of communication fabric, and memory hierarchy
may be fixed, there may be hardware process variations (e.g.,
memory power and error rates), different memory technology
characteristics that affect the performance and power con-
sumption of the system, as well as failures due to wear-out.
A. Multi-tasking and NVM Support
In order to present the compiler/programmer with an abstracted
view of the memory hierarchy and minimize the complexity of
our run-time system PHiLOSoftware proposes the use of virtual
SPMs and virtual NVMs. PHiLOSoftware leverages the concept
of vSPMs [10], which enables a program to view and manage aset of vSPMs as if they were physical SPMs. In order virtualize
SPMs, a small part of main memory (DRAM) called protected evict
memory (PEM) space is locked and used as extra storage. The run-
time system would then prioritize the data mapping to SPM and
PEM space based on a utilization metric. Similarly, to manage on-
chip hybrid memory space, designers can exploit the concept of
virtual NVMs (vNVMs) [11], which behave similarly to vSPMs,
meaning that the run-time environment transparently allows each
application to manage their own set of vNVMs.
Management of virtual memories is done through a small set of
APIs [8], [11], which send management commands to the memory
manager. The memory manager then presents each application
with intermediate physical addresses (IPAs), which point to their
virtual SPMs/NVMs. Traditional SPM-based memory management
requires the data layout to use physical addresses by pointing to
the base register of the SPMs, as a result, the same is expected of
SPM/NVM-based memory hierarchies [35]. InPHiLOSoftware, all
policies use virtual SPM and NVM base addresses, so any run-time
re-mapping of data will remain transparent to the initial allocation
policies as the IPAs will not change.
Table IFILTERI NEQUALITIES ANDP REFERREDM EMORYT YPE
Filter Pref. Inequalities
F1 sram E(Dspmi ) > E(Ddrami )
E(Dnvmi ) < E(D
drami )
V > Tvol
F2 nvm E(Dnvmi ) > E(Ddrami )
V < Tvol
F3 either E(Dspmi ) > E(Ddrami )
E(Dnvmi ) > E(D
drami )
F4 dram E(Dspmi ) < E(Ddrami )
E(Dnvmi ) < E(D
drami )
Table IICONFIGURATIONS
Config. Applications CPUs vSPM/vNVM SPM/NVMSpace Space
C1 adpcm,aes 1 32/128 KB 16/64 KB
C2 adpcm,aes,blowfish,gsm 1 64/256 KB 16/64 KB
C3 C2 & h263,jpeg,motion,sha 1 128/512 KB 16/6 4 KB
C4 same as C2 2 64/256 KB 32/128 KB
C5 same as C3 2 128/512 KB 32/128 KB
C6 same as C3 4 128/512 KB 64/256 KB
PHiLOSoftware defines four block-based allocation policies:
1) Temporal allocation, which combines temporal SPM alloca-
tion ([93]) and hybrid memory allocation ([35]), and adheres to
the initial layout obtained through static analysis; however, the
applications SPM and NVM contents must be swapped on a
context-switch to avoid conflicts with other tasks. 2)FixedDynamic
allocation, which combines dynamic-spatial SPM allocation ([79])
and hybrid memory allocation [35], and maps the data block
to the preferred memory type (adhering to the initial layout) as
long as there is space, otherwise, data is mapped to DRAM. 3)
FilterDynamic allocation, which exploits the concept of filtering
and volatility to find the best placement. Each request is filtered
according to a set of inequalities (shown in Table I) which
6
-
7/26/2019 Software Controlled
7/10
determine the preferred memory type. Finally, 4) Oracle-based
allocation, which is a near-optimal policy because on every block-
allocation request, it feeds the entire memory map to the same
policy generator the compiler uses to generate policies statically.
0.0
0.51.0
1.5
C1 C2 C3 C4 C5 C6 AVG C1 C2 C3 C4 C5 C6 AVG
MRAM P/G=P PRAM P/G=P
Normalized Execution Time (Goal=Performance,pageSize=4KB) Temporal
Oracle
FixedDyn
FilterDyn
0.0
0.5
1.0
1.5
C1 C2 C3 C4 C5 C6 AVG C1 C2 C3 C4 C5 C6 AVG
MRAM E/G=P PRAM E/G=P
Normalized Energy (Goal=Performance,pageSize=4KB) Temporal
Oracle
FixedDyn
FilterDyn
Figure 8. Normalized Execution Time and Energy Comparison for
Performance Optimized Policies
Table II shows each configuration (C1-6), which has a set of
applications running concurrently over a numberof CPUs, and a
predefined hybrid memory physical space. The idea is to show that
PHiLOSoftwaresFilterDynamic policy (backward-slashed bars in
Figure 8) achieves competitive quality allocation solutions as the
more complexOraclepolicy. Figure 8 shows the normalized execu-
tion time and energy for each of the different configurations (C1-6,
Goal=Min Execution Time denoted as G=P) using 4KB blocks and
different memory types with respect to the Temporal policy. The
FixedDynamicpolicy (forward-slashed bars in Figure 8) suffers the
greatest impact on energy and execution time as memory space
increases (C4-6) since it adheres to the decisions made at compile-
time and does not efficiently allocate memory blocks at run-time.
In general, we see that theFilterDynamicpolicy performs almost as
well as theOraclepolicy (within 8.45% execution time). Compared
with the Temporal policy, PHiLOSoftwaresFilterDynamic policyis able to reduce execution time and energy by an average 75.42%
and 62.88% respectively when the initial application policies have
been optimized for execution time minimization [11] .
B. Mixed Criticality Support
In order to efficiently exploit the available memory variability
(on- and off-chip [12]), programmers need to provide PHiLOSoft-
ware with data mapping hints in the form of policies. A pro-
grammer will consider the applications requirements and partition
its virtual address space into regions, which are then associated
with a mapping policy that dictates how to map the data into
physical address space and the type of guarantee needed (power,
performance, fault-tolerance). Iyer et al. [43] proposed a QoS-
enabled memory architecture to enable more cache space andmemory bandwidth for high priority applications based on guidance
from the operating environment. PHiLOSoftware takes a similar
approach, however, it opportunistically exploits the variations in
the memory subsystem to achieve better performance and minimize
power consumption.
Figure 9 shows a sample address space partitioning for JPEG
[57], where the programmer has identified: 1) Read-only and highly
utilized data, i.e., look up tables (red blocks), 2) A temporary buffer
Tables (2KB), low-power/E-RAID1Pixel Data, low-power/NO ERAIDDCT,Q,ZZ Data, low-power/NO ERAIDHuff. Data, low-power/irregular access
Figure 9. Sample User Annotations and Policies
HP DRAM LP DRAM
Voltage Scaled SPMs
Readonce/
backup
Often-
updated
PEM
E-RAID
1
b) PHiLOSoftware Mapping
ead
RDBMP
DCT
Q
ZIGZAG
HUFFMAN
Full PhysicalSPMs (4KB)
Virtual Memory/DRAM
a) Traditional Mapping
HUFFMAN
irtual
Figure 10. Partitioning the Applications Memory Space
for inter-task communication (gray block), 3) Read-only pixel
data (black blocks), and 4) Irregularly access data (green blocks).Figure 10 (a) shows a traditional mapping of these data blocks,
where variability is not taken into account. Figure 10 (b) shows
the result of PHiLOSoftwares virtualization layer mapping that
exploits: 1) Data mapping policies customized by the programmer
and used to make dynamic memory allocation decisions. 2) On-chip
memory voltage scaling (using E-RAIDs [9] to deal with process
variation), and 3) DRAM variability. For the sake of illustration,
PHiLOSoftware maps commonly used read-only data to voltage
scaled SRAM protected by an E-RAID 1 level, pixel data to
voltage scaled SRAM (NO ERAID), and irregular commonly used
data to low power DRAM. A programmer with knowledge of the
applications requirements can create custom data mapping policies
with low-power (LP) memory space in mind. PHiLOSoftware then
takes these policies and tries to opportunistically enforce them (besteffort), regardless of how the LP memory space is implemented by
the hardware layer. For instance: 1) If there is no noticeable DRAM
power variability, then PHiLOSoftware will not prioritize DRAMs
and follow a more traditional memory management scheme (e.g.,
malloc()), or 2) If voltage scaling on-chip memories is not possible,
PHiLOSoftware will proceed to treat all on-chip memories the
same.
Recall Figure 6 and consider the case where there is power
and latency variation in both on-chip (due to voltage scaling the
SPMs) and off-chip memories (due to the inherent hardware-
variability in DRAM memories). The programmer then partitions
each applications virtual address space as shown in Figure 9 and
define allocation policies for each virtual address space a shown
in Figure 10 (add color to each virtual address space accordingits requirements). These annotations are used by PHiLOSoftwares
run-time system, which opportunistically exploits the variations in
the memory subsystem. Each application will then have different
requirements (e.g., fault-tolerant memory space, secure memory
space, etc.), some being more critical than others (e.g., h263s
higher memory footprint requirements vs. shas need for full
memory isolation).
Figure 11 shows various configurations (x-axis):
7
-
7/26/2019 Software Controlled
8/10
App. 2x2x1 4x2x2 8x2x4
adpcm Y Y
aes Y Y
blowfish Y
gsm Y
h263 Y Y Y
jpeg Y Y Y
motion Y
sha Y
Figure 11. Dynamic Policy-driven Variability-aware Allocation.
{#Apps}x{#OSes}x{#CPUs} with 4x8KB physical SPMsand the set of applications run for each configuration (marked
by a Y in their respective row/column). The base-line policy (P)
utilized the entire physical space with context-switching (CX)
enabled [93] (e.g., swap SPM data on CX ). Policy M1 uses
vSPMs and allows PHiLOSoftware to dynamically map each
applications data. Because we are running various applications
concurrently, PHiLOSoftware needs to prioritize and judiciously
map different data to on-chip and off-chip memory. The data
sets (T1-4) in Figure 11 represent a high-level abstraction of the
applications workload, in this experiment, despite mapping allT1-3 data to vSPM for a given application, it is not guaranteed
that the data will go into physical SPM space, as the resources are
limited (only 4x8KB). So PHiLOSoftware prioritizes among all
data blocks of each category (T1-4), and based on their priorities
(BlkPriority=block utilization), decides where to map the data.
For example, H.263 has much higher on-chip/off-chip memory
requirements, as a result, higher priority is given to H.263s T1-4
blocks than the other applications (on-chip SRAM and low-power
DRAM). User-defined policies (M1) managed to reduce dynamic
power consumption by 63% on average while reducing total
execution time by an average of 34% because: 1) There are up
to {8Apps}x{4OSes}x{4CPUs} competing for memory resources,and traditional malloc (P) is unable to efficiently cope with
the demand, and 2) PHiLOSoftware efficiently manages thememory space by exploiting the idea of variability-aware dynamic
policy-driven memory allocation.
C. PHiLOSoftware Summary
This section presented the concept of PHiLOSoftware, which
proposes the idea of tagging or coloring virtual address spaces
with different characteristics ranging from low-power virtual mem-
ory space, secure virtual memory space, to fault-tolerant virtual
memory space, etc. Each application will define different virtual
mappings (policies) according to its needs. PHiLOSoftwares run-
time system will then opportunistically enforce these memory al-
location policies. PHiLOSoftware, like [43], is a plausible solution
to support the efficient management of the memory subsystem
in scalable many-core platforms, thereby allowing for mixed-
criticality systems to execute in an energy efficient and high-
performance manner.
VIII. CONCLUSION
By deploying simpler processing elements and exploiting task-
level and application-level parallelism, the many-core era promises
high performance with reduced power consumption. In order to
support the memory bandwidth requirements of such massively
parallel system, the memory subsystem must be carefully designed
and managed. However, as we move forward, the memory hi-
erarchy will be hetereogeneous in nature, and thus, traditional
memory management schemes will not cope with the demands of
the processing elements. In this paper we surveyed the state-of-
the-art in distributed memory models and motivated the need for
efficient management of software-controlled many-core memories.
We showed how virtualization may help address some of the
emerging issues in the memory subsystem, from the adoption ofemerging non-volatile memories to the opportunistic exploitation
of the inherent variation in the memory subsystem.
ACKNOWLEDGMENT
This work was partially supported by NSF Variability Expedition
Grant Number CCF-1029783.
REFERENCES[1] B. Ackland et al. A single-chip, 1.6-billion, 16-b mac/s multi-
processor dsp. Solid-State Circuits, IEEE Journal of, 35, 2000.
[2] V. Agarwal et al. Clock rate versus ipc: the end of the road for
conventional microarchitectures. SIGARCH Comp. Arch. News,
28, 2000.
[3] F. Angiolini et al. Reliability support for on-chip memories using
networks-on-chip. In ICCD, oct. 2006.[4] A. Ansari et al. Zerehcache: armoring cache architectures in
high defect density technologies. In MICRO 42, 2009.
[5] ARM. Arm cortex-m3 processor. In http://www.arm.com/, 2012.
[6] N. Azizi et al. Low-leakage asymmetric-cell sram. TVLSI, Vol.
11, aug. 2003.
[7] A. BanaiyanMofrad et al. Fft-cache: a flexible fault-tolerant
cache architecture for ultra low voltage operation. CASES, 2011.
[8] L. Bathen. philosoftware: A low power, high performance,
reliable, and secure virtualization layer for on-chip software-
controlled memories. InThesis, (PhD), University of California,
Irvine, 2012.
[9] L. Bathen et al. E-RoC: Embedded raids-on-chip for low power
distributed dynamically managed reliable memoreis. In DATE,
2011.
[10] L. Bathen et al. Spmvisor: Dynamic scratchpad memory virtual-
ization for secure, low power and high performance, distributed
on-chip memories. InCODES+ISSS, 2011.
[11] L. Bathen et al. HaVOC: A hybrid-memory-aware virtualization
layer for on-chip distributed scratchpad and non-volatile mem-
ories. In DAC, 2012.
[12] L. Bathen et al. VaMV: Variability-aware memory virtualization.
In DATE, 2012.
[13] L. Benini and G. D. Micheli. Networks on chips: A new soc
paradigm. IEEE Computer, 35(1), 2002.
[14] K. Bernstein et al. High-performance cmos variability in the
65-nm regime and beyond. IBM J. Res. Dev., 50, 2006.
[15] S. Borkar. Thousand core chips: a technology perspective. In
DAC, 2007.
[16] S. Borkar et al. Parameter variations and impact on circuits and
microarchitecture. In DAC 03, 2003.
[17] B. Calhoun et al. A 256-kb 65-nm sub-threshold sram design
forultra-low-voltage operation. InIEEE J. of Solid-State Circuits
(JSSC), 2007.
[18] A. Chakraborty et al. E
-
7/26/2019 Software Controlled
9/10
[19] X. Chen, Z. Lu, A. Jantsch, and S. Chen. Run-time partitioning
of hybrid distributed shared memory on multi-core network-on-
chips. InPAAP, 2010.
[20] X. Chen et al. Supporting distributed shared memory on multi-
core network-on-chips using a dual microcoded controller. In
DATE, 2010.
[21] S.-H. Chou et al. No cache-coherence: A single-cycle ring
interconnection for multi-core l1-nuca sharing on 3d chips. InDAC, 2009.
[22] D. E. Culler, A. Gupta, and J. P. Singh. Parallel Computer Ar-
chitecture: A Hardware/Software Approach. Morgan Kaufmann
Publishers Inc., San Francisco, CA, USA, 1st edition, 1997.
[23] A. Das et al. Pad: Power-aware directory placementin distributed
caches. TR Northwestern University (NWU-EECS-10-11), 2010.
[24] A. Devgan et al. Power variability and its impact on design. In
VLSID, 2005.
[25] B. Egger et al. Dynamic scratchpad memory management for
code in portable systems with an mmu. ACM TECS, 7, January
2008.
[26] B. Egger et al. Scratchpad memory management techniques
for code in embedded systems without an mmu. Comp., IEEE
Trans. on, 2010.[27] A. Ferreira et al. Using pcm in next-generation embedded space
applications. InRTAS 10, april 2010.
[28] L. Gauthier et al. Minimizing inter-task interferences in scratch-
pad memory usage for reducing the energy consumption of
multi-task systems. In CASES 10, 2010.
[29] S. Ghosh et al. Reducing power consumption in memory ecc
checkers. In I TC, 2004.
[30] S. Gochman, A. Mendelson, A. Naveh, and E. Rotem. Intro-
duction to intel core duo processor architecture. Intel Technol.
J., vol. 10, no. 2, pp. 8997, 2006.
[31] M. Gottscho et al. Analyzing power variability of ddr3 dual
inline memory modules for applications. TR UCLA-EE, 2011.
[32] D. S. Gracia et al. Lp-nuca: Networks-in-cache for high-
performance low-power embedded processors. TVLSI, PP(99):1,2011.
[33] T. Granlund, B. Granbom, and N. Olsson. Soft error rate increase
for new generations of srams. Nuclear Science, IEEE Trans. on,
50(6):2065 2068, dec. 2003.
[34] J. Howard et al. A 48-core ia-32 message-passing processor
with dvfs in 45nm cmos. In ISSCC, feb. 2010.
[35] J. Hu et al. Towards energy efficient hybrid on-chip scratch pad
memory with non-volatile memory. In DATE 11, 2011.
[36] IBM. The cell project. http://www.research.ibm.com/cell/, 2005.
[37] Intel Corp. Intel ixp2855 network processor. Available:
http://www.intel.com, 2005.
[38] I. Issenin et al. Multiprocessor system-on-chip data reuse
analysis for exploring customized memory hierarchies. In DAC
06, 2006.
[39] ITRS. System drivers. http://www.itrs.net/, 2003.
[40] ITRS. System drivers. http://www.itrs.net/, 2005.
[41] ITRS. Process integration, device and structures.
http://www.itrs.net/, 2007.
[42] ITRS. 2008 update overview. http://www.itrs.net/, 2008.
[43] R. Iyer et al. Qos policies and architecture for cache/memory in
cmp platforms. SIGMETRICS Perform. Eval. Rev., 35(1), 2007.
[44] Y. Joo et al. Energy- and endurance-aware design of phase
change memory caches. In DATE 10, 2010.
[45] S. Kaneko et al. A 600mhz single-chip multiprocessor with
4.8gb/s internal shared pipelined bus and 512kb internal mem-
ory. InSSCC 03, 2003.
[46] P. A. Karger. Multi-level security requirements for hypervisors.
In ACSAC, 2005.
[47] J. Kelm et al. Waypoint: scaling coherence to thousand-corearchitectures. In PACT, pages 99110, 2010.
[48] O. Khan et al. Dcc: A dependable cache coherence multicore
architecture. IEEE CAL, 10, 2011.
[49] S. Kim. Area-efficient error protection for caches. In DATE,
2006.
[50] C. Kim et al. An adaptive, non-uniform cache structure for
wire-delay dominated on-chip caches. SIGOPS Oper. Syst. Rev.,
36:211222, 2002.
[51] J. Kim et al. Multi-bit error tolerant caches using two-
dimensional error coding. In MICRO, 2007.
[52] N. Kim et al. Leakage current: Moores law meets static power.
Computer, 36(12), dec. 2003.
[53] J. Kulkarni et al. A 160 mv, fully differential, robust schmitt
trigger based sub-threshold sram. In ISLPED 07, 2007.[54] F. Kurdahi et al. Low-power multimedia system design by
aggressive voltage scaling. TVLSI, 18(5), 2010.
[55] G. Kurian et al. Atac: a 1000-core cache-coherent processor
with on-chip optical network. InPACT, 2010.
[56] B. Lee et al. Architecting phase change memory as a scalable
dram alternative. SIGARCH Comput. Archit. News, 37, 2009.
[57] C. Lee et al. Mediabench: a tool for evaluating and synthesizing
multimedia and communicatons systems. In MICRO 97, 1997.
[58] H. Lee et al. Cloudcache: Expanding and shrinking private
caches. InHPCA, 2011.
[59] K. Lee et al. Mitigating soft error failures for multimedia
applications by selective data protection. In CASES 06, 2006.
[60] J. Leverich et al. Comparing memory systems for chip multi-
processors. SIGARCH Comput. Archit. News, 35(2), June 2007.[61] F. Li et al. Improving scratch-pad memory reliability through
compiler-guided data block duplication. In ICCAD, 2005.
[62] J. Lira et al. Lru-pea: A smart replacement policy for non-
uniform cache architectures on chip multiprocessors. In ICCD,
2009.
[63] T. Liu et al. Power-aware variable partitioning for dsps with
hybrid pram and dram main memory. InDAC 11, 2011.
[64] A. Marongiu et al. An openmp compiler for efficient use of
distributed scratchpad memory in mpsocs. Computers, IEEE
Transactions on, PP(99):1, 2010.
[65] R. Mastipuram and E. C. Wee. Soft errors impact on system re-
liability. In http://www.edn.com/ article/ CA454636, September
2004.
[66] T. Mattson et al. The 48-core scc processor: the programmersview. In SC, 2010.
[67] D. Melpignano et al. Platform 2012, a many-core computing
accelerator for embedded socs: performance evaluation of visual
analytics applications. In DAC, 2012.
[68] L. Mieszko et al. Shared memory via execution migration.
ASPLOS, 2011.
[69] A. Mishra et al. Architecting on-chip interconnects for stacked
3d stt-ram caches in cmps. InISCA 11, 2011.
9
-
7/26/2019 Software Controlled
10/10
[70] J. Mogul et al. Operating system support for nvm+dram hybrid
main memory. InHotOS09, 2009.
[71] M. Monchiero et al. Exploration of distributed shared memory
architectures for noc-based multiprocessors. Journal of Systems
Architecture, 53(10), 2007.
[72] F. Moradi et al. 65nm sub-threshold 11t-sram for ultra low
voltage applications. InSOCC 08, pages 113 118, sept. 2008.
[73] S. Nassif. Modeling and analysis of manufacturing variations.In Custom Integrated Circuits, 2001, IEEE Conf. on. , 2001.
[74] NXP. Nxp nexperia mobile multimedia processor pnx4101.
Available: http://www.nxp.com/, 2007.
[75] K. Olukotun et al. The case for a single-chip multiprocessor.
SIGPLAN Not., 31, September 1996.
[76] K. Osada et al. 16.7 fa/cell tunnel-leakage-suppressed 16 mb
sram for handling cosmic-ray-induced multi-errors. In ISSCC
03, 2003.
[77] P. Panda et al. Efficient utilization of scratch-pad memory in
embedded processor applications. In EDTC 97, 1997.
[78] D. A. Patterson and J. L. Hennessy. Computer Organization
and Design, Fourth Edition, Fourth Edition: The Hardware/-
Software Interface (The Morgan Kaufmann Series in Computer
Architecture and Design). Morgan Kaufmann Publishers Inc.,4th edition, 2008.
[79] F. Poletti et al. An integrated hardware/software approach for
run-time scratchpad management. In DAC 04, 2004.
[80] R. Pyka et al. Os integrated energy aware scratchpad allocation
strategies for multiprocess applications. In SCOPES, 2007.
[81] A. Ros et al. A direct coherence protocol for many-core chip
multiprocessors. TPDS, 21(12), Dec. 2010.
[82] F. Ruckerbauer et al. Soft error rates in 65nm sramsanalysis
of new phenomena. InIOLTS, 2007.
[83] J. Sartori et al. Variation-aware speed binning of multi-core
processors. In ISQED 10, 2010.
[84] A. Sasan et al. Process variation aware sram/cache for aggressive
voltage-frequency scaling. In DATE 09, 2009.
[85] M. Shalan et al. A dynamic memory management unit forembedded real-time system-on-a-chip. In CASES 00, 2000.
[86] P. Shirvani et al. Padded cache: A new fault-tolerance technique
for cache memories. In VTS, 1999.
[87] P. Shivakumar et al. Modeling the effect of technology trends
on the soft error rate of combinational logic. InDSN, 2002.
[88] S. K. Shukla et al. A brief history of multiprocessors and eda.
IEEE Design & Test of Computers, 28(3):96, 2011.
[89] STMicroeletronics. Nomadik - open multimedia platform for
next generation mobile devices. Technical Article TA305, Avail-
able: http:// www.st.com, 2003.
[90] V. Suhendra et al. Integrated scratchpad memory optimization
and task scheduling for mpsoc architectures. In CASES, 2006.
[91] V. Suhendra et al. Scratchpad allocation for concurrent embed-
ded software. In CODES+ISSS, 2008.[92] G. Sun et al. A novel architecture of the 3D stacked MRAM
L2 cache for CMPs. InHPCA 09, feb. 2009.
[93] H. Takase et al. Partitioning and allocation of scratch-pad
memory for priority-based preemptive multi-task systems. In
DATE 10 , 2010.
[94] Tilera. Tilepro64.http://www.tilera.com/, 2010.
[95] Y. Wang et al. Temperature-constrained power control for chip
multiprocessors with online model estimation. In ISCA, 2009.
[96] L. Wanner et al. A case for opportunistic embedded sensing in
presence of hardware power variability. In HotPower10, 2010.
[97] C. Wilkerson et al. Trading off cache capacity for reliability to
enable low voltage operation. In ISCA 08, 2008.
[98] C. Wilkerson et al. Reducing cache power with low-cost, multi-
bit error-correcting codes. SIGARCH Comput. Archit. News,
38(3):8393, June 2010.
[99] W. Wolf et al. Multiprocessor system-on-chip (mpsoc) technol-ogy. TCAD, 27(10), 2008.
[100] B. Wongchaowart et al. A content-aware block placement
algorithm for reducing pram storage bit writes. In MSST 10,
2010.
[101] X. Wu et al. Hybrid cache architecture with disparate memory
technologies. In ISCA 09, 2009.
[102] C.-L. Yang et al. Software-controlled cache architecture for
energy efficiency. TCSVT, 15(5):634 644, 2005.
[103] N. Yanwei et al. Performance modelling and optimization of
memory access on cellular computer architecture cyclops64. In
NPC, 2005.
[104] W. Zhang. Enhancing data cache reliability by the addition of
a small fully-associative replication cache. In ICS, 2004.
[105] P. Zhou et al. A durable and energy efficient main memory usingphase change memory technology. In ISCA 09, 2009.
10