software controlled

Upload: vinu-t-kumar

Post on 02-Mar-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/26/2019 Software Controlled

    1/10

    Software Controlled Memories for Scalable Many-Core Architectures

    Luis Angel D. Bathen

    Center for Embedded Computer SystemsUniversity of California, Irvine

    [email protected]

    Nikil D. Dutt

    Center for Embedded Computer SystemsUniversity of California, Irvine

    [email protected]

    AbstractTechnology scaling along with the ever evolving demandfor media-rich software stacks have motivated the need for many-core

    platforms. With the increase in compute power and its inherent demand

    for high memory bandwidth comes the need for vast amounts of on-chipmemory space. Thus, designers must carefully provision the memory

    real-estate to meet their applications needs. It has been shown in

    the embedded systems domain that both software controlled memories

    (e.g., scratchpad memories) and hardware-controlled memories (e.g.,caches) have their pros and cons, some application domains such as

    multimedia fit very well in the software-controlled memory model,

    while other domains such as databases work well with caches. As aresult, efficient memory management is extremely critical as it hasa great impact on the systems power consumption and throughput.

    Traditional memory hierarchies primarily consist of SRAM-based on-

    chip caches, however, with the emergence of non-volatile memories

    (NVMs) and mixed-criticality systems, on-chip memories will beheterogeneous, not only in type (cache vs. scratchpad) but also in

    technology (e.g., SRAM vs. NVM). This paper surveys the state of

    the art in memory subsystems for many-core platforms, and presents

    strategies for efficiently managing software-controlled memories in themany-core domain, while addressing the various challenges designers

    face in deploying such memory subsystems (e.g., sharing the memory

    resources, accounting for variations in the subsystem, etc.).

    Keywords-many-core; multi-core; multiprocessors; memory manage-ment; scratchpads; virtualization; security; reliability

    I. INTRODUCTION

    The many-core revolution is driven by two tightly coupled

    factors: user demand for media rich applications (e.g., streaming

    content from the Cloud), resulting in complex software stacks,

    and pure need for new ways to improve system performance to

    cope with the increasingly complex software stacks. Gone are the

    days where designers would simply scale down processes, add

    more transistors, and bump up the frequencies of the devices.

    These limitations were observed in the uni-processor domain,

    where technology barriers such as the Power Wall and the

    ILP (instruction-level parallelism) Wall [2], [78] made it near

    impossible to keep up with the computational demands of new

    software systems, thus motivating the need for multi- and many-

    core platforms [75], [2]. With the increase in compute resources,

    designers must deploy energy-conscious and high-performancememory subsystems to cope with the bandwidth demands of the

    software stack (and processors). Efficient memory management is

    a challenge since not only does the real-estate greatly increase,

    but the memory hierarchy is gradually becoming more and more

    heterogeneous due to variations in their physical characteristics

    (e.g., distance from processors, read vs. write latencies and energy

    differences, endurance, reliability, etc.). In this paper we present

    the state-of-the-art in memory subsystems for many-core platforms,

    motivate the need for memory-conscious software stacks that take

    full advantage of the different characteristics in the memory sub-

    system, and present the concept of virtual address spaces colored

    by different characteristics (e.g., low-power, high-performance,

    reliable, secure, etc.), allowing them to be exploited efficiently.

    I I . MULTI-COREE RA

    Multiprocessors have been extensively studied in the realms of

    parallel and distributed computing [22] and embedded systems

    [88]. However, general purpose commercial multi-core platforms

    were not introduced until Intel released the Core Duo processor[30] in response to the thermal, power and performance issues that

    arise while operating at high frequencies (e.g., beyond 3.0GHz)

    [2]. Higher operating frequencies led to higher power dissipation,

    and higher power dissipation meant high temperatures over time,

    rendering standard cooling solutions inadequate. This issue is

    exacerbated in mobile devices, which have much tighter energy

    budgets than their desktop and server counterparts.

    Commercial Multi-Processor System-on-Chips (MPSoCs) have

    been around a lot longer than commercial multi-core platforms

    [99]. Wolf et al. [99] classify MPSoCs as a distinct branch of mul-

    tiprocessors given that they tend to be heterogeneous or specialized

    rather than the more general purpose multi-cores introduced by

    Olukotun et al. [75]. In this context, MPSoCs span a wide variety

    of classes targeted for specific application domains, from wireless

    base stations [1], packet processing [37], multimedia [36], [74], to

    mobile platforms [89].

    General purpose platforms have been expected to run generic

    workloads (e.g., office productivity tools such as word processing,

    browsers, multimedia, and gaming), while MPSoCs have been

    expected run customized software stacks for each of the different

    application domains (e.g., cell phones, automobile, etc.). As the em-

    bedded software stack continues to evolve, the line between mobile

    and desktop software stacks and platforms blurs significantly (e.g.,

    iPad, Android Phones, etc.), despite the fact that both architecture

    domains run on very different resource and energy constraints.

    III. FROM M ULTI-CORE TO M AN Y-CORE

    Figure 1 illustrates the growth in complexity of both, general

    purpose mobile software stack and the hardware. Figure 1 (a) showsa traditional software stack and computing platform consisting

    of a set of applications, an OS (proprietary), a CPU with two-

    level memory hierarchy, off-chip memory, and connected through

    a bus. Figure 1 (b) shows a multi-tasking Chip-Multiprocessor

    (CMP) [36], [45] software stack running on a number of complex

    CPUs, with shared memory, and connected through an on-chip bus.

    Figure 1 (c) shows a shared-memory CMP running a much more

    complex software stack (e.g., with support for multi-level security

    [46]) capable of running a light-weight (or full) virtualization layer,

    where one stack handles proprietary services such as voice and

    2012 IEEE International Conference on Embedded and Real-Time Computing Systems and Applications

    978-0-7695-4824-1/12 $26.00 2012 IEEE

    DOI 10.1109/RTCSA.2012.60

    1

  • 7/26/2019 Software Controlled

    2/10

  • 7/26/2019 Software Controlled

    3/10

    nity. ITRS predicts that over the next decade performance variabil-

    ity will increase from 48% to 66% and total power consumption

    variability will increase by up to 100% [41], [83], [96]. There are

    many factors that influence the variation within and across devices

    (e.g., temperature, voltage, wear-out, etc.). The memory hierarchy

    is also affected by variability [31]. Moreover, variability plays a

    major role not only in system performance and power consumption

    but also in production costs, since high degrees of variability mightcause a device to be discarded [16]. In order to cope with this

    expected increase in variability, designers must build adaptable and

    tunable software/hardware that can opportunistically exploit this

    variability. An example of such an ambitious project is the NSF

    Variability Expedition (http://variability.org).

    D. Exploiting Emerging Non-Volatile Memories

    Cache Size Norm. Density Latency

    (cycles)

    Dyn. Energy

    (nJ)

    Static Power

    (W)

    SRAM (1MB) 1 8 0.388 1.36

    MRAM (4MB) 4 Read: 20

    Write: 60

    Read: 0.4

    Write: 2.3

    0.15

    PRAM (16MB) 16 Read: 40

    Write:200

    Read: 0.8

    Write: 1.5

    0.3

    Figure 3. Memory Technology Characteristics [101].

    In order to reduce leakage power in SRAM-based memories,

    designers have proposed emerging Non-Volatile Memories (NVMs)

    as alternatives to SRAM for on-chip memories [92], [44], [69].

    Typically, NVMs (e.g., PCRAM [56]) offer high densities, low

    leakage power, comparable read latencies and dynamic read power

    with respect to traditional embedded memories (SRAM/eDRAM).

    These characteristics can be observed in Figure 3, where the

    difference in both read and write power/latency between SRAMs,

    PRAMs, and MRAMs can be observed. One major drawback

    across NVMs is the expensive write operation (high latencies and

    dynamic energy). This is illustrated in Figure 3, where MRAM

    memories provide 4x the density of SRAMs, similar read energy,

    less static power, but large write overheads (orders of magnitude

    larger). To mitigate the drawbacks of the write operation in NVMs,

    designers have made the case for deploying hybrid on-chip memory

    hierarchies (e.g., SRAM, NVM) [92], which have shown up to 37%

    reduction in leakage power [35], and increased IPC as a byproduct

    of the higher density provided by NVMs [69].

    V. MEMORY H IERARCHIES IN M AN Y-CORE P LATFORMS

    The processor-memory gap is a well-known concern; as a result,

    state-of-the-art platforms are deployed with complex memory hier-

    archies consisting of one to three levels of caches in the orders of

    megabytes. There are two basic models for the on-chip memory in

    CMP systems: hardware-managed, implicitly-addressed, coherent

    caches and software-managed, explicitly-addressed, local memo-

    ries (also called streaming memories [60], scratchpad memories

    [77], local stores [36], tightly coupled memories [5], or software

    controlled caches [102]).

    The cache-based model has built-in hardware policies where all

    on-chip storage is used for private and shared caches that are kept

    coherent. The on-chip memory space is not directly addressed

    through software, thus providing the advantage of transparent

    memory management, which exploits locality (best effort) and

    automatically handles the communication between memories. Even

    on occasions when accesses are too irregular to capture, the cache

    may still be able to exploit some degree of inherent locality. On

    the other hand, the software controlled memory model assumes

    that part of the on-chip storage is organized as independently

    addressable structures. Explicit accesses and DMA transfers are

    needed to move data to and from off-chip memory or between

    two on-chip structures. The address space can be localized to

    each individual core or distributed (shared memory model); thisis illustrated in Figure 4, where each processing element consists

    of locally addressable SPMs (SP) and globally addressable SPMs

    (GM).

    Figure 4. IBM Cyclops-64 Diagram [103].

    Software-controlled memories have the advantage or providing

    software with full flexibility on locality and communication man-

    agement in terms of addressing, granularity, and replacement policy

    [60]. Since communication is explicit, it can also be proactive,

    unlike the mostly reactive behavior of the cache-based model.

    Hence, this model allows software to exploit producer-consumer

    locality, avoid redundant data moves, and perform application-

    specific caching and prefetching [38], [60]. The main drawback of

    this scheme is that the on-chip address space needs be to explicitly

    managed, so the programmer or compiler need to explicitly tell

    the platform what data to map where and when; this may result in

    higher programming costs (e.g., higher learning curves, may impact

    productivity, and the programmer needs to be platform-aware).

    These issues motivate the need for transparent and efficient

    management of the physical on-chip software-controlled memory

    resources. The software control includes efficient compile-time

    optimizations (front-end), minimal changes to the traditional pro-

    gramming model (e.g., use of standard C-like APIs), as well as a

    light-weight run-time memory management system that allows for

    efficient dynamic allocation of the memory space (back-end). These

    issues must be efficiently addressed as we migrate to many-core

    platforms, where the total number of on-chip software-controlled

    memories might be in the thousands [34], [66], [94], [15].

    Figure 5 illustrates some possible memory hierarchy configura-

    tions for many-core platforms. Figure 5 (a) shows Intels Single

    Chip Cloud Computer [34], which is a NoC-interconnected many-

    core platform consisting of a mesh of tiles, and four memory

    controllers. Each tile in the NoC contains two simple IA-32 cores,

    connected to two 256KB L2 caches and the router. This model

    represents the pure cache-based memory model with two levels (L1

    and L2). Each IA-core can boot Linux, and coherency is maintained

    via software protocols such as MPI and OpenMP. Figure 5 (b)

    shows Tileras TILEPro64 [94], which consists of a similar 8x8

    Mesh NoC-based interconnect, four memory controllers, where

    3

  • 7/26/2019 Software Controlled

    4/10

  • 7/26/2019 Software Controlled

    5/10

  • 7/26/2019 Software Controlled

    6/10

    policies). The PHiLOSoftware Run-Time System (RTS) will take

    the applications allocation policies and enforce them at run-

    time by dynamically allocating the physical memory resources

    accordingly (e.g., highly utilized read-only data to on-chip Non-

    Volatile Memory).

    MemoryController

    Low Power DIMM

    High Power DIMM

    SRAMPreferred

    NVM

    DRAM

    VoltageScaled

    SPM

    CPU

    SPM

    NominalVoltage

    NVM

    VoltageScaled

    SPM

    CPU

    SPM

    NominalVoltage

    NVM

    Remotely AllocatedOn-Chip Space

    Locally AllocatedOn-Chip Space

    CPU

    VM SPM

    ca eVoltageScaled

    NominaVoltage

    Voltage Nominal

    (1) Voltage ScaledLow-Power / Mid Latency

    (2) Voltage ScaledFault-tolerant(3) Nominal-Voltage

    High Power / Low Latency(4) Nominal-Voltage

    Low-Priority(5) Nominal-Voltage

    Higher Power / LatencyL(6) NVM High Write

    Power / Latency(7) NVM Higher Write

    Power / LatencyL

    (8) Low-Power DRAM

    (9) High-Power DRAM

    Figure 7. Variation-aware Virtual Address Space Partitioning.

    For the sake of illustration, Figure 7 shows various address

    spaces defined by the programmer, where each address space

    will have different requirements (e.g., 1-9), ranging from (1) low-

    voltage and mid-latency on-chip address space (e.g., aggressively

    voltage scaled SRAM), (2) low-power fault-tolerant on-chip ad-

    dress space (SRAM), (3) nominal-voltage high-performance on-

    chip memory space (SRAM), (7) non-volatile remotely allocated

    non-volatile on-chip space (NVM), to (8) low-power off-chip

    memory space (DRAM). Each address space is created on-demand,

    and their allocation policies are generated by the programmer

    and/or compiler, while the run-time system (PHiLOSoftware RTS)

    enforces each policy best effort.

    PHiLOSoftware captures three key ideas: 1) Application intent

    (requirements), 2) memory management for scalability, and 3)

    adaptability to changes in the memory subsystem, thus spanning

    across three mutually dependent layers (Figure 6):

    Application and Compilation Layer: This layer statically de-

    fines the functionality of the application (in a programming

    language C/C++), the necessary resources to run the appli-

    cation (e.g., memory, CPUs), and how the applications data

    should be managed (e.g., allocation policies). The application

    is optimized to meet a given goal (e.g., performance) while

    adhering to a number of constraints (e.g., performance, relia-

    bility, etc.).

    Runtime System (RTS) Layer: This layer takes the appli-

    cations requirements and manages the system memory re-

    sources dynamically while trying to meet the applications

    goal, adhering to one or more constraints (e.g., performance,

    reliability, etc.), and adapting to changes in the underlying

    platform (e.g., Figure 6).

    Platform Layer: This layer defines the platform template for

    which an application may be optimized. Though the number

    of cores, type of communication fabric, and memory hierarchy

    may be fixed, there may be hardware process variations (e.g.,

    memory power and error rates), different memory technology

    characteristics that affect the performance and power con-

    sumption of the system, as well as failures due to wear-out.

    A. Multi-tasking and NVM Support

    In order to present the compiler/programmer with an abstracted

    view of the memory hierarchy and minimize the complexity of

    our run-time system PHiLOSoftware proposes the use of virtual

    SPMs and virtual NVMs. PHiLOSoftware leverages the concept

    of vSPMs [10], which enables a program to view and manage aset of vSPMs as if they were physical SPMs. In order virtualize

    SPMs, a small part of main memory (DRAM) called protected evict

    memory (PEM) space is locked and used as extra storage. The run-

    time system would then prioritize the data mapping to SPM and

    PEM space based on a utilization metric. Similarly, to manage on-

    chip hybrid memory space, designers can exploit the concept of

    virtual NVMs (vNVMs) [11], which behave similarly to vSPMs,

    meaning that the run-time environment transparently allows each

    application to manage their own set of vNVMs.

    Management of virtual memories is done through a small set of

    APIs [8], [11], which send management commands to the memory

    manager. The memory manager then presents each application

    with intermediate physical addresses (IPAs), which point to their

    virtual SPMs/NVMs. Traditional SPM-based memory management

    requires the data layout to use physical addresses by pointing to

    the base register of the SPMs, as a result, the same is expected of

    SPM/NVM-based memory hierarchies [35]. InPHiLOSoftware, all

    policies use virtual SPM and NVM base addresses, so any run-time

    re-mapping of data will remain transparent to the initial allocation

    policies as the IPAs will not change.

    Table IFILTERI NEQUALITIES ANDP REFERREDM EMORYT YPE

    Filter Pref. Inequalities

    F1 sram E(Dspmi ) > E(Ddrami )

    E(Dnvmi ) < E(D

    drami )

    V > Tvol

    F2 nvm E(Dnvmi ) > E(Ddrami )

    V < Tvol

    F3 either E(Dspmi ) > E(Ddrami )

    E(Dnvmi ) > E(D

    drami )

    F4 dram E(Dspmi ) < E(Ddrami )

    E(Dnvmi ) < E(D

    drami )

    Table IICONFIGURATIONS

    Config. Applications CPUs vSPM/vNVM SPM/NVMSpace Space

    C1 adpcm,aes 1 32/128 KB 16/64 KB

    C2 adpcm,aes,blowfish,gsm 1 64/256 KB 16/64 KB

    C3 C2 & h263,jpeg,motion,sha 1 128/512 KB 16/6 4 KB

    C4 same as C2 2 64/256 KB 32/128 KB

    C5 same as C3 2 128/512 KB 32/128 KB

    C6 same as C3 4 128/512 KB 64/256 KB

    PHiLOSoftware defines four block-based allocation policies:

    1) Temporal allocation, which combines temporal SPM alloca-

    tion ([93]) and hybrid memory allocation ([35]), and adheres to

    the initial layout obtained through static analysis; however, the

    applications SPM and NVM contents must be swapped on a

    context-switch to avoid conflicts with other tasks. 2)FixedDynamic

    allocation, which combines dynamic-spatial SPM allocation ([79])

    and hybrid memory allocation [35], and maps the data block

    to the preferred memory type (adhering to the initial layout) as

    long as there is space, otherwise, data is mapped to DRAM. 3)

    FilterDynamic allocation, which exploits the concept of filtering

    and volatility to find the best placement. Each request is filtered

    according to a set of inequalities (shown in Table I) which

    6

  • 7/26/2019 Software Controlled

    7/10

    determine the preferred memory type. Finally, 4) Oracle-based

    allocation, which is a near-optimal policy because on every block-

    allocation request, it feeds the entire memory map to the same

    policy generator the compiler uses to generate policies statically.

    0.0

    0.51.0

    1.5

    C1 C2 C3 C4 C5 C6 AVG C1 C2 C3 C4 C5 C6 AVG

    MRAM P/G=P PRAM P/G=P

    Normalized Execution Time (Goal=Performance,pageSize=4KB) Temporal

    Oracle

    FixedDyn

    FilterDyn

    0.0

    0.5

    1.0

    1.5

    C1 C2 C3 C4 C5 C6 AVG C1 C2 C3 C4 C5 C6 AVG

    MRAM E/G=P PRAM E/G=P

    Normalized Energy (Goal=Performance,pageSize=4KB) Temporal

    Oracle

    FixedDyn

    FilterDyn

    Figure 8. Normalized Execution Time and Energy Comparison for

    Performance Optimized Policies

    Table II shows each configuration (C1-6), which has a set of

    applications running concurrently over a numberof CPUs, and a

    predefined hybrid memory physical space. The idea is to show that

    PHiLOSoftwaresFilterDynamic policy (backward-slashed bars in

    Figure 8) achieves competitive quality allocation solutions as the

    more complexOraclepolicy. Figure 8 shows the normalized execu-

    tion time and energy for each of the different configurations (C1-6,

    Goal=Min Execution Time denoted as G=P) using 4KB blocks and

    different memory types with respect to the Temporal policy. The

    FixedDynamicpolicy (forward-slashed bars in Figure 8) suffers the

    greatest impact on energy and execution time as memory space

    increases (C4-6) since it adheres to the decisions made at compile-

    time and does not efficiently allocate memory blocks at run-time.

    In general, we see that theFilterDynamicpolicy performs almost as

    well as theOraclepolicy (within 8.45% execution time). Compared

    with the Temporal policy, PHiLOSoftwaresFilterDynamic policyis able to reduce execution time and energy by an average 75.42%

    and 62.88% respectively when the initial application policies have

    been optimized for execution time minimization [11] .

    B. Mixed Criticality Support

    In order to efficiently exploit the available memory variability

    (on- and off-chip [12]), programmers need to provide PHiLOSoft-

    ware with data mapping hints in the form of policies. A pro-

    grammer will consider the applications requirements and partition

    its virtual address space into regions, which are then associated

    with a mapping policy that dictates how to map the data into

    physical address space and the type of guarantee needed (power,

    performance, fault-tolerance). Iyer et al. [43] proposed a QoS-

    enabled memory architecture to enable more cache space andmemory bandwidth for high priority applications based on guidance

    from the operating environment. PHiLOSoftware takes a similar

    approach, however, it opportunistically exploits the variations in

    the memory subsystem to achieve better performance and minimize

    power consumption.

    Figure 9 shows a sample address space partitioning for JPEG

    [57], where the programmer has identified: 1) Read-only and highly

    utilized data, i.e., look up tables (red blocks), 2) A temporary buffer

    Tables (2KB), low-power/E-RAID1Pixel Data, low-power/NO ERAIDDCT,Q,ZZ Data, low-power/NO ERAIDHuff. Data, low-power/irregular access

    Figure 9. Sample User Annotations and Policies

    HP DRAM LP DRAM

    Voltage Scaled SPMs

    Readonce/

    backup

    Often-

    updated

    PEM

    E-RAID

    1

    b) PHiLOSoftware Mapping

    ead

    RDBMP

    DCT

    Q

    ZIGZAG

    HUFFMAN

    Full PhysicalSPMs (4KB)

    Virtual Memory/DRAM

    a) Traditional Mapping

    HUFFMAN

    irtual

    Figure 10. Partitioning the Applications Memory Space

    for inter-task communication (gray block), 3) Read-only pixel

    data (black blocks), and 4) Irregularly access data (green blocks).Figure 10 (a) shows a traditional mapping of these data blocks,

    where variability is not taken into account. Figure 10 (b) shows

    the result of PHiLOSoftwares virtualization layer mapping that

    exploits: 1) Data mapping policies customized by the programmer

    and used to make dynamic memory allocation decisions. 2) On-chip

    memory voltage scaling (using E-RAIDs [9] to deal with process

    variation), and 3) DRAM variability. For the sake of illustration,

    PHiLOSoftware maps commonly used read-only data to voltage

    scaled SRAM protected by an E-RAID 1 level, pixel data to

    voltage scaled SRAM (NO ERAID), and irregular commonly used

    data to low power DRAM. A programmer with knowledge of the

    applications requirements can create custom data mapping policies

    with low-power (LP) memory space in mind. PHiLOSoftware then

    takes these policies and tries to opportunistically enforce them (besteffort), regardless of how the LP memory space is implemented by

    the hardware layer. For instance: 1) If there is no noticeable DRAM

    power variability, then PHiLOSoftware will not prioritize DRAMs

    and follow a more traditional memory management scheme (e.g.,

    malloc()), or 2) If voltage scaling on-chip memories is not possible,

    PHiLOSoftware will proceed to treat all on-chip memories the

    same.

    Recall Figure 6 and consider the case where there is power

    and latency variation in both on-chip (due to voltage scaling the

    SPMs) and off-chip memories (due to the inherent hardware-

    variability in DRAM memories). The programmer then partitions

    each applications virtual address space as shown in Figure 9 and

    define allocation policies for each virtual address space a shown

    in Figure 10 (add color to each virtual address space accordingits requirements). These annotations are used by PHiLOSoftwares

    run-time system, which opportunistically exploits the variations in

    the memory subsystem. Each application will then have different

    requirements (e.g., fault-tolerant memory space, secure memory

    space, etc.), some being more critical than others (e.g., h263s

    higher memory footprint requirements vs. shas need for full

    memory isolation).

    Figure 11 shows various configurations (x-axis):

    7

  • 7/26/2019 Software Controlled

    8/10

    App. 2x2x1 4x2x2 8x2x4

    adpcm Y Y

    aes Y Y

    blowfish Y

    gsm Y

    h263 Y Y Y

    jpeg Y Y Y

    motion Y

    sha Y

    Figure 11. Dynamic Policy-driven Variability-aware Allocation.

    {#Apps}x{#OSes}x{#CPUs} with 4x8KB physical SPMsand the set of applications run for each configuration (marked

    by a Y in their respective row/column). The base-line policy (P)

    utilized the entire physical space with context-switching (CX)

    enabled [93] (e.g., swap SPM data on CX ). Policy M1 uses

    vSPMs and allows PHiLOSoftware to dynamically map each

    applications data. Because we are running various applications

    concurrently, PHiLOSoftware needs to prioritize and judiciously

    map different data to on-chip and off-chip memory. The data

    sets (T1-4) in Figure 11 represent a high-level abstraction of the

    applications workload, in this experiment, despite mapping allT1-3 data to vSPM for a given application, it is not guaranteed

    that the data will go into physical SPM space, as the resources are

    limited (only 4x8KB). So PHiLOSoftware prioritizes among all

    data blocks of each category (T1-4), and based on their priorities

    (BlkPriority=block utilization), decides where to map the data.

    For example, H.263 has much higher on-chip/off-chip memory

    requirements, as a result, higher priority is given to H.263s T1-4

    blocks than the other applications (on-chip SRAM and low-power

    DRAM). User-defined policies (M1) managed to reduce dynamic

    power consumption by 63% on average while reducing total

    execution time by an average of 34% because: 1) There are up

    to {8Apps}x{4OSes}x{4CPUs} competing for memory resources,and traditional malloc (P) is unable to efficiently cope with

    the demand, and 2) PHiLOSoftware efficiently manages thememory space by exploiting the idea of variability-aware dynamic

    policy-driven memory allocation.

    C. PHiLOSoftware Summary

    This section presented the concept of PHiLOSoftware, which

    proposes the idea of tagging or coloring virtual address spaces

    with different characteristics ranging from low-power virtual mem-

    ory space, secure virtual memory space, to fault-tolerant virtual

    memory space, etc. Each application will define different virtual

    mappings (policies) according to its needs. PHiLOSoftwares run-

    time system will then opportunistically enforce these memory al-

    location policies. PHiLOSoftware, like [43], is a plausible solution

    to support the efficient management of the memory subsystem

    in scalable many-core platforms, thereby allowing for mixed-

    criticality systems to execute in an energy efficient and high-

    performance manner.

    VIII. CONCLUSION

    By deploying simpler processing elements and exploiting task-

    level and application-level parallelism, the many-core era promises

    high performance with reduced power consumption. In order to

    support the memory bandwidth requirements of such massively

    parallel system, the memory subsystem must be carefully designed

    and managed. However, as we move forward, the memory hi-

    erarchy will be hetereogeneous in nature, and thus, traditional

    memory management schemes will not cope with the demands of

    the processing elements. In this paper we surveyed the state-of-

    the-art in distributed memory models and motivated the need for

    efficient management of software-controlled many-core memories.

    We showed how virtualization may help address some of the

    emerging issues in the memory subsystem, from the adoption ofemerging non-volatile memories to the opportunistic exploitation

    of the inherent variation in the memory subsystem.

    ACKNOWLEDGMENT

    This work was partially supported by NSF Variability Expedition

    Grant Number CCF-1029783.

    REFERENCES[1] B. Ackland et al. A single-chip, 1.6-billion, 16-b mac/s multi-

    processor dsp. Solid-State Circuits, IEEE Journal of, 35, 2000.

    [2] V. Agarwal et al. Clock rate versus ipc: the end of the road for

    conventional microarchitectures. SIGARCH Comp. Arch. News,

    28, 2000.

    [3] F. Angiolini et al. Reliability support for on-chip memories using

    networks-on-chip. In ICCD, oct. 2006.[4] A. Ansari et al. Zerehcache: armoring cache architectures in

    high defect density technologies. In MICRO 42, 2009.

    [5] ARM. Arm cortex-m3 processor. In http://www.arm.com/, 2012.

    [6] N. Azizi et al. Low-leakage asymmetric-cell sram. TVLSI, Vol.

    11, aug. 2003.

    [7] A. BanaiyanMofrad et al. Fft-cache: a flexible fault-tolerant

    cache architecture for ultra low voltage operation. CASES, 2011.

    [8] L. Bathen. philosoftware: A low power, high performance,

    reliable, and secure virtualization layer for on-chip software-

    controlled memories. InThesis, (PhD), University of California,

    Irvine, 2012.

    [9] L. Bathen et al. E-RoC: Embedded raids-on-chip for low power

    distributed dynamically managed reliable memoreis. In DATE,

    2011.

    [10] L. Bathen et al. Spmvisor: Dynamic scratchpad memory virtual-

    ization for secure, low power and high performance, distributed

    on-chip memories. InCODES+ISSS, 2011.

    [11] L. Bathen et al. HaVOC: A hybrid-memory-aware virtualization

    layer for on-chip distributed scratchpad and non-volatile mem-

    ories. In DAC, 2012.

    [12] L. Bathen et al. VaMV: Variability-aware memory virtualization.

    In DATE, 2012.

    [13] L. Benini and G. D. Micheli. Networks on chips: A new soc

    paradigm. IEEE Computer, 35(1), 2002.

    [14] K. Bernstein et al. High-performance cmos variability in the

    65-nm regime and beyond. IBM J. Res. Dev., 50, 2006.

    [15] S. Borkar. Thousand core chips: a technology perspective. In

    DAC, 2007.

    [16] S. Borkar et al. Parameter variations and impact on circuits and

    microarchitecture. In DAC 03, 2003.

    [17] B. Calhoun et al. A 256-kb 65-nm sub-threshold sram design

    forultra-low-voltage operation. InIEEE J. of Solid-State Circuits

    (JSSC), 2007.

    [18] A. Chakraborty et al. E

  • 7/26/2019 Software Controlled

    9/10

    [19] X. Chen, Z. Lu, A. Jantsch, and S. Chen. Run-time partitioning

    of hybrid distributed shared memory on multi-core network-on-

    chips. InPAAP, 2010.

    [20] X. Chen et al. Supporting distributed shared memory on multi-

    core network-on-chips using a dual microcoded controller. In

    DATE, 2010.

    [21] S.-H. Chou et al. No cache-coherence: A single-cycle ring

    interconnection for multi-core l1-nuca sharing on 3d chips. InDAC, 2009.

    [22] D. E. Culler, A. Gupta, and J. P. Singh. Parallel Computer Ar-

    chitecture: A Hardware/Software Approach. Morgan Kaufmann

    Publishers Inc., San Francisco, CA, USA, 1st edition, 1997.

    [23] A. Das et al. Pad: Power-aware directory placementin distributed

    caches. TR Northwestern University (NWU-EECS-10-11), 2010.

    [24] A. Devgan et al. Power variability and its impact on design. In

    VLSID, 2005.

    [25] B. Egger et al. Dynamic scratchpad memory management for

    code in portable systems with an mmu. ACM TECS, 7, January

    2008.

    [26] B. Egger et al. Scratchpad memory management techniques

    for code in embedded systems without an mmu. Comp., IEEE

    Trans. on, 2010.[27] A. Ferreira et al. Using pcm in next-generation embedded space

    applications. InRTAS 10, april 2010.

    [28] L. Gauthier et al. Minimizing inter-task interferences in scratch-

    pad memory usage for reducing the energy consumption of

    multi-task systems. In CASES 10, 2010.

    [29] S. Ghosh et al. Reducing power consumption in memory ecc

    checkers. In I TC, 2004.

    [30] S. Gochman, A. Mendelson, A. Naveh, and E. Rotem. Intro-

    duction to intel core duo processor architecture. Intel Technol.

    J., vol. 10, no. 2, pp. 8997, 2006.

    [31] M. Gottscho et al. Analyzing power variability of ddr3 dual

    inline memory modules for applications. TR UCLA-EE, 2011.

    [32] D. S. Gracia et al. Lp-nuca: Networks-in-cache for high-

    performance low-power embedded processors. TVLSI, PP(99):1,2011.

    [33] T. Granlund, B. Granbom, and N. Olsson. Soft error rate increase

    for new generations of srams. Nuclear Science, IEEE Trans. on,

    50(6):2065 2068, dec. 2003.

    [34] J. Howard et al. A 48-core ia-32 message-passing processor

    with dvfs in 45nm cmos. In ISSCC, feb. 2010.

    [35] J. Hu et al. Towards energy efficient hybrid on-chip scratch pad

    memory with non-volatile memory. In DATE 11, 2011.

    [36] IBM. The cell project. http://www.research.ibm.com/cell/, 2005.

    [37] Intel Corp. Intel ixp2855 network processor. Available:

    http://www.intel.com, 2005.

    [38] I. Issenin et al. Multiprocessor system-on-chip data reuse

    analysis for exploring customized memory hierarchies. In DAC

    06, 2006.

    [39] ITRS. System drivers. http://www.itrs.net/, 2003.

    [40] ITRS. System drivers. http://www.itrs.net/, 2005.

    [41] ITRS. Process integration, device and structures.

    http://www.itrs.net/, 2007.

    [42] ITRS. 2008 update overview. http://www.itrs.net/, 2008.

    [43] R. Iyer et al. Qos policies and architecture for cache/memory in

    cmp platforms. SIGMETRICS Perform. Eval. Rev., 35(1), 2007.

    [44] Y. Joo et al. Energy- and endurance-aware design of phase

    change memory caches. In DATE 10, 2010.

    [45] S. Kaneko et al. A 600mhz single-chip multiprocessor with

    4.8gb/s internal shared pipelined bus and 512kb internal mem-

    ory. InSSCC 03, 2003.

    [46] P. A. Karger. Multi-level security requirements for hypervisors.

    In ACSAC, 2005.

    [47] J. Kelm et al. Waypoint: scaling coherence to thousand-corearchitectures. In PACT, pages 99110, 2010.

    [48] O. Khan et al. Dcc: A dependable cache coherence multicore

    architecture. IEEE CAL, 10, 2011.

    [49] S. Kim. Area-efficient error protection for caches. In DATE,

    2006.

    [50] C. Kim et al. An adaptive, non-uniform cache structure for

    wire-delay dominated on-chip caches. SIGOPS Oper. Syst. Rev.,

    36:211222, 2002.

    [51] J. Kim et al. Multi-bit error tolerant caches using two-

    dimensional error coding. In MICRO, 2007.

    [52] N. Kim et al. Leakage current: Moores law meets static power.

    Computer, 36(12), dec. 2003.

    [53] J. Kulkarni et al. A 160 mv, fully differential, robust schmitt

    trigger based sub-threshold sram. In ISLPED 07, 2007.[54] F. Kurdahi et al. Low-power multimedia system design by

    aggressive voltage scaling. TVLSI, 18(5), 2010.

    [55] G. Kurian et al. Atac: a 1000-core cache-coherent processor

    with on-chip optical network. InPACT, 2010.

    [56] B. Lee et al. Architecting phase change memory as a scalable

    dram alternative. SIGARCH Comput. Archit. News, 37, 2009.

    [57] C. Lee et al. Mediabench: a tool for evaluating and synthesizing

    multimedia and communicatons systems. In MICRO 97, 1997.

    [58] H. Lee et al. Cloudcache: Expanding and shrinking private

    caches. InHPCA, 2011.

    [59] K. Lee et al. Mitigating soft error failures for multimedia

    applications by selective data protection. In CASES 06, 2006.

    [60] J. Leverich et al. Comparing memory systems for chip multi-

    processors. SIGARCH Comput. Archit. News, 35(2), June 2007.[61] F. Li et al. Improving scratch-pad memory reliability through

    compiler-guided data block duplication. In ICCAD, 2005.

    [62] J. Lira et al. Lru-pea: A smart replacement policy for non-

    uniform cache architectures on chip multiprocessors. In ICCD,

    2009.

    [63] T. Liu et al. Power-aware variable partitioning for dsps with

    hybrid pram and dram main memory. InDAC 11, 2011.

    [64] A. Marongiu et al. An openmp compiler for efficient use of

    distributed scratchpad memory in mpsocs. Computers, IEEE

    Transactions on, PP(99):1, 2010.

    [65] R. Mastipuram and E. C. Wee. Soft errors impact on system re-

    liability. In http://www.edn.com/ article/ CA454636, September

    2004.

    [66] T. Mattson et al. The 48-core scc processor: the programmersview. In SC, 2010.

    [67] D. Melpignano et al. Platform 2012, a many-core computing

    accelerator for embedded socs: performance evaluation of visual

    analytics applications. In DAC, 2012.

    [68] L. Mieszko et al. Shared memory via execution migration.

    ASPLOS, 2011.

    [69] A. Mishra et al. Architecting on-chip interconnects for stacked

    3d stt-ram caches in cmps. InISCA 11, 2011.

    9

  • 7/26/2019 Software Controlled

    10/10

    [70] J. Mogul et al. Operating system support for nvm+dram hybrid

    main memory. InHotOS09, 2009.

    [71] M. Monchiero et al. Exploration of distributed shared memory

    architectures for noc-based multiprocessors. Journal of Systems

    Architecture, 53(10), 2007.

    [72] F. Moradi et al. 65nm sub-threshold 11t-sram for ultra low

    voltage applications. InSOCC 08, pages 113 118, sept. 2008.

    [73] S. Nassif. Modeling and analysis of manufacturing variations.In Custom Integrated Circuits, 2001, IEEE Conf. on. , 2001.

    [74] NXP. Nxp nexperia mobile multimedia processor pnx4101.

    Available: http://www.nxp.com/, 2007.

    [75] K. Olukotun et al. The case for a single-chip multiprocessor.

    SIGPLAN Not., 31, September 1996.

    [76] K. Osada et al. 16.7 fa/cell tunnel-leakage-suppressed 16 mb

    sram for handling cosmic-ray-induced multi-errors. In ISSCC

    03, 2003.

    [77] P. Panda et al. Efficient utilization of scratch-pad memory in

    embedded processor applications. In EDTC 97, 1997.

    [78] D. A. Patterson and J. L. Hennessy. Computer Organization

    and Design, Fourth Edition, Fourth Edition: The Hardware/-

    Software Interface (The Morgan Kaufmann Series in Computer

    Architecture and Design). Morgan Kaufmann Publishers Inc.,4th edition, 2008.

    [79] F. Poletti et al. An integrated hardware/software approach for

    run-time scratchpad management. In DAC 04, 2004.

    [80] R. Pyka et al. Os integrated energy aware scratchpad allocation

    strategies for multiprocess applications. In SCOPES, 2007.

    [81] A. Ros et al. A direct coherence protocol for many-core chip

    multiprocessors. TPDS, 21(12), Dec. 2010.

    [82] F. Ruckerbauer et al. Soft error rates in 65nm sramsanalysis

    of new phenomena. InIOLTS, 2007.

    [83] J. Sartori et al. Variation-aware speed binning of multi-core

    processors. In ISQED 10, 2010.

    [84] A. Sasan et al. Process variation aware sram/cache for aggressive

    voltage-frequency scaling. In DATE 09, 2009.

    [85] M. Shalan et al. A dynamic memory management unit forembedded real-time system-on-a-chip. In CASES 00, 2000.

    [86] P. Shirvani et al. Padded cache: A new fault-tolerance technique

    for cache memories. In VTS, 1999.

    [87] P. Shivakumar et al. Modeling the effect of technology trends

    on the soft error rate of combinational logic. InDSN, 2002.

    [88] S. K. Shukla et al. A brief history of multiprocessors and eda.

    IEEE Design & Test of Computers, 28(3):96, 2011.

    [89] STMicroeletronics. Nomadik - open multimedia platform for

    next generation mobile devices. Technical Article TA305, Avail-

    able: http:// www.st.com, 2003.

    [90] V. Suhendra et al. Integrated scratchpad memory optimization

    and task scheduling for mpsoc architectures. In CASES, 2006.

    [91] V. Suhendra et al. Scratchpad allocation for concurrent embed-

    ded software. In CODES+ISSS, 2008.[92] G. Sun et al. A novel architecture of the 3D stacked MRAM

    L2 cache for CMPs. InHPCA 09, feb. 2009.

    [93] H. Takase et al. Partitioning and allocation of scratch-pad

    memory for priority-based preemptive multi-task systems. In

    DATE 10 , 2010.

    [94] Tilera. Tilepro64.http://www.tilera.com/, 2010.

    [95] Y. Wang et al. Temperature-constrained power control for chip

    multiprocessors with online model estimation. In ISCA, 2009.

    [96] L. Wanner et al. A case for opportunistic embedded sensing in

    presence of hardware power variability. In HotPower10, 2010.

    [97] C. Wilkerson et al. Trading off cache capacity for reliability to

    enable low voltage operation. In ISCA 08, 2008.

    [98] C. Wilkerson et al. Reducing cache power with low-cost, multi-

    bit error-correcting codes. SIGARCH Comput. Archit. News,

    38(3):8393, June 2010.

    [99] W. Wolf et al. Multiprocessor system-on-chip (mpsoc) technol-ogy. TCAD, 27(10), 2008.

    [100] B. Wongchaowart et al. A content-aware block placement

    algorithm for reducing pram storage bit writes. In MSST 10,

    2010.

    [101] X. Wu et al. Hybrid cache architecture with disparate memory

    technologies. In ISCA 09, 2009.

    [102] C.-L. Yang et al. Software-controlled cache architecture for

    energy efficiency. TCSVT, 15(5):634 644, 2005.

    [103] N. Yanwei et al. Performance modelling and optimization of

    memory access on cellular computer architecture cyclops64. In

    NPC, 2005.

    [104] W. Zhang. Enhancing data cache reliability by the addition of

    a small fully-associative replication cache. In ICS, 2004.

    [105] P. Zhou et al. A durable and energy efficient main memory usingphase change memory technology. In ISCA 09, 2009.

    10