extreme resilience and deeper memory hierarchy...deeper memory hierarchy that consists of...

Post on 20-Oct-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

  • http://www.gsic.titech.ac.jp/sc15

    Project introduction Thanks to Supercomputers, the large-scale simula-

    tions can be achieved. However, the increasing number of nodes and components will lead to a very high failure frequency. In Exa-scale supercomputers, the MTBF will be no more than tens of minutes, which means comput-ing node doesn’t work in effect. We’re seeking a solu-tion to the problem.

    FMI: Fault Tolerant Messaging Interface FMI is an MPI-like survivable messaging interface

    that enables scalable failure detection, dynamic node allocation, fast and transparent recovery.

    10 32 54 76FMI rank (virtual rank)

    FMIUser’s view

    P3P2 P5P4

    Node 1 Node 2 Node 3

    P9P8

    Node 4

    P7P6

    Scalable failure detection and Dynamic node allocation

    P2-2P2-1

    Parity 2P2-0

    P3-2P3-1

    Parity 3P3-0

    P4-2Parity 4

    P4-1P4-0

    P5-2Parity 5

    P5-1P5-0

    Parity 6P6-2P6-1P6-0

    Parity 7P7-2P7-1P7-0

    P0-2P0-1P0-0

    Parity 0

    P1-2P1-1P1-0

    Parity 1

    FMI’s view

    Node 0

    P1P0

    P0-2P0-1P0-0

    Parity 0

    P1-2P1-1P1-0

    Parity 1

    P0-2P0-1P0-0

    Parity 0

    P1-2P1-1P1-0

    Parity 1

    MPI-like interfaceFast checkpoint/restart

    0

    500

    1000

    1500

    2000

    2500

    0 500 1000 1500

    Perf

    orm

    ance

    (GFl

    ops)

    # of Processes (12 processes/node)

    MPI FMI MPI + C FMI + C FMI + C/R

    Himeno benchmarkint main (int *argc, char *argv[]) { FMI_Init(&argc, &argv); FMI_Comm_rank(FMI_COMM_WORLD, &rank); /* Application’s initialization */ while ((n = FMI_Loop(…)) < numloop) { /* Application’s program */ } /* Application’s finalization */ FMI_Finalize();}

    Lossy Compression for cp./rst. To reduce checkpoint time, lossy compression is ap-

    plied to checkpoint data then checkpoint size is reduced. Target● 1,2,3D mesh w/o pointerApproach:1. wavelet transformation2. quantization3. encoding4. gzip

    Original checkpoint data (Floating point array)

    Low-frequency band array High-frequency band array

    High-frequency band array

    High-frequency band array

    bitmap array

    bitmap array Low and high-frequency band arrays

    1. Wavelet transformation

    2. Quantization

    3. Encoding

    4. Formatting

    5. Applying gzip

    Compressed data

    -3-2

    .5 -2-1

    .5 -1-0

    .5 0 0.5 1

    1.5 2

    2.5 3

    3.5 4-3

    -2.5 -2

    -1.5 -1

    -0.5 0 0.5 1

    1.5 2

    2.5 3

    3.5 4

    Values in high-frequency band

    Values in high-frequency band

    Values in high-frequency band

    d = 10

    -3-2

    .5 -2-1

    .5 -1-0

    .5 0 0.5 1

    1.5 2

    2.5 3

    3.5 4

    n = 4

    High-frequency band

    Makehistogram Red elements belong

    to “spike” parts

    0 1 1 1 1 1 1 0bitmap

    simple quantization ↑↓proposed quantization

    -3-2

    .5 -2-1

    .5 -1-0

    .5 0 0.5 1

    1.5 2

    2.5 3

    3.5 4 -3

    -2.5 -2

    -1.5 -1

    -0.5 0 0.5 1

    1.5 2

    2.5 3

    3.5 4

    n = 4

    -3-2

    .5 -2-1

    .5 -1-0

    .5 0 0.5 1

    1.5 2

    2.5 3

    3.5 4

    Values in high-frequency band Values in high-frequency band Values in high-frequency band

    Freq

    uenc

    y

    Distribution of high-frequency band n = 4

    Makehistogram

    High-frequency band

    0102030405060708090

    100

    Com

    pres

    sion

    rate

    [%]

    original

    gzip

    Simple quantization(n=128)Proposed quantization(n=128)

    An average error on pressure array An average error on temperature array

    0

    0.001

    0.002

    0.003

    0.004

    0.005

    0.006

    0.007

    1 2 4 8 16 32 64 128

    Rel

    ativ

    e er

    ror

    [%]

    Division number

    Simple quantization Proposed quantization"

    00.10.20.30.40.50.60.70.80.9

    1 2 4 8 16 32 64 128

    Rel

    ativ

    e er

    ror

    [%]

    Division number

    Simple quantization Proposed quantization

    Target Application:NICAM [M.Satoh, 2008]climate simulation aboutPressure, temperature and velocity3D array, 1156*82*2 / 720 step

    Machine SpecCPU:Intel Core i7-3930K(6 cores 3.20GHz)Mem: 16GB

    These works were supported by JSPS KAKENHI Grant Number 23220003.

    Cp. size is much reduced with low error in particular situation

    Highly Optimized Stencils Larger than GPU Memory

    HHRT: System Software for GPU Memory Swap

    Integration with Real Simulation Application

    For extremely large stencil simula-tions, we implemented temporal blocking (TB) technique and clever optimizations on GPUs [1][2].● Eliminating redundant computation● Reducing memory footprint of TB algorithm

    7-point stencil on K20X GPU

    Execution ModelFor easier programming, we implement-ed system software, named HHRT (hy-brid hierarchical runtime) [3]. ● HHRT supports user programs written in MPI and CUDA with little modification● Oversubscription based execution model● HHRT implicitly supports memory swapping between GPU memory and host

    We integrated our techniques with the city airflow simulation.

    Original code on MPI+CUDA was devel-oped by Naoyuki On-odera, Tokyo Tech.We integrated TB into it and executed on HHRT.

    Airflow performance on a K20X GPU

    [1] G. Jin, T. Endo, S. Matsuoka. A Parallel Optimization Method for Stencil Computation on the Domain that is Bigger than Memory Capacity of GPUs . IEEE Cluster 2013.[2] G. Jin, J. Lin, T. Endo. Efficient Utilization of Memory Hierarchy to Enable the Computation on Bigger Domains for Stencil Computation in CPU-GPU Based Systems. IEEE ICHPA 2014.[3] T. Endo, G. Jin: Software Technologies Coping with Memory Hierarchy of GPGPU Clusters for Stencil Computations. IEEE Cluster 2014.

    PI: Toshio Endo (endo@is.titech.ac.jp), supported by JST-CREST

    8x largersimulation thandevice memory!

    Node Device memory

    Host memoryProcess’s data

    w/o HHRT (typically)

    With HHRT

    MPI commcudaMemcpy

    Node Device memory

    Host memoryProcess’s data

    MPI comm

    ● ● ●

    ● ● ●

    Overview of ProjectOn Exa-scale supercomputers, the “Memory Wall” problem will become even more severe, which prevents the realization of Extreme-ly Fast&Big Simulations.This project promotes research towards this problem via co-design ap-proach among application algorithms, system software, architecture.

    Target ArchitectureDeeper memory hierarchy that consists of heterogeneous memory devices

    Hybrid Memory Cube (HMC):DRAM chips are stacked with TSV technology.It will have advantage in bandwidth over DDR,but capacity will be smaller.

    NAND Flash:SSDs are already commodity.Newer products, such as IO-drive haveO(GB/s) bandwidth.Next-gen non-volatile RAM (NVRAM):Several kinds of NVRAM such as STT-MRAM, ReRAM, FeRAM, etc, will be available in a few years.

    Target: Realizing extremely Fast&Big

    simulations ofO(100PB/s) & O(100PB)

    in Exa-scale era

    Comm/BW reducingalgorithms

    System software for mem hierarchy mgmt

    HPC Architecture with hybrid memory devices

    HMC, HBM O(GB/s) FlashNext-gen NVM

    Fault Tolerant Infrastructure for Billion-Way Parallelization

    Dealing with Deeper Memory Hierarchy

    Extreme Resilienceand Deeper Memory Hierarchy

top related