extreme resilience and deeper memory hierarchy...deeper memory hierarchy that consists of...

http://www.gsic.titech.ac.jp/sc15

Project introduction Thanks to Supercomputers, the large-scale simula-

tions can be achieved. However, the increasing number of nodes and components will lead to a very high failure frequency. In Exa-scale supercomputers, the MTBF will be no more than tens of minutes, which means comput-ing node doesn’t work in effect. We’re seeking a solu-tion to the problem.

FMI: Fault Tolerant Messaging Interface FMI is an MPI-like survivable messaging interface

that enables scalable failure detection, dynamic node allocation, fast and transparent recovery.

10 32 54 76FMI rank (virtual rank)

FMIUser’s view

P3P2 P5P4

Node 1 Node 2 Node 3

P9P8

Node 4

P7P6

Scalable failure detection and Dynamic node allocation

P2-2P2-1

Parity 2P2-0

P3-2P3-1

Parity 3P3-0

P4-2Parity 4

P4-1P4-0

P5-2Parity 5

P5-1P5-0

Parity 6P6-2P6-1P6-0

Parity 7P7-2P7-1P7-0

P0-2P0-1P0-0

Parity 0

P1-2P1-1P1-0

Parity 1

FMI’s view

Node 0

P1P0

P0-2P0-1P0-0

Parity 0

P1-2P1-1P1-0

Parity 1

P0-2P0-1P0-0

Parity 0

P1-2P1-1P1-0

Parity 1

MPI-like interfaceFast checkpoint/restart

0

500

1000

1500

2000

2500

0 500 1000 1500

Perf

orm

ance

(GFl

ops)

# of Processes (12 processes/node)

MPI FMI MPI + C FMI + C FMI + C/R

Himeno benchmarkint main (int *argc, char *argv[]) { FMI_Init(&argc, &argv); FMI_Comm_rank(FMI_COMM_WORLD, &rank); /* Application’s initialization */ while ((n = FMI_Loop(…)) < numloop) { /* Application’s program */ } /* Application’s finalization */ FMI_Finalize();}

Lossy Compression for cp./rst. To reduce checkpoint time, lossy compression is ap-

plied to checkpoint data then checkpoint size is reduced. Target● 1,2,3D mesh w/o pointerApproach:1. wavelet transformation2. quantization3. encoding4. gzip

Original checkpoint data (Floating point array)

Low-frequency band array High-frequency band array

High-frequency band array

High-frequency band array

bitmap array

bitmap array Low and high-frequency band arrays

1. Wavelet transformation

2. Quantization

3. Encoding

4. Formatting

5. Applying gzip

Compressed data

-3-2

.5 -2-1

.5 -1-0

.5 0 0.5 1

1.5 2

2.5 3

3.5 4-3

-2.5 -2

-1.5 -1

-0.5 0 0.5 1

1.5 2

2.5 3

3.5 4

Values in high-frequency band



d = 10

-3-2

.5 -2-1

.5 -1-0

.5 0 0.5 1

1.5 2

2.5 3

3.5 4

n = 4

High-frequency band

Makehistogram Red elements belong

to “spike” parts

0 1 1 1 1 1 1 0bitmap

simple quantization ↑↓proposed quantization

-3-2

.5 -2-1

.5 -1-0

.5 0 0.5 1

1.5 2

2.5 3

3.5 4 -3

-2.5 -2

-1.5 -1

-0.5 0 0.5 1

1.5 2

2.5 3

3.5 4

n = 4

-3-2

.5 -2-1

.5 -1-0

.5 0 0.5 1

1.5 2

2.5 3

3.5 4

Values in high-frequency band Values in high-frequency band Values in high-frequency band

Freq

uenc

y

Distribution of high-frequency band n = 4

Makehistogram

High-frequency band

0102030405060708090

100

Com

pres

sion

rate

[%]

original

gzip

Simple quantization(n=128)Proposed quantization(n=128)

An average error on pressure array An average error on temperature array

0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

1 2 4 8 16 32 64 128

Rel

ativ

e er

ror

[%]

Division number

Simple quantization Proposed quantization"

00.10.20.30.40.50.60.70.80.9

1 2 4 8 16 32 64 128

Rel

ativ

e er

ror

[%]

Division number

Simple quantization Proposed quantization

Target Application:NICAM [M.Satoh, 2008]climate simulation aboutPressure, temperature and velocity3D array, 1156*82*2 / 720 step

Machine SpecCPU:Intel Core i7-3930K(6 cores 3.20GHz)Mem: 16GB

These works were supported by JSPS KAKENHI Grant Number 23220003.

Cp. size is much reduced with low error in particular situation

Highly Optimized Stencils Larger than GPU Memory

HHRT: System Software for GPU Memory Swap

Integration with Real Simulation Application

For extremely large stencil simula-tions, we implemented temporal blocking (TB) technique and clever optimizations on GPUs [1][2].● Eliminating redundant computation● Reducing memory footprint of TB algorithm

7-point stencil on K20X GPU

Execution ModelFor easier programming, we implement-ed system software, named HHRT (hy-brid hierarchical runtime) [3]. ● HHRT supports user programs written in MPI and CUDA with little modification● Oversubscription based execution model● HHRT implicitly supports memory swapping between GPU memory and host

We integrated our techniques with the city airflow simulation.

Original code on MPI+CUDA was devel-oped by Naoyuki On-odera, Tokyo Tech.We integrated TB into it and executed on HHRT.

Airflow performance on a K20X GPU

[1] G. Jin, T. Endo, S. Matsuoka. A Parallel Optimization Method for Stencil Computation on the Domain that is Bigger than Memory Capacity of GPUs . IEEE Cluster 2013.[2] G. Jin, J. Lin, T. Endo. Efficient Utilization of Memory Hierarchy to Enable the Computation on Bigger Domains for Stencil Computation in CPU-GPU Based Systems. IEEE ICHPA 2014.[3] T. Endo, G. Jin: Software Technologies Coping with Memory Hierarchy of GPGPU Clusters for Stencil Computations. IEEE Cluster 2014.

PI: Toshio Endo (endo@is.titech.ac.jp), supported by JST-CREST

8x largersimulation thandevice memory!

Node Device memory

Host memoryProcess’s data

w/o HHRT (typically)

With HHRT

MPI commcudaMemcpy

Node Device memory

Host memoryProcess’s data

MPI comm

● ● ●

● ● ●

Overview of ProjectOn Exa-scale supercomputers, the “Memory Wall” problem will become even more severe, which prevents the realization of Extreme-ly Fast&Big Simulations.This project promotes research towards this problem via co-design ap-proach among application algorithms, system software, architecture.

Target ArchitectureDeeper memory hierarchy that consists of heterogeneous memory devices

Hybrid Memory Cube (HMC):DRAM chips are stacked with TSV technology.It will have advantage in bandwidth over DDR,but capacity will be smaller.

NAND Flash:SSDs are already commodity.Newer products, such as IO-drive haveO(GB/s) bandwidth.Next-gen non-volatile RAM (NVRAM):Several kinds of NVRAM such as STT-MRAM, ReRAM, FeRAM, etc, will be available in a few years.

Target: Realizing extremely Fast&Big

simulations ofO(100PB/s) & O(100PB)

in Exa-scale era

Comm/BW reducingalgorithms

System software for mem hierarchy mgmt

＋

＋

HPC Architecture with hybrid memory devices

HMC, HBM O(GB/s) FlashNext-gen NVM

Fault Tolerant Infrastructure for Billion-Way Parallelization

Dealing with Deeper Memory Hierarchy

Extreme Resilienceand Deeper Memory Hierarchy

extreme resilience and deeper memory hierarchy...deeper memory hierarchy that consists of...

Documents

compressed memory hierarchy

memory sub-system ct101 – computing systems. memory...

software technology that deals with deeper...

memory hierarchy review

memory hierarchy, caching, virtual memory

memory hierarchy ii

memory hierarchy

memory hierarchy basics

memory hierarchy (ii) · how can deeper memory hierarchy...

12 memory hierarchy

9a. system memory-memory hierarchy

1 recap: memory hierarchy. 2 memory hierarchy - the big...

memory hierarchy how to improve memory access. outline...

recap: memory hierarchy

lecture6 memory hierarchy

the memory hierarchy

chapter-5 memory hierarchy design - vtu notes€¦ ·...

memory hierarchy: motivation

memory organization (memory...

memory hierarchy design