extreme resilience and deeper memory hierarchy...deeper memory hierarchy that consists of...
Post on 20-Oct-2020
5 Views
Preview:
TRANSCRIPT
-
http://www.gsic.titech.ac.jp/sc15
Project introduction Thanks to Supercomputers, the large-scale simula-
tions can be achieved. However, the increasing number of nodes and components will lead to a very high failure frequency. In Exa-scale supercomputers, the MTBF will be no more than tens of minutes, which means comput-ing node doesn’t work in effect. We’re seeking a solu-tion to the problem.
FMI: Fault Tolerant Messaging Interface FMI is an MPI-like survivable messaging interface
that enables scalable failure detection, dynamic node allocation, fast and transparent recovery.
10 32 54 76FMI rank (virtual rank)
FMIUser’s view
P3P2 P5P4
Node 1 Node 2 Node 3
P9P8
Node 4
P7P6
Scalable failure detection and Dynamic node allocation
P2-2P2-1
Parity 2P2-0
P3-2P3-1
Parity 3P3-0
P4-2Parity 4
P4-1P4-0
P5-2Parity 5
P5-1P5-0
Parity 6P6-2P6-1P6-0
Parity 7P7-2P7-1P7-0
P0-2P0-1P0-0
Parity 0
P1-2P1-1P1-0
Parity 1
FMI’s view
Node 0
P1P0
P0-2P0-1P0-0
Parity 0
P1-2P1-1P1-0
Parity 1
P0-2P0-1P0-0
Parity 0
P1-2P1-1P1-0
Parity 1
MPI-like interfaceFast checkpoint/restart
0
500
1000
1500
2000
2500
0 500 1000 1500
Perf
orm
ance
(GFl
ops)
# of Processes (12 processes/node)
MPI FMI MPI + C FMI + C FMI + C/R
Himeno benchmarkint main (int *argc, char *argv[]) { FMI_Init(&argc, &argv); FMI_Comm_rank(FMI_COMM_WORLD, &rank); /* Application’s initialization */ while ((n = FMI_Loop(…)) < numloop) { /* Application’s program */ } /* Application’s finalization */ FMI_Finalize();}
Lossy Compression for cp./rst. To reduce checkpoint time, lossy compression is ap-
plied to checkpoint data then checkpoint size is reduced. Target● 1,2,3D mesh w/o pointerApproach:1. wavelet transformation2. quantization3. encoding4. gzip
Original checkpoint data (Floating point array)
Low-frequency band array High-frequency band array
High-frequency band array
High-frequency band array
bitmap array
bitmap array Low and high-frequency band arrays
1. Wavelet transformation
2. Quantization
3. Encoding
4. Formatting
5. Applying gzip
Compressed data
-3-2
.5 -2-1
.5 -1-0
.5 0 0.5 1
1.5 2
2.5 3
3.5 4-3
-2.5 -2
-1.5 -1
-0.5 0 0.5 1
1.5 2
2.5 3
3.5 4
Values in high-frequency band
Values in high-frequency band
Values in high-frequency band
d = 10
-3-2
.5 -2-1
.5 -1-0
.5 0 0.5 1
1.5 2
2.5 3
3.5 4
n = 4
High-frequency band
Makehistogram Red elements belong
to “spike” parts
0 1 1 1 1 1 1 0bitmap
simple quantization ↑↓proposed quantization
-3-2
.5 -2-1
.5 -1-0
.5 0 0.5 1
1.5 2
2.5 3
3.5 4 -3
-2.5 -2
-1.5 -1
-0.5 0 0.5 1
1.5 2
2.5 3
3.5 4
n = 4
-3-2
.5 -2-1
.5 -1-0
.5 0 0.5 1
1.5 2
2.5 3
3.5 4
Values in high-frequency band Values in high-frequency band Values in high-frequency band
Freq
uenc
y
Distribution of high-frequency band n = 4
Makehistogram
High-frequency band
0102030405060708090
100
Com
pres
sion
rate
[%]
original
gzip
Simple quantization(n=128)Proposed quantization(n=128)
An average error on pressure array An average error on temperature array
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
1 2 4 8 16 32 64 128
Rel
ativ
e er
ror
[%]
Division number
Simple quantization Proposed quantization"
00.10.20.30.40.50.60.70.80.9
1 2 4 8 16 32 64 128
Rel
ativ
e er
ror
[%]
Division number
Simple quantization Proposed quantization
Target Application:NICAM [M.Satoh, 2008]climate simulation aboutPressure, temperature and velocity3D array, 1156*82*2 / 720 step
Machine SpecCPU:Intel Core i7-3930K(6 cores 3.20GHz)Mem: 16GB
These works were supported by JSPS KAKENHI Grant Number 23220003.
Cp. size is much reduced with low error in particular situation
Highly Optimized Stencils Larger than GPU Memory
HHRT: System Software for GPU Memory Swap
Integration with Real Simulation Application
For extremely large stencil simula-tions, we implemented temporal blocking (TB) technique and clever optimizations on GPUs [1][2].● Eliminating redundant computation● Reducing memory footprint of TB algorithm
7-point stencil on K20X GPU
Execution ModelFor easier programming, we implement-ed system software, named HHRT (hy-brid hierarchical runtime) [3]. ● HHRT supports user programs written in MPI and CUDA with little modification● Oversubscription based execution model● HHRT implicitly supports memory swapping between GPU memory and host
We integrated our techniques with the city airflow simulation.
Original code on MPI+CUDA was devel-oped by Naoyuki On-odera, Tokyo Tech.We integrated TB into it and executed on HHRT.
Airflow performance on a K20X GPU
[1] G. Jin, T. Endo, S. Matsuoka. A Parallel Optimization Method for Stencil Computation on the Domain that is Bigger than Memory Capacity of GPUs . IEEE Cluster 2013.[2] G. Jin, J. Lin, T. Endo. Efficient Utilization of Memory Hierarchy to Enable the Computation on Bigger Domains for Stencil Computation in CPU-GPU Based Systems. IEEE ICHPA 2014.[3] T. Endo, G. Jin: Software Technologies Coping with Memory Hierarchy of GPGPU Clusters for Stencil Computations. IEEE Cluster 2014.
PI: Toshio Endo (endo@is.titech.ac.jp), supported by JST-CREST
8x largersimulation thandevice memory!
Node Device memory
Host memoryProcess’s data
w/o HHRT (typically)
With HHRT
MPI commcudaMemcpy
Node Device memory
Host memoryProcess’s data
MPI comm
● ● ●
● ● ●
Overview of ProjectOn Exa-scale supercomputers, the “Memory Wall” problem will become even more severe, which prevents the realization of Extreme-ly Fast&Big Simulations.This project promotes research towards this problem via co-design ap-proach among application algorithms, system software, architecture.
Target ArchitectureDeeper memory hierarchy that consists of heterogeneous memory devices
Hybrid Memory Cube (HMC):DRAM chips are stacked with TSV technology.It will have advantage in bandwidth over DDR,but capacity will be smaller.
NAND Flash:SSDs are already commodity.Newer products, such as IO-drive haveO(GB/s) bandwidth.Next-gen non-volatile RAM (NVRAM):Several kinds of NVRAM such as STT-MRAM, ReRAM, FeRAM, etc, will be available in a few years.
Target: Realizing extremely Fast&Big
simulations ofO(100PB/s) & O(100PB)
in Exa-scale era
Comm/BW reducingalgorithms
System software for mem hierarchy mgmt
+
+
HPC Architecture with hybrid memory devices
HMC, HBM O(GB/s) FlashNext-gen NVM
Fault Tolerant Infrastructure for Billion-Way Parallelization
Dealing with Deeper Memory Hierarchy
Extreme Resilienceand Deeper Memory Hierarchy
top related