Download - Feasibility Study of A Future HPC System for Memory ... · Feasibility Study of A Future HPC System for Memory Intensive Applications Hiroaki Kobayashi Director and Professor Cyberscience

Feasibility Study of A Future HPC System for Memory Intensive Applications

Hiroaki KobayashiDirector and Professor

Cyberscience Center, Tohoku [email protected]

SC13 NEC Booth PresentationNovember 19, 2013

Denver, U.S.A.

Hiroaki Kobayashi, Tohoku University

SC13 NEC Booth Presentation November 19, 2013

MEXT New Program toward Exascale Computing: Feasibility Study of Future HPCI Systems

Objectives of the program are to

discuss and clarify social and scientific demands for HPC in the next five to ten years in Japan,

complete conceptual designs of future high-end systems capable of satisfying the social and scientific demands, and

investigate hardware and software key technologies for developing the above defined systems that make them available around year 2018

2-year program, and three system teams and one application team have bee selected.

Tohoku Univ., NEC and JAMSTEC team that investigates the feasibility of a mult-vectorcore architecture with a high-memory bandwidth

Univ. of Tokyo and Fujitsu team that investigates the feasibility of a K-Computer-compatible(?) many-core architecture

Tsukuba and Hitachi Team that investigates the feasibility of an accelerator-based architecture

Riken-TokyoTech Application Team that investigates social and scientific demands of 2020 and designs the roadmap of R&D on target applications satisfying the demands

2



Fallacy 1 in HPC: Peak Performance Tracks Observed Performance(Also, Pitfall 1 in HPC: Accelerating Peak Performance for Increasing Sustained Performance)

3

256.0

128.0

64.0

32.0

16.0

8.0

4.0

2.0

1.0

0.5

Att

aina

ble

Perf

orm

ance

(G

flops

/s)

Application B/F (Memory Access Intensity)8 4 2 1 0.5 0.25 0.125 0.0625

Stream

BW 72

.95GB/s

Tesla C1060 1.3B/F (78Gflop/s)

SX-9 2.5B/F(102.4Gflop/s)

Stream

BW 25

6GB/s

SX-ACE 1B/F (256Gflop/s)

Stream

BW 58

.61GB/s

Power7 0.52B/F(245.1Gflop/s)

Stream

BW 17

.6GB/s

Nehalem EX 0.47B/F (72.48Gflop/s)

Stream

BW 17

.0GB/s

Nehalem EP 0.55B/F (46.93Gflop/s)

Stream

BW 34

.8GB/s

Sandy Bridge 0.27B/F (187.5Gflop/s)

Stream

BW 10

.0 GB/s

Stream

BW 43

.3GB/s

FX-1 1.0B/F (40.32Gflop/s)

FX-10 0.36B/F (236.5Gflop/s)

Stream

BW 64

.7GB/s

K computer 0.5B/F(128Gflop/s)

SX-8R 2B/F (35.2Gflop/s)

For Memory intensive

applications

For Computation-intensive applications

0.03 0.01



Fallacy 2: Massively Parallel Processors are a Silver Bullet(Also Pitfall 2: A Large-Scale Aggregation of Fine-Grain Cores with Limited-Memory BW)

4

0250500750

1000

90 92 94 96 98 100

100Gflop/s Core × 10 /socket10Gflop/s Core × 100 /socket

Parallel Efficiency(%)

Sust

aine

d G

flop/

s

Efficiency20%

Efficiency10%

Scalability of Direct Numerical Simulation of MHD Turbulent Flow for a Fusion Reactor Design on SX-9 and K-Computer

Prof. Y. Yamamoto, Yamanashi Univ. Progress Report of JHPCN12 - NA14http://jhpcn-kyoten.itc.u-tokyo.ac.jp/ja/docH24/FinalRep/JHPCN12-NA14_FinalRep.pdf

SX-9 (Tohoku & ES; 1D) K computer (1D), (2D) FX10 (1D)

0 1024 2048 3072 4096 5120 6144 7168 81920

5

10

15

20

25

30

350 256 512 768 1024 1280 1536 1792 2048

Sust

aine

d T

flop/

s

Number of Nodes of K computer & FX10

Number of CPUs of SX-9(Tohoku & ES)

SX-9

0 100 200 300 400 500 600Peak Tflop/s

0

5

10

15

20

25

30

35

Sust

aine

d T

flop/

s

Efficiency5%

Efficiency2.5%



Flops Racing without Users??? ~Many Important Applications Need 1B/F or More!~

5

Need balanced improvement both in flop/s and BW! for

high-efficiency in wide application areas

Limiting application areas

256.0

128.0

64.0

32.0

16.0

8.0

4.0

2.0

1.0

0.5

Att

aina

ble

Perf

orm

ance

(G

flops

/s)

Application B/F (Memory Access Intensity)

8 4 2 1 0.5 0.25 0.125 0.063

Stream

BW 72

.95GB/s

Tesla C1060 1.3B/F (78Gflop/s)

SX-9 2.5B/F(102.4Gflop/s)

Stream

BW 25

6GB/s

NGV 1B/F (256Gflop/s)

Stream

BW 17

.6GB/s

Nehalem EX 0.47B/F (72.48Gflop/s)

Stream

BW 17

.0GB/s

Nehalem EP 0.55B/F (46.93Gflop/s)

Stream

BW 34

.8GB/s

Sandy Bridge 0.27B/F (187.5Gflop/s)

Stream

BW 58

.61GB/s

Power7 0.52B/F(245.1Gflop/s)

Stream

BW 10

.0 GB/s

Stream

BW 43

.3GB/s

FX-1 1.0B/F (40.32Gflop/s)

FX-10 0.36B/F (236.5Gflop/s)

Stream

BW 64

.7GB/s

K computer 0.5B/F(128Gflop/s)

0.03 0.01

SX-8 4B/F (35.2Gflop/s)

Applica'on*Spectrum�Sustaine

d*Pe

rformance�

LINPA

CK�

Flop/s-oriented, memory-limited design

Our target:High B/F oriented design



Team Organization

6

Holistic Collaboration of Application, Architecture and Device Researchers&Engineers

Application Research!Group�

System Research "Group�

Device Technologies!Research Group�

Tohoku University! Leader Hiroaki Kobayashi)�

Jamstec (Co-Leaders: Yoshiyuki Kaneda & Kunihiko Watanabe)!

(Takahashi, Hori, Itakura, Uehara, et al.)�

NEC (Leader: Yukiko Hashimoto)!(Hagiwara, Momose, Musa, Watanabe, Hayashi, Nakazato, et al.)�

Imamura, Koshimura, Yamamoto, Matsuoka,!

Toyokuni et al.!Applications�

Egawa, Architecture!

Takizawa, Sys. Soft.!

Muraoka, Storage!

Koyanagi, et al. 3D Device�

Hanyu, NV Mem.�

Furumura, Hori et al. University of Tokyo!

Applications�

Nakahashi, JAXA!Applications�

Hasegawa, Arakawa, !Osaka University, Network!

Sato, JAIST, Sys. Soft.!

Yokokawa, Uno, Riken!Sys. Soft & I/O & Storage!

Motoyoshi, Tohoku MicroTec!3D Device Fab.!

Sano, Network!

Komatsu, Benchmark!

PI: Hiroaki Kobayashi, Tohoku University

IRIDeS



Social and Scientific Demands: Compound Analysis for Disaster Mitigation by HPC

7

Sc

ien

tifi

c p

red

icti

on

fo

r d

isa

ste

r p

reve

nti

on

an

d m

itig

ati

on�

Earthquake occurrence�

Seismic wave/Tsunami propagation�

Prediction of disaster scenarios �

Building motion �

Assessment of structural impact�

Damage by drifted objects �

Damage prediction�

Structural damage �

Flooded area �

Evacuation prediction�

Extreme weather� Torrential rain�



Compound Analysis for Disaster Mitigation and Coupled Simulation Flow

RSGDX�

MMA,

Para

llel P

roce

ss

Map

ping

Time (less than 3-hour)

RSD

GX

(Ear

thqu

ake

)

STOC-CADMAS(Tsunami/Land Flooding)

Seis

m3D

(Str

ong

Mot

ion)

MM

A1(M

acro

-sca

leG

roun

d vi

brat

ion)

MM

A2(M

icro

-sca

leG

roun

d vi

brat

ion)

Memory, File I/O, Network are also key components in addition to Flop/s units!



Social and Scientific Demands: Highly Productive Engineering by HPC

9

Detail Design�

Concept design �

Simulation (lower cost) �

Experiment (higher cost) � Product �

Simulations can reduce both design cost and design time. Utilizing the digital design will boost innovations in industries, enhancing the global competitiveness of Japanese industries. �

• Exploring wider design space • Reducing experiments with models �

• Improving reliability, safety and productivity • Providing greener products and energy saving�

Feedback to manufacturing �

• Enabling digital flight (simulating steady to unsteady phenomenon)

• Developing lower noise airplanes (aerodynamic acoustic analysis) �

• Developing more efficient turbine (thermal flow analysis of an entire turbine)

• Simulation by multi-physics CFD (phase-change, erosion/corrosion, cracks)�

Airplane design � Design of turbo machinery�

Multi-scale simulations with both macro flows and micro phenomena �



Exascale Computing Requirements in 2020

10

Purpose� Applica-on� B/F� Required5Memory(TB)�

Required5Floa-ng=point5Opera-ons5�×1018��

Expected5Execu-on5Time5�H��

Single'Ultra,high'

Resolu2on'Simula2on5

RSGDX� 8.00� 14� 520� 24�

Seism3D� 2.14� 2,900� 1,000� 8�

MSSG� 4.00� 175� 720� 6�

BCM� 5.47� 13.6� 15 0.5�

Ensemble'Computa2ons'of''Moderate'Resolu2on'Simula2ons�

Compound'Analysis'of'Natural'Disaster'(1000'cases)�

2.1�8.0� 98� 25x1000' 3×1000'(�4'months)�

Numerical'Turbine'(20'cases)�

2.33� 163.5� 140� 20�

Data5Assimila-on5(CDA)�

Global/Urban5Weather5(MSSG)�

Aerodynamics5(Incompressive5Flow)5

(BCM)�

Aerodynamics5(Compressive5Flow)5

(BCM)�

Aeroacous-cs5(BCM=LEE)�

Turbo5Machinery5(Numeric5Turbine)�

Evacua-on5Guidance5(MAS)�

Compound'Simula2on'of'Natural'Disaster�

��$&��#'��!��

�&$! ��!&�! ��$!"��&�! ��%��

�%' �� !!��

��

$!' ��(��$�&�! ��



Goals and Strategies for our Target System Design

11

Memory IssuesHigh BW & Balanced B/F

High Memory B/W of 1~2B/F at Low Power by Using Advanced Device Technologies such as 2.5D/3D Die

Stacking

Vector Processing IssuesEfficient Short Vector and List Vector Processing

Advanced Vector Architecture Hiding Short-Vector PenaltySupported by A Large On-Chip Vector Load/Store Buffer at

4B/F with Random Access Mechanism

Node IssuesLarge-Grain Nodes for Reducing the Total Number of MPI

Processes

High Performance Nodes Composed of a Small Number of Large Cores of 256 GFlop/s each

Network IssuesWell-balanced Local/Neighboring and Global

Communication CapabilityHierarchical Network with Hi-Radix Switches

Storage/IO IssuesScalable Storage System for Data Assimilation and

Checkpointing/Restart Mechanism

Hierarchical Distributed Storage System with Locally High B/W and Globally Large Shard Storage

Capacity

System Software IssuesCompliance with Standard Programming Models and Tools

New Functionalities for Fault-tolerance and Resource-aware/Power-Aware Job Management

Linux with Advanced Functionalities



Target Architecture of Tohoku-NEC-JAMSTEC Team~Keep 1B/F or More!~

12

VLSB �

core� core� core� core�

Shared memory�

CPU0�1 TFlop/s� 1� 2� 3�Node0 4TFlop/s�

VLSB �


Shared memory�

CPU0� 1� 2� 3�Node xxxx�

Hierarchical Distributed Storage System �

Hierarchical Network�

VLSB 32MB �


2.5D/3D Die-Stacking Shared memory ~256GB �

1TF = 256GFlop/s x 4cores�

CPU0� 1� 2� 3�

~2TB/s(2B/F)�

~1TB/s (4B/F)�

2.5D/3D Die-Stacking Shared memory�



Preliminary Evaluation of the Target System

13

�551/)'9/438� �� +6:/7+*��14'9/3-��4/39��5+7'9/438�

>=��?�

+6:/7+*��+247<�

�;5+)9+*��;+):9/43�"/2+�

�.4:78��

�89/2'9+*��;+):9/43�"/2+�

�.4:78��

+6:/7+*��4,��#8�

��+'0��,1458��

%/9.�$�!�� 2+247<�8:(8<89+2�

!��&� �� "��

��,1458��

!+/82�� "��

��,1458��

�!!�� "��

�� ,1458��

�� "��

��,1458��

�440.(&8.327� ��

*59.6*)��03&8.2,��3.28��4*6&8.327�

=<��>�

*59.6*)��*136;�

�:4*(8*)��:*(98.32�".1*�

�-3967��

�78.1&8*)��:*(98.32�".1*�

�-3967��

*59.6*)��3+��#7�

��*&/��+0347��

%.8-�$�!�� 1*136;�

79'7;78*1�

�314392)�!.190&8.32�3+��&896&0��.7&78*6��=��(&7*7��27*1'0*>�

�@�� <�� "��-3967�<��?��1328-�

��-3967�<��

��+0347��

�91*6.(&0�"96'.2*�= �(&7*7��27*1'0*��

�� "�� -3967��-3967�

�� +0347��

Single Run

Ensemble Run



Performance Comparison of Our Target System with a Commodity-Based Future System

Our SystemSMP(UMA) Architecture

4TF, ~8TB/s(~2B/F), ~512GB, 1~1.6KW

Commodity-based SystemNUMA Architecture

3.3TF, 0.2TB/s(0.1B/F), 128GBx2, 0.4KW

14

core256GF

core

core core

VLSB32MB

Memory128~256GB(SMP)

1~2TB/s

core core core core core core



Cache32MB

Socket: 1TFSocket: 1.6TFcore : 90GF

MemoryDDR4 3200x4ch

0.1TB/s

2~4 B/F 1 B/F0.05TB/s

Mem128GB x 2(NUMA)

10GB/s x2

40GB/s x2

20GB/s

High Dimensional TorusFat-Tree



Performance Estimation (1/2)

In the case of the same number of processes(100,000proc) Performance normalized by commodity-based System

◆ 4x~8x speedup per watt in Seism3D and RSGDX because of their high B/F requirement

Our System Commodity-Based

Ratio

Tread/proc 4 4

PeakPF/proc 1TF 0.36TF 2.8x

PeakPF/100,000proc 100PF 35.84PF 2.8x

Nodes /100,000proc 25,000 11,111 2.3x

Total MemBW 200PB/s 3.4PB/s 59x

Total Cache Capacity 3.2TB 0.8TB 4x

0

20

40

60

80 72.8

43.0

Seism3D RSGDX

Our SystemFuture Commodity-Based System

Spee

dup

0123456789 8.1

4.8

Seism3D RSGDX

Spee

dup/

Wat

t

15



Performance Estimation (2/2)

0

150

300

450

600

5.8 2.9 1.4 0.7 0.4 0.1

538.9

250.6

117.552.1 25.6 6.5

102K 204K 409k 819K 1638K 6553K

Xeon-based System needs 6.4M processes (64x more than our system) to achieve the equivalent performance of our system, but it consumes 291MW!! (impractical)A Large on-chip cache works well up to 1M processes in the Xeon-based system, but memory data without locality become dominant when the number of processes is 1M or more.

30~40MW 290MW

Scalability Analysis when increasing the number of processes

Xeon-based System:

Scalable Performance obtained up-to 1M Proc.Commodity-based：Scalability Stalled

Our SystemFuture Commodity-Based

Exec

utio

n T

ime(

Sec)

# of Procs.

Seism3D



Summary

Well balanced HEC systems regarding memory performance is still key to high productivity in science and engineering in the post peta-scale era

We explore the great potential of the new generation vector architecture for future HPC systems, with new device technologies such as 2.5D/3D die-stacking

✴High sustained memory BW to fuel vector function units with lower power/energy.

The on-chip vector load/store unit can boost the sustained memory bandwidth energy-efficiently

When such new technologies will be available as production services?

✴ Design tools, fab. and markets steer the future of the technologies!

17



Riken AICS

• Mitsuo Yokokawa

• Atsuya UnoUniversity of Tokyo

• Takashi Furumura

• Muneo Hori JAXA

• Kazuhiro NakahashiJAIST

• Yukinori SatoOsaka University

• Go Hasegawa

• Shinichi ArakawaJMA

• Hideo Tada

• Chiashi Muroi

• Keiichi Katayama

• Eiji Toyoda

• Junichi Ishida

• Takafumi KanehamaMeisei University

• Kanji Otsuka

Acknowledgements

18

Tohoku University

• Hiroyuki Takizawa

• Kentaro Sano

• Ryusuke Egawa

• Kazuhiko Komatsu

• Mitsumasa Koyanagi

• Shunichi Koshimura

• Fumihiko Iwamura

• Takafumi Fukushima

• Kangwook Lee

• Hiroaki Muraoka

• Takahiro Hanyu

• Hiroshi Matsuoka

JAMSTEC

• Yoshiyuki Kaneda

• Kunihiko Watanabe

• Keiko Takahashi

• Takane Hori

• Mamoru Hyodo

• Kenichi Itakura

• Hitoshi Uehara

NEC

• Yukiko Hashimoto

• Yoko Isobe

• Akihiko Musa

• Takashi Soga

• Osamu Watanabe

• Akira Azami

• Takashi Hagiwara

• Shintaro Momose

• Yasuharu Hayashi

• Umezawa Kazuhiko

• Other NEC Engineers

Tohoku Micro-Tech

• Makoto Motoyoshi

Disclaimer:Information provided in this talk does not reflect any future products of the NEC systems.

For more details, see Booth 607 of Tohoku University

Download - Feasibility Study of A Future HPC System for Memory ... · Feasibility Study of A Future HPC System for Memory Intensive Applications Hiroaki Kobayashi Director and Professor Cyberscience

Top Related