Feasibility Study of A Future HPC System for Memory Intensive Applications
Hiroaki KobayashiDirector and Professor
Cyberscience Center, Tohoku [email protected]
SC13 NEC Booth PresentationNovember 19, 2013
Denver, U.S.A.
Hiroaki Kobayashi, Tohoku University
SC13 NEC Booth Presentation November 19, 2013
MEXT New Program toward Exascale Computing: Feasibility Study of Future HPCI Systems
Objectives of the program are to
discuss and clarify social and scientific demands for HPC in the next five to ten years in Japan,
complete conceptual designs of future high-end systems capable of satisfying the social and scientific demands, and
investigate hardware and software key technologies for developing the above defined systems that make them available around year 2018
2-year program, and three system teams and one application team have bee selected.
Tohoku Univ., NEC and JAMSTEC team that investigates the feasibility of a mult-vectorcore architecture with a high-memory bandwidth
Univ. of Tokyo and Fujitsu team that investigates the feasibility of a K-Computer-compatible(?) many-core architecture
Tsukuba and Hitachi Team that investigates the feasibility of an accelerator-based architecture
Riken-TokyoTech Application Team that investigates social and scientific demands of 2020 and designs the roadmap of R&D on target applications satisfying the demands
2
Hiroaki Kobayashi, Tohoku University
SC13 NEC Booth Presentation November 19, 2013
Fallacy 1 in HPC: Peak Performance Tracks Observed Performance(Also, Pitfall 1 in HPC: Accelerating Peak Performance for Increasing Sustained Performance)
3
256.0
128.0
64.0
32.0
16.0
8.0
4.0
2.0
1.0
0.5
Att
aina
ble
Perf
orm
ance
(G
flops
/s)
Application B/F (Memory Access Intensity)8 4 2 1 0.5 0.25 0.125 0.0625
Stream
BW 72
.95GB/s
Tesla C1060 1.3B/F (78Gflop/s)
SX-9 2.5B/F(102.4Gflop/s)
Stream
BW 25
6GB/s
SX-ACE 1B/F (256Gflop/s)
Stream
BW 58
.61GB/s
Power7 0.52B/F(245.1Gflop/s)
Stream
BW 17
.6GB/s
Nehalem EX 0.47B/F (72.48Gflop/s)
Stream
BW 17
.0GB/s
Nehalem EP 0.55B/F (46.93Gflop/s)
Stream
BW 34
.8GB/s
Sandy Bridge 0.27B/F (187.5Gflop/s)
Stream
BW 10
.0 GB/s
Stream
BW 43
.3GB/s
FX-1 1.0B/F (40.32Gflop/s)
FX-10 0.36B/F (236.5Gflop/s)
Stream
BW 64
.7GB/s
K computer 0.5B/F(128Gflop/s)
SX-8R 2B/F (35.2Gflop/s)
For Memory intensive
applications
For Computation-intensive applications
0.03 0.01
Hiroaki Kobayashi, Tohoku University
SC13 NEC Booth Presentation November 19, 2013
Fallacy 2: Massively Parallel Processors are a Silver Bullet(Also Pitfall 2: A Large-Scale Aggregation of Fine-Grain Cores with Limited-Memory BW)
4
0250500750
1000
90 92 94 96 98 100
100Gflop/s Core × 10 /socket10Gflop/s Core × 100 /socket
Parallel Efficiency(%)
Sust
aine
d G
flop/
s
Efficiency20%
Efficiency10%
Scalability of Direct Numerical Simulation of MHD Turbulent Flow for a Fusion Reactor Design on SX-9 and K-Computer
Prof. Y. Yamamoto, Yamanashi Univ. Progress Report of JHPCN12 - NA14http://jhpcn-kyoten.itc.u-tokyo.ac.jp/ja/docH24/FinalRep/JHPCN12-NA14_FinalRep.pdf
SX-9 (Tohoku & ES; 1D) K computer (1D), (2D) FX10 (1D)
0 1024 2048 3072 4096 5120 6144 7168 81920
5
10
15
20
25
30
350 256 512 768 1024 1280 1536 1792 2048
Sust
aine
d T
flop/
s
Number of Nodes of K computer & FX10
Number of CPUs of SX-9(Tohoku & ES)
SX-9
0 100 200 300 400 500 600Peak Tflop/s
0
5
10
15
20
25
30
35
Sust
aine
d T
flop/
s
Efficiency5%
Efficiency2.5%
Hiroaki Kobayashi, Tohoku University
SC13 NEC Booth Presentation November 19, 2013
Flops Racing without Users??? ~Many Important Applications Need 1B/F or More!~
5
Need balanced improvement both in flop/s and BW! for
high-efficiency in wide application areas
Limiting application areas
256.0
128.0
64.0
32.0
16.0
8.0
4.0
2.0
1.0
0.5
Att
aina
ble
Perf
orm
ance
(G
flops
/s)
Application B/F (Memory Access Intensity)
8 4 2 1 0.5 0.25 0.125 0.063
Stream
BW 72
.95GB/s
Tesla C1060 1.3B/F (78Gflop/s)
SX-9 2.5B/F(102.4Gflop/s)
Stream
BW 25
6GB/s
NGV 1B/F (256Gflop/s)
Stream
BW 17
.6GB/s
Nehalem EX 0.47B/F (72.48Gflop/s)
Stream
BW 17
.0GB/s
Nehalem EP 0.55B/F (46.93Gflop/s)
Stream
BW 34
.8GB/s
Sandy Bridge 0.27B/F (187.5Gflop/s)
Stream
BW 58
.61GB/s
Power7 0.52B/F(245.1Gflop/s)
Stream
BW 10
.0 GB/s
Stream
BW 43
.3GB/s
FX-1 1.0B/F (40.32Gflop/s)
FX-10 0.36B/F (236.5Gflop/s)
Stream
BW 64
.7GB/s
K computer 0.5B/F(128Gflop/s)
0.03 0.01
SX-8 4B/F (35.2Gflop/s)
Applica'on*Spectrum�Sustaine
d*Pe
rformance�
LINPA
CK�
Flop/s-oriented, memory-limited design
Our target:High B/F oriented design
Hiroaki Kobayashi, Tohoku University
SC13 NEC Booth Presentation November 19, 2013
Team Organization
6
Holistic Collaboration of Application, Architecture and Device Researchers&Engineers
Application Research!Group�
System Research "Group�
Device Technologies!Research Group�
Tohoku University! Leader Hiroaki Kobayashi)�
Jamstec (Co-Leaders: Yoshiyuki Kaneda & Kunihiko Watanabe)!
(Takahashi, Hori, Itakura, Uehara, et al.)�
NEC (Leader: Yukiko Hashimoto)!(Hagiwara, Momose, Musa, Watanabe, Hayashi, Nakazato, et al.)�
Imamura, Koshimura, Yamamoto, Matsuoka,!
Toyokuni et al.!Applications�
Egawa, Architecture!
Takizawa, Sys. Soft.!
Muraoka, Storage!
Koyanagi, et al. 3D Device�
Hanyu, NV Mem.�
Furumura, Hori et al. University of Tokyo!
Applications�
Nakahashi, JAXA!Applications�
Hasegawa, Arakawa, !Osaka University, Network!
Sato, JAIST, Sys. Soft.!
Yokokawa, Uno, Riken!Sys. Soft & I/O & Storage!
Motoyoshi, Tohoku MicroTec!3D Device Fab.!
Sano, Network!
Komatsu, Benchmark!
PI: Hiroaki Kobayashi, Tohoku University
IRIDeS
Hiroaki Kobayashi, Tohoku University
SC13 NEC Booth Presentation November 19, 2013
Social and Scientific Demands: Compound Analysis for Disaster Mitigation by HPC
7
Sc
ien
tifi
c p
red
icti
on
fo
r d
isa
ste
r p
reve
nti
on
an
d m
itig
ati
on�
Earthquake occurrence�
Seismic wave/Tsunami propagation�
Prediction of disaster scenarios �
Building motion �
Assessment of structural impact�
Damage by drifted objects �
Damage prediction�
Structural damage �
Flooded area �
Evacuation prediction�
Extreme weather� Torrential rain�
Hiroaki Kobayashi, Tohoku University
SC13 NEC Booth Presentation November 19, 20138
Compound Analysis for Disaster Mitigation and Coupled Simulation Flow
RSGDX�
MMA,
Para
llel P
roce
ss
Map
ping
Time (less than 3-hour)
RSD
GX
(Ear
thqu
ake
)
STOC-CADMAS(Tsunami/Land Flooding)
Seis
m3D
(Str
ong
Mot
ion)
MM
A1(M
acro
-sca
leG
roun
d vi
brat
ion)
MM
A2(M
icro
-sca
leG
roun
d vi
brat
ion)
Memory, File I/O, Network are also key components in addition to Flop/s units!
Hiroaki Kobayashi, Tohoku University
SC13 NEC Booth Presentation November 19, 2013
Social and Scientific Demands: Highly Productive Engineering by HPC
9
Detail Design�
Concept design �
Simulation (lower cost) �
Experiment (higher cost) � Product �
Simulations can reduce both design cost and design time. Utilizing the digital design will boost innovations in industries, enhancing the global competitiveness of Japanese industries. �
• Exploring wider design space • Reducing experiments with models �
• Improving reliability, safety and productivity • Providing greener products and energy saving�
Feedback to manufacturing �
• Enabling digital flight (simulating steady to unsteady phenomenon)
• Developing lower noise airplanes (aerodynamic acoustic analysis) �
• Developing more efficient turbine (thermal flow analysis of an entire turbine)
• Simulation by multi-physics CFD (phase-change, erosion/corrosion, cracks)�
Airplane design � Design of turbo machinery�
Multi-scale simulations with both macro flows and micro phenomena �
Hiroaki Kobayashi, Tohoku University
SC13 NEC Booth Presentation November 19, 2013
Exascale Computing Requirements in 2020
10
Purpose� Applica-on� B/F� Required5Memory(TB)�
Required5Floa-ng=point5Opera-ons5�×1018��
Expected5Execu-on5Time5�H��
Single'Ultra,high'
Resolu2on'Simula2on5
RSGDX� 8.00� 14� 520� 24�
Seism3D� 2.14� 2,900� 1,000� 8�
MSSG� 4.00� 175� 720� 6�
BCM� 5.47� 13.6� 15 0.5�
Ensemble'Computa2ons'of''Moderate'Resolu2on'Simula2ons�
Compound'Analysis'of'Natural'Disaster'(1000'cases)�
2.1�8.0� 98� 25x1000' 3×1000'(�4'months)�
Numerical'Turbine'(20'cases)�
2.33� 163.5� 140� 20�
Data5Assimila-on5(CDA)�
Global/Urban5Weather5(MSSG)�
Aerodynamics5(Incompressive5Flow)5
(BCM)�
Aerodynamics5(Compressive5Flow)5
(BCM)�
Aeroacous-cs5(BCM=LEE)�
Turbo5Machinery5(Numeric5Turbine)�
Evacua-on5Guidance5(MAS)�
Compound'Simula2on'of'Natural'Disaster�
��$&��#'�����!���� ����� ���
�&$! ���!&�! ��$!"���&�! �����%����
�%' ������ ����!!�� ��
�����������
$!' ��(��$�&�! �������
Hiroaki Kobayashi, Tohoku University
SC13 NEC Booth Presentation November 19, 2013
Goals and Strategies for our Target System Design
11
Memory IssuesHigh BW & Balanced B/F
High Memory B/W of 1~2B/F at Low Power by Using Advanced Device Technologies such as 2.5D/3D Die
Stacking
Vector Processing IssuesEfficient Short Vector and List Vector Processing
Advanced Vector Architecture Hiding Short-Vector PenaltySupported by A Large On-Chip Vector Load/Store Buffer at
4B/F with Random Access Mechanism
Node IssuesLarge-Grain Nodes for Reducing the Total Number of MPI
Processes
High Performance Nodes Composed of a Small Number of Large Cores of 256 GFlop/s each
Network IssuesWell-balanced Local/Neighboring and Global
Communication CapabilityHierarchical Network with Hi-Radix Switches
Storage/IO IssuesScalable Storage System for Data Assimilation and
Checkpointing/Restart Mechanism
Hierarchical Distributed Storage System with Locally High B/W and Globally Large Shard Storage
Capacity
System Software IssuesCompliance with Standard Programming Models and Tools
New Functionalities for Fault-tolerance and Resource-aware/Power-Aware Job Management
Linux with Advanced Functionalities
Hiroaki Kobayashi, Tohoku University
SC13 NEC Booth Presentation November 19, 2013
Target Architecture of Tohoku-NEC-JAMSTEC Team~Keep 1B/F or More!~
12
VLSB �
core� core� core� core�
Shared memory�
CPU0�1 TFlop/s� 1� 2� 3�Node0 4TFlop/s�
VLSB �
core� core� core� core�
Shared memory�
CPU0� 1� 2� 3�Node xxxx�
Hierarchical Distributed Storage System �
Hierarchical Network�
VLSB 32MB �
core� core� core� core�
2.5D/3D Die-Stacking Shared memory ~256GB �
1TF = 256GFlop/s x 4cores�
CPU0� 1� 2� 3�
~2TB/s(2B/F)�
~1TB/s (4B/F)�
2.5D/3D Die-Stacking Shared memory�
Hiroaki Kobayashi, Tohoku University
SC13 NEC Booth Presentation November 19, 2013
Preliminary Evaluation of the Target System
13
�551/)'9/438� ��� +6:/7+*��14'9/3-��4/39��5+7'9/438�
>=����?�
+6:/7+*��+247<�
�;5+)9+*��;+):9/43�"/2+�
�.4:78��
�89/2'9+*��;+):9/43�"/2+�
�.4:78��
+6:/7+*���4,���#8�
��+'0��,1458��
%/9.�$�!�� ���2+247<�8:(8<89+2�
!��&� ���� � �� �������"�� �� �� �����������
������,1458��
!+/82��� ��� ������ �����"�� ����� ��������������
������,1458��
�!!�� ���� � �� ������"�� ����� ����� ������
��� ��,1458��
���� ���� �� �����"�� ��� ������������
�����,1458��
�440.(&8.327� ���
*59.6*)��03&8.2,��3.28��4*6&8.327�
=<����>�
*59.6*)��*136;�
�:4*(8*)��:*(98.32�".1*�
�-3967��
�78.1&8*)��:*(98.32�".1*�
�-3967��
*59.6*)���3+���#7�
��*&/��+0347��
%.8-�$�!�� ���1*136;�
79'7;78*1�
�314392)�!.190&8.32�3+��&896&0��.7&78*6��=�����(&7*7��27*1'0*>�
�@��� �<����� �������"���-3967�<�����?��1328-�
���-3967�<�����
��������������+0347��
�91*6.(&0�"96'.2*�= �(&7*7��27*1'0*��
��� ���� �����"�� �-3967�����-3967�
��� ��������+0347��
Single Run
Ensemble Run
Hiroaki Kobayashi, Tohoku University
SC13 NEC Booth Presentation November 19, 2013
Performance Comparison of Our Target System with a Commodity-Based Future System
Our SystemSMP(UMA) Architecture
4TF, ~8TB/s(~2B/F), ~512GB, 1~1.6KW
Commodity-based SystemNUMA Architecture
3.3TF, 0.2TB/s(0.1B/F), 128GBx2, 0.4KW
14
core256GF
core
core core
VLSB32MB
Memory128~256GB(SMP)
1~2TB/s
core core core core core core
core core core core core core
core core core core core core
Cache32MB
Socket: 1TFSocket: 1.6TFcore : 90GF
MemoryDDR4 3200x4ch
0.1TB/s
2~4 B/F 1 B/F0.05TB/s
Mem128GB x 2(NUMA)
10GB/s x2
40GB/s x2
20GB/s
High Dimensional TorusFat-Tree
Hiroaki Kobayashi, Tohoku University
SC13 NEC Booth Presentation November 19, 2013
Performance Estimation (1/2)
In the case of the same number of processes(100,000proc) Performance normalized by commodity-based System
◆ 4x~8x speedup per watt in Seism3D and RSGDX because of their high B/F requirement
Our System Commodity-Based
Ratio
Tread/proc 4 4
PeakPF/proc 1TF 0.36TF 2.8x
PeakPF/100,000proc 100PF 35.84PF 2.8x
Nodes /100,000proc 25,000 11,111 2.3x
Total MemBW 200PB/s 3.4PB/s 59x
Total Cache Capacity 3.2TB 0.8TB 4x
0
20
40
60
80 72.8
43.0
Seism3D RSGDX
Our SystemFuture Commodity-Based System
Spee
dup
0123456789 8.1
4.8
Seism3D RSGDX
Spee
dup/
Wat
t
15
Hiroaki Kobayashi, Tohoku University
SC13 NEC Booth Presentation November 19, 2013
Performance Estimation (2/2)
0
150
300
450
600
5.8 2.9 1.4 0.7 0.4 0.1
538.9
250.6
117.552.1 25.6 6.5
102K 204K 409k 819K 1638K 6553K
Xeon-based System needs 6.4M processes (64x more than our system) to achieve the equivalent performance of our system, but it consumes 291MW!! (impractical)A Large on-chip cache works well up to 1M processes in the Xeon-based system, but memory data without locality become dominant when the number of processes is 1M or more.
30~40MW 290MW
Scalability Analysis when increasing the number of processes
Xeon-based System:
Scalable Performance obtained up-to 1M Proc.Commodity-based:Scalability Stalled
Our SystemFuture Commodity-Based
Exec
utio
n T
ime(
Sec)
# of Procs.
Seism3D
Hiroaki Kobayashi, Tohoku University
SC13 NEC Booth Presentation November 19, 2013
Summary
Well balanced HEC systems regarding memory performance is still key to high productivity in science and engineering in the post peta-scale era
We explore the great potential of the new generation vector architecture for future HPC systems, with new device technologies such as 2.5D/3D die-stacking
✴High sustained memory BW to fuel vector function units with lower power/energy.
The on-chip vector load/store unit can boost the sustained memory bandwidth energy-efficiently
When such new technologies will be available as production services?
✴ Design tools, fab. and markets steer the future of the technologies!
17
Hiroaki Kobayashi, Tohoku University
SC13 NEC Booth Presentation November 19, 2013
Riken AICS
• Mitsuo Yokokawa
• Atsuya UnoUniversity of Tokyo
• Takashi Furumura
• Muneo Hori JAXA
• Kazuhiro NakahashiJAIST
• Yukinori SatoOsaka University
• Go Hasegawa
• Shinichi ArakawaJMA
• Hideo Tada
• Chiashi Muroi
• Keiichi Katayama
• Eiji Toyoda
• Junichi Ishida
• Takafumi KanehamaMeisei University
• Kanji Otsuka
Acknowledgements
18
Tohoku University
• Hiroyuki Takizawa
• Kentaro Sano
• Ryusuke Egawa
• Kazuhiko Komatsu
• Mitsumasa Koyanagi
• Shunichi Koshimura
• Fumihiko Iwamura
• Takafumi Fukushima
• Kangwook Lee
• Hiroaki Muraoka
• Takahiro Hanyu
• Hiroshi Matsuoka
JAMSTEC
• Yoshiyuki Kaneda
• Kunihiko Watanabe
• Keiko Takahashi
• Takane Hori
• Mamoru Hyodo
• Kenichi Itakura
• Hitoshi Uehara
NEC
• Yukiko Hashimoto
• Yoko Isobe
• Akihiko Musa
• Takashi Soga
• Osamu Watanabe
• Akira Azami
• Takashi Hagiwara
• Shintaro Momose
• Yasuharu Hayashi
• Umezawa Kazuhiko
• Other NEC Engineers
Tohoku Micro-Tech
• Makoto Motoyoshi
Disclaimer:Information provided in this talk does not reflect any future products of the NEC systems.
For more details, see Booth 607 of Tohoku University