what we did for co-design in development of...
TRANSCRIPT
What We Did for Co-Design in Development of “Fugaku”
Mitsuhisa Sato Team Leader of Architecture Development Team
Deputy project leader, FLAGSHIP 2020 projectDeputy Director, RIKEN Center for Computational Science (R-CCS)
Professor (Cooperative Graduate School Program), University of Tsukuba
ORAP Forum, November 29, 2019 @Paris
FLAGSHIP2020 Project “Fugaku”Missions
• Building the Japanese national flagship supercomputer “Fugaku “(a.k. a post K), and
• Developing wide range of HPC applications, running on Fugaku, in order to solve social and science issues in Japan (application development proj will be over at the end of march)
Overview of Fugaku architectureNode: Manycore architecture
• Armv8-A + SVE (Scalable Vector Extension)• SIMD Length: 512 bits• # of Cores: 48 + (2/4 for OS) (> 2.7 TF / 48 core)• Co-design with application developers and high memory
bandwidth utilizing on-package stacked memory (HBM2) 1 TB/s B/W
• Low power : 15GF/W (dgemm)Network: TofuD• Chip-Integrated NIC, 6D mesh/torus Interconnect
Status and Update• March 2019: The official contract with Fujitsu to
manufacture, ship, and install hardware for Fugaku is done• RIKEN revealed #nodes > 150K• March 2019: The Name of the system was decided as
“Fugaku”• Aug. 2019: The K computer decommissioned, stopped the
services and shutdown (removed from the computer room)• Oct 2019: access to the test chips was started.• Nov. 2019: Fujitsu announce FX1000 and FX700, and
business with Cray.• Nov 2019: Fugaku clock frequency will be 2.0GHz and boost
to 2.2 GHz.• Mov 2019: Green 500 1st position!• Oct-Nov 2019: MEXT announced the Fugaku “early access
program” to begin around Q2/CY2020• Around Jan 2020: Installation of “Fugaku” will be started.
Nov/29/2019 2
No.1 in Green500 at SC19!
Nov/29/2019 3
Announce from Fujitsu at SC19
KPIs on Fugaku development in FLAGSHIP 2020 project
3 KPIs (key performance indicator) were defined for Fugaku development
1. Extreme Power-Efficient System Maximum performance under Power consumption of 30 - 40MW (for system) Approx. 15 GF/W (dgemm) confirmed by the prototype CPU => 1st in Green 500 !!!
2. Effective performance of target applications It is expected to exceed 100 times higher than the K computer’s performance in some
applications 125 times faster in GENESIS (MD application), 120 times faster in NICAM+LETKF (climate
simulation and data assimilation) were estimated
3. Ease-of-use system for wide-range of users Co-design with application developers Shared memory system with high-bandwidth on-package memory must make existing
OpenMP-MPI program ported easily. No programming effort for accelerators such as GPUs is required.
Nov/29/2019 4
Target Application’s Performance Performance Targets
100 times faster than K for some applications (tuning included) 30 to 40 MW power consumption
Area Priority Issue Performance Speedup over K Application Brief description
Health and longevity
1. Innovative computing infrastructure for drug discovery x125+ GENESIS MD for proteins
2. Personalized and preventive medicine using big data x8+ Genomon Genome processing
(Genome alignment)
Disaster prevention and Environment
3. Integrated simulation systems induced by earthquake and tsunami x45+ GAMERA Earthquake simulator (FEM in unstructured & structured grid)
4. Meteorological and global environmental prediction using big data x120+ NICAM+
LETKFWeather prediction system using Big data (structured grid stencil &
ensemble Kalman filter)
Energy issue
5. New technologies for energy creation, conversion / storage, and use x40+ NTChem Molecular electronic
(structure calculation)
6. Accelerated development of innovative clean energy systems x35+ Adventure Computational Mechanics System for Large Scale Analysis and Design
(unstructured grid)
Industrial competitiveness enhancement
7. Creation of new functional devices and high-performance materials x30+ RSDFT Ab-initio program
(density functional theory)
8. Development of innovative design and production processes x25+ FFB Large Eddy Simulation (unstructured grid)
Basic science 9. Elucidation of the fundamental laws and evolution of the universe x25+ LQCD Lattice QCD simulation (structured grid Monte Carlo)
Predicted Performance of 9 Target Applications As of 2019/05/14
https://postk-web.r-ccs.riken.jp/perf.html
5
Co-design in HPC
6
Co-design” in Wikipedia “Co-design or codesign is a product, service, or organization development process where design
professionals empower, encourage, and guide users to develop solutions for themselves.”… ….
“The phrase co-design is also used in reference to the simultaneous development of interrelated software and hardware systems. The term co-design has become popular in mobile phone development, where the two perspectives of hardware and software design are brought into a co-design process“
The co-design of HPC must optimize and maximize the benefits to cover many applications as possible. different from "co-design" in embedded systems. For example, in embedded field, co-
design sometimes includes "specialization" for a particular applications. On the other hands, in HPC, the system will be shared by many applications.
Why “co-design” is needed in very high-end HPC and exascale?
In modern very high-end parallel system, more performance can be delivered (even upto “exascale”) by increasing the number of nodes, but …
We need to design the system by trade-off between “energy/power” and “cost” and performance to satisfy constraints of “energy/power” and “cost” to maximize the performance.
The "co-design" in our post-K project must be … We design processor/network and system with the selected partner vender. Different from supercomputer acquisition in universities.
7
We need to design the system by taking characteristics of applications in account
⇒ “codesign” in HPC
Co-design in HPC
Richard F. BARRETT, et.al. “On the Role of Co-design in High Performance Computing”, Transition of HPC Towards Exascale Computing
8Nov/29/2019
Before starting FLAGSHIP2020 Project
9
The K computer project (2016-2012) The public service of the K computer started from 2012.
Feasibility study project (2012-2013) Application study team leaded by RIKEN AICS (Tomita) System study team leaded by U Tokyo (Ishikawa)
Next-generation “General-Purpose” Supercomputer System study team leaded by U Tsukuba (Sato)
Exascale heterogeneous systems with accelerators System study team leaded by Tohoku U (Kobayashi)
Initial Plan at the beginning (2014) was a combined system with “General-Purpose” Supercomputer and “accelerators” But, the development of “accelerators” was canceled due to
budget problem.
Nov/29/2019
Almost co-design done in this phase
The Basic design follows “general purpose” processor by FS of U. Tokyo
Technologies and Architectural Parameters to be determined
10
Chip configuration #core, #NUMA node Cache configuration (shared within NUMA node), cache size Chip connection, Memory technologies (DDR, HBM, HMC …) Frequency, Silicon Fabrication technology (10nm, 7nm FinFET, … ),
Chip die-size, power consumption Core specification #SIMD pipe, SIMD vector length Cache in Core (size, latency, …) O3 resources specialized hardware
Interconnect SerDes, “Tofu” or other network?
We had to examine the technologies available 3 or 4 year after the design, and Cost (project budget)!
Technologies trends
11
7nm FinFET10nm FinFET16nm FinFETSilicon Technology
HBM2HBM1HMCMemory
TechnologyCoWoS (Si Interposer)
4x28.05G
56G ???PCIe Gen3
PCIe Gen4
NetworkTechnology
IOtechnology
Almost co-design done in this phase
Chip configuration
12
#cluster, #core/cluster, NOC Chip size is a major factor of cost. MCM & interposer (organic, silicon, glass …) Memory tech.
HMC (mem size↑、Power ↑) HBM (mem size↓、Power ↓) DDR (mem size ↑ 、Power ↑, cost ↓)
Performance was estimated by “analytical models” for simple kernel such as HPL
HBM
NOC
ClusterCluster
Cluster Cluster
HBM
HBM HBM
NIC
&PCIe
HMC
NOC
ClusterCluster
Cluster Cluster
HMC
HMC HMCN
IC&
PCIe
HBM
ClusterCluster
Cluster Cluster
HBM
HBM HBM
NIC
&PCIe
MCM
Network
NIC
&PCIe
HMC
ClusterCluster
HMC
NIC
&PCIe
HMC
ClusterCluster
HMC
NIC
&PCIe
NIC
&PCIe
Small chip, HMC Large chip, HMC Small chip + HMC with MCM
Large chip + HBM(selected as A64FX)
cost
Chip size
cost getting higher beyond some size
Target applications for co-design
13
Typical benchmarks such as HPL (dgemm) and stream
Target applications provided by each application projects (“9 priority issues”)
Target applications are representatives of almost all our applications in terms of computational methods and communication patterns in order to design architectural features.
Target science: 9 Priority Issues重点課題① 生体分子システムの機能制御による 革新的創薬基盤の構築
①Innovative Drug Discovery
RIKEN Quant. Biology Center
重点課題② 個別化・予防医療を支援する 統合計算生命科学②Personalized and Preventive Medicine
Inst. Medical Science, U. Tokyo
重点課題③ 地震・津波による複合災害の統合的予測システムの構築③Hazard and Disaster induced by
Earthquake and Tsunami
Earthquake Res. Inst., U. Tokyo
重点課題④ 観測ビッグデータを活用した気象と地球環境の予測の高度化④Environmental Predictions
with Observational Big Data
Center for Earth Info., JAMSTEC
重点課題⑥ 革新的クリーンエネルギー システムの実用化⑥Innovative Clean Energy Systems
Grad. Sch. Engineering, U. Tokyo
重点課題⑦ 次世代の産業を支える新機能デバイス・高性能材料の創成⑦New Functional Devices and High-Performance
Inst. For Solid State Phys., U. Tokyo
重点課題⑧ 近未来型ものづくりを先導する 革新的設計・製造プロセスの開発⑧ Innovative Design and Production Processes for the Manufacturing Industry in the
Near Future
Cent. for Earth Info., JAMSTEC重点課題⑤ エネルギーの高効率な創出、変換・貯蔵、利用の新規基盤技術の開発⑤High-Efficiency Energy Creation,
Conversion/Storage and Use
Inst. Molecular Science, NINS
重点課題⑨ 宇宙の基本法則と進化の解明⑨Fundamental Laws and Evolution of the Universe
Cent. for Comp. Science, U. Tsukuba
20018/02/27 14
Tools for co-design
15
Performance estimation tool Enables performance projection using Fujitsu FX100 execution profile to a set of
arch. parameters. Modeled by Fujitsu micro-architecture
Fujitsu in-house simulators Extended FX100 (SPARC) simulator and compiler for preliminary studies Armv8+SVE simulator and compiler (gem5 simulator has been developed after co-design for architecture verification
and performance tuning by RIKEN
Emulator for logic design verification
Performance estimation tool Estimate the performance of multithreaded programs on “new architecture” (post-K) from the
profile data taken on Fujitsu FX100 Estimate a maximum performance when busy-time of each component is hidden by others,
based on Fujitsu’s micro-architecture.
16
Break-down exec time
Memory accbusy time
Memory W acc busy time
L2 busy time L1 busy time20018/02/27
Fujitsu FX100 execution profileAnd execute-time break-down
Estimated execute-time break-down on new architecture by projection
Projectionby throughput of
each element (similar to “roof-
line mode”)
Co-design Methodology
17
Extract kernels from each application benchmark, and breakdown to a set of kernels
Co-design process for each kernels1. Setting a set of system parameters2. Tuning target applications under the system parameters3. Evaluating execution time using “estimation tool”4. Identifying hardware bottlenecks and changing the set of
system parameters
Analytical models for simple kernel such as HPL and dgemm, stream
Identify kernel loops
18
Our performance estimation tool estimates performance according to “throughput” of each block in core, according to Fujitsu micro-architecture.
Identify kernels from each application benchmark, and breakdown to a set of kernels to estimate execution time of each kernel.
do I = 1,N… enddo
Kernel1
do J = 1, N1….
enddoKernel2
do K = 1, …… enddo Kernel3
PerformanceEstimation
Tool
profile2profile1
profile3
exec T2Exec T1
Exec T3
TotalExecution
Time
Extract Kernels
19
Extract kernels from each application benchmark for more detail analysis. This set of kernels is useful to evaluate performance on the simulator,
which takes long time to execute.
do I = 1,N… enddo
Kernel1
do J = 1, N1….
enddoKernel2
do K = 1, …… enddo Kernel3
do I = 1,N… enddo
Kernel1
do J = 1, N1….
enddoKernel2
do K = 1, …… enddo Kernel3
PerformanceEstimation
Tool
OR
Simulator
profile2profile1
profile3
exec T2Exec T1
Exec T3
A64FX Optimized Load Efficiency for Apps Performance
20
128 bytes/cycle sustained bandwidth even for unaligned SIMD load
“Combined Gather” doubles gather (indirect) load’s data throughput, when target elements are within a “128-byte aligned block” for a pair of two regs, even & odd
L1D cacheRead port0
Read port1
Read data064B/cycle
Read data164B/cycle
Mem.
128B
0 1 2 3 4 5 6 7
0 1 3 2 6 7 45
8B
Regs
flow-1
flow-2
flow-4 flow-3
Maximizes BW to 32 bytes/cyc.
Suggested through Co-design work w/ app teams
Low-power Design & Power Management Leading-edge Si-technology (7nm FinFET) Low power logic design (15 GF/W @ dgemm)
A64FX provides power management function called “Power Knob” FL pipeline usage: FLA only, EX pipeline usage : EXA only, Frequency reduction … User program can change “Power Knob” for power optimization “Energy monitor” facility enables chip-level power monitoring and detailed power analysis
of applications
“Eco-mode” : FLA only with lower “stand-by” power for ALUs Reduce the power-consumption for memory intensive apps. 4 apps out of 9 target applications select “eco-mode” for the max performance under the
limitation of our power capacity (Even using HBM2!) Retention mode: power state for de-activation of CPU with keeping network alive Large reduction of system power-consumption at idle time
2019/5/17
Co-design of network
22
We have selected “Tofu” network for performance compatibility in large-scale applications.
Select Link-speed 28Gps x 4 due to technology availability around 2019 Communication patterns ware extracted, and the communication performance
was estimated by “analytical model”. Many target applications have neighbor communication
pattern, or communication to near nodes. ⇒ “Tofu” and 28Gbp link were sufficient.
Some apps have all-to-all communication. We studied the benefits or feasibility of additional
“dedicated” all-to-all network, but it ware not selected due to cost
Support 3 DP reduction by Tofu TBI for QCD apps “Common” programing model will be to run each
MPI process on a NUMA node (CMG) with OpenMP-MPI hybrid programming. 48 threads OpenMP is also supported.
K w/ TBI
TBI: Tofu Barrier Interface
23HBM2
• TSMC 7nm FinFET• CoWoS technologies
for HBM2
CPU A64FX
Courtesy of FUJITSU LIMITED
Architecture Armv8.2-A SVE (512 bit SIMD)
Core
48 cores for compute and 2/4 for OS activities
Normal: 2.0 GHz DP: 3.072 TF, SP: 6.144 TF, HP: 12.288 TF
Boost: 2.2 GHz DP: 3.3792TF, SP: 6.7584 TF, HP: 13.5168 TF
Cache L1 64 KiB, 4 way, 230+ GB/s(load), 115+ GB/s (store)
Cache L2CMG(NUMA): 8 MiB, 16wayNode: 3.6+ TB/sCore: 115+ GB/s (load), 57+ GB/s (store)
Memory HBM2 32 GiB, 1024 GB/s
Interconnect TofuD (28 Gbps x 2 lane x 10 port)
I/O PCIe Gen3 x 16 lane
Technology 7nm FinFET
PerformanceStream triad: 830+ GB/sDgemm: 2.5+ TF (90+% efficiency)
ref. Toshio Yoshida, “Fujitsu High Performance CPU for the Post-K Computer,” IEEE Hot Chips: A Symposium on High Performance Chips, San Jose, August 21, 2018.
4 NUMA Nodes
TofuD Interconnect
8B Put latency 0.49 – 0.54 usec
1MiB Put throughput 6.35 GB/s
rf. Yuichiro Ajima, et al. , “The Tofu Interconnect D,” IEEE Cluster 2018, 2018.
• 6 RDMA Engines• Hardware barrier support• Network operation offloading capability
TNI: Tofu Network Interface (RDMA engine)
TNI0
TNI1
TNI2
TNI3
TNI4
TNI5
TNR(Tofu Network Router)
2 lanes x 10 ports
40.8 GB/s(6.8 GB/s x 6)
26
Fugaku prototype board and rack
Shelf: 48 CPUs (24 CMU)Rack: 8 shelves = 384 CPUs (8x48)
2 CPU / CMU
60mm
60mm
Water
Water
Electrical signals
AOC
QSFP28 (X)
QSFP28 (Y)
QSFP28 (Z)
AOC
AOC
Nov/29/2019
HBM2
27
Fugaku System Configuration
3-level hierarchical storage system 1st Layer
One of 16 compute nodes, called Compute & Storage I/O Node, has SSD about 1.6 TB
Services- Cache for global file system- Temporary file systems
- Local file system for compute node- Shared file system for a job
2nd Layer Fujitsu FEFS: Lustre-based global file
system 3rd Layer
Cloud storage services
Boost mode: 3.3792TF x 150k+ = 500+ PF 150k+ node Two types of nodes Compute Node and Compute & I/O Node connected by Fujitsu TofuD, 6D mesh/torus Interconnect
Advances from the K computer
SVE increases core performance Silicon tech. and scalable architecture (CMG) to increase node performance HBM enables high bandwidth
K computer Fugaku ratio# core 8 48
Si tech. (nm) 45 7Core perf. (GFLOPS) 16 > 64 4
Chip(node) perf. (TFLOPS) 0.128 >3.0 24Memory BW (GB/s) 64 1024
B/F (Bytes/FLOP) 0.5 0.4#node / rack 96 384 4
Rack perf. (TFLOPS) 12.3 >1179.6 96#node/system 82,944 > 150,000
System perf.(DP PFLOPS) 10.6 > 460.8 43
SVECMG&Si TechHBM
Si Tech
More than 7.5 M General-purpose cores!
Nov/29/2019
Boost mode: 3.3792TF x 150k+ = 500+ PF
29
Co-design in HPC
Richard F. BARRETT, et.al. “On the Role of Co-design in High Performance Computing”, Transition of HPC Towards Exascale Computing
30Nov/29/2019
31
Tools for performance tuning Performance estimation tool
Performance projection using Fujitsu FX100 execution profile
Gives “target” performance Post-K processor simulator
Based on gem5, O3, cycle-level simulation Very slow, so limited to kernel-level evaluation
Co-design of apps 1. Estimate “target” performance using
performance estimation tool 2. Extract kernel code for simulator 3. Measure exec time using simulator 4. Feed-back to code optimization 5. Feed-back to compiler
Co-design of Apps for Architecture (at early stage)
Perform-anceestimationtoolcd
Simulator Simulator Simulator
As is Tuning 1
Tuning 2
Targetperformance
Exec
utio
n tim
e
Now, Test chips of A64FX is available!We are evaluating the performance
and doing performance tuning, using the test chip
Nov/29/2019
Benchmark Results on test chip A64FX
CloverLeaf (UK Mini-App Consortium), Fortran/C A hydrodynamics mini-app to solve the compressible Euler equations in
2D, using an explicit, second-order method Stencil calculation
TeaLeaf (UK Mini-App Consortium), Fortran A mini-application to enable design-space explorations for iterative sparse
linear solvers https://github.com/UK-MAC/TeaLeaf_ref.git Problem size: Benchmarks/tea_bm_5.in, end_step=10 -> 3
LULESH (LLNL), C Mini-app representative of simplified 3D Lagrangian hydrodynamics on an
unstructured mesh, indirect memory accessNov/29/2019 32
Benchmark Results on test chip A64FX
Platform A64FX test chip (2.0 GHz) ThunderX2 @ Apollo70
28C/2S @ 2.0GHz Arm HPC compiler 19.1
Broadwell (Xeon E5-2680 v4) 14C/2S @ 2.4GHz Intel compiler 2019.0.045
Skylake (Xeon Gold 6126) @ Cygnus, Univ. of Tsukuba 12C/2S @ 2.6GHz Intel compiler 19.0.3.199
Nov/29/2019
Compiler Options Fujitsu compiler
-Kfast,openmp
Arm HPC compiler -Ofast -march=armv8-a(+sve)
Intel compiler -O3 -qopenmp -march=native
Disclaimer:The software used for the evaluation, such as the compiler, is still under development and its performance may be different when the supercomputer Fugaku starts its operation.
33
Evaluation using one CMG(NUMA node) without MPI Good scalability by increasing the number of threads within CMG. One GMG performance is comparable to Intel one. (Chip contains 4 CMG!)
CloverLeaf
Nov/29/2019
0
5
10
15
20
25
Fujitsu Arm Intel
A64FX TX2 Broadwell
Exec
utio
n tim
e [s
ec]
# of threads 1 # of threads 4 # of threads 8 # of threads 12
0
1
2
3
4
5
6
7
8
9
Fujitsu Arm Intel
A64FX TX2 Broadwell
Rela
tive
perf
orm
ance
# of threads 1 # of threads 4 # of threads 8 # of threads 12
Execution time Relative performance (to 1T/A64FX)
34
0
200
400
600
800
1000
1200
1400
1 4 8 12 1 4 8 12 1 4 8 12
1 2 4
Wal
l clo
ck ti
me
[sec
]
# of threads# of processes
A64FX TX2 Xeon
Evaluation of MPI program within one chip (upto 4 MPI process)
Changing #threads within CMG The speedup is limited for more
than 4 threads due to the memory bandwidth (?)
We need more performance analysis.
TeaLeaf
Nov/29/2019
Execution time
0
2
4
6
8
10
12
14
1 4 8 12 1 4 8 12 1 4 8 12
1 2 4
Rela
tive
perfo
rman
ce (/
A64F
X 1p
roc
1thr
ead)
# of threads# of processes
A64FX TX2 XeonRelative performance(to 1T/A64FX)
35
Xeon @ Cygnus, Univ. of TsukubaIntel Xeon Gold 6126 2.6GHz; 12 core x 2 socket
0
500
1000
1500
2000
2500
3000
3500
Fujitsu Arm Intel
A64FX TX2 Broadwell
FOM
(z/
s)
# of threads 1 # of threads 4 # of threads 8 # of threads 12
Evaluation using one CMG(NUMA node) without MPI One CMG performance is less than Thx2 and Intel one We found low vectorization (SIMD (SVE) instructions ratio is a few %) We need more code tuning for more vectorization using SIMD
LULESH
Nov/29/2019 36
37Nov/29/2019
38Nov/29/2019