what we did for co-design in development of...

What We Did for Co-Design in Development of “Fugaku”

Mitsuhisa Sato Team Leader of Architecture Development Team

Deputy project leader, FLAGSHIP 2020 projectDeputy Director, RIKEN Center for Computational Science (R-CCS)

Professor (Cooperative Graduate School Program), University of Tsukuba

ORAP Forum, November 29, 2019 @Paris

FLAGSHIP2020 Project “Fugaku”Missions

• Building the Japanese national flagship supercomputer “Fugaku “(a.k. a post K), and

• Developing wide range of HPC applications, running on Fugaku, in order to solve social and science issues in Japan (application development proj will be over at the end of march)

Overview of Fugaku architectureNode: Manycore architecture

• Armv8-A + SVE (Scalable Vector Extension)• SIMD Length: 512 bits• # of Cores: 48 + (2/4 for OS) (> 2.7 TF / 48 core)• Co-design with application developers and high memory

bandwidth utilizing on-package stacked memory (HBM2) 1 TB/s B/W

• Low power : 15GF/W (dgemm)Network: TofuD• Chip-Integrated NIC, 6D mesh/torus Interconnect

Status and Update• March 2019: The official contract with Fujitsu to

manufacture, ship, and install hardware for Fugaku is done• RIKEN revealed #nodes > 150K• March 2019: The Name of the system was decided as

“Fugaku”• Aug. 2019: The K computer decommissioned, stopped the

services and shutdown (removed from the computer room)• Oct 2019: access to the test chips was started.• Nov. 2019: Fujitsu announce FX1000 and FX700, and

business with Cray.• Nov 2019: Fugaku clock frequency will be 2.0GHz and boost

to 2.2 GHz.• Mov 2019: Green 500 1st position!• Oct-Nov 2019: MEXT announced the Fugaku “early access

program” to begin around Q2/CY2020• Around Jan 2020: Installation of “Fugaku” will be started.

Nov/29/2019 2

No.1 in Green500 at SC19!

Nov/29/2019 3

Announce from Fujitsu at SC19

KPIs on Fugaku development in FLAGSHIP 2020 project

3 KPIs (key performance indicator) were defined for Fugaku development

1. Extreme Power-Efficient System Maximum performance under Power consumption of 30 - 40MW (for system) Approx. 15 GF/W (dgemm) confirmed by the prototype CPU => 1st in Green 500 !!!

2. Effective performance of target applications It is expected to exceed 100 times higher than the K computer’s performance in some

applications 125 times faster in GENESIS (MD application), 120 times faster in NICAM+LETKF (climate

simulation and data assimilation) were estimated

3. Ease-of-use system for wide-range of users Co-design with application developers Shared memory system with high-bandwidth on-package memory must make existing

OpenMP-MPI program ported easily. No programming effort for accelerators such as GPUs is required.

Nov/29/2019 4

Target Application’s Performance Performance Targets

100 times faster than K for some applications (tuning included) 30 to 40 MW power consumption

Area Priority Issue Performance Speedup over K Application Brief description

Health and longevity

1. Innovative computing infrastructure for drug discovery x125+ GENESIS MD for proteins

2. Personalized and preventive medicine using big data x8+ Genomon Genome processing

(Genome alignment)

Disaster prevention and Environment

3. Integrated simulation systems induced by earthquake and tsunami x45+ GAMERA Earthquake simulator (FEM in unstructured & structured grid)

4. Meteorological and global environmental prediction using big data x120+ NICAM+

LETKFWeather prediction system using Big data (structured grid stencil &

ensemble Kalman filter)

Energy issue

5. New technologies for energy creation, conversion / storage, and use x40+ NTChem Molecular electronic

(structure calculation)

6. Accelerated development of innovative clean energy systems x35+ Adventure Computational Mechanics System for Large Scale Analysis and Design

(unstructured grid)

Industrial competitiveness enhancement

7. Creation of new functional devices and high-performance materials x30+ RSDFT Ab-initio program

(density functional theory)

8. Development of innovative design and production processes x25+ FFB Large Eddy Simulation (unstructured grid)

Basic science 9. Elucidation of the fundamental laws and evolution of the universe x25+ LQCD Lattice QCD simulation (structured grid Monte Carlo)

Predicted Performance of 9 Target Applications As of 2019/05/14

https://postk-web.r-ccs.riken.jp/perf.html

5

https://postk-web.r-ccs.riken.jp/perf.html

Co-design in HPC

6

Co-design” in Wikipedia “Co-design or codesign is a product, service, or organization development process where design

professionals empower, encourage, and guide users to develop solutions for themselves.”… ….

“The phrase co-design is also used in reference to the simultaneous development of interrelated software and hardware systems. The term co-design has become popular in mobile phone development, where the two perspectives of hardware and software design are brought into a co-design process“

The co-design of HPC must optimize and maximize the benefits to cover many applications as possible. different from "co-design" in embedded systems. For example, in embedded field, co-

design sometimes includes "specialization" for a particular applications. On the other hands, in HPC, the system will be shared by many applications.

Why “co-design” is needed in very high-end HPC and exascale?

In modern very high-end parallel system, more performance can be delivered (even upto “exascale”) by increasing the number of nodes, but …

We need to design the system by trade-off between “energy/power” and “cost” and performance to satisfy constraints of “energy/power” and “cost” to maximize the performance.

The "co-design" in our post-K project must be … We design processor/network and system with the selected partner vender. Different from supercomputer acquisition in universities.

7

We need to design the system by taking characteristics of applications in account

⇒ “codesign” in HPC

Co-design in HPC

Richard F. BARRETT, et.al. “On the Role of Co-design in High Performance Computing”, Transition of HPC Towards Exascale Computing

8Nov/29/2019

Before starting FLAGSHIP2020 Project

9

The K computer project (2016-2012) The public service of the K computer started from 2012.

Feasibility study project (2012-2013) Application study team leaded by RIKEN AICS (Tomita) System study team leaded by U Tokyo (Ishikawa)

Next-generation “General-Purpose” Supercomputer System study team leaded by U Tsukuba (Sato)

Exascale heterogeneous systems with accelerators System study team leaded by Tohoku U (Kobayashi)

Initial Plan at the beginning (2014) was a combined system with “General-Purpose” Supercomputer and “accelerators” But, the development of “accelerators” was canceled due to

budget problem.

Nov/29/2019

Almost co-design done in this phase

The Basic design follows “general purpose” processor by FS of U. Tokyo

Technologies and Architectural Parameters to be determined

10

Chip configuration #core, #NUMA node Cache configuration (shared within NUMA node), cache size Chip connection, Memory technologies (DDR, HBM, HMC …) Frequency, Silicon Fabrication technology (10nm, 7nm FinFET, … ),

Chip die-size, power consumption Core specification #SIMD pipe, SIMD vector length Cache in Core (size, latency, …) O3 resources specialized hardware

Interconnect SerDes, “Tofu” or other network?

We had to examine the technologies available 3 or 4 year after the design, and Cost (project budget)!

Technologies trends

11

7nm FinFET10nm FinFET16nm FinFETSilicon Technology

HBM2HBM1HMCMemory

TechnologyCoWoS (Si Interposer)

4x28.05G

56G ???PCIe Gen3

PCIe Gen4

NetworkTechnology

IOtechnology

Almost co-design done in this phase

Chip configuration

12

#cluster, #core/cluster, NOC Chip size is a major factor of cost. MCM & interposer (organic, silicon, glass …) Memory tech.

HMC (mem size↑、Power ↑) HBM (mem size↓、Power ↓) DDR (mem size ↑ 、Power ↑, cost ↓)

Performance was estimated by “analytical models” for simple kernel such as HPL

HBM

NOC

ClusterCluster

Cluster Cluster

HBM

HBM HBM

NIC

&PCIe

HMC

NOC

ClusterCluster

Cluster Cluster

HMC

HMC HMCN

IC&

PCIe

HBM

ClusterCluster

Cluster Cluster

HBM

HBM HBM

NIC

&PCIe

MCM

Network

NIC

&PCIe

HMC

ClusterCluster

HMC

NIC

&PCIe

HMC

ClusterCluster

HMC

NIC

&PCIe

NIC

&PCIe

Small chip, HMC Large chip, HMC Small chip + HMC with MCM

Large chip + HBM(selected as A64FX)

cost

Chip size

cost getting higher beyond some size

Target applications for co-design

13

Typical benchmarks such as HPL (dgemm) and stream

Target applications provided by each application projects (“9 priority issues”)

Target applications are representatives of almost all our applications in terms of computational methods and communication patterns in order to design architectural features.

Target science: 9 Priority Issues重点課題① 生体分子システムの機能制御による革新的創薬基盤の構築

①Innovative Drug Discovery

RIKEN Quant. Biology Center

重点課題② 個別化・予防医療を支援する統合計算生命科学②Personalized and Preventive Medicine

Inst. Medical Science, U. Tokyo

重点課題③ 地震・津波による複合災害の統合的予測システムの構築③Hazard and Disaster induced by

Earthquake and Tsunami

Earthquake Res. Inst., U. Tokyo

重点課題④ 観測ビッグデータを活用した気象と地球環境の予測の高度化④Environmental Predictions

with Observational Big Data

Center for Earth Info., JAMSTEC

重点課題⑥ 革新的クリーンエネルギーシステムの実用化⑥Innovative Clean Energy Systems

Grad. Sch. Engineering, U. Tokyo

重点課題⑦ 次世代の産業を支える新機能デバイス・高性能材料の創成⑦New Functional Devices and High-Performance

Inst. For Solid State Phys., U. Tokyo

重点課題⑧ 近未来型ものづくりを先導する革新的設計・製造プロセスの開発⑧ Innovative Design and Production Processes for the Manufacturing Industry in the

Near Future

Cent. for Earth Info., JAMSTEC重点課題⑤ エネルギーの高効率な創出、変換・貯蔵、利用の新規基盤技術の開発⑤High-Efficiency Energy Creation,

Conversion/Storage and Use

Inst. Molecular Science, NINS

重点課題⑨ 宇宙の基本法則と進化の解明⑨Fundamental Laws and Evolution of the Universe

Cent. for Comp. Science, U. Tsukuba

20018/02/27 14

Tools for co-design

15

Performance estimation tool Enables performance projection using Fujitsu FX100 execution profile to a set of

arch. parameters. Modeled by Fujitsu micro-architecture

Fujitsu in-house simulators Extended FX100 (SPARC) simulator and compiler for preliminary studies Armv8+SVE simulator and compiler (gem5 simulator has been developed after co-design for architecture verification

and performance tuning by RIKEN

Emulator for logic design verification

Performance estimation tool Estimate the performance of multithreaded programs on “new architecture” (post-K) from the

profile data taken on Fujitsu FX100 Estimate a maximum performance when busy-time of each component is hidden by others,

based on Fujitsu’s micro-architecture.

16

Break-down exec time

Memory accbusy time

Memory W acc busy time

L2 busy time L1 busy time20018/02/27

Fujitsu FX100 execution profileAnd execute-time break-down

Estimated execute-time break-down on new architecture by projection

Projectionby throughput of

each element (similar to “roof-

line mode”)

Co-design Methodology

17

Extract kernels from each application benchmark, and breakdown to a set of kernels

Co-design process for each kernels1. Setting a set of system parameters2. Tuning target applications under the system parameters3. Evaluating execution time using “estimation tool”4. Identifying hardware bottlenecks and changing the set of

system parameters

Analytical models for simple kernel such as HPL and dgemm, stream

Identify kernel loops

18

Our performance estimation tool estimates performance according to “throughput” of each block in core, according to Fujitsu micro-architecture.

Identify kernels from each application benchmark, and breakdown to a set of kernels to estimate execution time of each kernel.

do I = 1,N… enddo

Kernel1

do J = 1, N1….

enddoKernel2

do K = 1, …… enddo Kernel3

PerformanceEstimation

Tool

profile2profile1

profile3

exec T2Exec T1

Exec T3

TotalExecution

Time

Extract Kernels

19

Extract kernels from each application benchmark for more detail analysis. This set of kernels is useful to evaluate performance on the simulator,

which takes long time to execute.

do I = 1,N… enddo

Kernel1

do J = 1, N1….

enddoKernel2


do I = 1,N… enddo

Kernel1

do J = 1, N1….

enddoKernel2


PerformanceEstimation

Tool

OR

Simulator

profile2profile1

profile3

exec T2Exec T1

Exec T3

A64FX Optimized Load Efficiency for Apps Performance

20

128 bytes/cycle sustained bandwidth even for unaligned SIMD load

“Combined Gather” doubles gather (indirect) load’s data throughput, when target elements are within a “128-byte aligned block” for a pair of two regs, even & odd

L1D cacheRead port0

Read port1

Read data064B/cycle

Read data164B/cycle

Mem.

128B

0 1 2 3 4 5 6 7

0 1 3 2 6 7 45

8B

Regs

flow-1

flow-2

flow-4 flow-3

Maximizes BW to 32 bytes/cyc.

Suggested through Co-design work w/ app teams

Low-power Design & Power Management Leading-edge Si-technology (7nm FinFET) Low power logic design (15 GF/W @ dgemm)

A64FX provides power management function called “Power Knob” FL pipeline usage: FLA only, EX pipeline usage : EXA only, Frequency reduction … User program can change “Power Knob” for power optimization “Energy monitor” facility enables chip-level power monitoring and detailed power analysis

of applications

“Eco-mode” : FLA only with lower “stand-by” power for ALUs Reduce the power-consumption for memory intensive apps. 4 apps out of 9 target applications select “eco-mode” for the max performance under the

limitation of our power capacity (Even using HBM2!) Retention mode: power state for de-activation of CPU with keeping network alive Large reduction of system power-consumption at idle time

2019/5/17

Co-design of network

22

We have selected “Tofu” network for performance compatibility in large-scale applications.

Select Link-speed 28Gps x 4 due to technology availability around 2019 Communication patterns ware extracted, and the communication performance

was estimated by “analytical model”. Many target applications have neighbor communication

pattern, or communication to near nodes. ⇒ “Tofu” and 28Gbp link were sufficient.

Some apps have all-to-all communication. We studied the benefits or feasibility of additional

“dedicated” all-to-all network, but it ware not selected due to cost

Support 3 DP reduction by Tofu TBI for QCD apps “Common” programing model will be to run each

MPI process on a NUMA node (CMG) with OpenMP-MPI hybrid programming. 48 threads OpenMP is also supported.

K w/ TBI

TBI: Tofu Barrier Interface

23HBM2

• TSMC 7nm FinFET• CoWoS technologies

for HBM2

CPU A64FX

Courtesy of FUJITSU LIMITED

Architecture Armv8.2-A SVE (512 bit SIMD)

Core

48 cores for compute and 2/4 for OS activities

Normal: 2.0 GHz DP: 3.072 TF, SP: 6.144 TF, HP: 12.288 TF

Boost: 2.2 GHz DP: 3.3792TF, SP: 6.7584 TF, HP: 13.5168 TF

Cache L1 64 KiB, 4 way, 230+ GB/s(load), 115+ GB/s (store)

Cache L2CMG(NUMA): 8 MiB, 16wayNode: 3.6+ TB/sCore: 115+ GB/s (load), 57+ GB/s (store)

Memory HBM2 32 GiB, 1024 GB/s

Interconnect TofuD (28 Gbps x 2 lane x 10 port)

I/O PCIe Gen3 x 16 lane

Technology 7nm FinFET

PerformanceStream triad: 830+ GB/sDgemm: 2.5+ TF (90+% efficiency)

ref. Toshio Yoshida, “Fujitsu High Performance CPU for the Post-K Computer,” IEEE Hot Chips: A Symposium on High Performance Chips, San Jose, August 21, 2018.

4 NUMA Nodes

TofuD Interconnect

8B Put latency 0.49 – 0.54 usec

1MiB Put throughput 6.35 GB/s

rf. Yuichiro Ajima, et al. , “The Tofu Interconnect D,” IEEE Cluster 2018, 2018.

• 6 RDMA Engines• Hardware barrier support• Network operation offloading capability

TNI: Tofu Network Interface (RDMA engine)

TNI0

TNI1

TNI2

TNI3

TNI4

TNI5

TNR(Tofu Network Router)

2 lanes x 10 ports

40.8 GB/s(6.8 GB/s x 6)

Fugaku prototype board and rack

Shelf: 48 CPUs (24 CMU)Rack: 8 shelves = 384 CPUs (8x48)

2 CPU / CMU

60mm

60mm

Water

Water

Electrical signals

AOC

QSFP28 (X)

QSFP28 (Y)

QSFP28 (Z)

AOC

AOC

Nov/29/2019

HBM2

27

Fugaku System Configuration

3-level hierarchical storage system 1st Layer

One of 16 compute nodes, called Compute & Storage I/O Node, has SSD about 1.6 TB

Services- Cache for global file system- Temporary file systems

- Local file system for compute node- Shared file system for a job

2nd Layer Fujitsu FEFS: Lustre-based global file

system 3rd Layer

Cloud storage services

Boost mode: 3.3792TF x 150k+ = 500+ PF 150k+ node Two types of nodes Compute Node and Compute & I/O Node connected by Fujitsu TofuD, 6D mesh/torus Interconnect

Advances from the K computer

SVE increases core performance Silicon tech. and scalable architecture (CMG) to increase node performance HBM enables high bandwidth

K computer Fugaku ratio# core 8 48

Si tech. (nm) 45 7Core perf. (GFLOPS) 16 > 64 4

Chip(node) perf. (TFLOPS) 0.128 >3.0 24Memory BW (GB/s) 64 1024

B/F (Bytes/FLOP) 0.5 0.4#node / rack 96 384 4

Rack perf. (TFLOPS) 12.3 >1179.6 96#node/system 82,944 > 150,000

System perf.(DP PFLOPS) 10.6 > 460.8 43

SVECMG&Si TechHBM

Si Tech

More than 7.5 M General-purpose cores!

Nov/29/2019

Boost mode: 3.3792TF x 150k+ = 500+ PF

29

Co-design in HPC

Richard F. BARRETT, et.al. “On the Role of Co-design in High Performance Computing”, Transition of HPC Towards Exascale Computing

30Nov/29/2019

31

Tools for performance tuning Performance estimation tool

Performance projection using Fujitsu FX100 execution profile

Gives “target” performance Post-K processor simulator

Based on gem5, O3, cycle-level simulation Very slow, so limited to kernel-level evaluation

Co-design of apps 1. Estimate “target” performance using

performance estimation tool 2. Extract kernel code for simulator 3. Measure exec time using simulator 4. Feed-back to code optimization 5. Feed-back to compiler

Co-design of Apps for Architecture (at early stage)

Perform-anceestimationtoolcd

Simulator Simulator Simulator

As is Tuning 1

Tuning 2

Targetperformance

Exec

utio

n tim

e

Now, Test chips of A64FX is available!We are evaluating the performance

and doing performance tuning, using the test chip

Nov/29/2019

Benchmark Results on test chip A64FX

CloverLeaf (UK Mini-App Consortium), Fortran/C A hydrodynamics mini-app to solve the compressible Euler equations in

2D, using an explicit, second-order method Stencil calculation

TeaLeaf (UK Mini-App Consortium), Fortran A mini-application to enable design-space explorations for iterative sparse

linear solvers https://github.com/UK-MAC/TeaLeaf_ref.git Problem size: Benchmarks/tea_bm_5.in, end_step=10 -> 3

LULESH (LLNL), C Mini-app representative of simplified 3D Lagrangian hydrodynamics on an

unstructured mesh, indirect memory accessNov/29/2019 32

https://github.com/UK-MAC/TeaLeaf_ref.git

Benchmark Results on test chip A64FX

Platform A64FX test chip (2.0 GHz) ThunderX2 @ Apollo70

28C/2S @ 2.0GHz Arm HPC compiler 19.1

Broadwell (Xeon E5-2680 v4) 14C/2S @ 2.4GHz Intel compiler 2019.0.045

Skylake (Xeon Gold 6126) @ Cygnus, Univ. of Tsukuba 12C/2S @ 2.6GHz Intel compiler 19.0.3.199

Nov/29/2019

Compiler Options Fujitsu compiler

-Kfast,openmp

Arm HPC compiler -Ofast -march=armv8-a(+sve)

Intel compiler -O3 -qopenmp -march=native

Disclaimer:The software used for the evaluation, such as the compiler, is still under development and its performance may be different when the supercomputer Fugaku starts its operation.

33

Evaluation using one CMG(NUMA node) without MPI Good scalability by increasing the number of threads within CMG. One GMG performance is comparable to Intel one. (Chip contains 4 CMG!)

CloverLeaf

Nov/29/2019

0

5

10

15

20

25

Fujitsu Arm Intel

A64FX TX2 Broadwell

Exec

utio

n tim

e [s

ec]

# of threads 1 # of threads 4 # of threads 8 # of threads 12

0

1

2

3

4

5

6

7

8

9

Fujitsu Arm Intel

A64FX TX2 Broadwell

Rela

tive

perf

orm

ance


Execution time Relative performance (to 1T/A64FX)

34

0

200

400

600

800

1000

1200

1400

1 4 8 12 1 4 8 12 1 4 8 12

1 2 4

Wal

l clo

ck ti

me

[sec

]

# of threads# of processes

A64FX TX2 Xeon

Evaluation of MPI program within one chip (upto 4 MPI process)

Changing #threads within CMG The speedup is limited for more

than 4 threads due to the memory bandwidth (?)

We need more performance analysis.

TeaLeaf

Nov/29/2019

Execution time

0

2

4

6

8

10

12

14

1 4 8 12 1 4 8 12 1 4 8 12

1 2 4

Rela

tive

perfo

rman

ce (/

A64F

X 1p

roc

1thr

ead)

# of threads# of processes

A64FX TX2 XeonRelative performance(to 1T/A64FX)

35

Xeon @ Cygnus, Univ. of TsukubaIntel Xeon Gold 6126 2.6GHz; 12 core x 2 socket

0

500

1000

1500

2000

2500

3000

3500

Fujitsu Arm Intel

A64FX TX2 Broadwell

FOM

(z/

s)


Evaluation using one CMG(NUMA node) without MPI One CMG performance is less than Thx2 and Intel one We found low vectorization (SIMD (SVE) instructions ratio is a few %) We need more code tuning for more vectorization using SIMD

LULESH

Nov/29/2019 36

37Nov/29/2019

38Nov/29/2019

what we did for co-design in development of...

Documents