post-k computer with fujitsu's original cpu, a64fx powered ...s... · arm sve support scalable...
TRANSCRIPT
Post-K Supercomputer with Fujitsu's Original CPU, A64FXPowered by Arm ISA
Toshiyuki Shimizu
Copyright 2018 FUJITSU LIMITED0
Nov. 15th, 2018
Post-K is under development, information in these slides is subject to change without notice
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
Agenda
Background
K computer and Post-K
Post-K goals and approaches
Fujitsu new Arm CPU A64FX
Post-K system
System software
Configuration
Performance discussion
Summary and development status
Copyright 2018 FUJITSU LIMITED1
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
Agenda
Background
K computer and Post-K
Post-K goals and approaches
Fujitsu new Arm CPU A64FX
Post-K system
System software
Configuration
Performance discussion
Summary and development status
Copyright 2018 FUJITSU LIMITED2
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
K computer, PRIMEHPC, and Post-K
K computer, PRIMEHPC FX
Many applications are running and being developed for science and various industries
System software TCS supports hardware with newly developed technologies
Post-K
RIKEN and Fujitsu are working together for Post-K
OS is running and design verification is proceeding as scheduled
Copyright 2018 FUJITSU LIMITED3
2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
PRIMEHPC FX10
1.8x CPU perf.Easier installation
4x(DP) / 8x(SP) CPU & Tofu2High-density pkg & lower energy
App.review
FSprojects
HPCI strategic apps program
Operation of K computerDevelopment
Japan’s National Projects
FUJITSU
Post-K computer development
PRIMEHPC FX100
Technical Computing Suite (TCS)Handles millions of parallel jobsFEFS: super scalable file system
MPI: Ultra scalable collectivecommunication libraries
OS: Lower OS jitter w/ assistant core
Post-K
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
Post-K goals and approaches
Post-K goals
High application performance and good power efficiency
Good usability and better accessibility for users
Keeping application compatibility while advancing from predecessors
Our approaches
Develops high performance and scalable, custom CPU cores
【Performance】 Wider SIMD & mathematical acc. primitives, high memory BW
【Scalability】 Scalable interconnect “Tofu”
【Power efficiency】 The best device tech, power control functions, optimal resources
Adopts Arm ISA, binary compatibility
Maintains performance balance and supports advanced features
Copyright 2018 FUJITSU LIMITED4
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
Agenda
Background
K computer and Post-K
Post-K goals and approaches
Fujitsu new Arm CPU A64FX
Post-K system
System software
Configuration
Performance discussion
Summary and development status
Copyright 2018 FUJITSU LIMITED5
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
Fujitsu new Arm CPU: A64FX
A64FX approach
Specs and technologies
Tofu interconnect D
Power management
A64FX summary
Copyright 2018 FUJITSU LIMITED6
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
A64FX: Approach
High performance Arm CPU
Develops original CPU core supporting Arm SVE (Scalable Vector Extension)
Targeting high performance servers
Floating point calculations
High memory bandwidth
Low power consumption
Scalable performance and configuration
Extension for emerging applications
Half precision (FP16) & INT16/8 partial dot product support
Copyright 2018 FUJITSU LIMITED7
High BW/Calc. is the key for Real apps.
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
A64FX: Specs & technologiesCPU generations and key parameters
Combination of technologiesSVE increases core performance
CMG is scalable architecture to increase # of cores
HBM enables high bandwidth
Copyright 2018 FUJITSU LIMITED8
CPU SPARC64 VIIIfx SPARC64 IXfx SPARC64 XIfx A64FX
1st system w/ CPU K FX10 FX100 Post-K
Si tech. (nm) 45 40 20 7
Core perf. (GFLOPS) 16 14.8 34 57~
# of cores 8 16 32 48
Chip perf. (TFLOPS) 0.13 0.24 1.1 2.7~
Memory BW (GB/s) 64 85 480 1024
B/F (Bytes/FLOP) 0.5 0.4 0.4 0.4
SVE
CMG
HBM
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
A64FX technologies: Core performance
High calc. throughput of Fujitsu original CPU core w/ SVE
512-bit wide SIMD x 2 pipelines and new integer functions
Copyright 2018 FUJITSU LIMITED9
A0 A1 A2 A3
B0 B1 B2 B3
X X X X
8bit 8bit 8bit 8bit
C>57
>115
>230
>460
0
100
200
300
400
500
64-bit 32-bit 16-bit 8-bit
(GOPS)
(Element size)INT8 partial dot product
Core peak performance
32bit
INT8 partial dot product
Multiply and add
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
A64FX technologies: Scalable architectureCore Memory Group (CMG)
12 compute cores and an assistant core for OS daemon, I/O, etc.
Shared L2 cache
Dedicated memory controller
Four CMGs maintain cache coherence w/ on-chip directory
Threads binding within a CMG allows linear speed of up to 48 compute cores
Copyright 2018 FUJITSU LIMITED10
core core core core core core
core core core core core core
core
X-Bar connection
L2 cache 8MiB 16-way
Memory controller
Network on chip
CMG configuration
HBM2
Networkon
chip
CMG
CMG
CMG
CMG
HBM2
HBM2
HBM2
HBM2
A64FX package configuration
Tofu controller
PCIecontroller
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
A64FX: L1D cache uncompromised BW
128B/cycle sustained BW even for unaligned SIMD load
“Combined Gather” doubles gather (indirect) load’s data throughput, when target elements are within a “128-bytealigned block” for a pair of two regs, even & odd
Copyright 2018 FUJITSU LIMITED12
L1D$Read port0
Read port1
Read data064B/cycle
Read data164B/cycle
Mem.
128B
0 1 2 3 4 5 6 7
0 1 3 2 6 7 45
8B
Regs
flow-1
flow-2
flow-4 flow-3
Maximizes BW to 32 bytes/cyc.
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
A64FX: Tofu interconnect D
Integrated w/ rich resources
Increased TNIs achieves higher injection BW & flexible comm. patterns
Increased barrier resources allow flexible collective comm. algorithms
Memory bypassing achieves low latency
Direct descriptor & cache injection
Copyright 2018 FUJITSU LIMITED13
TofuD spec
Data rate 28.05 Gbps
Link bandwidth 6.8 GB/s
Injection bandwidth 40.8 GB/s
Measured
Put throughput 6.35 GB/s
PingPong latency 0.49~0.54 µs
cccc
c ccc
ccccc
NOC
HBM2CMG
cccc
c ccc
ccccc
HBM2CMG
cccc
c ccc
ccccc
HBM2CMG
cccc
c ccc
ccccc
HBM2CMG
PCle
A64FX
TNI0
TNI1
TNI2
TNI3
TNI4
TNI5 Tofu
Net
wor
k Ro
ute
r
2la
nes
×10
por
ts
TofuD
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
A64FX: Power monitor and analyzer
Energy monitor (per chip)
Node power via Power API*1 (~msec)
Averaged power of a node, CMG (cores, an L2 cache, a memory) etc.
Energy analyzer (per core)
Power profiler via PAPI*2 (~nsec)
Fine grained power analysis of a core, an L2 cache and a memory
Copyright 2018 FUJITSU LIMITED14
*1: Sandia National Laboratory *2: Performance Application Programming Interface
A64FX
CMG#0
Core
Power calc.Core activity
L2 cache
Power calc.L2 cache activity
Memory
Power calc.Memory activityCMG#1
CMG#2
CMG#3
Node 12 Cores L2 cache HBM2 Tofu PCIe
Energy Analyzer
Assistant coreEnergy Monitor CMG
Core
Memory
L2 Cache
Power calc. for PCIePower calc. for Tofu
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
A64FX: Power Knobs to reduce power consumption
“Power knob” limits units’ activity via user APIs
Performance/W would be optimized by utilizing Power knobs, Energy monitor & analyzer
Copyright 2018 FUJITSU LIMITED15
L1 I$
BranchPredictor
Decode& Issue
RSE0
RSA
RSE1
RSBR
PGPREXA
EXB
EAGAEXC
EAGBEXD
PFPR
FetchPort
Store Port L1D$
HBM2 Controller
Fetch Issue Dispatch Reg-read Execute Cache and Memory
CSE
Commit
PC
ControlRegisters
L2$
HBM2
Write Buffer
Tofu controller
FLA
PPR
FLB
PRX
FLA pipeline only
HBM2 B/W adjust(units of 10%)
Frequency reduction
Decode width: 4 to 2
EXA pipeline only
PCI Controller 52 cores
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
A64FX: Summary
Arm SVE, high performance and efficiency
DP performance >2.7 TFLOPS, >90%@DGEMM
Memory BW 1024 GB/s, >80%@STREAM Triad
Copyright 2018 FUJITSU LIMITED16
CMG12x compute cores
1x assistant core
A64FX
ISA (Base, extension) Armv8.2-A, SVE
Process technology 7 nm
Peak DP performance >2.7 TFLOPS
SIMD width 512-bit
# of compute cores 48
Memory capacity 32 GiB (HBM2 x4)
Memory peak bandwidth 1024 GB/s
PCIe Gen3 16 lanes
High speed interconnect TofuD integrated
PCleController
TofuInterface
C
C
C
C
NOC
HB
M2
HB
M2
HB
M2
HB
M2
CMG CMG
CMG CMG
CMG:Core Memory Group NOC:Network on Chip
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
Agenda
Background
K computer and Post-K
Post-K goals and approaches
Fujitsu new Arm CPU A64FX
Post-K system
System software
Configuration
Performance discussion
Summary and development status
Copyright 2018 FUJITSU LIMITED17
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
Software development: System software
RIKEN and Fujitsu are developing software stacks for Post-K
Copyright 2018 FUJITSU LIMITED18
Post-K system hardware
Linux OS / McKernel (Lightweight kernel)
Fujitsu Technical Computing Suite / RIKEN developing system software
Post-K applications
Management software Programming environmentFile system
System managementfor high availability &
power saving operation
Job management for higher system
utilization & power efficiency
LLIONVM-based file I/O
accelerator
OpenMP, COARRAY, Math. libs.
MPI (Open MPI, MPICH)
XcalableMPFEFSLustre-based
distributed file system
Compilers (C, C++, Fortran)
Debugging and tuning tools
Post-K Under development w/ RIKEN
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
FEFS vs Lustre, LLIO vs Burst buffer
Post-K supports superior performance file system, FEFS + LLIO
Copyright 2018 FUJITSU LIMITED19
FEFS Lustre
Storage network Tofu, InfiniBand, Ether, OPA InfiniBand, Ether, OPA
Injection BW per node 40.8 GB/s 12.5 GB/s (IB EDR)
MDS scalability Yes Yes (ver. 2.4~)
Interoperability x86, Arm (A64FX) x86, Power, Arm?
FEFS + LLIO* Lustre + Burst buffer**
Local temporary file acc. Yes No
Cache file acc. Yes Yes
Shared file within a job. Yes Yes
Explicit function call Not required (by mount point) Required (Source modification)
**DDN IME*Lightweight Layered IO-Accelerator
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
JobA Job B
LLIO (Lightweight Layered IO-accelerator) overview
Maximizing application file I/O performance
Easy access to User Data: File Cache of Global File System
Higher Data Access Performance: Temporary Local FS (in a process)
Higher Data Sharing Performance: Temporary Shared FS (among processes)
Copyright 2018 FUJITSU LIMITED20
Temporary
Cache
2nd level
1st level
file4Global file system (Lustre based)
NodeAssistant
Cores
Local file systemsfile2
Local
NodeApp.
file1
App.
SharedFile3
Cache
Node NodeApp.App.
NodeApp. Compute
CoresFile I/O
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
Post-K system configuration
Scalable design
Copyright 2018 FUJITSU LIMITED23
Unit # of nodes Description
CPU 1 Single socket node with HBM2 & Tofu interconnect D
CMU 2 CPU Memory Unit: 2x CPU
BoB 16 Bunch of Blades: 8x CMU
Shelf 48 3x BoB
Rack 384 8x Shelf
CPU CMU BoB RackShelf
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
CMU: CPU Memory Unit
A64FX CPU x2
QSFP28 x3 for Active Optical Cables
Single-side blind mate of signals & water
~100% direct water cooling
Copyright 2018 FUJITSU LIMITED24
Water
Water
Electrical signals
AOC
AOC
QSFP28 (X)
QSFP28 (Y)
QSFP28 (Z)
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
Agenda
Background
K computer and Post-K
Post-K goals and approaches
Fujitsu new Arm CPU A64FX
Post-K system
System software
Configuration
Performance discussion
Summary and development status
Copyright 2018 FUJITSU LIMITED25
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
Preliminary performance evaluation results
Over 2.5x faster in HPC & AI benchmarks than SPARC64 XIfx
Arm SVE, Fujitsu’s microarchitecture, and high bandwidth
Copyright 2018 FUJITSU LIMITED26
DGEMM STREAM Triad Fluid dynamics Atomosphere Seismic wavepropagation
ConvolutionFP32
Convolutionlow precision
A64FX chip performance measurements & architectural contributions
HPC AI
840GB/sPe
rfor
man
ce in
crea
sen
orm
aliz
ed b
y SP
AR
C64
XIf
x
512-bit SIMDCombined
gatherINT8 partial dot productL2$ BWMemory BW L1$ BW
Baseline: SPARC64 XIfx
2.5TF
9.4x
2.5x
3.4x2.8x3.0x
Throughput tests Application kernels
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
Agenda
Background
K computer and Post-K
Post-K goals and approaches
Fujitsu new Arm CPU A64FX
Post-K system
System software
Configuration
Performance discussion
Summary and development status
Copyright 2018 FUJITSU LIMITED27
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
Summary of Post-K
High application performance and good power efficiency
High memory bandwidth & wider SIMD
Arm SVE support
Scalable vector extension, state of art Arm instruction set architecture
Binary compatibility advances and expands the power of ecosystem by incorporating HPC technologies and applications
Optimizing for new supercomputer standards
Low power consumption in many aspects
Being focusing on the existing applications and emerging applications
Copyright 2018 FUJITSU LIMITED28
Post-K inheriting the strong and proven microarchitecture with HPC system software will be more powerful with Arm and its ecosystem
SC18, Exhibitor Forum, Dallas, Nov. 15th, 2018
System development status: Post-K
Fujitsu is currently developing Post-K, successor to K computer, with RIKEN
OS running and design verification is underway as scheduled
RIKEN announced Post-K early access program to begin around Q2/CY2020
Copyright 2018 FUJITSU LIMITED29
Post-Kprototype rackhttps://postk-web.r-ccs.riken.jp/sched.html
Copyright 2018 FUJITSU LIMITED30