accelerated computing activities at itc/university of tokyo · accelerated computing activities at...
TRANSCRIPT
Accelerated Computing Activities at ITC/University of Tokyo
Kengo NakajimaInformation Technology Center (ITC)The University of Tokyo
ADAC-3 WorkshopJanuary 25-27 2017, Kashiwa, Japan
Three Major Campuses of University of Tokyo
Kashiwa
Hongo
Komaba2
Kashiwa(柏): Oak Kashiwa-no-Ha: Leaf of Oak
3
“Kashi(樫)” is also called “Oak”
“Oak” covers many kinds of trees in Japan
• Overview of Information Technology Center (ITC), The University of Tokyo– Reedbush
• Integrated Supercomputer System for Data Analyses & Scientific Simulations
• Our First System with GPU’s– Oakforest-PCAS
• #1 System in Japan !• Performance of the New Systems
– 3D Poisson Solver by ICCG– Matrix Assembly for 3D FEM
4
Information Technology CenterThe University of Tokyo
(ITC/U.Tokyo)• Campus/Nation-wide Services on Infrastructure for
Information, related Research & Education• Established in 1999
– Campus-wide Communication & Computation Division– Digital Library/Academic Information Science Division– Network Division– Supercomputing Division
• Core Institute of Nation-wide Infrastructure Services/Collaborative Research Projects– Joint Usage/Research Center for Interdisciplinary Large-
scale Information Infrastructures (JHPCN) (2010-)– HPCI (HPC Infrastructure)
5
Supercomputing Research Division of ITC/U.Tokyo (SCD/ITC/UT)
http://www.cc.u-tokyo.ac.jp
• Services & Operations of Supercomputer Systems, Research, Education
• History– Supercomputing Center, U.Tokyo (1965~1999)
• Oldest Academic Supercomputer Center in Japan• Nation-Wide, Joint-Use Facility: Users are not limited to researchers
and students of U.Tokyo– Information Technology Center (1999~) (4 divisions)
• 10+ Faculty Members (including part-time members) – Architecture, System S/W, Algorithms, Applications, GPU
• 8 Technical Staffs
6
Research Activities• Collaboration with Users
– Linear Solvers, Parallel Vis., Performance Tuning• Research Projects
– FP3C (collab. with French Institutes) (FY.2010-2013)• Tsukuba, Tokyo Tech, Kyoto
– Feasibility Study of Advanced HPC in Japan (towards Japanese Exascale Project) (FY.2012-2013)
• 1 of 4 Teams: General Purpose Processors, Latency Cores– ppOpen-HPC (FY.2011-2015, 2016-2018)– Post K with RIKEN AICS (FY.2014-)– ESSEX-II (FY.2016-2018): German-Japan Collaboration
• International Collaboration– Lawrence Berkeley National Laboratory (USA)– National Taiwan University (Taiwan)– National Central University (Taiwan)– Intel Parallel Computing Center– ESSEX-II/SPPEXA/DFG (Germany)
7
Plan of 9 Tier-2 Supercomputer Centers (October 2016)
8
Fiscal Year 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
Hokkaido
Tohoku
Tsukuba
Tokyo
Tokyo Tech.
Nagoya
Kyoto
Osaka
Kyushu
HITACHI SR16000/M1(172TF, 22TB)Cloud System BS2000 (44TF, 14TB)Data Science Cloud / Storage HA8000 / WOS7000
(10TF, 1.96PB)
50+ PF (FAC) 3.5MW ( )200+ PF
(FAC) 6.5MWFujitsu FX10 (1PFlops, 150TiB, 408 TB/s), Hitachi SR16K/M1 (54.9 TF, 10.9 TiB, 28.7 TB/s)
( )
NEC SX-9他(60TF)
~30PF, ~30PB/s Mem BW (CFL-D/CFL-M) ~3MW
SX-ACE(707TF,160TB, 655TB/s)LX406e(31TF), Storage(4PB), 3D Vis, 2MW
100+ PF (FAC/UCC+CFL-M)up to 4MW
50+ PF (FAC/UCC + CFL-M)FX10(90TF) Fujitsu FX100 (2.9PF, 81 TiB)CX400(470TF) Fujitsu CX400 (774TF, 71TiB)SGI UV2000 (24TF, 20TiB) 2MW in total up to 4MW
Power consumption indicates maximum of power supply(includes cooling facility)
Cray: XE6 + GB8K +XC30 (983TF) 7-8 PF(FAC/TPF + UCC)
1.5 MWCray XC30 (584TF)
50-100+ PF(FAC/TPF + UCC) 1.8-2.4 MW
3.2 PF (UCC + CFL/M) 0.96MW3.2 PF (UCC + CFL/M) 0.96MW 30 PF (UCC + CFL-M) 2MW0.3 PF (Cloud) 0.36MW0.3 PF (Cloud) 0.36MW
15-20 PF (UCC/TPF)( )HA8000 (712TF, 242 TB)SR16000 (8.2TF, 6TB) 100-150 PF
(FAC/TPF + UCC/TPF)FX10 (90.8TFLOPS)
FX10 (90.8TFLOPS)
3MW( )
FX10 (272.4TF, 36 TB)CX400 (966.2 TF, 183TB)
2.0MW 2.6MW
HA-PACS (1166 TF)
( )100+ PF 4.5MW
(UCC + TPF)
( pg )TSUBAME 3.0 (20 PF, 4~6PB/s) 2.0MW
(3.5, 40PF at 2018 if upgradable) )TSUBAME 4.0 (100+ PF,
>10PB/s, ~2.0MW)
TSUBAME 2.5 (5.7 PF, 110+ TB, 1160 TB/s), 1.4MW
TSUBAME 2.5 (3~4 PF, extended)
25.6 PB/s, 50-100Pflop/s,1.5-2.0MW
3.2PB/s, 5-10Pflop/s, 1.0-1.5MW (CFL-M)NEC SX-ACE NEC Express5800(423TF) (22.4TF) 0.7-1PF (UCC)
COMA (PACS-IX) (1001 TF)
Reedbush 1.8〜1.9 PF 0.7 MW( )
Post Open Supercomputer 25 PF (UCC + TPF) 4.2 MW
PACS-X 10PF (TPF) 2MW
Supercomputers in ITC/U.Tokyo2 big systems, 6 yr. cycle
9
FY08 09 10 11 12 13 14 15 16 17 18 19 20 21 22Hitachi SR11K/J2IBM Power‐5+
18.8TFLOPS, 16.4TB
KPeta
Yayoi: Hitachi SR16000/M1IBM Power‐7
54.9 TFLOPS, 11.2 TB
Reedbush, SGIBroadwell + Pascal1.80‐1.93 PFLOPS
Integrated Supercomputer System for Data Analyses & Scientific Simulations
Hitachi HA8000 (T2K)AMD Opteron
140TFLOPS, 31.3TB
Oakforest‐PACSFujitsu, Intel KNL25PFLOPS, 919.3TB
BDEC System50+ PFLOPS (?)
Post‐K ?K computer
Oakleaf‐FX: Fujitsu PRIMEHPC FX10, SPARC64 IXfx1.13 PFLOPS, 150 TB
Oakbridge‐FX136.2 TFLOPS, 18.4 TB
JCAHPC: Tsukuba, Tokyo
Big Data & Extreme Computing
GPU Cluster1.40+ PFLOPS
We are now operating 5 systems !!• Yayoi (Hitachi SR16000, IBM Power7)
– 54.9 TF, Nov. 2011 – Oct. 2017• Oakleaf‐FX (Fujitsu PRIMEHPC FX10)
– 1.135 PF, Commercial Version of K, Apr.2012 – Mar.2018• Oakbridge‐FX (Fujitsu PRIMEHPC FX10)
– 136.2 TF, for long‐time use (up to 168 hr), Apr.2014 – Mar.2018• Reedbush (SGI, Intel BDW + NVIDIA P100 (Pascal))
– Integrated Supercomputer System for Data Analyses & Scientific Simulations
– 1.93 PF, Jul.2016‐Jun.2020– Our first GPU System (Mar.2017), DDN IME (Burst Buffer)
• Oakforest‐PACS (OFP) (Fujitsu, Intel Xeon Phi (KNL))– JCAHPC (U.Tsukuba & U.Tokyo)– 25 PF, #6 in 48th TOP 500 (Nov.2016) (#1 in Japan)– Omni‐Path Architecture, DDN IME (Burst Buffer)
10
Visiting Tour this Afternoon• Yayoi (Hitachi SR16000, IBM Power7)
– 54.9 TF, Nov. 2011 – Oct. 2017• Oakleaf‐FX (Fujitsu PRIMEHPC FX10)
– 1.135 PF, Commercial Version of K, Apr.2012 – Mar.2018• Oakbridge‐FX (Fujitsu PRIMEHPC FX10)
– 136.2 TF, for long‐time use (up to 168 hr), Apr.2014 – Mar.2018• Reedbush (SGI, Intel BDW + NVIDIA P100 (Pascal))
– Integrated Supercomputer System for Data Analyses & Scientific Simulations
– 1.93 PF, Jul.2016‐Jun.2020– Our first GPU System (Mar.2017), DDN IME (Burst Buffer)
• Oakforest‐PACS (OFP) (Fujitsu, Intel Xeon Phi (KNL))– JCAHPC (U.Tsukuba & U.Tokyo)– 25 PF, #6 in 48th TOP 500 (Nov.2016) (#1 in Japan)– Omni‐Path Architecture, DDN IME (Burst Buffer)
11
12
Work Ratio 80+% AverageOakleaf-FX + Oakbridge-FX
0
10
20
30
40
50
60
70
80
90
100
%
Oakleaf-FX Oakbridge-FXYayoi Reedbush-U
Research Area based on CPU HoursFX10 in FY.2015 (2015.4~2016.3E)
13
Oakleaf-FX + Oakbridge-FX
EngineeringEarth/SpaceMaterialEnergy/PhysicsInformation Sci.EducationIndustryBioEconomics
Services for Industry• Originally, only academic users have been allowed to
access our supercomputer systems.• Since FY.2008, we started services for industry
– supports to start large-scale computing for future business – not compete with private data centers, cloud services …– basically, results must be open to public – max 10% total comp. resource is open for usage by industry – special qualification processes/special (higher) fee for usage
• Various Types of Services– Normal usage (more expensive than academic users)
3-4 groups per year, fundamental research– Trial usage with discount rate– Research collaboration with academic rate (e.g. Taisei)– Open-Source/In-House Codes (NO ISV/Commercial App.)
14
Training & Education• 2-Day “Hands-on” Tutorials for Parallel Programming
by Faculty Members of SCD/ITC (Free)– Participants from industry are accepted.
• Graduate/Undergraduate Classes with Supercomputer System (Free)– We encourage faculty members to introduce hands-on
tutorial of supercomputer system into graduate/undergraduate classes.
– Up to 12 nodes (192 cores) of Oakleaf-FX, 8 nodes (288 cores) of Reedbush-U
– Proposal-based– Not limited to Classes of the University of Tokyo, 2-3 of 10
• RIKEN AICS Summer/Spring School (2011~)
15
• Yayoi (Hitachi SR16000, IBM Power7)– 54.9 TF, Nov. 2011 – Oct. 2017
• Oakleaf‐FX (Fujitsu PRIMEHPC FX10)– 1.135 PF, Commercial Version of K, Apr.2012 – Mar.2018
• Oakbridge‐FX (Fujitsu PRIMEHPC FX10)– 136.2 TF, for long‐time use (up to 168 hr), Apr.2014 – Mar.2018
• Reedbush (SGI, Intel BDW + NVIDIA P100 (Pascal))– Integrated Supercomputer System for Data Analyses & Scientific Simulations
– 1.93 PF, Jul.2016‐Jun.2020– Our first GPU System (Mar.2017), DDN IME (Burst Buffer)
• Oakforest‐PACS (OFP) (Fujitsu, Intel Xeon Phi (KNL))– JCAHPC (U.Tsukuba & U.Tokyo)– 25 PF, #6 in 48th TOP 500 (Nov.2016) (#1 in Japan)– Omni‐Path Architecture, DDN IME (Burst Buffer)
16
Reasons why we did not introduce systems with GPU’s before …
• CUDA• We have 2,000+ users• Although we are proud that they are very smart and
diligent …• Therefore, we have decided to adopt Intel XXX for
the Post T2K in Summer 2010.
17
Why have we decided to introduce a system with GPU’s this time ?
• OpenACC– Much easier than CUDA– Performance has been improved recently– Efforts by Akira Naruse (NVIDIA)
• Data Science, Deep Learning– Development of new types of users other
than traditional CSE (Computational Science & Engineering)
• Research Organization for Genome Medical Science, U. Tokyo
• U. Tokyo Hospital: Processing of Medical Images by Deep Learning
18
• Experts in our division– Prof’s Hanawa, Ohshima & Hoshino
Reedbush (1/2)• SGI was awarded (Mar. 22, 2016)• Compute Nodes (CPU only): Reedbush-U
– Intel Xeon E5-2695v4 (Broadwell-EP, 2.1GHz 18core ) x 2socket (1.210 TF), 256 GiB (153.6GB/sec)
– InfiniBand EDR, Full bisection Fat-tree– Total System: 420 nodes, 508.0 TF
• Compute Nodes (with Accelerators): Reedbush-H– Intel Xeon E5-2695v4 (Broadwell-EP, 2.1GHz 18core ) x
2socket, 256 GiB (153.6GB/sec)– NVIDIA Pascal GPU (Tesla P100)
• (4.8-5.3TF, 720GB/sec, 16GiB) x 2 / node– InfiniBand FDR x 2ch (for ea. GPU), Full bisection Fat-tree– 120 nodes, 145.2 TF(CPU)+ 1.15~1.27 PF(GPU)=
1.30~1.42 PF
19
Configuration of Each Compute Node of Reedbush‐H
NVIDIA Pascal
NVIDIA Pascal
NVLinK20 GB/s
Intel Xeon E5‐2695 v4
(Broadwell‐EP)
NVLinK20 GB/s
QPI76.8 GB/s
76.8 GB/s
IB FDRHCA
G3
x16 16 GB/s 16 GB/s
DDR4Mem128GB
EDR switch
EDR
76.8 GB/sw. 4ch
76.8 GB/sw. 4ch
Intel Xeon E5‐2695 v4
(Broadwell‐EP)QPIDDR4DDR4DDR4
DDR4DDR4DDR4DDR4
Mem128GB
PCIe swG3
x16
PCIe sw
G3
x16
G3
x16
IB FDRHCA
Management Servers
InfiniBand EDR 4x, Full‐bisection Fat‐tree
Parallel File System5.04 PB
Lustre FilesystemDDN SFA14KE x3
High‐speedFile Cache System
209 TB
DDN IME14K x6
Dual‐port InfiniBand FDR 4x
Login node
Login Node x6
Compute Nodes: 1.925 PFlops
CPU: Intel Xeon E5‐2695 v4 x 2 socket(Broadwell‐EP 2.1 GHz 18 core, 45 MB L3‐cache)
Mem: 256GB (DDR4‐2400, 153.6 GB/sec)×420
Reedbush‐U (CPU only) 508.03 TFlopsCPU: Intel Xeon E5‐2695 v4 x 2 socketMem: 256 GB (DDR4‐2400, 153.6 GB/sec)GPU: NVIDIA Tesla P100 x 2
(Pascal, SXM2, 4.8‐5.3 TF, Mem: 16 GB, 720 GB/sec, PCIe Gen3 x16, NVLink (for GPU) 20 GB/sec x 2 brick )
×120
Reedbush‐H (w/Accelerators) 1297.15‐1417.15 TFlops
436.2 GB/s145.2 GB/s
Login node Login node Login node Login node Login node UTnet Users
InfiniBand EDR 4x 100 Gbps /node
Mellanox CS7500 634 port + SB7800/7890 36 port x 14
SGI RackableC2112‐4GP3
56 Gbps x2 /node
SGI Rackable C1100 series
Why “Reedbush” ?• L'homme est un roseau
pensant.• Man is a thinking reed.• 人間は考える葦である
Pensées (Blaise Pascal)
Blaise Pascal(1623-1662)
Reedbush (2/2)• Storage/File Systems
– Shared Parallel File-system (Lustre) • 5.04 PB, 145.2 GB/sec
– Fast File Cache System: Burst Buffer (DDN IME (Infinite Memory Engine))
• SSD: 209.5 TB, 450 GB/sec
• Power, Cooling, Space– Air cooling only, < 500 kVA (without A/C): 378 kVA, < 90 m2
• Software & Toolkit for Data Analysis, Deep Learning …– OpenCV, Theano, Anaconda, ROOT, TensorFlow– Torch, Caffe, Cheiner, GEANT4
• Pilot system towards BDEC system after Fall 2018– New Research Area: Data Analysis, Deep Learning etc.
23
Supercomputers in ITC/U.Tokyo2 big systems, 6 yr. cycle
24
FY08 09 10 11 12 13 14 15 16 17 18 19 20 21 22Hitachi SR11K/J2IBM Power‐5+
18.8TFLOPS, 16.4TB
KPeta
Yayoi: Hitachi SR16000/M1IBM Power‐7
54.9 TFLOPS, 11.2 TB
Reedbush, SGIBroadwell + Pascal1.80‐1.93 PFLOPS
Hitachi HA8000 (T2K)AMD Opteron
140TFLOPS, 31.3TB
Oakforest‐PACSFujitsu, Intel KNL25PFLOPS, 919.3TB
BDEC System50+ PFLOPS (?)
Post‐K ?K computer
Oakleaf‐FX: Fujitsu PRIMEHPC FX10, SPARC64 IXfx1.13 PFLOPS, 150 TB
Oakbridge‐FX136.2 TFLOPS, 18.4 TB
JCAHPC: Tsukuba, Tokyo
Big Data & Extreme Computing
GPU Cluster1.40+ PFLOPS
2 GPU’s/n
4 GPU’s/n
• Yayoi (Hitachi SR16000, IBM Power7)– 54.9 TF, Nov. 2011 – Oct. 2017
• Oakleaf‐FX (Fujitsu PRIMEHPC FX10)– 1.135 PF, Commercial Version of K, Apr.2012 – Mar.2018
• Oakbridge‐FX (Fujitsu PRIMEHPC FX10)– 136.2 TF, for long‐time use (up to 168 hr), Apr.2014 – Mar.2018
• Reedbush (SGI, Intel BDW + NVIDIA P100 (Pascal))– Integrated Supercomputer System for Data Analyses & Scientific Simulations
– 1.93 PF, Jul.2016‐Jun.2020– Our first GPU System (Mar.2017), DDN IME (Burst Buffer)
• Oakforest‐PACS (OFP) (Fujitsu, Intel Xeon Phi (KNL))– JCAHPC (U.Tsukuba & U.Tokyo)– 25 PF, #6 in 48th TOP 500 (Nov.2016) (#1 in Japan)– Omni‐Path Architecture, DDN IME (Burst Buffer)
25
Oakforest-PACS• Full Operation started on December 1, 2016 • 8,208 Intel Xeon/Phi (KNL), 25 PF Peak Performance
– Fujitsu• TOP 500 #6 (#1 in Japan), HPCG #3 (#2), Green 500
#6 (#2) (November 2016)• JCAHPC: Joint Center for Advanced High
Performance Computing)– University of Tsukuba– University of Tokyo
• New system will installed in Kashiwa-no-Ha (Leaf of Oak) Campus/U.Tokyo, which is between Tokyo and Tsukuba
– http://jcahpc.jp
2626
JCAHPC: Joint Center for Advanced High Performance Computing
http://jcahpc.jp• JCAHPC was established in 2013 under agreement between the University of Tsukuba & the University of Tokyo for collaborative promotion of HPC by:
• Center for Computational Sciences (CCS) at University of Tsukuba, and
• Information Technology Center (ITC) atthe University of Tokyo.
• Design, operate and manage a next‐generation supercomputer system by researchers belonging to two universities
1st JCAHPC [email protected] 27
Kashiwa is the ideal place for collaboration between
U.Tokyo & U. Tsukuba
28
Features of Oakforest-PACS (1/2)• Computing Node
– 68 cores/node, 3 TFLOPS x 8,208= 25 PFLOPS– 2 Type of Memory
• MCDRAM: High-Speed, 16GB• DDR4: Medium-Speed, 96GB
– Variety of Selections for Memory-Mode/Cluster-Mode• Node-to-Node Communication
– Fat-Tree Network with Full Bi-Section Bandwidth– Intel’s Omni-Path Architecture
• High Efficiency for Applications with Full Nodes of the System• Flexible and Efficient Operations for Multiple Jobs
29
Specification of Oakforest‐PACS system
1st JCAHPC [email protected] 30
Total peak performance 25 PFLOPSTotal number of computenodes
8,208
Compute node
Product Fujitsu PRIMERGY CX600 M1 (2U) + CX1640 M1 x 8node
Processor Intel® Xeon Phi™ 7250(Code name: Knights Landing), 68 cores, 1.4 GHz
Memory High BW 16 GB, 490 GB/sec (MCDRAM, effective rate) Low BW 96 GB, 115.2 GB/sec (peak rate)
Inter‐connect
Product Intel® Omni‐Path ArchitectureLink speed 100 GbpsTopology Fat‐tree with (completely) full‐bisection bandwidth
31
Computation node (Fujitsu next generation PRIMERGY) with single chip Intel Xeon Phi (Knights Landing, 3+TFLOPS) and Intel Omni‐Path Architecture card (100Gbps)
Chassis with 8 nodes, 2U size
15 Chassis with 120 nodes per Rack
Full bisection bandwidth Fat‐tree by Intel® Omni‐Path Architecture
1st JCAHPC [email protected] 32
12 of768 port Director Switch(Source by Intel)
362 of48 port Edge Switch
2 2
241 4825 7249
Uplink: 24
Downlink: 24
. . . . . . . . .
Firstly, to reduce switches&cables, we considered :• All the nodes into subgroups are connected with FBB Fat‐tree• Subgroups are connected with each other with >20% of FBBBut, HW quantity is not so different from globally FBB, and globally FBB is preferredfor flexible job management.
Features of Oakforest-PACS (2/2)• File I/O
– Shared File System• Lustre, 26PB
– File Cache System • (DDN IME, Burst Buffer):
1+TB/sec Performance for IOR Benchmarks
• useful not only for computational sciences, but also for big-data analyses, machine learning etc.
33
• Power Consumption– #6 in Green 500– Linpack
• 4,986 MFLOPS/W (OFP)• 830 MFLOPS/W(京)
Specification of Oakforest‐PACS system (Cont’d)
1st JCAHPC [email protected] 34
Parallel File System
Type Lustre File System
Total Capacity 26.2 PB
Product DataDirect Networks SFA14KE
Aggregate BW 500 GB/sec
File Cache System
Type Burst Buffer, Infinite Memory Engine (by DDN)Total capacity 940 TB (NVMe SSD, including parity data by
erasure coding)Product DataDirect Networks IME14K
Aggregate BW 1,560 GB/secPower consumption 4.2 MW (including cooling)# of racks 102
Benchmarks• TOP 500 (Linpack,HPL(High Performance Linpack))
– Direct Linear Solvers, FLOPS rate– Regular Dense Matrices, Continuous Memory Access– Computing Performance
• HPCG– Preconditioned Iterative Solvers, FLOPS rate– Irregular Sparse Matrices derived from FEM Applications
with Many “0” Components• Irregular/Random Memory Access,• Closer to “Real” Applications than HPL
– Performance of Memory, Communications• Green 500
– FLOPS/W rate for HPL (TOP500)35
36
http://www.top500.org/
Site Computer/Year Vendor Cores Rmax(TFLOPS)
Rpeak(TFLOPS)
Power(kW)
1 National Supercomputing Center in Wuxi, China
Sunway TaihuLight , Sunway MPP, Sunway SW26010 260C 1.45GHz, 2016 NRCPC
10,649,600 93,015(= 93.0 PF) 125,436 15,371
2 National Supercomputing Center in Tianjin, China
Tianhe-2, Intel Xeon E5-2692, TH Express-2, Xeon Phi, 2013 NUDT 3,120,000 33,863
(= 33.9 PF) 54,902 17,808
3 Oak Ridge National Laboratory, USA
TitanCray XK7/NVIDIA K20x, 2012 Cray 560,640 17,590 27,113 8,209
4 Lawrence Livermore National Laboratory, USA
SequoiaBlueGene/Q, 2011 IBM 1,572,864 17,173 20,133 7,890
5 DOE/SC/LBNL/NERSCUSA
Cori, Cray XC40, Intel Xeon Phi 7250 68C 1.4GHz, Cray Aries, 2016 Cray
632,400 14,015 27,881 3,939
6Joint Center for Advanced High Performance Computing, Japan
Oakforest-PACS, PRIMERGY CX600 M1, Intel Xeon Phi Processor 7250 68C 1.4GHz, Intel Omni-Path, 2016 Fujitsu
557,056 13,555 24,914 2,719
7 RIKEN AICS, Japan K computer, SPARC64 VIIIfx , 2011 Fujitsu 705,024 10,510 11,280 12,660
8 Swiss Natl. Supercomputer Center, Switzerland
Piz DaintCray XC30/NVIDIA P100, 2013 Cray 206,720 9,779 15,988 1,312
9 Argonne National Laboratory, USA
MiraBlueGene/Q, 2012 IBM 786,432 8,587 10,066 3,945
10 DOE/NNSA/LANL/SNL, USA Trinity, Cray XC40, Xeon E5-2698v3 16C 2.3GHz, 2016 Cray 301,056 8,101 11,079 4,233
48th TOP500 List (November, 2016)
Rmax: Performance of Linpack (TFLOPS)Rpeak: Peak Performance (TFLOPS), Power: kW
37http://www.hpcg-benchmark.org/
HPCG Ranking (SC16, November, 2016)Site Computer Cores HPL Rmax
(Pflop/s)TOP500
RankHPCG
(Pflop/s)HPCG/
HPL (%)
1 RIKEN AICS, Japan K computer 705,024 10.510 7 0.6027 5.73
2 NSCC / Guangzhou, China Tianhe-2 3,120,000 33.863 2 0.5800 1.71
3 JCAHPC, Japan Oakforest-PACS 557,056 13.555 6 0.3855 2.84
4National Supercomputing Center in Wuxi, China
Sunway TaihuLight 10,649,600 93.015 1 0.3712 .399
5 DOE/SC/LBNL/NERSCUSA Cori 632,400 13.832 5 0.3554 2.57
6 DOE/NNSA/LLNL, USA Sequoia 1,572,864 17.173 4 0.3304 1.92
7DOE/SC/Oak Ridge National Laboratory, USA
Titan 560,640 17.590 3 0.3223 1.83
8DOE/NNSA/LANL/SNL,USA
Trinity 301,056 8.101 10 0.1826 2.25
9 NASA / Mountain View, USA Pleiades: SGI ICE X 243,008 5.952 13 0.1752 2.94
10DOE/SC/Argonne National Laboratory, USA
Mira: IBM BlueGene/Q, 786,432 8.587 9 0.1670 1.94
38
Green 500 Ranking (SC16, November, 2016)Site Computer CPU
HPL Rmax
(Pflop/s)
TOP500 Rank
Power(MW) GFLOPS/W
1 NVIDIA Corporation
DGX SATURNV
NVIDIA DGX-1, Xeon E5-2698v4 20C 2.2GHz, Infiniband EDR, NVIDIA Tesla P100
3.307 28 0.350 9.462
2Swiss National Supercomputing Centre (CSCS)
Piz DaintCray XC50, Xeon E5-2690v3 12C 2.6GHz, Aries interconnect , NVIDIA Tesla P100
9.779 8 1.312 7.454
3 RIKEN ACCS Shoubu ZettaScaler-1.6 etc. 1.001 116 0.150 6.674
4 National SC Center in Wuxi
Sunway TaihuLight
Sunway MPP, Sunway SW26010 260C 1.45GHz, Sunway 93.01 1 15.37 6.051
5SFB/TR55 at Fujitsu Tech.Solutions GmbH
QPACE3PRIMERGY CX1640 M1, Intel Xeon Phi 7210 64C 1.3GHz, Intel Omni-Path
0.447 375 0.077 5.806
6 JCAHPC Oakforest-PACS
PRIMERGY CX1640 M1, Intel Xeon Phi 7250 68C 1.4GHz, Intel Omni-Path
1.355 6 2.719 4.986
7 DOE/SC/ArgonneNational Lab. Theta Cray XC40, Intel Xeon Phi 7230 64C
1.3GHz, Aries interconnect 5.096 18 1.087 4.688
8 Stanford Research Computing Center XStream
Cray CS-Storm, Intel Xeon E5-2680v2 10C 2.8GHz, Infiniband FDR, NvidiaK80
0.781 162 0.190 4.112
9 ACCMS, Kyoto University Camphor 2 Cray XC40, Intel Xeon Phi 7250 68C
1.4GHz, Aries interconnect 3.057 33 0.748 4.087
10 Jefferson Natl. Accel. Facility SciPhi XVI KOI Cluster, Intel Xeon Phi 7230
64C 1.3GHz, Intel Omni-Path 0.426 397 0.111 3.837
http://www.top500.org/
Software of Oakforest‐PACS• OS: Red Hat Enterprise Linux (Login nodes), CentOS or McKernel (Compute nodes, dynamically switchable)
• McKernel: OS for many‐core CPU developed by RIKEN AICS• Ultra‐lightweight OS compared with Linux, no background noise to user program
• Expected to be installed to post‐K computer
• Compiler: GCC, Intel Compiler, XcalableMP• XcalableMP: Parallel programming language developed by RIKEN AICS and University of Tsukuba
• Easy to develop high‐performance parallel application by adding directives to original code written by C or Fortran
• Application: Open‐source softwares• : ppOpen‐HPC, OpenFOAM, ABINIT‐MP, PHASE system, FrontFlow/blue and so on
391st JCAHPC [email protected]
Applications on Oakforest‐PACS• ARTED: Ab‐initio Real Time Electron Dynamics simulator
• Lattice QCD: Quantum Chrono Dynamics• NICAM & COCO: Atmosphere & Ocean Coupling
• GAMERA/GOJIRA: Earthquake Simulations• Seism3D: Seismic Wave Propagation
40
Atmosphere‐Ocean Coupling on OFP by NICAM/COCO/ppOpen‐MATH/MP
• High‐resolution global atmosphere‐ocean coupled simulation by NICAM and COCO (Ocean Simulation) through ppOpen‐MATH/MP on the K computer is achieved.
– ppOpen‐MATH/MP is a coupling software for the models employing various discretization method.
• An O(km)‐mesh NICAM‐COCO coupled simulation is planned on the Oakforest‐PACS system.
– A big challenge for optimization of the codes on new Intel Xeon Phi processor
– New insights for understanding of global climate dynamics
41
[C/O M. Satoh (AORI/UTokyo)@SC16]
Schedule for Operations• NO charge until the end of March 2017 • There are no explicit partition: All resources are sharedin order to make full use of 8,208 nodes/25 PF
• Various Types of Program for UsageEach University’s Own Program
Group Course (up to 2,048 nodes) Personal Course (U.Tokyo only) Industry Use (U.Tokyo only) International Collaboration (Part of Group Course)
HPCI 20% of 8,208 node/yr. by “JCAHPC”, including industry use
JHPCN(学際大規模情報基盤共同利用共同研究拠点) 5% of 8,208 node/yr.: Industry, International (U.Tokyo only)
Education (Tutorials, Classes)Large-Scale HPC Challenge: Full Node, once per month for 24-
hours, proposal-based 42
Greeting from Professor Olaf Schenk• Institute of Computational Science, Universita della Svizzera italiana at Lugano, Switzerland
• Program Director, SIAM Activity Group in Supercomputing (2016‐2017)
• Developer of PARADISO library (direct solver in Intel MKL)
• KFLOPS/Person for Each Country based on TOP 500 List
43
• Overview of Information Technology Center (ITC), The University of Tokyo– Reedbush
• Integrated Supercomputer System for Data Analyses & Scientific Simulations
• Our First System with GPU’s– Oakforest-PCAS
• #1 System in Japan !• Performance of the New Systems
– 3D Poisson Solver by ICCG– Matrix Assembly for 3D FEM
44
Target Application: Poisson3D-OMP• Finite Volume Method, Poisson Equations (1283 cells)
– FDM-type mesh (7-pt. Stencil), Unstructured data structure– SPD matrix– ICCG: Data Dependency for Incomplete Cholesky Fact.
• Fortran 90 + OpenMP• Thread Parallelization by OpenMP: Reordering
needed– CM-RCM + Coalesced/Sequential
• Storage of Matrix– CRS, ELL (Ellpack-Itpack)
• Outer Loops– Row-Wise, Column-Wise
45
x
yz
NXNY
NZ Z
XY
x
yz
x
yz
NXNY
NZ Z
XY
Z
XY
XY
Elimination of Data Dependencyin ICCG
• Coloring + Reordering• More colors, better convergence/larger sync. overhead
– MC: good parallel performance, bad convergence– RCM: good convergence, many colors– CM-RCM(k): combinations of MC + RCM
46
64 63 61 58 54 49 43 36
62 60 57 53 48 42 35 28
59 56 52 47 41 34 27 21
55 51 46 40 33 26 20 15
50 45 39 32 25 19 14 10
44 38 31 24 18 13 9 6
37 30 23 17 12 8 5 3
29 22 16 11 7 4 2 1
48 32
31 15
14 62
61 44
43 26
25 8
7 54
53 36
16 64
63 46
45 28
27 10
9 56
55 38
37 20
19 2
47 30
29 12
11 58
57 40
39 22
21 4
3 50
49 33
13 60
59 42
41 24
23 6
5 52
51 35
34 18
17 1
64 63 61 58 54 49 43 36
62 60 57 53 48 42 35 28
59 56 52 47 41 34 27 21
55 51 46 40 33 26 20 15
50 45 39 32 25 19 14 10
44 38 31 24 18 13 9 6
37 30 23 17 12 8 5 3
29 22 16 11 7 4 2 1
1 17 3 18 5 19 7 20
33 49 34 50 35 51 36 52
17 21 19 22 21 23 23 24
37 53 38 54 39 55 40 56
33 25 35 26 37 27 39 28
41 57 42 58 43 59 44 60
49 29 51 30 53 31 55 32
45 61 46 62 47 63 48 64
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
RCMReverse Cuthill-Mckee
MC (Color#=4)Multicoloring
CM-RCM (4)Cyclic MC + RCM
Storing Formats for Sparse Matrices• ELL/Sliced ELL: Excellent Performance by Prefetching• SELL-C-: Good for Vector/SIMD (by FAU/Erlangen)
– This work is the first example where SELL-C- is applied to IC/ILU preconditioning.
47
CRS ELL Sliced ELL SELL-C-(SELL-2-8)
C
COPYRIGHT©TAISEI CORPORATION ALL RIGHTS RESERVED
Simulation of Geologic CO2 Storage
48
Fujitsu FX10(Oakleaf-FX),30M DOF: 2x-3x improvement[Dr. Hajime Yamamoto, Taisei]
30 million DoF (10 million grids × 3 DoF/grid node)
0.1
1
10
100
1000
10000
10 100 1000 10000
Calculation Time (sec)
Number of Processors
0.1
1
10
100
1000
10000
10 100 1000 10000
Calculation Time (sec)
Number of Processors
TOUGH2‐MPon FX10
2‐3 times speedup
Average time for solving matrix for one time step
3D Multiphase Flow (Liquid/Gas) + 3D Mass Transfer
49
Code Name KNC KNL-0 BDW
ArchitectureIntel Xeon Phi
5110P(Knights Corner)
Intel Xeon Phi 7210
(Knights Landing)
Intel Xeon E5-2695 v4
(Broadwell-EP)
Frequency (GHz) 1.053 1.30 2.10
Core # (Max Thread #) 60 (240) 64 (256) 18 (18)
Peak Performance (GFLOPS)
1,010.9 2,662.4 604.8
Memory (GB) 8 MCDRAM: 16DDR4: 96 128
Memory Bandwidth(GB/sec., Stream Triad)
159MCDRAM:
454DDR4: 72.5
65.5
Out-of-Order N Y YSystem OFP-mini Reedbush-U
Comp. Time for ICCG (Best ELL)Effects of synchronization overhead are significant on
KNL & KNC, if number of colors is largerGenerally, optimum number of color is 10 for KNL/KNC
Down is Good !!
50
0.00
1.00
2.00
3.00
4.00
5.00
6.00
1 10 100
sec.
Color#
KNCBDWKNL-DDR4KNL-MCDRAM
51
Code Name KNC KNL-0 K40 P100
ArchitectureIntel Xeon Phi 5110P(Knights Corner)
Intel Xeon Phi 7210(Knights Landing)
NVIDIA Tesla K40
NVIDIA Tesla P100
(Pascal)
Frequency (GHz) 1.053 1.30 0.745 1.328
Core # (Max Thread #) 60 (240) 64 (256) 2,800 1,792
Peak Performance (GFLOPS)
1,010.9 2,662.4 1,430 4,759
Memory (GB) 8MCDRAM:
16DDR4: 96
12 16
Memory Bandwidth(GB/sec., Stream Triad)
159MCDRAM:
454DDR4: 72.5
218 530
Comp. Time for ICCG (Best ELL)Xeon Phi (KNL) and P100 are competitive
Down is Good !!
52
Please contact Kengo Nakajima (nakajima(at)cc.u-tokyo.ac.jp) if you need results by P100
• Overview of Information Technology Center (ITC), The University of Tokyo– Reedbush
• Integrated Supercomputer System for Data Analyses & Scientific Simulations
• Our First System with GPU’s– Oakforest-PCAS
• #1 System in Japan !• Performance of the New Systems
– 3D Poisson Solver by ICCG– Matrix Assembly for 3D FEM
53
54
Matrix Assembly in FEM• Integration on each Element: Element Matrix (Dense)• Accumulation of Elem. Mat: Global Matrix (Sparse)
11
2 3
4 5 6
7 8 9
2 3 4
5 6 7 8
9 10 11 12
13 14 15 16• Integration/Matrix: Element• Matrix Computation: Node
– Components of each Node: Accumulation by Neighboring Elements.
– Data dependency may happen– Coloring needed: Multicoloring
• Hardware– Intel Xeon Phi (KNC/KNL)– NVIDIA Tesla K40/P100
• Atomic OP’s supported by H/W[KN, Naruse et al. HPC-152, 2015]
55
Target: GeoFEM/Cube• Parallel FEM Code (& Benchmarks)• 3D-Static-Elastic-Linear (Solid Mechanics)• Performance of Parallel Preconditioned Iterative
Solvers– 3D Tri-linear Hexa. Elements– SPD matrices: CG solver– Fortran90+MPI+OpenMP– Distributed Data Structure– Reordering for IC/SGS Precond.
• MC, RCM, CM-RCM– MPI,OpenMP,OpenMP/MPI Hybrid
• only OpenMP case
• Focusing on Matrix Assembling in this Study: 1283 cubes
x
y
z
Uz=0 @ z=Zmin
Ux=0 @ x=Xmin
Uy=0 @ y=Ymin
Uniform Distributed Force in z-direction @ z=Zmax
(Ny-1) elementsNy nodes
(Nx-1) elementsNx nodes
(Nz-1) elementsNz nodes
x
y
z
Uz=0 @ z=Zmin
Ux=0 @ x=Xmin
Uy=0 @ y=Ymin
Uniform Distributed Force in z-direction @ z=Zmax
(Ny-1) elementsNy nodes
(Nx-1) elementsNx nodes
(Nz-1) elementsNz nodes
56
Code Name KNC KNL-1 K40 P100
ArchitectureIntel Xeon Phi 5110P(Knights Corner)
Intel Xeon Phi 7250(Knights Landing)
NVIDIA Tesla K40
NVIDIA Tesla P100
(Pascal)
Frequency (GHz) 1.053 1.40 0.745 1.328
Core # (Max Thread #) 60 (240) 68 (272) 2,800 1,792
Peak Performance (GFLOPS)
1,010.9 3,046.4 1,430 4,759
Memory (GB) 8MCDRAM:
16DDR4: 96
12 16
Memory Bandwidth(GB/sec., Stream Triad)
159MCDRAM:
490 DDR4: 84.5
218 530
Real OFP
57
Results (Fastest Case)sec. (GFLOPS)
We need further improvement on KNL
Coloring + OpenMP/OpenACC
Atomic+OpenACC
Atomic + CUDA
KNC 0.690 (94.0)
KNL-1 0.304 (213.4)
K40 0.675 (96.2) 0.507 (128.1) 0.362 (179.4)
P100
Please contact Kengo Nakajima (nakajima(at)cc.u-tokyo.ac.jp) if you need results by P100
58
Why “Atomic” is faster ? [Naruse]4-colors for 2D problemElements on same color are computed in parallel
NO cache effect: Each node is “touched” only once during operations for a certain color: this effect of “cache miss” is larger than overhead by Atomic operations
No problems on traditional vector processors
[KN, Naruse et al. HPC-152, 2015]
59
Summary• Overview of Information Technology Center (ITC),
The University of Tokyo– Reedbush
• Integrated Supercomputer System for Data Analyses & Scientific Simulations, Our First System with GPU’s
– Oakforest-PCAS• #1 System in Japan !
• Performance of the New Systems– 3D Poisson Solver by ICCG– Matrix Assembly for 3D FEM
16th SIAM Conference on Parallel Processing for Scientific Computing (PP18)March 7-10, 2018, Tokyo, Japan
• Venue– Nishiwaseda Campus, Waseda University (in Tokyo)
• Organizing Committee Co-Chairs’s– Satoshi Matsuoka (Tokyo Institute of Technology, Japan) – Kengo Nakajima (The University of Tokyo, Japan)– Olaf Schenk (Universita della Svizzera italiana, Switzerland)
• Contact– Kengo Nakajima, nakajima(at)cc.u-tokyo.ac.jp– http://nkl.cc.u-tokyo.ac.jp/SIAMPP18/
60
Waseda University, Tokyo, JapanNorthwest of the City of Tokyo
61
ImperialPalace
University ofTokyo
(Hongo)
WasedaUnivesity
Shinjuku
Ginza
Akihabara
Ueno
Shibuya
Olympic Stadium1964 & 2020
Tokyo CentralStation