japanese'hpc'update:'exascale' research'and'the
TRANSCRIPT
-
Japanese'HPC'Update:'Exascale'Research'and'the'Next7Genera9on'
Flagship'Supercomputer
Naoya%Maruyama%(RIKEN%Advanced%Ins6tute%for%Computa6onal%Science)
3rd%Workshop%on%ExtremeDScale%Programming%Tools%@%SC14%
Nov%17,%2014
-
Yutaka%Ishikawa%(RIKEN%AICS)%
Mitsuhisa%Sato%(University%of%Tsukuba)%
Satoshi%Matsuoka%(Tokyo%Ins6tute%of%Technology)%
Toshio%Endo%(Tokyo%Ins6tute%of%Technology)
Acknowledgments
-
Towards the Next Flagship Machine & Beyond
1
T2K
PF
2008 2010 2012 2014 2016 2018 2020
U. Tsukuba U. Tokyo Kyoto U.
Tokyo Tech. TSUBAME2.0
-
Tohoku%U.
U.%Tsukuba%1PF%
Hokkaido%U.
Kyoto%U.Osaka%U.
Kyushu%U%1.7PF
Nagoya%U.
HPCI: a nation-wide HPC infrastructure - Supercomputers ~25 PFlops (2013) - National Storage 22 PB HDDs + Tape - Research Network (SINET4), 40+10GBps - SSO (HPCI-ID), Distributed FS (Gfarm) - National HPCI Allocation Process
Management of SINET &Single sign-on
NII
Supercomputers 1.1PF HPCI Storage (12PB)Tokyo%Tech%TSUBAME2.5%
5.7%Petaflops
JAMSTEC
U.'Tokyo
3
Japans High Performance Computing Infrastructure (HPCI)
Tokyo'Tech.
K computer 11Petaflops HPCI Storage (10PB)
RIKEN'AICS
-
Substation Supply
Chillers
Site'of'AICS'and'K'computer%
Research Building
Computer Building
Tokyo
Kobe
423km (263miles) !west of Tokyo!
AICS site
Kobe
Kobe Airport
-
Advanced Institute for Computational Science (AICS)!
! Foundation : June 2010"
! Missions : "! Operation of K computer for research
including industry applications"! Leading edge research through strong
collaborations between computer and computational scientists"
! Development of Japan's future strategy for computational science, including the path to exa-scale computing"
! #Personnel : 222 (1 April 2014)"
Director
Administration Division
Operations &Computer Technology Division
Research Division"16 teams and 3 units
Deputy Director
Exascale Project
-
Computa9onal'Science'Research'Teams'%Provide%a%shared%infrastructure%to%support%a%wide%range%of%fields%in%making%sophis6cated%use%of%the%K%computer,%by%developing%methodologies%required%by%computa6onal%science.%
! Par6cle%Physics%(Y.%Kuramashi)%! Astrophysics%(J.%Makino)%! Solid%State%Physics%(S.%Yunoki)%! Quantum%Chemistry%(T.%Nakajima)%! Computa6onal%Chemistry%(K.%Hirao)%! Biophysics%(Y.%Sugita)%! Drug%Design%(F.%Tama)%! Earth%Science%(M.%Hori)%! Climate%Science%(H.%Tomita)%! Engineering%(M.%Tsubokura) %! Discrete%Event%Simula6on%(N.%Ito)%
! Processor%(M.%Taiji)%! System%Soaware%(Y.%Ishikawa)%! Programming%Environment%(M.%Sato)%! LargeDscale%Parallel%Numerical%Compu6ng%
Technology%(T.%Imamura)%! HPC%Usability%(T.%Maeda)%! HPC%Programming%Framework%(N.%Maruyama)%! Advanced%Visualiza6on%(K.%Ono)%! Data%Assimila6on%(T.%Miyoshi)%
Computer'Science'Research'Teams''Solve%issues%surrounding%the%K%computer%through%research%in%major%elemental%computer%technologies%
AICS'Research'Teams'
Promo9ng'strong'collabora9ons'between'computer'scien9sts'and'computa9onal'scien9sts
-
K'computer'Overview
System board
node4
512 GFLOPS 64GB
Compute rack
System board24 IO system board6
12.3 TFLOPS 1.5TB
Node
CPU1 ICC1 DRAM
128 GFLOPS 16GB
Rack section
Compute rack8 Disk rack2
98.4TFLOPS 12TB
Courtesy%of:%Fujitsu
Whole system
Compute rack864
10.62 PFLOPS 1.2PB Memory
-
SPARC64VIIIfxSpecification
Performance 128 GFLOPS (16 GFLOPS x 8 cores)
Number of cores 8
Clock Frequency 2.0 GHz
FP UnitsFMA x 4 (2 SIMD)DIV x 2
RegistersFP registers (64bit) : 256GP registers (64bit) : 188
Cache L1I$ : 32 KB (2way) L1D$ : 32 KB (2way) L2$ : Shared 6 MB (12way)
Memory BW 64 GB/s (0.5B/F) 45nm CMOS process Chip size: 22.7mm x 22.6mm Transistor count: 760MPower: 58W
Vendor Name CoreProcess
rule (nm)
Peak performance (GFLOPS)
Cache (MB)
Power (W)
GF/WSystem
(w/planned)
IBM PowerPC A2 16 45 204.80 32 55 3.72 Sequoia (BlueGene/Q)
Intel E3-1260L 4 32 105.60 8 45 2.35
Fujitsu SPARC64VIIIfx 8 45 128.00 6 58 2.21 K computer
IBM Power7 8 45 256.00 32 200 1.28
AMD Opteron 6172 12 45 100.80 12 80 1.26 XE6,etc.
Intel Xeon X5670 6 32 79.92 12 95 0.84 TSUBAME2.0,etc.
Courtesy%of:%Fujitsu
-
Tofu'Interconnect"
"
"
"
"
"
"
"
64GB/s
5GB/s / dir5GB/s / dir
5GB/s / dir
5GB/s x
Courtesy%of:%Fujitsu
-
Graph500'performance
! Measures the edge-traversing speed of a large graph
! Performance ! #vertices: 2^40 ! #edges: 2^44 ! #nodes used: 65,536 ! Speed in seconds: 0.98sec ! Speed in TEPS*: 17,977
GTEPS *Traversed Edges per Seconds
! Work done by K. Ueno (Tokyo Tech & RIKEN), et al.
-
Site Computer Cores HPL Rmax
(Pflops)
HPL Rank
HPCG (Pflops)
HPCG/HPL
NSCC / Guangzhou Tianhe-2 NUDT,
Xeon 12C 2.2GHz + Intel Xeon Phi 57C + Custom
3,120,000 33.9 1 .580 1.7%
RIKEN Advanced Inst for Comp Sci
K computer Fujitsu SPARC64 VIIIfx 8C + Custom 705,024 10.5 4 .427 4.1%
DOE/OS Oak Ridge Nat Lab
Titan, Cray XK7 AMD 16C + Nvidia Kepler GPU 14C +
Custom 560,640 17.6 2 .322 1.8%
DOE/OS Argonne Nat Lab
Mira BlueGene/Q, Power BQC 16C 1.60GHz + Custom 786,432 8.59 5 .101
# 1.2%
Swiss CSCS Piz Daint, Cray XC30, Xeon 8C + Nvidia Kepler 14C + Custom 115,984 6.27 6 .099 1.6%
Leibniz Rechenzentrum SuperMUC, Intel 8C + IB 147,456 2.90 12 .0833 2.9%
CEA/TGCC-GENCI Curie tine nodes Bullx B510 Intel Xeon 8C 2.7 GHz + IB 79,504 1.36 26 .0491 3.6%
Exploration and Production Eni S.p.A.
HPC2, Intel Xeon 10C 2.8 GHz + Nvidia Kepler 14C + IB 62,640 3.00 11 .0489 1.6%
DOE/OS L Berkeley Nat Lab
Edison Cray XC30, Intel Xeon 12C 2.4GHz + Custom 132,840 1.65 18 .0439
# 2.7% Texas Advanced Computing Center
Stampede, Dell Intel (8c) + Intel Xeon Phi (61c) + IB 78,848 .881* 7 .0161 1.8%
Meteo France Beaufix Bullx B710 Intel Xeon 12C 2.7 GHz + IB 24,192 .469
(.467*) 79 .0110 2.4%
Meteo France Prolix Bullx B710 Intel Xeon 2.7 GHz 12C + IB 23,760 .464
(.415*) 80 .00998 2.4%
U of Toulouse CALMIP Bullx DLC Intel Xeon 10C 2.8 GHz + IB 12,240 .255 184 .00725 2.8%
Cambridge U Wilkes, Intel Xeon 6C 2.6 GHz + Nvidia Kepler 14C + IB 3584 .240 201 .00385 1.6%
TiTech TUSBAME-KFC Intel Xeon 6C 2.1 GHz + IB 2720 .150 436 .00370 2.5%
HPL and HPCG
HPL HPCG * scaled to reflect the same number of cores # unoptimized implementation
HPCG'performance
! Measures the floating point speed for PCG with a sparse matrix
! Performance ! #dimension of matrix:
1.74x1011
! #nodes used: 82,944 ! Speed in seconds: 3,600 sec ! Speed in FLOPS: 0.427PFLOPS ! Ratio to peak speed: 4.02%
! Work done by K. Kumahata and K. Minami (RIKEN)
-
unscheduled%down%6me%
Opera9on'of'K'computer'(I)
1.2%%=%4.38days%
50%
100%
0%
kept%at%around%80%
weekly%node%u6liza6on%rate
in%opera6on%94.7%
system%failure%1.2%
scheduled%maintenance%
4.0%
2013.4%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%2014.3Sta6s6cs%%for%JFY2013%
-
10001 82944
%%4097 10000
%%2049 %%4096
%%1025 %%2048
%%%%%%%%1 %%1024
Opera9on'of'K'computer'(II)
Job%Size/month%in%JFY2013
#nodes
-
NICAM'global'climate'simula9on
! Previous NICAM simulation with 3.5km resolution on Earth Simulator starting from August 25, 2012
! quite accurate but not able to resolve individual cumulonimbus clouds.
Visualized%by%Dr.%Ryuji%Yoshida,%Computa6onal%Climate%Science%Research%Team,%AICS,%RIKEN%
Joint%research%of%JAMSTEC,%AORI,%the%University%of%Tokyo%(HPCI%%Strategic%program%3)%and%AICS,%RIKEN
-
Zooming'in'on'Typhoon'#15
Visualized%by%Dr.%Ryuji%Yoshida,%Computa6onal%Climate%Science%Research%Team,%AICS,%RIKEN%
Joint%research%of%JAMSTEC,%AORI,%the%University%of%Tokyo%(HPCI%%Strategic%program%3)%and%AICS,%RIKEN
Using K computer: ! 1st simulation with
-
Coupled%calcula6on%of%%! Earthquake%! Crustal%deforma6on%! Tsunami%%This%enables%%%! Direct%comparison%with%observed%
records%! planning%countermeasures%against%
complex%disasters%involving%mul6ple%elements%%
3.11'Tohoku'Earthquake'&'Tsunami
Maeda%et%al.,%2013,%Bull.%Seism.%Soc.%Am.
%#nodes:%2,304%Time:%15%hours%%%%(now%reduced%to%3%hours)%%
-
Gordon Bell Finalist: 3D Fast Scalable FE-based Seismic Simulation
Equation to be solved"(second ordered tetrahedral mesh, double precision)
Solve equation roughly using CG solver (linear tetrahedral mesh, single precision)
Solve equation roughly using CG solver (second ordered tetra mesh, single precision)
Use as initial solution
Use for preconditioner of outer loop
CG loop
Other computations of CG loop
Solving preconditioning equation
Outer loop
Inner coarse loop
Inner fine loop
Solving preconditioning equation
Algorithm of GAMERAs linear solverWeak-scaling of GAMERA running on K computer
Efficacy of algorithm when using the whole K computerPhysics-based urban earthquake simulation enhanced by 10.7 BlnDOF x 30 K time-step unstructured FE non-linear seismic wave simulation, Tsuyoshi Ichimura, Kohei Fujita, Seizo Tanaka, Muneo Hori, Maddegedara Lalith, Yoshihisa Shizawa, and Hiroshi Kobayashi, SC14 Gordon Bell Finalist
-
Mass'evacua9on'simula9on'based'on'mul9'agent'model! Mass%evacua6on%simula6on%(200,000%agents,%real%city%in%Japan)%! Help%formulate%an%effec6ve%evacua6on%plan%and%precau6ons%against%
tsunamis%and%earthquakes.%%%! Integra6on%of%seismic%response%of%structures%into%disaster%simula6ons%under%
way.%
Courtesy%of ,% %
#Nodes:%2,000%%%Time:%1%hour
2013
-
Turbulent'flow'in'the'solar'global'convec9on
! Crucial for understanding the formation of magnetic field and sun spots
! High resolution (5x108 " 30x108 mesh) calculation on K with a new algorithm (Reduced Speed of Sound Technique )
! Successfully resolved the structure of the turbulent flow; 1st step toward understanding the solar global convenction and sun spots (11 year cycle)
Press%release%in%Japanese%:%hnp://www.s.uDtokyo.ac.jp/ja/press/2014/15.html%
Movie%is%courtesy%of%Dr%Hideyuki%Hona:%hnp://wwwDspace.eps.s.uDtokyo.ac.jp/~hona/movie/conv_spe.html%
Radiative zone
Convective zone
Core zone
Energy flow
#Nodes: 3072 %Time: 30%hours
Apr%11,%2014%
H. Hotta, T. Yokoyama (U. Tokyo), M. Rempel (HAO, USA)
Astrophysical Journal, 2014
-
those originating from graphite and its surface functionalgroups (e.g., sp2, sp3, and CO). The CF3 peak is also observedin the F 1s spectrum as the main chemical state of F element.The CF3 component arises from decomposition products of aTFSA anion (and partially a residual LiTFSA salt). The F 1sspectrum also shows the presence of LiF, which should beformed by the decomposition of a LiTFSA salt. The S 2pspectrum, deconvoluted with spinorbit split doublets (S 2p3/2and S 2p1/2), shows the presence of SO2 and SOx withtraces of S and sulfide. Since the only source of S is a TFSAanion in this system, all the species are from decompositionproducts of a TFSA anion. Less informative N 1s and O 1sspectra are presented in Figure S5, from which we could notidentify their accurate chemical states due to the overlapping ofseveral peaks at the almost same binding energy. Overall, thereare plenty of evidence for the formation of a TFSA-derivedsurface film on the graphite electrode in the superconcentratedelectrolyte. During the first lithium intercalation into graphite,the TFSA anion, instead of the AN solvent, is predominantlyreduced to form a stable surface film, which corresponds to theirreversible capacity in the first cycle (Figure 2). Judging fromthe high Coulombic efficiency in the following cycle, the TFSA-derived surface film works as a protective layer to kineticallysuppress further reductive decompositions of TFSA anions aswell as AN solvents. In other words, due to the presence of theTFSA-derived protective surface film, the electrochemicalwindow is kinetically widened to a cathodic direction in thesuperconcentrated AN solution. It is this situation that allowsfor reversible lithium intercalation/deintercalation at a graphiteelectrode by suppressing continuous reductive decompositionsof AN solvents. This enhanced reductive stability, arising frommodified film-forming ability, is unique to such a super-concentrated solution and contrary to the conventional belief
that an AN-based electrolyte is never stable during lithiumintercalation into graphite.
Solution Structure. To identify the solution structureproviding the unusual reductive stability with a TFSA-derivedsurface film, Raman spectra were obtained for LiTFSA/ANsolutions at various Li salt concentrations (Figure 4a,b). TheRaman spectrum of pure AN in Figure 4a shows a CNstretching band (v2 mode) at 2258 cm
1 deriving from free ANmolecules (i.e., without coordinating to Li+).39,40 At 1.0 moldm3 concentration, another v2 band appears at 2282 cm
1
arising from Li+-solvating AN molecules.39,40 In such dilutesolutions, a stable solvation structure around Li+ is reported tobe 3- or 4-fold coordination (Figure 4c).40,41 Further increasingthe Li-salt concentration decreases free AN molecules andinstead increases the Li+-solvating AN molecules. At 4.2 moldm3 superhigh concentration, where unusual reductivestability was observed, there is only a peak for Li+-solvatingAN molecules, indicating that all the AN molecules coordinateto Li+. Since the molar ratio of LiTFSA:AN is ca. 1:2 in 4.2 moldm3 LiTFSA/AN, Li+ should have 2-fold AN coordination onaverage (Figure 4c). Further upshift of the v2 band at 4.2 moldm3 concentration indicates much stronger CN bond ofAN, suggesting a peculiar coordination structure totallydifferent from those in dilute solutions.Turning to the vibration mode of TFSA (i.e., SN
stretching, CS stretching, and CF3 bending) in Figure 4b, adeconvolution analysis shows that the Raman band consists ofthree peaks at 740, 745, and 750 cm1, arising from free anions,contact ion pairs (CIPs, TFSA coordinating to a single Li+
cation), and aggregates (AGGs, TFSA coordinating to two ormore Li+ cations), respectively.40 At 1.0 mol dm3 concen-tration, the majority of TFSA exists as free anions with smallamount of CIPs and AGGs, due to the high salt dissociation
Figure 5. Supercells used and projected density of states (PDOS) obtained in quantum mechanical DFT-MD simulations on (a and b) dilute (1-LiTFSA/43-AN corresponding to 0.4 mol dm3) and (c) spuerconcentrated (10-LiTFSA/20-AN corresponding to 4.2 mol dm3) LiTFSA/ANsolutions. The illustrated structures are the snapshots in equilibrium trajectories. For a dilute solution, both situations of LiTFSA salt (i.e., (a) fulldissociation and (b) CIP) were considered. Atom color: Li, purple; C, dark gray; H, light gray; O, red; N, blue; S, yellow; F, green. Li atoms arehighlighted in size. Insets in the PDOS profiles are magnified figures of the lowest energy-level edge of the conduction band.
Journal of the American Chemical Society Article
dx.doi.org/10.1021/ja412807w | J. Am. Chem. Soc. 2014, 136, 503950465043
Super7concentrated'electrolyte'for'Lithium'Ion'Baceries
! Functional and stable electrolyte is a key for high performance of LIB
! First-principles molecular dynamics simulations show that a super-concentrated electrolyte has:
! a special network of anions and solvents, which leads to ! a remarkably fast Li-ion transport (1/3 charging
time) ! a high stability against reduction
Yoshitaka%Tateyama%%(NIMS),%%Keitaro%Sodeyama%(Kyoto%Univ.%&%NIMS)%Press%Release:%
hnp://www.t.uDtokyo.ac.jp/etpage/release/2014/2014032401.html%March%24,%2014%
J. Am. Chem. Soc., 2014
#Nodes:1536%Time:%500%hours
-
Industrial'usage13
901 3 8 2 2 3
13 15
JSR JFE CAE
4JFE
OSS (OpenFOAM, LAMMPS, OCTA/COGNAC83, SUSHU9.1, FrontISTR, FrontFlow/blue, REVOCAP_Coupler)
ISV (Poynting, VASP, CzeekS, VSOP )
61 ( )
21 26
JSRIHI
23 28
CAE
9 9
JFE
87
( 28 )
26 5 7 H24 9
0
5
10
15
20
25
30
14
32
Accumulated number of Trial Use
9.2012 3.2014
Source: RIST report at HPCIC 5.2014
Area'of'industry Representa9ve'firms Typical'usage
Pharmaceu6cal Dainippon%Sumitomo%Pharma%Daiichi%Sankyo%
Drug%design%
Chemical Sumitomo%Chemical%Bridgestone
New%material%development%Tire%tread%panern%designing
Construc6on Shimizu%Corpora6on%Takenaka%Corpora6on
Wind%pressure%on%buildings%Building%structural%design
Automobile/%manufacturing
Toyota%Kawasaki%Heavy%Industries
Engine%combus6on%Efficiency%of%turbine%generator
IT/%soaware Mizuho%Informa6on%&%Research%Ins6tute
Soaware%development%and%consul6ng%service
-
Towards%the%Next%Flagship%Machine%&%Beyond
1%
T2K
PF
2008 2010 2012 2014 2016 2018 2020
U. Tsukuba U. Tokyo Kyoto U.
Tokyo Tech. TSUBAME2.0
-
TSUBAME2.0'Nov.'1,'2010'The'Greenest'Produc9on'Supercomputer'in'the'World
27%
TSUBAME 2.0 New Development
32nm 40nm
>400GB/s'Mem'BW'80Gbps'NW'BW'~1KW'max
>1.6TB/s'Mem'BW' >12TB/s'Mem'BW'35KW'Max'
>600TB/s'Mem'BW'220Tbps'NW''Bisecion'BW'1.4MW'Max'
-
Greenest%Produc6on%Supercomputer%in%the%World%the%Green%500%(#3%overall)%Nov.%2010,%June%2011%(#4%Top500%Nov.%2010)
TSUBAME2.0 Awards
ACM%Gordon%Bell%Prize%2011%2.0%Petaflops%Dendrite%
Simula6on
-
TSUBAME-KFC
Single Node 5.26 TFLOPS DFPSystem (40 nodes) 210.61 TFLOPS DFP
630TFlops SFP
Storage (3SSDs/node) 1.2TBytes SSDs/Node Total 50TBytes
~50GB/s BW
A TSUBAME3.0 prototype system with advanced next gen cooling 40 compute nodes are oil-submerged 1200 liters of oil (Exxon PAO ~1 ton) #1 Nov. 2013 Green 500!!
(Kepler Fluid Cooling)
-
#1'in'Green'500'List'(Nov.'2013) First%achievement%as%Japanese%supercomputer% #1%again%in%June%2014%% TSUBAME%2.5%is%also%ranked%#6
TSUBAME7KFC
TSUBAME'2.5
-
Beyond'TSUBAME.KFC:'GoldenBox'Proto1'(NVIDIA'K17based)'
To'be'shown'at'SC14'Tokyo'Tech.'Booth 36%Node%Tegra%K1,%
11TFlops%SFP% ~700GB/s%BW% ~350Wans% Integrated%mSata%SSD,%
~7GB/s%I/O% Ultra%dense,%Oil%immersive%
cooling% Same%SW%stack%as%
TSUBAME%
2022:%x10%Flops,%x10%Mem%Bandwidth,%silicon%photonics,%x10%NVM,%x10%node%density
-
Towards%the%Next%Flagship%Machine%&%Beyond
1%
Flagship 2020 Post K
PostT2K
T2K
PF
2008 2010 2012 2014 2016 2018 2020
U. Tsukuba U. Tokyo Kyoto U.
RIKEN
9 Universities and National Laboratories
The Flagship 2020 project: the next national flagship system for 2020
Alternative Leading Machines
TSUBAME3.0
Tokyo Tech. TSUBAME2.0
Co-design primary key Some academic-led R&D
(esp. system SW and overall architecture).
International collaboration New targets e.g. power,
big data, etc.
Future Exascale
-
Design of computer systems solving scientific and social issues
Identification of R&D issues to realize the systems
Review of the system using the application codes
Estimation of the systems cost
Identification of scientific and social issues to be solve in the future
Drawing science road map until 2020 Selection of the applications that plays key
roles in the roadmap Review of the architectures using those
applications
Feasibility'Studies'on'Future'HPC'R&D'in'Japan'(FY201272013)
%
1%applica6on%study%team% 3%system%study%teams%
RIKEN%AICS%and%TITECH%Collabora6on%with%applica6on%filelds%
Tohoku%Univ.%and%
NEC%
U.%of%Tsukuba,%Titech,%and%Hitachi%
U.%of%Tokyo,%Kyushu%U.,%%Fujitsu,%
Hitachi,%and%NEC%
Hirofuji%Tomita Satoshi%Matsuoka Mitsuhisa%Sato Yutaka%IshikawaHiroaki%KobayashiCoDdesign
-
Post'K'Computer'(Flagship'2020)
Login%Servers%
Maintenance%Servers%
I/O%Network%
Hierarchical%Storage%System
Portal%Servers%
# CPU Many-core with Interconnect interface Power Knob feature
# Interconnect TOFU (mesh/torus network)
Codesign: Compute Node
Features FP performance Memory hierarchy,
control, capacity, and bandwidth
Network Performance I/O Performance
-
Current'status'of'the'post7K'project The%project%is%in%procurement%process%for%the%development%of%
the%postDK%computer%system.%Fujitsu%was%decided%as%the%vendor%partner.%
In%the%specifica6on%of%RFP:% Constraints%are:%
Power%capacity%(about%30MW)% Space%for%system%installa6on%(in%Kobe%AICS%building)% Budget%(money)%for%development%(NRE)%and%produc6on.% ...%some%degree%of%compa6bility%to%the%current%K%computer.%
The%system%should%be%designed%to%maximize%the%performance%of%applica6ons%in%each%computa6onal%science%field.%
To%be%installed%in%2018D2019,%become%opera6onal%in%2020
%
RIKEN%AICS%Advanced%Ins6tute%for%Computa6onal%Science
RIKEN
-
Co7design'elements'in'HPC'systems Hardware/architecture%
Node%architecture%(#core,%%#SIMD,%etc...)%
cache%(size%and%bandwidth)% network%(topologies,%latency%%
and%bandwidth)% memory%technologies%(HBM%%
and%HMC,%...)% specialized%hardware%
#nodes% Storage,%file%systems% ...%system%configura6ons%
%
! System%soaware%! Opera6ng%system%for%many%core%
architecture%! communica6on%library%(low%level%
layer,%MPI,%PGAS)%! Programming%model%and%
languages%! DSL,%...%
! Algorithm%and%math%lib%! Dense%and%Sparse%solver%! Eigen%solver%%! ...%DomainDspecific%lib%and%
framework%
! Applica6ons%
-
Linux'+'McKernel' Concerns%
Reducing%memory%conten6on% Reducing%data%movement%among%cores% Providing%new%memory%management% Providing%fast%communica6on% Parallelizing%OS%func6ons%achieving%less%
data%movement% New%OS%mechanisms%and%APIs%are%
revolu6onarily/evolu6onally%created%and%examined,%and%selected%
Linux%with%Light%Weight%Micro%Kernel% IHK%(Interface%for%Heterogeneous%
Kernel)% Loading%a%kernel%into%cores% Communica6on%between%Linux%and%the%kernel%
McKernel% Customizable%OS%environment%
E.g.%environment%without%CPU%scheduler%(without%6mer%interrupt)
2014/09/03
Core
McKernel
Linux Kernel
Daem
on
Core
Core
User
pro
cess
User
pro
cess
Daem
onDa
emon
Core
Interface%for%Hetero.%Kernels
System%call%to%LMK%System%call%to%Linux
Running%on%both%Xeon%and%XeonDphi%environments
IHK%and%McKernel%have%been%developed%at%the%University%of%Tokyo%and%Riken%with%Hitachi,%NEC,%and%Fujitsu
-
XcalableMP'(XMP)%%hnp://www.xcalablemp.org Whats%XcalableMP%(XMP%for%short)?%%
A%PGAS%programming%model%and%language%for%distributed%memory%,%proposed%by%XMP'Spec'WG'
XMP%Spec%WG%is%a%special%interest%group%to%design%and%draa%the%specifica6on%of%XcalableMP%language.%It%is%now%organized%under'PC'Cluster'Consor9um,%Japan.%Mainly%ac6ve%in%Japan,%but%open%for%everybody.%
Project%status%(as%of%Nov.%2013)% XMP%Spec%Version'1.2%is%available%at%XMP%site.%new%features:%mixed%OpenMP%and%OpenACC%,%libraries%for%collec6ve%communica6ons.%
Reference%implementa6on%by%U.%Tsukuba%and%Riken%AICS:%Version'0.7'(C'and'Fortran90)%is%available%for%PC%clusters,%Cray%XT%and%K%computer.%SourceDtoD%Source%compiler%to%code%with%the%run6me%on%top%of%MPI%and%GasNet.%
Poss
iblit
yof
Per
form
ance
tun
ing
Programming cost
MPI
Automaticparallelization
PGAS
HPF
chapel
XcalableMPXcalableMP
int array[YMAX][XMAX];
#pragma xmp nodes p(4)#pragma xmp template t(YMAX)#pragma xmp distribute t(block) on p#pragma xmp align array[i][*] to t(i)
main(){int i, j, res;res = 0;
#pragma xmp loop on t(i) reduction(+:res)for(i = 0; i < 10; i++)for(j = 0; j < 10; j++){
array[i][j] = func(i, j);res += array[i][j];
}}
add to the serial code : incremental parallelization
data distribution
work sharing and data synchronization
! Language%Features%! Direc6veDbased%language%extensions%for%Fortran%and%C%for%PGAS%model%
! Global%view%programming%with%globalDview%distributed%data%structures%for%data%parallelism%! SPMD%execu6on%model%as%MPI%! pragmas%for%data%distribu6on%%of%global%array.%! Work%mapping%constructs%to%map%works%and%itera6on%with%affinity%to%data%explicitly.%
! Rich%communica6on%and%sync%direc6ves%such%as%gmove%and%shadow.%
! Many%concepts%are%inherited%from%HPF%! CoDarray%feature%of%CAF%is%adopted%as%a%part%of%the%language%spec%for%local%view%programming%(also%defined%in%C).%
XMP provides a global view for data parallel
program in PGAS model
Code%example
-
CREST:%Development%of%System%Soaware%Technologies%for%postDPeta%Scale%High%Performance%Compu6ng%
2010H2D2018
%
Objec6ves% CoDdesign%of%system%soaware%with%applica6ons%and%postDpeta%scale%
computer%architectures%% Development%of%deliverable%soaware%pieces%%
Research%Supervisor% Akinori%Yonezawa,%Deputy%Director%of%RIKEN%AICS%
Run%by%JST%(Japan%Science%and%Technology%Agency)% Budget%and%Forma6on%(2010%to%2018)%
About%60M%$%(47M$%in%normal%rate)%in%total% Round%1:%From%2010%for%5.5%year% Round%2:%From%2011%for%5.5%year% Round%3:%From%2012%for%5.5%year%
http://www.postpeta.jst.go.jp/en/
-
Overview%of%PPC%CREST%(slide%1%of%3)%
%
2013 2014 2015 2016 2017Round 1: 5 teams run
CREST:%Development%of%System%Soaware%Technologies%for%postDPeta%Scale%High%Performance%Compu6ng
Round 3 : 4 teams runRound 2: 5 teams run
Taisuke%Boku,%U.%of%Tsukuba%Research%and%Development%on%Unified%Environment%of%Accelerated%Compu6ng%and%Interconnec6on%for%PostDPetascale%Era%
Atsushi%Hori,%RIKEN%AICS%Parallel%System%Soaware%for%Mul6Dcore%and%ManyDcore%
Toshio%Endo,%Tokyo%Tech.%Soaware%Technology%that%Deals%with%Deeper%Memory%Hierarchy%in%PostDpetascale%Era%
Takeshi%Nanri,%Kyushu%University%Development%of%Scalable%Communica6on%Library%with%Technologies%for%Memory%Saving%and%Run6me%Op6miza6on%
Osamu%Tatebe,%U.%of%Tsukuba%System%Soaware%for%Post%Petascale%Data%Intensive%Science%
Masaaki%Kondo,%U.%of%ElectroDComm.%Power%Management%Framework%for%%PostDPetascale%Supercomputers%
-
Overview%of%PPC%CREST%(slide%2%of%3)%
%
2013 2014 2015 2016 2017Round 1: 5 teams run
CREST:%Development%of%System%Soaware%Technologies%for%postDPeta%Scale%High%Performance%Compu6ng
Round 3 : 4 teams runRound 2: 5 teams run
Naoya%Maruyama,%Riken%AICS%Highly%Produc6ve,%High%Performance%Applica6on%Frameworks%for%Post%Petascale%Compu6ng%
Hiroyuki%Takizawa,%Tohoku%University%An%evolu6onary%approach%to%construc6on%of%a%soaware%development%environment%for%massivelyDparallel%heterogeneous%systems%
Shigeru%Chiba,%Tokyo%Tech.%Soaware%development%for%post%petascale%super%compu6ng%DDD%Modularity%for%Super%Compu6ng%
Itsuki%Noda,%AIST%Framework%for%Administra6on%of%Social%Simula6ons%on%Massively%Parallel%Computers%
-
Overview%of%PPC%CREST%(slide%3%of%3)%
%
2013 2014 2015 2016 2017Round 1: 5 teams run
CREST:%Development%of%System%Soaware%Technologies%for%postDPeta%Scale%High%Performance%Compu6ng
Round 3 : 4 teams runRound 2: 5 teams run
Kengo%Nakajima,%University%of%Tokyo%ppOpenDHPC
Tetsuya%Sakurai,%University%of%Tsukuba%Development%of%an%EigenDSupercompu6ng%Engine%using%a%PostDPetascale%Hierarchical%Model%
Ryuji%Shioya,%Toyo%University%Development%of%a%Numerical%Library%based%on%Hierarchical%Domain%Decomposi6on%for%Post%Petascale%Simula6on%
Katsuki%Fujisawa,%Chuo%University%Advanced%Compu6ng%and%Op6miza6on%Infrastructure%for%Extremely%LargeDScale%Graphs%on%Post%PetaDScale%Supercomputers%
-
Thank'you! RIKEN%AICS%booth:%#2431%
Tokyo%Tech%booth:%#1857%
JST%CREST%booth:%#3807%