japanese'hpc'update:'exascale' research'and'the

42
Japanese HPC Update: Exascale Research and the Next7Genera9on Flagship Supercomputer Naoya Maruyama (RIKEN Advanced Ins6tute for Computa6onal Science) 3 rd Workshop on ExtremeDScale Programming Tools @ SC’14 Nov 17, 2014

Upload: lyphuc

Post on 30-Dec-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

  • Japanese'HPC'Update:'Exascale'Research'and'the'Next7Genera9on'

    Flagship'Supercomputer

    Naoya%Maruyama%(RIKEN%Advanced%Ins6tute%for%Computa6onal%Science)

    3rd%Workshop%on%ExtremeDScale%Programming%Tools%@%SC14%

    Nov%17,%2014

  • Yutaka%Ishikawa%(RIKEN%AICS)%

    Mitsuhisa%Sato%(University%of%Tsukuba)%

    Satoshi%Matsuoka%(Tokyo%Ins6tute%of%Technology)%

    Toshio%Endo%(Tokyo%Ins6tute%of%Technology)

    Acknowledgments

  • Towards the Next Flagship Machine & Beyond

    1

    T2K

    PF

    2008 2010 2012 2014 2016 2018 2020

    U. Tsukuba U. Tokyo Kyoto U.

    Tokyo Tech. TSUBAME2.0

  • Tohoku%U.

    U.%Tsukuba%1PF%

    Hokkaido%U.

    Kyoto%U.Osaka%U.

    Kyushu%U%1.7PF

    Nagoya%U.

    HPCI: a nation-wide HPC infrastructure - Supercomputers ~25 PFlops (2013) - National Storage 22 PB HDDs + Tape - Research Network (SINET4), 40+10GBps - SSO (HPCI-ID), Distributed FS (Gfarm) - National HPCI Allocation Process

    Management of SINET &Single sign-on

    NII

    Supercomputers 1.1PF HPCI Storage (12PB)Tokyo%Tech%TSUBAME2.5%

    5.7%Petaflops

    JAMSTEC

    U.'Tokyo

    3

    Japans High Performance Computing Infrastructure (HPCI)

    Tokyo'Tech.

    K computer 11Petaflops HPCI Storage (10PB)

    RIKEN'AICS

  • Substation Supply

    Chillers

    Site'of'AICS'and'K'computer%

    Research Building

    Computer Building

    Tokyo

    Kobe

    423km (263miles) !west of Tokyo!

    AICS site

    Kobe

    Kobe Airport

  • Advanced Institute for Computational Science (AICS)!

    ! Foundation : June 2010"

    ! Missions : "! Operation of K computer for research

    including industry applications"! Leading edge research through strong

    collaborations between computer and computational scientists"

    ! Development of Japan's future strategy for computational science, including the path to exa-scale computing"

    ! #Personnel : 222 (1 April 2014)"

    Director

    Administration Division

    Operations &Computer Technology Division

    Research Division"16 teams and 3 units

    Deputy Director

    Exascale Project

  • Computa9onal'Science'Research'Teams'%Provide%a%shared%infrastructure%to%support%a%wide%range%of%fields%in%making%sophis6cated%use%of%the%K%computer,%by%developing%methodologies%required%by%computa6onal%science.%

    ! Par6cle%Physics%(Y.%Kuramashi)%! Astrophysics%(J.%Makino)%! Solid%State%Physics%(S.%Yunoki)%! Quantum%Chemistry%(T.%Nakajima)%! Computa6onal%Chemistry%(K.%Hirao)%! Biophysics%(Y.%Sugita)%! Drug%Design%(F.%Tama)%! Earth%Science%(M.%Hori)%! Climate%Science%(H.%Tomita)%! Engineering%(M.%Tsubokura) %! Discrete%Event%Simula6on%(N.%Ito)%

    ! Processor%(M.%Taiji)%! System%Soaware%(Y.%Ishikawa)%! Programming%Environment%(M.%Sato)%! LargeDscale%Parallel%Numerical%Compu6ng%

    Technology%(T.%Imamura)%! HPC%Usability%(T.%Maeda)%! HPC%Programming%Framework%(N.%Maruyama)%! Advanced%Visualiza6on%(K.%Ono)%! Data%Assimila6on%(T.%Miyoshi)%

    Computer'Science'Research'Teams''Solve%issues%surrounding%the%K%computer%through%research%in%major%elemental%computer%technologies%

    AICS'Research'Teams'

    Promo9ng'strong'collabora9ons'between'computer'scien9sts'and'computa9onal'scien9sts

  • K'computer'Overview

    System board

    node4

    512 GFLOPS 64GB

    Compute rack

    System board24 IO system board6

    12.3 TFLOPS 1.5TB

    Node

    CPU1 ICC1 DRAM

    128 GFLOPS 16GB

    Rack section

    Compute rack8 Disk rack2

    98.4TFLOPS 12TB

    Courtesy%of:%Fujitsu

    Whole system

    Compute rack864

    10.62 PFLOPS 1.2PB Memory

  • SPARC64VIIIfxSpecification

    Performance 128 GFLOPS (16 GFLOPS x 8 cores)

    Number of cores 8

    Clock Frequency 2.0 GHz

    FP UnitsFMA x 4 (2 SIMD)DIV x 2

    RegistersFP registers (64bit) : 256GP registers (64bit) : 188

    Cache L1I$ : 32 KB (2way) L1D$ : 32 KB (2way) L2$ : Shared 6 MB (12way)

    Memory BW 64 GB/s (0.5B/F) 45nm CMOS process Chip size: 22.7mm x 22.6mm Transistor count: 760MPower: 58W

    Vendor Name CoreProcess

    rule (nm)

    Peak performance (GFLOPS)

    Cache (MB)

    Power (W)

    GF/WSystem

    (w/planned)

    IBM PowerPC A2 16 45 204.80 32 55 3.72 Sequoia (BlueGene/Q)

    Intel E3-1260L 4 32 105.60 8 45 2.35

    Fujitsu SPARC64VIIIfx 8 45 128.00 6 58 2.21 K computer

    IBM Power7 8 45 256.00 32 200 1.28

    AMD Opteron 6172 12 45 100.80 12 80 1.26 XE6,etc.

    Intel Xeon X5670 6 32 79.92 12 95 0.84 TSUBAME2.0,etc.

    Courtesy%of:%Fujitsu

  • Tofu'Interconnect"

    "

    "

    "

    "

    "

    "

    "

    64GB/s

    5GB/s / dir5GB/s / dir

    5GB/s / dir

    5GB/s x

    Courtesy%of:%Fujitsu

  • Graph500'performance

    ! Measures the edge-traversing speed of a large graph

    ! Performance ! #vertices: 2^40 ! #edges: 2^44 ! #nodes used: 65,536 ! Speed in seconds: 0.98sec ! Speed in TEPS*: 17,977

    GTEPS *Traversed Edges per Seconds

    ! Work done by K. Ueno (Tokyo Tech & RIKEN), et al.

  • Site Computer Cores HPL Rmax

    (Pflops)

    HPL Rank

    HPCG (Pflops)

    HPCG/HPL

    NSCC / Guangzhou Tianhe-2 NUDT,

    Xeon 12C 2.2GHz + Intel Xeon Phi 57C + Custom

    3,120,000 33.9 1 .580 1.7%

    RIKEN Advanced Inst for Comp Sci

    K computer Fujitsu SPARC64 VIIIfx 8C + Custom 705,024 10.5 4 .427 4.1%

    DOE/OS Oak Ridge Nat Lab

    Titan, Cray XK7 AMD 16C + Nvidia Kepler GPU 14C +

    Custom 560,640 17.6 2 .322 1.8%

    DOE/OS Argonne Nat Lab

    Mira BlueGene/Q, Power BQC 16C 1.60GHz + Custom 786,432 8.59 5 .101

    # 1.2%

    Swiss CSCS Piz Daint, Cray XC30, Xeon 8C + Nvidia Kepler 14C + Custom 115,984 6.27 6 .099 1.6%

    Leibniz Rechenzentrum SuperMUC, Intel 8C + IB 147,456 2.90 12 .0833 2.9%

    CEA/TGCC-GENCI Curie tine nodes Bullx B510 Intel Xeon 8C 2.7 GHz + IB 79,504 1.36 26 .0491 3.6%

    Exploration and Production Eni S.p.A.

    HPC2, Intel Xeon 10C 2.8 GHz + Nvidia Kepler 14C + IB 62,640 3.00 11 .0489 1.6%

    DOE/OS L Berkeley Nat Lab

    Edison Cray XC30, Intel Xeon 12C 2.4GHz + Custom 132,840 1.65 18 .0439

    # 2.7% Texas Advanced Computing Center

    Stampede, Dell Intel (8c) + Intel Xeon Phi (61c) + IB 78,848 .881* 7 .0161 1.8%

    Meteo France Beaufix Bullx B710 Intel Xeon 12C 2.7 GHz + IB 24,192 .469

    (.467*) 79 .0110 2.4%

    Meteo France Prolix Bullx B710 Intel Xeon 2.7 GHz 12C + IB 23,760 .464

    (.415*) 80 .00998 2.4%

    U of Toulouse CALMIP Bullx DLC Intel Xeon 10C 2.8 GHz + IB 12,240 .255 184 .00725 2.8%

    Cambridge U Wilkes, Intel Xeon 6C 2.6 GHz + Nvidia Kepler 14C + IB 3584 .240 201 .00385 1.6%

    TiTech TUSBAME-KFC Intel Xeon 6C 2.1 GHz + IB 2720 .150 436 .00370 2.5%

    HPL and HPCG

    HPL HPCG * scaled to reflect the same number of cores # unoptimized implementation

    HPCG'performance

    ! Measures the floating point speed for PCG with a sparse matrix

    ! Performance ! #dimension of matrix:

    1.74x1011

    ! #nodes used: 82,944 ! Speed in seconds: 3,600 sec ! Speed in FLOPS: 0.427PFLOPS ! Ratio to peak speed: 4.02%

    ! Work done by K. Kumahata and K. Minami (RIKEN)

  • unscheduled%down%6me%

    Opera9on'of'K'computer'(I)

    1.2%%=%4.38days%

    50%

    100%

    0%

    kept%at%around%80%

    weekly%node%u6liza6on%rate

    in%opera6on%94.7%

    system%failure%1.2%

    scheduled%maintenance%

    4.0%

    2013.4%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%2014.3Sta6s6cs%%for%JFY2013%

  • 10001 82944

    %%4097 10000

    %%2049 %%4096

    %%1025 %%2048

    %%%%%%%%1 %%1024

    Opera9on'of'K'computer'(II)

    Job%Size/month%in%JFY2013

    #nodes

  • NICAM'global'climate'simula9on

    ! Previous NICAM simulation with 3.5km resolution on Earth Simulator starting from August 25, 2012

    ! quite accurate but not able to resolve individual cumulonimbus clouds.

    Visualized%by%Dr.%Ryuji%Yoshida,%Computa6onal%Climate%Science%Research%Team,%AICS,%RIKEN%

    Joint%research%of%JAMSTEC,%AORI,%the%University%of%Tokyo%(HPCI%%Strategic%program%3)%and%AICS,%RIKEN

  • Zooming'in'on'Typhoon'#15

    Visualized%by%Dr.%Ryuji%Yoshida,%Computa6onal%Climate%Science%Research%Team,%AICS,%RIKEN%

    Joint%research%of%JAMSTEC,%AORI,%the%University%of%Tokyo%(HPCI%%Strategic%program%3)%and%AICS,%RIKEN

    Using K computer: ! 1st simulation with

  • Coupled%calcula6on%of%%! Earthquake%! Crustal%deforma6on%! Tsunami%%This%enables%%%! Direct%comparison%with%observed%

    records%! planning%countermeasures%against%

    complex%disasters%involving%mul6ple%elements%%

    3.11'Tohoku'Earthquake'&'Tsunami

    Maeda%et%al.,%2013,%Bull.%Seism.%Soc.%Am.

    %#nodes:%2,304%Time:%15%hours%%%%(now%reduced%to%3%hours)%%

  • Gordon Bell Finalist: 3D Fast Scalable FE-based Seismic Simulation

    Equation to be solved"(second ordered tetrahedral mesh, double precision)

    Solve equation roughly using CG solver (linear tetrahedral mesh, single precision)

    Solve equation roughly using CG solver (second ordered tetra mesh, single precision)

    Use as initial solution

    Use for preconditioner of outer loop

    CG loop

    Other computations of CG loop

    Solving preconditioning equation

    Outer loop

    Inner coarse loop

    Inner fine loop

    Solving preconditioning equation

    Algorithm of GAMERAs linear solverWeak-scaling of GAMERA running on K computer

    Efficacy of algorithm when using the whole K computerPhysics-based urban earthquake simulation enhanced by 10.7 BlnDOF x 30 K time-step unstructured FE non-linear seismic wave simulation, Tsuyoshi Ichimura, Kohei Fujita, Seizo Tanaka, Muneo Hori, Maddegedara Lalith, Yoshihisa Shizawa, and Hiroshi Kobayashi, SC14 Gordon Bell Finalist

  • Mass'evacua9on'simula9on'based'on'mul9'agent'model! Mass%evacua6on%simula6on%(200,000%agents,%real%city%in%Japan)%! Help%formulate%an%effec6ve%evacua6on%plan%and%precau6ons%against%

    tsunamis%and%earthquakes.%%%! Integra6on%of%seismic%response%of%structures%into%disaster%simula6ons%under%

    way.%

    Courtesy%of ,% %

    #Nodes:%2,000%%%Time:%1%hour

    2013

  • Turbulent'flow'in'the'solar'global'convec9on

    ! Crucial for understanding the formation of magnetic field and sun spots

    ! High resolution (5x108 " 30x108 mesh) calculation on K with a new algorithm (Reduced Speed of Sound Technique )

    ! Successfully resolved the structure of the turbulent flow; 1st step toward understanding the solar global convenction and sun spots (11 year cycle)

    Press%release%in%Japanese%:%hnp://www.s.uDtokyo.ac.jp/ja/press/2014/15.html%

    Movie%is%courtesy%of%Dr%Hideyuki%Hona:%hnp://wwwDspace.eps.s.uDtokyo.ac.jp/~hona/movie/conv_spe.html%

    Radiative zone

    Convective zone

    Core zone

    Energy flow

    #Nodes: 3072 %Time: 30%hours

    Apr%11,%2014%

    H. Hotta, T. Yokoyama (U. Tokyo), M. Rempel (HAO, USA)

    Astrophysical Journal, 2014

  • those originating from graphite and its surface functionalgroups (e.g., sp2, sp3, and CO). The CF3 peak is also observedin the F 1s spectrum as the main chemical state of F element.The CF3 component arises from decomposition products of aTFSA anion (and partially a residual LiTFSA salt). The F 1sspectrum also shows the presence of LiF, which should beformed by the decomposition of a LiTFSA salt. The S 2pspectrum, deconvoluted with spinorbit split doublets (S 2p3/2and S 2p1/2), shows the presence of SO2 and SOx withtraces of S and sulfide. Since the only source of S is a TFSAanion in this system, all the species are from decompositionproducts of a TFSA anion. Less informative N 1s and O 1sspectra are presented in Figure S5, from which we could notidentify their accurate chemical states due to the overlapping ofseveral peaks at the almost same binding energy. Overall, thereare plenty of evidence for the formation of a TFSA-derivedsurface film on the graphite electrode in the superconcentratedelectrolyte. During the first lithium intercalation into graphite,the TFSA anion, instead of the AN solvent, is predominantlyreduced to form a stable surface film, which corresponds to theirreversible capacity in the first cycle (Figure 2). Judging fromthe high Coulombic efficiency in the following cycle, the TFSA-derived surface film works as a protective layer to kineticallysuppress further reductive decompositions of TFSA anions aswell as AN solvents. In other words, due to the presence of theTFSA-derived protective surface film, the electrochemicalwindow is kinetically widened to a cathodic direction in thesuperconcentrated AN solution. It is this situation that allowsfor reversible lithium intercalation/deintercalation at a graphiteelectrode by suppressing continuous reductive decompositionsof AN solvents. This enhanced reductive stability, arising frommodified film-forming ability, is unique to such a super-concentrated solution and contrary to the conventional belief

    that an AN-based electrolyte is never stable during lithiumintercalation into graphite.

    Solution Structure. To identify the solution structureproviding the unusual reductive stability with a TFSA-derivedsurface film, Raman spectra were obtained for LiTFSA/ANsolutions at various Li salt concentrations (Figure 4a,b). TheRaman spectrum of pure AN in Figure 4a shows a CNstretching band (v2 mode) at 2258 cm

    1 deriving from free ANmolecules (i.e., without coordinating to Li+).39,40 At 1.0 moldm3 concentration, another v2 band appears at 2282 cm

    1

    arising from Li+-solvating AN molecules.39,40 In such dilutesolutions, a stable solvation structure around Li+ is reported tobe 3- or 4-fold coordination (Figure 4c).40,41 Further increasingthe Li-salt concentration decreases free AN molecules andinstead increases the Li+-solvating AN molecules. At 4.2 moldm3 superhigh concentration, where unusual reductivestability was observed, there is only a peak for Li+-solvatingAN molecules, indicating that all the AN molecules coordinateto Li+. Since the molar ratio of LiTFSA:AN is ca. 1:2 in 4.2 moldm3 LiTFSA/AN, Li+ should have 2-fold AN coordination onaverage (Figure 4c). Further upshift of the v2 band at 4.2 moldm3 concentration indicates much stronger CN bond ofAN, suggesting a peculiar coordination structure totallydifferent from those in dilute solutions.Turning to the vibration mode of TFSA (i.e., SN

    stretching, CS stretching, and CF3 bending) in Figure 4b, adeconvolution analysis shows that the Raman band consists ofthree peaks at 740, 745, and 750 cm1, arising from free anions,contact ion pairs (CIPs, TFSA coordinating to a single Li+

    cation), and aggregates (AGGs, TFSA coordinating to two ormore Li+ cations), respectively.40 At 1.0 mol dm3 concen-tration, the majority of TFSA exists as free anions with smallamount of CIPs and AGGs, due to the high salt dissociation

    Figure 5. Supercells used and projected density of states (PDOS) obtained in quantum mechanical DFT-MD simulations on (a and b) dilute (1-LiTFSA/43-AN corresponding to 0.4 mol dm3) and (c) spuerconcentrated (10-LiTFSA/20-AN corresponding to 4.2 mol dm3) LiTFSA/ANsolutions. The illustrated structures are the snapshots in equilibrium trajectories. For a dilute solution, both situations of LiTFSA salt (i.e., (a) fulldissociation and (b) CIP) were considered. Atom color: Li, purple; C, dark gray; H, light gray; O, red; N, blue; S, yellow; F, green. Li atoms arehighlighted in size. Insets in the PDOS profiles are magnified figures of the lowest energy-level edge of the conduction band.

    Journal of the American Chemical Society Article

    dx.doi.org/10.1021/ja412807w | J. Am. Chem. Soc. 2014, 136, 503950465043

    Super7concentrated'electrolyte'for'Lithium'Ion'Baceries

    ! Functional and stable electrolyte is a key for high performance of LIB

    ! First-principles molecular dynamics simulations show that a super-concentrated electrolyte has:

    ! a special network of anions and solvents, which leads to ! a remarkably fast Li-ion transport (1/3 charging

    time) ! a high stability against reduction

    Yoshitaka%Tateyama%%(NIMS),%%Keitaro%Sodeyama%(Kyoto%Univ.%&%NIMS)%Press%Release:%

    hnp://www.t.uDtokyo.ac.jp/etpage/release/2014/2014032401.html%March%24,%2014%

    J. Am. Chem. Soc., 2014

    #Nodes:1536%Time:%500%hours

  • Industrial'usage13

    901 3 8 2 2 3

    13 15

    JSR JFE CAE

    4JFE

    OSS (OpenFOAM, LAMMPS, OCTA/COGNAC83, SUSHU9.1, FrontISTR, FrontFlow/blue, REVOCAP_Coupler)

    ISV (Poynting, VASP, CzeekS, VSOP )

    61 ( )

    21 26

    JSRIHI

    23 28

    CAE

    9 9

    JFE

    87

    ( 28 )

    26 5 7 H24 9

    0

    5

    10

    15

    20

    25

    30

    14

    32

    Accumulated number of Trial Use

    9.2012 3.2014

    Source: RIST report at HPCIC 5.2014

    Area'of'industry Representa9ve'firms Typical'usage

    Pharmaceu6cal Dainippon%Sumitomo%Pharma%Daiichi%Sankyo%

    Drug%design%

    Chemical Sumitomo%Chemical%Bridgestone

    New%material%development%Tire%tread%panern%designing

    Construc6on Shimizu%Corpora6on%Takenaka%Corpora6on

    Wind%pressure%on%buildings%Building%structural%design

    Automobile/%manufacturing

    Toyota%Kawasaki%Heavy%Industries

    Engine%combus6on%Efficiency%of%turbine%generator

    IT/%soaware Mizuho%Informa6on%&%Research%Ins6tute

    Soaware%development%and%consul6ng%service

  • Towards%the%Next%Flagship%Machine%&%Beyond

    1%

    T2K

    PF

    2008 2010 2012 2014 2016 2018 2020

    U. Tsukuba U. Tokyo Kyoto U.

    Tokyo Tech. TSUBAME2.0

  • TSUBAME2.0'Nov.'1,'2010'The'Greenest'Produc9on'Supercomputer'in'the'World

    27%

    TSUBAME 2.0 New Development

    32nm 40nm

    >400GB/s'Mem'BW'80Gbps'NW'BW'~1KW'max

    >1.6TB/s'Mem'BW' >12TB/s'Mem'BW'35KW'Max'

    >600TB/s'Mem'BW'220Tbps'NW''Bisecion'BW'1.4MW'Max'

  • Greenest%Produc6on%Supercomputer%in%the%World%the%Green%500%(#3%overall)%Nov.%2010,%June%2011%(#4%Top500%Nov.%2010)

    TSUBAME2.0 Awards

    ACM%Gordon%Bell%Prize%2011%2.0%Petaflops%Dendrite%

    Simula6on

  • TSUBAME-KFC

    Single Node 5.26 TFLOPS DFPSystem (40 nodes) 210.61 TFLOPS DFP

    630TFlops SFP

    Storage (3SSDs/node) 1.2TBytes SSDs/Node Total 50TBytes

    ~50GB/s BW

    A TSUBAME3.0 prototype system with advanced next gen cooling 40 compute nodes are oil-submerged 1200 liters of oil (Exxon PAO ~1 ton) #1 Nov. 2013 Green 500!!

    (Kepler Fluid Cooling)

  • #1'in'Green'500'List'(Nov.'2013) First%achievement%as%Japanese%supercomputer% #1%again%in%June%2014%% TSUBAME%2.5%is%also%ranked%#6

    TSUBAME7KFC

    TSUBAME'2.5

  • Beyond'TSUBAME.KFC:'GoldenBox'Proto1'(NVIDIA'K17based)'

    To'be'shown'at'SC14'Tokyo'Tech.'Booth 36%Node%Tegra%K1,%

    11TFlops%SFP% ~700GB/s%BW% ~350Wans% Integrated%mSata%SSD,%

    ~7GB/s%I/O% Ultra%dense,%Oil%immersive%

    cooling% Same%SW%stack%as%

    TSUBAME%

    2022:%x10%Flops,%x10%Mem%Bandwidth,%silicon%photonics,%x10%NVM,%x10%node%density

  • Towards%the%Next%Flagship%Machine%&%Beyond

    1%

    Flagship 2020 Post K

    PostT2K

    T2K

    PF

    2008 2010 2012 2014 2016 2018 2020

    U. Tsukuba U. Tokyo Kyoto U.

    RIKEN

    9 Universities and National Laboratories

    The Flagship 2020 project: the next national flagship system for 2020

    Alternative Leading Machines

    TSUBAME3.0

    Tokyo Tech. TSUBAME2.0

    Co-design primary key Some academic-led R&D

    (esp. system SW and overall architecture).

    International collaboration New targets e.g. power,

    big data, etc.

    Future Exascale

  • Design of computer systems solving scientific and social issues

    Identification of R&D issues to realize the systems

    Review of the system using the application codes

    Estimation of the systems cost

    Identification of scientific and social issues to be solve in the future

    Drawing science road map until 2020 Selection of the applications that plays key

    roles in the roadmap Review of the architectures using those

    applications

    Feasibility'Studies'on'Future'HPC'R&D'in'Japan'(FY201272013)

    %

    1%applica6on%study%team% 3%system%study%teams%

    RIKEN%AICS%and%TITECH%Collabora6on%with%applica6on%filelds%

    Tohoku%Univ.%and%

    NEC%

    U.%of%Tsukuba,%Titech,%and%Hitachi%

    U.%of%Tokyo,%Kyushu%U.,%%Fujitsu,%

    Hitachi,%and%NEC%

    Hirofuji%Tomita Satoshi%Matsuoka Mitsuhisa%Sato Yutaka%IshikawaHiroaki%KobayashiCoDdesign

  • Post'K'Computer'(Flagship'2020)

    Login%Servers%

    Maintenance%Servers%

    I/O%Network%

    Hierarchical%Storage%System

    Portal%Servers%

    # CPU Many-core with Interconnect interface Power Knob feature

    # Interconnect TOFU (mesh/torus network)

    Codesign: Compute Node

    Features FP performance Memory hierarchy,

    control, capacity, and bandwidth

    Network Performance I/O Performance

  • Current'status'of'the'post7K'project The%project%is%in%procurement%process%for%the%development%of%

    the%postDK%computer%system.%Fujitsu%was%decided%as%the%vendor%partner.%

    In%the%specifica6on%of%RFP:% Constraints%are:%

    Power%capacity%(about%30MW)% Space%for%system%installa6on%(in%Kobe%AICS%building)% Budget%(money)%for%development%(NRE)%and%produc6on.% ...%some%degree%of%compa6bility%to%the%current%K%computer.%

    The%system%should%be%designed%to%maximize%the%performance%of%applica6ons%in%each%computa6onal%science%field.%

    To%be%installed%in%2018D2019,%become%opera6onal%in%2020

    %

    RIKEN%AICS%Advanced%Ins6tute%for%Computa6onal%Science

    RIKEN

  • Co7design'elements'in'HPC'systems Hardware/architecture%

    Node%architecture%(#core,%%#SIMD,%etc...)%

    cache%(size%and%bandwidth)% network%(topologies,%latency%%

    and%bandwidth)% memory%technologies%(HBM%%

    and%HMC,%...)% specialized%hardware%

    #nodes% Storage,%file%systems% ...%system%configura6ons%

    %

    ! System%soaware%! Opera6ng%system%for%many%core%

    architecture%! communica6on%library%(low%level%

    layer,%MPI,%PGAS)%! Programming%model%and%

    languages%! DSL,%...%

    ! Algorithm%and%math%lib%! Dense%and%Sparse%solver%! Eigen%solver%%! ...%DomainDspecific%lib%and%

    framework%

    ! Applica6ons%

  • Linux'+'McKernel' Concerns%

    Reducing%memory%conten6on% Reducing%data%movement%among%cores% Providing%new%memory%management% Providing%fast%communica6on% Parallelizing%OS%func6ons%achieving%less%

    data%movement% New%OS%mechanisms%and%APIs%are%

    revolu6onarily/evolu6onally%created%and%examined,%and%selected%

    Linux%with%Light%Weight%Micro%Kernel% IHK%(Interface%for%Heterogeneous%

    Kernel)% Loading%a%kernel%into%cores% Communica6on%between%Linux%and%the%kernel%

    McKernel% Customizable%OS%environment%

    E.g.%environment%without%CPU%scheduler%(without%6mer%interrupt)

    2014/09/03

    Core

    McKernel

    Linux Kernel

    Daem

    on

    Core

    Core

    User

    pro

    cess

    User

    pro

    cess

    Daem

    onDa

    emon

    Core

    Interface%for%Hetero.%Kernels

    System%call%to%LMK%System%call%to%Linux

    Running%on%both%Xeon%and%XeonDphi%environments

    IHK%and%McKernel%have%been%developed%at%the%University%of%Tokyo%and%Riken%with%Hitachi,%NEC,%and%Fujitsu

  • XcalableMP'(XMP)%%hnp://www.xcalablemp.org Whats%XcalableMP%(XMP%for%short)?%%

    A%PGAS%programming%model%and%language%for%distributed%memory%,%proposed%by%XMP'Spec'WG'

    XMP%Spec%WG%is%a%special%interest%group%to%design%and%draa%the%specifica6on%of%XcalableMP%language.%It%is%now%organized%under'PC'Cluster'Consor9um,%Japan.%Mainly%ac6ve%in%Japan,%but%open%for%everybody.%

    Project%status%(as%of%Nov.%2013)% XMP%Spec%Version'1.2%is%available%at%XMP%site.%new%features:%mixed%OpenMP%and%OpenACC%,%libraries%for%collec6ve%communica6ons.%

    Reference%implementa6on%by%U.%Tsukuba%and%Riken%AICS:%Version'0.7'(C'and'Fortran90)%is%available%for%PC%clusters,%Cray%XT%and%K%computer.%SourceDtoD%Source%compiler%to%code%with%the%run6me%on%top%of%MPI%and%GasNet.%

    Poss

    iblit

    yof

    Per

    form

    ance

    tun

    ing

    Programming cost

    MPI

    Automaticparallelization

    PGAS

    HPF

    chapel

    XcalableMPXcalableMP

    int array[YMAX][XMAX];

    #pragma xmp nodes p(4)#pragma xmp template t(YMAX)#pragma xmp distribute t(block) on p#pragma xmp align array[i][*] to t(i)

    main(){int i, j, res;res = 0;

    #pragma xmp loop on t(i) reduction(+:res)for(i = 0; i < 10; i++)for(j = 0; j < 10; j++){

    array[i][j] = func(i, j);res += array[i][j];

    }}

    add to the serial code : incremental parallelization

    data distribution

    work sharing and data synchronization

    ! Language%Features%! Direc6veDbased%language%extensions%for%Fortran%and%C%for%PGAS%model%

    ! Global%view%programming%with%globalDview%distributed%data%structures%for%data%parallelism%! SPMD%execu6on%model%as%MPI%! pragmas%for%data%distribu6on%%of%global%array.%! Work%mapping%constructs%to%map%works%and%itera6on%with%affinity%to%data%explicitly.%

    ! Rich%communica6on%and%sync%direc6ves%such%as%gmove%and%shadow.%

    ! Many%concepts%are%inherited%from%HPF%! CoDarray%feature%of%CAF%is%adopted%as%a%part%of%the%language%spec%for%local%view%programming%(also%defined%in%C).%

    XMP provides a global view for data parallel

    program in PGAS model

    Code%example

  • CREST:%Development%of%System%Soaware%Technologies%for%postDPeta%Scale%High%Performance%Compu6ng%

    2010H2D2018

    %

    Objec6ves% CoDdesign%of%system%soaware%with%applica6ons%and%postDpeta%scale%

    computer%architectures%% Development%of%deliverable%soaware%pieces%%

    Research%Supervisor% Akinori%Yonezawa,%Deputy%Director%of%RIKEN%AICS%

    Run%by%JST%(Japan%Science%and%Technology%Agency)% Budget%and%Forma6on%(2010%to%2018)%

    About%60M%$%(47M$%in%normal%rate)%in%total% Round%1:%From%2010%for%5.5%year% Round%2:%From%2011%for%5.5%year% Round%3:%From%2012%for%5.5%year%

    http://www.postpeta.jst.go.jp/en/

  • Overview%of%PPC%CREST%(slide%1%of%3)%

    %

    2013 2014 2015 2016 2017Round 1: 5 teams run

    CREST:%Development%of%System%Soaware%Technologies%for%postDPeta%Scale%High%Performance%Compu6ng

    Round 3 : 4 teams runRound 2: 5 teams run

    Taisuke%Boku,%U.%of%Tsukuba%Research%and%Development%on%Unified%Environment%of%Accelerated%Compu6ng%and%Interconnec6on%for%PostDPetascale%Era%

    Atsushi%Hori,%RIKEN%AICS%Parallel%System%Soaware%for%Mul6Dcore%and%ManyDcore%

    Toshio%Endo,%Tokyo%Tech.%Soaware%Technology%that%Deals%with%Deeper%Memory%Hierarchy%in%PostDpetascale%Era%

    Takeshi%Nanri,%Kyushu%University%Development%of%Scalable%Communica6on%Library%with%Technologies%for%Memory%Saving%and%Run6me%Op6miza6on%

    Osamu%Tatebe,%U.%of%Tsukuba%System%Soaware%for%Post%Petascale%Data%Intensive%Science%

    Masaaki%Kondo,%U.%of%ElectroDComm.%Power%Management%Framework%for%%PostDPetascale%Supercomputers%

  • Overview%of%PPC%CREST%(slide%2%of%3)%

    %

    2013 2014 2015 2016 2017Round 1: 5 teams run

    CREST:%Development%of%System%Soaware%Technologies%for%postDPeta%Scale%High%Performance%Compu6ng

    Round 3 : 4 teams runRound 2: 5 teams run

    Naoya%Maruyama,%Riken%AICS%Highly%Produc6ve,%High%Performance%Applica6on%Frameworks%for%Post%Petascale%Compu6ng%

    Hiroyuki%Takizawa,%Tohoku%University%An%evolu6onary%approach%to%construc6on%of%a%soaware%development%environment%for%massivelyDparallel%heterogeneous%systems%

    Shigeru%Chiba,%Tokyo%Tech.%Soaware%development%for%post%petascale%super%compu6ng%DDD%Modularity%for%Super%Compu6ng%

    Itsuki%Noda,%AIST%Framework%for%Administra6on%of%Social%Simula6ons%on%Massively%Parallel%Computers%

  • Overview%of%PPC%CREST%(slide%3%of%3)%

    %

    2013 2014 2015 2016 2017Round 1: 5 teams run

    CREST:%Development%of%System%Soaware%Technologies%for%postDPeta%Scale%High%Performance%Compu6ng

    Round 3 : 4 teams runRound 2: 5 teams run

    Kengo%Nakajima,%University%of%Tokyo%ppOpenDHPC

    Tetsuya%Sakurai,%University%of%Tsukuba%Development%of%an%EigenDSupercompu6ng%Engine%using%a%PostDPetascale%Hierarchical%Model%

    Ryuji%Shioya,%Toyo%University%Development%of%a%Numerical%Library%based%on%Hierarchical%Domain%Decomposi6on%for%Post%Petascale%Simula6on%

    Katsuki%Fujisawa,%Chuo%University%Advanced%Compu6ng%and%Op6miza6on%Infrastructure%for%Extremely%LargeDScale%Graphs%on%Post%PetaDScale%Supercomputers%

  • Thank'you! RIKEN%AICS%booth:%#2431%

    Tokyo%Tech%booth:%#1857%

    JST%CREST%booth:%#3807%