astro-ph/9912202 -...

20
Office of Science The present and future at OLCF Bronson Messer Acting Group Leader Scientific Computing November 28, 2012

Upload: buiquynh

Post on 09-Mar-2018

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: astro-ph/9912202 - denali.physics.indiana.edudenali.physics.indiana.edu/~sg/premeeting/HEPMesser.pdf · • Exposing more parallelism through code refactoring and source ... • Seismology$

Office of Science

The present and future at OLCF

Bronson Messer Acting Group Leader Scientific Computing

November 28, 2012

Page 2: astro-ph/9912202 - denali.physics.indiana.edudenali.physics.indiana.edu/~sg/premeeting/HEPMesser.pdf · • Exposing more parallelism through code refactoring and source ... • Seismology$

2

Architectural Trends – No more free lunch •  CPU clock rates quit

increasing in 2003 •  P = CV2f

Power consumed is proportional to the frequency and to the square of the voltage

•  Voltage can’t go any lower, so frequency can’t go higher without increasing power

•  Power is capped by heat dissipation and $$$

•  Performance increases have been coming through increased parallelism Herb Sutter: Dr. Dobb’s Journal:

http://www.gotw.ca/publications/concurrency-ddj.htm

Page 3: astro-ph/9912202 - denali.physics.indiana.edudenali.physics.indiana.edu/~sg/premeeting/HEPMesser.pdf · • Exposing more parallelism through code refactoring and source ... • Seismology$

3

astro-ph/9912202

Page 4: astro-ph/9912202 - denali.physics.indiana.edudenali.physics.indiana.edu/~sg/premeeting/HEPMesser.pdf · • Exposing more parallelism through code refactoring and source ... • Seismology$

4

SYSTEM SPECIFICATIONS: • Peak performance of 27.1 PF

•  24.5 GPU + 2.6 CPU •  18,688 Compute Nodes each with:

•  16-Core AMD Opteron CPU • NVIDIA Tesla “K20x” GPU •  32 + 6 GB memory

•  512 Service and I/O nodes •  200 Cabinets •  710 TB total system memory • Cray Gemini 3D Torus Interconnect •  8.8 MW peak power

ORNL’s “Titan” Hybrid System: Cray XK7 with AMD Opteron and NVIDIA Tesla processors

4,352 ft2

404 m2

Page 5: astro-ph/9912202 - denali.physics.indiana.edudenali.physics.indiana.edu/~sg/premeeting/HEPMesser.pdf · • Exposing more parallelism through code refactoring and source ... • Seismology$

5

Cray XK7 Compute Node

Y  

X  

Z  

HT3 HT3

PCIe Gen2

XK7  Compute  Node  Characteris6cs  AMD  Opteron  6274  16  core  processor  

Tesla  K20x  @  1311  GF  

Host  Memory  32GB  

1600  MHz  DDR3  Tesla  K20x  Memory  

6GB  GDDR5  Gemini  High  Speed  Interconnect  

Slide courtesy of Cray, Inc.

Page 6: astro-ph/9912202 - denali.physics.indiana.edudenali.physics.indiana.edu/~sg/premeeting/HEPMesser.pdf · • Exposing more parallelism through code refactoring and source ... • Seismology$

6

Titan: Cray XK7 System

Board: 4 Compute Nodes 5.8 TF 152 GB

Cabinet: 24 Boards 96 Nodes 139 TF 3.6 TB

System: 200 Cabinets 18,688 Nodes 27 PF 710 TB

Compute Node: 1.45 TF 38 GB

Page 7: astro-ph/9912202 - denali.physics.indiana.edudenali.physics.indiana.edu/~sg/premeeting/HEPMesser.pdf · • Exposing more parallelism through code refactoring and source ... • Seismology$

7

Titan is an upgrade of Jaguar Phase 1: Replaced all of the Cray XT5 node boards with XK7 node boards, replaced fans, added power supplies & 3.3 MW transformer

Reused parts from Jaguar: • Cabinets • Backplanes • Interconnect cables • Power Supplies • Liquid Cooling System • RAS System • File System

Upgrade  saved  $25M  over  the  cost  of  a  new  

system!  

Page 8: astro-ph/9912202 - denali.physics.indiana.edudenali.physics.indiana.edu/~sg/premeeting/HEPMesser.pdf · • Exposing more parallelism through code refactoring and source ... • Seismology$

8

Flywheel  based  UPS  for  highest  efficiency  

Variable  Speed  Chillers  save  energy  

Liquid  Cooling  is  1,000  Wmes  more  efficient  than  air  cooling  

13,800  volt  power  into  the  building  saves  on  transmission  losses  

Titan’s Power & Cooling: Designed for Efficiency

Vapor barriers and positive air pressure keep humidity out of computer center

Result: With a PUE of 1.25, ORNL has one of the world’s most efficient data centers

480 volt power to computers saves $1M in installation costs and reduce losses

Page 9: astro-ph/9912202 - denali.physics.indiana.edudenali.physics.indiana.edu/~sg/premeeting/HEPMesser.pdf · • Exposing more parallelism through code refactoring and source ... • Seismology$

9

Why GPUs? High Performance and Power Efficiency on a Path to Exascale • Hierarchical parallelism – Improves scalability of applications • Exposing more parallelism through code refactoring and source

code directives • Heterogeneous multi-core processor architecture – Use the

right type of processor for each task. • Data locality – Keep the data near the processing. GPU has

high bandwidth to local memory for rapid access. GPU has large internal cache

• Explicit data management – Explicitly manage data movement between CPU and GPU memories.

Page 10: astro-ph/9912202 - denali.physics.indiana.edudenali.physics.indiana.edu/~sg/premeeting/HEPMesser.pdf · • Exposing more parallelism through code refactoring and source ... • Seismology$

10

How Effective are GPUs on Scalable Applications? OLCF-3 Early Science Codes Very early performance measurements on Titan

    XK7  (w/  K20x)    vs.  XE6    Cray  XK7:  K20x  GPU  plus  AMD  6274  CPU  Cray  XE6:  Dual  AMD  6274  and  no  GPU  

Cray  XK6  w/o  GPU:  Single  AMD  6274,  no  GPU  

Applica6on   Performance    Ra6o   Comments  

S3D   1.8   •  Turbulent  combusWon    •  6%  of  Jaguar  workload  

Denovo  sweep   3.8  

•  Sweep  kernel  of  3D  neutron  transport  for  nuclear  reactors  

•  2%  of  Jaguar  workload  

LAMMPS   7.4*  (mixed  precision)  

•  High-­‐performance  molecular  dynamics  •  1%  of  Jaguar  workload  

WL-­‐LSMS   1.6  •  StaWsWcal  mechanics  of  magneWc  materials  •  2%  of  Jaguar  workload  •  2009  Gordon  Bell  Winner  

CAM-­‐SE   1.5   •  Community  atmosphere  model  •  1%  of  Jaguar  workload  

Page 11: astro-ph/9912202 - denali.physics.indiana.edudenali.physics.indiana.edu/~sg/premeeting/HEPMesser.pdf · • Exposing more parallelism through code refactoring and source ... • Seismology$

11

Additional Applications from Community Efforts Current performance measurements on Titan or CSCS system

    XK7  (w/  K20x)  vs.  XE6    

Cray  XK7:  K20x  GPU  plus  AMD  6274  CPU  Cray  XE6:  Dual  AMD  6274  and  no  GPU  

Cray  XK6  w/o  GPU:  Single  AMD  6274,  no  GPU  

Applica6on   Performance    Ra6o   Comment  

NAMD   1.4  •  High-­‐performance  molecular  dynamics  •  2%  of  Jaguar  workload  

Chroma   6.1  •  High-­‐energy  nuclear  physics  •  2%  of  Jaguar  workload  

QMCPACK   3.0  •  Electronic  structure  of  materials  •  New  to  OLCF,  Common  to  

SPECFEM-­‐3D   2.5  •  Seismology  •  2008  Gordon  Bell  Finalist  

GTC   1.6  •  Plasma  physics  for  fusion-­‐energy  •  2%  of  Jaguar  workload  

CP2K   1.5  •  Chemical  physics  •  1%  of  Jaguar  workload  

Page 12: astro-ph/9912202 - denali.physics.indiana.edudenali.physics.indiana.edu/~sg/premeeting/HEPMesser.pdf · • Exposing more parallelism through code refactoring and source ... • Seismology$

12

Hierarchical Parallelism •  MPI parallelism between nodes (or PGAS) •  On-node, SMP-like parallelism via threads (or

subcommunicators, or…) •  Vector parallelism

•  SSE/AVX/etc on CPUs •  GPU threaded parallelism

•  Exposure of unrealized parallelism is essential to exploit all near-future architectures.

•  Uncovering unrealized parallelism and improving data locality improves the performance of even CPU-only code.

•  Experience with vanguard codes at OLCF suggests 1-2 person-years is required to “port” extant codes to GPU platforms.

•  Likely less if begun today, due to better tools/compilers

11010110101000 01010110100111 01110110111011

01010110101010

Page 13: astro-ph/9912202 - denali.physics.indiana.edudenali.physics.indiana.edu/~sg/premeeting/HEPMesser.pdf · • Exposing more parallelism through code refactoring and source ... • Seismology$

13

How do you program these nodes? • Compilers

–  OpenACC is a set of compiler directives that allows the user to express hierarchical parallelism in the source code so that the compiler can generate parallel code for the target platform, be it GPU, MIC, or vector SIMD on CPU

–  Cray compiler supports XK7 nodes and is OpenACC compatible –  CAPS HMPP compiler supports C, C++ and Fortran compilation for

heterogeneous nodes with OpenACC support –  PGI compiler supports OpenACC and CUDA Fortran

•  Tools –  Allinea DDT debugger scales to full system size and with ORNL support

will be able to debug heterogeneous (x86/GPU) apps –  ORNL has worked with the Vampir team at TUD to add support for

profiling codes on heterogeneous nodes –  CrayPAT and Cray Apprentice support XK6 programming

Page 14: astro-ph/9912202 - denali.physics.indiana.edudenali.physics.indiana.edu/~sg/premeeting/HEPMesser.pdf · • Exposing more parallelism through code refactoring and source ... • Seismology$

14

Unified x86/Accelerator Development Environment enhances productivity with a common look and feel

Cray  Compiling  Environment  

Cray  Scien6fic  &  Math  Libraries  

Cray  Performance  Monitoring    and  Analysis  Tools   Accelerators  

X86-­‐64  Cray  Message  Passing  Toolkit  

Cray  Debug  Support  Tools  

CUDA  SDK  

GNU  

Slide Courtesy of Cray Inc

Page 15: astro-ph/9912202 - denali.physics.indiana.edudenali.physics.indiana.edu/~sg/premeeting/HEPMesser.pdf · • Exposing more parallelism through code refactoring and source ... • Seismology$

15

Filesystems

•  The Spider center-wide file system is the operational work file system for most NCCS systems. It is a large-scale Lustre file system, with over 26,000 clients, providing 10.7 PB of disk space. It is also has a demonstrated bandwidth of 240 GB/s.

•  New Spider procurement is underway

•  roughly double the capacity •  roughly 4x BW

Page 16: astro-ph/9912202 - denali.physics.indiana.edudenali.physics.indiana.edu/~sg/premeeting/HEPMesser.pdf · • Exposing more parallelism through code refactoring and source ... • Seismology$

16

Three primary ways for access to LCFs Distribution of allocable hours

60% INCITE 4.7 billion core-hours in CY2013

30% ASCR Leadership Computing Challenge

10% Director’s Discretionary

Leadership-class computing

DOE/SC capability computing

Page 17: astro-ph/9912202 - denali.physics.indiana.edudenali.physics.indiana.edu/~sg/premeeting/HEPMesser.pdf · • Exposing more parallelism through code refactoring and source ... • Seismology$

17

LCFs support models for INCITE

•  “Two-pronged” support model is shared •  Specific organizational implementations differ slightly

but user perspective is virtually identical

ScienWfic  compuWng  

Liaisons  

VisualizaWon  

Performance  

End-­‐to-­‐end  workflows  

Data  analyWcs    and  visualizaWon  

Catalysts   Performance  engineering  

User  assistance  and  outreach   User  service  and  outreach  

Page 18: astro-ph/9912202 - denali.physics.indiana.edudenali.physics.indiana.edu/~sg/premeeting/HEPMesser.pdf · • Exposing more parallelism through code refactoring and source ... • Seismology$

18

Basics

• User Assistance group provides “front-line” support for day-to-day computing issues

• SciComp Liaisons provide advanced algorithmic and implementation assistance

• Assistance in data analytics and workflow management, visualization, and performance engineering are also provided for each project (both tasks are “housed” in SciComp at OLCF)

Page 19: astro-ph/9912202 - denali.physics.indiana.edudenali.physics.indiana.edu/~sg/premeeting/HEPMesser.pdf · • Exposing more parallelism through code refactoring and source ... • Seismology$

19

Most SciComp interactions fall into one of three bins •  “user support +”

–  The SciComp liaison answers basic questions when asked and serves as an internal advocate for the project at the OLCF

–  Constant “pings” from liaisons •  “rainmakers”

–  The SciComp liaison “parachutes in” and undertakes an short, intense burst of development activity to surmount a singular application problem

–  The usual duration is less than 2 months in wallclock time and is 1 FTE-month in effort

•  collaborators –  The SciComp liaison is a member of (in several cases, one of the

leaders of) the code development team –  Liaison is a co-author on scientific papers

Page 20: astro-ph/9912202 - denali.physics.indiana.edudenali.physics.indiana.edu/~sg/premeeting/HEPMesser.pdf · • Exposing more parallelism through code refactoring and source ... • Seismology$

20

Questions? [email protected]

20

The research and activities described in this presentation were performed using the resources of the National Center for Computational Sciences at

Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy

under Contract No. DE-AC0500OR22725.