astro-ph/9912202 -...

Office of Science

The present and future at OLCF

Bronson Messer Acting Group Leader Scientific Computing

November 28, 2012

2

Architectural Trends – No more free lunch •  CPU clock rates quit

increasing in 2003 •  P = CV2f

Power consumed is proportional to the frequency and to the square of the voltage

•  Voltage can’t go any lower, so frequency can’t go higher without increasing power

•  Power is capped by heat dissipation and $$$

•  Performance increases have been coming through increased parallelism Herb Sutter: Dr. Dobb’s Journal:

http://www.gotw.ca/publications/concurrency-ddj.htm

3

astro-ph/9912202

4

SYSTEM SPECIFICATIONS: • Peak performance of 27.1 PF

•  24.5 GPU + 2.6 CPU •  18,688 Compute Nodes each with:

•  16-Core AMD Opteron CPU • NVIDIA Tesla “K20x” GPU •  32 + 6 GB memory

•  512 Service and I/O nodes •  200 Cabinets •  710 TB total system memory • Cray Gemini 3D Torus Interconnect •  8.8 MW peak power

ORNL’s “Titan” Hybrid System: Cray XK7 with AMD Opteron and NVIDIA Tesla processors

4,352 ft2

404 m2

5

Cray XK7 Compute Node

Y

X

Z

HT3 HT3

PCIe Gen2

XK7 Compute Node Characteris6cs AMD Opteron 6274 16 core processor

Tesla K20x @ 1311 GF

Host Memory 32GB

1600 MHz DDR3 Tesla K20x Memory

6GB GDDR5 Gemini High Speed Interconnect

Slide courtesy of Cray, Inc.

6

Titan: Cray XK7 System

Board: 4 Compute Nodes 5.8 TF 152 GB

Cabinet: 24 Boards 96 Nodes 139 TF 3.6 TB

System: 200 Cabinets 18,688 Nodes 27 PF 710 TB

Compute Node: 1.45 TF 38 GB

7

Titan is an upgrade of Jaguar Phase 1: Replaced all of the Cray XT5 node boards with XK7 node boards, replaced fans, added power supplies & 3.3 MW transformer

Reused parts from Jaguar: • Cabinets • Backplanes • Interconnect cables • Power Supplies • Liquid Cooling System • RAS System • File System

Upgrade saved $25M over the cost of a new

system!

8

Flywheel based UPS for highest efficiency

Variable Speed Chillers save energy

Liquid Cooling is 1,000 Wmes more efficient than air cooling

13,800 volt power into the building saves on transmission losses

Titan’s Power & Cooling: Designed for Efficiency

Vapor barriers and positive air pressure keep humidity out of computer center

Result: With a PUE of 1.25, ORNL has one of the world’s most efficient data centers

480 volt power to computers saves $1M in installation costs and reduce losses

9

Why GPUs? High Performance and Power Efficiency on a Path to Exascale • Hierarchical parallelism – Improves scalability of applications • Exposing more parallelism through code refactoring and source

code directives • Heterogeneous multi-core processor architecture – Use the

right type of processor for each task. • Data locality – Keep the data near the processing. GPU has

high bandwidth to local memory for rapid access. GPU has large internal cache

• Explicit data management – Explicitly manage data movement between CPU and GPU memories.

10

How Effective are GPUs on Scalable Applications? OLCF-3 Early Science Codes Very early performance measurements on Titan

XK7 (w/ K20x) vs. XE6 Cray XK7: K20x GPU plus AMD 6274 CPU Cray XE6: Dual AMD 6274 and no GPU

Cray XK6 w/o GPU: Single AMD 6274, no GPU

Applica6on Performance Ra6o Comments

S3D 1.8 •  Turbulent combusWon •  6% of Jaguar workload

Denovo sweep 3.8

•  Sweep kernel of 3D neutron transport for nuclear reactors

•  2% of Jaguar workload

LAMMPS 7.4* (mixed precision)

•  High-‐performance molecular dynamics •  1% of Jaguar workload

WL-‐LSMS 1.6 •  StaWsWcal mechanics of magneWc materials •  2% of Jaguar workload •  2009 Gordon Bell Winner

CAM-‐SE 1.5 •  Community atmosphere model •  1% of Jaguar workload

11

Additional Applications from Community Efforts Current performance measurements on Titan or CSCS system

XK7 (w/ K20x) vs. XE6

Cray XK7: K20x GPU plus AMD 6274 CPU Cray XE6: Dual AMD 6274 and no GPU

Cray XK6 w/o GPU: Single AMD 6274, no GPU

Applica6on Performance Ra6o Comment

NAMD 1.4 •  High-‐performance molecular dynamics •  2% of Jaguar workload

Chroma 6.1 •  High-‐energy nuclear physics •  2% of Jaguar workload

QMCPACK 3.0 •  Electronic structure of materials •  New to OLCF, Common to

SPECFEM-‐3D 2.5 •  Seismology •  2008 Gordon Bell Finalist

GTC 1.6 •  Plasma physics for fusion-‐energy •  2% of Jaguar workload

CP2K 1.5 •  Chemical physics •  1% of Jaguar workload

12

Hierarchical Parallelism •  MPI parallelism between nodes (or PGAS) •  On-node, SMP-like parallelism via threads (or

subcommunicators, or…) •  Vector parallelism

•  SSE/AVX/etc on CPUs •  GPU threaded parallelism

•  Exposure of unrealized parallelism is essential to exploit all near-future architectures.

•  Uncovering unrealized parallelism and improving data locality improves the performance of even CPU-only code.

•  Experience with vanguard codes at OLCF suggests 1-2 person-years is required to “port” extant codes to GPU platforms.

•  Likely less if begun today, due to better tools/compilers

11010110101000 01010110100111 01110110111011

01010110101010

13

How do you program these nodes? • Compilers

–  OpenACC is a set of compiler directives that allows the user to express hierarchical parallelism in the source code so that the compiler can generate parallel code for the target platform, be it GPU, MIC, or vector SIMD on CPU

–  Cray compiler supports XK7 nodes and is OpenACC compatible –  CAPS HMPP compiler supports C, C++ and Fortran compilation for

heterogeneous nodes with OpenACC support –  PGI compiler supports OpenACC and CUDA Fortran

•  Tools –  Allinea DDT debugger scales to full system size and with ORNL support

will be able to debug heterogeneous (x86/GPU) apps –  ORNL has worked with the Vampir team at TUD to add support for

profiling codes on heterogeneous nodes –  CrayPAT and Cray Apprentice support XK6 programming

14

Unified x86/Accelerator Development Environment enhances productivity with a common look and feel

Cray Compiling Environment

Cray Scien6fic & Math Libraries

Cray Performance Monitoring and Analysis Tools Accelerators

X86-‐64 Cray Message Passing Toolkit

Cray Debug Support Tools

CUDA SDK

GNU

Slide Courtesy of Cray Inc

15

Filesystems

•  The Spider center-wide file system is the operational work file system for most NCCS systems. It is a large-scale Lustre file system, with over 26,000 clients, providing 10.7 PB of disk space. It is also has a demonstrated bandwidth of 240 GB/s.

•  New Spider procurement is underway

•  roughly double the capacity •  roughly 4x BW

16

Three primary ways for access to LCFs Distribution of allocable hours

60% INCITE 4.7 billion core-hours in CY2013

30% ASCR Leadership Computing Challenge

10% Director’s Discretionary

Leadership-class computing

DOE/SC capability computing

17

LCFs support models for INCITE

•  “Two-pronged” support model is shared •  Specific organizational implementations differ slightly

but user perspective is virtually identical

ScienWfic compuWng

Liaisons

VisualizaWon

Performance

End-‐to-‐end workflows

Data analyWcs and visualizaWon

Catalysts Performance engineering

User assistance and outreach User service and outreach

18

Basics

• User Assistance group provides “front-line” support for day-to-day computing issues

• SciComp Liaisons provide advanced algorithmic and implementation assistance

• Assistance in data analytics and workflow management, visualization, and performance engineering are also provided for each project (both tasks are “housed” in SciComp at OLCF)

19

Most SciComp interactions fall into one of three bins •  “user support +”

–  The SciComp liaison answers basic questions when asked and serves as an internal advocate for the project at the OLCF

–  Constant “pings” from liaisons •  “rainmakers”

–  The SciComp liaison “parachutes in” and undertakes an short, intense burst of development activity to surmount a singular application problem

–  The usual duration is less than 2 months in wallclock time and is 1 FTE-month in effort

•  collaborators –  The SciComp liaison is a member of (in several cases, one of the

leaders of) the code development team –  Liaison is a co-author on scientific papers

20

Questions? [email protected]

20

The research and activities described in this presentation were performed using the resources of the National Center for Computational Sciences at

Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy

under Contract No. DE-AC0500OR22725.

astro-ph/9912202 -...

Documents