tutorial - xsede

XSEDE 12

July 16, 2012

Preparing for Stampede: Programming

Heterogeneous Many-Core Supercomputers

Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld

dan/lars/bbarth/[email protected]

1

Tutorial

Agenda

08:00 – 08:45 Stampede and MIC Overview

08:45 – 10:00 Vectorization

10:00 – 10:15 Break

10:15 – 11:15 OpenMP and Offloading

11:15 – 12:00 Lab (on a Sandy Bridge machine)

Preparing for Stampede:

Programming Heterogeneous Many-Core Supercomputers

2

Stampede and MIC Overview

• The TACC Stampede System

• Overview of Intel’s MIC ecosystem

– MIC: Many Integrated Cores

– Basic Design

– Roadmap

– Coprocessor vs. Accelerator

• Programming Models

– OpenMP, MPI

– MKL, TBB, and Cilk Plus

• Roadmap & Preparation

MIC is

pronounced

“Mike”

3

Stampede

• A new NSF-funded XSEDE resource at TACC

• Funded two components of a new system:

– Two petaflop, 100,000+ core Xeon based Dell

cluster to be the new workhorse of the NSF open

science community (>3x Ranger)

– Eight additional petaflops of Intel Many-Integrated-

Core (MIC) processors to provide a revolutionary

innovative capability: a new generation of

supercomputing.

– Change the power/performance curves of

supercomputing without changing the programming

model (terribly much).

Stampede - Headlines

• Up to 10PF Peak performance in initial system (Early 2013) – 100,000+ conventional Intel processor cores

– Almost 500,000 total cores with Intel MIC

• Upgrade with “Future Knight’s” in 2015

• 14PB+ disk

• 272TB+ RAM

• 56Gb FDR InfiniBand interconnect

• Nearly 200 racks of compute hardware (10,000 sq ft)

• >SIX MEGAWATTS total power

The Stampede Project

• A Complete Advanced Computing Ecosystem:

– HPC Compute, storage, interconnect

– Visualization subsystem (144 high end GPUs

– Large memory support (16 1TB nodes)

– Integration with archive, and (coming soon) TACC global work

filesystem and other data systems

– People, training, and documentation to support computational

science

• Hosted at an expanded building in TACC; massive power

upgrades

– 12MW new total datacenter power

– Thermal energy storage to reduce operating costs

6

Stampede Datacenter –February

20th

Stampede Datacenter – March 22nd

Stampede Datacenter – May 16th

Stampede Datacenter – June 20th

11

Stampede Footprint

Stampede InfiniBand (fat-tree) ~75 Miles of InfiniBand Cables

Ranger: 3000 ft2

0.6 PF

3 MW

Stampede:

8000 ft2

10 PF

6.5 MW

Capabilities: 20x

Footprint: 2x

Facilities at TACC

12





– Basic Design

– Roadmap



– OpenMP, MPI



13

An Inflection point (or two) in High

Performance Computing

• Relatively “boring”, but rapidly improving

architectures for the last 15 years.

• Performance rising much faster than Moore’s

Law…

– But power rising faster

– And concurrency rising faster than that, with serial

performance decreasing.

• Something had to give…

Consider this Example Homogeneous v. Heterogeneous

Acquisition costs and power consumption

• Homogeneous system: K computer in Japan

– 10 PF unaccelerated

– Rumored $1.2 billion acquisition costs

– Large power bill and foot print

• Heterogeneous system: Stampede at TACC

– Near 10PF (2 SNB + up to 8 MIC)

– Modest footprint: 8,000 ft2, 6.5MW

– $27.5 million

15

Accelerated Computing for Exascale

• Exascale systems, predicted for 2018, would

have required 500MW on the “old” curves.

• Something new was clearly needed.

• The “accelerated computing” movement was

reborn (this happens periodically, starting

with 387 math coprocessors).

Key aspects of acceleration

• We have lots of transistors… Moore’s law is

holding; this isn’t necessarily the problem.

• We don’t really need lower power per

transistor, we need lower power per

*operation*.

• How to do this?

– nVidia GPU -AMD Fusion

– FPGA -ARM Cores

Intel’s MIC approach

• Since the days of RISC vs. CISC, Intel has

mastered the art of figuring out what is

important about a new processing

technology, and saying “why can’t we do this

in x86?”

• The Intel Many Integrated Core (MIC)

architecture is about large die, simpler circuit,

much more parallelism, in the x86 line.

What is MIC?

• MIC is Intel’s solution to try and change the power curves

• Stage 1: Knights Ferry (KNF)

– SDP: Software Development Platform

– Intel has granted early access to KNF to several dozen institutions

– Training has started

– TACC had access to KNF hardware since early March 2011 and is

hosting a system since mid March 2011

– TACC continues to evaluate the hardware and programming models

• Stage 2: Knights Corner (KNC) -- Now “Xeon Phi”

– Early silicon now at TACC for Stampede development

– First product expected early 2013

MIC Ancestors • 80-core Terascale research program

• Larrabee many-core visual computing

• Single-chip Cloud Computer (SCC) 19

What is MIC? • Basic Design Ideas

– Leverage x86 architecture (CPU with many cores)

– Leverage (Intel’s) existing x86 programming models

– Dedicate much of the silicon to floating point ops., keep some cache(s)

– Keep cache-coherency protocol (was debated at the beginning)

– Increase floating-point throughput per core

– Implement as a separate device

• Pre-product Software Development Platform: Knights Ferry – x86 cores that are simpler, but allow for more compute throughput

– Strip expensive features (out-of-order execution, branch

prediction, etc.)

– Widen SIMD registers for more throughput

– Fast (GDDR5) memory on card

– “GPU” form factor: size, power, PCIe connection

20

MIC Architecture

• Many cores on the die

• L1 and L2 cache

• Bidirectional ring network

• Memory and PCIe connection

Knights Ferry SDP

• Up to 32 cores

• 1-2 GB of GDDR5 RAM

• 512-bit wide SIMD registers

• L1/L2 caches

• Multiple threads (up to 4) per core

• Slow operation in double precision

Knights Corner (first product)

• 50+ cores

• Increased amount of RAM

• Details are under NDA

• Double precision half the speed of

single precision (canonical ratio)

• 22 nm technology

MIC (KNF) architecture block diagram

21

MIC vs. Xeon

Number of cores

Clock Speed (GHz)

SIMD width (bit)

• Xeon chip designed to work well for all work loads

• MIC designed to work well for number crunching

• Single MIC core is slower than a Xeon core

• Number of cores on card is much higher

MIC (KNF) Xeon

30-32 4-8

≥1 ~3

512 128 - 256

What is Vectorization?

• Instruction set: Streaming SIMD Extensions (SSE)

• Single Instruction Multiple Data (SIMD)

• Multiple data elements are loaded into SSE registers

• One operation produces 2, 4, or 8 results in double precision

• Effective vectorization (by the compiler) crucial for performance

22

What we at TACC like about MIC • Intel’s MIC is based on x86 technology

– x86 cores w/ caches and cache coherency

– SIMD instruction set

• Programming for MIC is similar to programming for CPUs

– Familiar languages: C/C++ and Fortran

– Familiar parallel programming models: OpenMP & MPI

– MPI on host and on the coprocessor

– Any code can run on MIC, not just kernels

• Optimizing for MIC is similar to optimizing for CPUs

– “Optimize once, run anywhere”

– Our early MIC porting efforts for codes “in the field” are frequently

doubling performance on Sandy Bridge.

Coprocessor vs. Accelerator

• GPUs are often called “Accelerators”

• Intel calls MIC a “Coprocessor”

• Similarities

– Large die device, all silicon dedicated to compute

– Fast GDDR5 memory

– Connected through PCIe bus, two physical address spaces

– Number of MIC cores similar to number of GPU Streaming

Multiprocessors

24

Coprocessor vs. Accelerator

• Differences

– Architecture: x86 vs. streaming processors

coherent caches vs. shared memory

and caches

– HPC Programming model:

extension to C++/C/Fortran vs. CUDA/OpenCL

OpenCL support

Threading/MPI:

OpenMP and Multithreading vs. threads in hardware

MPI on host and/or MIC vs. MPI on host only

– Programming details

offloaded regions vs. kernels

– Support for any code: serial, scripting, etc.

Yes vs. No

• Native mode: MIC-compiled code may be run as an independent process on the coprocessor.

25

The MIC Promise

• Competitive Performance: Peak and Sustained

• Familiar Programming Model

– HPC: C/C++, Fortran, and Intel’s TBB

– Parallel Programming: OpenMP, MPI, Pthreads, Cilk Plus, OpenCL

– Intel’s MKL library (later: third party libraries)

– Serial and Scripting, etc. (anything a CPU core can do)

• “Easy” transition for OpenMP code

– Pragmas/directives added to “offload” OMP parallel regions onto MIC

– See examples later

• Support for MPI

MPI tasks on host, communication through offloading

MPI tasks on MIC, communication through MPI

26





– Basic Design

– Roadmap



– OpenMP, MPI



27

Adapting and Optimizing Code for MIC

• Hybrid Programming

– MPI + Threads (OpenMP, etc.)

• Optimize for higher thread performance

– Minimize serial sections and synchronization

• Test different execution models

– Symmetric vs. Offloaded

• Optimize for L1/L2 caches

• Test performance on MIC

• Optimize for specific architecture

• Start production

Today “Any” resource

Today/KNF/early KNC

Stampede MIC 2013+

Levels of Parallel Execution

Multi-core CPUs, Coprocessors (MIC), and Accelerators (GPUs)

exploit hardware parallelism on 2 levels

1. Processing unit (CPU, MIC, GPU) with multiple functional

units

– CPU/MIC: Cores

– GPU: Streaming Multiprocessor/SIMD Engine

– Functional units operate independently

2. Each functional units executes in lock-step (in parallel)

– CPU/MIC: Vector processing unit utilizes SIMD lanes

– GPU: Warp/Wavefront executes on “CUDA” or “Stream” cores

29

SIMD vs. CUDA/Stream Cores

Each functional units executes in lock-step (in parallel)

– CPU/MIC: Vector processing unit utilizes SIMD lanes

– GPU: Warp/Wavefront executing on “CUDA” or “Stream” cores

• GPU programming model exposes low level parallelism

– All CUDA/Stream cores must be utilized for good performance; this

is stating the obvious and it is transparent to the programmer

– Programmer is forced to manually setup blocks and grids of threads

Steep learning curve at the beginning,

but it leads naturally to the (data-parallel) utilization of all cores

• CPU programming model leaves vectorization to compiler

– Programmers may be unaware of parallel hardware

– Failing to write vectorizable code will lead to a significant

performance penalty of 87.5% on MIC (50-75% on Xeon)

30

Early MIC Programming Experiences at

TACC • Codes port easily

– Minutes to days depending mostly on library

dependencies

• Performance requires real work

– While the silicon continues to evolve

– Getting codes to run *at all* is almost too easy;

really need to put in the effort to get what you

expect

• Scalability is pretty good

– Multiple threads per core *really important*

– Getting your code to vectorize *really important*

Stampede Programming

Five Possible Models

• Host-only

• Offload

• Symmetric

• Reverse Offload

• MIC Native

Programming Intel® MIC-based Systems MPI+Offload

MPI ranks on Intel® Xeon® processors (only)

All messages into/out of processors

Offload models used to accelerate MPI ranks

Intel® CilkTM Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads* within Intel® MIC

Homogenous network of hybrid nodes:

11

Xeon

MIC

Xeon

MIC

Xeon

MIC

Xeon

MIC

Network

Data

Data

Data

Data

MPI

Offload

Programming Intel® MIC-based Systems Many-core Hosted

MPI ranks on Intel® MIC (only)

All messages into/out of Intel® MIC

Intel® CilkTM Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads used directly within MPI processes

Programmed as homogenous network of many-core CPUs:

13

Xeon

MIC

Xeon

MIC

Xeon

MIC

Xeon

MIC

Network

Data

Data

Data

Data

MPI

Programming Intel® MIC-based Systems Symmetric

MPI ranks on Intel® MIC and Intel® Xeon® processors

Messages to/from any core

Intel® CilkTM Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads* used directly within MPI processes

Programmed as heterogeneous network of homogeneous nodes:

14

Xeon

MIC

Xeon

MIC

Xeon

MIC

Xeon

MIC

Network

Data

Data

Data

Data

MPI

Data

Data

Data

Data

MPI

MPI

MIC Programming with OpenMP

• MIC specific pragma precedes OpenMP pragma

– Fortran: !dir$ omp offload target(mic) <…>

– C: #pragma offload target(mic) <…>

• All data transfer is handled by the compiler

(optional keywords provide user control)

• Offload and OpenMP section will go over examples for:

– Automatic data management

– Manual data management

– I/O from within offloaded region

• Data can “stream” through the MIC; no need to leave MIC to fetch new data

• Also very helpful when debugging (print statements)

– Offloading a subroutine and using MKL

36

Programming Models

Ready to use on day one!

• TBB’s will be available to C++ programmers

• MKL will be available

– Automatic offloading by compiler for some MKL features

– (probably most transparent mode)

• Cilk Plus

– Useful for task-parallel programing (add-on to OpenMP)

– May become available for Fortran users as well

• OpenMP

– TACC expects that OpenMP will be the most interesting

programming model for our HPC users

37





– Basic Design

– Roadmap



– OpenMP, MPI



38

Roadmap What comes next?

How to get prepared?

• General expectation:

– Many of the upcoming large systems will be accelerated

• Stampede’s base Sandy Bridge cluster being constructed now

• Intel’s MIC coprocessor is on track for 2013

– Expect that (large) systems with MIC coprocessors will become

available in 2013 at/after product launch

• MPI + OpenMP is the main HPC programming model

– If you are not using TBBs

– If you are not spending all your time in libraries (MKL, etc.)

• Early user access sometime after SC12

39

Roadmap What comes next?

How to get prepared?

• Many HPC applications are pure-MPI codes

• Start thinking about upgrading to a hybrid scheme

• Adding OpenMP is a larger effort than adding MIC directives

• Special MIC/OpenMP considerations

– Many more threads will be needed: 50+

cores (Production KNC) ➞ 50+/100+/200+ threads

– Good OpenMP scaling will be (much) more important

– Vectorize, vectorize, vectorize

– There is no unified last-level cache (LLC): Cache shared among

cores, several MB on CPUs

Stay tuned

The coming years will be exciting …

40

tutorial - xsede

Documents