tutorial - xsede
TRANSCRIPT
XSEDE 12
July 16, 2012
Preparing for Stampede: Programming
Heterogeneous Many-Core Supercomputers
Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld
dan/lars/bbarth/[email protected]
1
Tutorial
Agenda
08:00 – 08:45 Stampede and MIC Overview
08:45 – 10:00 Vectorization
10:00 – 10:15 Break
10:15 – 11:15 OpenMP and Offloading
11:15 – 12:00 Lab (on a Sandy Bridge machine)
Preparing for Stampede:
Programming Heterogeneous Many-Core Supercomputers
2
Stampede and MIC Overview
• The TACC Stampede System
• Overview of Intel’s MIC ecosystem
– MIC: Many Integrated Cores
– Basic Design
– Roadmap
– Coprocessor vs. Accelerator
• Programming Models
– OpenMP, MPI
– MKL, TBB, and Cilk Plus
• Roadmap & Preparation
MIC is
pronounced
“Mike”
3
Stampede
• A new NSF-funded XSEDE resource at TACC
• Funded two components of a new system:
– Two petaflop, 100,000+ core Xeon based Dell
cluster to be the new workhorse of the NSF open
science community (>3x Ranger)
– Eight additional petaflops of Intel Many-Integrated-
Core (MIC) processors to provide a revolutionary
innovative capability: a new generation of
supercomputing.
– Change the power/performance curves of
supercomputing without changing the programming
model (terribly much).
Stampede - Headlines
• Up to 10PF Peak performance in initial system (Early 2013) – 100,000+ conventional Intel processor cores
– Almost 500,000 total cores with Intel MIC
• Upgrade with “Future Knight’s” in 2015
• 14PB+ disk
• 272TB+ RAM
• 56Gb FDR InfiniBand interconnect
• Nearly 200 racks of compute hardware (10,000 sq ft)
• >SIX MEGAWATTS total power
The Stampede Project
• A Complete Advanced Computing Ecosystem:
– HPC Compute, storage, interconnect
– Visualization subsystem (144 high end GPUs
– Large memory support (16 1TB nodes)
– Integration with archive, and (coming soon) TACC global work
filesystem and other data systems
– People, training, and documentation to support computational
science
• Hosted at an expanded building in TACC; massive power
upgrades
– 12MW new total datacenter power
– Thermal energy storage to reduce operating costs
6
7
Stampede Datacenter –February
20th
Stampede Datacenter – March 22nd
Stampede Datacenter – May 16th
Stampede Datacenter – June 20th
11
Stampede Footprint
Stampede InfiniBand (fat-tree) ~75 Miles of InfiniBand Cables
Ranger: 3000 ft2
0.6 PF
3 MW
Stampede:
8000 ft2
10 PF
6.5 MW
Capabilities: 20x
Footprint: 2x
Facilities at TACC
12
Stampede and MIC Overview
• The TACC Stampede System
• Overview of Intel’s MIC ecosystem
– MIC: Many Integrated Cores
– Basic Design
– Roadmap
– Coprocessor vs. Accelerator
• Programming Models
– OpenMP, MPI
– MKL, TBB, and Cilk Plus
• Roadmap & Preparation
13
An Inflection point (or two) in High
Performance Computing
• Relatively “boring”, but rapidly improving
architectures for the last 15 years.
• Performance rising much faster than Moore’s
Law…
– But power rising faster
– And concurrency rising faster than that, with serial
performance decreasing.
• Something had to give…
Consider this Example Homogeneous v. Heterogeneous
Acquisition costs and power consumption
• Homogeneous system: K computer in Japan
– 10 PF unaccelerated
– Rumored $1.2 billion acquisition costs
– Large power bill and foot print
• Heterogeneous system: Stampede at TACC
– Near 10PF (2 SNB + up to 8 MIC)
– Modest footprint: 8,000 ft2, 6.5MW
– $27.5 million
15
Accelerated Computing for Exascale
• Exascale systems, predicted for 2018, would
have required 500MW on the “old” curves.
• Something new was clearly needed.
• The “accelerated computing” movement was
reborn (this happens periodically, starting
with 387 math coprocessors).
Key aspects of acceleration
• We have lots of transistors… Moore’s law is
holding; this isn’t necessarily the problem.
• We don’t really need lower power per
transistor, we need lower power per
*operation*.
• How to do this?
– nVidia GPU -AMD Fusion
– FPGA -ARM Cores
Intel’s MIC approach
• Since the days of RISC vs. CISC, Intel has
mastered the art of figuring out what is
important about a new processing
technology, and saying “why can’t we do this
in x86?”
• The Intel Many Integrated Core (MIC)
architecture is about large die, simpler circuit,
much more parallelism, in the x86 line.
What is MIC?
• MIC is Intel’s solution to try and change the power curves
• Stage 1: Knights Ferry (KNF)
– SDP: Software Development Platform
– Intel has granted early access to KNF to several dozen institutions
– Training has started
– TACC had access to KNF hardware since early March 2011 and is
hosting a system since mid March 2011
– TACC continues to evaluate the hardware and programming models
• Stage 2: Knights Corner (KNC) -- Now “Xeon Phi”
– Early silicon now at TACC for Stampede development
– First product expected early 2013
MIC Ancestors • 80-core Terascale research program
• Larrabee many-core visual computing
• Single-chip Cloud Computer (SCC) 19
What is MIC? • Basic Design Ideas
– Leverage x86 architecture (CPU with many cores)
– Leverage (Intel’s) existing x86 programming models
– Dedicate much of the silicon to floating point ops., keep some cache(s)
– Keep cache-coherency protocol (was debated at the beginning)
– Increase floating-point throughput per core
– Implement as a separate device
• Pre-product Software Development Platform: Knights Ferry – x86 cores that are simpler, but allow for more compute throughput
– Strip expensive features (out-of-order execution, branch
prediction, etc.)
– Widen SIMD registers for more throughput
– Fast (GDDR5) memory on card
– “GPU” form factor: size, power, PCIe connection
20
MIC Architecture
• Many cores on the die
• L1 and L2 cache
• Bidirectional ring network
• Memory and PCIe connection
Knights Ferry SDP
• Up to 32 cores
• 1-2 GB of GDDR5 RAM
• 512-bit wide SIMD registers
• L1/L2 caches
• Multiple threads (up to 4) per core
• Slow operation in double precision
Knights Corner (first product)
• 50+ cores
• Increased amount of RAM
• Details are under NDA
• Double precision half the speed of
single precision (canonical ratio)
• 22 nm technology
MIC (KNF) architecture block diagram
21
MIC vs. Xeon
Number of cores
Clock Speed (GHz)
SIMD width (bit)
• Xeon chip designed to work well for all work loads
• MIC designed to work well for number crunching
• Single MIC core is slower than a Xeon core
• Number of cores on card is much higher
MIC (KNF) Xeon
30-32 4-8
≥1 ~3
512 128 - 256
What is Vectorization?
• Instruction set: Streaming SIMD Extensions (SSE)
• Single Instruction Multiple Data (SIMD)
• Multiple data elements are loaded into SSE registers
• One operation produces 2, 4, or 8 results in double precision
• Effective vectorization (by the compiler) crucial for performance
22
What we at TACC like about MIC • Intel’s MIC is based on x86 technology
– x86 cores w/ caches and cache coherency
– SIMD instruction set
• Programming for MIC is similar to programming for CPUs
– Familiar languages: C/C++ and Fortran
– Familiar parallel programming models: OpenMP & MPI
– MPI on host and on the coprocessor
– Any code can run on MIC, not just kernels
• Optimizing for MIC is similar to optimizing for CPUs
– “Optimize once, run anywhere”
– Our early MIC porting efforts for codes “in the field” are frequently
doubling performance on Sandy Bridge.
Coprocessor vs. Accelerator
• GPUs are often called “Accelerators”
• Intel calls MIC a “Coprocessor”
• Similarities
– Large die device, all silicon dedicated to compute
– Fast GDDR5 memory
– Connected through PCIe bus, two physical address spaces
– Number of MIC cores similar to number of GPU Streaming
Multiprocessors
24
Coprocessor vs. Accelerator
• Differences
– Architecture: x86 vs. streaming processors
coherent caches vs. shared memory
and caches
– HPC Programming model:
extension to C++/C/Fortran vs. CUDA/OpenCL
OpenCL support
Threading/MPI:
OpenMP and Multithreading vs. threads in hardware
MPI on host and/or MIC vs. MPI on host only
– Programming details
offloaded regions vs. kernels
– Support for any code: serial, scripting, etc.
Yes vs. No
• Native mode: MIC-compiled code may be run as an independent process on the coprocessor.
25
The MIC Promise
• Competitive Performance: Peak and Sustained
• Familiar Programming Model
– HPC: C/C++, Fortran, and Intel’s TBB
– Parallel Programming: OpenMP, MPI, Pthreads, Cilk Plus, OpenCL
– Intel’s MKL library (later: third party libraries)
– Serial and Scripting, etc. (anything a CPU core can do)
• “Easy” transition for OpenMP code
– Pragmas/directives added to “offload” OMP parallel regions onto MIC
– See examples later
• Support for MPI
MPI tasks on host, communication through offloading
MPI tasks on MIC, communication through MPI
26
Stampede and MIC Overview
• The TACC Stampede System
• Overview of Intel’s MIC ecosystem
– MIC: Many Integrated Cores
– Basic Design
– Roadmap
– Coprocessor vs. Accelerator
• Programming Models
– OpenMP, MPI
– MKL, TBB, and Cilk Plus
• Roadmap & Preparation
27
Adapting and Optimizing Code for MIC
• Hybrid Programming
– MPI + Threads (OpenMP, etc.)
• Optimize for higher thread performance
– Minimize serial sections and synchronization
• Test different execution models
– Symmetric vs. Offloaded
• Optimize for L1/L2 caches
• Test performance on MIC
• Optimize for specific architecture
• Start production
Today “Any” resource
Today/KNF/early KNC
Stampede MIC 2013+
Levels of Parallel Execution
Multi-core CPUs, Coprocessors (MIC), and Accelerators (GPUs)
exploit hardware parallelism on 2 levels
1. Processing unit (CPU, MIC, GPU) with multiple functional
units
– CPU/MIC: Cores
– GPU: Streaming Multiprocessor/SIMD Engine
– Functional units operate independently
2. Each functional units executes in lock-step (in parallel)
– CPU/MIC: Vector processing unit utilizes SIMD lanes
– GPU: Warp/Wavefront executes on “CUDA” or “Stream” cores
29
SIMD vs. CUDA/Stream Cores
Each functional units executes in lock-step (in parallel)
– CPU/MIC: Vector processing unit utilizes SIMD lanes
– GPU: Warp/Wavefront executing on “CUDA” or “Stream” cores
• GPU programming model exposes low level parallelism
– All CUDA/Stream cores must be utilized for good performance; this
is stating the obvious and it is transparent to the programmer
– Programmer is forced to manually setup blocks and grids of threads
Steep learning curve at the beginning,
but it leads naturally to the (data-parallel) utilization of all cores
• CPU programming model leaves vectorization to compiler
– Programmers may be unaware of parallel hardware
– Failing to write vectorizable code will lead to a significant
performance penalty of 87.5% on MIC (50-75% on Xeon)
30
Early MIC Programming Experiences at
TACC • Codes port easily
– Minutes to days depending mostly on library
dependencies
• Performance requires real work
– While the silicon continues to evolve
– Getting codes to run *at all* is almost too easy;
really need to put in the effort to get what you
expect
• Scalability is pretty good
– Multiple threads per core *really important*
– Getting your code to vectorize *really important*
Stampede Programming
Five Possible Models
• Host-only
• Offload
• Symmetric
• Reverse Offload
• MIC Native
Programming Intel® MIC-based Systems MPI+Offload
MPI ranks on Intel® Xeon® processors (only)
All messages into/out of processors
Offload models used to accelerate MPI ranks
Intel® CilkTM Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads* within Intel® MIC
Homogenous network of hybrid nodes:
11
Xeon
MIC
Xeon
MIC
Xeon
MIC
Xeon
MIC
Network
Data
Data
Data
Data
MPI
Offload
Programming Intel® MIC-based Systems Many-core Hosted
MPI ranks on Intel® MIC (only)
All messages into/out of Intel® MIC
Intel® CilkTM Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads used directly within MPI processes
Programmed as homogenous network of many-core CPUs:
13
Xeon
MIC
Xeon
MIC
Xeon
MIC
Xeon
MIC
Network
Data
Data
Data
Data
MPI
Programming Intel® MIC-based Systems Symmetric
MPI ranks on Intel® MIC and Intel® Xeon® processors
Messages to/from any core
Intel® CilkTM Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads* used directly within MPI processes
Programmed as heterogeneous network of homogeneous nodes:
14
Xeon
MIC
Xeon
MIC
Xeon
MIC
Xeon
MIC
Network
Data
Data
Data
Data
MPI
Data
Data
Data
Data
MPI
MPI
MIC Programming with OpenMP
• MIC specific pragma precedes OpenMP pragma
– Fortran: !dir$ omp offload target(mic) <…>
– C: #pragma offload target(mic) <…>
• All data transfer is handled by the compiler
(optional keywords provide user control)
• Offload and OpenMP section will go over examples for:
– Automatic data management
– Manual data management
– I/O from within offloaded region
• Data can “stream” through the MIC; no need to leave MIC to fetch new data
• Also very helpful when debugging (print statements)
– Offloading a subroutine and using MKL
36
Programming Models
Ready to use on day one!
• TBB’s will be available to C++ programmers
• MKL will be available
– Automatic offloading by compiler for some MKL features
– (probably most transparent mode)
• Cilk Plus
– Useful for task-parallel programing (add-on to OpenMP)
– May become available for Fortran users as well
• OpenMP
– TACC expects that OpenMP will be the most interesting
programming model for our HPC users
37
Stampede and MIC Overview
• The TACC Stampede System
• Overview of Intel’s MIC ecosystem
– MIC: Many Integrated Cores
– Basic Design
– Roadmap
– Coprocessor vs. Accelerator
• Programming Models
– OpenMP, MPI
– MKL, TBB, and Cilk Plus
• Roadmap & Preparation
38
Roadmap What comes next?
How to get prepared?
• General expectation:
– Many of the upcoming large systems will be accelerated
• Stampede’s base Sandy Bridge cluster being constructed now
• Intel’s MIC coprocessor is on track for 2013
– Expect that (large) systems with MIC coprocessors will become
available in 2013 at/after product launch
• MPI + OpenMP is the main HPC programming model
– If you are not using TBBs
– If you are not spending all your time in libraries (MKL, etc.)
• Early user access sometime after SC12
39
Roadmap What comes next?
How to get prepared?
• Many HPC applications are pure-MPI codes
• Start thinking about upgrading to a hybrid scheme
• Adding OpenMP is a larger effort than adding MIC directives
• Special MIC/OpenMP considerations
– Many more threads will be needed: 50+
cores (Production KNC) ➞ 50+/100+/200+ threads
– Good OpenMP scaling will be (much) more important
– Vectorize, vectorize, vectorize
– There is no unified last-level cache (LLC): Cache shared among
cores, several MB on CPUs
Stay tuned
The coming years will be exciting …
40