debugging numerical simulations on accelerated architectures - totalview for openpower, cuda and...

43
TotalView for OpenPOWER, CUDA and OpenMP Chris Gottbrath Dean Stewart May 18, 2015 ScicomP 2015

Upload: rogue-wave-software

Post on 06-Aug-2015

100 views

Category:

Technology


2 download

TRANSCRIPT

TotalView for OpenPOWER, CUDA

and OpenMP

Chris GottbrathDean Stewart

May 18, 2015

ScicomP 2015

Agenda

• Introduction

• TotalView Overview

• OpenMP, OMPT and OMPD

• Current work and future plans

• Questions and wrap-up

© 2015 Rogue Wave Software, Inc. All Rights Reserved

Introduction

About Rogue Wave Software

Rogue Wave proven technical solutions simplify the growing

complexity of building and testing quality software code. Rogue Wave

customers improve software quality and ensure code integrity, while

shortening development cycle times.

Founded: 1989

Headquarters: Boulder, CO

Employees: 250

Offices Worldwide: 9

© 2015 Rogue Wave Software, Inc. All Rights Reserved

Company timeline

1989Cross-platform commercial math & statistics libraries for C++

Technology Timeline

Corporate Timeline

1994First commercial database library for C++

1989 Rogue Wave establishedTools.h++

1996 Rogue Wave publicly listed on NASDAQ

2003 Rogue Wave acquired by Quovadx

2007 Rogue Wave spun out of Quovadx by Battery Ventures

2009 Rogue Wave acquires TotalView and Visual Numerics

2010 Rogue Wave acquires Acumem

2008First graphical reverse debugger for C, C++ and Fortran on Linux

2011First InfinibandCluster-capable reverse debugger

First cache optimization product to market

2010First GPU enabled commercial analytics in FORTRAN

2012 Rogue Wave acquires Visualizations for C++

Audax Group acquires Rogue Wave

2013 Rogue Wave acquires OpenLogic & Klocwork

1989 2015

© 2015 Rogue Wave Software, Inc. All Rights Reserved

Rogue Wave Solution Portfolio

© 2015 Rogue Wave Software, Inc. All Rights Reserved

HPC Trends

• What do we see

– NVIDIA Tesla GP-GPU computational accelerators

– Intel Xeon Phi Coprocessors

– Complex memory hierarchies (numa, device vs host, etc)

– Custom languages such as CUDA and OpenCL

– Directive based programming such as OpenACC and OpenMP

– Core and thread counts going up

• A lot of complexity to deal with if you want performance

– C or Fortran with MPI starts to look “simple”

– Everything is Multiple Languages / Parallel Paradigms

– Up to 4 “kinds” of parallelism (cluster, thread, heterogeneous,

vector)

– Data movement and load balancing

© 2015 Rogue Wave Software, Inc. All Rights Reserved

How does Rogue Wave help?

• Troubleshooting and analysis tool

– Visibility into applications

– Control over applications

• Scalability

• Usability

• Support for HPC platforms and languages

TotalView debugger

© 2015 Rogue Wave Software, Inc. All Rights Reserved

TotalView Overview

Application Analysis and Debugging Tool: Code Confidently

• Debug and Analyse C/C++ and Fortran on Linux™, Unix or

Mac OS X

• Laptops to supercomputers

• Makes developing, maintaining, and supporting critical apps

easier and less risky

Major Features

• Easy to learn graphical user interface with data visualization

• Parallel Debugging

– MPI, Pthreads, OpenMP™, Fortran Coarrays

– CUDA™, OpenACC®, and Intel® Xeon Phi™ coprocessor

• Low tool overhead resource usage

• Includes a Remote Display Client which frees you to work

from anywhere

• Memory Debugging with MemoryScape™

• Deterministic Replay Capability Included on Linux/x86-64

• Non-interactive Batch Debugging with TVScript and the CLI

• TTF & C++View to transform user defined objects

What is TotalView®?

© 2015 Rogue Wave Software, Inc. All Rights Reserved

What Is MemoryScape®?

• Runtime Memory Analysis : Eliminate Memory Errors– Detects memory leaks before they are a problem– Explore heap memory usage with powerful analytical tools– Use for validation as part of a quality software development process

• Major Features– Included in TotalView, or Standalone– Detects

• Malloc API misuse• Memory leaks• Buffer overflows

– Supports• C, C++, Fortran• Linux, Unix, and Mac OS X• Intel® Xeon Phi™• MPI, pthreads, OMP, and remote apps

– Low runtime overhead– Easy to use

• Works with vendor libraries• No recompilation or instrumentation

© 2015 Rogue Wave Software, Inc. All Rights Reserved

Deterministic Replay Debugging

• Reverse Debugging: Radically simplify your debugging

– Captures and Deterministically Replays Execution• Not just “checkpoint and restart”

– Eliminate the Restart Cycle and Hard-to-Reproduce Bugs– Step Back and Forward by Function, Line, or Instruction

• Specifications– A feature included in TotalView on Linux x86 and x86-64

• No recompilation or instrumentation• Explore data and state in the past just like in a

live process, including C++View transformations– Replay on Demand: enable it when you want it– Supports MPI on Ethernet, Infiniband, Cray XE Gemini– Supports Pthreads, and OpenMP– New: Save / Load Replay Information

© 2015 Rogue Wave Software, Inc. All Rights Reserved

Supported Platforms

© 2015 Rogue Wave Software, Inc. All Rights Reserved

Platforms C/C++ Compilers

Fortran Compilers

Linux x86 Gcc, Intel, PGI Absoft, GNU, Intel, PGI

Linux x86-64 Gcc, Intel, PGI Absoft, GNU, Intel, PGI

Power Linux Gcc, XL C++ GNU, XL Fortran

RS6000 Power AIX Gcc, XL C++ XL Fortran

BlueGene Gcc, XL C++ XL Fortran

TotalView for the NVIDIA ® GPU Accelerator

• NVIDIA CUDA 6.0, 6.5 and 7.0• Features and capabilities include

– Support for dynamic parallelism

– Support for MPI based clusters and multi-card configurations

– Flexible Display and Navigation on the CUDA device

• Physical (device, SM, Warp, Lane)

• Logical (Grid, Block) tuples

– CUDA device window reveals what is running where

– Support for types and separate memory address spaces

– Leverages CUDA memcheck

© 2015 Rogue Wave Software, Inc. All Rights Reserved

SC ‘14

• Setting breakpoints in CUDA kernels

– Start debugging (e.g. “Go”)

– Message box when

kernel is loaded:

– Set kernel

breakpoints as in

host code

11.17.2014

SC ‘14

• Debugger thread IDs in Linux CUDA process

– Host thread: positive no.

– CUDA thread: negative no.

• GPU thread navigation

– Logical coordinates: blocks (3 dimensions),

threads (3 dimensions)

– Physical coordinates: device, SM, warp, lane

– Only valid selections are permitted

11.17.2014

SC ‘14

• Single Stepping

– Advances all GPU hardware threads within same warp

– Stepping over a __syncthreads() call advances all threads

within the block

• Advancing more than just one warp

– “Run To” a selected line

number in the source pane

– Set a breakpoint and

“Continue” the process

• Halt

– Stops all the host and

device threads

11.17.2014

t0 t1 t31

t32 t63

warpgroup of 32 threadssame program counter (PC)

SC ‘14

• Displaying CUDA device

properties

– “Tools” - “CUDA Devices”

– Helps mapping between

logical & physical

coordinates

• PCs across SMs, warps, lanes

– valid, active, divergent

11.17.2014

program counter (PC) within warp

SC ‘14

• Displaying GPU data

– “Dive” into variable or

watch “Type” in “Expression

List”

– Device memory spaces: “@”

notation

11.17.2014

Storage Qualifier Meaning of address

@global Offset within global storage

@shared Offset within shared storage

@local Offset within local storage

@register PTX register name

@generic Offset within generic address space (e.g. pointer to global, local or shared memory)

@constant Offset within constant storage

@parameter Offset within parameter storage (TV built-in type)

SC ‘14

• Checking GPU memory

– Enable “CUDA Memory checking” during startup or in the

“Debug” menu

– Detects global memory addressing violations and misaligned

memory accesses

• Further features

– Multi-device support

– Host-pinned memory support

– MPI-CUDA applications

11.17.2014

Note: Recent cuda-memcheck versions are also able to detect race conditions:cuda-memcheck -–tool racecheck <prog>

OpenMP, OMPT and OMPD

The Importance of OpenMP

• Programming models are changing to accommodate changes in system architectures

• Higher degree of on-node parallelism: many-core CPUs and/or GPUs

• Hybrid programming models: MPI+X, where OpenMP is an important X

– MPI across the nodes

– OpenMP shared memory parallelism across the cores in a node

• Why use OpenMP?

– The most widely used standard for SMP systems, implemented by many vendors

– Supports the Fortran, C, and C++ languages

– Relatively small and simple specification, and supports incremental parallelism

– OpenMP research keeps it up to date with the latest hardware developments

– OpenMP 4 allows targeting GPUs

• We see momentum building around OpenMP

© 2015 Rogue Wave Software, Inc. All Rights Reserved

OpenMP Debugging Challenges

• Programmers will attempt to exploit MPI+OpenMP hybrid

parallelism

• Porting existing large applications from MPI to a hybrid model is

nontrivial and arduous, and having GPUs in the mix makes it even

more challenging

• Programming errors such as memory corruption, logic errors and

concurrency bugs are inevitable

• Bottom line

– MPI+OpenMP+GPUs will present programmers with

unprecedented debugging challenges

– They need good debugging tools for MPI+OpenMP+GPUs

© 2015 Rogue Wave Software, Inc. All Rights Reserved

The following are some features that TotalView supports:

• Source-level debugging of the original OpenMP code.

• The ability to plant breakpoints throughout the OpenMP code,

including lines that are executed in parallel.

• Visibility of OpenMP worker threads.

• Access to SHARED and PRIVATE variables in OpenMP PARALLEL

code.

• Access to OMP THREADPRIVATE data in code compiled by

supported compilers.

Debugging OpenMP Applications

© 2015 Rogue Wave Software, Inc. All Rights Reserved

Sample OpenMP Debugging Session

OpenMP worker threads

Local variables

© 2015 Rogue Wave Software, Inc. All Rights Reserved

OpenMP code high and low level

• Intention is expressed in the OpenMP code

– Serial-correct code with OMP directives expressing parallelism

– Higher level expression of the ideas

• Compiler can create either serial or parallel executable programs

from this source

• A parallel executable includes both the program logic and a runtime

– Teams of threads on the device, the host or both

– Outlined routines

– Runtime calls to dispatch work to worker threads

– Work created on thread A may be executed on threads M-N

© 2015 Rogue Wave Software, Inc. All Rights Reserved

What’s Needed?

• Debugging and performance analysis support from OpenMP

implementations

• “OMPT and OMPD: OpenMP Tools Application Programming Interfaces for

Performance Analysis and Debugging” technical report (TR)

– First TR combined OMPT and OMPD in one document

– OMPD was redacted to allow OMPT to progress

• “TR2: OMPT: An OpenMP Tools Application Programming Interface for

Performance Analysis”

– Accepted by the OpenMP ARB (March 2014)

– OMPT is an API for first-party performance tools

• It is now time to circle back and finish OMPD!

– OMPD is similar to OMPT in its functionality, but…

– OMPD is an API for third-party debugging tools

© 2015 Rogue Wave Software, Inc. All Rights Reserved

What Does OMPT/OMPD Do?

• OMPT

– Enable performance tools to gather execution program/runtime costs

– Allow construction of low-overhead performance tools

– Allow logical stack unwinding (to handle outlined parallel regions)

– Provide the state of a thread at any point in time (e.g., idle, work, wait)

– Asynchronous signal safe

• OMPD

– Enable debugging tools to inspect the state of a live process or core file

– Third-party versions of the OpenMP runtime inquiry functions

– Third-party versions of the OMPT inquiry functions

– Intercept the beginning/end of parallel/task regions (e.g., stepping

in/out)

– Enable the debugger to construct a “global view” of the process

© 2015 Rogue Wave Software, Inc. All Rights Reserved

How is OMPD Structured?

• Based on a commonly used idiom

– pthread thread_db, MPI MQD, MPI Handles, and others

– The OMPD DLL is “paired” with the OpenMP runtime library

• The debugger

– Attaches to the target OpenMP application

– Loads the OMPD DLL (e.g., via dlopen())

– Registers callbacks in the OMPD DLL

– Makes “requests” into the OMPD DLL to query runtime state

• The OMPD DLL

– Makes callbacks into the debugger (lookup symbols, read/write

memory, etc.)

– Returns the result to the debugger

© 2015 Rogue Wave Software, Inc. All Rights Reserved

OMPD DLL

OMPD DLL loaded intodebugger and callbacks registered

OMPD “In Action”

Application Process

OpenMPRuntime Library

(RTL)

Application address space

AttachDebugger

Debugger address space

Request OpenMP

state

1

• Handles for threads, parallel regions, tasks• Parent / child relationships• State of handles (wait, work, idle)

1

Request types

Request symbols and

address information

2

• Lookup symbols in the target process• Read/write target address spaces• Support for GPUs

2

Callback functions

Callback ops

3Result

© 2015 Rogue Wave Software, Inc. All Rights Reserved

OMPD Status

• Collaboration between LLNL and Rogue Wave Software (TotalView)

– LLNL: Dong Ahn, Ignacio Laguna, Joachim Protze, Martin

Schulz

– RWS: Ariel Burton, John DelSignore

• Resurrect OMPD with the ultimate goal of having it accepted by

the ARB

– Fix the current OMPD spec

– Implement OMPD DLL in the Intel OpenMP runtime

– Implement OMPD-based features in TotalView

• IWOMP Paper (International Workshop on OpenMP)

© 2015 Rogue Wave Software, Inc. All Rights Reserved

Current Work and

Future Plans

TotalView 8.15

New Features

• Scalable Infrastructure

• Faster start up on Linux

• Scales to O(100,000) processes

& O(1,000,000) threads

• Updated CUDA support

• CUDA 7.0

• Support updates including:

• Clang 3.5

• Intel 15.0

• MPT 2.12

• SLES 12, Fedora 21

TV Client

MRNet CP MRNet CP

TV Server TV Server TV Server TV Server

© 2015 Rogue Wave Software, Inc. All Rights Reserved

TotalView’s Scalability StrategyM

ultic

ast R

eductionTotalView uses an “MRNet tree” of servers

TV Client

MRNet CP MRNet CP

TV Server TV Server TV Server TV Server

Remain lightweight in the backend!

Sm

artsS

mart

s

Push debugger “smarts” to the

backend, not the whole debugger!

Use classic optimization

techniques too: caching, hoisting invariants, etc.

© 2015 Rogue Wave Software, Inc. All Rights Reserved

Linux Start up Performance in TV 8.15.4

5x faster (600s / 120s) at 16k between 8.14.1 and 8.15.4.Note that we switched to mrnet by default in 8.15.0

© 2015 Rogue Wave Software, Inc. All Rights Reserved

BG Start up Performance in TV 8.15.4

6.4x faster (180s / 28s) at 16k between 8.14.1 and 8.15.4.Note that we switched to mrnet by default in 8.15.0

© 2015 Rogue Wave Software, Inc. All Rights Reserved

TotalView debugs 786,432 cores.Climb with Rogue Wave towards exacale.

Some more details on the 786,432 core test

• The test was performed on 48 racks of Sequoia

• The test code

– Implements a Jacobi Linear Equation Solver

– The test code is a hybrid MPI + OpenMP code

– 16 threads per process, one process per node

• The test operations

– Start up

– Setting breakpoints / removing breakpoints

– Single stepping all threads

• Tests performed at a variety of scales to understand

scalability

© 2015 Rogue Wave Software, Inc. All Rights Reserved

40

TotalView’s Memory Efficiency

• TotalView is lightweight in the back-end (server)

• Servers don’t “steal” memory from the application

• Each server is a multi-process debugger agent

– One server can debug thousands of processes

– Not a conglomeration of single process debuggers

– TotalView’s architecture provides flexibility (e.g., P/SVR)

– No artificial limits to accommodate the debugger (e.g., BG/Q 1 P/CN)

• Symbols are read, stored, and shared in the front-end (client)

• Example: LLNL APP ADB, 920 shlibs, Linux, 64 P, 4 CN, 16 P/CN, 1 SVR/CN

Process VSZ (largest, MB)

RSS (largest, MB)

Where

TV Client 4,469 3,998 Front End ONLY

MRNet CP 497 4 Compute Nodes

TV Server 304 53 Compute Nodes

© 2015 Rogue Wave Software, Inc. All Rights Reserved

Future plans

• Contact [email protected] with any inquires about our future

plans with regard to TotalView product.

Questions

Thanks!

• Visit the website

– http://www.roguewave.com/products/totalview.aspx

– Documentation

– Sign up for an evaluation

– Contact customer support & post on the user forum

© 2015 Rogue Wave Software, Inc. All Rights Reserved