debugging numerical simulations on accelerated architectures - totalview for openpower, cuda and...
TRANSCRIPT
Agenda
• Introduction
• TotalView Overview
• OpenMP, OMPT and OMPD
• Current work and future plans
• Questions and wrap-up
© 2015 Rogue Wave Software, Inc. All Rights Reserved
About Rogue Wave Software
Rogue Wave proven technical solutions simplify the growing
complexity of building and testing quality software code. Rogue Wave
customers improve software quality and ensure code integrity, while
shortening development cycle times.
Founded: 1989
Headquarters: Boulder, CO
Employees: 250
Offices Worldwide: 9
© 2015 Rogue Wave Software, Inc. All Rights Reserved
Company timeline
1989Cross-platform commercial math & statistics libraries for C++
Technology Timeline
Corporate Timeline
1994First commercial database library for C++
1989 Rogue Wave establishedTools.h++
1996 Rogue Wave publicly listed on NASDAQ
2003 Rogue Wave acquired by Quovadx
2007 Rogue Wave spun out of Quovadx by Battery Ventures
2009 Rogue Wave acquires TotalView and Visual Numerics
2010 Rogue Wave acquires Acumem
2008First graphical reverse debugger for C, C++ and Fortran on Linux
2011First InfinibandCluster-capable reverse debugger
First cache optimization product to market
2010First GPU enabled commercial analytics in FORTRAN
2012 Rogue Wave acquires Visualizations for C++
Audax Group acquires Rogue Wave
2013 Rogue Wave acquires OpenLogic & Klocwork
1989 2015
© 2015 Rogue Wave Software, Inc. All Rights Reserved
HPC Trends
• What do we see
– NVIDIA Tesla GP-GPU computational accelerators
– Intel Xeon Phi Coprocessors
– Complex memory hierarchies (numa, device vs host, etc)
– Custom languages such as CUDA and OpenCL
– Directive based programming such as OpenACC and OpenMP
– Core and thread counts going up
• A lot of complexity to deal with if you want performance
– C or Fortran with MPI starts to look “simple”
– Everything is Multiple Languages / Parallel Paradigms
– Up to 4 “kinds” of parallelism (cluster, thread, heterogeneous,
vector)
– Data movement and load balancing
© 2015 Rogue Wave Software, Inc. All Rights Reserved
How does Rogue Wave help?
• Troubleshooting and analysis tool
– Visibility into applications
– Control over applications
• Scalability
• Usability
• Support for HPC platforms and languages
TotalView debugger
© 2015 Rogue Wave Software, Inc. All Rights Reserved
Application Analysis and Debugging Tool: Code Confidently
• Debug and Analyse C/C++ and Fortran on Linux™, Unix or
Mac OS X
• Laptops to supercomputers
• Makes developing, maintaining, and supporting critical apps
easier and less risky
Major Features
• Easy to learn graphical user interface with data visualization
• Parallel Debugging
– MPI, Pthreads, OpenMP™, Fortran Coarrays
– CUDA™, OpenACC®, and Intel® Xeon Phi™ coprocessor
• Low tool overhead resource usage
• Includes a Remote Display Client which frees you to work
from anywhere
• Memory Debugging with MemoryScape™
• Deterministic Replay Capability Included on Linux/x86-64
• Non-interactive Batch Debugging with TVScript and the CLI
• TTF & C++View to transform user defined objects
What is TotalView®?
© 2015 Rogue Wave Software, Inc. All Rights Reserved
What Is MemoryScape®?
• Runtime Memory Analysis : Eliminate Memory Errors– Detects memory leaks before they are a problem– Explore heap memory usage with powerful analytical tools– Use for validation as part of a quality software development process
• Major Features– Included in TotalView, or Standalone– Detects
• Malloc API misuse• Memory leaks• Buffer overflows
– Supports• C, C++, Fortran• Linux, Unix, and Mac OS X• Intel® Xeon Phi™• MPI, pthreads, OMP, and remote apps
– Low runtime overhead– Easy to use
• Works with vendor libraries• No recompilation or instrumentation
© 2015 Rogue Wave Software, Inc. All Rights Reserved
Deterministic Replay Debugging
• Reverse Debugging: Radically simplify your debugging
– Captures and Deterministically Replays Execution• Not just “checkpoint and restart”
– Eliminate the Restart Cycle and Hard-to-Reproduce Bugs– Step Back and Forward by Function, Line, or Instruction
• Specifications– A feature included in TotalView on Linux x86 and x86-64
• No recompilation or instrumentation• Explore data and state in the past just like in a
live process, including C++View transformations– Replay on Demand: enable it when you want it– Supports MPI on Ethernet, Infiniband, Cray XE Gemini– Supports Pthreads, and OpenMP– New: Save / Load Replay Information
© 2015 Rogue Wave Software, Inc. All Rights Reserved
Supported Platforms
© 2015 Rogue Wave Software, Inc. All Rights Reserved
Platforms C/C++ Compilers
Fortran Compilers
Linux x86 Gcc, Intel, PGI Absoft, GNU, Intel, PGI
Linux x86-64 Gcc, Intel, PGI Absoft, GNU, Intel, PGI
Power Linux Gcc, XL C++ GNU, XL Fortran
RS6000 Power AIX Gcc, XL C++ XL Fortran
BlueGene Gcc, XL C++ XL Fortran
TotalView for the NVIDIA ® GPU Accelerator
• NVIDIA CUDA 6.0, 6.5 and 7.0• Features and capabilities include
– Support for dynamic parallelism
– Support for MPI based clusters and multi-card configurations
– Flexible Display and Navigation on the CUDA device
• Physical (device, SM, Warp, Lane)
• Logical (Grid, Block) tuples
– CUDA device window reveals what is running where
– Support for types and separate memory address spaces
– Leverages CUDA memcheck
© 2015 Rogue Wave Software, Inc. All Rights Reserved
SC ‘14
• The following 6 slides are from an SC14 tutorial by:
• Damian Alvarez
• Dr. Mike Ashworth
• Vincent Betro, Ph. D.
• Chris Gottbrath
• Nikolay Piskun, Ph.D.
• Sandra Wienke
11.17.2014
SC ‘14
• Setting breakpoints in CUDA kernels
– Start debugging (e.g. “Go”)
– Message box when
kernel is loaded:
– Set kernel
breakpoints as in
host code
11.17.2014
SC ‘14
• Debugger thread IDs in Linux CUDA process
– Host thread: positive no.
– CUDA thread: negative no.
• GPU thread navigation
– Logical coordinates: blocks (3 dimensions),
threads (3 dimensions)
– Physical coordinates: device, SM, warp, lane
– Only valid selections are permitted
11.17.2014
SC ‘14
• Single Stepping
– Advances all GPU hardware threads within same warp
– Stepping over a __syncthreads() call advances all threads
within the block
• Advancing more than just one warp
– “Run To” a selected line
number in the source pane
– Set a breakpoint and
“Continue” the process
• Halt
– Stops all the host and
device threads
11.17.2014
…
t0 t1 t31
…
t32 t63
…
warpgroup of 32 threadssame program counter (PC)
SC ‘14
• Displaying CUDA device
properties
– “Tools” - “CUDA Devices”
– Helps mapping between
logical & physical
coordinates
• PCs across SMs, warps, lanes
– valid, active, divergent
11.17.2014
program counter (PC) within warp
…
SC ‘14
• Displaying GPU data
– “Dive” into variable or
watch “Type” in “Expression
List”
– Device memory spaces: “@”
notation
11.17.2014
Storage Qualifier Meaning of address
@global Offset within global storage
@shared Offset within shared storage
@local Offset within local storage
@register PTX register name
@generic Offset within generic address space (e.g. pointer to global, local or shared memory)
@constant Offset within constant storage
@parameter Offset within parameter storage (TV built-in type)
SC ‘14
• Checking GPU memory
– Enable “CUDA Memory checking” during startup or in the
“Debug” menu
– Detects global memory addressing violations and misaligned
memory accesses
• Further features
– Multi-device support
– Host-pinned memory support
– MPI-CUDA applications
11.17.2014
Note: Recent cuda-memcheck versions are also able to detect race conditions:cuda-memcheck -–tool racecheck <prog>
The Importance of OpenMP
• Programming models are changing to accommodate changes in system architectures
• Higher degree of on-node parallelism: many-core CPUs and/or GPUs
• Hybrid programming models: MPI+X, where OpenMP is an important X
– MPI across the nodes
– OpenMP shared memory parallelism across the cores in a node
• Why use OpenMP?
– The most widely used standard for SMP systems, implemented by many vendors
– Supports the Fortran, C, and C++ languages
– Relatively small and simple specification, and supports incremental parallelism
– OpenMP research keeps it up to date with the latest hardware developments
– OpenMP 4 allows targeting GPUs
• We see momentum building around OpenMP
© 2015 Rogue Wave Software, Inc. All Rights Reserved
OpenMP Debugging Challenges
• Programmers will attempt to exploit MPI+OpenMP hybrid
parallelism
• Porting existing large applications from MPI to a hybrid model is
nontrivial and arduous, and having GPUs in the mix makes it even
more challenging
• Programming errors such as memory corruption, logic errors and
concurrency bugs are inevitable
• Bottom line
– MPI+OpenMP+GPUs will present programmers with
unprecedented debugging challenges
– They need good debugging tools for MPI+OpenMP+GPUs
© 2015 Rogue Wave Software, Inc. All Rights Reserved
The following are some features that TotalView supports:
• Source-level debugging of the original OpenMP code.
• The ability to plant breakpoints throughout the OpenMP code,
including lines that are executed in parallel.
• Visibility of OpenMP worker threads.
• Access to SHARED and PRIVATE variables in OpenMP PARALLEL
code.
• Access to OMP THREADPRIVATE data in code compiled by
supported compilers.
Debugging OpenMP Applications
© 2015 Rogue Wave Software, Inc. All Rights Reserved
Sample OpenMP Debugging Session
OpenMP worker threads
Local variables
© 2015 Rogue Wave Software, Inc. All Rights Reserved
OpenMP code high and low level
• Intention is expressed in the OpenMP code
– Serial-correct code with OMP directives expressing parallelism
– Higher level expression of the ideas
• Compiler can create either serial or parallel executable programs
from this source
• A parallel executable includes both the program logic and a runtime
– Teams of threads on the device, the host or both
– Outlined routines
– Runtime calls to dispatch work to worker threads
– Work created on thread A may be executed on threads M-N
© 2015 Rogue Wave Software, Inc. All Rights Reserved
What’s Needed?
• Debugging and performance analysis support from OpenMP
implementations
• “OMPT and OMPD: OpenMP Tools Application Programming Interfaces for
Performance Analysis and Debugging” technical report (TR)
– First TR combined OMPT and OMPD in one document
– OMPD was redacted to allow OMPT to progress
• “TR2: OMPT: An OpenMP Tools Application Programming Interface for
Performance Analysis”
– Accepted by the OpenMP ARB (March 2014)
– OMPT is an API for first-party performance tools
• It is now time to circle back and finish OMPD!
– OMPD is similar to OMPT in its functionality, but…
– OMPD is an API for third-party debugging tools
© 2015 Rogue Wave Software, Inc. All Rights Reserved
What Does OMPT/OMPD Do?
• OMPT
– Enable performance tools to gather execution program/runtime costs
– Allow construction of low-overhead performance tools
– Allow logical stack unwinding (to handle outlined parallel regions)
– Provide the state of a thread at any point in time (e.g., idle, work, wait)
– Asynchronous signal safe
• OMPD
– Enable debugging tools to inspect the state of a live process or core file
– Third-party versions of the OpenMP runtime inquiry functions
– Third-party versions of the OMPT inquiry functions
– Intercept the beginning/end of parallel/task regions (e.g., stepping
in/out)
– Enable the debugger to construct a “global view” of the process
© 2015 Rogue Wave Software, Inc. All Rights Reserved
How is OMPD Structured?
• Based on a commonly used idiom
– pthread thread_db, MPI MQD, MPI Handles, and others
– The OMPD DLL is “paired” with the OpenMP runtime library
• The debugger
– Attaches to the target OpenMP application
– Loads the OMPD DLL (e.g., via dlopen())
– Registers callbacks in the OMPD DLL
– Makes “requests” into the OMPD DLL to query runtime state
• The OMPD DLL
– Makes callbacks into the debugger (lookup symbols, read/write
memory, etc.)
– Returns the result to the debugger
© 2015 Rogue Wave Software, Inc. All Rights Reserved
OMPD DLL
OMPD DLL loaded intodebugger and callbacks registered
OMPD “In Action”
Application Process
OpenMPRuntime Library
(RTL)
Application address space
AttachDebugger
Debugger address space
Request OpenMP
state
1
• Handles for threads, parallel regions, tasks• Parent / child relationships• State of handles (wait, work, idle)
1
Request types
Request symbols and
address information
2
• Lookup symbols in the target process• Read/write target address spaces• Support for GPUs
2
Callback functions
Callback ops
3Result
© 2015 Rogue Wave Software, Inc. All Rights Reserved
OMPD Status
• Collaboration between LLNL and Rogue Wave Software (TotalView)
– LLNL: Dong Ahn, Ignacio Laguna, Joachim Protze, Martin
Schulz
– RWS: Ariel Burton, John DelSignore
• Resurrect OMPD with the ultimate goal of having it accepted by
the ARB
– Fix the current OMPD spec
– Implement OMPD DLL in the Intel OpenMP runtime
– Implement OMPD-based features in TotalView
• IWOMP Paper (International Workshop on OpenMP)
© 2015 Rogue Wave Software, Inc. All Rights Reserved
TotalView 8.15
New Features
• Scalable Infrastructure
• Faster start up on Linux
• Scales to O(100,000) processes
& O(1,000,000) threads
• Updated CUDA support
• CUDA 7.0
• Support updates including:
• Clang 3.5
• Intel 15.0
• MPT 2.12
• SLES 12, Fedora 21
TV Client
MRNet CP MRNet CP
TV Server TV Server TV Server TV Server
© 2015 Rogue Wave Software, Inc. All Rights Reserved
TotalView’s Scalability StrategyM
ultic
ast R
eductionTotalView uses an “MRNet tree” of servers
TV Client
MRNet CP MRNet CP
TV Server TV Server TV Server TV Server
Remain lightweight in the backend!
Sm
artsS
mart
s
Push debugger “smarts” to the
backend, not the whole debugger!
Use classic optimization
techniques too: caching, hoisting invariants, etc.
© 2015 Rogue Wave Software, Inc. All Rights Reserved
Linux Start up Performance in TV 8.15.4
5x faster (600s / 120s) at 16k between 8.14.1 and 8.15.4.Note that we switched to mrnet by default in 8.15.0
© 2015 Rogue Wave Software, Inc. All Rights Reserved
BG Start up Performance in TV 8.15.4
6.4x faster (180s / 28s) at 16k between 8.14.1 and 8.15.4.Note that we switched to mrnet by default in 8.15.0
© 2015 Rogue Wave Software, Inc. All Rights Reserved
Some more details on the 786,432 core test
• The test was performed on 48 racks of Sequoia
• The test code
– Implements a Jacobi Linear Equation Solver
– The test code is a hybrid MPI + OpenMP code
– 16 threads per process, one process per node
• The test operations
– Start up
– Setting breakpoints / removing breakpoints
– Single stepping all threads
• Tests performed at a variety of scales to understand
scalability
© 2015 Rogue Wave Software, Inc. All Rights Reserved
40
TotalView’s Memory Efficiency
• TotalView is lightweight in the back-end (server)
• Servers don’t “steal” memory from the application
• Each server is a multi-process debugger agent
– One server can debug thousands of processes
– Not a conglomeration of single process debuggers
– TotalView’s architecture provides flexibility (e.g., P/SVR)
– No artificial limits to accommodate the debugger (e.g., BG/Q 1 P/CN)
• Symbols are read, stored, and shared in the front-end (client)
• Example: LLNL APP ADB, 920 shlibs, Linux, 64 P, 4 CN, 16 P/CN, 1 SVR/CN
Process VSZ (largest, MB)
RSS (largest, MB)
Where
TV Client 4,469 3,998 Front End ONLY
MRNet CP 497 4 Compute Nodes
TV Server 304 53 Compute Nodes
© 2015 Rogue Wave Software, Inc. All Rights Reserved
Future plans
• Contact [email protected] with any inquires about our future
plans with regard to TotalView product.
Thanks!
• Visit the website
– http://www.roguewave.com/products/totalview.aspx
– Documentation
– Sign up for an evaluation
– Contact customer support & post on the user forum
© 2015 Rogue Wave Software, Inc. All Rights Reserved