cuda kernel profiling using nvidia nsight compute · system-wide application algorithm tuning...

42
Sanjiv Satoor, Magnus Strengert CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE

Upload: others

Post on 28-Oct-2019

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

Sanjiv Satoor, Magnus Strengert

CUDA KERNEL PROFILINGUSING NVIDIA NSIGHT COMPUTE

Page 2: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

2

Overview of NVIDIA’s Devtools

What is New for CUDA Tools in CUDA 10.1

Working with NVIDIA Nsight Compute (Demo)

AGENDA

Page 3: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

3

TOOLS OFFERINGS

Debug

• Nsight

• CUDA-GDB• CUDA Debug API

IDE

• Nsight Eclipse Edition (EE) (plugin)

• Nsight Visual Studio Edition (VSE) (plugin)

Memcheck

• CUDA-memcheck

• Nsight Visual Studio Edition built-in

Profile

• Nsight Systems

• Nsight Compute

• CUDA Visual Profiler

• nvprof• Nsight Visual Studio Edition

• CUPTI

NVTX (NVidia Tools eXtension)

Page 4: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

4

CUDA PROFILING TOOLS

General:

• Support for latest Turing GPUs

Visual Profiler/nvprof/CUPTI:

• Support for NVTX string registration API nvtxDomainRegisterStringA().

Nsight Visual Studio Edition:

• Nsight Compute improvements

Updates for CUDA 10.1

Page 5: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

5

NSIGHT COMPUTE

Updates:

• Lower performance overhead and reduced memory overhead

• Source Page reports metrics by functions or files

• Updates sections and rules; section descriptions

Improved Parity to NVPROF:

• Profiling of child processes (MPI, …)

• CLI options: --summary, --quite, …

Extended NVTX Support:

• Trigger profiling by NVTX ranges

• Print NVTX state in CLI

Updates for CUDA 10.1

Page 6: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

6

NSIGHT PRODUCT FAMILY

Standalone Performance Tools

Nsight Systems - System-wide application algorithm tuning

Nsight Compute - Debug/optimize specific CUDA kernel

Nsight Graphics - Debug/optimize specific graphics shader

IDE Plugins

Nsight Eclipse Edition/Visual Studio – editor, debugger, some perf analysis

Workflow

NsightSystems

NsightCompute

NsightGraphics

Page 7: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

7

NSIGHT SYSTEMS

System-wide application algorithm tuningMulti-process tree support

Locate optimization opportunitiesVisualize millions of events on a very fast GUI timelineOr gaps of unused CPU and GPU time

Balance your workload across multiple CPUs and GPUsCPU algorithms, utilization, and thread stateGPU streams, kernels, memory transfers, etc

Overview

Page 8: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

8

Processes and

threads

CUDA and OpenGL

API trace

Multi-GPU

Kernel and memory

transfer activities

cuDNN and

cuBLAS trace

Thread/core

migration

Thread state

Page 9: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

9

TRANSITIONING TO PROFILE A KERNELDive into kernel analysis

Page 10: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

10

NSIGHT COMPUTE INTRODUCTION

Page 11: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

11

NVIDIA NSIGHT COMPUTE

Key Features:

• Interactive CUDA API debugging and kernel profiling

• Graphical profile report

• Comparison of multiple kernel reports

• Fully Customizable (Reports and Analysis Rules)

• Command Line, Standalone, IDE Integration

OS: Linux, Windows, ARM, MacOSX (host only)

GPUs: Pascal (GP10x), Volta, Turing

Docs/product: https://developer.nvidia.com/nsight-compute

Next-Gen Kernel Profiling Tool

Page 12: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

12

NSIGHT COMPUTE DEMO

Page 13: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

13

WARP SCHEDULERVolta Architecture

4 Warp Scheduler per SM

Manages a pool of warps:

Volta: 16 warp slots

Turing: 8 warp slots

Each scheduler can issue 1 warp/cycle

Offers simplified mental model for profiling and SM metrics

Page 14: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

14

WARP SCHEDULER Mental Model for Profiling

7

6

5

4

3

2

1

0

Warp

Slo

ts

Warp States:

Unused

Active

Stalled

Eligible

Selected

Page 15: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

15

WARP SCHEDULER Mental Model for Profiling

7

6

5

4

3

2

1

0

Warp

Slo

ts

Warp States:

Unused

Active

Stalled

Eligible

Selected

Cycle:

Warp

Slo

ts

N

Issue Slot:

Each Cycle:Out of all eligible warps,select one to issue on that cycle

Page 16: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

16

WARP SCHEDULER Mental Model for Profiling

7

6

5

4

3

2

1

0

Warp

Slo

ts

Warp States:

Unused

Active

Stalled

Eligible

Selected

Cycle:

Warp

Slo

ts

N

Issue Slot:

Each Cycle:Out of all eligible warps,select one to issue on that cycle

3

Page 17: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

17

WARP SCHEDULER Mental Model for Profiling

7

6

5

4

3

2

1

0

Warp

Slo

ts

Warp States:

Unused

Active

Stalled

Eligible

Selected

Cycle:

Warp

Slo

ts

Issue Slot:

N

3

N+1

4

Warp selected in cycle N,is not eligible in N+1.E.g. instructions with longer instruction latencies

Page 18: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

18

WARP SCHEDULER Mental Model for Profiling

7

6

5

4

3

2

1

0

Warp

Slo

ts

Warp States:

Unused

Active

Stalled

Eligible

Selected

Cycle:

Warp

Slo

ts

Issue Slot:

N

3

N+1

4

N+2

Warp exits

No eligible warp! Issue slot unused

Page 19: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

19

WARP SCHEDULER Mental Model for Profiling

7

6

5

4

3

2

1

0

Warp

Slo

ts

Warp States:

Unused

Active

Stalled

Eligible

Selected

Cycle:

Warp

Slo

ts

Issue Slot:

N

3

N+1

4

N+2 N+3

2

New warps scheduled

Page 20: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

20

WARP SCHEDULER Mental Model for Profiling

7

6

5

4

3

2

1

0

Warp

Slo

ts

Warp States:

Unused

Active

Stalled

Eligible

Selected

Cycle:

Warp

Slo

ts

Issue Slot:

N

3

N+1

4

N+2 N+3

2

Metrics (aggregated):

Page 21: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

21

WARP SCHEDULER Mental Model for Profiling

7

6

5

4

3

2

1

0

Warp

Slo

ts

Warp States:

Unused

Active

Stalled

Eligible

Selected

Cycle:

Warp

Slo

ts

Issue Slot:

N

3

N+1

4

N+2 N+3

2

Metrics (aggregated):

cycles_active 4

Page 22: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

22

WARP SCHEDULER Mental Model for Profiling

7

6

5

4

3

2

1

0

Warp

Slo

ts

Warp States:

Unused

Active

Stalled

Eligible

Selected

Cycle:

Warp

Slo

ts

Issue Slot:

N

3

N+1

4

N+2 N+3

2

Metrics (aggregated):

cycles_active 4warps_active 20

Page 23: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

23

WARP SCHEDULER Mental Model for Profiling

7

6

5

4

3

2

1

0

Warp

Slo

ts

Warp States:

Unused

Active

Stalled

Eligible

Selected

Cycle:

Warp

Slo

ts

Issue Slot:

N

3

N+1

4

N+2 N+3

2

Metrics (aggregated):

cycles_active 4warps_active 20

warps_active/cycles_active 5achieved_occupancy 62.5%

Page 24: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

24

WARP SCHEDULER Mental Model for Profiling

7

6

5

4

3

2

1

0

Warp

Slo

ts

Warp States:

Unused

Active

Stalled

Eligible

Selected

Cycle:

Warp

Slo

ts

Issue Slot:

N

3

N+1

4

N+2 N+3

2

Metrics (aggregated):

cycles_active 4warps_active 20

warps_active/cycles_active 5achieved_occupancy 62.5%

warps_stalled 15

Page 25: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

25

WARP SCHEDULER Mental Model for Profiling

7

6

5

4

3

2

1

0

Warp

Slo

ts

Warp States:

Unused

Active

Stalled

Eligible

Selected

Cycle:

Warp

Slo

ts

Issue Slot:

N

3

N+1

4

N+2 N+3

2

Metrics (aggregated):

cycles_active 4warps_active 20

warps_active/cycles_active 5achieved_occupancy 62.5%

warps_stalled 15warps_eligible 5

Page 26: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

26

WARP SCHEDULER Mental Model for Profiling

7

6

5

4

3

2

1

0

Warp

Slo

ts

Warp States:

Unused

Active

Stalled

Eligible

Selected

Cycle:

Warp

Slo

ts

Issue Slot:

N

3

N+1

4

N+2 N+3

2

Metrics (aggregated):

cycles_active 4warps_active 20

warps_active/cycles_active 5achieved_occupancy 62.5%

warps_stalled 15warps_eligible 5warps_issued 3

Page 27: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

27

WARP SCHEDULER Mental Model for Profiling

7

6

5

4

3

2

1

0

Warp

Slo

ts

Warp States:

Unused

Active

Stalled

Eligible

Selected

Cycle:

Warp

Slo

ts

Issue Slot:

N

3

N+1

4

N+2 N+3

2

Metrics (aggregated):

cycles_active 4warps_active 20

warps_active/cycles_active 5achieved_occupancy 62.5%

warps_stalled 15warps_eligible 5warps_issued 3

warps_issued/cycles_active 0.75issue_slot_utilization 75%

Page 28: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

28

WARP SCHEDULER Application to kernel_B

N N+1 N+2 N+3 N+4 N+5 N+6 N+7 N+8 Metrics (theoretical; every 8 cycles):

warps_active 8 1warps_stalled 7 7/8warps_eligible 1 1/8warps_selected 1 1/8

Warp States:Unused

Active

StalledEligible

Selected

Page 29: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

29

WARP SCHEDULER Application to kernel_B

N N+1 N+2 N+3 N+4 N+5 N+6 N+7 N+8 Metrics (theoretical; every 8 cycles):

warps_active 8 1warps_stalled 7 7/8warps_eligible 1 1/8warps_selected 1 1/8

Metrics (from report):

warps_active 1.00warps_stalled 0.87warps_eligible 0.13warps_selected 0.13

Warp States:Unused

Active

StalledEligible

Selected

Page 30: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

30

WARP SCHEDULER Application to kernel_A

FP64 Pipeline on Volta:

Dependent Issue Rate: 8 cyclesIssue Rate: 4 cycles

N N+1 N+8N+2 N+3 N+5 N+6 N+7N+4

Warp States:Unused

Active

StalledEligible

Selected

Page 31: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

31

WARP SCHEDULER Application to kernel_A

N N+1 N+8N+2 N+3 N+5 N+6 N+7N+4

Warp States:Unused

Active

StalledEligible

Selected

FP64 Pipeline on Volta:

Dependent Issue Rate: 8 cyclesIssue Rate: 4 cycles

warp

0

Page 32: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

32

WARP SCHEDULER Application to kernel_A

N N+1 N+8N+2 N+3 N+5 N+6 N+7N+4

Warp States:Unused

Active

StalledEligible

Selected

FP64 Pipeline on Volta:

Dependent Issue Rate: 8 cyclesIssue Rate: 4 cycles

warp

0

Page 33: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

33

WARP SCHEDULER Application to kernel_A

N N+1 N+8N+2 N+3 N+5 N+6 N+7N+4

Warp States:Unused

Active

StalledEligible

Selected

FP64 Pipeline on Volta:

Dependent Issue Rate: 8 cyclesIssue Rate: 4 cycles

warp

0

Page 34: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

34

WARP SCHEDULER Application to kernel_A

N N+1 N+8N+2 N+3 N+5 N+6 N+7N+4

Warp States:Unused

Active

StalledEligible

Selected

FP64 Pipeline on Volta:

Dependent Issue Rate: 8 cyclesIssue Rate: 4 cycles

warp

0w

arp

1

Page 35: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

35

WARP SCHEDULER Application to kernel_A

N N+1 N+8N+2 N+3 N+5 N+6 N+7N+4

Warp States:Unused

Active

StalledEligible

Selected

Metrics (theoretical; every 8 cycles):

warps_active 96warps_stalled 74warps_eligible 22warps_selected 2

Page 36: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

36

WARP SCHEDULER Application to kernel_A

Metrics (theoretical; every 8 cycles):

warps_active 96 96/8=12.00warps_stalled 74 74/8= 9.25warps_eligible 22 22/8= 2.75warps_selected 2 2/8= 0.25

Metrics (from report):

warps_active 12.27warps_stalled 9.08warps_eligible 2.80warps_selected 0.26

N N+1 N+8N+2 N+3 N+5 N+6 N+7N+4

Warp States:Unused

Active

StalledEligible

Selected

Page 37: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

37

TAKEAWAYS

General:

• Use tools whenever possible

• Understanding app/gpu/system and tools go hand in hand(on us to make this as easy as possible)

Nsight Compute:

• Use SOL metrics to understand overall limiter(reading real-world reports is a lot more difficult; but the very same workflow applies)

• Read rules’ output for guidance throughout the report(we’ll add more rules in future releases)

• Use top-down approach; no need to jump directly into SASS code

• Do not optimize stalls, if you already use all/sufficient issue slots

• Let us know what works for you and what doesn’t:devtalk.nvidia.com > Development Tools > Nsight Compute

Profiling with Nsight Compute

Page 38: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

38

DEVELOPER TOOLS AT GTC19Talks:S9751: Accelerate Your CUDA Development with Latest Debugging and Code Analysis Developer Tools, Tue @9amS9866 - Optimizing Facebook AI Workloads for NVIDIA GPUs, Tue @9amS9345: CUDA Kernel Profiling using NVIDIA Nsight Compute, Tue @1pmS9661: Nsight Graphics - DXR/Vulkan Profiling/Vulkan Raytracing, Wed @10amS9503: Using Nsight Tools to Optimize the NAMD Molecular Dynamics Simulation Program, Wed @1pm

Hands-on labs:L9102: Jetson Developer Tools Training Lab, Mon @9am, 11:30amL9124: Debugging and optimizing CUDA applications with Nsight products on Linux training lab, Tue @8am, 10am

Connect with the Experts (where Developer Tools will be available):CE9123: CUDA & Graphics Developer Tools, Tue @2pm, Wed @3pmCE9137: Jetson Embedded Platform, Tue @12pm, 5pm, Wed @1pm, 4pm, Thu @12pm

Podium: Demos of DevTools products on Linux, DRIVE AGX & Jetson AGX at the showfloorTue @12pm – 7pmWed @12pm – 7pmThu @11am – 2pm

Page 39: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

Any Questions?

THANK YOU

Page 40: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

40

BACKUP SLIDES

Page 41: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

49

CUDA PROFILING TOOLS

NVIDIA® Visual Profiler – CUDA trace/kernel profiling

GUI – runs on Linux, Windows, Mac

nvprof – CUDA trace/kernel profiling

Command line tool – runs on Linux, Windows, Mac

NVIDIA® Nsight™ Visual Studio Edition – CUDA, OpenGL and system trace and CUDA kernel profiling

Integration into Visual Studio 2012 and newer – Windows only

Page 42: CUDA KERNEL PROFILING USING NVIDIA NSIGHT COMPUTE · System-wide application algorithm tuning Multi-process tree support Locate optimization opportunities Visualize millions of events

50

NSIGHT SYSTEMS

Multicore CPU and multi-GPU system activity trace

CPU utilization, thread state and core/thread migration

OS library trace with blocking call backtraces

pthread, semaphore, file I/O, network I/O, poll/select, sleep/yield, ioctl, syscall, fork

CUDA, OpenACC, OpenGL API and GPU Workload traceIncludes UVM eventscuDNN and cuBLAS

NVidia Tools eXtension (NVTX) user annotations

Command line

Features