OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2...

47
Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS

Transcript of OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2...

Page 1: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

Mayank Jain, 11 May 2017

OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS

Page 2: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

2

AGENDA

• CUDA Profiling Tools

• Unified Memory Profiling

• NVLink Profiling

• PC Sampling

• MPI Profiling

• Multi-hop Remote Profiling

• Volta Support

• Other Improvements

Page 3: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

3

CUDA PROFILING TOOLS

• NVIDIA® Visual Profiler

• nvprof *

• NVIDIA® Nsight™ Visual Studio Edition

* Android CUDA APK profiling not supported (yet)

Page 4: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

4

3RD PARTY PROFILING TOOLS

TAU Performance System ® VampirTrace

PAPI CUDA Component HPC Toolkit

Page 5: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

5

UNIFIED MEMORY PROFILING

Page 6: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

6

UNIFIED MEMORY EVENTS

Events

Page 7: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

7

SEGMENT MODE TIMELINE OPTIONS

Option to enable segment mode

Option to specify number of segments

Page 8: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

8

SEGMENT MODE TIMELINE

Segment mode interval Heat map for CPU page faults

Page 9: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

9

SWITCH TO NON-SEGMENT VIEW

Select settings view

Select this tab Load data within a specific time range

Uncheck

Page 10: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

10

NON-SEGMENTED MODE TIMELINE

Time range

Page 11: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

11

CPU PAGE FAULT SOURCE CORRELATION

Selected interval Source location

Page 12: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

12

CPU PAGE FAULT SOURCE CORRELATION

Source line causing CPU page fault

Page 13: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

13

CPU PAGE FAULT SOURCE CORRELATION

Unguided Analysis

Option to collect Unified Memory

information

Summary of all CPU page faults

Page 14: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

14

NEW UNIFIED MEMORY EVENTSPage throttling, Memory thrashing, Remote map

Memory thrashing

Page throttling

Remote map

Page 15: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

15

FILTER AND ANALYZE

1 Select unified memory in the unguided analysis section

2 Select required events and click on ‘Filter and Analyze’

Summary of filtered intervals

Page 16: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

16

FILTER AND ANALYZEUnfiltered

Page 17: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

17

FILTER AND ANALYZEFiltered

Filtered intervals

Page 18: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

18

UNOPTIMIZED APPLICATION

Memory Thrashing

Analyze read access page faults

and thrashing

12.2 ms

Read access page faults

Page 19: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

19

OPTIMIZATION

int threadsPerBlock = 256;int numBlocks = (length + threadsPerBlock – 1) / threadsPerBlock;

kernel<<< numBlocks , threadsPerBlock >>>(A, B, C, length);

OLD

NEW

int threadsPerBlock = 256;int numBlocks = (length + threadsPerBlock – 1) / threadsPerBlock;

cudaMemAdvise(A, size, cudaMemAdviseSetReadMostly, 0);cudaMemAdvise(B, size, cudaMemAdviseSetReadMostly, 0);

kernel<<< numBlocks , threadsPerBlock >>>(A, B, C, length);

Page 20: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

20

OPTIMIZED APPLICATION

No DtoH Migrations and thrashing

2.9 ms

Speedup 4x (2.9 vs 12.2)

Page 21: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

21

NVLINK PROFILING

Page 22: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

22

NVLINK VISUALIZATIONVisual Profiler

Color codes for

NVLink

Topology

Static

properties

Runtime

values

Option to collect

NVLink information

Unguided Analysis

Achieved

throughputSelected

NVLink

Page 23: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

23

DGX-1V NVLINK TOPOLOGY

Page 24: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

24

NVLINK EVENTS ON TIMELINE

MemCpy API

NVLink Events on

Timeline

Color Coding of

NVLink Events

Page 25: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

25

NVLINK ANALYSISStage I: Data Movement Over PCIe

216 milliseconds

Page 26: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

26

NVLINK ANALYSISStage II: Data Movement Over NVLink

65 milliseconds

Under-utilized

NVLink

Minimal-Unused

NVLinks

Speedup 4x (216 vs 65)

Page 27: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

27

NVLINK ANALYSISStage III: Data Movement Over NVLink with Streams

20 milliseconds

Changed Color based

on achieved

bandwidth

Changed color based

on selection

Speedup 3x (65 vs 20)

Page 28: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

28

INSTRUCTION LEVEL PROFILING(PC SAMPLING)

Page 29: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

29

PC SAMPLING

PC sampling feature is available for device with CC >= 5.2

Provides CPU PC sampling parity + additional information for warp states/stalls reasons for GPU kernels

Effective in optimizing large kernels, pinpoints performance bottlenecks at specific lines in source code or assembly instructions

Samples warp states periodically in round robin order over all active warps

No overheads in kernel runtime, CPU overheads to parse the records

Page 30: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

30

PC SAMPLING UI

Pie chart for sample distribution for a CUDA function

Source-Assembly view

Page 31: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

31

MPI PROFILING

Page 32: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

32

MPI PROFILING

$ mpirun -n 4 nvprof --process-name "MPI Rank %q{OMPI_COMM_WORLD_RANK}" --context-name "MPI Rank %q{OMPI_COMM_WORLD_RANK}" -o timeline.%q{OMPI_COMM_WORLD_RANK}.pdm ./simpleMPIRunning on 4 nodes==21977== NVPROF is profiling process 21977, command: ./simpleMPI==21983== NVPROF is profiling process 21983, command: ./simpleMPI==21979== NVPROF is profiling process 21979, command: ./simpleMPI==21982== NVPROF is profiling process 21982, command: ./simpleMPI<program output>==21982== Generated result file: timeline.0.pdm==21977== Generated result file: timeline.3.pdm==21983== Generated result file: timeline.1.pdm==21979== Generated result file: timeline.2.pdm

nvprof

Page 33: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

33

MPI PROFILINGnvprof daemon mode

$ nvprof --profile-all-processes-o out.%h.%p.%q{OMPI_COMM_WORLD_RANK}<nvprof listens in daemon mode>

<profiling data is generated>

$ mpirun -n 4 ./simpleMPI

1

2

3

Shell 1 Shell 2

Page 34: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

34

MPI PROFILING

1 4

Importing into the Visual Profiler

2 3

Page 35: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

35

MPI PROFILINGVisual Profiler

MPI Rank-based naming

NVTX Markers & Ranges

Page 36: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

36

MPI PROFILINGMPI + NVTX

nvtxEventAttributes_t range = {0};range.message.ascii = "MPI_Scatter";nvtxRangePushEx(range);int result = MPI_Scatter(...);nvtxRangePop();

Auto-generate mpi_interception.so

LD_PRELOAD=mpi_interception.so

Run your MPI app with nvprof.

MPI calls will be auto-annotated using NVTX.

Manual mode Interception mode

1

2

3

https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-track-mpi-calls-nvidia-visual-profiler/

Page 37: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

37

MPI PROFILINGInterception

int res = MPI_Scatter(...);

MPI app Interception library(LD_PRELOAD)

int MPI_Scatter(...) { nvtxRangePushEx(range);int res = PMPI_Scatter(...);nvtxRangePop();return res;

}

MPI library

int PMPI_Scatter(...)

Page 38: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

38

REMOTE PROFILING

Page 39: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

39

NVVP: MULTI-HOP REMOTE PROFILING

Host Compute NodeLogin Node

Visual Profiler ScriptCUDA

Applicationssh

ssh

scp

Page 40: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

40

NVVP: MULTI-HOP REMOTE PROFILING

1 2 3

Login Node

Script

Connect Visual Profiler to the login node

Configure script on the login node

Use the custom script option

One-Time Setup

Page 41: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

41

NVVP: MULTI-HOP REMOTE PROFILING

1 2 Application transparently runs on compute node and profiling data is displayed in the Visual Profiler

Select custom script, then create a remote session as usual

Application Profiling

Page 42: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

42

VOLTA SUPPORT

Page 43: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

43

VOLTA SUPPORT

GV100 Device

Attributes

GPU Trace

Page 44: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

44

OTHER IMPROVEMENTS

Page 45: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

45

OTHER IMPROVEMENTS

Tracing and profiling of Cooperative Kernel launches is supported

The Visual Profiler supports remote profiling to systems supporting ssh algorithms

with a key length of 2048 bits

OpenACC profiling is now supported on systems without CUDA setup

nvprof flushes all profiling data when a SIGINT or SIGKILL signal is encountered

Page 46: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling

46

REFERENCES

NVIDIA toolkit documentation: http://docs.nvidia.com

CUDA Profiler User Guide: http://docs.nvidia.com/cuda/profiler-users-guide/index.html

Other GTC 2017 sessions:

S7824 - DEVELOPER TOOLS UPDATE IN CUDA 9

S7519 - DEVELOPER TOOLS FOR AUTOMOTIVE, DRONES AND INTELLIGENT CAMERAS APPLICATIONS

S7445 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING WHOLE APPLICATION PERFORMANCE

S7444 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS

Page 47: OPTIMIZING APPLICATION PERFORMANCE WITH …on-demand.gputechconf.com/gtc/2017/presentation/s7495...2 AGENDA • CUDA Profiling Tools • Unified Memory Profiling • NVLink Profiling