NVIDIA Visual Profiler - uni-graz.at · NVIDIA Visual Profiler & CUDA-MEMCHECK . Visual Profiler...

Post on 12-Aug-2020

31 views 0 download

Transcript of NVIDIA Visual Profiler - uni-graz.at · NVIDIA Visual Profiler & CUDA-MEMCHECK . Visual Profiler...

NVIDIA Visual Profiler &

CUDA-MEMCHECK

Visual Profiler – Overview

• Included in CUDA Toolkit

• Visualize and optimize performance of a CUDA application

• Shows timeline on CPU and GPU

• nvvp (GUI)

• nvprof (Terminal)

• Two types: – Executable session

– Imported session (importing data generated by nvprof)

• Generate pdf report

Getting started

Timeline View

• CPU activity

• GPU activity

• Shows start & end of

– Threads

– Kernels

– Memcpy

– …

• Zoom, filter, reorder, …

Analysis View

• Guided or unguided – For unguided compile with SET(LOCAL_CUDA_NVCC_FLAGS ${LOCAL_CUDA_NVCC_FLAGS] –lineinfo)

• CUDA Application Analysis – Application‘s overall GPU utilization

– Kernel performance (orders kernels according to optimization importance based on execution time and achieved occupancy)

• Performance-Critical Kernels – Detailed analysis of a selected kernel

• Compute, Bandwith, or Latency Bound

• Instruction and memory latency

– Examine occupancy

How many warps the kernel has active on the GPU, relative to the maximum number of warps supported by GPU

– Examine stall reasons

Could give insight why latency is still an issue for the kernel

• Compute resources

GPU compute resources could limit the performance of a kernel, if they are insufficient or poorly utilized

CUDA-MEMCHECK

• detects memory access errors

• Run time error detection

• Included in CUDA Toolkit

• Getting started:

– cuda-memcheck executable -options

best case:

Supported error detection

• Memory access error Errors due to out of bound or misaligned access to memory by global,

local, shared or global atomic access

• Hardware exception Errors reported by hardware error reporting mechanism

• Malloc/Free errors Errors due to incorrect use of malloc or free

• CUDA API errors Failure of CUDA API call

• cudaMalloc memory leaks Allocations of device memory which have not been freed

• Device heap memory leaks Allocations of device memory in device code which have not been freed

Example

__global__ : for device global memory __shared__ : for per block shared memory __local__ : for per thread local memory Information about type of access (read / write) Size of access in bytes Source file and line number Thread indices and block indices Memory address being accessed and type of access error