Post on 17-Apr-2022
Lecture:Manycore GPUArchitecturesandProgramming,Part3
-- Streaming,LibraryandTuning
1
CSCE569ParallelComputing
DepartmentofComputerScienceandEngineeringYonghong Yan
yanyh@cse.sc.eduhttps://passlab.github.io/CSCE569/
Manycore GPUArchitecturesandProgramming:Outline
• Introduction– GPUarchitectures,GPGPUs,andCUDA• GPUExecutionmodel• CUDAProgrammingmodel• WorkingwithMemoryinCUDA– Globalmemory,sharedandconstantmemory• Streamsandconcurrency• CUDAinstructionintrinsicandlibrary• Performance,profiling,debugging,anderrorhandling• Directive-basedhigh-levelprogrammingmodel– OpenACC andOpenMP
2
OffloadingProcessingFlow
1. CopyinputdatafromCPUmemorytoGPUmemory
PCIBus
3
OffloadingProcessingFlow
1. CopyinputdatafromCPUmemorytoGPUmemory
2. LoadGPUprogramandexecute,cachingdataonchipforperformance
PCIBus
4
OffloadingProcessingFlow
1. CopyinputdatafromCPUmemorytoGPUmemory
2. LoadGPUprogramandexecute,cachingdataonchipforperformance
3. CopyresultsfromGPUmemorytoCPUmemory
PCIBus
5
OverlappingCommunicationandComputation
GPU
PCIe BusCopy Copy Copy Copy Copy
Compute Compute Compute Compute
• Threesequentialstepsforasinglekernelexecution• Multiplekernels– Asynchronyisafirst-classcitizenofmostGPUprogramming
frameworks– Computation-communicationoverlapisacommontechnique
inGPUprogramming
6
AbstractConcurrency
• DifferentkindsofactionoverlaparepossibleinCUDA?1. Overlappedhostcomputationanddevicecomputation2. Overlappedhostcomputationandhost-devicedata
transfer3. Overlappedhost-devicedatatransferanddevice
computation4. Concurrentdevicecomputation
• CUDAStreamstoachieveeachofthesetypesofoverlap
7
CUDAStreams
• CUDAStreams:aFIFOqueueofCUDAactionstobeperformed– Placinganewactionattheheadofastreamisasynchronous– ExecutingactionsfromthetailasCUDAresourcesallow– Everyaction(kernellaunch,cudaMemcpy,etc)runsinan
implicitorexplicitstream
CUDAStream
CUDAApplication
CUDARuntime&GPUKernel cudaMemcpy cudaMemcpy
head tail
8
CUDAStreams
• TwotypesofstreamsinaCUDAprogram– Theimplicitly declaredstream(NULLstream)– Explicitly declaredstreams(non-NULLstreams)
• Upuntilnow,allcodehasbeenusingtheNULLstreambydefault
cudaMemcpy(...);kernel<<<...>>>(...);cudaMemcpy(...);
• Non-NULLstreamsrequiremanualallocationandmanagementbytheCUDAprogrammer
9
CUDAStreams
• TocreateaCUDAstream:cudaError_t cudaStreamCreate(cudaStream_t *stream);
• TodestroyaCUDAstream:cudaError_t cudaStreamDestroy(cudaStream_t stream);
• TowaitforallactionsinaCUDAstreamtofinish:cudaError_t cudaStreamSynchronize(cudaStream_t stream);
• TocheckifallactionsinaCUDAstreamhavefinished:cudaError_t cudaStreamQuery(cudaStream_t stream);
10
CUDAStreams
• cudaMemcpyAsync:AsynchronousmemcpycudaError_t cudaMemcpyAsync(void *dst, const void *src,size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0);
• cudaMemcpyAsync doesthesameascudaMemcpy,butmayreturnbeforethetransferisactuallycomplete
• PinnedhostmemoryisarequirementforcudaMemcpyAsync– Memorythatisresidentinphysicalmemorypages,andcannot
beswappedout,alsoreferredaspage-locked• Recallmalloc normallyreservevirtualaddressspacefirstandthenactuallyphysicalpagesareallocated
– DMAisinvolvedfordatatransfer11
CUDAStreams
• PerformingacudaMemcpyAsync:
int *h_arr, *d_arr;cudaStream_t stream;cudaMalloc((void **)&d_arr, nbytes);cudaMallocHost((void **)&h_arr, nbytes);cudaStreamCreate(&stream);
cudaMemcpyAsync(d_arr, h_arr, nbytes, cudaMemcpyHostToDevice, stream);...cudaStreamSynchronize(stream);cudaFree(d_arr); cudaFreeHost(h_arr); cudaStreamDestroy(stream);
page-lockedmemoryallocation
Callreturnbeforetransfercomplete
Dosomethingwhiledataisbeingmoved
Synctomakesureoperationscomplete12
CUDAStreams
• Associatekernellauncheswithanon-NULLstream
kernel<<<nblocks, threads_per_block, smem_size, stream>>>(...);
• TheeffectsofcudaMemcpyAsync andkernellaunching– Operationsareputinthestreamqueueforexecution– Actuallyoperationsmaynothappenyet
• Host-sidetimertotimethoseoperations– Nottheactualtimeoftheoperations
13
• Vectorsumexample,A+B=C
• PartitionthevectorsanduseCUDAstreamstooverlapcopyandcompute
CUDAStreams
CopyA CopyB vector_sum<<<...>>> CopyCNULL stream
A B v_s C
A B v_s C
A B v_s C
A B v_s C
Stream A
Stream B
Stream C
Stream D
14
• Howcanthisbeimplementedincode?cudaStream_t stream[nstreams];/* initialize stream here */int eles_per_stream = N / nstreams;
for (int i = 0; i < nstreams; i++) {int offset = i * eles_per_stream;cudaMemcpyAsync(&d_A[offset], &h_A[offset], eles_per_stream *
sizeof(int), cudaMemcpyHostToDevice, streams[i]);cudaMemcpyAsync(&d_B[offset], &h_B[offset], eles_per_stream *
sizeof(int), cudaMemcpyHostToDevice, streams[i]);……vector_sum<<<..., streams[i]>>>(d_A + offset,
d_B + offset, d_C + offset);cudaMemcpyAsync(&h_C[offset], &d_C[offset], eles_per_stream *
sizeof(int), cudaMemcpyDeviceToHost, streams[i]);}
for (int i = 0; i < nstreams; i++)cudaStreamSynchronize(streams[i]);
CUDAStreams
15
• Timingasynchronousoperations– Host-sidetimer:onlymeasurethetimeforthecall,nottheactual
timeforthedatamovementorkernelexecution
• Eventsinstreams,whichmarkspecificpointsinstreamexecution
• Eventsaremanuallycreatedanddestroyed:cudaError_t cudaEventCreate(cudaEvent_t
*event);cudaError_t cudaEventDestroy(cudaEvent_t
*event);
CUDAEvents
CopyA CopyB vector_sum<<<...>>> CopyC
Event
16
• ToaddaneventtoaCUDAstream:cudaError_t cudaEventRecord(cudaEvent_t event,
cudaStream_t stream);
– Eventmarksthepoint-in-timeafterallprecedingactionsinstream complete,andbeforeanyactionsaddedaftercudaEventRecord run
• HosttowaitforsomeCUDAactionstofinishcudaError_t cudaEventSynchronize(cudaEvent_t event);
– Waitforalltheoperationsbeforethiseventstocomplete,butnotthoseafter
CUDAEvents
CopyA CopyB vector_sum<<<...>>> CopyC
Event
17
• Checkifaneventhasbeenreachedwithoutwaitingforit:
cudaError_t cudaEventQuery(cudaEvent_tevent);
• Gettheelapsedmillisecondsbetweentwoevents:cudaError_t cudaEventElapsedTime(float
*ms, cudaEvent_t start, cudaEvent_t stop);
CUDAEvents
CopyA CopyB vector_sum<<<...>>> CopyC
start stop18
• Incodes:
float time;cudaEvent_t start, stop;cudaEventCreate(&start); cudaEventCreate(&stop);
cudaEventRecord(start); kernel<<<grid, block>>>(arguments); cudaEventRecord(stop); cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop); cudaEventDestroy(start);cudaEventDestroy(stop);
CUDAEvents
19
ImplicitandExplicitSynchronization
• Twotypesofhost-devicesynchronization:– ImplicitsynchronizationcausesthehosttowaitontheGPU,
butasasideeffectofotherCUDAactions– ExplicitsynchronizationcausesthehosttowaitontheGPU
becausetheprogrammerhasaskedforthatbehavior
20
• FiveCUDAoperationsthatincludeimplicitsynchronization:1. Apinnedhostmemoryallocation(cudaMallocHost,
cudaHostAlloc)2. Adevicememoryallocation(cudaMalloc)3. Adevicememset (cudaMemset)4. Amemorycopybetweentwoaddressesonthesame
device(cudaMemcpy(..., cudaMemcpyDeviceToDevice))
5. AmodificationtotheL1/sharedmemoryconfiguration(cudaThreadSetCacheConfig, cudaDeviceSetCacheConfig)
ImplicitandExplicitSynchronization
21
• FourwaystoexplicitlysynchronizeinCUDA:1. Synchronizeonadevice
cudaError_t cudaDeviceSynchronize();
2. SynchronizeonastreamcudaError_t cudaStreamSynchronize();
3. SynchronizeonaneventcudaError_t cudaEventSynchronize();
4. SynchronizeacrossstreamsusinganeventcudaError_t cudaStreamWaitEvent(cudaStream_tstream, cudaEvent_t event);
ImplicitandExplicitSynchronization
22
• cudaStreamWaitEvent addsinter-streamdependencies– Causesthespecifiedstream towaitonthespecifiedevent
beforeexecutinganyfurtheractions– event doesnotneedtobeaneventrecordedinstream
cudaEventRecord(event, stream1);...cudaStreamWaitEvent(stream2, event);
...
– Noactionsaddedtostream2afterthecalltocudaStreamWaitEvent willexecuteuntileventissatisfied
ImplicitandExplicitSynchronization
23
SuggestedReadings
1. Chapter6inProfessionalCUDACProgramming2. JustinLuitjens.CUDAStreams:BestPracticesandCommon
Pitfalls.GTC2014.http://on-demand.gputechconf.com/gtc/2014/presentations/S4158-cuda-streams-best- practices-common-pitfalls.pdf
3. SteveRennich.CUDAC/C++StreamsandConcurrency.2011.http://on-demand.gputechconf .com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf
24
Manycore GPUArchitecturesandProgramming:Outline
• Introduction– GPUarchitectures,GPGPUs,andCUDA• GPUExecutionmodel• CUDAProgrammingmodel• WorkingwithMemoryinCUDA– Globalmemory,sharedandconstantmemory• Streamsandconcurrency• CUDAinstructionintrinsicandlibrary• Performance,profiling,debugging,anderrorhandling• Directive-basedhigh-levelprogrammingmodel– OpenACC andOpenMP
25
CUDALibraries
• Pre-packagedandexpertly-optimizedfunctionsthatimplementcommonlyusefuloperations.– Vectoraddition,matrixvector,matrixmatrix,FFT,etc• AdvantagesofCUDALibraries?– Supportawiderangeofapplicationdomains– Highlyusable,high-levelAPIsthatarefamiliartodomain
experts– TunedbyCUDAexpertstoperformwellacrossplatformsand
datasets– Oftenofferthequickestrouteforporting,simplyswapoutAPI
calls– Lowmaintenance,developerofthelibrarytakeson
responsibilityofbugfixesandfeaturerequests
26
CUDALibraries
27https://developer.nvidia.com/gpu-accelerated-libraries
WorkflowtoUseCUDALibrary
1. Createalibrary-specifichandlethatmanagescontextualinformationusefulforthelibrary’soperation.– ManyCUDALibrarieshavetheconceptofahandlewhich
storesopaquelibrary-specificinformationonthehostwhichmanylibraryfunctionsaccess
– Programmer’sresponsibilitytomanagethishandle– For example: cublasHandle_t,cufftHandle,cusparseHandle_t,curandGenerator_t
1. Allocatedevicememoryforinputsandoutputstothelibraryfunction.– Use cudaMalloc as usual
28
CommonLibraryWorkflow
3. Ifinputsarenotalreadyinalibrary-supportedformat,convertthemtobeaccessiblebythelibrary.– ManyCUDALibrariesonlyacceptdatainaspecificformat– Forexample:column-majorvs.row-majorarrays
4. Populatethepre-allocateddevicememorywithinputsinasupportedformat.– Inmanycases,thisstepsimplyimpliesacudaMemcpy orone
ofitsvariantstomakethedataaccessibleontheGPU– Somelibrariesprovidecustomtransferfunctions,forexample:cublasSetVector optimizesstrided copiesfortheCUBLASlibrary
29
CommonLibraryWorkflow
5. Configurethelibrarycomputationtobeexecuted.– Insomelibraries,thisisano-op– Othersrequireadditionalmetadatatoexecutelibrary
computationcorrectly– Insomecasesthisconfigurationtakestheformofextra
parameterspassedtolibraryfunctions,otherssetfieldsinthelibraryhandle
6. ExecutealibrarycallthatoffloadsthedesiredcomputationtotheGPU.– NoGPU-specificknowledgerequired
30
CommonLibraryWorkflow
7. Retrievetheresultsofthatcomputationfromdevicememory,possiblyinalibrary-determinedformat.– Again,thismaybeassimpleasacudaMemcpy orrequirea
library-specificfunction
8. Ifnecessary,converttheretrieveddatatotheapplication’snativeformat.– Ifaconversiontoalibrary-specificformatwasnecessary,this
stepensurestheapplicationcannowusethecalculateddata– Ingeneral,itisbesttokeeptheapplicationformatandlibrary
formatthesame,reducingoverheadfromrepeatedconversions
31
CommonLibraryWorkflow
9. ReleaseCUDAresources.– IncludestheusualCUDAcleanup(cudaFree,cudaStreamDestroy,etc)plusanylibrary-specificcleanup
10.Continuewiththeremainderoftheapplication.
32
CommonLibraryWorkflow
33
• Notalllibrariesfollowthisworkflow,andnotalllibrariesrequireeverystepinthisworkflow– Infact,formanylibrariesmanystepsareskipped– Keepingthisworkflowinmindwillhelpgiveyoucontexton
whatthelibrarymightbedoingbehindthescenesandwhereyouareintheprocess
• Next,we’lltakealookattwocommonlyusefullibraries– Trytokeepthecommonworkflowinmindwhileweworkwith
them
cuBLAS
• cuBLAS isaportofapopularlinearalgebralibrary,BLAS
• cuBLAS (likeBLAS)splitsitssubroutinesintomultiplelevelsbasedondatatypesprocessed:– Level1:vector-onlyoperations(e.g.vectoraddition)– Level2:matrix-vectoroperations(e.g.matrix-vector
multiplication)– Level3:matrix-matrixoperations(e.g.matrixmultiplication)
34
cuBLAS Idiosyncracies
• Forlegacycompatibility,cuBLAS operatesoncolumn-majormatrices
• cuBLAS alsohasalegacyAPIwhichwasdroppedsinceCUDA4.0,thislecturewillusethenewcuBLAS API– IfyoufindcuBLAS codethatdoesn’tquitematchup,youmay
belookingattheoldcuBLAS API
3 0 06 0 00 2 1
3 6 0 0 0 2 0 0 1
35
cuBLAS DataManagement
• DevicememoryincuBLAS isallocatedasyou’reusedto:cudaMalloc
• Transferringdatato/fromthedeviceusescuBLAS-specificfunctions:– cublasGetVector/cublasSetVector– cublasGetMatrix/cublasSetMatrix
36
cuBLAS DataManagement
• Example:cublasStatus_t cublasSetVector(int n, int elemSize, const void *x, int incx, void *y, int incy);
where:• n isthenumberofelementstotransfertotheGPU• elemSize isthesizeofeachelement(e.g.sizeof(int))• x isthevectoronthehosttocopyfrom• incx isastrideinx ofthearraycellstotransferto• y isthevectorontheGPUtocopyto• incy isastrideiny ofthearraycellstotransferto
37
cuBLAS DataManagement
• Example:cublasSetVector(5, sizeof(int), h_x, 3, d_x, 2);
h_x
d_x
38
cuBLAS DataManagement
• Similarly:cublasStatus_t cublasSetMatrix(int rows, int cols, int elemSize,
const void *A, int lda, void *B, intldb);
where:• rows isthenumberofrowsinamatrixtocopy• cols isthenumberofcolsinamatrixtocopy• elemSize isthesizeofeachcellinthematrix(e.g.sizeof(int))• A isthesourcematrixonthehost• lda isthenumberofrowsintheunderlyingarrayforA• B isthedestinationmatrixontheGPU• ldb isthenumberofrowsintheunderlyingarrayforB
39
cuBLAS DataManagement
• Similarly:cublasSetMatrix(3, 3, sizeof(int), h_A, 4, d_A, 5);
4
5
40
cuBLAS Example
• Matrix-vectormultiplication– Uses6ofthe10stepsinthecommonlibraryworkflow:
1. CreateacuBLAS handleusingcublasCreateHandle2. Allocatedevicememoryforinputsandoutputsusing
cudaMalloc3. PopulatedevicememoryusingcublasSetVector,
cublasSetMatrix4. CallcublasSgemv torunmatrix-vectormultiplicationon
theGPU5. RetrieveresultsfromtheGPUusingcublasGetVector6. ReleaseCUDAandcuBLAS resourcesusingcudaFree,
cublasDestroy41
cuBLAS Example
• Youcanbuildandruntheexamplecublas.cu:
cublasCreate(&handle);cudaMalloc((void **)&dA, sizeof(float) * M * N);cudaMalloc((void **)&dX, sizeof(float) * N);cudaMalloc((void **)&dY, sizeof(float) * M);
cublasSetVector(N, sizeof(float), X, 1, dX, 1);cublasSetVector(M, sizeof(float), Y, 1, dY, 1);cublasSetMatrix(M, N, sizeof(float), A, M, dA, M);
cublasSgemv(handle, CUBLAS_OP_N, M, N, &alpha, dA, M, dX, 1, &beta, dY, 1);
cublasGetVector(M, sizeof(float), dY, 1, Y, 1);
/* for sgemm */cublasSgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N,matrix_size.uiWB,matrix_size.uiHA,matrix_size.uiWA,&alpha,d_B,matrix_size.uiWB,d_A,matrix_size.uiWA,&beta,d_C,matrix_size.uiWA)
42
cuBLAS Portability
• PortingtocuBLAS fromBLASisastraightforwardprocess.Ingeneral,itrequires:– Addingdevicememoryallocation/freeing(cudaMalloc,cudaFree)
– Addingdevicetransferfunctions(cublasSetVector,cublasSetMatrix,etc)
– TransformlibraryroutinecallsfromBLAStocuBLAS (e.g.cblas_sgemvè cublasSgemv)
43
cuBLAS Portability
• SomecommonoptimizationsfollowinganaiveBLASècuBLAS portare:– Reusingdevicememoryallocations– Removingredundantdatatransfersfromandtothedevice– AddingstreamedexecutionusingcublasSetStream
44
cuBLAS Summary
• cuBLAS makesacceleratinglegacyBLASapplicationssimpleandeasy– Verylittleaddedcode– StraightforwardmappingfromBLASroutinestocuBLAS
routines– FlexibleAPIimprovesportability
• Fornewlinearalgebraapplications,cuBLAS offersahigh-performancealternativetoBLAS– High-performancekernelswithverylittleprogrammertime
45
cuFFT
• cuFFT offersanoptimizedimplementationofthefastFouriertransform
46
cuFFT Configuration
47
• IncuFFT terminology,plans==handles– cuFFT plansdefineasingleFFTtransformationtobeperformed
• cuFFT usesplanstoderivetheinternalmemoryallocations,transfers,kernelsrequiredtoimplementthedesiredtransform
• Plansarecreatedwith:cufftResult cufftPlan1d(cufftHandle *plan, int nx, cufftTypetype, int batch);
cufftResult cufftPlan2d(cufftHandle *plan, int nx, int ny, cufftType type);
cufftResult cufftPlan3d(cufftHandle *plan, int nx, int ny, intnz, cufftType type);
cuFFT Configuration
48
• cufftType referstothedatatypesofatransformation,forexample:– Complex-to-complex:CUFFT_C2C– Real-to-complex:CUFFT_R2C– Complex-to-real:CUFFT_C2R
cuFFT Example
• Acomplex-to-complex1DcuFFT planandexecutingit,using6ofthe10stepsinthecommonlibraryworkflow:
1. CreateandconfigureacuFFT plan2. AllocateGPUmemoryfortheinputsamplesandoutput
frequenciesusingcudaMalloc3. PopulateGPUmemorywithinputsamplesusing
cudaMemcpy4. ExecutetheplanusingacufftExec* function5. RetrievethecalculatedfrequenciesfromGPUmemory
usingcudaMemcpy6. ReleaseCUDAandcuFFT resourcesusingcudaFree,
cufftDestroy49
cuFFT Example
• Youcanbuildandrunanexamplecufft.cu:
cufftPlan1d(&plan, N, CUFFT_C2C, 1);
cudaMalloc((void **)&dComplexSamples, sizeof(cufftComplex) * N);
cudaMemcpy(dComplexSamples, complexSamples, sizeof(cufftComplex) * N, cudaMemcpyHostToDevice);
cufftExecC2C(plan, dComplexSamples, dComplexSamples, CUFFT_FORWARD);
cudaMemcpy(complexFreq, dComplexSamples, sizeof(cufftComplex) * N, cudaMemcpyDeviceToHost);
50
cuFFT Summary
• LikecuBLAS,cuFFT offersahigh-levelandusableAPIforportinglegacyFFTapplicationsorwritingnewones– cuFFT’s APIisdeliberatelysimilartoindustry-standardlibrary
FFTWtoimproveprogrammability– Offershigherperformanceforlittledevelopereffort
51
Drop-InCUDALibraries
• Drop-InCUDALibrariesallowseamlessintegrationofCUDAperformancewithexistingcodebases– Fullcompatibilitywithindustry-standardlibraries,exposethe
sameexternalAPIs– BLASè NVBLAS– FFTWè cuFFTW
• TwowaystouseDrop-InLibraries:– Re-linktoCUDALibraries– LD_PRELOAD CUDALibrariesbeforetheirhostequivalents
52
Drop-InCUDALibraries
• Re-linkinglegacyapplicationstoCUDALibraries:– SupposeyouhavealegacyapplicationthatreliesonBLAS:
$ gcc app.c –lblas –o app– RecompilingwithNVBLASlinkedwillautomaticallyaccelerate
allBLAScalls$ gcc app.c –lnvblas –o app
• Alternatively,simplysetLD_PRELOAD whenexecutingtheapplication:
$ env LD_PRELOAD=libnvblas.so ./app
53
SurveyofCUDALibraryPerformance
54
• We’veseenthatcuBLAS andcuFFT arehigh-level,programmablelibraries(liketheirhostcounterparts)– NoCUDA-specificconcepts(e.g.threadblocks,pinned
memory,etc)
• Let’sdoabriefsurveyofCUDALibraryperformancetoseetheperformanceimprovementspossible– Focusonthesamelibraries(cuBLAS andcuFFT)butsimilar
dataonotherlibrariesisavailableinthebookandonline
SurveyofCUDALibraryPerformance
55
SurveyofCUDALibraryPerformance
56
SuggestedReadings
1. AllsectionsinChapter8ofProfessionalCUDACProgramming exceptUsingOpenACC
2. cuSPARSE UserGuide.2014.http://docs.nvidia.com/cuda/cusparse/3. cuBLAS UserGuide.2014.http://docs.nvidia.com/cuda/cublas/4. cuRAND UserGuide.2014.http://docs.nvidia.com/cuda/curand/5. cuFFT UserGuide.2014.http://docs.nvidia.com/cuda/cufft/6. CUDAToolkit5.0PerformanceReport.2013.http://on-
demand.gputechconf.com/gtc-express/2013/presentations/cuda--5.0-math-libraries-performance.pdf
57
Manycore GPUArchitecturesandProgramming:Outline
• Introduction– GPUarchitectures,GPGPUs,andCUDA• GPUExecutionmodel• CUDAProgrammingmodel• WorkingwithMemoryinCUDA– Globalmemory,sharedandconstantmemory• Streamsandconcurrency• CUDAinstructionintrinsicandlibrary• Performance,profiling,debugging,anderrorhandling• Directive-basedhigh-levelprogrammingmodel– OpenACC andOpenMP
58
GPUParallelization
• A many-facetedprocess– Performancevariesdramaticallydependingonthe
implementationofthesamealgorithms• Naïvetohighlyoptimizedversion
• ManytypesofoptimizationsforGPUs– Sharedmemory– Constantmemory– Globalmemoryaccesspatterns– Warpshuffleinstructions– Computation-communicationoverlap– CUDAcompilerflags,e.g.loopunrolling,etc– Increasingparallelism– ...
59
OptimizationOpportunities
• Kernel-leveloptimization:– ExposingSufficientParallelism– OptimizingMemoryAccess– OptimizingInstructionExecution
• Host-GPUoptimization– E.g.kernelanddatatransferoverlapusingCUDAstreams
• Profile-drivenoptimizationimprovesoptimizationsselection
60
Kernel-LevelOptimization
• ExposingSufficientParallelism– IncreasetheamountofconcurrentworkontheGPUsoasto
saturateinstructionandmemorybandwidth
• Canbeaccomplishedby:1. MoreconcurrentlyactivewarpsperSM2. Moreindependentworkassignedtoeachthread
Warp-LevelParallelism
Instruction-LevelParallelism
61
Kernel-LevelOptimization
• IncreasingthenumberofwarpsperSM/threadblockdoesnotguaranteeperformanceimprovement– Resultinfewerper-SMresourcesassignedtoeachthread(e.g.
registers,sharedmemory)
Less parallelism, more per-thread resources
More parallelism, smaller per-thread resources
62
Kernel-LevelOptimization
• Creatingmoreindependentworkperthread– loopunrollingorothercodetransformationsthatexpose
instruction-levelparallelism,– Butmayalsoincreaseper-threadresourcerequirements
int sum = 0;for (int i = 0; i < 4; i++) {
sum += a[i];}
int i1 = a[0];int i2 = a[1];int i3 = a[2];int i4 = a[4];int sum = i1 + i2 + i3 + i4;
Requires 2 registers (sum, i), no instruction-level parallelism Requires 5 registers (i1, i2, i3, i4,
sum), four-way instruction-level parallelism
63
Kernel-LevelOptimization
• Optimizingmemoryaccesstomaximize:– Memorybandwidthutilization(efficiencyofmemoryaccess
patterns)– Memoryaccessconcurrency(sufficientmemoryrequeststo
hidememorylatency)
64
Kernel-LevelOptimization
• Aligned,coalescedglobalandsharedmemoryaccessesoptimizememorybandwidthutilization– Constantmemoryprefersabroadcastaccesspattern
65
Kernel-LevelOptimization
• OptimizingInstructionExecutionfocuseson:– Hidinginstructionlatencybykeepingasufficientnumberof
warpsactive– Avoidingdivergentexecutionpathswithinwarps• Ifinsideakernel
• Experimentingwiththreadexecutionconfigurationcanproduceunexpectedperformancegainsfrommoreorlessactivewarps
• Divergentexecutionwithinawarpproducesreducedparallelismaswarpexecutionofmultiplecodepathsisserialized
66
Profile-DrivenOptimization
• Profile-drivenoptimizationisaniterativeprocesstooptimizeprogrambasedonquantitative profilinginfo– Asweapplyoptimizationtechniques,weanalyze theresults
usingnvprof anddecideiftheyarebeneficial
67
Determine performance
inhibitors
Identify hotspots
Gather profiling information
Optimize
Repeat
Profile-DrivenOptimization
68
• Thekeychallengeinprofile-drivenoptimizationistodetermineperformanceinhibitorsinhotspots– nvvp andnvprof areinvaluabletoolsforthis
Profile-DrivenOptimization
• nvprof profilingmodes:– SummaryMode:defaultmode,displaysexecutiontime
informationonhigh-levelactionssuchaskernelsordatatransfers
– TraceMode:ProvidesatimelineofCUDAeventsoractionsinchronologicalorder
– Event/MetricSummaryMode:Aggregatesevent/metriccountsacrossallkernelinvocations
– Event/MetricTraceMode:Displaysevent/metriccountsforeachkernelinvocation
69
Profile-DrivenOptimization
• TheNVIDIAVisualProfiler(nvvp)isalsoapowerfultoolforguidingprofile-drivenoptimization– OffersanumberofviewstoinspectdifferentpartsofaCUDA
application
70
CUDADebugging
• AnimportantpartofCUDAsoftwaredevelopmentistheabilitytodebugCUDAapplications
• CUDAoffersanumberofdebuggingtools,splitintotwocategories:– KernelDebugging– MemoryDebugging
71
CUDADebugging
• KernelDebuggingtoolshelpustoanalyze thecorrectnessofrunningCUDAkernelsbyinspectingrunningapplicationstate
• MemoryDebuggingToolshelpusdetectapplicationbugsbyobservingirregularorout-of-boundmemoryaccessesperformedbyCUDAkernels
72
CUDAKernelDebugging
• Primarytoolforthejob:cuda-gdb– Intentionallybuilttobesimilartothehostdebuggingtoolgdb– Requirescompilationwithspecialflagstobeuseful:
$ nvcc –g –G foo.cu -o foo
• Onceanapplicationiscompiledindebugmode,runningitundercuda-gdb ispossibleusing:
$ cuda-gdb foo...(cuda-gdb)
73
CUDAKernelDebugging
• cuda-gdb usesmostofthesamecommandsasgdb
• OnemaindifferenceistheideaofCUDAFocus,orthecurrentthreadthatcuda-gdb isfocusedonandagainstwhichallcommandsrun– Querythecurrentfocususing:
(cuda-gdb) cuda thread lane warp block sm grid device kernel
– Exampleofsettingfocustothe128th threadinthecurrentblock:
(cuda-gdb) cuda thread (128)
74
CUDAKernelDebugging
• printf isanotherformofCUDAKernelDebugging– Onlyavailableondevicesofcomputecapability2.0orhigher
• Printsarebufferedonthedeviceandperiodicallytransferredbacktothehostfordisplay– SizeofthisbufferconfigurablewithcudaSetDeviceLimit
• BuffercontentsaretransferredtothehostafteranyCUDAkernellaunch,anyhost-sideexplicitsynchronization,anysynchronousmemorycopies
75
CUDAMemoryDebugging
• MemoryDebuggingdetectsmemoryerrorsinCUDAkernelsthatarelikelyindicativeofbugsinthecode– Forexample:out-of-boundsmemoryaccesses
• ThereisasingletoolforMemoryDebugging,cuda-memcheck,whichcontainstwoutilities:– Thememcheck tool– Theracecheck tool
76
CUDAMemoryDebugging
• Thecompilationprocessforcuda-memcheck ismoreinvolvedthanforcuda-gdb– Buildingwithfulldebugoptionsaffectsperformance,which
maymakememoryerrorshardertohit– Applicationsshouldalwaysbecompiledwith-lineinfo– Applicationsshouldalsobecompiledtoincludesymbol
information,butdoingthisvariesbyplatform• Linux:-Xcompiler –rdynamic• Windows:-Xcompiler /Zi• ...
77
CUDAMemoryDebugging
• Oncetheapplicationiscompiled,memcheck canbeusedtocheckfor6differenttypesofmemoryerrors:1. MemoryAccessError:Out-of-boundsormisaligned
memoryaccess2. HardwareException:Errorreportedbyhardware3. malloc/free Errors:ImproperuseofCUDAdynamic
memoryallocation4. CUDAAPIErrors:AnyerrorreturncodefromaCUDAAPI
call5. cudaMalloc MemoryLeaks:cudaMalloc allocationsthat
arenotcudaFree’d6. DeviceHeapMemoryLeaks:Dynamicmemoryallocations
thatareneverfreed78
CUDAMemoryDebugging
• Thetwocuda-memcheck utilitiesofferverydifferentcapabilities:– memcheck performsawiderangeofmemorycorrectness
checks– racecheck verifiesthat__shared__memoryusageis
correctinanapplication,aparticularlydifficulttasktoperformmanually
• cuda-memcheck offersamoreautomatedapproachtodebuggingthancuda-gdb
79
CUDAErrorHandling
• PropererrorhandlingisanimportantpartofrobustCUDAdeployment– EveryCUDAfunctionreturnsanerrorcodethatmustbe
checked– Ifasynchronousoperationsareused,thiserrormaybearesult
ofadifferentasynchronousoperationfailing– ReturncodeofcudaSuccess indicatessuccess
• CUDAalsooffersanumberoferror-handlingfunctions
80
CUDAErrorHandling
cudaError_t cudaGetLastError();– RetrievethelatestCUDAerror,clearingtheCUDAruntime’s
internalerrorstatetobecudaSuccess
cudaError_t cudaPeekLastError();– RetrievethelatestCUDAerror,butdonotcleartheCUDA
runtime’sinternalerrorstate
const char *cudaGetErrorString(cudaError_terror);– Fetchahuman-readablestringfortheprovidederror
81
SuggestedReadings
1. Chapter10inProfessionalCUDACProgramming2. AdamDeConinck.IntroductiontotheCUDAToolkitasanApplicationBuild
Tool.GTC2013.http://on-demand.gputechconf.com/gtc/2013/webinar/cuda-toolkit-as-build- tool.pdf
3. Sandarbh Jain.CUDAProfilingTools.GTC2014.http://on-demand.gputechconf.com/gtc/2014/presentations/S4587-cuda-profiling-tools.pdf
4. ThomasBradley.GPUPerformanceAnalysisandOptimization.2012.http://people.maths .ox.ac.uk/gilesm/cuda/lecs/NV_Profiling_lowres.pdf
5. Julien Demouth.CUDAOptimizationwithNVIDIANsight(TM)VisualStudioEdition:ACaseStudy.GTC2014.http://on-demand.gputechconf.com/gtc/2014/presentations/S4160- cuda-optimization-nvidia-nsight-vse-case-study.pdf
82