Lecture: ManycoreGPU Architectures and Programming, Part 3

Lecture:Manycore GPUArchitecturesandProgramming,Part3

-- Streaming,LibraryandTuning

CSCE569ParallelComputing

DepartmentofComputerScienceandEngineeringYonghong Yan

yanyh@cse.sc.eduhttps://passlab.github.io/CSCE569/

Manycore GPUArchitecturesandProgramming:Outline

• Introduction– GPUarchitectures,GPGPUs,andCUDA• GPUExecutionmodel• CUDAProgrammingmodel• WorkingwithMemoryinCUDA– Globalmemory,sharedandconstantmemory• Streamsandconcurrency• CUDAinstructionintrinsicandlibrary• Performance,profiling,debugging,anderrorhandling• Directive-basedhigh-levelprogrammingmodel– OpenACC andOpenMP

OffloadingProcessingFlow

1. CopyinputdatafromCPUmemorytoGPUmemory

PCIBus

2. LoadGPUprogramandexecute,cachingdataonchipforperformance

PCIBus

2. LoadGPUprogramandexecute,cachingdataonchipforperformance

3. CopyresultsfromGPUmemorytoCPUmemory

PCIBus

OverlappingCommunicationandComputation

PCIe BusCopy Copy Copy Copy Copy

Compute Compute Compute Compute

• Threesequentialstepsforasinglekernelexecution• Multiplekernels– Asynchronyisafirst-classcitizenofmostGPUprogramming

frameworks– Computation-communicationoverlapisacommontechnique

inGPUprogramming

AbstractConcurrency

• DifferentkindsofactionoverlaparepossibleinCUDA?1. Overlappedhostcomputationanddevicecomputation2. Overlappedhostcomputationandhost-devicedata

transfer3. Overlappedhost-devicedatatransferanddevice

computation4. Concurrentdevicecomputation

• CUDAStreamstoachieveeachofthesetypesofoverlap

CUDAStreams

• CUDAStreams:aFIFOqueueofCUDAactionstobeperformed– Placinganewactionattheheadofastreamisasynchronous– ExecutingactionsfromthetailasCUDAresourcesallow– Everyaction(kernellaunch,cudaMemcpy,etc)runsinan

implicitorexplicitstream

CUDAStream

CUDAApplication

CUDARuntime&GPUKernel cudaMemcpy cudaMemcpy

head tail

CUDAStreams

• TwotypesofstreamsinaCUDAprogram– Theimplicitly declaredstream(NULLstream)– Explicitly declaredstreams(non-NULLstreams)

• Upuntilnow,allcodehasbeenusingtheNULLstreambydefault

cudaMemcpy(...);kernel<<<...>>>(...);cudaMemcpy(...);

• Non-NULLstreamsrequiremanualallocationandmanagementbytheCUDAprogrammer

CUDAStreams

• TocreateaCUDAstream:cudaError_t cudaStreamCreate(cudaStream_t *stream);

• TodestroyaCUDAstream:cudaError_t cudaStreamDestroy(cudaStream_t stream);

• TowaitforallactionsinaCUDAstreamtofinish:cudaError_t cudaStreamSynchronize(cudaStream_t stream);

• TocheckifallactionsinaCUDAstreamhavefinished:cudaError_t cudaStreamQuery(cudaStream_t stream);

CUDAStreams

• cudaMemcpyAsync:AsynchronousmemcpycudaError_t cudaMemcpyAsync(void *dst, const void *src,size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0);

• cudaMemcpyAsync doesthesameascudaMemcpy,butmayreturnbeforethetransferisactuallycomplete

• PinnedhostmemoryisarequirementforcudaMemcpyAsync– Memorythatisresidentinphysicalmemorypages,andcannot

beswappedout,alsoreferredaspage-locked• Recallmalloc normallyreservevirtualaddressspacefirstandthenactuallyphysicalpagesareallocated

– DMAisinvolvedfordatatransfer11

CUDAStreams

• PerformingacudaMemcpyAsync:

int *h_arr, *d_arr;cudaStream_t stream;cudaMalloc((void **)&d_arr, nbytes);cudaMallocHost((void **)&h_arr, nbytes);cudaStreamCreate(&stream);

cudaMemcpyAsync(d_arr, h_arr, nbytes, cudaMemcpyHostToDevice, stream);...cudaStreamSynchronize(stream);cudaFree(d_arr); cudaFreeHost(h_arr); cudaStreamDestroy(stream);

page-lockedmemoryallocation

Callreturnbeforetransfercomplete

Dosomethingwhiledataisbeingmoved

Synctomakesureoperationscomplete12

CUDAStreams

• Associatekernellauncheswithanon-NULLstream

kernel<<<nblocks, threads_per_block, smem_size, stream>>>(...);

• TheeffectsofcudaMemcpyAsync andkernellaunching– Operationsareputinthestreamqueueforexecution– Actuallyoperationsmaynothappenyet

• Host-sidetimertotimethoseoperations– Nottheactualtimeoftheoperations

• Vectorsumexample,A+B=C

• PartitionthevectorsanduseCUDAstreamstooverlapcopyandcompute

CUDAStreams

CopyA CopyB vector_sum<<<...>>> CopyCNULL stream

A B v_s C

Stream A

Stream B

Stream C

Stream D

• Howcanthisbeimplementedincode?cudaStream_t stream[nstreams];/* initialize stream here */int eles_per_stream = N / nstreams;

for (int i = 0; i < nstreams; i++) {int offset = i * eles_per_stream;cudaMemcpyAsync(&d_A[offset], &h_A[offset], eles_per_stream *

sizeof(int), cudaMemcpyHostToDevice, streams[i]);cudaMemcpyAsync(&d_B[offset], &h_B[offset], eles_per_stream *

sizeof(int), cudaMemcpyHostToDevice, streams[i]);……vector_sum<<<..., streams[i]>>>(d_A + offset,

d_B + offset, d_C + offset);cudaMemcpyAsync(&h_C[offset], &d_C[offset], eles_per_stream *

sizeof(int), cudaMemcpyDeviceToHost, streams[i]);}

for (int i = 0; i < nstreams; i++)cudaStreamSynchronize(streams[i]);

CUDAStreams

• Timingasynchronousoperations– Host-sidetimer:onlymeasurethetimeforthecall,nottheactual

timeforthedatamovementorkernelexecution

• Eventsinstreams,whichmarkspecificpointsinstreamexecution

• Eventsaremanuallycreatedanddestroyed:cudaError_t cudaEventCreate(cudaEvent_t

*event);cudaError_t cudaEventDestroy(cudaEvent_t

*event);

CUDAEvents

CopyA CopyB vector_sum<<<...>>> CopyC

• ToaddaneventtoaCUDAstream:cudaError_t cudaEventRecord(cudaEvent_t event,

cudaStream_t stream);

– Eventmarksthepoint-in-timeafterallprecedingactionsinstream complete,andbeforeanyactionsaddedaftercudaEventRecord run

• HosttowaitforsomeCUDAactionstofinishcudaError_t cudaEventSynchronize(cudaEvent_t event);

– Waitforalltheoperationsbeforethiseventstocomplete,butnotthoseafter

CUDAEvents

• Checkifaneventhasbeenreachedwithoutwaitingforit:

cudaError_t cudaEventQuery(cudaEvent_tevent);

• Gettheelapsedmillisecondsbetweentwoevents:cudaError_t cudaEventElapsedTime(float

*ms, cudaEvent_t start, cudaEvent_t stop);

CUDAEvents

start stop18

• Incodes:

float time;cudaEvent_t start, stop;cudaEventCreate(&start); cudaEventCreate(&stop);

cudaEventRecord(start); kernel<<<grid, block>>>(arguments); cudaEventRecord(stop); cudaEventSynchronize(stop);

cudaEventElapsedTime(&time, start, stop); cudaEventDestroy(start);cudaEventDestroy(stop);

CUDAEvents

ImplicitandExplicitSynchronization

• Twotypesofhost-devicesynchronization:– ImplicitsynchronizationcausesthehosttowaitontheGPU,

butasasideeffectofotherCUDAactions– ExplicitsynchronizationcausesthehosttowaitontheGPU

becausetheprogrammerhasaskedforthatbehavior

• FiveCUDAoperationsthatincludeimplicitsynchronization:1. Apinnedhostmemoryallocation(cudaMallocHost,

cudaHostAlloc)2. Adevicememoryallocation(cudaMalloc)3. Adevicememset (cudaMemset)4. Amemorycopybetweentwoaddressesonthesame

device(cudaMemcpy(..., cudaMemcpyDeviceToDevice))

5. AmodificationtotheL1/sharedmemoryconfiguration(cudaThreadSetCacheConfig, cudaDeviceSetCacheConfig)

• FourwaystoexplicitlysynchronizeinCUDA:1. Synchronizeonadevice

cudaError_t cudaDeviceSynchronize();

2. SynchronizeonastreamcudaError_t cudaStreamSynchronize();

3. SynchronizeonaneventcudaError_t cudaEventSynchronize();

4. SynchronizeacrossstreamsusinganeventcudaError_t cudaStreamWaitEvent(cudaStream_tstream, cudaEvent_t event);

• cudaStreamWaitEvent addsinter-streamdependencies– Causesthespecifiedstream towaitonthespecifiedevent

beforeexecutinganyfurtheractions– event doesnotneedtobeaneventrecordedinstream

cudaEventRecord(event, stream1);...cudaStreamWaitEvent(stream2, event);

– Noactionsaddedtostream2afterthecalltocudaStreamWaitEvent willexecuteuntileventissatisfied

SuggestedReadings

1. Chapter6inProfessionalCUDACProgramming2. JustinLuitjens.CUDAStreams:BestPracticesandCommon

Pitfalls.GTC2014.http://on-demand.gputechconf.com/gtc/2014/presentations/S4158-cuda-streams-best- practices-common-pitfalls.pdf

3. SteveRennich.CUDAC/C++StreamsandConcurrency.2011.http://on-demand.gputechconf .com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf

CUDALibraries

• Pre-packagedandexpertly-optimizedfunctionsthatimplementcommonlyusefuloperations.– Vectoraddition,matrixvector,matrixmatrix,FFT,etc• AdvantagesofCUDALibraries?– Supportawiderangeofapplicationdomains– Highlyusable,high-levelAPIsthatarefamiliartodomain

experts– TunedbyCUDAexpertstoperformwellacrossplatformsand

datasets– Oftenofferthequickestrouteforporting,simplyswapoutAPI

calls– Lowmaintenance,developerofthelibrarytakeson

responsibilityofbugfixesandfeaturerequests

CUDALibraries

27https://developer.nvidia.com/gpu-accelerated-libraries

WorkflowtoUseCUDALibrary

1. Createalibrary-specifichandlethatmanagescontextualinformationusefulforthelibrary’soperation.– ManyCUDALibrarieshavetheconceptofahandlewhich

storesopaquelibrary-specificinformationonthehostwhichmanylibraryfunctionsaccess

– Programmer’sresponsibilitytomanagethishandle– For example: cublasHandle_t,cufftHandle,cusparseHandle_t,curandGenerator_t

1. Allocatedevicememoryforinputsandoutputstothelibraryfunction.– Use cudaMalloc as usual

CommonLibraryWorkflow

3. Ifinputsarenotalreadyinalibrary-supportedformat,convertthemtobeaccessiblebythelibrary.– ManyCUDALibrariesonlyacceptdatainaspecificformat– Forexample:column-majorvs.row-majorarrays

4. Populatethepre-allocateddevicememorywithinputsinasupportedformat.– Inmanycases,thisstepsimplyimpliesacudaMemcpy orone

ofitsvariantstomakethedataaccessibleontheGPU– Somelibrariesprovidecustomtransferfunctions,forexample:cublasSetVector optimizesstrided copiesfortheCUBLASlibrary

5. Configurethelibrarycomputationtobeexecuted.– Insomelibraries,thisisano-op– Othersrequireadditionalmetadatatoexecutelibrary

computationcorrectly– Insomecasesthisconfigurationtakestheformofextra

parameterspassedtolibraryfunctions,otherssetfieldsinthelibraryhandle

6. ExecutealibrarycallthatoffloadsthedesiredcomputationtotheGPU.– NoGPU-specificknowledgerequired

7. Retrievetheresultsofthatcomputationfromdevicememory,possiblyinalibrary-determinedformat.– Again,thismaybeassimpleasacudaMemcpy orrequirea

library-specificfunction

8. Ifnecessary,converttheretrieveddatatotheapplication’snativeformat.– Ifaconversiontoalibrary-specificformatwasnecessary,this

stepensurestheapplicationcannowusethecalculateddata– Ingeneral,itisbesttokeeptheapplicationformatandlibrary

formatthesame,reducingoverheadfromrepeatedconversions

9. ReleaseCUDAresources.– IncludestheusualCUDAcleanup(cudaFree,cudaStreamDestroy,etc)plusanylibrary-specificcleanup

10.Continuewiththeremainderoftheapplication.

• Notalllibrariesfollowthisworkflow,andnotalllibrariesrequireeverystepinthisworkflow– Infact,formanylibrariesmanystepsareskipped– Keepingthisworkflowinmindwillhelpgiveyoucontexton

whatthelibrarymightbedoingbehindthescenesandwhereyouareintheprocess

• Next,we’lltakealookattwocommonlyusefullibraries– Trytokeepthecommonworkflowinmindwhileweworkwith

cuBLAS

• cuBLAS isaportofapopularlinearalgebralibrary,BLAS

• cuBLAS (likeBLAS)splitsitssubroutinesintomultiplelevelsbasedondatatypesprocessed:– Level1:vector-onlyoperations(e.g.vectoraddition)– Level2:matrix-vectoroperations(e.g.matrix-vector

multiplication)– Level3:matrix-matrixoperations(e.g.matrixmultiplication)

cuBLAS Idiosyncracies

• Forlegacycompatibility,cuBLAS operatesoncolumn-majormatrices

• cuBLAS alsohasalegacyAPIwhichwasdroppedsinceCUDA4.0,thislecturewillusethenewcuBLAS API– IfyoufindcuBLAS codethatdoesn’tquitematchup,youmay

belookingattheoldcuBLAS API

3 0 06 0 00 2 1

3 6 0 0 0 2 0 0 1

cuBLAS DataManagement

• DevicememoryincuBLAS isallocatedasyou’reusedto:cudaMalloc

• Transferringdatato/fromthedeviceusescuBLAS-specificfunctions:– cublasGetVector/cublasSetVector– cublasGetMatrix/cublasSetMatrix

• Example:cublasStatus_t cublasSetVector(int n, int elemSize, const void *x, int incx, void *y, int incy);

where:• n isthenumberofelementstotransfertotheGPU• elemSize isthesizeofeachelement(e.g.sizeof(int))• x isthevectoronthehosttocopyfrom• incx isastrideinx ofthearraycellstotransferto• y isthevectorontheGPUtocopyto• incy isastrideiny ofthearraycellstotransferto

• Example:cublasSetVector(5, sizeof(int), h_x, 3, d_x, 2);

• Similarly:cublasStatus_t cublasSetMatrix(int rows, int cols, int elemSize,

const void *A, int lda, void *B, intldb);

where:• rows isthenumberofrowsinamatrixtocopy• cols isthenumberofcolsinamatrixtocopy• elemSize isthesizeofeachcellinthematrix(e.g.sizeof(int))• A isthesourcematrixonthehost• lda isthenumberofrowsintheunderlyingarrayforA• B isthedestinationmatrixontheGPU• ldb isthenumberofrowsintheunderlyingarrayforB

• Similarly:cublasSetMatrix(3, 3, sizeof(int), h_A, 4, d_A, 5);

cuBLAS Example

• Matrix-vectormultiplication– Uses6ofthe10stepsinthecommonlibraryworkflow:

1. CreateacuBLAS handleusingcublasCreateHandle2. Allocatedevicememoryforinputsandoutputsusing

cudaMalloc3. PopulatedevicememoryusingcublasSetVector,

cublasSetMatrix4. CallcublasSgemv torunmatrix-vectormultiplicationon

theGPU5. RetrieveresultsfromtheGPUusingcublasGetVector6. ReleaseCUDAandcuBLAS resourcesusingcudaFree,

cublasDestroy41

cuBLAS Example

• Youcanbuildandruntheexamplecublas.cu:

cublasCreate(&handle);cudaMalloc((void **)&dA, sizeof(float) * M * N);cudaMalloc((void **)&dX, sizeof(float) * N);cudaMalloc((void **)&dY, sizeof(float) * M);

cublasSetVector(N, sizeof(float), X, 1, dX, 1);cublasSetVector(M, sizeof(float), Y, 1, dY, 1);cublasSetMatrix(M, N, sizeof(float), A, M, dA, M);

cublasSgemv(handle, CUBLAS_OP_N, M, N, &alpha, dA, M, dX, 1, &beta, dY, 1);

cublasGetVector(M, sizeof(float), dY, 1, Y, 1);

/* for sgemm */cublasSgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N,matrix_size.uiWB,matrix_size.uiHA,matrix_size.uiWA,&alpha,d_B,matrix_size.uiWB,d_A,matrix_size.uiWA,&beta,d_C,matrix_size.uiWA)

cuBLAS Portability

• PortingtocuBLAS fromBLASisastraightforwardprocess.Ingeneral,itrequires:– Addingdevicememoryallocation/freeing(cudaMalloc,cudaFree)

– Addingdevicetransferfunctions(cublasSetVector,cublasSetMatrix,etc)

– TransformlibraryroutinecallsfromBLAStocuBLAS (e.g.cblas_sgemvè cublasSgemv)

cuBLAS Portability

• SomecommonoptimizationsfollowinganaiveBLASècuBLAS portare:– Reusingdevicememoryallocations– Removingredundantdatatransfersfromandtothedevice– AddingstreamedexecutionusingcublasSetStream

cuBLAS Summary

• cuBLAS makesacceleratinglegacyBLASapplicationssimpleandeasy– Verylittleaddedcode– StraightforwardmappingfromBLASroutinestocuBLAS

routines– FlexibleAPIimprovesportability

• Fornewlinearalgebraapplications,cuBLAS offersahigh-performancealternativetoBLAS– High-performancekernelswithverylittleprogrammertime

• cuFFT offersanoptimizedimplementationofthefastFouriertransform

cuFFT Configuration

• IncuFFT terminology,plans==handles– cuFFT plansdefineasingleFFTtransformationtobeperformed

• cuFFT usesplanstoderivetheinternalmemoryallocations,transfers,kernelsrequiredtoimplementthedesiredtransform

• Plansarecreatedwith:cufftResult cufftPlan1d(cufftHandle *plan, int nx, cufftTypetype, int batch);

cufftResult cufftPlan2d(cufftHandle *plan, int nx, int ny, cufftType type);

cufftResult cufftPlan3d(cufftHandle *plan, int nx, int ny, intnz, cufftType type);

cuFFT Configuration

• cufftType referstothedatatypesofatransformation,forexample:– Complex-to-complex:CUFFT_C2C– Real-to-complex:CUFFT_R2C– Complex-to-real:CUFFT_C2R

cuFFT Example

• Acomplex-to-complex1DcuFFT planandexecutingit,using6ofthe10stepsinthecommonlibraryworkflow:

1. CreateandconfigureacuFFT plan2. AllocateGPUmemoryfortheinputsamplesandoutput

frequenciesusingcudaMalloc3. PopulateGPUmemorywithinputsamplesusing

cudaMemcpy4. ExecutetheplanusingacufftExec* function5. RetrievethecalculatedfrequenciesfromGPUmemory

usingcudaMemcpy6. ReleaseCUDAandcuFFT resourcesusingcudaFree,

cufftDestroy49

cuFFT Example

• Youcanbuildandrunanexamplecufft.cu:

cufftPlan1d(&plan, N, CUFFT_C2C, 1);

cudaMalloc((void **)&dComplexSamples, sizeof(cufftComplex) * N);

cudaMemcpy(dComplexSamples, complexSamples, sizeof(cufftComplex) * N, cudaMemcpyHostToDevice);

cufftExecC2C(plan, dComplexSamples, dComplexSamples, CUFFT_FORWARD);

cudaMemcpy(complexFreq, dComplexSamples, sizeof(cufftComplex) * N, cudaMemcpyDeviceToHost);

cuFFT Summary

• LikecuBLAS,cuFFT offersahigh-levelandusableAPIforportinglegacyFFTapplicationsorwritingnewones– cuFFT’s APIisdeliberatelysimilartoindustry-standardlibrary

FFTWtoimproveprogrammability– Offershigherperformanceforlittledevelopereffort

Drop-InCUDALibraries

• Drop-InCUDALibrariesallowseamlessintegrationofCUDAperformancewithexistingcodebases– Fullcompatibilitywithindustry-standardlibraries,exposethe

sameexternalAPIs– BLASè NVBLAS– FFTWè cuFFTW

• TwowaystouseDrop-InLibraries:– Re-linktoCUDALibraries– LD_PRELOAD CUDALibrariesbeforetheirhostequivalents

Drop-InCUDALibraries

• Re-linkinglegacyapplicationstoCUDALibraries:– SupposeyouhavealegacyapplicationthatreliesonBLAS:

$ gcc app.c –lblas –o app– RecompilingwithNVBLASlinkedwillautomaticallyaccelerate

allBLAScalls$ gcc app.c –lnvblas –o app

• Alternatively,simplysetLD_PRELOAD whenexecutingtheapplication:

$ env LD_PRELOAD=libnvblas.so ./app

SurveyofCUDALibraryPerformance

• We’veseenthatcuBLAS andcuFFT arehigh-level,programmablelibraries(liketheirhostcounterparts)– NoCUDA-specificconcepts(e.g.threadblocks,pinned

memory,etc)

• Let’sdoabriefsurveyofCUDALibraryperformancetoseetheperformanceimprovementspossible– Focusonthesamelibraries(cuBLAS andcuFFT)butsimilar

dataonotherlibrariesisavailableinthebookandonline

SurveyofCUDALibraryPerformance

SuggestedReadings

1. AllsectionsinChapter8ofProfessionalCUDACProgramming exceptUsingOpenACC

2. cuSPARSE UserGuide.2014.http://docs.nvidia.com/cuda/cusparse/3. cuBLAS UserGuide.2014.http://docs.nvidia.com/cuda/cublas/4. cuRAND UserGuide.2014.http://docs.nvidia.com/cuda/curand/5. cuFFT UserGuide.2014.http://docs.nvidia.com/cuda/cufft/6. CUDAToolkit5.0PerformanceReport.2013.http://on-

demand.gputechconf.com/gtc-express/2013/presentations/cuda--5.0-math-libraries-performance.pdf

GPUParallelization

• A many-facetedprocess– Performancevariesdramaticallydependingonthe

implementationofthesamealgorithms• Naïvetohighlyoptimizedversion

• ManytypesofoptimizationsforGPUs– Sharedmemory– Constantmemory– Globalmemoryaccesspatterns– Warpshuffleinstructions– Computation-communicationoverlap– CUDAcompilerflags,e.g.loopunrolling,etc– Increasingparallelism– ...

OptimizationOpportunities

• Kernel-leveloptimization:– ExposingSufficientParallelism– OptimizingMemoryAccess– OptimizingInstructionExecution

• Host-GPUoptimization– E.g.kernelanddatatransferoverlapusingCUDAstreams

• Profile-drivenoptimizationimprovesoptimizationsselection

Kernel-LevelOptimization

• ExposingSufficientParallelism– IncreasetheamountofconcurrentworkontheGPUsoasto

saturateinstructionandmemorybandwidth

• Canbeaccomplishedby:1. MoreconcurrentlyactivewarpsperSM2. Moreindependentworkassignedtoeachthread

Warp-LevelParallelism

Instruction-LevelParallelism

• IncreasingthenumberofwarpsperSM/threadblockdoesnotguaranteeperformanceimprovement– Resultinfewerper-SMresourcesassignedtoeachthread(e.g.

registers,sharedmemory)

Less parallelism, more per-thread resources

More parallelism, smaller per-thread resources

• Creatingmoreindependentworkperthread– loopunrollingorothercodetransformationsthatexpose

instruction-levelparallelism,– Butmayalsoincreaseper-threadresourcerequirements

int sum = 0;for (int i = 0; i < 4; i++) {

sum += a[i];}

int i1 = a[0];int i2 = a[1];int i3 = a[2];int i4 = a[4];int sum = i1 + i2 + i3 + i4;

Requires 2 registers (sum, i), no instruction-level parallelism Requires 5 registers (i1, i2, i3, i4,

sum), four-way instruction-level parallelism

• Optimizingmemoryaccesstomaximize:– Memorybandwidthutilization(efficiencyofmemoryaccess

patterns)– Memoryaccessconcurrency(sufficientmemoryrequeststo

hidememorylatency)

• Aligned,coalescedglobalandsharedmemoryaccessesoptimizememorybandwidthutilization– Constantmemoryprefersabroadcastaccesspattern

• OptimizingInstructionExecutionfocuseson:– Hidinginstructionlatencybykeepingasufficientnumberof

warpsactive– Avoidingdivergentexecutionpathswithinwarps• Ifinsideakernel

• Experimentingwiththreadexecutionconfigurationcanproduceunexpectedperformancegainsfrommoreorlessactivewarps

• Divergentexecutionwithinawarpproducesreducedparallelismaswarpexecutionofmultiplecodepathsisserialized

Profile-DrivenOptimization

• Profile-drivenoptimizationisaniterativeprocesstooptimizeprogrambasedonquantitative profilinginfo– Asweapplyoptimizationtechniques,weanalyze theresults

usingnvprof anddecideiftheyarebeneficial

Determine performance

inhibitors

Identify hotspots

Gather profiling information

Optimize

Repeat

• Thekeychallengeinprofile-drivenoptimizationistodetermineperformanceinhibitorsinhotspots– nvvp andnvprof areinvaluabletoolsforthis

• nvprof profilingmodes:– SummaryMode:defaultmode,displaysexecutiontime

informationonhigh-levelactionssuchaskernelsordatatransfers

– TraceMode:ProvidesatimelineofCUDAeventsoractionsinchronologicalorder

– Event/MetricSummaryMode:Aggregatesevent/metriccountsacrossallkernelinvocations

– Event/MetricTraceMode:Displaysevent/metriccountsforeachkernelinvocation

• TheNVIDIAVisualProfiler(nvvp)isalsoapowerfultoolforguidingprofile-drivenoptimization– OffersanumberofviewstoinspectdifferentpartsofaCUDA

application

CUDADebugging

• AnimportantpartofCUDAsoftwaredevelopmentistheabilitytodebugCUDAapplications

• CUDAoffersanumberofdebuggingtools,splitintotwocategories:– KernelDebugging– MemoryDebugging

CUDADebugging

• KernelDebuggingtoolshelpustoanalyze thecorrectnessofrunningCUDAkernelsbyinspectingrunningapplicationstate

• MemoryDebuggingToolshelpusdetectapplicationbugsbyobservingirregularorout-of-boundmemoryaccessesperformedbyCUDAkernels

CUDAKernelDebugging

• Primarytoolforthejob:cuda-gdb– Intentionallybuilttobesimilartothehostdebuggingtoolgdb– Requirescompilationwithspecialflagstobeuseful:

$ nvcc –g –G foo.cu -o foo

• Onceanapplicationiscompiledindebugmode,runningitundercuda-gdb ispossibleusing:

$ cuda-gdb foo...(cuda-gdb)

CUDAKernelDebugging

• cuda-gdb usesmostofthesamecommandsasgdb

• OnemaindifferenceistheideaofCUDAFocus,orthecurrentthreadthatcuda-gdb isfocusedonandagainstwhichallcommandsrun– Querythecurrentfocususing:

(cuda-gdb) cuda thread lane warp block sm grid device kernel

– Exampleofsettingfocustothe128th threadinthecurrentblock:

(cuda-gdb) cuda thread (128)

CUDAKernelDebugging

• printf isanotherformofCUDAKernelDebugging– Onlyavailableondevicesofcomputecapability2.0orhigher

• Printsarebufferedonthedeviceandperiodicallytransferredbacktothehostfordisplay– SizeofthisbufferconfigurablewithcudaSetDeviceLimit

• BuffercontentsaretransferredtothehostafteranyCUDAkernellaunch,anyhost-sideexplicitsynchronization,anysynchronousmemorycopies

CUDAMemoryDebugging

• MemoryDebuggingdetectsmemoryerrorsinCUDAkernelsthatarelikelyindicativeofbugsinthecode– Forexample:out-of-boundsmemoryaccesses

• ThereisasingletoolforMemoryDebugging,cuda-memcheck,whichcontainstwoutilities:– Thememcheck tool– Theracecheck tool

CUDAMemoryDebugging

• Thecompilationprocessforcuda-memcheck ismoreinvolvedthanforcuda-gdb– Buildingwithfulldebugoptionsaffectsperformance,which

maymakememoryerrorshardertohit– Applicationsshouldalwaysbecompiledwith-lineinfo– Applicationsshouldalsobecompiledtoincludesymbol

information,butdoingthisvariesbyplatform• Linux:-Xcompiler –rdynamic• Windows:-Xcompiler /Zi• ...

CUDAMemoryDebugging

• Oncetheapplicationiscompiled,memcheck canbeusedtocheckfor6differenttypesofmemoryerrors:1. MemoryAccessError:Out-of-boundsormisaligned

memoryaccess2. HardwareException:Errorreportedbyhardware3. malloc/free Errors:ImproperuseofCUDAdynamic

memoryallocation4. CUDAAPIErrors:AnyerrorreturncodefromaCUDAAPI

call5. cudaMalloc MemoryLeaks:cudaMalloc allocationsthat

arenotcudaFree’d6. DeviceHeapMemoryLeaks:Dynamicmemoryallocations

thatareneverfreed78

CUDAMemoryDebugging

• Thetwocuda-memcheck utilitiesofferverydifferentcapabilities:– memcheck performsawiderangeofmemorycorrectness

checks– racecheck verifiesthat__shared__memoryusageis

correctinanapplication,aparticularlydifficulttasktoperformmanually

• cuda-memcheck offersamoreautomatedapproachtodebuggingthancuda-gdb

CUDAErrorHandling

• PropererrorhandlingisanimportantpartofrobustCUDAdeployment– EveryCUDAfunctionreturnsanerrorcodethatmustbe

checked– Ifasynchronousoperationsareused,thiserrormaybearesult

ofadifferentasynchronousoperationfailing– ReturncodeofcudaSuccess indicatessuccess

• CUDAalsooffersanumberoferror-handlingfunctions

CUDAErrorHandling

cudaError_t cudaGetLastError();– RetrievethelatestCUDAerror,clearingtheCUDAruntime’s

internalerrorstatetobecudaSuccess

cudaError_t cudaPeekLastError();– RetrievethelatestCUDAerror,butdonotcleartheCUDA

runtime’sinternalerrorstate

const char *cudaGetErrorString(cudaError_terror);– Fetchahuman-readablestringfortheprovidederror

SuggestedReadings

1. Chapter10inProfessionalCUDACProgramming2. AdamDeConinck.IntroductiontotheCUDAToolkitasanApplicationBuild

Tool.GTC2013.http://on-demand.gputechconf.com/gtc/2013/webinar/cuda-toolkit-as-build- tool.pdf

3. Sandarbh Jain.CUDAProfilingTools.GTC2014.http://on-demand.gputechconf.com/gtc/2014/presentations/S4587-cuda-profiling-tools.pdf

4. ThomasBradley.GPUPerformanceAnalysisandOptimization.2012.http://people.maths .ox.ac.uk/gilesm/cuda/lecs/NV_Profiling_lowres.pdf

5. Julien Demouth.CUDAOptimizationwithNVIDIANsight(TM)VisualStudioEdition:ACaseStudy.GTC2014.http://on-demand.gputechconf.com/gtc/2014/presentations/S4160- cuda-optimization-nvidia-nsight-vse-case-study.pdf

Lecture: ManycoreGPU Architectures and Programming, Part 3

Documents

Transcript of Lecture: ManycoreGPU Architectures and Programming, Part 3

Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

64 Ia 32 Architectures Software Developer System Programming Manual 325384

Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Programming Language Constructs as Basis for Software Architectures

Programming models for next generation of GPGPU architectures · Programming models for next generation of GPGPU architectures ... Programing models for next ... // create the OpenCL

High-Performance Embedded Systems: Architectures ...jmconrad/ECGR4101Common/Wolf/High P… · High-Performance Embedded Systems: Architectures, Algorithms ... C programming. ... applications

1 Programming the Basic Computer Computer Organization Computer Architectures Lab PROGRAMMING THE BASIC COMPUTER Introduction Machine Language Assembly.

Shared memory programming with OpenMP · OpenMP Programming 4 OpenMP Model for shared-memory parallel programming Portable across shared-memory architectures Incremental parallelization

Multi-Core Architectures and Programming OpenCL ... · Multi-Core Architectures and Programming OpenCL Implementierung von OpenCV Funktionen Julian Mueller, Patrick Schneider julian.mueller@e-technik.stud.uni

"Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

Embedded Systems Programming and Architecturesteachers.cm.ihu.gr/.../Lecture_Embedded_Kalomiros_1.pdf · 2012-11-03 · Embedded Systems Programming and Architectures Dr John Kalomiros

Programming Models and System Architectures

Cognitive Architectures: A Way Forward for the Psychology of Programming

Programming many-core architectures - a case study: dense ...flame/pubs/SCC_CCPE.pdf · Programming many-core architectures - a case study: dense matrix computations on the Intel

Lecture: ManycoreGPU Architectures and Programming, Part 4 · In DGEMM, the performance of Map2 2 and Map2 4 are better than the other two mappings which is because they both have

The Future of Microprocessors and Tiled Architectures ... · PDF fileThe Future of Microprocessors and Tiled Architectures Multi-Core Architectures and Programming Stephan Seitz, Daniel

Architectures and Applications for Wireless Sensor Networks (204525) Sensor Node Programming

1 Distributed Memory Computers and Programming. 2 Outline Distributed Memory Architectures Topologies Cost models Distributed Memory Programming Send.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths GPUComputing@Sheffield

Prototyping Parallel Simulations on Manycore Architectures ... · Architectures Using Scala: A Case Study Jonathan Passerat-Palmbach, Romain Reuillon, Claude Mazel, ... programming