CUDA - Wikipedia, The Free Encyclopedia

8/6/2015 CUDAWikipedia,thefreeencyclopedia

https://en.wikipedia.org/wiki/CUDA 1/14

CUDA

Aparallelcomputingplatformandprogrammingmodel

Developer(s) NVIDIACorporation

Initialrelease June23,2007

Stablerelease 7.0/March17,2015

Operatingsystem WindowsXPandlater,MacOSX,Linux

Platform SupportedGPUs

Type GPGPU

License Freeware

Website www.nvidia.com/object/cuda_home_new.html(http://www.nvidia.com/object/cuda_home_new.html)

CUDAFromWikipedia,thefreeencyclopedia

CUDA,whichstandsforComputeUnifiedDeviceArchitecture,[1]isaparallelcomputingplatformandapplicationprogramminginterface(API)modelcreatedbyNVIDIA.[2]ItallowssoftwaredeveloperstouseaCUDAenabledgraphicsprocessingunit(GPU)forgeneralpurposeprocessinganapproachknownasGPGPU.TheCUDAplatformisasoftwarelayerthatgivesdirectaccesstotheGPU'svirtualinstructionsetandparallelcomputationalelements.[3]

TheCUDAplatformisdesignedtoworkwithprogramminglanguagessuchasC,C++andFortran.ThisaccessibilitymakesiteasierforspecialistsinparallelprogrammingtoutilizeGPUresources,asopposedtopreviousAPIsolutionslikeDirect3DandOpenGL,whichrequiredadvancedskillsingraphicsprogramming.Also,CUDAsupportsprogrammingframeworkssuchasOpenACCandOpenCL.[3]

Contents

1Background2Programmingcapabilities3Advantages4Limitations5SupportedGPUs6Versionfeaturesandspecifications7Example8Languagebindings9CurrentandfutureusagesofCUDAarchitecture10Seealso11References12Externallinks



ExampleofCUDAprocessingflow1.CopydatafrommainmemtoGPUmem2.CPUinstructstheprocesstoGPU3.GPUexecuteparallelineachcore4.CopytheresultfromGPUmemtomainmem

Background

TheGPU,asaspecializedprocessor,addressesthedemandsofrealtimehighresolution3Dgraphicscomputeintensivetasks.Asof2012,GPUshaveevolvedintohighlyparallelmulticoresystemsallowingveryefficientmanipulationoflargeblocksofdata.ThisdesignismoreeffectivethangeneralpurposeCPUsforalgorithmswhereprocessingoflargeblocksofdataisdoneinparallel,suchas:

pushrelabelmaximumflowalgorithmfastsortalgorithmsoflargeliststwodimensionalfastwavelettransformmoleculardynamicssimulations

Programmingcapabilities

TheCUDAplatformisaccessibletosoftwaredevelopersthroughCUDAacceleratedlibraries,compilerdirectivessuchasOpenACC,andextensionstoindustrystandardprogramminglanguagesincludingC,C++andFortran.C/C++programmersuse'CUDAC/C++',compiledwith"nvcc"NVIDIA'sLLVMbasedC/C++compiler.[4]Fortranprogrammerscanuse'CUDAFortran',compiledwiththePGICUDAFortrancompilerfromThePortlandGroup.

Inadditiontolibraries,compilerdirectives,CUDAC/C++andCUDAFortran,theCUDAplatformsupportsothercomputationalinterfaces,includingtheKhronosGroup'sOpenCL,[5]Microsoft'sDirectCompute,OpenGLComputeShaders(http://www.opengl.org/wiki/Compute_Shader)andC++AMP.[6]ThirdpartywrappersarealsoavailableforPython,Perl,Fortran,Java,Ruby,Lua,Haskell,R,MATLAB,IDL,andnativesupportinMathematica.

Inthecomputergameindustry,GPUsareusednotonlyforgraphicsrenderingbutalsoingamephysicscalculations(physicaleffectssuchasdebris,smoke,fire,fluids)examplesincludePhysXandBullet.CUDAhasalsobeenusedtoacceleratenongraphicalapplicationsincomputationalbiology,cryptographyandotherfieldsbyanorderofmagnitudeormore.[7][8][9][10][11]

CUDAprovidesbothalowlevelAPIandahigherlevelAPI.TheinitialCUDASDKwasmadepublicon15February2007,forMicrosoftWindowsandLinux.MacOSXsupportwaslateraddedinversion2.0,[12]whichsupersedesthebetareleasedFebruary14,2008.[13]CUDAworkswithallNvidiaGPUsfromtheG8xseriesonwards,includingGeForce,QuadroandtheTeslaline.CUDAiscompatiblewithmoststandardoperatingsystems.NvidiastatesthatprogramsdevelopedfortheG8xserieswillalsoworkwithoutmodificationonallfutureNvidiavideocards,duetobinarycompatibility.

Advantages



CUDAhasseveraladvantagesovertraditionalgeneralpurposecomputationonGPUs(GPGPU)usinggraphicsAPIs:

ScatteredreadscodecanreadfromarbitraryaddressesinmemoryUnifiedvirtualmemory(CUDA4.0andabove)Unifiedmemory(CUDA6.0andabove)SharedmemoryCUDAexposesafastsharedmemoryregionthatcanbesharedamongstthreads.Thiscanbeusedasausermanagedcache,enablinghigherbandwidththanispossibleusingtexturelookups.[14]FasterdownloadsandreadbackstoandfromtheGPUFullsupportforintegerandbitwiseoperations,includingintegertexturelookups

Limitations

CUDAdoesnotsupportthefullCstandard,asitrunshostcodethroughaC++compiler,whichmakessomevalidC(butinvalidC++)codefailtocompile.[15][16]InteroperabilitywithrenderinglanguagessuchasOpenGLisoneway,withOpenGLhavingaccesstoregisteredCUDAmemorybutCUDAnothavingaccesstoOpenGLmemory.Copyingbetweenhostanddevicememorymayincuraperformancehitduetosystembusbandwidthandlatency(thiscanbepartlyalleviatedwithasynchronousmemorytransfers,handledbytheGPU'sDMAengine)Threadsshouldberunningingroupsofatleast32forbestperformance,withtotalnumberofthreadsnumberinginthethousands.Branchesintheprogramcodedonotaffectperformancesignificantly,providedthateachof32threadstakesthesameexecutionpaththeSIMDexecutionmodelbecomesasignificantlimitationforanyinherentlydivergenttask(e.g.traversingaspacepartitioningdatastructureduringraytracing).UnlikeOpenCL,CUDAenabledGPUsareonlyavailablefromNvidia[17]NoemulatororfallbackfunctionalityisavailableformodernrevisionsValidC/C++maysometimesbeflaggedandpreventcompilationduetooptimizationtechniquesthecompilerisrequiredtoemploytouselimitedresources.Asingleprocessmustrunspreadacrossmultipledisjointmemoryspaces,unlikeotherClanguageruntimeenvironments.C++RunTimeTypeInformation(RTTI)isnotsupportedinCUDAcode,duetolackofsupportintheunderlyinghardware.ExceptionhandlingisnotsupportedinCUDAcodeduetoperformanceoverheadthatwouldbeincurredwithmanythousandsofparallelthreadsrunning.CUDA(withcomputecapability2.x)allowsasubsetofC++classfunctionality,forexamplememberfunctionsmaynotbevirtual(thisrestrictionwillberemovedinsomefuturerelease).[SeeCUDACProgrammingGuide3.1AppendixD.6]InsingleprecisiononfirstgenerationCUDAcomputecapability1.xdevices,denormalnumbersarenotsupportedandareinsteadflushedtozero,andtheprecisionsofthedivisionandsquarerootoperationsareslightlylowerthanIEEE754compliantsingleprecisionmath.Devicesthatsupportcomputecapability2.0andabovesupportdenormalnumbers,andthedivisionandsquarerootoperationsareIEEE754compliantbydefault.However,userscanobtainthepreviousfastergaminggrademathofcomputecapability1.xdevicesifdesiredbysettingcompilerflagstodisableaccuratedivisions,disableaccuratesquareroots,andenableflushingdenormalnumberstozero.[18]

SupportedGPUs

Computecapabilitytable(versionofCUDAsupported)byGPUandcard.AlsoavailabledirectlyfromNvidia(http://developer.nvidia.com/cudagpus):



Computecapability(version)

Microarchitecture GPUs Cards

1.0

Tesla

G80,G92,G92b,G94,G94b

GeForceGT420*,GeForce8800Ultra,GeForce8800GTX,GeForceGT340*,GeForceGT330*,GeForceGT320*,GeForce315*,GeForce310*,GeForce9800GT,GeForce9600GT,GeForce9400GT,QuadroFX5600,QuadroFX4600,QuadroPlex2100S4,TeslaC870,TeslaD870,TeslaS870

1.1

G86,G84,G98,G96,G96b,G94,G94b,G92,G92b

GeForceG110M,GeForce9300MGS,GeForce9200MGS,GeForce9100MG,GeForce8400MGT,GeForce8600GT,GeForce8600GTS,GeForceG105M,QuadroFX4700X2,QuadroFX3700,QuadroFX1800,Quadro

FX1700,QuadroFX580,QuadroFX570,QuadroFX470,QuadroFX380,QuadroFX370,QuadroFX370LowProfile,QuadroNVS450,QuadroNVS420,QuadroNVS290,QuadroNVS295,QuadroPlex2100D4,QuadroFX3800M,QuadroFX3700M,QuadroFX3600M,QuadroFX

2800M,QuadroFX2700M,QuadroFX1700M,QuadroFX1600M,QuadroFX770M,QuadroFX570M,QuadroFX370M,QuadroFX360M,QuadroNVS320M,QuadroNVS160M,QuadroNVS150M,QuadroNVS140M,QuadroNVS135M,QuadroNVS130M,QuadroNVS450,QuadroNVS

420,QuadroNVS295

1.2GT218,GT216,GT215

GeForceGT240,GeForceGT220*,GeForce210*,GeForceGTS360M,GeForceGTS350M,GeForceGT335M,GeForceGT330M,GeForceGT

325M,GeForceGT240M,GeForceG210M,GeForce310M,GeForce305M,QuadroFX380LowProfile,NVIDIANVS300,QuadroFX1800M,

QuadroFX880M,QuadroFX380M,NVIDIANVS300,NVS5100M,NVS3100M,NVS2100M,ION

1.3 GT200,GT200b

GeForceGTX280,GeForceGTX275,GeForceGTX260,QuadroFX5800,QuadroFX4800,QuadroFX4800forMac,QuadroFX3800,QuadroCX,

QuadroPlex2200D2,TeslaC1060,TeslaS1070,TeslaM1060

2.0

Fermi

GF100,GF110

GeForceGTX590,GeForceGTX580,GeForceGTX570,GeForceGTX480,GeForceGTX470,GeForceGTX465,GeForceGTX480M,Quadro

6000,Quadro5000,Quadro4000,Quadro4000forMac,QuadroPlex7000,Quadro5010M,Quadro5000M,TeslaC2075,TeslaC2050/C2070,Tesla

M2050/M2070/M2075/M2090

2.1

GF104,GF106

GF108GF114,GF116,GF119

GeForceGTX560Ti,GeForceGTX550Ti,GeForceGTX460,GeForceGTS450,GeForceGTS450*,GeForceGT640(GDDR3),GeForceGT630,

GeForceGT620,GeForceGT610,GeForceGT520,GeForceGT440,GeForceGT440*,GeForceGT430,GeForceGT430*,GeForceGTX675M,GeForceGTX670M,GeForceGT635M,GeForceGT630M,

GeForceGT625M,GeForceGT720M,GeForceGT620M,GeForce710M,GeForce610M,GeForceGTX580M,GeForceGTX570M,GeForceGTX560M,GeForceGT555M,GeForceGT550M,GeForceGT540M,GeForceGT525M,GeForceGT520MX,GeForceGT520M,GeForceGTX485M,GeForceGTX470M,GeForceGTX460M,GeForceGT445M,GeForceGT435M,GeForceGT420M,GeForceGT415M,GeForce710M,GeForce410M,Quadro2000,Quadro2000D,Quadro600,Quadro410,Quadro

4000M,Quadro3000M,Quadro2000M,Quadro1000M,NVS5400M,NVS5200M,NVS4200M

3.0GK104,GK106,GK107

GeForceGTX770,GeForceGTX760,GeForceGT740,GeForceGTX690,GeForceGTX680,GeForceGTX670,GeForceGTX660Ti,GeForceGTX660,GeForceGTX650TiBOOST,GeForceGTX650Ti,GeForceGTX650,GeForceGTX880M,GeForceGTX780M,GeForceGTX770M,

GeForceGTX765M,GeForceGTX760M,GeForceGTX680MX,GeForceGTX680M,GeForceGTX675MX,GeForceGTX670MX,GeForceGTX660M,GeForceGT750M,GeForceGT650M,GeForceGT745M,GeForce

GT645M,GeForceGT740M,GeForceGT730M,GeForceGT640M,GeForceGT640MLE,GeForceGT735M,GeForceGT730M,Quadro

K5000,QuadroK4200,QuadroK4000,QuadroK2000,QuadroK2000D,



Kepler QuadroK600,QuadroK420,QuadroK500M,QuadroK510M,QuadroK610M,QuadroK1000M,QuadroK2000M,QuadroK1100M,QuadroK2100M,QuadroK3000M,QuadroK3100M,QuadroK4000M,Quadro

K5000M,QuadroK4100M,QuadroK5100M,TeslaK103.2 TegraK1 JetsonTK1(SoC)

3.5 GK110,GK208

GeForceGTXTITANZ,GeForceGTXTITANBlack,GeForceGTXTITAN,GeForceGTX780Ti,GeForceGTX780,GeForceGT640

(GDDR5),GeForceGT630v2,GeForceGT730,GeForceGT720,QuadroK6000,QuadroK5200,TeslaK40,TeslaK20x,TeslaK20

3.7 GK210 TeslaK80

5.0

Maxwell

GM107,GM108

GeForceGTX750Ti,GeForceGTX750,GeForceGTX960M,GeForceGTX950M,GeForce940M,GeForce930M,GeForceGTX860M,GeForceGTX850M,GeForce845M,GeForce840M,GeForce830M,QuadroK2200,

QuadroK1200,QuadroK620,QuadroK620M

5.2GM200,GM204,GM206

GeForceGTXTITANX,GeForceGTX980Ti,GeForceGTX980,GeForceGTX970,GeForceGTX960,GeForceGTX950,GeForceGTX980M,

GeForceGTX970M,GeForceGTX965M,QuadroM6000,QuadroM5000,QuadroM4000

5.3 TegraX1

'*'OEMonlyproducts

AtableofdevicesofficiallysupportingCUDA:[17]

NvidiaGeForceGeForceGTXTITANXGeForceGTX980TiGeForceGTX980GeForceGTX970GeForceGTX960GeForceGTX950GeForceGTXTitanZGeForceGTXTITANBlackGeForceGTXTITANGeForceGTX780TiGeForceGTX780GeForceGTX770GeForceGTX760GeForceGTX750TiGeForceGTX750GeForceGT740GeForceGT730GeForceGTX690GeForceGTX680GeForceGTX670GeForceGTX660TiGeForceGTX660GeForceGTX650TiBOOSTGeForceGTX650TiGeForceGTX650

NvidiaGeForceMobileGeForceGTX980MGeForceGTX970MGeForceGTX965MGeForceGTX960MGeForceGTX950MGeForce940MGeForce930MGeForceGTX880MGeForceGTX870MGeForceGTX860MGeForceGTX850MGeForce845MGeForce840MGeForce830MGeForceGTX780MGeForceGTX770MGeForceGTX765MGeForceGTX760MGeForceGT750MGeForceGT745MGeForceGT740MGeForceGT735MGeForceGT730MGeForceGTX680MXGeForceGTX680M

NvidiaQuadroQuadroM6000QuadroM5000QuadroM4000QuadroK6000QuadroK5200QuadroK5000QuadroK4200QuadroK4000QuadroK2200QuadroK2000DQuadroK2000QuadroK1200QuadroK620QuadroK600QuadroK420Quadro6000Quadro5000Quadro4000Quadro2000Quadro600QuadroFX5800QuadroFX5600QuadroFX4800QuadroFX4700X2QuadroFX4600



GeForceGT640GeForceGT630GeForceGT620GeForceGT610GeForceGTX590GeForceGTX580GeForceGTX570GeForceGTX560TiGeForceGTX560GeForceGTX550TiGeForceGT520GeForceGTX480GeForceGTX470GeForceGTX465GeForceGTX460GeForceGTX460SEGeForceGTS450GeForceGT440GeForceGT430GeForceGT420GeForceGTX295GeForceGTX285GeForceGTX280GeForceGTX275GeForceGTX260GeForceGTS250GeForceGTS240GeForceGT240GeForceGT220GeForce210/G210GeForceGT140GeForce9800GX2GeForce9800GTX+GeForce9800GTXGeForce9800GTGeForce9600GSOGeForce9600GTGeForce9500GTGeForce9400GTGeForce9400mGPUGeForce9300mGPUGeForce9100mGPUGeForce8800UltraGeForce8800GTXGeForce8800GTSGeForce8800GTGeForce8800GSGeForce8600GTSGeForce8600GT

GeForceGTX675MXGeForceGTX675MGeForceGTX670MXGeForceGTX670MGeForceGTX660MGeForceGT650MGeForceGT645MGeForceGT640MGeForceGTX580MGeForceGTX570MGeForceGTX560MGeForceGT555MGeForceGT550MGeForceGT540MGeForceGT525MGeForceGT520MGeForceGTX480MGeForceGTX470MGeForceGTX460MGeForceGT445MGeForceGT435MGeForceGT425MGeForceGT420MGeForceGT415MGeForceGTX285MGeForceGTX280MGeForceGTX260MGeForceGTS360MGeForceGTS350MGeForceGTS260MGeForceGTS250MGeForceGT335MGeForceGT330MGeForceGT325MGeForceGT320MGeForce310MGeForceGT240MGeForceGT230MGeForceGT220MGeForceG210MGeForceGTS160MGeForceGTS150MGeForceGT130MGeForceGT120MGeForceG110MGeForceG105MGeForceG103MGeForceG102MGeForceG100

QuadroFX3800QuadroFX3700QuadroFX1800QuadroFX1700QuadroFX580QuadroFX570QuadroFX380QuadroFX370QuadroNVS510QuadroNVS450QuadroNVS420QuadroNVS295QuadroPlex1000ModelIVQuadroPlex1000ModelS4NvidiaQuadroMobileQuadroK5100MQuadroK5000MQuadroK4100MQuadroK4000MQuadroK3100MQuadroK3000MQuadroK2100MQuadroK2000MQuadroK1100MQuadroK1000MQuadroK620MQuadroK610MQuadroK510MQuadroK500MQuadro5010MQuadro5000MQuadro4000MQuadro3000MQuadro2000MQuadro1000MQuadroFX3800MQuadroFX3700MQuadroFX3600MQuadroFX2800MQuadroFX2700MQuadroFX1800MQuadroFX1700MQuadroFX1600MQuadroFX880MQuadroFX770MQuadroFX570MQuadroFX380MQuadroFX370MQuadroFX360M



GeForce8600mGTGeForce8500GTGeForce8400GSGeForce8300mGPUGeForce8200mGPUGeForce8100mGPU

GeForce9800MGTXGeForce9800MGTSGeForce9800MGTGeForce9800MGSGeForce9700MGTSGeForce9700MGTGeForce9650MGTGeForce9650MGSGeForce9600MGTGeForce9600MGSGeForce9500MGSGeForce9500MGGeForce9400MGGeForce9300MGSGeForce9300MGGeForce9200MGSGeForce9100MGGeForce8800MGTXGeForce8800MGTSGeForce8700MGTGeForce8600MGTGeForce8600MGSGeForce8400MGTGeForce8400MGSGeForce8400MGGeForce8200MG

QuadroNVS320MQuadroNVS160MQuadroNVS150MQuadroNVS140MQuadroNVS135MQuadroNVS130M

NvidiaTeslaTeslaK80TeslaK40TeslaK20XTeslaK20TeslaK10TeslaC2050/2070TeslaM2050/M2070TeslaS2050TeslaS1070TeslaM1060TeslaC1060TeslaC870TeslaD870TeslaS870

Versionfeaturesandspecifications



Featuresupport(unlistedfeaturesaresupportedforallcomputecapabilities)

Computecapability(version)1.0 1.1 1.2 1.3 2.x 3.0 3.5 3.7 5.0 5.2

Integeratomicfunctionsoperatingon32bitwordsinglobalmemory

No YesatomicExch()operatingon32bitfloatingpointvaluesinglobalmemoryIntegeratomicfunctionsoperatingon32bitwordsinsharedmemory

No YesatomicExch()operatingon32bitfloatingpointvaluesinsharedmemoryIntegeratomicfunctionsoperatingon64bitwordsinglobalmemoryWarpvotefunctionsDoubleprecisionfloatingpointoperations No YesAtomicfunctionsoperatingon64bitintegervaluesinsharedmemory

No Yes

Floatingpointatomicadditionoperatingon32bitwordsinglobalandsharedmemory_ballot()_threadfence_system()_syncthreads_count(),_syncthreads_and(),_syncthreads_or()Surfacefunctions3DgridofthreadblockWarpshufflefunctions No YesFunnelshift

No YesDynamicparallelism

Featuresupport(unlistedfeaturesaresupportedforallcomputecapabilities)

1.0 1.1 1.2 1.3 2.x 3.0 3.5 3.7 5.0 5.2Computecapability(version)

TechnicalspecificationsComputecapability(version)

1.0 1.1 1.2 1.3 2.x 3.0 3.5 3.7 5.0 5.2Maximumdimensionalityofgridofthreadblocks 2 3

Maximumxdimensionofagridofthreadblocks 65535 2

311

Maximumy,orzdimensionofagridofthreadblocks 65535

Maximumdimensionalityofthreadblock 3

Maximumxorydimensionofablock 512 1024

Maximumzdimensionofablock 64Maximumnumberofthreadsperblock 512 1024Warpsize 32Maximumnumberofresidentblockspermultiprocessor 8 16 32

Maximumnumberofresidentwarpspermultiprocessor 24 32 48 64

Maximumnumberofresidentthreadspermultiprocessor 768 1024 1536 2048



Numberof32bitregisterspermultiprocessor 8K 16K 32K 64K 128K 64KMaximumnumberof32bitregistersperthread 128 63 255

Maximumamountofsharedmemorypermultiprocessor 16KB 48KB

112KB

64KB

96KB

Numberofsharedmemorybanks 16 32Amountoflocalmemoryperthread 16KB 512KBConstantmemorysize 64KBCacheworkingsetpermultiprocessorforconstantmemory

8KB 10KB

Cacheworkingsetpermultiprocessorfortexturememory

Devicedependent,between6KBand8KB 12KB

Between12KB

and48KB24KB

Maximumwidthfor1DtexturereferenceboundtoaCUDAarray 8192 65536

Maximumwidthfor1Dtexturereferenceboundtolinearmemory 2

27

Maximumwidthandnumberoflayersfora1Dlayeredtexturereference 8192512 163842048

Maximumwidthandheightfor2DtexturereferenceboundtoaCUDAarray 6553632768 6553665535

Maximumwidthandheightfor2Dtexturereferenceboundtoalinearmemory 6500065000 6500065000

Maximumwidthandheightfor2DtexturereferenceboundtoaCUDAarraysupportingtexturegather

N/A 1638416384

Maximumwidth,height,andnumberoflayersfora2Dlayeredtexturereference 81928192512 16384163842048

Maximumwidth,heightanddepthfora3DtexturereferenceboundtolinearmemoryoraCUDAarray

204820482048 409640964096

Maximumwidth(andheight)foracubemaptexturereference N/A 16384

Maximumwidth(andheight)andnumberoflayersforacubemaplayeredtexturereference

N/A 163842046

Maximumnumberoftexturesthatcanbeboundtoakernel 128 256

Maximumwidthfora1DsurfacereferenceboundtoaCUDAarray

Notsupported

65536

Maximumwidthandnumberoflayersfora1Dlayeredsurfacereference 655362048

Maximumwidthandheightfora2DsurfacereferenceboundtoaCUDAarray 6553632768

Maximumwidth,height,andnumberoflayersfora2Dlayeredsurfacereference 65536327682048

Maximumwidth,height,anddepthfora3DsurfacereferenceboundtoaCUDAarray

65536327682048

Maximumwidth(andheight)foracubemapsurfacereferenceboundtoaCUDAarray 32768

Maximumwidth(andheight)andnumber



oflayersforacubemaplayeredsurfacereference

327682046

Maximumnumberofsurfacesthatcanbeboundtoakernel 8 16

Maximumnumberofinstructionsperkernel 2million 512million

Technicalspecifications1.0 1.1 1.2 1.3 2.x 3.0 3.5 3.7 5.0 5.2

Computecapability(version)

ArchitecturespecificationsComputecapability(version)

1.0 1.1 1.2 1.3 2.0 2.1 3.0 3.5 3.7 5.0 5.2NumberofALUlanesforintegerandfloatingpointarithmeticoperations 8

[19] 32 48 192 128

Numberofspecialfunctionunitsforsingleprecisionfloatingpointtranscendentalfunctions 2 4 8 32

Numberoftexturefilteringunitsforeverytextureaddressunitorrenderoutputunit(ROP) 2 4 8 16 8

Numberofwarpschedulers 1 2 4Numberofinstructionsissuedatoncebyscheduler 1 2[20]

Formoreinformationpleasevisitthissite:http://www.geeks3d.com/20100606/gpucomputingnvidiacudacomputecapabilitycomparativetable/andalsoreadNvidiaCUDAprogrammingguide.[21]

Example

ThisexamplecodeinC++loadsatexturefromanimageintoanarrayontheGPU:

texturetex;

voidfoo(){cudaArray*cu_array;

//AllocatearraycudaChannelFormatDescdescription=cudaCreateChannelDesc();cudaMallocArray(&cu_array,&description,width,height);

//CopyimagedatatoarraycudaMemcpyToArray(cu_array,image,width*height*sizeof(float),cudaMemcpyHostToDevice);

//Settextureparameters(default)tex.addressMode[0]=cudaAddressModeClamp;tex.addressMode[1]=cudaAddressModeClamp;tex.filterMode=cudaFilterModePoint;tex.normalized=false;//donotnormalizecoordinates

//BindthearraytothetexturecudaBindTextureToArray(tex,cu_array);

//Runkerneldim3blockDim(16,16,1);dim3gridDim((width+blockDim.x1)/blockDim.x,(height+blockDim.y1)/blockDim.y,1);kernel(d_data,height,width);

//UnbindthearrayfromthetexturecudaUnbindTexture(tex);}//endfoo()

__global__voidkernel(float*odata,intheight,intwidth){unsignedintx=blockIdx.x*blockDim.x+threadIdx.x;



unsignedinty=blockIdx.y*blockDim.y+threadIdx.y;if(x



MathematicaCUDALink(http://reference.wolfram.com/mathematica/CUDALink/tutorial/Overview.html)MATLABParallelComputingToolbox,MATLABDistributedComputingServer,[24]and3rdpartypackageslikeJacket..NETCUDA.NET(http://www.casshpc.com/solutions/libraries/cudanet),ManagedCUDA(https://managedcuda.codeplex.com),CUDAfy.NET(http://www.hybriddsp.com).NETkernelandhostcode,CURAND,CUBLAS,CUFFTPerlKappaCUDA(http://psilambda.com/download/kappaforperl),CUDA::Minimal(https://github.com/run4flat/perlCUDAMinimal)PythonNumba,NumbaPro,PyCUDA(http://mathema.tician.de/software/pycuda),KappaCUDA(http://psilambda.com/download/kappaforpython),TheanoRubyKappaCUDA(http://psilambda.com/download/kappaextras)Rgputools(http://brainarray.mbni.med.umich.edu/brainarray/rgpgpu/)

CurrentandfutureusagesofCUDAarchitecture

Acceleratedrenderingof3DgraphicsAcceleratedinterconversionofvideofileformatsAcceleratedencryption,decryptionandcompressionDistributedcalculations,suchaspredictingthenativeconformationofproteinsMedicalanalysissimulations,forexamplevirtualrealitybasedonCTandMRIscanimages.Physicalsimulations,inparticularinfluiddynamics.NeuralnetworktraininginmachinelearningproblemsDistributedcomputingMoleculardynamicsMiningcryptocurrencies

Seealso

AllineaDDTAdebuggerforCUDA,OpenACC,andparallelapplicationsOpenCLAstandardforprogrammingavarietyofplatforms,includingGPUsBrookGPUtheStanfordUniversitygraphicsgroup'scompilerArrayprogrammingParallelcomputingStreamprocessingrCUDAAnAPIforcomputingonremotecomputersMolecularmodelingonGPU

References1. Shimpi,AnandLalWilson,Derek(November8,2006)."NVIDIA'sGeForce8800(G80):GPUsRe

architectedforDirectX10"(http://www.anandtech.com/show/2116/8).AnandTech.RetrievedMay16,2015.2. NVIDIACUDAHomePage(http://www.nvidia.com/object/cuda_home_new.html)3. AbiChahla,Fedy(June18,2008)."Nvidia'sCUDA:TheEndoftheCPU?"

(http://www.tomshardware.com/reviews/nvidiacudagpu,1954.html).Tom'sHardware.RetrievedMay17,2015.

4. CUDALLVMCompiler(http://developer.nvidia.com/cuda/cudallvmcompiler)5. FirstOpenCLdemoonaGPU(https://www.youtube.com/watch?v=r1sN1ELJfNo)onYouTube6. DirectComputeOceanDemoRunningonNvidiaCUDAenabledGPU(https://www.youtube.com/watch?

v=K1I4kts5mqc)onYouTube7. GiorgosVasiliadis,SpirosAntonatos,MichalisPolychronakis,EvangelosP.MarkatosandSotirisIoannidis

(September2008)."Gnort:HighPerformanceNetworkIntrusionDetectionUsingGraphicsProcessors"(http://www.ics.forth.gr/dcs/Activities/papers/gnort.raid08.pdf)(PDF).Proceedingsofthe11thInternational



ExternallinksOfficialwebsite(http://www.nvidia.com/object/cuda_home.html)CUDACommunity(https://plus.google.com/communities/114632076318201174454)onGoogle+AlittletooltoadjusttheVRAMsize(https://devtalk.nvidia.com/default/topic/726765/needalittletooltoadjustthevramsize/)

Retrievedfrom"https://en.wikipedia.org/w/index.php?title=CUDA&oldid=674383050"

Categories: Computerphysicsengines GPGPU GPGPUlibraries GraphicshardwareNvidiasoftware Parallelcomputing Videocards Videogamehardware

(http://www.ics.forth.gr/dcs/Activities/papers/gnort.raid08.pdf)(PDF).Proceedingsofthe11thInternationalSymposiumonRecentAdvancesinIntrusionDetection(RAID).

8. Schatz,M.C.,Trapnell,C.,Delcher,A.L.,Varshney,A.(2007)."HighthroughputsequencealignmentusingGraphicsProcessingUnits"(http://www.biomedcentral.com/14712105/8/474).BMCBioinformatics.8:474:474.doi:10.1186/147121058474(https://dx.doi.org/10.1186%2F147121058474).PMC2222658(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2222658).PMID18070356(https://www.ncbi.nlm.nih.gov/pubmed/18070356).

9. Manavski,SvetlinA.GiorgioValle(2008)."CUDAcompatibleGPUcardsasefficienthardwareacceleratorsforSmithWatermansequencealignment"(http://www.biomedcentral.com/14712105/9/S2/S10).BMCBioinformatics9:S10.doi:10.1186/147121059S2S10(https://dx.doi.org/10.1186%2F147121059S2S10).PMC2323659(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2323659).PMID18387198(https://www.ncbi.nlm.nih.gov/pubmed/18387198).

10. PyritGoogleCodehttps://code.google.com/p/pyrit/11. UseyourNvidiaGPUforscientificcomputing(http://boinc.berkeley.edu/cuda.php),BOINCofficialsite

(December18,2008)12. NvidiaCUDASoftwareDevelopmentKit(CUDASDK)ReleaseNotesVersion2.0forMACOSX

(http://developer.download.nvidia.com/compute/cuda/sdk/website/doc/CUDA_SDK_release_notes_macosx.txt)13. CUDA1.1NowonMacOSX(http://news.developer.nvidia.com/2008/02/cuda11nowo.html)(Posted

onFeb14,2008)14. Silberstein,MarkSchuster,AssafGeiger,DanPatney,AnjulOwens,JohnD.(2008).Efficientcomputation

ofsumproductsonGPUsthroughsoftwaremanagedcache.Proceedingsofthe22ndannualinternationalconferenceonSupercomputingICS'08.pp.309318.doi:10.1145/1375527.1375572(https://dx.doi.org/10.1145%2F1375527.1375572).ISBN9781605581583.

15. NVCCforcesc++compilationof.cufiles(https://devtalk.nvidia.com/default/topic/508479/cudaprogrammingandperformance/nvccforcesccompilationofcufiles/#entry1340190)

16. C++keywordsonCUDACcode(http://stackoverflow.com/questions/15362678/ckeywordsoncudaccode/15362798)

17. "CUDAEnabledProducts"(http://www.nvidia.com/object/cuda_learn_products.html).CUDAZone.NvidiaCorporation.Retrieved20081103.

18. Whitehead,NathanFitFlorea,Alex."Precision&Performance:FloatingPointandIEEE754ComplianceforNVIDIAGPUs"(https://developer.nvidia.com/sites/default/files/akamai/cuda/files/NVIDIACUDAFloatingPoint.pdf)(PDF).Nvidia.RetrievedNovember18,2014.

19. ALUsperformonlysingleprecisionfloatingpointarithmetics.Thereis1doubleprecisionfloatingpointunit.

20. Nomorethanoneschedulercanissue2instructionsatonce.ThefirstschedulerisinchargeofthewarpswithanoddIDandthesecondschedulerisinchargeofthewarpswithanevenID.

21. AppendixF.FeaturesandTechnicalSpecifications(http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf)PDF(3.2MiB),Page148of175(Version5.0October2012)

22. PyCUDA(http://mathema.tician.de/software/pycuda)23. pycublas(http://kered.org/blog/20090413/easypythonnumpycudacublas/)24. "MATLABAddsGPGPUSupport"(http://www.hpcwire.com/features/MATLABAddsGPGPUSupport

103307084.html).20100920.



Thispagewaslastmodifiedon3August2015,at15:46.TextisavailableundertheCreativeCommonsAttributionShareAlikeLicenseadditionaltermsmayapply.Byusingthissite,youagreetotheTermsofUseandPrivacyPolicy.WikipediaisaregisteredtrademarkoftheWikimediaFoundation,Inc.,anonprofitorganization.

CUDA - Wikipedia, The Free Encyclopedia

Documents

Transcript of CUDA - Wikipedia, The Free Encyclopedia