NVIDIA's Fermi the First Complete GPU Architecture · 2009-09-29 · create a demand for millions...

NVIDIA’sFermi:TheFirstCompleteGPUComputingArchitecture

AwhitepaperbyPeterN.Glaskowsky

PreparedundercontractwithNVIDIACorporation

Copyright©September2009,PeterN.Glaskowsky

2

PeterN.Glaskowskyisaconsultingcomputerarchitect,technologyanalyst,andprofessionalbloggerinSiliconValley.Glaskowskywastheprincipal

systemarchitectofchipstartupMontalvoSystems.Earlier,hewasEditorinChiefoftheaward‐winningindustrynewsletterMicroprocessorReport.

GlaskowskywritestheSpeedsandFeedsblogfortheCNETBlogNetwork:

http://www.speedsnfeeds.com/

ThisdocumentislicensedundertheCreativeCommonsAttributionShareAlike3.0License.Inshort:youarefreetoshareandmakederivativeworksofthefileundertheconditionsthatyouappropriatelyattributeit,and

thatyoudistributeitonlyunderalicenseidenticaltothisone.

http://creativecommons.org/licenses/by‐sa/3.0/

Companyandproductnamesmaybetrademarksoftherespectivecompanieswithwhichtheyareassociated.

3

ExecutiveSummary

After38yearsofrapidprogress,conventionalmicroprocessortechnologyisbeginningtoseediminishingreturns.Thepaceofimprovementinclockspeedsandarchitecturalsophisticationisslowing,andwhilesingle‐threadedperformancecontinuestoimprove,thefocushasshiftedtomulticoredesigns.

Thesetooarereachingpracticallimitsforpersonalcomputing;aquad‐coreCPUisn’tworthtwicethepriceofadual‐core,andchipswithevenhighercorecountsaren’tlikelytobeamajordriverofvalueinfuturePCs.

CPUswillnevergoaway,butGPUsareassumingamoreprominentroleinPCsystemarchitecture.GPUsdelivermorecost‐effectiveandenergy‐efficientperformanceforapplicationsthatneedit.

TherapidlygrowingpopularityofGPUsalsomakesthemanaturalchoiceforhigh‐performancecomputing(HPC).Gamingandotherconsumerapplicationscreateademandformillionsofhigh‐endGPUseachyear,andthesehighsalesvolumesmakeitpossibleforcompanieslikeNVIDIAtoprovidetheHPCmarketwithfast,affordableGPUcomputingproducts.

NVIDIA’snext‐generationCUDAarchitecture(codenamedFermi),isthelatestandgreatestexpressionofthistrend.WithmanytimestheperformanceofanyconventionalCPUonparallelsoftware,andnewfeaturestomakeiteasierforsoftwaredeveloperstorealizethefullpotentialofthehardware,Fermi‐basedGPUswillbringsupercomputerperformancetomoreusersthaneverbefore.

FermiisthefirstarchitectureofanykindtodeliverallofthefeaturesrequiredforthemostdemandingHPCapplications:unmatcheddouble‐precisionfloating‐pointperformance,IEEE754‐2008complianceincludingfusedmultiply‐addoperations,ECCprotectionfromtheregisterstoDRAM,astraightforwardlinearaddressingmodelwithcachingatalllevels,andsupportforlanguagesincludingC,C++,FORTRAN,Java,Matlab,andPython.

Withthesefeatures,plusmanyotherperformanceandusabilityenhancements,FermiisthefirstcompletearchitectureforGPUcomputing.

4

CPUComputing—theGreatTradition

Thehistoryofthemicroprocessoroverthelast38yearsdescribesthegreatestperiodofsustainedtechnicalprogresstheworldhaseverseen.Moore’sLaw,whichdescribestherateofthisprogress,hasnoequivalentintransportation,agriculture,ormechanicalengineering.ThinkhowdifferenttheIndustrialRevolutionwouldhavebeen300yearsagoif,forexample,thestrengthofstructuralmaterialshaddoubledevery18monthsfrom1771to1809.Nevermindsteam;the19thcenturycouldhavebeenpoweredbypea‐sizedinternal‐combustionenginescompressinghydrogentoproducenuclearfusion.

CPUperformanceistheproductofmanyrelatedadvances:

• Increasedtransistordensity• Increasedtransistorperformance• Widerdatapaths• Pipelining• Superscalarexecution• Speculativeexecution• Caching• Chip‐andsystem‐levelintegration

Thefirstthirtyyearsofthemicroprocessorfocusedalmostexclusivelyonserialworkloads:compilers,managingserialcommunicationlinks,user‐interfacecode,andsoon.Morerecently,CPUshaveevolvedtomeettheneedsofparallelworkloadsinmarketsfromfinancialtransactionprocessingtocomputationalfluiddynamics.

CPUsaregreatthings.They’reeasytoprogram,becausecompilersevolvedrightalongwiththehardwaretheyrunon.SoftwaredeveloperscanignoremostofthecomplexityinmodernCPUs;microarchitectureisalmostinvisible,andcompilermagichidestherest.Multicorechipshavethesamesoftwarearchitectureasoldermultiprocessorsystems:asimplecoherentmemorymodelandaseaofidenticalcomputingengines.

ButCPUcorescontinuetobeoptimizedforsingle‐threadedperformanceattheexpenseofparallelexecution.Thisfactismostapparentwhenoneconsidersthatintegerandfloating‐pointexecutionunitsoccupyonlyatinyfractionofthedieareainamodernCPU.

Figure1showstheportionofthedieareausedbyALUsintheCorei7processor(thechipcode‐namedBloomfield)basedonIntel’sNehalemmicroarchitecture.

5

Figure1.Intel’sCorei7processor(thechipcodenamedBloomfield,basedonthe

Nehalemmicroarchitecture)includesfourCPUcoreswithsimultaneousmultithreading,8MBofL3cache,andonchipDRAMcontrollers.Madewith45nmprocesstechnology,eachchiphas731milliontransistorsandconsumesupto130Wofthermaldesignpower.Redoutlineshighlighttheportionofeachcoreoccupiedbyexecutionunits.(Source:IntelCorporationexceptredhighlighting)

Withsuchasmallpartofthechipdevotedtoperformingdirectcalculations,it’snosurprisethatCPUsarerelativelyinefficientforhigh‐performancecomputingapplications.MostofthecircuitryonaCPU,andthereforemostoftheheatitgenerates,isdevotedtoinvisiblecomplexity:thosecaches,instructiondecoders,branchpredictors,andotherfeaturesthatarenotarchitecturallyvisiblebutwhichenhancesingle‐threadedperformance.

Speculation

Attheheartofthisfocusonsingle‐threadedperformanceisaconceptknownasspeculation.Atahighlevel,speculationencompassesnotonlyspeculativeexecution(inwhichinstructionsbeginexecutingevenbeforeitispossibletoknowtheirresultswillbeneeded),butmanyotherelementsofCPUdesign.

6

Caches,forexample,arefundamentallyspeculative:storingdatainacacherepresentsabetthatthedatawillbeneededagainsoon.Cachesconsumedieareaandpowerthatcouldotherwisebeusedtoimplementandoperatemoreexecutionunits.Whetherthebetpaysoffdependsonthenatureofeachworkload.

Similarly,multipleexecutionunits,outoforderprocessing,andbranchpredictionalsorepresentspeculativeoptimizations.Allofthesechoicestendtopayoffforcodewithhighdatalocality(wherethesamedataitems,orthosenearbyinmemory,arefrequentlyaccessed),amixofdifferentoperations,andahighpercentageofconditionalbranches.

Butwhenexecutingcodeconsistingofmanysequentialoperationsofthesametype—likescientificworkloads—thesespeculativeelementscansitunused,consumingdieareaandpower.

Theeffectofprocesstechnology

TheneedforCPUdesignerstomaximizesingle‐threadedperformanceisalsobehindtheuseofaggressiveprocesstechnologytoachievethehighestpossibleclockrates.Butthisdecisionalsocomeswithsignificantcosts.Fastertransistorsrunhotter,leakmorepowerevenwhentheyaren’tswitching,andcostmoretomanufacture.

Companiesthatmakehigh‐endCPUsspendstaggeringamountsofmoneyonprocesstechnologyjusttoimprovesingle‐threadedperformance.Betweenthem,IBMandIntelhaveinvestedtensofbillionsofdollarsonR&Dforprocesstechnologyandtransistordesign.Theresultsareimpressivewhenmeasuredingigahertz,butlesssofromtheperspectiveofGFLOPSperdollarorperwatt.

Processormicroarchitecturealsocontributestoperformance.WithinthePCandservermarkets,theextremesofmicroarchitecturaloptimizationarerepresentedbytwoclassesofCPUdesign:relativelysimpledual‐issuecoresandmorecomplexmulti‐issuecores.

Dual‐issueCPUs

ThesimplestCPUmicroarchitectureusedinthePCmarkettodayisthedual‐issuesuperscalarcore.Suchdesignscanexecuteuptotwooperationsineachclockcycle,sometimeswithspecial“pairingrules”thatdefinewhichinstructionscanbeexecutedtogether.Forexample,someearlydual‐issueCPUscouldissuetwosimple

7

integeroperationsatthesametime,oroneintegerandonefloating‐pointoperation,butnottwofloating‐pointoperations.

Dual‐issuecoresgenerallyprocessinstructionsinprogramorder.Theydeliverimprovedperformancebyexploitingthenaturalinstruction‐levelparallelism(ILP)inmostprograms.TheamountofavailableILPvariesfromoneprogramtoanother,butthere’salmostalwaysenoughtotakeadvantageofasecondpipeline.

Intel’sAtomprocessorisagoodexampleofafullyevolveddual‐issueprocessor.Likeotheradvancedx86chips,Atomtranslatesx86instructionsintointernal“micro‐ops”thataremoreliketheinstructionsinoldRISC(reducedinstructionsetcomputing)processors.InAtom,eachmicro‐opcantypicallyperformoneALUoperationplusoneormoresupportingoperationsuchasamemoryloadorstore.

Dual‐issueprocessorslikeAtomusuallyoccupythelowendofthemarketwherecost‐efficiencyisparamount.Forthisreason,Atomhasfewerperformance‐orientedoptimizationsthanmoreexpensiveIntelchips.Atomexecutesinorder,withnospeculativeexecution.MuchofthenewengineeringworkinAtomwentintoimprovingitspowerefficiencywhennotoperatingatfullspeed.

Atomhassixexecutionpipelines(twoforfloatingpointoperations,twoforintegeroperations,andtwoforaddresscalculations;thelatterarecommoninthex86architecturebecauseinstructionoperandscanspecifymemorylocations).Onlytwoinstructions,however,canbeissuedtothesepipelinesinasingleclockperiod.Thislowutilizationmeansthatsomeexecutionunitswillalwaysgounusedineachcycle.

Likeanyx86processor,alargepartofAtomisdedicatedtoinstructioncaching,decoding(inthiscase,translatingtomicro‐ops),andamicrocodestoretoimplementthemorecomplexx86instructions.ItalsosupportsAtom’stwo‐waysimultaneousmultithreading(SMT)feature.Thiscircuitry,whichIntelcallsthe“frontendcluster,”occupiesmoredieareathanthechip’sfloating‐pointunit.

SMTisbasicallyawaytoworkaroundcasesthatfurtherlimitutilizationoftheexecutionunits.Sometimesasinglethreadisstalledwaitingfordatafromthecache,orhasmultipleinstructionspendingforasinglepipeline.Inthesecases,thesecondthreadmaybeabletoissueaninstructionortwo.Thenetperformancebenefitisusuallylow,only10%–20%onsomeapplications,butSMTaddsonlyafewpercenttothesizeofthechip.

8

Asaresult,theAtomcoreissuitableforlow‐endconsumersystems,butprovidesverylownetperformance,wellbelowwhatisavailablefromotherIntelprocessors.

Intel’sLarrabee

LarrabeeisIntel’scodenameforafuturegraphicsprocessingarchitecturebasedonthex86architecture.ThefirstLarrabeechipissaidtousedual‐issuecoresderivedfromtheoriginalPentiumdesign,butmodifiedtoincludesupportfor64‐bitx86operationsandanew512‐bitvector‐processingunit.

Apartfromthevectorunit,theLarrabeecoreissimplerthanAtom’s.Itdoesn’tsupportIntel’sMMXorSSEextensions,insteadrelyingsolelyonthenewvectorunit,whichhasitsownnewinstructions.Thevectorunitiswideenoughtoperform16single‐precisionFPoperationsperclock,andalsoprovidesdouble‐precisionFPsupportatalowerrate.

SeveralfeaturesinLarrabee’svectorunitarenewtothex86architecture,includingscatter‐gatherloadsandstores(formingavectorfrom16differentlocationsinmemory—aconvenientfeature,thoughonethatmustbeusedjudiciously),fusedmultiply‐add,predicatedexecution,andthree‐operandfloating‐pointinstructions.

Larrabeealsosupportsfour‐waymultithreading,butnotinthesamewayasAtom.WhereAtomcansimultaneouslyexecuteinstructionsfromtwothreads(hencetheSMTname),Larrabeesimplymaintainsthestateofmultiplethreadstospeedtheprocessofswitchingtoanewthreadwhenthecurrentthreadstalls.

Larrabee’sx86compatibilityreducesitsperformanceandefficiencywithoutdeliveringmuchbenefitforgraphics.AswithAtom,asignificant(ifnothuge)partoftheLarrabeedieareaandpowerbudgetwillbeconsumedbyinstructiondecoders.Asagraphicschip,Larrabeewillbeimpairedbyitslackofoptimizedfixed‐functionlogicforrasterization,interpolating,andalphablending.Lackingcost‐effectiveperformancefor3Dgames,itwillbedifficultforLarrabeetoachievethekindofsalesvolumesandprofitmarginsIntelexpectsofitsmajorproductlines.

LarrabeewillbeIntel’ssecondattempttoenterthePCgraphics‐chipmarket,afterthei740programof1998,whichwascommerciallyunsuccessfulbutlaidthefoundationforIntel’slaterintegrated‐graphicschipsets.(Intelmadeanevenearlierrunatthevideocontrollerbusinesswiththei750,andbeforethat,thecompany’si860RISCprocessorwasusedasagraphicsacceleratorinsomeworkstations.)

9

Intel’sNehalemmicroarchitecture

Nehalemisthemostsophisticatedmicroarchitectureinanyx86processor.Itsfeaturesarelikealaundrylistofhigh‐performanceCPUdesign:four‐widesuperscalar,outoforder,speculativeexecution,simultaneousmultithreading,multiplebranchpredictors,on‐diepowergating,on‐diememorycontrollers,largecaches,andmultipleinterprocessorinterconnects.Figure2showstheNehalemmicroarchitecture.

Figure2.TheNehalemcoreincludesmultiplex86instructiondecoders,queues,

reorderingbuffers,andsixexecutionpipelinestosupportspeculativeoutofordermultithreadedexecution.(Source:"File:IntelNehalemarch.svg."WikimediaCommons)

10

FourinstructiondecodersareprovidedineachNehalemcore;theseruninparallelwhereverpossible,thoughonlyonecandecodecomplexx86instructions,whicharerelativelyinfrequent.Themicro‐opsgeneratedbythedecodersarequeuedanddispatchedoutoforderthroughsixportsto12computationalexecutionunits.Thereisalsooneloadunitandtwostoreunitsfordataandaddressvalues.

Nehalem’s128‐bitSIMDfloating‐pointunitsaresimilartothosefoundonpreviousgenerationIntelprocessors:oneforFMULandFDIV(floating‐pointmultiplyanddivide),oneforFADD,andoneforFPshuffleoperations.The“shuffle”unitisusedtorearrangedatavalueswithintheSIMDregisters,anddoesnotcontributetotheperformanceofmultiply‐addintensivealgorithms.

Thepeaksingle‐precisionfloating‐pointperformanceofafour‐coreNehalemprocessor(notcountingtheshuffleunit)canbecalculatedas:

4 cores * 2 SIMD ops/clock * 4 values/op * clock rate

Also,whileNehalemprocessorsprovide32GB/sofpeakDRAMbandwidth—acommendablefigureforaPCprocessor—thisfigurerepresentsalittlelessthanonebyteofDRAMI/Oforeachthreefloating‐pointoperations.Asaresult,manyhigh‐performancecomputingapplicationswillbebottleneckedbyDRAMperformancebeforetheysaturatethechip’sfloating‐pointALUs.

TheXeonW5590isIntel’shigh‐endquad‐coreworkstationprocessorbasedontheNehalem‐EP“Gainestown”chip.TheW5590ispricedat$1,600eachwhenpurchasedin1,000‐unitquantities(asofAugust2009).

Atits3.33GHzclock,theW5590deliversapeaksingle‐precisionfloating‐pointrateof106.56GFLOPS.TheW5590hasa130Wthermaldesignpower(TDP)rating,or1.22watts/GFLOPS—notincludingthenecessarycore‐logicchipset.

Nehalemhasbeenoptimizedforsingle‐threadedperformanceandclockspeedattheexpenseofsustainedthroughput.Thisisadesirabletradeoffforachipintendedtobeamarket‐leadingPCdesktopandserverprocessor,butitmakesNehalemanexpensive,power‐hungrychoiceforhigh‐performancecomputing.

11

“TheWall”

Themarketdemandsgeneral‐purposeprocessorsthatdeliverhighsingle‐threadedperformanceaswellasmulti‐corethroughputforawidevarietyofworkloadsonclient,server,andhigh‐performancecomputing(HPC)systems.Thispressurehasgivenusalmostthreedecadesofprogresstowardhighercomplexityandhigherclockrates.

Thisprogresshasn’talwaysbeensteady.Intelcancelledits“Tejas”processor,whichwasrumoredtohavea40‐stagepipeline,andlaterkilledofftheentirePentium4“NetBurst”productfamilybecauseofitsrelativeinefficiency.ThePentium4ultimatelyreachedaclockrateof3.8GHzinthe2004“Prescott”model,aspeedthatIntelhasbeenunabletomatchsince.

InthemorerecentCore2(Conroe/Penryn)andCorei7(Nehalem)processors,IntelusesincreasedcomplexitytodeliversubstantialperformanceimprovementsoverthePentium4line,butthepaceoftheseimprovementsisslowing.Eachnewgenerationofprocesstechnologyrequiresevermoreheroicmeasurestoimprovetransistorcharacteristics;eachnewcoremicroarchitecturemustworkdisproportionatelyhardertofindandexploitinstruction‐levelparallelism(ILP).

Asthesechallengesbecamemoreapparentinthe1990s,CPUarchitectsbeganreferringtothe“powerwall,”the“memorywall,”andthe“ILPwall”asobstaclestothekindofrapidprogressseenupuntilthattime.Itmaybebettertothinkoftheseissuesasmountainsratherthanwalls—mountainsthatbeginasmildslopesandbecomesteeperwitheachstep,makingfurtherprogressincreasinglydifficult.

Nevertheless,theinexorableadvanceofprocesstechnologyprovidedCPUdesignerswithmoretransistorsineachgeneration.By2005,thecompetitivepressuretousetheseadditionaltransistorstodeliverimprovedperformance(atthechiplevel,ifnotatthecorelevel)droveAMDandInteltointroducedual‐coreprocessors.Sincethen,theprimaryfocusofPCprocessordesignhasbeencontinuingtoincreasethecorecountonthesechips.

Thatapproach,however,hasreachedapointofdiminishingreturns.Dual‐coreCPUsprovidenoticeablebenefitsformostPCusers,butarerarelyfullyutilizedexceptwhenworkingwithmultimediacontentormultipleperformance‐hungryapplications.Quad‐coreCPUsareonlyaslightimprovement,mostofthetime.By2010,therewillbeeight‐coreCPUsindesktops,butitwilllikelybedifficulttosellmostcustomersonthevalueoftheadditionalcores.Sellingfurtherincreaseswillbeevenmoreproblematic.

12

Oncetheincreaseincorecountstalls,thefocuswillreturntosingle‐threadedperformance,butwithallthelow‐hangingfruitlonggone,furtherimprovementswillbehardtofind.Inthenearterm,AMDandIntelareexpectedtoemphasizevectorfloating‐pointimprovementswiththeforthcomingAdvancedVectorExtensions(AVX).LikeSSE,AVX’sprimaryvaluewillbeforapplicationswherevectorizablefloating‐pointcomputationsneedtobecloselycoupledwiththekindofcontrol‐flowcodeforwhichmodernx86processorshavebeenoptimized.

CPUcoredesignwillcontinuetoprogress.Therewillcontinuetobefurtherimprovementsinprocesstechnology,fastermemoryinterfaces,andwidersuperscalarcores.Butabouttenyearsago,NVIDIA’sprocessorarchitectsrealizedthatCPUswerenolongerthepreferredsolutionforcertainproblems,andstartedfromacleansheetofpapertocreateabetteranswer.

13

TheHistoryoftheGPU

It’sonethingtorecognizethefuturepotentialofanewprocessingarchitecture.It’sanothertobuildamarketbeforethatpotentialcanbeachieved.Therewereattemptstobuildchip‐scaleparallelprocessorsinthe1990s,butthelimitedtransistorbudgetsinthosedaysfavoredmoresophisticatedsingle‐coredesigns.

TherealpathtowardGPUcomputingbegan,notwithGPUs,butwithnon‐programmable3G‐graphicsaccelerators.Multi‐chip3Drenderingenginesweredevelopedbymultiplecompaniesstartinginthe1980s,butbythemid‐1990sitbecamepossibletointegratealltheessentialelementsontoasinglechip.From1994to2001,thesechipsprogressedfromthesimplestpixel‐drawingfunctionstoimplementingthefull3Dpipeline:transforms,lighting,rasterization,texturing,depthtesting,anddisplay.

NVIDIA’sGeForce3in2001introducedprogrammablepixelshadingtotheconsumermarket.Theprogrammabilityofthischipwasverylimited,butlaterGeForceproductsbecamemoreflexibleandfaster,addingseparateprogrammableenginesforvertexandgeometryshading.ThisevolutionculminatedintheGeForce7800,showninFigure3.

Figure3.TheGeForce7800hadthreekindsofprogrammableenginesfordifferentstages

ofthe3Dpipelineplusseveraladditionalstagesofconfigurableandfixedfunctionlogic.(Source:NVIDIA)

14

So‐calledgeneral‐purposeGPU(GPGPU)programmingevolvedasawaytoperformnon‐graphicsprocessingonthesegraphics‐optimizedarchitectures,typicallybyrunningcarefullycraftedshadercodeagainstdatapresentedasvertexortextureinformationandretrievingtheresultsfromalaterstageinthepipeline.Thoughsometimesawkward,GPGPUprogrammingshowedgreatpromise.

Managingthreedifferentprogrammableenginesinasingle3Dpipelineledtounpredictablebottlenecks;toomucheffortwentintobalancingthethroughputofeachstage.In2006,NVIDIAintroducedtheGeForce8800,asFigure4shows.Thisdesignfeatureda“unifiedshaderarchitecture”with128processingelementsdistributedamongeightshadercores.Eachshadercorecouldbeassignedtoanyshadertask,eliminatingtheneedforstage‐by‐stagebalancingandgreatlyimprovingoverallperformance.

Figure4.TheGeForce8800introducedaunifiedshaderarchitecturewithjustonekind

ofprogrammableprocessingelementthatcouldbeusedformultiplepurposes.Somesimplegraphicsoperationsstillusedspecialpurposelogic.(Source:NVIDIA)

The8800alsointroducedCUDA,theindustry’sfirstC‐baseddevelopmentenvironmentforGPUs.(CUDAoriginallystoodfor“ComputeUnifiedDeviceArchitecture,”butthelongernameisnolongerspelledout.)CUDAdeliveredaneasierandmoreeffectiveprogrammingmodelthanearlierGPGPUapproaches.

15

Tobringtheadvantagesofthe8800architectureandCUDAtonewmarketssuchasHPC,NVIDIAintroducedtheTeslaproductline.CurrentTeslaproductsusethemorerecentGT200architecture.

TheTeslalinebeginswithPCIExpressadd‐inboards—essentiallygraphicscardswithoutdisplayoutputs—andwithdriversoptimizedforGPUcomputinginsteadof3Drendering.WithTesla,programmersdon’thavetoworryaboutmakingtaskslooklikegraphicsoperations;theGPUcanbetreatedlikeamany‐coreprocessor.

Unliketheearlyattemptsatchip‐scalemultiprocessingbackinthe’90s,Teslawasahigh‐volumehardwareplatformrightfromthebeginning.ThisisdueinparttoNVIDIA’sstrategyofsupportingtheCUDAsoftwaredevelopmentplatformonthecompany’sGeForceandQuadroproducts,makingitavailabletoamuchwideraudienceofdevelopers.NVIDIAsaysithasshippedover100millionCUDA‐capablechips.

Atthetimeofthiswriting,thepricefortheentry‐levelTeslaC1060add‐inboardisunder$1,500fromsomeInternetmail‐ordervendors.That’slowerthanthepriceofasingleIntelXeonW5590processor—andtheTeslacardhasapeakGFLOPSratingmorethaneighttimeshigherthantheXeonprocessor.

TheTeslalinealsoincludestheS1070,a1U‐heightrackmountserverthatincludesfourGT200‐seriesGPUsrunningatahigherspeedthanthatintheC1060(upto1.5GHzcoreclockvs.1.3GHz),sotheS1070’speakperformanceisover4.6timeshigherthanasingleC1060card.TheS1070connectstoaseparatehostcomputerviaaPCIExpressadd‐incard.

Thiswidespreadavailabilityofhigh‐performancehardwareprovidesanaturaldrawforsoftwaredevelopers.Justasthehigh‐volumex86architectureattractsmoredevelopersthantheIA‐64architectureofIntel’sItaniumprocessors,thehighsalesvolumesofGPUs—althoughdrivenprimarilybythegamingmarket—makesGPUsmoreattractivefordevelopersofhigh‐performancecomputingapplicationsthandedicatedsupercomputersfromcompanieslikeCray,Fujitsu,IBM,NEC,andSGI.

AlthoughGPUcomputingisonlyafewyearsoldnow,it’slikelytherearealreadymoreprogrammerswithdirectGPUcomputingexperiencethanhaveeverusedaCray.AcademicsupportforGPUcomputingisalsogrowingquickly.NVIDIAsaysover200collegesanduniversitiesareteachingclassesinCUDAprogramming;theavailabilityofOpenCL(suchasinthenew“SnowLeopard”versionofApple’sMacOSX)willdrivethatnumberevenhigher.

16

IntroducingFermi

GPUcomputingisn’tmeanttoreplaceCPUcomputing.Eachapproachhasadvantagesforcertainkindsofsoftware.Asexplainedearlier,CPUsareoptimizedforapplicationswheremostoftheworkisbeingdonebyalimitednumberofthreads,especiallywherethethreadsexhibithighdatalocality,amixofdifferentoperations,andahighpercentageofconditionalbranches.

GPUdesignaimsattheotherendofthespectrum:applicationswithmultiplethreadsthataredominatedbylongersequencesofcomputationalinstructions.Overthelastfewyears,GPUshavebecomemuchbetteratthreadhandling,datacaching,virtualmemorymanagement,flowcontrol,andotherCPU‐likefeatures,butthedistinctionbetweencomputationallyintensivesoftwareandcontrol‐flowintensivesoftwareisfundamental.

ThestateoftheartinGPUdesignisrepresentedbyNVIDIA’snext‐generationCUDAarchitecture,codenamedFermi.Figure5showsahigh‐levelblockdiagramofthefirstFermichip.

Figure5.NVIDIA’sFermiGPUarchitectureconsistsofmultiplestreaming

multiprocessors(SMs),eachconsistingof32cores,eachofwhichcanexecuteonefloatingpointorintegerinstructionperclock.TheSMsaresupportedbyasecondlevelcache,hostinterface,GigaThreadscheduler,andmultipleDRAMinterfaces.(Source:NVIDIA)

17

Atthislevelofabstraction,theGPUlookslikeseaofcomputationalunitswithonlyafewsupportelements—anillustrationofthekeyGPUdesigngoal,whichistomaximizefloating‐pointthroughput.

Sincemostofthecircuitrywithineachcoreisdedicatedtocomputation,ratherthanspeculativefeaturesmeanttoenhancesingle‐threadedperformance,mostofthedieareaandpowerconsumedbyFermigoesintotheapplication’sactualalgorithmicwork.

TheProgrammingModel

ThecomplexityoftheFermiarchitectureismanagedbyamulti‐levelprogrammingmodelthatallowssoftwaredeveloperstofocusonalgorithmdesignratherthanthedetailsofhowtomapthealgorithmtothehardware,thusimprovingproductivity.ThisisaconcernthatconventionalCPUshaveyettoaddressbecausetheirstructuresaresimpleandregular:asmallnumberofcorespresentedaslogicalpeersonavirtualbus.

InNVIDIA’sCUDAsoftwareplatform,aswellasintheindustry‐standardOpenCLframework,thecomputationalelementsofalgorithmsareknownaskernels(atermhereadaptedfromitsuseinsignalprocessingratherthanfromoperatingsystems).Anapplicationorlibraryfunctionmayconsistofoneormorekernels.

KernelscanbewrittenintheClanguage(specifically,theANSI‐standardC99dialect)extendedwithadditionalkeywordstoexpressparallelismdirectlyratherthanthroughtheusualloopingconstructs.

Oncecompiled,kernelsconsistofmanythreadsthatexecutethesameprograminparallel:onethreadislikeoneiterationofaloop.Inanimage‐processingalgorithm,forexample,onethreadmayoperateononepixel,whileallthethreadstogether—thekernel—mayoperateonawholeimage.

Multiplethreadsaregroupedintothreadblockscontainingupto1,536threads.AllofthethreadsinathreadblockwillrunonasingleSM,sowithinthethreadblock,threadscancooperateandsharememory.Threadblockscancoordinatetheuseofglobalsharedmemoryamongthemselvesbutmayexecuteinanyorder,concurrentlyorsequentially.

Threadblocksaredividedintowarpsof32threads.ThewarpisthefundamentalunitofdispatchwithinasingleSM.InFermi,twowarpsfromdifferentthreadblocks(evendifferentkernels)canbeissuedandexecutedconcurrently,increasinghardwareutilizationandenergyefficiency.

18

Threadblocksaregroupedintogrids,eachofwhichexecutesauniquekernel.

Threadblocksandthreadseachhaveidentifiers(IDs)thatspecifytheirrelationshiptothekernel.TheseIDsareusedwithineachthreadasindexestotheirrespectiveinputandoutputdata,sharedmemorylocations,andsoon.

Atanyonetime,theentireFermideviceisdedicatedtoasingleapplication.Asmentionedabove,anapplicationmayincludemultiplekernels.Fermisupportssimultaneousexecutionofmultiplekernelsfromthesameapplication,eachkernelbeingdistributedtooneormoreSMsonthedevice.Thiscapabilityavoidsthesituationwhereakernelisonlyabletousepartofthedeviceandtherestgoesunused.

Switchingfromoneapplicationtoanotherisabout20timesfasteronFermi(just25microseconds)thanonprevious‐generationGPUs.ThistimeisshortenoughthataFermiGPUcanstillmaintainhighutilizationevenwhenrunningmultipleapplications,likeamixofcomputecodeandgraphicscode.Efficientmultitaskingisimportantforconsumers(e.g.,forvideogamesusingphysics‐basedeffects)andprofessionalusers(whooftenneedtoruncomputationallyintensivesimulationsandsimultaneouslyvisualizetheresults).

Thisswitchingismanagedbythechip‐levelGigaThreadhardwarethreadscheduler,whichmanages1,536simultaneouslyactivethreadsforeachstreamingmultiprocessoracross16kernels.

ThiscentralizedschedulerisanotherpointofdeparturefromconventionalCPUdesign.Inamulticoreormultiprocessorserver,nooneCPUis“incharge”.Alltasks,includingtheoperatingsystem’skernelitself,mayberunonanyavailableCPU.Thisapproachallowseachoperatingsystemtofollowadifferentphilosophyinkerneldesign,fromlargemonolithickernelslikeLinux’stothemicrokerneldesignofQNXandhybriddesignslikeWindows7.Butthegeneralityofthisapproachisalsoitsweakness,becauseitrequirescomplexCPUstospendtimeandenergyperformingfunctionsthatcouldalsobehandledbymuchsimplerhardware.

WithFermi,theintendedapplications,principlesofstreamprocessing,andthekernelandthreadmodel,wereallknowninadvancesothatamoreefficientschedulingmethodcouldbeimplementedintheGigaThreadengine.

InadditiontoC‐languagesupport,FermicanalsoaccelerateallthesamelanguagesastheGT200,includingFORTRAN(withindependentsolutionsfromThePortlandGroupandNOAA,theNationalOceanicandAtmosphericAdministration),Java,Matlab,andPython.SupportedsoftwareplatformsincludeNVIDIA’sownCUDA

19

developmentenvironment,theOpenCLstandardmanagedbytheKhronosGroup,andMicrosoft’sDirectComputeAPI.

ThePortlandGroup(PGI)supportstwowaystouseGPUstoaccelerateFORTRANprograms:thePGIAcceleratorprogrammingmodelinwhichregionsofcodewithinaFORTRANprogramcanbeoffloadedtoaGPU,andCUDAFORTRAN,whichallowstheprogrammerdirectcontrolovertheoperationofattachedGPUsincludingmanaginglocalandsharedmemory,threadsynchronization,andsoon.NOAAprovidesalanguagetranslatorthatconvertsFORTRANcodeintoCUDAC.

Fermibringsanimportantnewcapabilitytothemarketwithnewinstruction‐levelsupportforC++,includinginstructionsforC++virtualfunctions,functionpointers,dynamicobjectallocation,andtheC++exceptionhandlingoperations“try”and“catch”.ThepopularityoftheC++language,previouslyunsupportedonGPUs,willmakeGPUcomputingmorewidelyavailablethanever.

TheStreamingMultiprocessor

Fermi’sstreamingmultiprocessors,showninFigure6,comprise32cores,eachofwhichcanperformfloating‐pointandintegeroperations,alongwith16load‐storeunitsformemoryoperations,fourspecial‐functionunits,and64KoflocalSRAMsplitbetweencacheandlocalmemory.

20

Figure6.EachFermiSMincludes32cores,16load/storeunits,fourspecialfunction

units,a4Kwordregisterfile,64KofconfigurableRAM,andthreadcontrollogic.Eachcorehasbothfloatingpointandintegerexecutionunits.(Source:NVIDIA)

Floating‐pointoperationsfollowtheIEEE754‐2008floating‐pointstandard.Eachcorecanperformonesingle‐precisionfusedmultiply‐addoperationineachclockperiodandonedouble‐precisionFMAintwoclockperiods.Atthechiplevel,Fermiperformsmorethan8×asmanydouble‐precisionoperationsperclockthanthepreviousGT200generation,wheredouble‐precisionprocessingwashandledbyadedicatedunitperSMwithmuchlowerthroughput.

IEEEfloating‐pointcomplianceincludesallfourroundingmodes,andsubnormalnumbers(numbersclosertozerothananormalizedformatcanrepresent)arehandledcorrectlybytheFermihardwareratherthanbeingflushedtozeroorrequiringadditionalprocessinginasoftwareexceptionhandler.

21

Fermi’ssupportforfusedmultiply‐add(FMA)alsofollowstheIEEE754‐2008standard,improvingtheaccuracyofthecommonlyusedmultiply‐addsequencebynotroundingofftheintermediateresult,asotherwisehappensbetweenthemultiplyandaddoperations.InFermi,thisintermediateresultcarriesafull106‐bitmantissa;infact,161bitsofprecisionaremaintainedduringtheaddoperationtohandleworst‐casedenormalizednumbersbeforethefinaldouble‐precisionresultiscomputed.TheGT200supportedFMAfordouble‐precisionoperationsonly;FermibringsthebenefitsofFMAtosingle‐precisionaswell.

FMAsupportalsoincreasestheaccuracyandperformanceofothermathematicaloperationssuchasdivisionandsquareroot,andmorecomplexfunctionssuchasextended‐precisionarithmetic,intervalarithmetic,andlinearalgebra.

TheintegerALUsupportstheusualmathematicalandlogicaloperations,includingmultiplication,onboth32‐bitand64‐bitvalues.

Memoryoperationsarehandledbyasetof16load‐storeunitsineachSM.Theload/storeinstructionscannowrefertomemoryintermsoftwo‐dimensionalarrays,providingaddressesintermsofxandyvalues.Datacanbeconvertedfromoneformattoanother(forexample,fromintegertofloatingpointorvice‐versa)asitpassesbetweenDRAMandthecoreregistersatthefullrate.TheseformattingandconvertingfeaturesarefurtherexamplesofoptimizationsuniquetoGPUs—notworthwhileingeneral‐purposeCPUs,butheretheywillbeusedsufficientlyoftentojustifytheirinclusion.

AsetoffourSpecialFunctionUnits(SFUs)isalsoavailabletohandletranscendentalandotherspecialoperationssuchassin,cos,exp,andrcp(reciprocal).FouroftheseoperationscanbeissuedpercycleineachSM.

WithintheSM,coresaredividedintotwoexecutionblocksof16coreseach.Alongwiththegroupof16load‐storeunitsandthefourSFUs,therearefourexecutionblocksperSM.Ineachcycle,atotalof32instructionscanbedispatchedfromoneortwowarpstotheseblocks.Ittakestwocyclesforthe32instructionsineachwarptoexecuteonthecoresorload/storeunits.Awarpof32special‐functioninstructionsisissuedinasinglecyclebuttakeseightcyclestocompleteonthefourSFUs.Figure7showsasequenceofinstructionsbeingdistributedamongtheavailableexecutionblocks.

22

Figure7.Atotalof32instructionsfromoneortwowarpscanbedispatchedineachcycle

toanytwoofthefourexecutionblockswithinaFermiSM:twoblocksof16coreseach,oneblockoffourSpecialFunctionUnits,andoneblockof16load/storeunits.Thisfigureshowshowinstructionsareissuedtotheexecutionblocks.(Source:NVIDIA)

ISAimprovements

FermidebutstheParallelThreadeXecution(PTX)2.0instruction‐setarchitecture(ISA).PTX2.0definesaninstructionsetandanewvirtualmachinearchitecturethatamountstoanidealizedprocessordesignedforparallelthreadoperation.

Becausethisvirtualmachinemodeldoesn’tliterallymodeltheFermihardware,itcanbeportablefromonegenerationtothenext.NVIDIAintendsPTX2.0tospanmultiplegenerationsofGPUhardwareandmultipleGPUsizeswithineachgeneration,justasPTX1.0did.

CompilerssupportingNVIDIAGPUsprovidePTX‐compliantbinariesthatactasahardware‐neutraldistributionformatforGPUcomputingapplicationsandmiddleware.Whenapplicationsareinstalledonatargetmachine,theGPUdrivertranslatesthePTXbinariesintothelow‐levelmachineinstructionsthataredirectlyexecutedbythehardware.(PTX1.0binariescanalsobetranslatedbyFermiGPUdriversintonativeinstructions.)

23

Thisfinaltranslationstepimposesnofurtherperformancepenalties.Kernelsandlibrariesforeventhemostperformance‐sensitiveapplicationscanbehand‐codedtothePTX2.0ISA,makingthemportableacrossGPUgenerationsandimplementations.

AllofthearchitecturallyvisibleimprovementsinFermiarerepresentedinPTX2.0.PredicationisoneofthemoresignificantenhancementsinthenewISA.

Allinstructionssupportpredication.Eachinstructioncanbeexecutedorskippedbasedonconditioncodes.Predicationallowseachthread—eachcore—toperformdifferentoperationsasneededwhileexecutioncontinuesatfullspeed.Wherepredicationisn’tsufficient,Fermialsosupportstheusualif‐then‐elsestructurewithbranchstatements.

MostCPUsrelyexclusivelyonconditionalbranchesandincorporatebranch‐predictionhardwaretoallowspeculationalongthelikelypath.That’sareasonablesolutionforbranch‐intensiveserialcode,butlessefficientthanpredicationforstreamingapplications.

AnothermajorimprovementinFermiandPTX2.0isanewunifiedaddressingmodel.AlladdressesintheGPUareallocatedfromacontinuous40‐bit(oneterabyte)addressspace.Global,shared,andlocaladdressesaredefinedasrangeswithinthisaddressspaceandcanbeaccessedbycommonload/storeinstructions.(Theload/storeinstructionssupport64‐bitaddressestoallowforfuturegrowth.)

TheCacheandMemoryHierarchy

LikeearlierGPUs,theFermiarchitectureprovidesforlocalmemoryineachSM.NewtoFermiistheabilitytousesomeofthislocalmemoryasafirst‐level(L1)cacheforglobalmemoryreferences.Thelocalmemoryis64Kinsize,andcanbesplit16K/48Kor48K/16KbetweenL1cacheandsharedmemory.

Sharedmemory,thetraditionaluseforlocalSMmemory,provideslow‐latencyaccesstomoderateamountsofdata(suchasintermediateresultsinaseriesofcalculations,oneroworcolumnofdataformatrixoperations,alineofvideo,etc.).Becausetheaccesslatencytothismemoryisalsocompletelypredictable,algorithmscanbewrittentointerleaveloads,calculations,andstoreswithmaximumefficiency.

Thedecisiontoallocate16Kor48Kofthelocalmemoryascacheusuallydependsontwofactors:howmuchsharedmemoryisneeded,andhowpredictablethekernel’saccessestoglobalmemory(usuallytheoff‐chipDRAM)arelikelytobe.

24

Alargershared‐memoryrequirementarguesforlesscache;morefrequentorunpredictableaccessestolargerregionsofDRAMarguesformorecache.

Someembeddedprocessorssupportlocalmemoryinasimilarway,butthisfeatureisalmostneveravailableonaPCorserverprocessorbecausemainstreamoperatingsystemshavenowaytomanagelocalmemory;thereisnosupportforitintheirprogrammingmodels.Thisisoneofthereasonswhyhigh‐performancecomputingapplicationsrunningongeneral‐purposeprocessorsaresofrequentlybottleneckedbymemorybandwidth;theapplicationhasnowaytomanagewherememoryisallocatedandalgorithmscan’tbefullyoptimizedforaccesslatency.

EachFermiGPUisalsoequippedwithanL2cache(768KBinsizefora512‐corechip).TheL2cachecoversGPUlocalDRAMaswellassystemmemory.

TheL2cachesubsystemalsoimplementsanotherfeaturenotfoundonCPUs:asetofmemoryread‐modify‐writeoperationsthatareatomic—thatis,uninterruptible—andthusidealformanagingaccesstodatathatmustbesharedacrossthreadblocksorevenkernels.Normallythisfunctionalityisprovidedthroughatwo‐stepprocess;aCPUusesanatomictest‐and‐setinstructiontomanageasemaphore,andthesemaphoremanagesaccesstoapredefinedlocationorregioninmemory.

Fermicanimplementthatsamesolutionwhenneeded,butit’smuchsimplerfromthesoftwareperspectivetobeabletoissueastandardintegerALUoperationthatperformstheatomicoperationdirectlyratherthanhavingtowaituntilasemaphorebecomesavailable.

Fermi’satomicoperationsareimplementedbyasetofintegerALUslogicallythatcanlockaccesstoasinglememoryaddresswhiletheread‐modify‐writesequenceiscompleted.Thismemoryaddresscanbeinsystemmemory,intheGPU’slocallyconnectedDRAM,oreveninthememoryspacesofotherPCIExpress‐connecteddevices.Duringthebrieflockinterval,therestofmemorycontinuestooperatenormally.LocksinsystemmemoryareatomicwithrespecttotheoperationsoftheGPUperformingtheatomicoperation;softwaresynchronizationisordinarilyusedtoassignregionsofmemorytoGPUcontrol,thusavoidingconflictingwritesfromtheCPUorotherdevices.

Considerakerneldesignedtocalculateahistogramforanimage,wherethehistogramconsistsofonecounterforeachbrightnesslevelintheimage.ACPUmightloopthroughthewholeimageandincrementtheappropriatecountervaluebasedonthebrightnessofeachpixel.AGPUwithoutatomicoperationsmightassignoneSMtoeachpartoftheimageandletthemrununtilthey’realldone(by

25

imposingasynchronizationbarrier)andthenrunasecondshortprogramtoaddupalltheresults.

WiththeatomicoperationsinFermi,oncetheregionalhistogramsarecomputed,thoseresultscanbecombinedintothefinalhistogramusingatomicaddoperations;nosecondpassisrequired.

Similarimprovementscanbemadeinotherapplicationssuchasraytracing,patternrecognition,andlinearalgebraroutinessuchasmatrixmultiplicationasimplementedinthecommonlyusedBasicLinearAlgebraSubprograms(BLAS).AccordingtoNVIDIA,atomicoperationsonFermiare5×to20×fasterthanonpreviousGPUsusingconventionalsynchronizationmethods.

ThefinalstageofthelocalmemoryhierarchyistheGPU’sdirectlyconnectedDRAM.Fermiprovidessix64‐bitDRAMchannelsthatsupportSDDR3andGDDR5DRAMs.Upto6GBofGDDR5DRAMcanbeconnectedtothechipforasignificantboostincapacityandbandwidthoverNVIDIA’spreviousproducts.

FermiisthefirstGPUtoprovideECC(errorcorrectingcode)protectionforDRAM;thechip’sregisterfiles,sharedmemories,L1andL2cachesarealsoECCprotected.ThelevelofprotectionisknownasSECDED:single(bit)errorcorrection,doubleerrordetection.SECDEDistheusuallevelofprotectioninmostECC‐equippedsystems.

Fermi’sECCprotectionforDRAMisuniqueamongGPUs;soisitsimplementation.Insteadofeach64‐bitmemorychannelcarryingeightextrabitsforECCinformation,NVIDIAhasaproprietary(andundisclosed)solutionforpackingtheECCbitsintoreservedlinesofmemory.

TheGigaThreadcontrollerthatmanagesapplicationcontextswitching(describedearlier)alsoprovidesapairofstreamingdata‐transferengines,eachofwhichcanfullysaturateFermi’sPCIExpresshostinterface.Typically,onewillbeusedtomovedatafromsystemmemorytoGPUmemorywhensettingupaGPUcomputation,whiletheotherwillbeusedtomoveresultdatafromGPUmemorytosystemmemory.

26

Conclusions

Injustafewyears,NVIDIAhasadvancedthestateoftheartinGPUdesignfromalmostpurelygraphics‐focusedproductslikethe7800seriestotheflexibleFermiarchitecture.

FermiisstillderivedfromNVIDIA’sgraphicsproducts,whichensuresthatNVIDIAwillsellmillionsofsoftware‐compatiblechipstoPCgamers.InthePCmarket,Fermi’scapacityforGPUcomputingwilldeliversubstantialimprovementsingameplay,multimediaencodingandenhancement,andotherpopularPCapplications.

ThosesalestoPCusersgenerateanindirectbenefitforcustomersinterestedprimarilyinhigh‐performancecomputing:Fermi’saffordabilityandavailabilitywillbeunmatchedbyanyothercomputingarchitectureinitsperformancerange.

AlthoughitmaysometimesappearthatGPUsarebecomingmoreCPU‐likewitheachnewproductgeneration,fundamentaltechnicaldifferencesamongcomputingapplicationsleadtolastingdifferencesinhowCPUsandGPUsaredesignedandused.

CPUswillcontinuetobebestfordynamicworkloadsmarkedbyshortsequencesofcomputationaloperationsandunpredictablecontrolflow.ModernCPUs,whichdevotelargeportionsoftheirsiliconrealestatetocaches,largebranchpredictors,andcomplexinstructionsetdecoderswillalwaysbeoptimizedforthiskindofcode.

Attheotherextreme,workloadsthataredominatedbycomputationalworkperformedwithinasimplercontrolflowneedadifferentkindofprocessorarchitecture,oneoptimizedforstreamingcalculationsbutalsoequippedwiththeabilitytosupportpopularprogramminglanguages.Fermiisjustsuchanarchitecture.

Fermiisthefirstcomputingarchitecturetodeliversuchahighlevelofdouble‐precisionfloating‐pointperformancefromasinglechipwithaflexible,error‐protectedmemoryhierarchyandsupportforlanguagesincludingC++andFORTRAN.Assuch,Fermiistheworld’sfirstcompleteGPUcomputingarchitecture.

NVIDIA's Fermi the First Complete GPU Architecture · 2009-09-29 · create a demand for millions...

Documents

Transcript of NVIDIA's Fermi the First Complete GPU Architecture · 2009-09-29 · create a demand for millions...