Recent Progress in Autotuning at Berkeley

RecentProgressinAutotuningatBerkeley

JimDemmelUCBerkeley

bebop.cs.berkeley.eduparlab.eecs.berkeley.edu

Outline

•  Introduc?ontoParLab•  Summaryofautotuningac?vi?es

– AlsoShoaibKamil’stalkonThursday@10:15

•  DenseLinearAlgebra•  SparseLinearAlgebra(summary)

•  Communica?oncollec?ves

•  Responsestoques?onsfromtheorganizers

OverviewoftheParallelLaboratoryKrsteAsanovic,RasBodik,JimDemmel,

TomKeaveny,KurtKeutzer,JohnKubiatowicz,EdwardLee,NelsonMorgan,GeorgeNecula,DavePaWerson,

KoushikSen,JohnWawrzynek,DavidWessel,KathyYelick

parlab.eecs.berkeley.edu

  How do compelling apps relate to 13 motifs?

“Mo?f"Popularity (RedHot→

  How do compelling apps relate to 13 motifs?

Personal Health

Image Retrieval

Hearing, Music Speech Parallel

Browser Motifs

Sketching

Legacy Code Schedulers Communication &

Synch. Primitives Efficiency Language Compilers

ParLabResearchOverviewEasy to write correct programs that run efficiently on manycore

Legacy OS

Multicore/GPGPU

OS Libraries & Services

RAMP Manycore

Hypervisor

Composition & Coordination Language (C&CL)

Parallel Libraries

Parallel Frameworks

Static Verification

Dynamic Checking

Debugging with Replay

Directed Testing

Autotuners

C&CL Compiler/Interpreter

Efficiency Languages

Type Systems

Personal Health

Image Retrieval

Hearing, Music Speech Parallel

Browser Motifs

Sketching

Legacy Code Schedulers Communication &

Synch. Primitives Efficiency Language Compilers

ParLabResearchOverviewEasy to write correct programs that run efficiently on manycore

Legacy OS

Multicore/GPGPU

OS Libraries & Services

RAMP Manycore

Hypervisor

Composition & Coordination Language (C&CL)

Parallel Libraries

Parallel Frameworks

Static Verification

Dynamic Checking

Debugging with Replay

Directed Testing

Autotuners

C&CL Compiler/Interpreter

Efficiency Languages

Type Systems

SummaryofAutotuning(1/2)

•  Denselinearalgebra– NewalgorithmsthataWainlowerboundsoncommunica?on(parallelandsequen?al)

–  NewalgorithmsforGPUs

•  Sparselinearalgebra(summary)– NewalgorithmsthataWainlowerboundsoncomm.

•  Collec?vecommunica?ons– Choosingrighttreeforreduc?ons/broadcasts/etc

SummaryofAutotuning(2/2)•  To be presented by Shoaib Kamil on Thursday

•  Recentworkautotuningthreeparallelkernels–  SparseMatrixVectorMul?ply(SpMV)–  LaeceBoltzmannMHD

–  Stencils(HeatEqua?on)•  Lessonslearned/commonali?esbetweentheautotuners

•  Towardsaframeworkforbuildingautotuners– Whatistheroleofthecompiler?

Mo?va?on•  Running?meofanalgorithmissumof3terms:

•  #flops*?me_per_flop•  #wordsmoved/bandwidth

•  #messages*latency

•  Exponen?allygrowinggapsbetween•  Time_per_flop<<1/NetworkBW<<NetworkLatency

•  Improving59%/yearvs26%/yearvs15%/year

•  Time_per_flop<<1/MemoryBW<<MemoryLatency•  Improving59%/yearvs23%/yearvs5.5%/year

communica?on

Mo?va?on•  Running?meofanalgorithmissumof3terms:

•  #flops*?me_per_flop•  #wordsmoved/bandwidth

•  #messages*latency

•  Exponen?allygrowinggapsbetween•  Time_per_flop<<1/NetworkBW<<NetworkLatency

•  Improving59%/yearvs26%/yearvs15%/year

•  Time_per_flop<<1/MemoryBW<<MemoryLatency•  Improving59%/yearvs23%/yearvs5.5%/year

•  Goal:reorganize/tunemo?fstoavoidcommunica?on•  Not justhidingcommunica?on(speedup≤2x)

•  Arbitraryspeedupspossibleforlinearalgebra•  Successmetric–aWainlowerboundsoncommunica?on

communica?on

LinearAlgebraCollaborators(sofar)•  UCBerkeley

–  KathyYelick,MingGu–  MarkHoemmen,MarghoobMohiyuddin,KaushikDaWa,GeorgePetropoulos,

SamWilliams,BeBOpgroup–  LennyOliker,JohnShalf–  ParLab/UPCRC–newparallelcompu?ngresearchcenterfundedbyInteland

Microsop•  CUDenver

–  JulienLangou•  INRIA

–  LauraGrigori,HuaXiang•  GeorgiaTech

–  RichVuduc

•  Densework⊆ongoingdevelopmentofSca/LAPACK–  JointwithUTK

Communica?onLowerBoundsforDenseLinearAlgebra(1/2)

•  Hong‐Kung(1981),Irony/Tishkin/Toledo(2004)•  Usualmatmulusing2n3flops•  Sequen?alcase,withfastmemoryofsizeW<3n2

–  Thm:#wordsmoved=Ω(n3/W1/2)

•  Parallelcase,Pprocs,O(n2/P)memory/proc–  Thm:#wordsmoved=Ω(n2/P1/2)

•  Bandwidthlowerbound⇒latencylowerbound–  #messages≥#wordsmoved/(fast,local)memorysize

–  Sequen?al:#messages=Ω(n3/W3/2)

–  Parallel:#messages=Ω(P1/2)

Communica?onLowerBoundsforDenseLinearAlgebra(2/2)

•  SamelowerboundsapplytoLUandQR–  Assump?on:O(n3)algorithms;LUiseasybutQRissubtle

•  LAPACKandScaLAPACKdonotaWainthesebounds–  ScaLAPACKaWainsbandwidthbound

–  ButsendsO((mn/P)1/2)?mesmoremessages–  LAPACKaWainsneither;O((m2/W)1/2)xmorewordsmoved

•  ButnewalgorithmsdoaWainthem,modpolylogfactors–  ParallelQR:requiresnewrepresenta?onofQ,redundantflops–  Sequen?alQR:canusenewone,orrecursive–  ParallelLU:newpivo?ngstrategyneeded,stableinprac?ce–  Sequen?alLU:canusenewoneorrecursive,buthigherlatency

MinimizingCommunica?oninParallelQR

W0 W1 W2 W3

R00 R10 R20 R30

• QRdecomposi?onofmxnmatrixW,m>>n• TSQR=“TallSkinnyQR”• Pprocessors,blockrowlayout

• UsualParallelAlgorithm• ComputeHouseholdervectorforeachcolumn• Numberofmessages∝nlogP

• Communica?onAvoidingAlgorithm• Reduc?onopera?on,withQRasoperator• Numberofmessages∝logP‐op?mal

TSQRinmoredetail

Qisrepresentedimplicitlyasaproduct(treeoffactors)

MinimizingCommunica?oninTSQR

W0 W1 W2 W3

R00 R10 R20 R30

R02 Parallel:

W0 W1 W2 W3

R01 R02

R03 Sequen?al:

W0 W1 W2 W3

R00 R01

R01 R11

R11 R03

DualCore:

Choosereduc?ontreedynamically

Mul?core/Mul?socket/Mul?rack/Mul?site/Out‐of‐core:?

PerformanceofTSQRvsSca/LAPACK•  Parallel

–  Pen?umIIIcluster,DolphinInterconnect,MPICH•  Upto6.7xspeedup(16procs,100Kx200)

–  BlueGene/L•  Upto4xspeedup(32procs,1Mx50)

–  BothuseElmroth‐Gustavsonlocally–enabledbyTSQR•  Sequen?al

–  OOConPowerPClaptop•  AsliWleas2xslowdownvs(predicted)infiniteDRAM

•  SeeUCBerkeleyEECSTechReport2008‐74•  Beingrevisedtoaddop?malityresults!

•  GeneralrectangularQR•  TSQRforpanelfactoriza?on,thenright‐looking•  Implementa?onunderway(JulienLangou)

ModeledSpeedupsofCAQRvsScaLAPACK

Petascaleupto22.9x

IBMPower5upto9.7x

“Grid”upto11x

Petascalemachinewith8192procs,eachat500GFlops/s,abandwidthof4GB/s.

TSLU:LUfactoriza?onofatallskinnymatrix

Firsttrytheobviousgeneraliza?onofTSQR:

GrowthfactorforTSLUbasedfactoriza?on

UnstableforlargePandlargematrices.WhenP=#rows,TSLUisequivalenttoparallelpivo?ng.

CourtesyofH.Xiang

MakingTSLUStable

•  Ateachnodeintree,TSLUselectsbpivotrowsfrom2bcandidatesfromits2childnodes

•  Ateachnode,doLUon2boriginalrowsselectedbychildnodes,notUfactorsfromchildnodes

•  WhenTSLUdone,permutebselectedrowstotopoforiginalmatrix,redobstepsofLUwithoutpivo?ng

•  CALU–Communica?onAvoidingLUforgeneralA–  UseTSLUforpanelfactoriza?ons–  Applytorestofmatrix–  Cost:redundantpanelfactoriza?ons

•  Benefit:–  Stableinprac?ce,butnot samepivotchoiceasGEPP–  b?mesfewermessagesoverall‐faster

GrowthfactorforbeWerCALUapproach

Likethresholdpivo?ngwithworstcasethreshold=.33,so|L|<=3Tes?ngshowsaboutsameresidualasGEPP

PerformancevsScaLAPACK•  TSLU

–  IBMPower5•  Upto4.37xfaster(16procs,1Mx150)

–  CrayXT4•  Upto5.52xfaster(8procs,1Mx150)

•  CALU–  IBMPower5

•  Upto2.29xfaster(64procs,1000x1000)–  CrayXT4

•  Upto1.81xfaster(64procs,1000x1000)•  Op?malityanalysisanalogoustoQR•  SeeINRIATechReport6523(2008)

Petascalemachinewith8192procs,eachat500GFlops/s,abandwidthof4GB/s.

Speeduppredic?onforaPetascalemachine‐upto81xfasterP=8192

RelatedWorkandContribu?onsforDenseLinearAlgebra

•  Relatedwork– Pothen&Raghavan(1989)– Toledo(1997),Rabani&Toledo(2001)– Dunha,Becker&PaWerson(2002)– Gunter&vandeGeijn(2005)

•  Ourcontribu?ons– QR:2Dparallel,efficient1DparallelQR– Unifiedapproachtosequen?al,parallel,…– ParallelLU– Communica?onlowerbounds,provingop?mality

Heterogeneous(CPU+GPU)PerformanceLibraries

VasilyVolkov

ChallengesinprogrammingGPUs

Relatively little tuning information for GPU hardware -  Hardware exposed via abstract programming model -  Hard to develop accurate performance models -  Hardware changing rapidly

We get ~2x speedups vs. best other work on variety of libraries -  Matrix-matrix multiply (1.6x speedup) -  FFT (2x speedup) -  LU/Cholesky factorization (3x/2x speedups) -  Tridiagonal eigenvalue solver (2x speedup)

Example: partial pivoting in LU factorization on NVIDIA GPU -  Ad: “GPU supports gather/scatter with arbitrary access patterns” -  LAPACK code (sgetrf) spends 50% of time doing pivoting -  Gets worse with larger matrices -  Only 1% of time is spent in pivoting if matrix is transposed

-  2x speedup!

Matrix‐matrixmul?ply(SGEMM)

- Uses software prefetching, strip-mining, register blocking - Resembles algorithms used for Cray X1 and IBM 3090 VF - 60% of peak, bound by access to on-chip shared memory

FFT,complex‐to‐complex,batched

Results as on NVIDIA GeForce 8800GTX Resembles algorithms used for vector processors Radix-8 in registers, transposes using shared memory

1 8 64 512 4096 32768

registerlstorage

CUFFT1.1

ourbest

HeterogeneousCompu?ng

Question: should you port entire applications to GPU? – Maybe not

Example: Eigenvalue solver (bisection, LAPACK’s sstebz) -  Most work (O(n2)) in vectorized, embarrassingly parallel code

-  Do it on GPU and finely tune -  May do many redundant flops on GPU to go faster

-  multisection -  Run on CPU when problem too small for GPU -  Use performance model to choose CPU or GPU version

-  Little work (O(n)) in nearly serial code -  Do it on CPU: time-consuming and non-trivial to port

Independent project: - Did everything on the GPU - Used advanced parallel algorithms for O(n) work - Up to 2x slower running on faster GPU (compared to us)

BreakdownofRun?me(sstebz)

“Count(x)” is the O(n2) work, use either SSE or GPU “the rest” is the O(n) work, isn’t worth optimizing

LUFactoriza?on

LU factorization: -  O(n3) work in bulky BLAS3 operations -  O(n2) work in fine-grain BLAS1 / BLAS2 (panel factorization)

Panel factorization is usually faster on the CPU - Peak sustained bandwidth of 8800GTX is 75GB/s - Kernel launch overhead is 5µs -  BLAS1 and BLAS2 are bandwidth bound, so run at

-  compare vs. CPU handicapped by CPU-GPU transfers -  Faster on CPU up to n ~ 16K on NVIDIA 8800GTX

OverallPerformance

OverlapinLUfactoriza?on

Overlap panel factorization on CPU with sgemm on GPU

LUslowdownsfromomiengop?miza?ons

SummaryofGPUExperience•  Heterogeneityexpandstuningspace

– DividingworkbetweenCPUandGPU•  Dependsalotonflop‐rate/bandwidth/latency/startup

– DifferentdatalayoutsmaybebestforCPUandGPU•  RowwisevscolumnwiseforLU

– Maywanttodo(many)extraflopsonGPU•  Mul?sec?onvsBisec?onineigensolver

•  TRINV+TRMMvsTRSMinLU

•  Rapidevolu?onofGPUarchitecturesmo?vatesautotuning

Whyallourproblemsaresolvedfordenselinearalgebra–intheory

•  Thm(D.,Dumitriu,Holtz,Kleinberg)(Numer.Math.2007)–  GivenanymatmulrunninginO(nω)opsforsomeω>2,itcanbe

madestableands?llruninO(nω+ε)ops,foranyε>0.

•  Currentrecord:ω≈2.38•  Thm(D.,Dumitriu,Holtz)(Numer.Math.2008)

–  GivenanystablematmulrunninginO(nω+ε)ops,candobackwardstabledenselinearalgebrainO(nω+ε)ops:

•  GEPP,QR(recursive)•  rankrevealingQR(randomized)•  (Generalized)Schurdecomposi?on,SVD(randomized)

•  Alsoreducescommunica?ontoO(nω+ε)•  Butconstants?

AvoidingCommunica?oninSparseLinearAlgebra‐Summary

•  TakekstepsofKrylovsubspacemethod– GMRES,CG,Lanczos,Arnoldi– Assumematrix“well‐par??oned,”withmodestsurface‐to‐volumera?o

–  Parallelimplementa?on•  Conven?onal:O(klogp)messages•  “New”:O(logp)messages‐op?mal

–  Serialimplementa?on•  Conven?onal:O(k)movesofdatafromslowtofastmemory•  “New”:O(1)movesofdata–op?mal

•  Canincorporatesomeprecondi?oners– Hierarchical,semiseparablematrices…

•  Lotsofspeeduppossible(modeledandmeasured)–  Price:someredundantcomputa?on

TuningCollec?ves•  RajeshNishtala

•  Collec?vescalledbyallthreadstoperformgloballycommunica?on–  Broadcase,reduc?on,all‐to‐all

•  Needtotunebecausebestalgorithmdependson–  #procs,–  Networklatency–  Networkbandwidth,–  Transfersize,etc.

•  FocusonPGASlanguagesandone‐sidedcommunica?onmodels–  E.g.UPC,Titanium,Co‐ArrayFortran

ExampleTreeTopologies

Radix4k‐nomialtree(quadnomial)

Radix2k‐nomialtree(binomial)

42BinaryTree ForkTree

DistributedMemoryBroadcast

•  100016Bytebroadcasts•  Bestimplementa?onis

pla�ormspecific•  Lowcosttoinject

message⇒wantflattree(6‐ary)

•  Highcost⇒wantdeeptreetoparallelizetheoverhead(binomial)

DistributedMemoryAll‐to‐All

•  Op?malalgorithmvariesbasedontransfersize–  O(PlogP)algorithm

doublesmessagesizeapereveryround

–  Expensiveastheprocessorcountgrows

•  Cross‐overpointispla�ormspecific G

Collec?veSynchroniza?on

•  UPCpresentsnewseman?csforcollec?vesynchroniza?on•  Loosesynchroniza?onletscollec?vebeginaperany

threadarrives(looserthanMPIseman?cs)•  Strictsynchroniza?onrequiresallthreadstoarrive

beforecommunica?oncanstart(stricterthanMPI)•  Synchroniza?onseman?csaffectsalgorithmchoice

Mul?coreBarrierPerformanceResults•  Timemanyback‐to‐backbarriersonDual

SocketSunNiagra2(128hardwarethreads)

•  Flattreeisjustonelevelwithallthreadsrepor?ngtothread0

–  Leveragessharedmemorybutnon‐scalable

•  ArchitectureIndependentTree(radix=2)

–  Pickageneric“good”radixthatissuitableformanypla�orms

–  Mismatchedtoarchitecture

•  ArchitectureDependentTree

–  Searchoverallradicestopickthetreethatbestmatchesthearchitecture

–  Searchrevealedradix8tobethebest

2 4 8 16 32 64 128

ExecuM

ThreadCount

FlatTree

ArchitectureIndependent(radix=2)

ArchitectureDependent(radix=best)

Niagara2(Maramba)BarrierPerformance

SummaryandConclusions(1/2)

•  Possibletominimizecommunica?oncomplexityofmuchdenseandsparselinearalgebra–  Prac?calspeedups– Approachingtheore?callowerbounds

•  Hardwaretrendsmeanthe?mehascometodothis

•  Op?malasympto?ccomplexityalgorithmsfordenselinearalgebra–alsolowercommunica?on

•  Importanttoop?mizecommunica?oncollec?vesthemselves

•  Manyopenproblems– Howtoaddthesetoautoma?ctuningdesignspace

•  PLASMA(JakubKurzak)

– Extendop?malityproofs,algorithmstomoregeneralarchitectures•  Heterogeneity,includingmul?plebandwidthsandlatencies

– Denseeigenvalueproblems–SBRorspectralD&C?– Sparsedirectsolvers–CALUorSuperLU?– Whichprecondi?onerswork?

– Othermo?fs

AnswerstoQues?onsfromOrganizers(1/5)

•  Whatabouttuningonpetascale?–  Communica?onavoidingalgorithmshelpaddressit–  Canleveragemul?coretuning,ifwecanmirrorhierarchyofmachinesinhierarchyofalgorithms/tuners

•  Howdowemeasuresuccessfortuning?–  PerformancegainsaWained–  Easeofusebyend‐user

–  Easeofdevelopment/evolu?on/maintenanceoftuners

•  Whatarchitecturesshouldwetarget?–  Mul?coreandPetascaleasabove,–  EmergingarchitectureslikeGPUs

AnswerstoQues?onsfromOrganizers(2/5)

•  T/F:“ParameterTuning”isnotenough–  Manyofushavebeenchangingdatastructuresand

algorithmsforawhile,whichgoesbeyond“parameters”

•  T/F:Self‐tunedlibrarieswillbeatcompilers–  Whatisthestar?ngpointforthecompiler?Themostnaïve

possiblecode?Thenyes(butmayworkforsomemo?fs)

–  Butautotunersusecompilerstocompiletunedsourcecode

•  Willcompilerschangedatastructures/algorithms?–  Wouldtheys?llbecalledcompilersorsomethingelse?–  Howmuchwisdomaboutalgorithms,numericalanalysisand

appliedmathcanweexpectcompilerstoincorporate?

AnswerstoQues?onsfromOrganizers(3/5)•  Simpleperformancemodelswillmakesearch

unnecessary,eg“cache‐oblivious”–  Ifsearchiseasycomparedtofiguringouta“simple”

performancemodel,itismoreproduc?ve

–  “Cacheoblivious”wouldnothavefoundnewcommunica?onavoidingalgorithms

–  Simplemodelshelpdecidewhentostoptuning

•  BoundariesbetweenSWlayertoorigid–  Yes!SeeParLabstructureforonetakeonbreakingusual

layeringoftheHW/SWstack,egschedulers

AnswerstoQues?onsfromOrganizers(4/5)•  Whatissues/technologiesareweignoring?

–  Needmorecompiler‐basedexper?setobuildautotuners,sotheyareeasiertodesign/develop/maintain,andnotjustbigPERLscripsthatneedtobewriWenfromscratchforeachnewarchitectureorslightlydifferentfunc?on

•  Whatcommontoolsshouldwebuild?–  Seeanswertolastques?on;seePluto,talkbyChen

•  Tuningsystemsaretoonarrow,investincompilersinstead

–  Ifcompilersdonotchangedatastructuresoralgorithms,theywillmissalotofperformancegains

–  ParLabhypothesisisthat13mo?fs/tunersisenough–  Investincompilertoolstohelpbuildautotuners

AnswerstoQues?onsfromOrganizers(5/5)•  Run?meop?miza?onisneeded

–  Yes!Anumberofautotunersdothisnow,egJIT,OSKI

•  IfalllayersoftheSWstackareautotuned,howwilltheybecomposed?

–  Abigques?onforParLabtoo!Oneanswerformul?coreisinhavingschedulersforeachlibrarythatcanacceptorgiveupresources(cores)ontheflysothathigherlevelschedulerscanbalanceworkatahigherlevel

EXTRASLIDES

AvoidingCommunica?oninSparseLinearAlgebra‐Summary

•  TakekstepsofKrylovsubspacemethod– GMRES,CG,Lanczos,Arnoldi– Assumematrix“well‐par??oned,”withmodestsurface‐to‐volumera?o

–  Parallelimplementa?on•  Conven?onal:O(klogp)messages•  “New”:O(logp)messages‐op?mal

–  Serialimplementa?on•  Conven?onal:O(k)movesofdatafromslowtofastmemory•  “New”:O(1)movesofdata–op?mal

•  Canincorporatesomeprecondi?oners– Hierarchical,semiseparablematrices…

•  Lotsofspeeduppossible(modeledandmeasured)–  Price:someredundantcomputa?on

5 10 15 20 25 30

Local Dependencies for k=8

LocallyDependentEntriesfor[x,Ax],Atridiagonal,2processors

Canbecomputedwithoutcommunica?on

Proc1Proc2

5 10 15 20 25 30

Proc1Proc2

LocallyDependentEntriesfor[x,Ax,A2x],Atridiagonal,2processors

5 10 15 20 25 30

Proc1Proc2

LocallyDependentEntriesfor[x,Ax,…,A3x],Atridiagonal,2processors

5 10 15 20 25 30

Proc1Proc2

5 10 15 20 25 30

Canbecomputedwithoutcommunica?onk=8foldreuseofA

Proc1Proc2

5 10 15 20 25 30

Type (1) Remote Dependencies for k=8

RemotelyDependentEntriesfor[x,Ax,…,A8x],Atridiagonal,2processors

Onemessagetogetdataneededtocomputeremotelydependententries,not k=8 Minimizesnumberofmessages=latencycostPrice:redundantwork∝“surface/volumera?o”

Proc1Proc2

5 10 15 20 25 30

Type (2) Remote Dependencies for k=8

FewerRemotelyDependentEntriesfor[x,Ax,…,A8x],Atridiagonal,2processors

Reduceredundantworkbyhalf

Proc1Proc2

RemotelyDependentEntriesfor[x,Ax,A2x,A3x],Airregular,mulMpleprocessors

SequenMal[x,Ax,…,A4x],withmemoryhierarchy

Onereadofmatrixfromslowmemory,not k=4 Minimizeswordsmoved=bandwidthcost

Noredundantwork

PerformanceResults

•  Measured– Sequen?al/OOCspeedupupto3x

•  Modeled– Sequen?al/mul?corespeedupupto2.5x

– Parallel/Petascalespeedupupto6.9x– Parallel/Gridspeedupupto22x

•  Seebebop.cs.berkeley.edu/#pubs

Op?mizingCommunica?onComplexityofSparseSolvers

•  Example:GMRESforAx=bon“2DMesh”– xlivesonn‐by‐nmesh– Par??onedonp½‐by‐p½grid– A has“5pointstencil”(Laplacian)

•  (Ax)(i,j)=linear_combina?on(x(i,j),x(i,j±1),x(i±1,j))– Ex:18‐by‐18meshon3‐by‐3grid

MinimizingCommunica?onofGMRES

•  Whatisthecost=(#flops,#words,#mess)ofkstepsofstandardGMRES?

GMRES,ver.1:fori=1tokw=A*v(i‐1)MGS(w,v(0),…,v(i‐1))updatev(i),HendforsolveLSQproblemwithH n/p½

• Cost(A*v)=k*(9n2/p,4n/p½,4)• Cost(MGS)=k2/2*(4n2/p,logp,logp)• Totalcost~Cost(A*v)+Cost(MGS)• Canwereducethelatency?

MinimizingCommunica?onofGMRES•  Cost(GMRES,ver.1)=Cost(A*v)+Cost(MGS)

• Cost(W)=(~same,~same,8)• Latencycostindependentofk–opMmal

• Cost(MGS)unchanged• Canwereducethelatencymore?

=(9kn2/p,4kn/p½,4k)+(2k2n2/p,k2logp/2,k2logp/2)

• HowmuchlatencycostfromA*vcanyouavoid?Almostall

GMRES,ver.2:W=[v,Av,A2v,…,Akv][Q,R]=MGS(W)BuildHfromR,solveLSQproblem

MinimizingCommunica?onofGMRES•  Cost(GMRES,ver.2)=Cost(W)+Cost(MGS)

=(9kn2/p,4kn/p½,8)+(2k2n2/p,k2logp/2,k2logp/2)

• HowmuchlatencycostfromMGScanyouavoid?Almostall

• Cost(TSQR)=(~same,~same,logp)• Latencycostindependentofs‐opMmal

GMRES,ver.3:W=[v,Av,A2v,…,Akv][Q,R]=TSQR(W)…“TallSkinnyQR”BuildHfromR,solveLSQproblem

W1W2W3W4

R1R2R3R4

• Cost(TSQR)=(~same,~same,logp)• Oops

W1W2W3W4

R1R2R3R4

• Cost(TSQR)=(~same,~same,logp)• Oops–Wfrompowermethod,precisionlost!

W1W2W3W4

R1R2R3R4

MinimizingCommunica?onofGMRES•  Cost(GMRES,ver.3)=Cost(W)+Cost(TSQR)

=(9kn2/p,4kn/p½,8)+(2k2n2/p,k2logp/2,logp)

• Latencycostindependentofk,justlogp–op?mal• Oops–Wfrompowermethod,soprecisionlost–Whattodo?

• Useadifferentpolynomialbasis• NotMonomialbasisW=[v,Av,A2v,…],instead…• NewtonBasisWN=[v,(A–θ1I)v,(A–θ2I)(A–θ1I)v,…]or• ChebyshevBasisWC=[v,T1(v),T2(v),…]

RelatedWorkandContribu?onsforSparseLinearAlgebra

•  Relatedwork–  s‐stepGMRES:DeSturler(1991),Bai,Hu&Reichel(1991),

Joubertetal(1992),Erhel(1995)

–  s‐stepCG:VanRosendale(1983),Chronopoulos&Gear(1989),Toledo(1995)– MatrixPowersKernel:Pfeifer(1963),Hong&Kung(1981),

Leiserson,Rao&Toledo(1993),Toledo(1995),Douglas,Hu,Kowarschik,Rüde,Weiss(2000),Strout,Carter&Ferrante(2001)

•  Ourcontribu?ons–  Unifiedapproachtoserialandparallel,useofTSQR–  Op?mizingserialcaseviaTSP

–  Unifiedapproachtostability–  Incorpora?ngprecondi?oning

•  Possibletominimizecommunica?oncomplexityofmuchdenseandsparselinearalgebra– Prac?calspeedups– Approachingtheore?callowerbounds

•  Op?malasympto?ccomplexityalgorithmsfordenselinearalgebra–alsolowercommunica?on

•  Hardwaretrendsmeanthe?mehascometodothis

•  Lotsofpriorwork(seepubs)–andsomenew

•  Manyopenproblems– Automa?ctuning‐buildandop?mizecomplicateddatastructures,communica?onpaWerns,codeautoma?cally:bebop.cs.berkeley.edu

– Extendop?malityproofstogeneralarchitectures– Denseeigenvalueproblems–SBRorspectralD&C?

– Sparsedirectsolvers–CALUorSuperLU?– Whichprecondi?onerswork?– Whystopatlinearalgebra?

Recent Progress in Autotuning at Berkeley

Documents

Transcript of Recent Progress in Autotuning at Berkeley

Managing the Expense of Hyperparameter Autotuning · 2018-04-23 · Managing the Expense of Hyperparameter Autotuning . Patrick Koch, Brett Wujek, and Oleg Golovidov . ... Visual

Throughput Issues for High-Speed Wide-Area Networks · Autotuning Settings • Linux 2.6 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 # autotuning min, default, and max

Lecture 5: Autotuning High-End Applicationsctop.cs.utah.edu/downloads/ACACES/acaces-hall-L5.pdf · Lecture 5: Autotuning High-End Applications ... { m10_10_10(a,b,c); return;} ...

Compiler-Based Autotuning Technology Lecture 4: Parallel ...

A User Perspective on Autotuning for Scalable Multicore ...

Identiﬁcation and Autotuning of Temperature-Cont

Autotuning Scientific Kernels on Multicore Systems · 2012. 9. 7. · Autotuning Scientific Kernels on Multicore Systems Sam Williams*§, Jonathan Carter*, James Demmel§, Leonid

Autotuning Large Computational Chemistry Codes

Performance, Design, and Autotuning of Batched GEMM …icl.cs.utk.edu/news_pub/submissions/tudeba_gemm.pdf · Performance, Design, and Autotuning of ... For most performance tests

Improving an Autotuning Engine for 3D Fast Wavelet ...

Autotuning of a PID-controller - Lund Universitylup.lub.lu.se/student-papers/record/8848024/file/8859436.pdf · Autotuning of a PID-controller Camilla Andersson Mirjam Lindberg ...

Autotuning Wavefront Patterns for Heterogeneous

Berkeley Climate Action Plan Progress 2011 Update · Berkeley Climate Action Plan Metrics 2011 Progress Update . Berkeley Climate Action Plan: How are we doing? In June 2009, the

A Learning-Based Recommender System for Autotuning …

AutoTuning and Adaptivity approach for efficient eXascale HPCspecs/projects/antarex/... · Section 3 describes the ANTAREX adaptivity/autotuning approach. Section 4 describes the

A Survey on Compiler Autotuning using Machine LearningAdditional Key Words and Phrases: Compilers, Autotuning, Machine Learning, Phase ordering, Optimizations, Application Characterization

Autotuning Sparse Matrix and Structured Grid Kernels

Experiences in autotuning matrix multiplication for energy ...

Autotuning (2/2)

Application-independent Autotuning for GPUs - Martin Tillmann, … · 2013. 10. 18. · Application-independent Autotuning for GPUs - Martin Tillmann, Thomas Karcher, Carsten Dachsbacher,

Autotuning Scientific Kernels on Multicore Systems · 2012. 9. 7. · Autotuning Scientific Kernels on Multicore Systems Sam Williams§, Jonathan Carter, James Demmel§, Leonid