Multi-GPU Graph Analytics - Home Page -...
Transcript of Multi-GPU Graph Analytics - Home Page -...
Multi-GPUGraphAnalytics Yuechao Pan,Yangzihao Wang,Yuduo Wu,CarlYangandJohnD.Owens,UniversityofCalifornia,Davis{ychpan,yzhwang,yudwu,ctcyang,jowens}@ucdavis.edu
Introduction- aboutGunrock
Whatweaimed
Multi-GPUFramework Results
FutureWork
AcknowledgementReferences[1]Yuechao Pan,Yangzihao Wang,Yuduo Wu,CarlYang,andJohnD.
Owens.Multi-GPUGraphAnalytics.CoRR,abs/1504.04804,Apr.2016.[2]Yangzihao Wang,AndrewDavidson,Yuechao Pan,Yuduo Wu,Andy
Riffel,andJohnD.Owens.Gunrock:AHigh-PerformanceGraphProcessingLibraryontheGPU.inProceedingsofthe21st ACMSIGPLANSymposium onPrinciplesandPractic ofParallelProgramming, ser.PPoPP 2016,Mar.2016.http://escholarship.org/uc/item/6xz7z9k0
[3]M.Bisson,M.Bernaschi,andE.Mastrostefano,“ParalleldistributedbreadthfirstsearchontheKepler architecture,”IEEETransactionsonParallelandDistributedSystems,vol.PP,no.99,Sep.2015.
[4]M.Bernaschi,G.Carbone,E.Mastrostefano,M.Bisson,andM.Fatica,“EnhancedGPU-baseddistributedbreadth firstsearch,”inProceedingsofthe12thACMInternationalConferenceonComputing Frontiers,ser.CF’15,2015,pp.10:1–10:8.
Gunrock isamulti-GPUgraphprocessinglibrary,whichtargetsat:• Highperformanceanalyticsoflargegraphs• Lowprogrammingcomplexity inimplementingparallel
graphalgorithmsonGPUsHomepage:http://gunrock.github.ioThecopyrightofGunrock isownedbyTheRegentsoftheUniversityofCalifornia,2016.AllsourcecodearereleasedunderApache2.0.
graph ref. ref. hw. ref.perf. ourhw. ourperf. comp.com-orkut (3M,117M,UD) Bisson [3] 1xK20Xx4 2.67GTEPS 4xK40 11.42GTEPS 5.33Xcom-Friendster(66M,1.81B,UD) Bisson [3] 1xK20Xx64 15.68GTEPS 4xK40 14.1GTEPS 0.90Xkron_n23_16(8M,256M,UD) Bernaschi [4] 1xK20Xx4 ~1.3GTEPS 4xK40 30.8GTEPS 23.7Xkron_n25_16(32M,1.07B,UD) Bernaschi [4] 1xK20Xx16 ~3.2GTEPS 6xK40 31.0GTEPS 9.69Xkron_n25_32(32M,1.07B,UD) Fu[5] 2xK20x32 22.7GTEPS 4xK40 32.0GTEPS 1.41Xkron_n23_32 (8M,256M,D) Fu[5] 2xK20x2 6.3GTEPS 4xK40 27.9GTEPS 4.43Xkron_n24_32(16.8M,1.07B,UD) Liu[6] 2xK40x1 15GTEPS 2xK40 77.7GTEPS 5.18Xkron_n24_32(16.8M,1.07B,UD) Liu[6] 4xK40x1 18GTEPS 4xK40 67.7GTEPS 3.76Xkron_n24_32(16.8M,1.07B,UD) Liu[6] 8xK40x1 18.4GTEPS 4xK80 40.2GTEPS 2.18Xtwitter-mpi (52.6M,1.96B,D) Bebee [7] 1xK40x16 0.2242sec 3xK40 94.31ms 2.38Xrmat_n21_64(2M,128M,D) Merrill[8] 4 xC2050x1 8.3GTEPS 4xK40 23.7GTEPS 2.86X
• extendingGunrock ontomultiplenodes• asynchronized graphalgorithms
• Otherpartitioningmethods• morealgorithms
Comparisonwithpreviousworkonmulti-GPUBFS
Scalibility andspeedup.Top-Lelf:BFS,Top-Right:DOBFS,Bottom-Left:PR,Bottom-Right:Overallwith16datasets
• Programmability: easytodevelopgraphprimitivesonmultipleGPUs->programmeronlyneedstospecifyafewthings(refertoFramework)
• Algorithmgenerality:supportawiderangeofgraphalgorithms->BFS,DOBFS,SSSP,CC,BC,PRandmoretocome
• Hardwarecompatibility:usableonmostsinglenodeGPUsystems->withorwithoutpeerGPUaccess
• Performance:lowruntime,andleveragestheunderlyinghardwarewell->morethan500GTEPSpeekBFSperformanceon6GPUs
• Scalability:scalableintermsofbothperformanceandmemoryusage->2.63X,2.57X,2.00X,1.96X,and3.86Xgeometricmeanspeedupsover16datasetsforBFS,SSSP,CC,BCandPR->graphswith3.62Bedgesprocessedusing6GPUs
●●
●●
●● ● ●
0
10
20
30
40
50
1 2 3 4 5 6 7 8Number of GPUsBi
llion
Trav
erse
d Ed
ges
per S
econ
d (G
TEPS
) ● Strong_Scaling Weak_Edge_Scaling Weak_Vertex_Scaling
● ● ● ● ● ● ● ●
0
100
200
300
1 2 3 4 5 6 7 8Number of GPUsBi
llion
Trav
erse
d Ed
ges
per S
econ
d (G
TEPS
)
● Strong_Scaling Weak_Edge_Scaling Weak_Vertex_Scaling
● ● ● ● ●● ● ●
0
10
20
30
40
1 2 3 4 5 6 7 8Number of GPUs
Billio
n Tr
aver
sed
Edge
s pe
r Sec
ond
(GTE
PS)
● Strong_Scaling Weak_Edge_Scaling Weak_Vertex_Scaling
1
2
3
4
BC BFS CC DOBFS PR SSSPAlgorithm
Speedup
Number of GPUs 2 3 4 5 6
andachieved
Iterate till all GPUs converge
Localinput frontier
Localoutput frontier
GPU 0
Local data
GPU 1
Localdata
Remote output frontier
Remoteinput frontier
Remote input frontier
Localinput frontier
Remote output frontier
Localoutput frontier
Partition
Single GPU
primitive
Single GPU
primitivePackage
Combine Combine
Package
Whattheprogrammerneedstospecify:• Coresingle-GPUprimitives• Datatocommunicate• Howtocombineremoteandlocaldata• Stopcondition
Whattheframeworktakescare:• Splitfrontiers• Packagedata• PushtoremoteGPUs• Mergelocal&receiveddata• ManageGPUs
Optimizations• Direction-optimizingtraversal• Compute/communicationoverlap• Justenoughmemoryallocation
Graphalgorithmasadata-centricprocessFrontier:compactqueueofnodesoredges
Advancevisitneighborlists
Filterselectandreorganize
Computeper-elementcomputations
combinablewithadvanceorfilter
• Atomicavoidance• Kernelfusion
Alltailoredtosupportmulti-GPUenvironmentbetter
TheGPUhardwareandclusteraccesswasprovidedbyNVIDIA.ThisworkwasfundedbytheDARPAXDATAprogramunderAFRLContractFA8750-13-C-0002andbyNSFawardsCCF-1017399andOCI-1032859.
[5]Z.Fu,H.K.Dasari,B.Bebee,M.Berzins,andB.Thompson, “ParallelbreadthfirstsearchonGPUclusters,”inIEEEInternationalConferenceon BigData,Oct.2014,pp.110–118.
[6]H.LiuandH.H.Huang,“Enterprise:Breadth-firstgraphtraversalonGPUs,”inProceedingsoftheInternationalConferenceforHighPerformanceComputing, Networking,StorageandAnalysis,ser.SC’15,Nov.2015,pp.68:1–68:12.
[7]B.Bebee,“Whattodowithallthatbandwidth? GPUsforgraphandpredictiveanalytics,”21Mar.2016,https://devblogs.nvidia.com/parallelforall/gpus-graph-predictive-analytics/.
[8]D.Merrill,M.Garland,andA.Grimshaw.ScalableGPUgraphtraversal.InProceedingsofthe17thACMSIGPLANSymposium onPrinciplesandPracticeofParallelProgramming, PPoPP '12,pages117-128,Feb.2012.