Download - Multi-GPU Graph Analytics - Home Page - SC16sc16.supercomputing.org/sc-archive/tech_poster/poster_files/post... · This work was funded by the DARPA XDATA program under AFRL Contract

Multi-GPUGraphAnalytics Yuechao Pan,Yangzihao Wang,Yuduo Wu,CarlYangandJohnD.Owens,UniversityofCalifornia,Davis{ychpan,yzhwang,yudwu,ctcyang,jowens}@ucdavis.edu

Introduction- aboutGunrock

Whatweaimed

Multi-GPUFramework Results

FutureWork

AcknowledgementReferences[1]Yuechao Pan,Yangzihao Wang,Yuduo Wu,CarlYang,andJohnD.

Owens.Multi-GPUGraphAnalytics.CoRR,abs/1504.04804,Apr.2016.[2]Yangzihao Wang,AndrewDavidson,Yuechao Pan,Yuduo Wu,Andy

Riffel,andJohnD.Owens.Gunrock:AHigh-PerformanceGraphProcessingLibraryontheGPU.inProceedingsofthe21st ACMSIGPLANSymposium onPrinciplesandPractic ofParallelProgramming, ser.PPoPP 2016,Mar.2016.http://escholarship.org/uc/item/6xz7z9k0

[3]M.Bisson,M.Bernaschi,andE.Mastrostefano,“ParalleldistributedbreadthfirstsearchontheKepler architecture,”IEEETransactionsonParallelandDistributedSystems,vol.PP,no.99,Sep.2015.

[4]M.Bernaschi,G.Carbone,E.Mastrostefano,M.Bisson,andM.Fatica,“EnhancedGPU-baseddistributedbreadth firstsearch,”inProceedingsofthe12thACMInternationalConferenceonComputing Frontiers,ser.CF’15,2015,pp.10:1–10:8.

Gunrock isamulti-GPUgraphprocessinglibrary,whichtargetsat:• Highperformanceanalyticsoflargegraphs• Lowprogrammingcomplexity inimplementingparallel

graphalgorithmsonGPUsHomepage:http://gunrock.github.ioThecopyrightofGunrock isownedbyTheRegentsoftheUniversityofCalifornia,2016.AllsourcecodearereleasedunderApache2.0.

graph ref. ref. hw. ref.perf. ourhw. ourperf. comp.com-orkut (3M,117M,UD) Bisson [3] 1xK20Xx4 2.67GTEPS 4xK40 11.42GTEPS 5.33Xcom-Friendster(66M,1.81B,UD) Bisson [3] 1xK20Xx64 15.68GTEPS 4xK40 14.1GTEPS 0.90Xkron_n23_16(8M,256M,UD) Bernaschi [4] 1xK20Xx4 ~1.3GTEPS 4xK40 30.8GTEPS 23.7Xkron_n25_16(32M,1.07B,UD) Bernaschi [4] 1xK20Xx16 ~3.2GTEPS 6xK40 31.0GTEPS 9.69Xkron_n25_32(32M,1.07B,UD) Fu[5] 2xK20x32 22.7GTEPS 4xK40 32.0GTEPS 1.41Xkron_n23_32 (8M,256M,D) Fu[5] 2xK20x2 6.3GTEPS 4xK40 27.9GTEPS 4.43Xkron_n24_32(16.8M,1.07B,UD) Liu[6] 2xK40x1 15GTEPS 2xK40 77.7GTEPS 5.18Xkron_n24_32(16.8M,1.07B,UD) Liu[6] 4xK40x1 18GTEPS 4xK40 67.7GTEPS 3.76Xkron_n24_32(16.8M,1.07B,UD) Liu[6] 8xK40x1 18.4GTEPS 4xK80 40.2GTEPS 2.18Xtwitter-mpi (52.6M,1.96B,D) Bebee [7] 1xK40x16 0.2242sec 3xK40 94.31ms 2.38Xrmat_n21_64(2M,128M,D) Merrill[8] 4 xC2050x1 8.3GTEPS 4xK40 23.7GTEPS 2.86X

• extendingGunrock ontomultiplenodes• asynchronized graphalgorithms

• Otherpartitioningmethods• morealgorithms

Comparisonwithpreviousworkonmulti-GPUBFS

Scalibility andspeedup.Top-Lelf:BFS,Top-Right:DOBFS,Bottom-Left:PR,Bottom-Right:Overallwith16datasets

• Programmability: easytodevelopgraphprimitivesonmultipleGPUs->programmeronlyneedstospecifyafewthings(refertoFramework)

• Algorithmgenerality:supportawiderangeofgraphalgorithms->BFS,DOBFS,SSSP,CC,BC,PRandmoretocome

• Hardwarecompatibility:usableonmostsinglenodeGPUsystems->withorwithoutpeerGPUaccess

• Performance:lowruntime,andleveragestheunderlyinghardwarewell->morethan500GTEPSpeekBFSperformanceon6GPUs

• Scalability:scalableintermsofbothperformanceandmemoryusage->2.63X,2.57X,2.00X,1.96X,and3.86Xgeometricmeanspeedupsover16datasetsforBFS,SSSP,CC,BCandPR->graphswith3.62Bedgesprocessedusing6GPUs

●●

●●

●● ● ●

0

10

20

30

40

50

1 2 3 4 5 6 7 8Number of GPUsBi

llion

Trav

erse

d Ed

ges

per S

econ

d (G

TEPS

) ● Strong_Scaling Weak_Edge_Scaling Weak_Vertex_Scaling

● ● ● ● ● ● ● ●

0

100

200

300

1 2 3 4 5 6 7 8Number of GPUsBi

llion

Trav

erse

d Ed

ges

per S

econ

d (G

TEPS

)

● Strong_Scaling Weak_Edge_Scaling Weak_Vertex_Scaling

● ● ● ● ●● ● ●

0

10

20

30

40

1 2 3 4 5 6 7 8Number of GPUs

Billio

n Tr

aver

sed

Edge

s pe

r Sec

ond

(GTE

PS)

● Strong_Scaling Weak_Edge_Scaling Weak_Vertex_Scaling

1

2

3

4

BC BFS CC DOBFS PR SSSPAlgorithm

Speedup

Number of GPUs 2 3 4 5 6

andachieved

Iterate till all GPUs converge

Localinput frontier

Localoutput frontier

GPU 0

Local data

GPU 1

Localdata

Remote output frontier

Remoteinput frontier

Remote input frontier

Localinput frontier

Remote output frontier

Localoutput frontier

Partition

Single GPU

primitive

Single GPU

primitivePackage

Combine Combine

Package

Whattheprogrammerneedstospecify:• Coresingle-GPUprimitives• Datatocommunicate• Howtocombineremoteandlocaldata• Stopcondition

Whattheframeworktakescare:• Splitfrontiers• Packagedata• PushtoremoteGPUs• Mergelocal&receiveddata• ManageGPUs

Optimizations• Direction-optimizingtraversal• Compute/communicationoverlap• Justenoughmemoryallocation

Graphalgorithmasadata-centricprocessFrontier:compactqueueofnodesoredges

Advancevisitneighborlists

Filterselectandreorganize

Computeper-elementcomputations

combinablewithadvanceorfilter

• Atomicavoidance• Kernelfusion

Alltailoredtosupportmulti-GPUenvironmentbetter

TheGPUhardwareandclusteraccesswasprovidedbyNVIDIA.ThisworkwasfundedbytheDARPAXDATAprogramunderAFRLContractFA8750-13-C-0002andbyNSFawardsCCF-1017399andOCI-1032859.

[5]Z.Fu,H.K.Dasari,B.Bebee,M.Berzins,andB.Thompson, “ParallelbreadthfirstsearchonGPUclusters,”inIEEEInternationalConferenceon BigData,Oct.2014,pp.110–118.

[6]H.LiuandH.H.Huang,“Enterprise:Breadth-firstgraphtraversalonGPUs,”inProceedingsoftheInternationalConferenceforHighPerformanceComputing, Networking,StorageandAnalysis,ser.SC’15,Nov.2015,pp.68:1–68:12.

[7]B.Bebee,“Whattodowithallthatbandwidth? GPUsforgraphandpredictiveanalytics,”21Mar.2016,https://devblogs.nvidia.com/parallelforall/gpus-graph-predictive-analytics/.

[8]D.Merrill,M.Garland,andA.Grimshaw.ScalableGPUgraphtraversal.InProceedingsofthe17thACMSIGPLANSymposium onPrinciplesandPracticeofParallelProgramming, PPoPP '12,pages117-128,Feb.2012.