Multi-GPU Graph Analytics - Home Page -...

1
Multi-GPU Graph Analytics Yuechao Pan, Yangzihao Wang, Yuduo Wu, Carl Yang and John D. Owens, University of California, Davis {ychpan, yzhwang, yudwu, ctcyang, jowens}@ucdavis.edu Introduction - about Gunrock What we aimed Multi-GPU Framework Results Future Work Acknowledgement References [1] Yuechao Pan, Yangzihao Wang, Yuduo Wu, Carl Yang, and John D. Owens. Multi-GPU Graph Analytics. CoRR, abs/1504.04804, Apr. 2016. [2] Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. Gunrock: A High-Performance Graph Processing Library on the GPU. in Proceedings of the 21 st ACM SIGPLAN Symposium on Principles and Practic of Parallel Programming, ser. PPoPP 2016, Mar. 2016. http://escholarship.org/uc/item/6xz7z9k0 [3] M. Bisson, M. Bernaschi, and E. Mastrostefano, “Parallel distributed breadth first search on the Kepler architecture,” IEEE Transactions on Parallel and Distributed Systems, vol. PP, no. 99, Sep. 2015. [4] M. Bernaschi, G. Carbone, E. Mastrostefano, M. Bisson, and M. Fatica, “Enhanced GPU-based distributed breadth first search,” in Proceedings of the 12th ACM International Conference on Computing Frontiers, ser. CF ’15, 2015, pp. 10:1–10:8. Gunrock is a multi-GPU graph processing library, which targets at: High performance analytics of large graphs Low programming complexity in implementing parallel graph algorithms on GPUs Homepage: http://gunrock.github.io The copyright of Gunrock is owned by The Regents of the University of California, 2016. All source code are released under Apache 2.0. graph ref. ref. hw. ref. perf. our hw. our perf. comp. com-orkut (3M, 117M, UD) Bisson [3] 1 x K20X x 4 2.67 GTEPS 4 x K40 11.42 GTEPS 5.33 X com-Friendster (66M, 1.81B, UD) Bisson [3] 1 x K20X x 64 15.68 GTEPS 4 x K40 14.1 GTEPS 0.90 X kron_n23_16 (8M, 256M, UD) Bernaschi [4] 1 x K20X x 4 ~ 1.3 GTEPS 4 x K40 30.8 GTEPS 23.7 X kron_n25_16 (32M, 1.07B, UD) Bernaschi [4] 1 x K20X x 16 ~ 3.2 GTEPS 6 x K40 31.0 GTEPS 9.69 X kron_n25_32 (32M, 1.07B, UD) Fu [5] 2 x K20 x 32 22.7 GTEPS 4 x K40 32.0 GTEPS 1.41 X kron_n23_32 (8M, 256M, D) Fu [5] 2 x K20 x 2 6.3 GTEPS 4 x K40 27.9 GTEPS 4.43 X kron_n24_32 (16.8M, 1.07B, UD) Liu [6] 2 x K40 x 1 15 GTEPS 2 x K40 77.7 GTEPS 5.18 X kron_n24_32 (16.8M, 1.07B, UD) Liu [6] 4 x K40 x 1 18 GTEPS 4 x K40 67.7 GTEPS 3.76 X kron_n24_32 (16.8M, 1.07B, UD) Liu [6] 8 x K40 x 1 18.4 GTEPS 4 x K80 40.2 GTEPS 2.18 X twitter -mpi (52.6M, 1.96B, D) Bebee [7] 1 x K40 x 16 0.2242 sec 3 x K40 94.31 ms 2.38 X rmat_n21_64 (2M, 128M, D) Merrill [8] 4 x C2050 x 1 8.3 GTEPS 4 x K40 23.7 GTEPS 2.86 X extending Gunrock onto multiple nodes asynchronized graph algorithms Other partitioning methods more algorithms Comparison with previous work on multi-GPU BFS Scalibility and speedup. Top-Lelf: BFS, Top-Right: DOBFS, Bottom-Left: PR, Bottom-Right: Overall with 16 datasets Programmability: easy to develop graph primitives on multiple GPUs -> programmer only needs to specify a few things (refer to Framework) Algorithm generality: support a wide range of graph algorithms -> BFS, DOBFS, SSSP, CC, BC, PR and more to come Hardware compatibility: usable on most single node GPU systems -> with or without peer GPU access Performance: low runtime, and leverages the underlying hardware well -> more than 500 GTEPS peek BFS performance on 6 GPUs Scalability: scalable in terms of both performance and memory usage -> 2.63 X, 2.57 X, 2.00 X, 1.96 X, and 3.86 X geometric mean speedups over 16 datasets for BFS, SSSP, CC, BC and PR -> graphs with 3.62B edges processed using 6 GPUs 0 10 20 30 40 50 1 2 3 4 5 6 7 8 Number of GPUs Billion Traversed Edges per Second (GTEPS) Strong_Scaling Weak_Edge_Scaling Weak_Vertex_Scaling 0 100 200 300 1 2 3 4 5 6 7 8 Number of GPUs Billion Traversed Edges per Second (GTEPS) Strong_Scaling Weak_Edge_Scaling Weak_Vertex_Scaling 0 10 20 30 40 1 2 3 4 5 6 7 8 Number of GPUs Billion Traversed Edges per Second (GTEPS) Strong_Scaling Weak_Edge_Scaling Weak_Vertex_Scaling 1 2 3 4 BC BFS CC DOBFS PR SSSP Algorithm Speedup Number of GPUs 2 3 4 5 6 and achieved Iterate till all GPUs converge Local input frontier Local output frontier GPU 0 Local data GPU 1 Local data Remote output frontier Remote input frontier Remote input frontier Local input frontier Remote output frontier Local output frontier Partition Single GPU primitive Single GPU primitive Package Combine Combine Package What the programmer needs to specify: Core single-GPU primitives Data to communicate How to combine remote and local data Stop condition What the framework takes care: Split frontiers Package data Push to remote GPUs Merge local & received data Manage GPUs Optimizations Direction-optimizing traversal Compute / communication overlap Just enough memory allocation Graph algorithm as a data-centric process Frontier: compact queue of nodes or edges Advance visit neighbor lists Filter select and reorganize Compute per-element computations combinable with advance or filter Atomic avoidance Kernel fusion All tailored to support multi-GPU environment better The GPU hardware and cluster access was provided by NVIDIA. This work was funded by the DARPA XDATA program under AFRL Contract FA8750-13-C-0002 and by NSF awards CCF-1017399 and OCI-1032859. [5] Z. Fu, H. K. Dasari, B. Bebee, M. Berzins, and B. Thompson, “Parallel breadth first search on GPU clusters,” in IEEE International Conference on Big Data, Oct. 2014, pp. 110–118. [6] H. Liu and H. H. Huang, “Enterprise: Breadth-first graph traversal on GPUs,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’15, Nov. 2015, pp. 68:1–68:12. [7] B. Bebee, “What to do with all that bandwidth? GPUs for graph and predictive analytics,” 21 Mar. 2016, https://devblogs.nvidia.com/ parallelforall/gpus-graph-predictive-analytics/. [8] D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 117-128, Feb.2012.

Transcript of Multi-GPU Graph Analytics - Home Page -...

Multi-GPUGraphAnalytics Yuechao Pan,Yangzihao Wang,Yuduo Wu,CarlYangandJohnD.Owens,UniversityofCalifornia,Davis{ychpan,yzhwang,yudwu,ctcyang,jowens}@ucdavis.edu

Introduction- aboutGunrock

Whatweaimed

Multi-GPUFramework Results

FutureWork

AcknowledgementReferences[1]Yuechao Pan,Yangzihao Wang,Yuduo Wu,CarlYang,andJohnD.

Owens.Multi-GPUGraphAnalytics.CoRR,abs/1504.04804,Apr.2016.[2]Yangzihao Wang,AndrewDavidson,Yuechao Pan,Yuduo Wu,Andy

Riffel,andJohnD.Owens.Gunrock:AHigh-PerformanceGraphProcessingLibraryontheGPU.inProceedingsofthe21st ACMSIGPLANSymposium onPrinciplesandPractic ofParallelProgramming, ser.PPoPP 2016,Mar.2016.http://escholarship.org/uc/item/6xz7z9k0

[3]M.Bisson,M.Bernaschi,andE.Mastrostefano,“ParalleldistributedbreadthfirstsearchontheKepler architecture,”IEEETransactionsonParallelandDistributedSystems,vol.PP,no.99,Sep.2015.

[4]M.Bernaschi,G.Carbone,E.Mastrostefano,M.Bisson,andM.Fatica,“EnhancedGPU-baseddistributedbreadth firstsearch,”inProceedingsofthe12thACMInternationalConferenceonComputing Frontiers,ser.CF’15,2015,pp.10:1–10:8.

Gunrock isamulti-GPUgraphprocessinglibrary,whichtargetsat:• Highperformanceanalyticsoflargegraphs• Lowprogrammingcomplexity inimplementingparallel

graphalgorithmsonGPUsHomepage:http://gunrock.github.ioThecopyrightofGunrock isownedbyTheRegentsoftheUniversityofCalifornia,2016.AllsourcecodearereleasedunderApache2.0.

graph ref. ref. hw. ref.perf. ourhw. ourperf. comp.com-orkut (3M,117M,UD) Bisson [3] 1xK20Xx4 2.67GTEPS 4xK40 11.42GTEPS 5.33Xcom-Friendster(66M,1.81B,UD) Bisson [3] 1xK20Xx64 15.68GTEPS 4xK40 14.1GTEPS 0.90Xkron_n23_16(8M,256M,UD) Bernaschi [4] 1xK20Xx4 ~1.3GTEPS 4xK40 30.8GTEPS 23.7Xkron_n25_16(32M,1.07B,UD) Bernaschi [4] 1xK20Xx16 ~3.2GTEPS 6xK40 31.0GTEPS 9.69Xkron_n25_32(32M,1.07B,UD) Fu[5] 2xK20x32 22.7GTEPS 4xK40 32.0GTEPS 1.41Xkron_n23_32 (8M,256M,D) Fu[5] 2xK20x2 6.3GTEPS 4xK40 27.9GTEPS 4.43Xkron_n24_32(16.8M,1.07B,UD) Liu[6] 2xK40x1 15GTEPS 2xK40 77.7GTEPS 5.18Xkron_n24_32(16.8M,1.07B,UD) Liu[6] 4xK40x1 18GTEPS 4xK40 67.7GTEPS 3.76Xkron_n24_32(16.8M,1.07B,UD) Liu[6] 8xK40x1 18.4GTEPS 4xK80 40.2GTEPS 2.18Xtwitter-mpi (52.6M,1.96B,D) Bebee [7] 1xK40x16 0.2242sec 3xK40 94.31ms 2.38Xrmat_n21_64(2M,128M,D) Merrill[8] 4 xC2050x1 8.3GTEPS 4xK40 23.7GTEPS 2.86X

• extendingGunrock ontomultiplenodes• asynchronized graphalgorithms

• Otherpartitioningmethods• morealgorithms

Comparisonwithpreviousworkonmulti-GPUBFS

Scalibility andspeedup.Top-Lelf:BFS,Top-Right:DOBFS,Bottom-Left:PR,Bottom-Right:Overallwith16datasets

• Programmability: easytodevelopgraphprimitivesonmultipleGPUs->programmeronlyneedstospecifyafewthings(refertoFramework)

• Algorithmgenerality:supportawiderangeofgraphalgorithms->BFS,DOBFS,SSSP,CC,BC,PRandmoretocome

• Hardwarecompatibility:usableonmostsinglenodeGPUsystems->withorwithoutpeerGPUaccess

• Performance:lowruntime,andleveragestheunderlyinghardwarewell->morethan500GTEPSpeekBFSperformanceon6GPUs

• Scalability:scalableintermsofbothperformanceandmemoryusage->2.63X,2.57X,2.00X,1.96X,and3.86Xgeometricmeanspeedupsover16datasetsforBFS,SSSP,CC,BCandPR->graphswith3.62Bedgesprocessedusing6GPUs

●●

●●

●● ● ●

0

10

20

30

40

50

1 2 3 4 5 6 7 8Number of GPUsBi

llion

Trav

erse

d Ed

ges

per S

econ

d (G

TEPS

) ● Strong_Scaling Weak_Edge_Scaling Weak_Vertex_Scaling

● ● ● ● ● ● ● ●

0

100

200

300

1 2 3 4 5 6 7 8Number of GPUsBi

llion

Trav

erse

d Ed

ges

per S

econ

d (G

TEPS

)

● Strong_Scaling Weak_Edge_Scaling Weak_Vertex_Scaling

● ● ● ● ●● ● ●

0

10

20

30

40

1 2 3 4 5 6 7 8Number of GPUs

Billio

n Tr

aver

sed

Edge

s pe

r Sec

ond

(GTE

PS)

● Strong_Scaling Weak_Edge_Scaling Weak_Vertex_Scaling

1

2

3

4

BC BFS CC DOBFS PR SSSPAlgorithm

Speedup

Number of GPUs 2 3 4 5 6

andachieved

Iterate till all GPUs converge

Localinput frontier

Localoutput frontier

GPU 0

Local data

GPU 1

Localdata

Remote output frontier

Remoteinput frontier

Remote input frontier

Localinput frontier

Remote output frontier

Localoutput frontier

Partition

Single GPU

primitive

Single GPU

primitivePackage

Combine Combine

Package

Whattheprogrammerneedstospecify:• Coresingle-GPUprimitives• Datatocommunicate• Howtocombineremoteandlocaldata• Stopcondition

Whattheframeworktakescare:• Splitfrontiers• Packagedata• PushtoremoteGPUs• Mergelocal&receiveddata• ManageGPUs

Optimizations• Direction-optimizingtraversal• Compute/communicationoverlap• Justenoughmemoryallocation

Graphalgorithmasadata-centricprocessFrontier:compactqueueofnodesoredges

Advancevisitneighborlists

Filterselectandreorganize

Computeper-elementcomputations

combinablewithadvanceorfilter

• Atomicavoidance• Kernelfusion

Alltailoredtosupportmulti-GPUenvironmentbetter

TheGPUhardwareandclusteraccesswasprovidedbyNVIDIA.ThisworkwasfundedbytheDARPAXDATAprogramunderAFRLContractFA8750-13-C-0002andbyNSFawardsCCF-1017399andOCI-1032859.

[5]Z.Fu,H.K.Dasari,B.Bebee,M.Berzins,andB.Thompson, “ParallelbreadthfirstsearchonGPUclusters,”inIEEEInternationalConferenceon BigData,Oct.2014,pp.110–118.

[6]H.LiuandH.H.Huang,“Enterprise:Breadth-firstgraphtraversalonGPUs,”inProceedingsoftheInternationalConferenceforHighPerformanceComputing, Networking,StorageandAnalysis,ser.SC’15,Nov.2015,pp.68:1–68:12.

[7]B.Bebee,“Whattodowithallthatbandwidth? GPUsforgraphandpredictiveanalytics,”21Mar.2016,https://devblogs.nvidia.com/parallelforall/gpus-graph-predictive-analytics/.

[8]D.Merrill,M.Garland,andA.Grimshaw.ScalableGPUgraphtraversal.InProceedingsofthe17thACMSIGPLANSymposium onPrinciplesandPracticeofParallelProgramming, PPoPP '12,pages117-128,Feb.2012.