How well do CPU, GPU and Hybrid Graph Processing ...

28
How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

Transcript of How well do CPU, GPU and Hybrid Graph Processing ...

Page 1: How well do CPU, GPU and Hybrid Graph Processing ...

HowwelldoCPU,GPUandHybridGraph

ProcessingFrameworksPerform?

TanujKrAasawat,TahsinReza,MateiRipeanuNetworkedSystemsLaboratory(NetSysLab)UniversityofBritishColumbia

Page 2: How well do CPU, GPU and Hybrid Graph Processing ...
Page 3: How well do CPU, GPU and Hybrid Graph Processing ...

NetworkedSystemsLaboratory(NetSysLab)UniversityofBritishColumbia

Agolfcourse…

…a(nudist)beach

(…and199daysofraineachyear)

Page 4: How well do CPU, GPU and Hybrid Graph Processing ...

Graphs are Everywhere

4

1B users 150B friendships

100B neurons 700T connections

Page 5: How well do CPU, GPU and Hybrid Graph Processing ...

Challenges in Graph Processing

Data-dependentmemory

accesspatterns

Largememoryfootprint

Poorlocality

Lowcompute-to-memoryaccessratio

Varyingdegreesofparallelism(bothintra-andinter-stage)

Graph500“mini”graphrequires128GB.

Page 6: How well do CPU, GPU and Hybrid Graph Processing ...

Processing Elements Characteristics

Data-dependentmemory

accesspatternsLargeCaches

Largememoryfootprint >1TB

CPUs

Poorlocality

Massivehardwaremultithreading

~16GB

GPUs

Lowcompute-to-memoryaccessratio

Caches

Varyingdegreesofparallelism(bothintra-andinter-stage)

Graph500“mini”graphrequires128GB.

Assemble a hybrid platform?

Page 7: How well do CPU, GPU and Hybrid Graph Processing ...

Graph Processing Frameworks

ProgrammingModel

(VertexProgramming/LinearAlgebra)

Architecture(Single-nodeorDistributed)

HighPerformance

CPU/GPU/Hybrid

Page 8: How well do CPU, GPU and Hybrid Graph Processing ...

Motivation

Howarchitectureandprogrammingmodelcombinationimprovesperformanceandefficiencyofthesystemasawhole?

Page 9: How well do CPU, GPU and Hybrid Graph Processing ...

Graph Processing Frameworks Architecture Model Programming

Model Vertex

Programming CPU

CPU+Distributed LinearAlgebra

VertexProgramming

Multi-GPU

GPU LinearAlgebra

Framework

GaloisUTexas,Austin

GraphMatIntel

GunrockUC,Davis

NvgraphNvidia

TotemUBC

CPU+multi-GPU VertexProgramming

Page 10: How well do CPU, GPU and Hybrid Graph Processing ...

Benchmark Algorithms

• PageRank•  Rankingwebpages•  Computeintensive

•  SingleSourceShortestPaths(SSSP)•  IProuting,Transportationnetworks

• Breadth-FirstSearch(BFS)•  Findingconnectedcomponent,subroutine•  Memoryintensive

Page 11: How well do CPU, GPU and Hybrid Graph Processing ...

Evaluation Metrics

§ RawPerformance§  TraversedEdgesPerSecond(TEPS):TraversedEdges/ExecutionTime

§ EnergyConsumption§ AveragePowerconsumed*ExecutionTime

§ Scalability§  Strongscalingw.r.tprocessingunits

Page 12: How well do CPU, GPU and Hybrid Graph Processing ...

Testbed Characteristics System1

CPU 2xIntelXeonE5-2695v3(Haswell)

#CPUCores 28

HostMemory 512GBDDR4

L3Cache 70MB

PCIe 3.0–x16

GPU 2xNvidiaTeslaK40c

GPUThreadCount

2880

GPUMemory 12GB

Page 13: How well do CPU, GPU and Hybrid Graph Processing ...

Datasets Graph #Vertices #Edges MaxDegree Avg.Degree

RealWorld

Com-Orkut 3M 234M 33,313 78

liveJournal 4.8M 68M 20,292 14

Road-USA 28.8M 47.9M 9 1.6

Twitter 52M 3.9B 3,691,240 75

Synthetic

RMAT22 4M 128M 168,729 32

RMAT23 8M 256M 272,808 32

RMAT24 16M 512M 439,994 32

RMAT27 128M 4B 3,910,241 32

Page 14: How well do CPU, GPU and Hybrid Graph Processing ...

WDC,2012

Page 15: How well do CPU, GPU and Hybrid Graph Processing ...

Memory Consumption

Framework Memorylayout PageRank SSSP BFS

Nvgraph CSC(PageRank,SSSP)andCSR(BFS)

1,159(1.8x) 1,111(1.0x) 683(1.0x)

Gunrock CSRandCOO 641(1.0x) 1,582(1.4x) 1,443(2.1x)

Galois CSR 1,599(2.5x) 2,074(1.9x) 1,432(2.1x)

GraphMat* DCSC 2,818(4.4x) 2,786(2.5x) 2,980(4.4x)

Totem-2S CSR 1,275(2.0x) 2,198(2.0x) 1,282(1.9x)

Totem-2S2G CSR 1,628(2.5x) 2,587(2.3x) 1,658(2.4x)

MemoryConsumption(inMB)forRMAT22graph(edgelistsize:512MB)

9,354MBduringpre-processing

step

Page 16: How well do CPU, GPU and Hybrid Graph Processing ...

Experimental Results 1. Raw Performance - PageRank

02468

1012141618

Orkut LiveJournal RMAT22 RMAT23 RMAT24 RMAT27 Twitter

Billion

TEPS/Iteratio

nNvgraph Gunrock Totem-1G GaloisGraphMat Totem-2S Totem-2S2G

Fastest:Totem-2SNvgraphvsGraphMat

Page 17: How well do CPU, GPU and Hybrid Graph Processing ...

Experimental Results 1. Raw Performance - SSSP

0.000.501.001.502.002.503.003.504.004.50

Orkut LiveJournalRoad_USA RMAT22 RMAT24 RMAT27 Twitter

Billion

TEPS

Nvgraph Gunrock Totem-1G GaloisGraphMat Totem-2S Totem-2S2G

Fastest:Totem-2SCSCissuitableforPageRank

Page 18: How well do CPU, GPU and Hybrid Graph Processing ...

20

4 3

1 0 1 3 3 6 80 1 2 3 4 5*

0 2 3 6 7 80 1 2 3 4 5*

1 2 3 0 2 4 0 20 1 2 3 4 5 6 7

3 4 0 1 3 4 1 30 1 2 3 4 5 6 7

CSRRepresentation

CSCRepresentation

rowPtrVertexId

colPtr

edgeList

VertexId

edgeList

GraphLayoutinMemory

Page 19: How well do CPU, GPU and Hybrid Graph Processing ...

Experimental Results 1. Raw Performance - BFS

0

20

40

60

80

100

120

Orkut LiveJournal RMAT22 RMAT24 RMAT27 Twitter

Billion

TEPS

Nvgraph Gunrock Totem-1G GaloisGraphMat Totem-2S Totem-2S2G

Fastest:Totem-2SNvgraphvsGraphMatCSRsuitableforBFS

Hybrid:~2x

Page 20: How well do CPU, GPU and Hybrid Graph Processing ...

Experimental Results 2. Energy Consumption – GPU Frameworks – Orkut Workload

1

10

100

1,000

Nvgraph

Gunrock

Totem-1G

Totem-2S

Totem-2S2G

Nvgraph

Gunrock

Totem-1G

Totem-2S

Totem-2S2G

Nvgraph

Gunrock

Totem-1G

Totem-2S

Totem-2S2G

PageRank SSSP BFS

Energy(w

att-sec)

Page 21: How well do CPU, GPU and Hybrid Graph Processing ...

Experimental Results 2. Energy Consumption – GPU Frameworks – Orkut Workload

1

10

100

1,000

Nvgraph

Gunrock

Totem-1G

Totem-2S

Totem-2S2G

Nvgraph

Gunrock

Totem-1G

Totem-2S

Totem-2S2G

Nvgraph

Gunrock

Totem-1G

Totem-2S

Totem-2S2G

PageRank SSSP BFS

Energy(w

att-sec)

Page 22: How well do CPU, GPU and Hybrid Graph Processing ...

Experimental Results 2. Energy Consumption – CPU Frameworks – Twitter Workload

1

10

100

1,000

10,000

100,000

Galois

GraphM

at

Totem-2S

Totem-2S2G

Galois

GraphM

at

Totem-2S

Totem-2S2G

Galois

GraphM

at

Totem-2S

Totem-2S2G

PageRank SSSP BFS

Energy(w

att-second

)

Page 23: How well do CPU, GPU and Hybrid Graph Processing ...

Experimental Results 2. Energy Consumption – CPU Frameworks – Twitter Workload

1

10

100

1,000

10,000

100,000

Galois

GraphM

at

Totem-2S

Totem-2S2G

Galois

GraphM

at

Totem-2S

Totem-2S2G

Galois

GraphM

at

Totem-2S

Totem-2S2G

PageRank SSSP BFS

Energy(w

att-second

)

EnergyEfficient:Totem-2S

Page 24: How well do CPU, GPU and Hybrid Graph Processing ...

Summary

• GPU+LinearAlgebra|CPU+Vertexprogramming=GoodMatch• GPUbasedframeworks:?• CPUbasedframeworks:Totem-2S•  TotemHybrid:Greenest• CSCPageRank• CSRBFS,SSSP

Page 25: How well do CPU, GPU and Hybrid Graph Processing ...

Discussion

Page 26: How well do CPU, GPU and Hybrid Graph Processing ...

Does hybrid have the future potential?

020004000600080001000012000140001600018000

02468

1012141618

BFS SSSP PR BFS SSSP PR

4S 2S2G

Energy(W

att-Sec)

ExecutionTime(secon

ds)ExecutionTime Energy

Totem-4SvsTotem-2S2GforRMAT30(edgelistsize:128GB)4SMachine:4xIntelXeonE7-4870v2(Ivybridge),with1,536GBmemory

Page 27: How well do CPU, GPU and Hybrid Graph Processing ...

27

Hybrid Graph Processing

Data-dependentmemory

accesspatterns

LargeCaches+summarydatastructures

Largememoryfootprint >1TB

CPUs

Poorlocality

Massivehardwaremultithreading

16GB!

GPUs

Lowcompute-to-memoryaccessratio

Caches+summarydatastructures

Varyingdegreesofparallelism(bothintra-andinter-stage)

GraphProcessing

LowDegreeHighDegree

Page 28: How well do CPU, GPU and Hybrid Graph Processing ...

Questions

code@:netsyslab.ece.ubc.ca