Which Graph Representation to Select for Static Graph ...

32
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU Thorsten Blaß and Michael Philippsen

Transcript of Which Graph Representation to Select for Static Graph ...

Page 1: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU

Thorsten Blaß and Michael Philippsen

Page 2: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 2

Motivation (I)

Your task: write static graph algorithm on the GPU.

Preparation: check literature□ several data structures to represent graphs.□ all papers pick CRS data structure

and focus on optimizing algorithmic aspects.

You wonder: □ Is data structure irrelevant for performance? □ Is it right to follow the crowed?

This paper: Graph representation matters!

Page 3: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 3

1.0

1.1

1.2

1.3

1.4

1.5

1.6

norm

alize

d fil

e-to

-file

runt

ime

w.r.

t. be

st se

lect

ion

benchmarks

With fastest data structure LonestarGPU codes with CRS data structure

BFS MST SSSP

Motivation (II)

§ Adequate data structure speeds up graph algorithms.

Our improvements: Fastest runs when choosing the adequate data structure.

Available Lonestar GPU benchmark implementations.

Higher = slower

Page 4: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 4

Study Setup

§ What is the adequate data structure?

§ A total of 754,000 measurements:□ 10 state-of-the-art static graph algorithms (do not modify the graph).□ 19 input graphs.□ 4 architecturally different Nvidia GPUs.□ 10 graph data structures.□ 3 widely used graph exchange formats.

Page 5: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 5

Study Setup – Benchmark Graph Algorithms

§ 10 state-of-the-art implementations from recent research and benchmark suites with different characteristics.

Page 6: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 6

§ 19 input graphs with different characteristics.

Study Setup – Input Graphs

Page 7: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 7

Study Setup – GPUs

§ 4 architecturally different Nvidia GPUs□ Titan XP, 12.2 GB, Pascal□ Geforce GTX 980, 4.0 GB, Maxwell□ Geforce GTX 680, 2.0 GB, Kepler□ Geforce GTX 580, 1.0 GB, Fermi

§ Only Titan XP can run all measurement configurations.

§ All measurements scale w.r.t. nominal GPU performance.à Suffices to discuss the Titan XP measurements.

Page 8: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 8

Decision Tree (file-to-file)

Excess Rate < 20%

N

N

Y

Y

Y

N

ELL HYBavg

RB [symm]

Best data structure

Best exchg. format

ELL C?S

RB [symm]

C?S

RB [symm]

COO/EL

GR/MTX[symm]

Touch Count < 5

Edge-Centric

Page 9: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 9

Decision Tree – First Layer (I)

Excess Rate < 20%

N

N

Y

Y

Y

N

ELL HYBavg

RB [symm]

Best data structure

Best exchg. format

ELL C?S

RB [symm]

C?S

RB [symm]

COO/EL

GR/MTX[symm]

Touch Count < 5

Edge-Centric

Page 10: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 10

Decision Tree – First Layer (II)

§ First decision depends on whether the algorithm is edge-centric or node-centric.□ Splits data structures into two groups.

§ For edge-centric algorithms COO and EL perform best.□ The other formats are 25% slower in our measurements.

§ COOrdinate format□ Stores source, target, and weight of an

edge in different arrays.

§ Edge List format □ Stores an edge as a struct in an array.

0

15

3

32 26

5

2

64

4

1

𝑠𝑟𝑐 = 0 1 3 3 3 4 5𝑡𝑟𝑔𝑡 = 1 3 0 2 5 3 6

𝑤𝑒𝑖𝑔ℎ𝑡𝑠 = 5 3 2 6 2 1 4

𝑒𝑑𝑔𝑒𝑠 =015,133,302,⋯

Page 11: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 11

Decision Tree – First Layer (III)

§ COO and EL show the same runtime behavior.□ Either COO or EL can be picked for edge-centric algorithms.□ Runtime can be minimized with MTX/GR exchange format.

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

norm

alize

d fil

e-to

-file

exe

cutio

n tim

e w

.r.t.

aver

age

of th

e fa

stes

t com

bina

tion

of in

put g

raph

and

ex

chan

ge fo

rmat

per

dat

a st

ruct

and

ben

chm

ark

benchmarks

COO ELL

RBsymm

RB

MTX, MTXsymm, GR, GRsymm

APSP BFS MIS MST PR-n PR-e SpMV SSSP WCC-n WCC-e

Higher = slower

Page 12: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 12

Decision Tree – First Layer (IV)

§ The MTX and GR input file formats are dumps of the COO and EL formats.

§ No costly conversion from file to representation in CPU memory.

§ The RB format is 20% slower in our measurements.

< 𝑀𝑇𝑋 ℎ𝑒𝑎𝑑𝑒𝑟 >7 7 70 1 51 3 33 0 23 2 63 5 24 3 15 6 4

< 𝐺𝑅 ℎ𝑒𝑎𝑑𝑒𝑟 >𝑝 𝑠𝑝 7 7 7𝑎 0 1 5𝑎 1 3 3𝑎 3 0 2𝑎 3 2 6𝑎 3 5 2𝑎 4 3 1𝑎 5 6 4

Encoded edges

Size of the array(s)

Page 13: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 13

Decision Tree (file-to-file)

Excess Rate < 20%

N

N

Y

Y

Y

N

ELL HYBavg

RB [symm]

Best data structure

Best exchg. format

ELL C?S

RB [symm]

C?S

RB [symm]

COO/EL

GR/MTX[symm]

Touch Count < 5

Edge-Centric

Page 14: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 14

Decision Tree – Second Layer (I)

§ Node-centric algorithms need a deeper analysis.

§ Compressed Row Storage format□ Most memory efficient format. □ Indirect addressing step necessary.

§ Compressed Column Storage format □ Same as CRS but stores the

predecessors of a node.

s𝑟𝑐@AB = 0 1 2 2 5 6 7 7𝒔𝒖𝒄𝒄 = 1 3 0 2 5 3 6𝑤𝑒𝑖𝑔ℎ𝑡𝑠 = 5 3 2 6 2 1 4

s𝑟𝑐@AB = 0 1 2 3 5 5 7 8𝒑𝒓𝒆𝒅 = 3 0 3 1 4 3 5𝑤𝑒𝑖𝑔ℎ𝑡 = 2 5 6 3 1 2 4

0

15

3

32 26

5

2

64

4

1

Page 15: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 15

§ ELL format□ Fixed row length reserved for successors.□ Well suited for vector processors.□ Memory efficiency depends on degree

distribution of a graph.

§ HYBrid format□ Better memory efficiency than ELL.□ Heuristics (avg, dstr) find smaller the row length.□ Additional successors are stored in EL.

𝑒𝑙𝑙L53∗214∗

𝑒𝑙𝑙N13∗036∗

Decision Tree – Second Layer (II)

0

15

3

32 26

5

2

64

4

1

𝑠𝑢𝑐𝑐1 ∗ ∗3 ∗ ∗∗ ∗ ∗0 2 53 ∗ ∗6 ∗ ∗∗ ∗ ∗

𝑤𝑒𝑖𝑔ℎ𝑡𝑠5 ∗ ∗3 ∗ ∗∗ ∗ ∗2 6 21 ∗ ∗4 ∗ ∗∗ ∗ ∗

𝑒𝑙326,352

Page 16: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 16

Decision Tree – Second Layer (III)

§ Data structures cause costs for construction, shipping, ...

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

0 5 10 15 20 25 30

norm

alize

d fil

e-to

-file

exe

cutio

n tim

e w

.r.t.

best

dat

a st

ruct

ure

touch count

C?S (RB) ELL (RB) HYB_avg (RB)Higher = slower

Page 17: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 17

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

0 5 10 15 20 25 30

norm

alize

d fil

e-to

-file

exe

cutio

n tim

e w

.r.t.

best

dat

a st

ruct

ure

touch count

C?S (RB) ELL (RB) HYB_avg (RB)

Decision Tree – Second Layer (III)

§ Data structures cause costs for construction, shipping, ...

Touch count splits the measurements into two

groups.

Higher = slower

Page 18: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 18

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

0 5 10 15 20 25 30

norm

alize

d fil

e-to

-file

exe

cutio

n tim

e w

.r.t.

best

dat

a st

ruct

ure

touch count

C?S (RB) ELL (RB) HYB_avg (RB)

Decision Tree – Second Layer (III)

§ Data structures cause costs for construction, shipping, ...

We will discuss this later.

Higher = slower

Page 19: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 19

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

0 5 10 15 20 25 30

norm

alize

d fil

e-to

-file

exe

cutio

n tim

e w

.r.t.

best

dat

a st

ruct

ure

touch count

C?S (RB) ELL (RB) HYB_avg (RB)Higher = slower

Decision Tree – Second Layer (III)

§ Data structures cause costs for construction, shipping, ...

CRS or CCS perform best, followed by ELL and HYBavg.

Page 20: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 20

Decision Tree – Second Layer (IV)

§ The Rutherford Boeing (RB) format is a dump of the CRS format.

§ Other file formats are 40% slower in our measurements.

8 2 2 2𝑖𝑔𝑎 7 7 711𝐼4 11𝐼30 1 2 25 6 71 3 0 25 3 65 3 2 62 1 4

𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑎𝑏𝑜𝑢𝑡 𝑡ℎ𝑒𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑛𝑐𝑜𝑑𝑒𝑑 𝑔𝑟𝑎𝑝ℎ𝑎𝑛𝑑 𝑡ℎ𝑒 𝑓𝑖𝑙𝑒.

𝑠𝑟𝑐@AB

𝑠𝑢𝑐𝑐/𝑝𝑟𝑒𝑑

𝑤𝑒𝑖𝑔ℎ𝑡𝑠

Page 21: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 21

Decision Tree (file-to-file)

Excess Rate < 20%

N

N

Y

Y

Y

N

ELL HYBavg

RB [symm]

Best data structure

Best exchg. format

ELL C?S

RB [symm]

C?S

RB [symm]

COO/EL

GR/MTX[symm]

Touch Count < 5

Edge-Centric

Page 22: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 22

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

0 5 10 15 20 25 30

norm

alize

d fil

e-to

-file

exe

cutio

n tim

e w

.r.t.

best

dat

a st

ruct

ure

touch count

C?S (RB) ELL (RB) HYB_avg (RB)

Decision Tree – Third Layer (I)

Higher = slower

Page 23: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 23

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

0 5 10 15 20 25 30

norm

alize

d fil

e-to

-file

exe

cutio

n tim

e w

.r.t.

best

dat

a st

ruct

ure

touch count

C?S (RB) ELL (RB) HYB_avg (RB)Higher = slower

No best single data structure.

Inconclusive! Performance also depends on input graphs.

Decision Tree – Third Layer (I)

Page 24: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 24

0

10

20

30

40

50

60

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

2.1

2.2

2.3

2.4

2.5

2.6

perc

enta

ge o

f nod

es w

ith h

ighe

r deg

ree

than

the

aver

age

norm

alize

d fil

e-to

-file

exe

cutio

n tim

e w

.r.t.

best

dat

a st

ruct

ure

input graphs

ELL (RB) avg. ELL (RB) ELL (RB_sym) C?S (RB) avg. C?S (RB)C?S (RB_sym) HYB_avg (RB) avg. HYB_avg (RB) HYB_avg (RB_sym) excess

road

_BAY

road

_CAL

road

_COL

road

_CTR

road

_FLA

road

_USA

msdoor

pwtk

rgg_n

_2_2

4

amaz

on-08

coAuth

ors

dblp-10

hollywood-09

indochina-0

4

it-04

kron_g

500

ljourn

al-08

twitt

er-ret

weed

soc-o

rkut

Decision Tree – Third Layer (II)

Graphs are sorted in ascending order by their excess rate.

Page 25: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 25

0

10

20

30

40

50

60

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

2.1

2.2

2.3

2.4

2.5

2.6

perc

enta

ge o

f nod

es w

ith h

ighe

r deg

ree

than

the

aver

age

norm

alize

d fil

e-to

-file

exe

cutio

n tim

e w

.r.t.

best

dat

a st

ruct

ure

input graphs

ELL (RB) avg. ELL (RB) ELL (RB_sym) C?S (RB) avg. C?S (RB)C?S (RB_sym) HYB_avg (RB) avg. HYB_avg (RB) HYB_avg (RB_sym) excess

road

_BAY

road

_CAL

road

_COL

road

_CTR

road

_FLA

road

_USA

msdoor

pwtk

rgg_n

_2_2

4

amaz

on-08

coAuth

ors

dblp-10

hollywood-09

indochina-0

4

it-04

kron_g

500

ljourn

al-08

twitt

er-ret

weed

soc-o

rkut

Decision Tree – Third Layer (II)

If ELL fits into memory it is always the fastest choice.

Shaded = ELL too large

Page 26: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 26

0

10

20

30

40

50

60

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

2.1

2.2

2.3

2.4

2.5

2.6

perc

enta

ge o

f nod

es w

ith h

ighe

r deg

ree

than

the

aver

age

norm

alize

d fil

e-to

-file

exe

cutio

n tim

e w

.r.t.

best

dat

a st

ruct

ure

input graphs

ELL (RB) avg. ELL (RB) ELL (RB_sym) C?S (RB) avg. C?S (RB)C?S (RB_sym) HYB_avg (RB) avg. HYB_avg (RB) HYB_avg (RB_sym) excess

road

_BAY

road

_CAL

road

_COL

road

_CTR

road

_FLA

road

_USA

msdoor

pwtk

rgg_n

_2_2

4

amaz

on-08

coAuth

ors

dblp-10

hollywood-09

indochina-0

4

it-04

kron_g

500

ljourn

al-08

twitt

er-ret

weed

soc-o

rkut

Decision Tree – Third Layer (II)

HYBavg faster

Excess Rate < 20%

C?S faster

Shaded = ELL too large

Y N

Page 27: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 27

Decision Tree (file-to-file)

Excess Rate < 20%

N

N

Y

Y

Y

N

ELL HYBavg

RB [symm]

Best data structure

Best exchg. format

ELL C?S

RB [symm]

C?S

RB [symm]

COO/EL

GR/MTX[symm]

Touch Count < 5

Edge-Centric

How often touches the algorithm every node (on average).

Percentage of nodes that have an above-average degree.

Page 28: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 28

Decision Tree – Kernel only

N

Y

ELL HYBdstr

Best data structure

COO/EL

Edge-Centric n.a.

n.a.

n.a.

Page 29: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 29

0

10

20

30

40

50

60

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

input graphs

perc

enta

ge o

f no

des w

ith h

ighe

r deg

ree

than

the

aver

age

norm

alize

d ke

rnel

runt

ime

w.r.

t. be

st d

ata

stru

ctur

e

ELL avg. ELL

CRS avg. CRS

HYB_avg avg. HYB_avg

HYB_dstr avg. HYB_dstr

excess

road

_BAY

road

_CAL

road

_COL

road

_CTR

road

_FLA

road

_USA

msdoor

pwtk

rgg_n

_2_2

4

amaz

on-08

coAuth

ors

dblp-10

hollywood-09

indochina-0

4

it-04

kron_g

500

ljourn

al-08

twitt

er-ret

weed

soc-o

rkut

Kernel-only results

§ Without I/O, data structure construction, and CPU ßà GPU shipping.

ELL remains fastest graph structure if it fits into memory.

Shaded = ELL too large

Page 30: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 30

0

10

20

30

40

50

60

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

input graphs

perc

enta

ge o

f no

des w

ith h

ighe

r deg

ree

than

the

aver

age

norm

alize

d ke

rnel

runt

ime

w.r.

t. be

st d

ata

stru

ctur

e

ELL avg. ELL

CRS avg. CRS

HYB_avg avg. HYB_avg

HYB_dstr avg. HYB_dstr

excess

road

_BAY

road

_CAL

road

_COL

road

_CTR

road

_FLA

road

_USA

msdoor

pwtk

rgg_n

_2_2

4

amaz

on-08

coAuth

ors

dblp-10

hollywood-09

indochina-0

4

it-04

kron_g

500

ljourn

al-08

twitt

er-ret

weed

soc-o

rkut

Kernel-only results

§ Without I/O, data structure construction, and CPU ßà GPU shipping.

Otherwise HYBdstr is the best choice.

Shaded = ELL too large

Page 31: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 31

Conclusion

§ There is no single best graph data structure for GPUs.§ Three decision layers:

□ Edge-centric vs. node-centric□ Touch count threshold around 5□ Excess rate threshold around 20%

§ Our Decision Tree can help developers to pick an adequate data structure for their use case.

§ Choosing the adequate data structure canspeed up graph algorithms by up to 45%.

Page 32: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 32

Conclusion

§ There is no single best graph data structure for GPUs.§ Three decision layers:

□ Edge-centric vs. node-centric□ Touch count threshold around 5□ Excess rate threshold around 20%

§ Our Decision Tree can help developers to pick an adequate data structure for their use case.

§ Choosing the adequate data structure canspeed up graph algorithms by up to 45%. Thank you for your attention!Any questions?