Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all...

36
Fast Triangle Counting on the GPU Oded Green

Transcript of Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all...

Page 1: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Fast Triangle Counting

on the GPU

Oded Green

Page 2: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Triangle Counting

Building block for clustering coefficients

Defined by [Watts & Strogatz; 1998]

Used to state how tightly bound vertices are in a network

Used for the analysis many types of networks:

Communication

Social

Biological

[email protected], GTC, 2015 Twitter social network using Large

Graph Layout

Page 3: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Relevant network properties

Sparse-network

System utilization

Power-law distribution

[Faloutsos & Faloutsos & Faloutsos; 1999]

[Broder et al.; 2000]

Load balancing issues

[email protected], GTC, 2015

Page 4: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Computational Approaches

Enumerating over all node-triples - 𝑂(𝑉3).

Using matrix multiplication - 𝑂(𝑉𝑤), 𝑤 ≤ 2.376.

Adjacency list intersection - 𝑂 𝑉 ⋅ 𝑑𝑚𝑎𝑥2 ,

where 𝑑 𝑚𝑎𝑥 is the vertex with largest

adjacency.

[email protected], GTC, 2015

Page 5: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

For given e = 𝑢, 𝑣 ∈ 𝐸:

Take:

u’s list

v’s list

Intersection and find common neighbors

Then intersect (𝑢, 𝑥) and (𝑢, 𝑦)

Adjacency List Intersection

x

v u

y

[email protected], GTC, 2015

Page 6: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

GPU Challenges: List Intersection

Partitioning / Load balancing

Must be computationally cheap

Must be parallel (high utilization)

Minimal synchronization/communication

Scalable

Simple

[email protected], GTC, 2015

Page 7: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

The only other GPU triangle counting algorithm

Uses the GPU like a CPU

One CUDA thread per

intersection

Load balancing issues

Low utilization

Limited scalability

[Heist et al. ;2012]

[email protected], GTC, 2015

Page 8: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Merge-Path and GPU Triangle Counting

[email protected], GTC, 2015

Page 9: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Merge-Path

Visual approach for merging

Highly scalable

Load-balanced

Two legal moves

Right

Down

Reminder: Merging and intersecting are similar

[email protected], GTC, 2015 Merge Path : [Odeh et al. ; 2012]

GPU Merge Path : [Green et al. ;2012]

B[1] B[2] B[3] B[4] B[5] B[6] B[7] B[8]

1 2 4 5 8 9 12 13

A[1] 4

A[2] 6

A[3] 7

A[4] 11

A[5] 12

A[6] 14

A[7] 15

A[8] 16

Page 10: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Partitioning

Find the intersection of the

cross-diagonal and the

path

Uses binary search

[email protected], GTC, 2015

B[1] B[2] B[3] B[4] B[5] B[6] B[7] B[8]

1 2 4 5 8 9 12 13

A[1] 4

A[2] 6

A[3] 7

A[4] 11

A[5] 12

A[6] 14

A[7] 15

A[8] 16

Page 11: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Merge-Path Phases

Partitioning

Log(n)

Synchronization

Merging

Synchronization

[email protected], GTC, 2015

Page 12: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Intersect-Path

Building block for triangle counting and clustering coefficients

Modified Merge-Path

Right

Down

Right-down – when values are equal

[email protected], GTC, 2015

B[1] B[2] B[3] B[4] B[5] B[6] B[7] B[8]

1 2 4 5 8 9 12 13

A[1] 4

A[2] 6

A[3] 7

A[4] 11

A[5] 12

A[6] 14

A[7] 15

A[8] 16

Page 13: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

B[1] B[2] B[3] B[4] B[5] B[6] B[7] B[8]

1 2 4 5 8 9 12 13

A[1] 4

A[2] 6

A[3] 7

A[4] 11

A[5] 12

A[6] 14

A[7] 15

A[8] 16

Differences

Typically unique values in each set

Two possible start locations when cross-diagonal goes through “equal-path”.

Requires:

Checking the initial condition

Avoiding double counting

Shared-memory and synchronization

[email protected], GTC, 2015

Page 14: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Triangle-Counting

Split vertices to thread blocks

|V|/Thread_Blocks

Iteratively (over the vertices)

Partition

Log(n)

Warp Synchronization

Multiple Intersection

Thread-block Synchronization

Vertices

Thread-block

Warp 0 Warp 1

[email protected], GTC, 2015

Page 15: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Algorithm Control Parameters

𝐼: # intersection in a thread block

Small number : under utilization

Large number : not useful for sparse networks

𝑇ℎ: # threads per intersection

Small number : load-balancing, under utilization

Large number : adds overhead compared to the intersection

𝐼 ⋅ 𝑇ℎ ≥ 𝑤𝑎𝑟𝑝 𝑠𝑖𝑧𝑒

[email protected], GTC, 2015

Page 16: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Performance Analysis

[email protected], GTC, 2015

Page 17: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Experiments

Graphs taken from 10th DIMACS Challenge

GPU

NVIDIA K40

12 GB

14 MPs x 192 SPs (per MP) = 2880 CUDA threads

[email protected], GTC, 2015

Page 18: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Experiments

CPU - OpenMP based [Green et al. ;2014]

4x10 Intel Xeon E7-8870 processers

256 GB

30 MB L3 cache per processor

[email protected], GTC, 2015

Page 19: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Implementation comparison

GPU algorithm as discussed here

CPU algorithms

Threads are independent

Partitioning stage not needed

[email protected], GTC, 2015

Page 20: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

GPU Vs. Sequential CPU

[email protected], GTC, 2015

0

5

10

15

20

25

30

35

40

Spe

edu

p

GPU IntersectPath Sequential (Baseline)

Page 21: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

GPU Vs. CPU – OpenMP Straightforward

OpenMP results from[Green et al. ;2014] [email protected], GTC, 2015

0

5

10

15

20

25

30

35

40

Spe

edu

p

Naïve OMP GPU IntersectPath Sequential (Baseline) Maximal for 40 cores

Page 22: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

GPU Vs. CPU OpenMP Optimized I

[email protected], GTC, 2015 [Green et al. ;2014]

0

5

10

15

20

25

30

35

40

Spe

edu

p

Optimized OMP I GPU IntersectPath Sequential (Baseline) Maximal for 40 cores

Page 23: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

[email protected], GTC, 2015

Page 24: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

GPU Vs. CPU OpenMP Optimized II

[email protected], GTC, 2015 [Green et al. ;2014]

0

5

10

15

20

25

30

35

40

Spe

edu

p

Optimized OMP II GPU IntersectPath Sequential (Baseline) Maximal for 40 cores

Page 25: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

GPU – Parameter control -prefAttachment

[email protected], GTC, 2015

Single thread per intersection

X-Y-Z configuration:

• X: threads per

block

• Y: thread per

intersection

• Z: number of

intersection per

thread block

Page 26: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

[email protected], GTC, 2015

Single Intersection per warp

GPU – Parameter control -prefAttachment

X-Y-Z configuration:

• X: threads per

block

• Y: thread per

intersection

• Z: number of

intersection per

thread block

Page 27: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

These parameters count

3𝑋 − 7𝑋 difference between slowest and

fastest

Offer better system utilization

Offer better load-balancing

[email protected], GTC, 2015

Page 28: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Final thoughts

Intersect-Path

Modified Merge-Path

Performance

9X-32X faster than a sequential

[email protected], GTC, 2015

Page 29: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Additional Challenges (Future work)

Fine-tuning the GPU parameters based on

graph properties.

More load-balancing across the MPs.

Create the capability to analyze large

graphs on the GPU

Limited memory size

[email protected], GTC, 2015

Page 30: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Thank you!

Join our Open-Source community

https://github.com/arrayfire/arrayfire

[email protected], GTC, 2015

Page 31: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Backup slides

[email protected], GTC, 2015

Page 32: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Examples – Clustering Coefficient

𝐶𝐶 = 𝐶𝐶𝑣 =

𝑣

𝑡𝑟𝑖(𝑣)

deg 𝑣 ⋅ deg 𝑣 − 1𝑣

[email protected], GTC, 2015

Page 33: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Immense volume of data

Facebook: >1B users

average 130 friends

30B pieces of content shared / month

Twitter: 200M active users

400M tweets / day

Goal: Use the interaction

network to understand

and characterize

information flow

Motivation: Social Media

Sources: Facebook, Twitter

[email protected], GTC, 2015

Page 34: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

GPU Vs. CPU

[email protected], GTC, 2015

0

5

10

15

20

25

30

35

40

Spee

du

p

Naïve OMP Optimized OMP I Optimized OMP II

GPU IntersectPath Sequential (Baseline) Maximal for 40 cores

Page 35: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

Speedup/Dollar

[email protected], GTC, 2015

0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

Spee

du

p/d

olla

r

Naïve OMP Optimized OMP I Optimized OMP II GPU IntersectPath

Assuming

CPU : $20K

GPU : $5K

Page 36: Fast Triangle Counting on the GPU€¦ · Computational Approaches Enumerating over all node-triples - 𝑂(𝑉3). Using matrix multiplication - 𝑂(𝑉 ), ≤ t. u y x. Adjacency

GPU – Parameter control

[email protected], GTC, 2015

Growing thread block size