Graph Partitioning using Bayesian Inference on GPUctcyang/pub/gtc-slides2018.pdf · Overview 1...

Graph Partitioning using Bayesian Inference on GPU

Carl Yang, Steven Dalton, Maxim Naumov, Michael Garland,Aydın Buluc, John D. Owens

UC Davis, NVIDIA intern

ctcyang@ucdavis.edu

March 26, 2018

Carl Yang, Steven Dalton, Maxim Naumov, Michael Garland,Aydın Buluc, John D. Owens (NVIDIA)Final Presentation March 26, 2018 1 / 63

Overview

1 Introduction

2 Stochastic Block Model

3 Bayesian inference for graph partitioning

4 Parallelization strategy

5 Experiments

Problem: How can we break this graph up into smallerpieces so we can understand it?

Problem definition

Problem 1

Can MCMC be sped up by using a GPU?

Problem 2

How is convergence affected?

Problem 3

Is this a scalable solution?

Problem definition

Problem 1

Problem 2

Problem 3

Is MCMC a scalable solution to the graph clustering problem?

Problem definition

Problem 1

Problem 2

Problem 3

Is this a scalable solution?

Related work

Minimum-cut method

Hierarchical clustering

Girvan–Newman algorithm

Modularity maximization

Clique-based methods

Generative models

Before thinking of how to partition, we should come up with a model thatgenerates what we are looking for.

The parameters should describe block structure in a graph.

The parameter values are unknown, but can be inferred from the dataand the current state in a principled, statistical way.

Stochastic Block Model (SBM)

Holland, Laskey, and Leinhardt. ”Stochastic blockmodels: First steps.”Social networks 5.2 (1983)

Parameters: ηi → probability a node belongs to block i

Mrs → probability an edge exists between block r and block s

Rules for placing N nodes in B blocks:

1 Sample bi ∼ Cat(η) to obtain each node’s colour.

2 Sample eij ∼ Poisson(M) to determine which two blocks r and s theedge connects.

3 Sample i ∼ Uniform(nr ) and j ∼ Uniform(ns) to get two nodes inblocks r and s respectively for edge eij .

Formulate clustering as exact recovery problem

1 Given G and b(t), find M(t).

2 Given G and M(t), find arg maxb P(b|G ,M). This becomes b(t+1).

Exact recovery problem

Bayesian inference

We want to find partition b that maximizes:

P(b|G ,M) =P(G |b,M)P(b,M)

Taking negative logs of both sides, we want to minimize Σ:

Σ = − logP(G |b,M)− logP(b,M) + logP(G )

S is the amount of information required to describe the graph when themodel is known.L is the amount of information required to describe the model.

Bayesian inference

We want to find partition b that maximizes:

P(b|G ,M) =P(G |b,M)P(b,M)

Taking negative logs of both sides, we want to minimize Σ:

Σ = − logP(G |b,M)︸︷︷︸S

− logP(b,M)︸︷︷︸L

+ logP(G )︸︷︷︸constant

S is the amount of information required to describe the graph when themodel is known.L is the amount of information required to describe the model.

Computing terms

S can be found by counting the number of configurations of the graph.The fewer configurations, the better our model fits the graph:

S = log( 1

)= log

( ∏rs Mrs !∏

r k+r !∏

r k−r !

L can be found by counting:

L = log

))+ logN!−

log nr !︸︷︷︸b term

))︸︷︷︸

M term

Design decision: Ignore L for now in prototype, but leave room for it to beadded in the future.

Computing terms

S = log( 1

)= log

( ∏rs Mrs !∏

r k+r !∏

r k−r !

L = log

))+ logN!−

))︸︷︷︸

M term

Design decision: Ignore L for now in prototype, but leave room for it to beadded in the future.

Intuition

Combinatorial optimization problem

So we want to partition b s.t. Σ is minimized.

However for a graph of B blocks and N nodes, there are BN manypossible partitions b we would need to compute that quantity for.

We need an efficient way to traverse large state space.

MCMC sampling

1 Propose move.

2 Calculate move acceptance probability.

3 Commit move.

Upside: Stationary distribution will converge to probability distribution weare trying to find.

Merge phase

Merge phase:

Merge phase

Nodal (MCMC) phase

MCMC sampling applied to solve graph partitioning

Merge phase1 Propose move2 Calculate change in objective function3 Get block move that improves objective function the most4 Commit move5 Goto 1) until nblocksinitial

r blocks left

MCMC phase1 Propose move2 Calculate change in objective function3 Calculate move acceptance probability4 Commit move5 Goto 1) until MCMC chain has converged

Do Merge phase, MCMC phase, Merge phase, MCMC phase, etc.until target cluster count has been reached.

r blocks left

1. Propose move

Counter-based RNG allows O(1) skip-ahead for each thread.

This allows independent random numbers to be generated within adevice function.

2. Calculate objective function

Problem: How do we compute the objective function as if we have alreadymade the move, but without actually changing our graph?

Key insight: Merge move and node move can be both expressed as thesimultaneous element-wise addition of rows and columns of a matrix.

Problem: How do we compute the objective function as if we have alreadymade the move, but without actually changing our graph?

Key insight: Merge move and node move can be both expressed as thesimultaneous element-wise addition of rows and columns of a matrix.

We have a graph

How to express in matrix notation node 1 being movedfrom blue to yellow?

Elementwise move node 1’s out-edge contribution fromblue to yellow

Elementwise move node 1’s in-edge contribution from blueto yellow

Move complete

For sparse matrices, elementwise addition is equivalent to doing a setunion.

Warp-wide sorting network allows us to do set unions using registermemory.

3. Commit move

Triple matrix product used to update model between Merge and MCMCphases.

Hypothesis 1: Committing merge moves in parallel does not affectconvergence rate.

Hypothesis 2: Committing MCMC moves in parallel does not affectconvergence rate.

Parallelization summary

Reference impl. Our contributionCPU Seq CPU Par GPU Seq GPU Par

Propose move par par par parMerge Calculate obj par par par par

Commit move seq seq seq par

Propose move seq par par parMCMC Calculate obj seq par par par

Commit move seq seq par par

Experimental Setup

Hardware:

CPU: Intel Core i7-5820K CPU @ 3.30GHz, 32GB RAM

GPU: Titan Xp, 12GB RAM

Datasets:

Nodes 50 100 1K 5K 20K 50K 500K

Edges 319 6K 20K 102K 409K 1M 10M

Experimental Setup

Hardware:

CPU: Intel Core i7-5820K CPU @ 3.30GHz, 32GB RAM

GPU: Titan Xp, 12GB RAM

Datasets:

Synthetic datasets with ground truth partitions for each node.

Nodes 50 100 1K 5K 20K 50K 500K

Edges 319 6K 20K 102K 409K 1M 10M

Speedup comparison

50 100 1000 5000 20000 50000 500000

Speedup

NumberofNodes

CPUSeq CPUPar

GPUSeq GPUPar

Figure: Speedup comparison across four implementations.Carl Yang, Steven Dalton, Maxim Naumov, Michael Garland,Aydın Buluc, John D. Owens (NVIDIA)Final Presentation March 26, 2018 45 / 63

Runtime breakdown

CPUSeq

CPUPar

GPUSeq

GPUPar

CPUSeq

CPUPar

GPUSeq

GPUPar

CPUSeq

CPUPar

GPUSeq

GPUPar

CPUSeq

CPUPar

GPUSeq

GPUPar

CPUSeq

CPUPar

GPUSeq

GPUPar

CPUSeq

CPUPar

GPUSeq

GPUPar

50 100 1000 5000 20000 50000

Build Merge MCMC

Figure: Runtime breakdown between four implementations.

Rate of convergence

-600000

-500000

-400000

-300000

-200000

-100000

100000

200000

0 500000 1000000 1500000 2000000

GPU CPUSeq CPUPar

Figure: Change in objective function plotted against number of moves.

Rate of convergence (in runtime)

-600000

-500000

-400000

-300000

-200000

-100000

100000

200000

0 50 100 150 200 250

GPU CPUSeq CPUPar

Figure: Change in objective function plotted against runtime in seconds.

Raw runtime numbers and accuracy

CPU Seq CPU Par GPU Seq GPU ParNodes Time (s) Acc (%) Time (s) Acc (%) Time (s) Acc (%) Time (s) Acc (%)50 0.519 100 0.519 100 0.0876 100 0.0603 100100 0.802 100 0.531 82 0.2249 100 0.1779 1001000 5.193 81.41 0.939 100 3.153 100 1.5649 1005000 16.443 90 2.255 81.7 27.093 92.943 3.113 87.620000 118.201 94.6 29.97 93.93 51.519 96.5 7.671 88.550000 272.249 89.8 97.68 87.15 2902.4 97.6 23.707 89.2

Takeaways

It is surprisingly easy to make MCMC converge.

However, it’s a different story to make MCMC scalable.

Future work

Use specialized triple matrix product kernel to take advantage ofknowledge about matrix structure.

Use load-balancing methods such as TWC to handle unbalanced data.

Try newer Bayesian inference methods such as minibatch MCMC andADVA (auto differentiation variational inference) that claim to scalebetter with data size than standard MCMC.

Add multi-GPU support.

Questions?

Given N nodes in B blocks:

State: bi → block node i belongs to

Parameters: ηi → probability a node belongs in block i

λrs → probability an edge exists between block r and block s

1 Sample each node i.i.d. over ηi to obtain each node’s colour.

2 Sample each edge i.i.d. over Poi(λrs) to obtain blocks r and s theyconnect. For each edge, sample one node in block r with probability1nr

and one node in block s with probability 1ns

to determine whichtwo nodes the edge connects.

Given N nodes in B blocks:

State: bi → block node i belongs to

Parameters: ηi → probability a node belongs in block i

λrs → probability an edge exists between block r and block s

The probability of generating a graph G and partition b given parametersη, λ assuming a Bernoulli edge distribution is:

P(G |b,M) =∏i

ηbi∏i<j

bibj(1− λbibj )

1−Aij

Variant of SBM we will use

Non-parametric: use Bayesian formulation instead of maximumlikelihood.

This solves the over-fitting problem.

Degree-corrected: add additional parameters ki for every node irepresenting its propensity for high degree

This accounts for the power law degree distribution that manyreal-world graphs exhibit.

Expression

Taking negative logs of both sides:

− logP(b|G ,M) = − logP(G |b,M)︸︷︷︸S

− logP(b,M)︸︷︷︸L

+ logP(G )︸︷︷︸constant

Sequential MCMC for graph partitioning

Input: b: N × 1 current block assignment vector, M: B × B interblockedge count matrix, A: N × N adjacency matrix

1: procedure MCMCSequential(b,M,A)2: for node i do3: Propose random move for i : block r → s4: Acceptance probability:

5: paccept = min[exp(−β∆S)ps→r

pr→s, 1]

6: Perform move by updating b,M

Generative models

Before thinking of how to partition, we should come up with a model ofwhat we are looking for.

The parameters should describe block structure.

The parameter values are unknown, but can be inferred from the dataand the current state in a principled, statistical way.

Generative models: Sketch of algorithm

Given data G and an initial guess of partition b(0), we can compute M(1)

and b(1):

1 Compute model parameters M(1) using G and b(1).

2 Make better guess for partition b(1) using Bayesian inference:

arg maxb

P(b|G ,M) = arg maxb

P(G |b,M)P(b,M)

Computing terms

=( ∏

rs Mrs !∏r k

+r !∏

r k−r !

L = log

))+ logN!−

))︸︷︷︸

M term

Variable-at-a-time Metropolis-Hastings

Algorithm 1 Sequential MCMC.

Input: b0: N × 1 state vector initialized randomlyOutput: bT : N × 1 vector equal to stationary distribution1: for iteration t = 1, 2, ... do2: for node i = 1, 2, ...,N do

3: Propose: b(cand)i ∼ q(bti |bt−1)

4: Acceptance probability:

α = min (q(bt−1

i |bcandi )π(bcandi )

q(bcandi |bt−1i )π(bt−1

i ), 1)

5: u ∼ Uniform(0, 1)6: if u < α then7: Accept proposal: bti ← bcandi

8: else9: Reject proposal: bti ← bt−1

Where SBM fits into machine learning

Hidden Markov Model

Latent Variable Model

Variational auto-encoders

Graph Partitioning using Bayesian Inference on GPUctcyang/pub/gtc-slides2018.pdf · Overview 1...

Documents

Transcript of Graph Partitioning using Bayesian Inference on GPUctcyang/pub/gtc-slides2018.pdf · Overview 1...

Quantum Bayesian Inference - informatyka.agh.edu.pl · QuantumBayesianInference Quantum Bayesian Inference MichałGrabowski MarcinPrzewięźlikowski InstituteofComputerScience,AGH,al.

Bayesian Inference, Review 4/25/12 Frequentist inference Bayesian inference Review The Bayesian Heresy (pdf)pdf Professor Kari Lock Morgan Duke University.

Aspects of Bayesian Inference

Bayesian and frequentist inference for ecological ... · Key Words and Phrases: ecological inference, Bayesian inference, frequentist inference, voting patterns. 1 Introduction to

Introduction to Bayesian Inference

Bayesian inference

Bayesian inference and generative models - TNU should I know about Bayesian inference? Because Bayesian principles are fundamental for • statistical inference in general • system

Object Perception as Bayesian Inference · 2020-06-29 · Object Perception as Bayesian Inference 1 Object Perception as Bayesian Inference Daniel Kersten Department of Psychology,

BAYESIAN INFERENCE Sampling techniques

Recursive partitioning and Bayesian inference on conditional …lm186/files/condOPT.pdfmethods for nonparametric inference on conditional distributions. This topic has been exten-sively

Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.

Likelihood and Bayesian Inference · Likelihood and Bayesian Inference Joe Felsenstein Department of Genome Sciences and Department of Biology Likelihood and Bayesian Inference –

Bayesian network inference

DCM Bayesian Inference

Gaussian Models: Bayesian Inference

Bayesian Inference (II) - University of California, Santa Cruzabrsvn/intro_bayes_2.pdf · Bayesian Inference in a Nutshell (Again) In Bayesian inference, uncertainty or degree of

Approximate Bayesian Inference I:

Inference in Bayesian Networks

Bayesian inference method

Qualitative Robustness in Bayesian Inference - arXiv · Qualitative Robustness in Bayesian Inference ... Abstract The practical implementation of Bayesian inference requires numerical