pptx - Distributed Parallel Inference on Large Factor Graphs
description
Transcript of pptx - Distributed Parallel Inference on Large Factor Graphs
![Page 1: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/1.jpg)
Carnegie Mellon
Distributed Parallel Inference on Large Factor
Graphs
Joseph E. GonzalezYucheng Low
Carlos GuestrinDavid O’Hallaron
![Page 2: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/2.jpg)
2
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
0.01
0.1
1
10
Exponential Parallelism
Exponentially
Incre
asing
Sequential P
erform
ance
Constant SequentialPerformance
Pro
cess
or
Sp
eed
GH
z
Exponentially Increasing Parallel Performance
Release Date
![Page 3: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/3.jpg)
3
Distributed Parallel Setting
Opportunities:Access to larger systems: 8 CPUs 1000 CPUsLinear Increase:
RAM, Cache Capacity, and Memory Bandwidth
Challenges:Distributed state, Communication and Load Balancing
Fast Reliable Network
Node
CPU BusMemory
Cache
Node
CPU BusMemory
Cache
![Page 4: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/4.jpg)
4
Graphical Models and Parallelism
Graphical models provide a common language for general purpose parallel algorithms in machine learning
A parallel inference algorithm would improve:
Protein Structure Prediction
Movie Recommendation
Computer Vision
Inference is a key step in Learning Graphical Models
![Page 5: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/5.jpg)
5
Belief Propagation (BP)Message passing algorithm
Naturally Parallel Algorithm
![Page 6: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/6.jpg)
6
Parallel Synchronous BPGiven the old messages all new messages can be computed in parallel:
NewMessages
OldMessages
CPU 2
CPU 1
CPU 3
CPU n
Map-Reduce Ready!
![Page 7: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/7.jpg)
7
Hidden Sequential Structure
![Page 8: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/8.jpg)
8
Hidden Sequential Structure
![Page 9: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/9.jpg)
9
Hidden Sequential Structure
Running Time:
EvidenceEvidence
Time for a singleparallel iteration
Number of Iterations
![Page 10: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/10.jpg)
Optimal Sequential Algorithm
Forward-Backward
Naturally Parallel
2n2/p
p ≤ 2n
10
RunningTime
2n
Gap
p = 1
Optimal Parallel
n
p = 2
![Page 11: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/11.jpg)
11
Parallelism by Approximation
τε represents the minimal sequential structure
True Messages
τε -Approximation
1 2 3 4 5 6 7 8 9 10
1
![Page 12: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/12.jpg)
12
Synchronous Schedule Optimal Schedule
Optimal Parallel Scheduling
In [AIStats 09] we demonstrated that this algorithm is optimal:
Processor 1 Processor 2 Processor 3
ParallelComponent
SequentialComponent
Gap
![Page 13: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/13.jpg)
13
The Splash OperationGeneralize the optimal chain algorithm:
to arbitrary cyclic graphs:
~
1) Grow a BFS Spanning tree with fixed size
2) Forward Pass computing all messages at each vertex
3) Backward Pass computing all messages at each vertex
![Page 14: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/14.jpg)
14
Local State
CPU 2
Local State
CPU 3
Local State
CPU 1
Running Parallel Splashes
Partition the graphSchedule Splashes locallyTransmit the messages along the boundary of the partition
Splash SplashSplash
Key Challenges:1) How do we schedules Splashes?2) How do we partition the Graph?
![Page 15: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/15.jpg)
Local State
Sch
edu
ling
Queu
e
Where do we Splash?Assign priorities and use a scheduling queue to select roots:
Splash
Splash
??
?
CPU 1
How do we assign priorities?
![Page 16: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/16.jpg)
16
Message SchedulingResidual Belief Propagation [Elidan et al., UAI 06]:
Assign priorities based on change in inbound messages
1
Message
Message
Message
2
Message
Message
Message
Large ChangeSmall Change
Small Change
Large Change
Small Change:Expensive No-Op
Large Change:Informative Update
![Page 17: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/17.jpg)
17
Problem with Message Scheduling
Small changes in messages do not imply small changes in belief:
Small change inall message
Large change inbelief
Message
Message
BeliefMessage
Message
![Page 18: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/18.jpg)
18
Problem with Message Scheduling
Large changes in a single message do not imply large changes in belief:
Large change ina single message
Small changein belief
Message
BeliefMessageMessage
Message
![Page 19: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/19.jpg)
19
Belief Residual Scheduling
Assign priorities based on the cumulative change in belief:
1 1
+1
+rv =
MessageChange
A vertex whose belief has changed substantially
since last being updatedwill likely produce
informative new messages.
![Page 20: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/20.jpg)
20
Message vs. Belief Scheduling
Belief scheduling improves accuracy more quicklyBelief scheduling improves convergence
Belief Residuals
Message Residual
0%
20%
40%
60%
80%
100%
% Converged in 4Hrs
Bett
er
0 20 40 60 80 100
0.02
0.03
0.04
0.05
0.06Message Schedul-ing
Belief Scheduling
Time (Seconds)
L1
Err
or
in B
eliefs
![Page 21: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/21.jpg)
Splash PruningBelief residuals can be used to dynamically reshape and resize Splashes:
LowBeliefs
Residual
![Page 22: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/22.jpg)
22
Splash SizeUsing Splash Pruning our algorithm is able to dynamically select the optimal splash size
0 10 20 30 40 50 600
50
100
150
200
250
300
350Without Pruning
With Pruning
Splash Size (Messages)
Ru
nn
ing
Tim
e (
Secon
ds)
Bett
er
![Page 23: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/23.jpg)
23
Example
Synthetic Noisy Image
Factor Graph
Vertex Updates
ManyUpdate
s
FewUpdate
s
Algorithm identifies and focuses on hidden sequential structure
![Page 24: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/24.jpg)
24
Distributed Belief Residual Splash Algorithm
Partition factor graph over processorsSchedule Splashes locally using belief residualsTransmit messages on boundary
Local State
CPU 1
Splash
Local State
CPU 2
Local State
CPU 3
Splash
Fast Reliable Network
Splash
SchedulingQueue
SchedulingQueue
SchedulingQueue
Given a uniform partitioning of the chain graphical model, DBRSplash will run in time:
retaining optimality.
Theorem:
![Page 25: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/25.jpg)
25
CPU 1 CPU 2
Partitioning ObjectiveThe partitioning of the factor graph determines:
Storage, Computation, and Communication
Goal: Balance Computation and Minimize Communication
EnsureBalanceComm.
cost
![Page 26: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/26.jpg)
26
The Partitioning ProblemObjective:
Depends on:
NP-Hard METIS fast partitioning heuristic
Work:
Comm:
Minimize Communication
Ensure Balance
Update counts are not known!
![Page 27: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/27.jpg)
27
Unknown Update Counts Determined by belief schedulingDepends on: graph structure, factors, …Little correlation between past & future update counts
Noisy Image Update CountsUninformed Cut
![Page 28: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/28.jpg)
28
Uniformed Cuts
Greater imbalance & lower communication cost
Update Counts Uninformed Cut Optimal Cut
Denoise UW-Syst.0
1
2
3
4Work Imbalance
Denoise UW-Syst.0.5
0.7
0.9
1.1
Communication Cost
Unin-formedB
ett
er
Bett
er
Too Much Work
Too Little Work
![Page 29: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/29.jpg)
29
Over-PartitioningOver-cut graph into k*p partitions and randomly assign CPUs
Increase balanceIncrease communication cost (More Boundary)
CPU 1
CPU 2
CPU 1 CPU 2 CPU 2
CPU 1 CPU 1 CPU 2
CPU 1 CPU 2 CPU 1
CPU 2 CPU 1 CPU 2
Without Over-Partitioning
k=6
![Page 30: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/30.jpg)
30
Over-Partitioning ResultsProvides a simple method to trade between work balance and communication cost
-1 1 3 5 7 9 11 13 151
1.5
2
2.5
3
3.5
4
Communication Cost
Partition Factor k
-1 1 3 5 7 9 11 13 151.5
2
2.5
3
3.5
Work Imbalance
Partition Factor k
Bett
er
Bett
er
![Page 31: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/31.jpg)
31
CPU UtilizationOver-partitioning improves CPU utilization:
0 50 100 150 200 2500
10203040506070
UW-Systems MLN
Time (Seconds)
Acti
ve C
PU
s
0 10 20 30 40 500
10203040506070
Denoise
No Over-Part
10x Over-Part
Time (Seconds)
![Page 32: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/32.jpg)
32
DBRSplash Algorithm
Over-Partition factor graph Randomly assign pieces to processors
Schedule Splashes locally using belief residualsTransmit messages on boundary
Local State
CPU 1
Splash
Local State
CPU 2
Local State
CPU 3
Splash
Fast Reliable Network
Splash
SchedulingQueue
SchedulingQueue
SchedulingQueue
![Page 33: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/33.jpg)
33
ExperimentsImplemented in C++ using MPICH2 as a message passing API
Ran on Intel OpenCirrus cluster: 120 processors
15 Nodes with 2 x Quad Core Intel Xeon ProcessorsGigabit Ethernet Switch
Tested on Markov Logic Networks obtained from Alchemy [Domingos et al. SSPR 08]
Present results on largest UW-Systems and smallest UW-Languages MLNs
![Page 34: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/34.jpg)
34
Parallel Performance (Large Graph)
0 30 60 90 1200
20
40
60
80
100
120
No Over-Part
5x Over-Part
Number of CPUs
Sp
eed
up
UW-Systems8K Variables406K Factors
Single Processor Running Time:
1 Hour
Linear to Super-Linear up to 120 CPUs
Cache efficiency
Linear
Bett
er
![Page 35: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/35.jpg)
35
Parallel Performance (Small Graph)
UW-Languages1K Variables27K Factors
Single Processor Running Time:
1.5 Minutes
Linear to Super-Linear up to 30 CPUs
Network costs quickly dominate short running-time
0 30 60 90 1200
10
20
30
40
50
60
No Over-Part
5x Over-Part
Number of CPUs
Sp
eed
up
Linear
Bett
er
![Page 36: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/36.jpg)
36
SummarySplash Operation generalization of the optimal parallel schedule on chain graphs
Belief-based scheduling
Addresses message scheduling issues
Improves accuracy and convergence
DBRSplash an efficient distributed parallel inference algorithm
Over-partitioning to improve work balance
Experimental results on large factor graphs:
Linear to super-linear speed-up using up to 120 processors
![Page 37: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/37.jpg)
Carnegie Mellon
Thank You
AcknowledgementsIntel Research Pittsburgh: OpenCirrus Cluster
AT&T LabsDARPA
![Page 38: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/38.jpg)
38
Exponential ParallelismFrom Saman Amarasinghe:
1985 199019801970 1975 1995 2000 2005
Raw
Power4Opteron
Power6
Niagara
YonahPExtreme
Tanglewood
Cell
IntelTflops
Xbox360
CaviumOcteon
RazaXLR
PA-8800
CiscoCSR-1
PicochipPC102
Broadcom 1480
20??
# ofcores
1
2
4
8
16
32
64
128
256
512
Opteron 4PXeon MP
AmbricAM2045
4004
8008
80868080 286 386 486 Pentium P2 P3P4
Itanium
Itanium 2Athlon
![Page 39: pptx - Distributed Parallel Inference on Large Factor Graphs](https://reader033.fdocuments.us/reader033/viewer/2022061201/547851a5b4af9f65248b45fc/html5/thumbnails/39.jpg)
39
AIStats09 Speedup3D –Video Prediction Protein Side-Chain Prediction