Probabilistic Inference in Distributed Systems

Stanislav Funiak

Disclaimer:Statements made in this talk are the sole opinions of the presenter and do not necessarily represent the official position of the University or presenter’s advisor.

Monitoring in Emergency Response Systems

Firefighters enter a building

As they run around, place a bunch of sensors

Want to monitor the temperature in various places

p(temperature at location i

| temperature observed at all sensors)

Efficient inference:Nice model:

Monitoring in Emergency Response Systems

You ask a 10-701 graduate for help: “learn the model”

You ask a 10-708 graduate for help: “implement efficient inference”

Put them in an IntelTM Core-Trio machine with 30GB RAM

Simulation experiments work great

Z2 Z6Z4

hiddenstate

observedtemp.

D-Day arrives…

You start up your machine and…

Firefighters deploy the sensors

The network goes down. Got flooded.

You call up an old-time friend at MIT.

Sends you a patch in 24 minutes.

highly optimized routing

Ooops! Part of the ceiling just went down, lost connection again

Last-minute Link Stats

mhm, communication is lossy mhm, link qualities change

* Joke warning: = 1 week

Maybe having a good routing was not such a bad idea…

What’s wrong here?

• Cannot rely on centralized infrastructure– too costly to gather all observations– need be robust against node failures, message losses– may want to perform online control

• nodes equipped with actuators

• Want to perform inference directly on network nodes

Autonomous teams of mobile robots

Distributed Inference – The Big Picture

p(Qn | temperature observed at all sensors)

zEach node nissues a query

some variables,e.g. temperature at locations 1,2,3

Nodes collaborate at computing the query

Probabilistic model vs. physical layer

Probabilisticmodel

physical nodes

available communication linksX1

Z2 Z6Z4

Sensor network

Physicallayer

Natural solution: Loopy B.P.Suppose: Network nodes = Variables

Natural solution: Loopy B.P.

Issues:

X1 X3 X5 X7

X2 X4 X6 X8

[Pfeffer, 2003, 2005]

Suppose: Network nodes = Variables

Then could run loopy B.P. directly on the network

p(X4)could view as

• may not observe network structure • potentially non-converging• definitely over-confident will revisit in

experimental results

not fully resolved

: 99% hot

Truth: 51% hot, 49% cold

Want the Following Properties

1. Global correctness:Eventually, each node obtains the true distributionp(Qn | z)

2. Partial correctness:Before convergence, a node can form a meaningfulapproximation of p(Qn | z)

3. Local correctness:without seeing other nodes’ beliefs, each node can condition on its own observations

Outline

Z2 Z6Z4

Sensor network

communication links

routing tree

reparametrizedmodel

1. Nodes make local observations

2. Nodes establish a routing structure

3. Communicate tocompute the query

offline

distribute the model

Input model (BN / MRF) [Paskin & Guestrin, 2004]

Standard parameterization not robust

Now, suppose someone told us p(X2 | X3) and p(X3 | X1)

effectively, assuming uniform prior on X2

lost CPD

probability ofhigh temp.?

Much better: inference in a simpler model

Suppose we “lose” a CPD / potential (not communicated yet, a node failed)

observehigh temp.

p(X2 | X1) £ p(X3 | X1,X2) £ p(X4 | X2,X3) = p(X4 | X1)

Distribution changes dramatically

Exact model:

Construct approximation:

preserves correlation btw X1 and X3

X2?X1 | X3

Review: Junction Tree representation

BN / MNJunction tree

running intersection

family-preserving

(Think as writing the CPDs p(X6 | X4,X5), etc.)

clique marginals

separator marginals

X6X1,X2 X2,X3,X4 X3,X4,X5 X4,X5,X6

separatorX4,X5

we’ll keepthese

not important(can be computed)

X3,X4,X5

clique

X3,X4X2

Properties used by the Algorithm

1. Marginalization amounts to pruning cliques:

Key properties:

2. Using a subset of cliques amounts to KL-projection:

exact:

approximate:

X2,3 ? X5,6 | X4

X2,X3,X4 X4,X5,X6

Junction tree T

X2,X3,X4 X3,X4,X5 X4,X5,X6

X3,4 X4,5

all distributions that factor as T’

X2,X3,X4 X4,X5,X6

missing clique

X3,X4,X5

How are these structures used for distributed inference?

From clique marginals to distributed inference

X1,X2 X3,X4,X5 X4,X5,X6

X1, X2 X2, X3, X4

X3, X4, X5

X4, X5, X6

X2, X3, X4 , X5

Network junction tree:[Paskin et al, 2005]

• used for communication

• satisfies running intersection property

• adaptive, can be optimized

Clique marginals

are assigned to network nodes

stronger linksweaker links

X2,X3,X4

X1,X2 X2,X3,X4 X3,X4,X5 X4,X5,X6

Global model: external J.T.Robust message passing algorithm

X1,X2 X3,X4,X5

X4,X5,X6

X1, X2 X2, X3, X4

X3, X4, X5

X4, X5, X6

X2, X3, X4 , X5

Clique marginals

X4,X5,X6

Local cliques:

Node locally decides, which cliques sufficient for its neighbors

X3,X4,X5

node 3 obtained

X2,X3,X4X2,X3,X4

Network junction treeNodes communicate clique marginals along the network junction tree

Message passing = pruning leaf cliques

Theorem: On a path towards some network node, cliques that are not passed form branches of an external junction tree.

Corollary: At convergence, each node obtains subtree of external junction tree.

[Ch 6, Paskin, 2004]

X1,X2 X2,X3,X4X3,X4,X5

X4,X5,X6

X3,X4,X5

X4,X5,X6X2,X3,X4

ReplayX1,X2 X2,X3,X4 X3,X4,X5 X4,X5,X6

External junction tree:

pruned cliquescliques obtainedby node 1

Incorporating observationsOriginal model

Reparametrized as junction tree

X1,X2 X2,X3,X4 X3,X4,X5 X4,X5,X6

Suppose all observation variables are leaves:

Can associate each likelihood with any clique that covers its parents• algorithm will pass around clique priors and clique likelihoods• marginalization still amounts to pruning

e.g., suppose marginalize out X1

Putting it all together

Theorem: Global correctnessAt convergence, each node n obtains exact distribution overits query variables, conditioned on all observations

Theorem: Partial correctnessBefore convergence, each node n obtains a KL projection over its query variables, conditioned on collected observations E

junction tree formed bycollected cliques

Results: Convergence

Robust message passing

Standard sum-product algorithm

Model: Nodes estimate temperature as well as additive bias

converges early closeto the global optimum

Bad answers for a long time;then “snaps” in

(iteration)

Results: Robustness

Communication partitioned at t=60;restored at t=120

Node failure

converges closeto the global optimum

insensitive to node failures

(robust message passing algorithm)

How about dynamic inference?Firefighters get fancier equipment…

Place wireless cameras around an environment

Want to determine the locations automatically

local observationlocation Ci?

[Funiak et al 2006]

Firefighters get fancier equipment…Distributed camera localization:

camera location Ci object trajectory M1:T

This is a dynamic inference problem

How localization works in practice…

Model: (Dynamic) Bayesian Network

C2C1Cameralocations

observationsO(t)

O11 O1

t=1Objectlocation

Transition model:

Filtering: compute the posterior distribution

Measurement model:

t=2 t=5

stateprocesses

Filtering: Summary

prediction

estimation

prior distribution

posterior distribution

roll-up

Observations & transitions introduce dependencies

t t + 1

Suppose person observed by cameras 1 & 2 at two consecutive time steps

At time t:

At time t+1:

No independence assertionsamong C1, C2, Mt+1

Typically, after a while, no independence assertions among state variables

C1, C2, …, CN, Mt+1

C1, Mt C2, Mt C3

C1, C2, Mt+1 C3

Junction Tree Assumed Density Filtering

prior distributionat time t

Markov network: Junction tree:

exact prior at time t+1

approximate belief at time t+1

estimationprediction

roll-up KL projection

Periodically project to a “small” junction tree [Boyen,Koller 1998]

Distributed Assumed Density Filtering

X1,X2 X2,X3,X4 X3,X4,X5 X4,X5,X6

At each time step, a node computes a marginal over its clique(s)

1. Initialization:

2. Estimation:

3. Prediction:

condition on evidence (distributed)

advance to the next step (local)

Results: Convergence

RMS error

centralized solution

3 5 10 15 20Iterations per time step

Theorem: Given sufficient communication at each time step, distribution obtained by the algorithm is equal to running B&K98 algorithm.

Convergence: Temperature monitoringb

Iterations per time step

Comparison with Loopy B.P.b

Loopy, window 5

Loopy, window 1

Distributed filter, 1 iter/step

Distributed filter, 3 iter/step

t=1 t=2 t=3 t=4 t=5UnrolledDBN:

Partitions introduce inconsistencies

real camera network distribution computedby nodes on the left

distribution computedby nodes on the right

network partition

objectlocation

cameraposes

The beliefs obtained by the left and the right sub-network do not agree on the shared variables, do not represent a globally consistent distribution

Good news: the beliefs are not too different.Main difference: how certain the beliefs are.

The “two Bayesians meet on a street” problem

I believe the sun is up. Man, isn’t it down?

Hard problem, in general. Need samples to decide…

AlignmentIdea: formulate as an optimization problem.

Suppose we define aligned distribution to match the clique marginals:

aligneddistribution

inconsistentprior marginals

Not so great for Gaussians…

belief 1: uncertain

belief 2: certaini(x)

Aligneddistribution

This objective tends to forget information…

Alignment

For Gaussians, is a convex problem:

determinant maximization[Vandenberghe et al, SIAM 1998]

linear regression, can distribute[Guestrin IPSN 04]

Suppose we use KL divergence in “wrong” order

aligneddistribution

inconsistentprior marginals

Good: tends to prefer more certain distributions q

Results: Partition

progressively partitionthe communication graph

Number of partition components

omniscient best

omniscient worst

KL minimization performsas well as best unaligned solution

a simpler alignment

Conclusion

Distributed inference presents many interesting challenges• perform inference directly on the sensor nodes• robust to message losses, node failures

Static inference: message passing on routing tree• message = collections of clique marginals, likelihoods• obtain joint distribution• convergence, partial correctness properties

Dynamic inference: assumed density filtering• address inconsistencies

Probabilistic Inference in Distributed Systems

Documents

Transcript of Probabilistic Inference in Distributed Systems

Probabilistic Inference

Probabilistic Graphical Models (II) Inference & Leaningbigml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/8-PGM... · Probabilistic Graphical Models (II) Inference & Leaning [70240413

Probabilistic Modeling and Statistical Inference for ...

Tractable Representations Probabilistic Inference Learning ...guyvdb/slides/TPMTutorialUAI19.pdf · Tractable Probabilistic Models Representations Inference Learning Applications

4 : Exact Inference: Variable Elimination 1 Probabilistic Inference

Bayesian Inference for NASA Probabilistic

Probabilistic inference in graphical models

Probabilistic Inference: Conscious and Unconscious

Learning and Inference for Hierarchically Split Probabilistic ...nlp.cs.berkeley.edu/pubs/Petrov-Klein_2007_Learning...Learning and Inference for Hierarchically Split Probabilistic

Bayonet: Probabilistic Inference for Networks

Probabilistic inference in graphical modelsmlg.eng.cam.ac.uk/zoubin/course04/hbtnn2e-I.pdf · Probabilistic inference in graphical models ... inference algorithms allow statistical

Probabilistic Inference from Arbitrary Uncertainty using ...

Probabilistic Modeling & Bayesian Inference

Probabilistic Graphical Models: Distributed Inference and ...ssg.mit.edu/ssg_theses/ssg_theses_2010_present/LiuY_Phd_6_14.pdf · Probabilistic Graphical Models: Distributed Inference

Accelerating Machine Learning Inference with Probabilistic ... · Accelerating Machine Learning Inference with Probabilistic Predicates YaoLu1,3,AakankshaChowdhery2,3,SrikanthKandula3,SurajitChaudhuri3

Fourier Theoretic Probabilistic Inference over Permutations

Probabilistic Representations Circuits Inference Learning …starai.cs.ucla.edu/slides/AAAI20.pdf · 2020. 2. 19. · Probabilistic Circuits Representations Inference Learning Applications

Stochastic Digital Circuits for Probabilistic Inference

PLIS : a Probabilistic Lexical Inference System

On Distributing Probabilistic Inference