Graphical Models - Department of Systems Engineering and...

36
Graphical Models A Brief Introduction Reference: Pattern Recognition and Machine Learning by C.M. Bishop, Springer Chapter 8.2 https://www.microsoft.com/enus/research/wpcontent/uploads/2016/05/BishopPRMLsample.pdf 1

Transcript of Graphical Models - Department of Systems Engineering and...

Page 1: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Graphical ModelsA Brief Introduction 

Reference: Pattern Recognition and Machine Learningby C.M. Bishop, SpringerChapter 8.2

https://www.microsoft.com/en‐us/research/wp‐content/uploads/2016/05/Bishop‐PRML‐sample.pdf

1

Page 2: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

ProbabilisticModel

Real WorldData

P(Data | Parameters)

2

Page 3: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

ProbabilisticModel

Real WorldData

P(Data | Parameters)

P(Parameters | Data)

3

Page 4: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

ProbabilisticModel

Real WorldData

P(Data | Parameters)

P(Parameters | Data)

Generative Model, Probability

Inference, Statistics

4

Page 5: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Notation and Definitions• X is a random variable

– Lower‐case x is some possible value for X– “X = x” is a logical proposition: that X takes value x– There is uncertainty about the value of X

• e.g., X is the Hang Seng index at 5pm tomorrow

• p(X = x) is the probability that proposition X=x is true– often shortened to p(x)

• If the set of possible x’s is finite, we have a probability distribution and   p(x) = 1

• If the set of possible x’s is infinite, p(x) is a density function, and p(x) integrates to 1 over the range of X  

5

Page 6: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Multiple Variables• p(x, y, z)

– Probability that X=x AND Y=y AND Z =z– Possible values: cross‐product of X Y Z

– e.g., X, Y, Z each take 10 possible values• x,y,z can take 103 possible values• p(x,y,z) is a 3‐dimensional array/table

– Defines 103 probabilities• Note the exponential increase as we add more variables

– e.g., X, Y, Z are all real‐valued• x,y,z live in a 3‐dimensional vector space• p(x,y,z) is a positive function defined over this space, integrates to 1

6

Page 7: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Conditional Probability• p(x | y, z)

– Probability of x given that Y=y and Z = z– Could be 

• hypothetical, e.g., “if Y=y and if Z = z”• observational, e.g., we observed values y and z

– can also have p(x, y | z), etc– “all probabilities are conditional probabilities”

• Computing conditional probabilities is the basis of many prediction and learning problems, e.g.,– p(DJI tomorrow | DJI index last week)– expected value of [DJI tomorrow | DJI index next week)– most likely value of parameter  given observed data

7

Page 8: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Computing Conditional Probabilities• Variables A, B, C, D

– All distributions of interest related to A,B,C,D can be computed from the  full joint distribution p(a,b,c,d)

• Examples, using the Law of Total Probability– p(a) =  {b,c,d} p(a, b, c, d) – p(c,d) = {a,b} p(a, b, c, d)– p(a,c | d)  = {b} p(a, b, c | d)

where p(a, b, c | d) = p(a,b,c,d)/p(d)• These are standard probability manipulations: however, we 

will see how to use these to make inferences about parameters and unobserved variables, given data

8

Page 9: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Two Practical Problems  

(Assume for simplicity each variable takes K values) • Problem 1: Computational Complexity

– Conditional probability computations scale as O(KN) • where N is the number of variables being summed over

• Problem 2: Model Specification– To specify a joint distribution we need a table of O(KN) numbers

– Where do these numbers come from?

9

Page 10: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Two Key Ideas

• Problem 1: Computational Complexity– Idea: Graphical models  

• Structured probability models lead to tractable inference

• Problem 2: Model Specification– Idea: Probabilistic learning 

• General principles for learning from data

10

Page 11: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Conditional Independence• A is conditionally independent of B given C iff

p(a | b, c)  = p(a | c)(also implies that B is conditionally independent of A given C)

• In words, B provides no information about A, if value of C is known

• Example:– a = “reading ability”– b = “height”– c = “age”

• Note that conditional independence does not imply marginal independence

11

Page 12: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Graphical Models

• Represent dependency structure with a directed graph– Node <‐> random variable– Edges encode dependencies

• Absence of edge ‐> conditional independence– Directed and undirected versions

• Why is this useful?– A language for communication– A language for computation

12

Page 13: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Examples of 3‐way Graphical Models

A CB Marginal Independence:p(A,B,C) = p(A) p(B) p(C)

13

Page 14: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Examples of 3‐way Graphical Models

A

CB

Conditionally independent effects:p(A,B,C) = p(B|A)p(C|A)p(A)

B and C are conditionally independentGiven A

e.g., A is a disease, and we model B and C as conditionally independentsymptoms given A

14

Page 15: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Examples of 3‐way Graphical Models

A B

C

Independent Causes:p(A,B,C) = p(C|A,B)p(A)p(B)

15

Page 16: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Examples of 3‐way Graphical Models

A CB Markov dependence:p(A,B,C) = p(C|B) p(B|A)p(A)

16

Page 17: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Directed Graphical Models

A B

C

p(A,B,C) = p(C|A,B)p(A)p(B)

17

Page 18: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Directed Graphical Models

A B

C

In general,p(X1, X2,....XN) = p(Xi | parents(Xi ) )

p(A,B,C) = p(C|A,B)p(A)p(B)

18

Page 19: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Directed Graphical Models

A B

C

• Probability model has simple factored form

• Directed edges => direct dependence

• Absence of an edge => conditional independence

• Also known as belief networks, Bayesian networks, causal networks

In general,p(X1, X2,....XN) = p(Xi | parents(Xi ) )

p(A,B,C) = p(C|A,B)p(A)p(B)

19

Page 20: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Reminders from Probability….

• Law of Total ProbabilityP(a)  = b P(a, b)  = b P(a | b) P(b)

– Conditional version:P(a|c)  = b P(a, b|c)  = b P(a | b , c) P(b|c)

• Factorization or Chain Rule– P(a, b, c, d)  = P(a | b, c, d) P(b | c, d)  P(c | d) P (d), or

= P(b | a, c, d) P(c | a, d)  P(d | a) P(a), or= …..

20

Page 21: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Probability Calculations on Graphs• General algorithms exist ‐ beyond trees 

– Complexity is typically O(m (number of parents ) )(where m = arity of each node)

– If single parents (e.g., tree), ‐> O(m)– The sparser the graph the lower the complexity 

• Technique can be “automated”– i.e., a fully general algorithm for arbitrary graphs– For continuous variables:

• replace sum with integral– For identification of most likely values

• Replace sum with max operator

21

Page 22: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

ProbabilisticModel

Real WorldData

P(Data | Parameters)

P(Parameters | Data)

Generative Model, Probability

Inference, Statistics

22

Page 23: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

The Likelihood Function• Likelihood  = p(data | parameters) 

= p( D |  ) = L () 

• Likelihood tells us how likely the observed data are conditioned on a particular setting of the parameters

• Details– Constants that do not involve  can be dropped in defining L () 

– Often easier to work with log L () 

23

Page 24: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Comments on the Likelihood Function

• Constructing a likelihood function L () is the first step in probabilistic modeling

• The likelihood function implicitly assumes an underlying probabilistic model M with parameters 

• L () connects the model to the observed data

• Graphical models provide a useful language for constructing likelihoods 

24

Page 25: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Binomial Likelihood• Binomial model

– N memoryless trials, 2 outcomes– probability  of success at each trial

• Observed data– r successes in n trials – Defines a likelihood:  

L() =  p(D | )  =  p(successes) p(non‐successes)=    r (1‐) n‐r

25

Page 26: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Binomial Likelihood Examples

26

Page 27: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Graphical Models

• Left – data points  are conditionally independent given 

• Right – plate notation (same model as left)repeated  nodes are inside a box (plate)number in lower right hand corner  , specifies the number of repetitions of the  node

27

• Represent using a graphical model:

Page 28: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

• Assume each data case was generated  independently but from the same distribution

• Data cases are only independent conditional on the parameters 

• Marginally, the data cases are dependent• The order in which the data cases arrive makes 

no difference to the benefits about (all orderings have same sufficient statistics) data is exchangeable

28

Graphical Models

Page 29: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

• Avoid visual clutter:use a form of syntactic sugar, called plates

• Draw a little box around the repeated variables• With the convention that nodes within the box is 

repeated when the model is unrolled• Bottom right corner of the box: number of copies 

or repetitions• The corresponding joint distribution has the form:

29

Plate Notation

Page 30: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Multinomial Likelihood• Multinomial model

– N memoryless trials, K outcomes– Probability vector  for outcomes at each trial

• Observed data– nj successes in n trials – Defines a likelihood:

30

Page 31: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Graphical Model for Multinomial

w1

= [ p(w1), p(w2),….. p(wk) ]

w2 wn

Parameters

Observed data

31

Page 32: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

“Plate” Notation

wi

i=1:n

Data = D = {w1,…wn}

Model parameters

Plate (rectangle) indicates replicated nodes in a graphical model

Variables within a plate are conditionally independent manner given parent

32

Page 33: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Learning in Graphical Models

wi

i=1:n

Data = D = {w1,…wn}

Model parameters

• Can view learning in a graphical model as computing the most likely value of the parameter node given the data nodes

33

Page 34: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Maximum Likelihood (ML) Principle

wi

i=1:n

L () = p(Data | ) = p(yi | )

Maximum Likelihood: ML = arg max{ Likelihood() }

Select the parameters that make the observed data most likely

Data = {w1,…wn}

Model parameters

34

Page 35: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

The Bayesian Approach to Learning

wi

i=1:n

Fully Bayesian:p( | Data) = p(Data | ) p() / p(Data)

Maximum A Posteriori:MAP = arg max{ Likelihood() x Prior() }

Prior() = p( )

35

Page 36: Graphical Models - Department of Systems Engineering and ...seem5680/lecture/graphical-models-brief-2018.pdf• Can use graphical models to describe relationships between parameters

Summary of Bayesian Learning• Can use graphical models to describe relationships between 

parameters and data• P(data | parameters) = Likelihood function• P(parameters) = prior

– In applications such as text mining, prior can be “uninformative”, i.e., flat

– Prior can also be optimized for prediction (e.g., on validation data) 

• We can compute P(parameters | data, prior)or a “point estimate”  (e.g., posterior mode or mean)

• Computation of posterior estimates can be computationally intractable – Monte Carlo techniques often used

36