CS b553: Algorithms for Optimization and Learning

34
CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNING Structure Learning

description

CS b553: Algorithms for Optimization and Learning. Structure Learning. Agenda. Learning probability distributions from example data To what extent can Bayes net structure be learned? Constraint methods (inferring conditional independence) Scoring methods (learning => optimization). - PowerPoint PPT Presentation

Transcript of CS b553: Algorithms for Optimization and Learning

Page 1: CS b553: Algorithms for Optimization and  Learning

CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNINGStructure Learning

Page 2: CS b553: Algorithms for Optimization and  Learning

AGENDA

Learning probability distributions from example data

To what extent can Bayes net structure be learned?

Constraint methods (inferring conditional independence)

Scoring methods (learning => optimization)

Page 3: CS b553: Algorithms for Optimization and  Learning

BASIC QUESTION Given examples drawn from a distribution P*

with independence relations given by the Bayesian structure G*, can we recover G*?

Page 4: CS b553: Algorithms for Optimization and  Learning

BASIC QUESTION Given examples drawn from a distribution P*

with independence relations given by the Bayesian structure G*, can we recover G*

construct a network that encodes the same independence relations as G*?

G* G1

G2

Page 5: CS b553: Algorithms for Optimization and  Learning

LEARNING IN THE FACE OF NOISY DATA Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT

X Y

Model 1

X Y

Model 2

Page 6: CS b553: Algorithms for Optimization and  Learning

LEARNING IN THE FACE OF NOISY DATA Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT

X Y

Model 1

X Y

Model 2

ML parameters

P(X=H) = 9/20P(Y=H) = 8/20

P(X=H) = 9/20P(Y=H|X=H) = 3/9P(Y=H|X=T) = 5/11

Page 7: CS b553: Algorithms for Optimization and  Learning

LEARNING IN THE FACE OF NOISY DATA Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT

X Y

Model 1

X Y

Model 2

ML parameters

P(X=H) = 9/20P(Y=H) = 8/20

P(X=H) = 9/20P(Y=H|X=H) = 3/9P(Y=H|X=T) = 5/11 Errors are

likely to be larger!

Page 8: CS b553: Algorithms for Optimization and  Learning

PRINCIPLE Learning structure must trade off fit of data

vs. complexity of network Complex networks

More parameters to learn More data fragmentation = greater sensitivity to

noise

Page 9: CS b553: Algorithms for Optimization and  Learning

APPROACH #1: CONSTRAINT-BASED LEARNING First, identify an undirected skeleton of edges

in G*

If an edge X-Y is in G*, then no subset of evidence variables can make X and Y independent

If X-Y is not in G*, then we can find evidence variables to make X and Y independent

Then, assign directionality to preserve independences

Page 10: CS b553: Algorithms for Optimization and  Learning

BUILD-SKELETON ALGORITHM Given X={X1,…,Xn}, query Independent?

(X,Y,U) H = complete graph over X For all pairs Xi, Xj, test separation as follows:

Enumerate all possible separating sets U If Independent?(Xi,Xj,U) then remove Xi—Xj from

HIn practice:

• Must restrict to bounded size subsets |U|d (i.e., assume G* has bounded degree). O(n2(n-2)d) tests

• Independence can’t be tested exactly

Page 11: CS b553: Algorithms for Optimization and  Learning

ASSIGNING DIRECTIONALITY Note that V-structures XYZ introduce a

dependency between X and Z given Y In structures XYZ, XYZ, and XYZ, X and

Z are independent given Y In fact Y must be given for X and Z to be

independent Idea: look at separating sets for all triples X-Y-

Z in the skeleton without edge X-Z

Y

X ZTriangle

Directionality is irrelevant

Page 12: CS b553: Algorithms for Optimization and  Learning

ASSIGNING DIRECTIONALITY Note that V-structures XYZ introduce a

dependency between X and Z given Y In structures XYZ, XYZ, and XYZ, X and

Z are independent given Y In fact Y must be given for X and Z to be

independent Idea: look at separating sets for all triples X-Y-

Z in the skeleton without edge X-Z

Y

X ZTriangle

Y

X ZY separates X, Z

Not a v-structureDirectionality is irrelevant

Page 13: CS b553: Algorithms for Optimization and  Learning

ASSIGNING DIRECTIONALITY Note that V-structures XYZ introduce a

dependency between X and Z given Y In structures XYZ, XYZ, and XYZ, X and

Z are independent given Y In fact Y must be given for X and Z to be

independent Idea: look at separating sets for all triples X-Y-

Z in the skeleton without edge X-Z

Y

X ZTriangle

Y

X ZY separates X, Z

Not a v-structure

Y

X ZYU separates X, Z

A v-structureDirectionality is irrelevant

Page 14: CS b553: Algorithms for Optimization and  Learning

ASSIGNING DIRECTIONALITY Note that V-structures XYZ introduce a

dependency between X and Z given Y In structures XYZ, XYZ, and XYZ, X and

Z are independent given Y In fact Y must be given for X and Z to be

independent Idea: look at separating sets for all triples X-Y-

Z in the skeleton without edge X-Z

Y

X ZTriangle

Y

X ZY separates X, Z

Not a v-structure

Y

X ZYU separates X, Z

A v-structureDirectionality is irrelevant

Page 15: CS b553: Algorithms for Optimization and  Learning

STATISTICAL INDEPENDENCE TESTING Question: are X and Y independent? Null hypothesis H0: X and Y are independent Alternative hypothesis HA: X and Y are not

independent

Page 16: CS b553: Algorithms for Optimization and  Learning

STATISTICAL INDEPENDENCE TESTING Question: are X and Y independent? Null hypothesis H0: X and Y are independent Alternative hypothesis HA: X and Y are not

independent 2 test: use the statistic

withthe empirical probability of X Can compute (table lookup) the probability of

getting a value at least this extreme if H0 is true (p-value)

If p < some threshold, e.g. 1-0.95, H0 is rejected

Page 17: CS b553: Algorithms for Optimization and  Learning

APPROACH #2: SCORE-BASED METHODS Learning => optimization Define scoring function Score(G;D) that

evaluates quality of structure G, and optimize it Combinatorial optimization problem

Issues: Choice of scoring function: maximum likelihood

score, Bayesian score Efficient optimization techniques

Page 18: CS b553: Algorithms for Optimization and  Learning

MAXIMUM-LIKELIHOOD SCORES ScoreL(G;D) = likelihood of the BN with the

most likely parameter settings under structure G Let L(G,G;D) be the likelihood of data using

parameters G with structure G Let G

* = arg max L(,G;D) as described in last lecture

Then ScoreL(G;D) = L(G*,G;D)

Page 19: CS b553: Algorithms for Optimization and  Learning

ISSUE WITH ML SCORE

Page 20: CS b553: Algorithms for Optimization and  Learning

ISSUE WITH ML SCORE Independent coin example

X Y

G1

X Y

G2

ML parameters

P(X=H) = 9/20P(Y=H) = 8/20

P(X=H) = 9/20P(Y=H|X=H) = 3/9P(Y=H|X=T) = 5/11

Likelihood score

log L(G1*,G1;D)= 9 log(9/20) + 11 log(11/20) + 8 log (8/20) + 12 log (12/20)

log L(G2*,G2;D)= 9 log(9/20) + 11 log(11/20) + 3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11)

Page 21: CS b553: Algorithms for Optimization and  Learning

ISSUE WITH ML SCORE

X Y

G1

X Y

G2

Likelihood score

log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – [3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11)]

Page 22: CS b553: Algorithms for Optimization and  Learning

ISSUE WITH ML SCORE

X Y

G1

X Y

G2

Likelihood score

log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)]

Page 23: CS b553: Algorithms for Optimization and  Learning

ISSUE WITH ML SCORE

X Y

G1

X Y

G2

Likelihood score

log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]

Page 24: CS b553: Algorithms for Optimization and  Learning

ISSUE WITH ML SCORE

X Y

G1

X Y

G2

Likelihood score

log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]=

Page 25: CS b553: Algorithms for Optimization and  Learning

ISSUE WITH ML SCORE

X Y

G1

X Y

G2

Likelihood score

log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]= =

Page 26: CS b553: Algorithms for Optimization and  Learning

ISSUE WITH ML SCORE

X Y

G1

X Y

G2

Likelihood score

log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]= =

Page 27: CS b553: Algorithms for Optimization and  Learning

MUTUAL INFORMATION PROPERTIES (the mutual information between X and Y)

with Q(x,y) = P(x)P(y)

0 by nonnegativity of KL divergence

Implication: ML scores do not decrease for more connected

graphs=> Overfitting to data!

Page 28: CS b553: Algorithms for Optimization and  Learning

POSSIBLE SOLUTIONS Fix complexity of graphs (e.g., bounded in-

degree) See HW7

Penalize complex graphs Bayesian scores

Page 29: CS b553: Algorithms for Optimization and  Learning

IDEA OF BAYESIAN SCORING Note that parameters are uncertain Bayesian approach: put a prior on parameter

values and marginalize them out P(D|G) =

For example, use Beta/Dirichlet priors => marginal is manageable to compute E.g., uniform hyperparameter over network Set virtual counts to 2^-|PaXi|

Page 30: CS b553: Algorithms for Optimization and  Learning

LARGE SAMPLE APPROXIMATION log P(D|G) =

log L(G*;D) – ½ log M Dim[G] + O(1)

With M the number of samples, Dim[G] the number of free parameters of G

Bayesian Information Criterion (BIC) score: ScoreBIC(G;D) = log L (G*;D) – ½ log M Dim[G]

Page 31: CS b553: Algorithms for Optimization and  Learning

LARGE SAMPLE APPROXIMATION log P(D|G) =

log L(G*;D) – ½ log M Dim[G] + O(1)

With M the number of samples, Dim[G] the number of free parameters of G

Bayesian Information Criterion (BIC) score: ScoreBIC(G;D) = log L (G*;D) – ½ log M Dim[G]

Fit data set Prefer simple models

Page 32: CS b553: Algorithms for Optimization and  Learning

STRUCTURE OPTIMIZATION, GIVEN A SCORE… The problem is well-defined, but

combinatorially complex! Superexponential in # of variables

Idea: search locally through the space of graphs using graph operators Add edge Delete edge Reverse edge

Page 33: CS b553: Algorithms for Optimization and  Learning

SEARCH STRATEGIES Greedy

Pick operator that leads to greatest score Local minima? Plateaux?

Overcoming plateaux Search with basin flooding Tabu search Perturbation methods (similar to simulated

annealing, except on data weighting) Implementation details:

Evaluate ’s between structures quickly (local decomposibility)

Page 34: CS b553: Algorithms for Optimization and  Learning

RECAP Bayes net structure learning: from

equivalence class of networks that encode the same conditional independences

Constraint-based methods Statistical independence tests

Score-based methods Learning => optimization