CS B553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Gradient descent.
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.
-
Upload
louisa-gardner -
Category
Documents
-
view
213 -
download
0
Transcript of CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.
![Page 1: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/1.jpg)
CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNINGStructure Learning
![Page 2: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/2.jpg)
AGENDA
Learning probability distributions from example data
To what extent can Bayes net structure be learned?
Constraint methods (inferring conditional independence)
Scoring methods (learning => optimization)
![Page 3: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/3.jpg)
BASIC QUESTION
Given examples drawn from a distribution P* with independence relations given by the Bayesian structure G*, can we recover G*?
![Page 4: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/4.jpg)
BASIC QUESTION
Given examples drawn from a distribution P* with independence relations given by the Bayesian structure G*, can we recover G*
construct a network that encodes the same independence relations as G*?
G* G1
G2
![Page 5: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/5.jpg)
LEARNING IN THE FACE OF NOISY DATA
Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT
X Y
Model 1
X Y
Model 2
![Page 6: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/6.jpg)
LEARNING IN THE FACE OF NOISY DATA
Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT
X Y
Model 1
X Y
Model 2
ML parameters
P(X=H) = 9/20P(Y=H) = 8/20
P(X=H) = 9/20P(Y=H|X=H) = 3/9P(Y=H|X=T) = 5/11
![Page 7: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/7.jpg)
LEARNING IN THE FACE OF NOISY DATA
Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT
X Y
Model 1
X Y
Model 2
ML parameters
P(X=H) = 9/20P(Y=H) = 8/20
P(X=H) = 9/20P(Y=H|X=H) = 3/9P(Y=H|X=T) = 5/11 Errors are
likely to be larger!
![Page 8: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/8.jpg)
PRINCIPLE
Learning structure must trade off fit of data vs. complexity of network
Complex networks More parameters to learn More data fragmentation = greater sensitivity to
noise
![Page 9: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/9.jpg)
APPROACH #1: CONSTRAINT-BASED LEARNING
First, identify an undirected skeleton of edges in G*
If an edge X-Y is in G*, then no subset of evidence variables can make X and Y independent
If X-Y is not in G*, then we can find evidence variables to make X and Y independent
Then, assign directionality to preserve independences
![Page 10: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/10.jpg)
BUILD-SKELETON ALGORITHM
Given X={X1,…,Xn}, query Independent?(X,Y,U)
H = complete graph over X For all pairs Xi, Xj, test separation as follows:
Enumerate all possible separating sets U If Independent?(Xi,Xj,U) then remove Xi—Xj from
H
In practice:• Must restrict to bounded size subsets |U|d (i.e.,
assume G* has bounded degree). O(n2(n-2)d) tests• Independence can’t be tested exactly
![Page 11: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/11.jpg)
ASSIGNING DIRECTIONALITY
Note that V-structures XYZ introduce a dependency between X and Z given Y In structures XYZ, XYZ, and XYZ, X and Z are
independent given Y In fact Y must be given for X and Z to be
independent Idea: look at separating sets for all triples X-Y-
Z in the skeleton without edge X-Z
Y
X Z
Triangle
Directionality is irrelevant
![Page 12: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/12.jpg)
ASSIGNING DIRECTIONALITY
Note that V-structures XYZ introduce a dependency between X and Z given Y In structures XYZ, XYZ, and XYZ, X and Z are
independent given Y In fact Y must be given for X and Z to be
independent Idea: look at separating sets for all triples X-Y-
Z in the skeleton without edge X-Z
Y
X Z
Triangle
Y
X Z
Y separates X, Z
Not a v-structureDirectionality is irrelevant
![Page 13: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/13.jpg)
ASSIGNING DIRECTIONALITY
Note that V-structures XYZ introduce a dependency between X and Z given Y In structures XYZ, XYZ, and XYZ, X and Z are
independent given Y In fact Y must be given for X and Z to be
independent Idea: look at separating sets for all triples X-Y-
Z in the skeleton without edge X-Z
Y
X Z
Triangle
Y
X Z
Y separates X, Z
Not a v-structure
Y
X Z
YU separates X, Z
A v-structureDirectionality is irrelevant
![Page 14: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/14.jpg)
ASSIGNING DIRECTIONALITY
Note that V-structures XYZ introduce a dependency between X and Z given Y In structures XYZ, XYZ, and XYZ, X and Z are
independent given Y In fact Y must be given for X and Z to be
independent Idea: look at separating sets for all triples X-Y-
Z in the skeleton without edge X-Z
Y
X Z
Triangle
Y
X Z
Y separates X, Z
Not a v-structure
Y
X Z
YU separates X, Z
A v-structureDirectionality is irrelevant
![Page 15: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/15.jpg)
STATISTICAL INDEPENDENCE TESTING
Question: are X and Y independent? Null hypothesis H0: X and Y are independent
Alternative hypothesis HA: X and Y are not independent
![Page 16: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/16.jpg)
STATISTICAL INDEPENDENCE TESTING
Question: are X and Y independent? Null hypothesis H0: X and Y are independent
Alternative hypothesis HA: X and Y are not independent
2 test: use the statistic
withthe empirical probability of X Can compute (table lookup) the probability of
getting a value at least this extreme if H0 is true (p-value)
If p < some threshold, e.g. 1-0.95, H0 is rejected
![Page 17: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/17.jpg)
APPROACH #2: SCORE-BASED METHODS
Learning => optimization Define scoring function Score(G;D) that
evaluates quality of structure G, and optimize it Combinatorial optimization problem
Issues: Choice of scoring function: maximum likelihood
score, Bayesian score Efficient optimization techniques
![Page 18: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/18.jpg)
MAXIMUM-LIKELIHOOD SCORES
ScoreL(G;D) = likelihood of the BN with the most likely parameter settings under structure G Let L(G,G;D) be the likelihood of data using
parameters G with structure G Let G
* = arg max L(,G;D) as described in last lecture
Then ScoreL(G;D) = L(G*,G;D)
![Page 19: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/19.jpg)
ISSUE WITH ML SCORE
![Page 20: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/20.jpg)
ISSUE WITH ML SCORE
Independent coin example
X Y
G1
X Y
G2
ML parameters
P(X=H) = 9/20P(Y=H) = 8/20
P(X=H) = 9/20P(Y=H|X=H) = 3/9P(Y=H|X=T) = 5/11
Likelihood score
log L(G1*,G1;D)= 9 log(9/20) + 11 log(11/20) + 8 log (8/20) + 12 log (12/20)
log L(G2*,G2;D)= 9 log(9/20) + 11 log(11/20) + 3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11)
![Page 21: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/21.jpg)
ISSUE WITH ML SCORE
X Y
G1
X Y
G2
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – [3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11)]
![Page 22: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/22.jpg)
ISSUE WITH ML SCORE
X Y
G1
X Y
G2
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)]
![Page 23: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/23.jpg)
ISSUE WITH ML SCORE
X Y
G1
X Y
G2
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]
![Page 24: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/24.jpg)
ISSUE WITH ML SCORE
X Y
G1
X Y
G2
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]=
![Page 25: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/25.jpg)
ISSUE WITH ML SCORE
X Y
G1
X Y
G2
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]= =
![Page 26: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/26.jpg)
ISSUE WITH ML SCORE
X Y
G1
X Y
G2
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]= =
![Page 27: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/27.jpg)
MUTUAL INFORMATION PROPERTIES
(the mutual information between X and Y)
with Q(x,y) = P(x)P(y)
0 by nonnegativity of KL divergence
Implication: ML scores do not decrease for more connected
graphs=> Overfitting to data!
![Page 28: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/28.jpg)
POSSIBLE SOLUTIONS
Fix complexity of graphs (e.g., bounded in-degree) See HW7
Penalize complex graphs Bayesian scores
![Page 29: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/29.jpg)
IDEA OF BAYESIAN SCORING
Note that parameters are uncertain Bayesian approach: put a prior on parameter
values and marginalize them out P(D|G) =
For example, use Beta/Dirichlet priors => marginal is manageable to compute E.g., uniform hyperparameter over network Set virtual counts to 2^-|PaXi|
![Page 30: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/30.jpg)
LARGE SAMPLE APPROXIMATION
log P(D|G) =log L(G
*;D) – ½ log M Dim[G] + O(1) With M the number of samples, Dim[G] the
number of free parameters of G Bayesian Information Criterion (BIC) score:
ScoreBIC(G;D) = log L (G*;D) – ½ log M Dim[G]
![Page 31: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/31.jpg)
LARGE SAMPLE APPROXIMATION
log P(D|G) =log L(G
*;D) – ½ log M Dim[G] + O(1) With M the number of samples, Dim[G] the
number of free parameters of G Bayesian Information Criterion (BIC) score:
ScoreBIC(G;D) = log L (G*;D) – ½ log M Dim[G]
Fit data set Prefer simple models
![Page 32: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/32.jpg)
STRUCTURE OPTIMIZATION, GIVEN A SCORE…
The problem is well-defined, but combinatorially complex! Superexponential in # of variables
Idea: search locally through the space of graphs using graph operators Add edge Delete edge Reverse edge
![Page 33: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/33.jpg)
SEARCH STRATEGIES
Greedy Pick operator that leads to greatest score Local minima? Plateaux?
Overcoming plateaux Search with basin flooding Tabu search Perturbation methods (similar to simulated
annealing, except on data weighting) Implementation details:
Evaluate ’s between structures quickly (local decomposibility)
![Page 34: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.](https://reader035.fdocuments.us/reader035/viewer/2022062716/56649dc45503460f94ab727a/html5/thumbnails/34.jpg)
RECAP
Bayes net structure learning: from equivalence class of networks that encode the same conditional independences
Constraint-based methods Statistical independence tests
Score-based methods Learning => optimization