CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.

CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNINGStructure Learning

AGENDA

Learning probability distributions from example data

To what extent can Bayes net structure be learned?

Constraint methods (inferring conditional independence)

Scoring methods (learning => optimization)

BASIC QUESTION

Given examples drawn from a distribution P* with independence relations given by the Bayesian structure G*, can we recover G*?

BASIC QUESTION

Given examples drawn from a distribution P* with independence relations given by the Bayesian structure G*, can we recover G*

construct a network that encodes the same independence relations as G*?

LEARNING IN THE FACE OF NOISY DATA

Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT

Model 1

Model 2

Model 1

Model 2

ML parameters

P(X=H) = 9/20P(Y=H) = 8/20

P(X=H) = 9/20P(Y=H|X=H) = 3/9P(Y=H|X=T) = 5/11

Model 1

Model 2

ML parameters

P(X=H) = 9/20P(Y=H) = 8/20

P(X=H) = 9/20P(Y=H|X=H) = 3/9P(Y=H|X=T) = 5/11 Errors are

likely to be larger!

PRINCIPLE

Learning structure must trade off fit of data vs. complexity of network

Complex networks More parameters to learn More data fragmentation = greater sensitivity to

APPROACH #1: CONSTRAINT-BASED LEARNING

First, identify an undirected skeleton of edges in G*

If an edge X-Y is in G*, then no subset of evidence variables can make X and Y independent

If X-Y is not in G*, then we can find evidence variables to make X and Y independent

Then, assign directionality to preserve independences

BUILD-SKELETON ALGORITHM

Given X={X1,…,Xn}, query Independent?(X,Y,U)

H = complete graph over X For all pairs Xi, Xj, test separation as follows:

Enumerate all possible separating sets U If Independent?(Xi,Xj,U) then remove Xi—Xj from

In practice:• Must restrict to bounded size subsets |U|d (i.e.,

assume G* has bounded degree). O(n2(n-2)d) tests• Independence can’t be tested exactly

ASSIGNING DIRECTIONALITY

Note that V-structures XYZ introduce a dependency between X and Z given Y In structures XYZ, XYZ, and XYZ, X and Z are

independent given Y In fact Y must be given for X and Z to be

independent Idea: look at separating sets for all triples X-Y-

Z in the skeleton without edge X-Z

Triangle

Directionality is irrelevant

Triangle

Y separates X, Z

Not a v-structureDirectionality is irrelevant

Triangle

Y separates X, Z

Not a v-structure

YU separates X, Z

A v-structureDirectionality is irrelevant

Triangle

Y separates X, Z

Not a v-structure

YU separates X, Z

A v-structureDirectionality is irrelevant

STATISTICAL INDEPENDENCE TESTING

Question: are X and Y independent? Null hypothesis H0: X and Y are independent

Alternative hypothesis HA: X and Y are not independent

STATISTICAL INDEPENDENCE TESTING

Question: are X and Y independent? Null hypothesis H0: X and Y are independent

Alternative hypothesis HA: X and Y are not independent

2 test: use the statistic

withthe empirical probability of X Can compute (table lookup) the probability of

getting a value at least this extreme if H0 is true (p-value)

If p < some threshold, e.g. 1-0.95, H0 is rejected

APPROACH #2: SCORE-BASED METHODS

Learning => optimization Define scoring function Score(G;D) that

evaluates quality of structure G, and optimize it Combinatorial optimization problem

Issues: Choice of scoring function: maximum likelihood

score, Bayesian score Efficient optimization techniques

MAXIMUM-LIKELIHOOD SCORES

ScoreL(G;D) = likelihood of the BN with the most likely parameter settings under structure G Let L(G,G;D) be the likelihood of data using

parameters G with structure G Let G

* = arg max L(,G;D) as described in last lecture

Then ScoreL(G;D) = L(G*,G;D)

ISSUE WITH ML SCORE

Independent coin example

ML parameters

P(X=H) = 9/20P(Y=H) = 8/20

P(X=H) = 9/20P(Y=H|X=H) = 3/9P(Y=H|X=T) = 5/11

Likelihood score

log L(G1*,G1;D)= 9 log(9/20) + 11 log(11/20) + 8 log (8/20) + 12 log (12/20)

log L(G2*,G2;D)= 9 log(9/20) + 11 log(11/20) + 3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11)

ISSUE WITH ML SCORE

Likelihood score

log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – [3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11)]

ISSUE WITH ML SCORE

Likelihood score

log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)]

ISSUE WITH ML SCORE

Likelihood score

log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]

ISSUE WITH ML SCORE

Likelihood score

log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]=

ISSUE WITH ML SCORE

Likelihood score

log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]= =

ISSUE WITH ML SCORE

Likelihood score

log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]= =

MUTUAL INFORMATION PROPERTIES

(the mutual information between X and Y)

with Q(x,y) = P(x)P(y)

0 by nonnegativity of KL divergence

Implication: ML scores do not decrease for more connected

graphs=> Overfitting to data!

POSSIBLE SOLUTIONS

Fix complexity of graphs (e.g., bounded in-degree) See HW7

Penalize complex graphs Bayesian scores

IDEA OF BAYESIAN SCORING

Note that parameters are uncertain Bayesian approach: put a prior on parameter

values and marginalize them out P(D|G) =

For example, use Beta/Dirichlet priors => marginal is manageable to compute E.g., uniform hyperparameter over network Set virtual counts to 2^-|PaXi|

LARGE SAMPLE APPROXIMATION

log P(D|G) =log L(G

*;D) – ½ log M Dim[G] + O(1) With M the number of samples, Dim[G] the

number of free parameters of G Bayesian Information Criterion (BIC) score:

ScoreBIC(G;D) = log L (G*;D) – ½ log M Dim[G]

LARGE SAMPLE APPROXIMATION

log P(D|G) =log L(G

*;D) – ½ log M Dim[G] + O(1) With M the number of samples, Dim[G] the

number of free parameters of G Bayesian Information Criterion (BIC) score:

ScoreBIC(G;D) = log L (G*;D) – ½ log M Dim[G]

Fit data set Prefer simple models

STRUCTURE OPTIMIZATION, GIVEN A SCORE…

The problem is well-defined, but combinatorially complex! Superexponential in # of variables

Idea: search locally through the space of graphs using graph operators Add edge Delete edge Reverse edge

SEARCH STRATEGIES

Greedy Pick operator that leads to greatest score Local minima? Plateaux?

Overcoming plateaux Search with basin flooding Tabu search Perturbation methods (similar to simulated

annealing, except on data weighting) Implementation details:

Evaluate ’s between structures quickly (local decomposibility)

Bayes net structure learning: from equivalence class of networks that encode the same conditional independences

Constraint-based methods Statistical independence tests

Score-based methods Learning => optimization

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.

Documents

Transcript of CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.

CS B553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Univariate optimization.

CHAPTER 9 E VOLUTIONARY C OMPUTATION I : G ENETIC A LGORITHMS

A LGORITHMS & D ATA S TRUCTURES FOR G AMES Lecture 3 1 Minor Games Programming.

EE 553 LPOPF

1 F ORCE F IELD O PTIMIZATION for F LUOROCARBON Seung Soon Jang.

OEO: O ptimization of E xponentiation O peration ~ A Brief Discussion ~

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Monte Carlo Methods for Probabilistic Inference.

G ENETIC A LGORITHMS Ranga Rodrigo March 5, 2014 1.

INDR 262 INTRODUCTION TO O PTIMIZATION METHODS

Ch.553 BUILDING CONSTRUCTION ST AND ARDS F.S. 1983 CHAPTER 553 · 2013. 9. 12. · ch.553 building construction st and ards f.s. 1983 chapter 553 building construction standards part

A Cloud-based Decision Intelligence Application ntegrated ecision ptimization enter idocidoc.

· 2019. 3. 19. · P AR TIAL_DCBT .......... . 552 PREFETCH_BY_LOAD ........ . 553 PREFETCH_BY_STREAM ....... . 553 PREFETCH_FOR_LOAD ....... . …

S earch E ngine O ptimization търсещите машини & изграждане на успешни сайтове

EXPERIMENTAL STUDY OF RADIO FREQUENCY INTERFERENCE DETECTION A LGORITHMS IN MICROWAVE RADIOMETRY

Report on the CapacityCapacity, DemandDemand, and … Capacity , MW 0 2,447 2,411 2,393 2,376 2,359 2,359 2,359 2,359 2,359 50% of Non-Synchronous Ties, MW 553 553 553 553 553 553

CHAPTER 12 S TATISTICAL M ETHODS FOR O PTIMIZATION IN D ISCRETE P ROBLEMS

Ambulance Demo 553

Revista Jaque 553

Diversity Maintenance Behavior on Evolutionary Multi-Objective O ptimization

DEVELOPMENT AND PTIMIZATION OF FOOD PACKAGING IN …