© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved....

31
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Learning I Excerpts from Tutorial at: Excerpts from Tutorial at: http://robotics.stanford.edu/people/nir/tutorials http://robotics.stanford.edu/people/nir/tutorials /index.html /index.html

Transcript of © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved....

Page 1: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved.

Learning ILearning IExcerpts from Tutorial at:Excerpts from Tutorial at:http://robotics.stanford.edu/people/nir/tutorials/index.htmlhttp://robotics.stanford.edu/people/nir/tutorials/index.html

Page 2: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-2

Bayesian Networks

Qualitative part: statistical independence statements (causality!)

Directed acyclic graph (DAG)

Nodes - random variables of interest (exhaustive and mutually exclusive states)

Edges - direct (causal) influence

Quantitative part: Local probability models. Set of conditional probability distributions.

0.9 0.1

e

b

e

0.2 0.8

0.01 0.99

0.9 0.1

be

b

b

e

BE P(A | E,B)Earthquake

Radio

Burglary

Alarm

Call

Page 3: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-3

Learning Bayesian networks (reminder)

InducerInducerInducerInducerData + Prior information

E

R

B

A

C .9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

Page 4: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-4

The Learning Problem

Known Structure Unknown Structure

Complete Data Statisticalparametricestimation

(closed-form eq.)

Discrete optimizationover structures(discrete search)

Incomplete Data Parametricoptimization(EM, gradient

descent...)

Combined(Structural EM, mixture

models…)

Page 5: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-5

Learning ProblemKnown Structure Unknown Structure

Complete Statistical parametricestimation

(closed-form eq.)

Discrete optimization overstructures

(discrete search)

Incomplete Parametric optimization(EM, gradient descent...)

Combined(Structural EM, mixture

models…)

? ?

e

b

e

? ?

? ?

? ?

be

b

b

e

BE P(A | E,B)

E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>

InducerInducerInducerInducerE B

A.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

E B

A

Page 6: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-6

Learning ProblemKnown Structure Unknown Structure

Complete Statistical parametricestimation

(closed-form eq.)

Discrete optimization overstructures

(discrete search)

Incomplete Parametric optimization(EM, gradient descent...)

Combined(Structural EM, mixture

models…)

? ?

e

b

e

? ?

? ?

? ?

be

b

b

e

BE P(A | E,B)

E, B, A<Y,N,N><Y,?,Y><N,N,Y><N,Y,?> . .<?,Y,Y>

InducerInducerInducerInducerE B

A.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

E B

A

Page 7: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-7

Learning ProblemKnown Structure Unknown Structure

Complete Statistical parametricestimation

(closed-form eq.)

Discrete optimization overstructures

(discrete search)

Incomplete Parametric optimization(EM, gradient descent...)

Combined(Structural EM, mixture

models…)

? ?

e

b

e

? ?

? ?

? ?

be

b

b

e

BE P(A | E,B)

E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>

InducerInducerInducerInducerE B

A.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

E B

A

Page 8: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-8

Learning ProblemKnown Structure Unknown Structure

Complete Statistical parametricestimation

(closed-form eq.)

Discrete optimization overstructures

(discrete search)

Incomplete Parametric optimization(EM, gradient descent...)

Combined(Structural EM, mixture

models…)

? ?

e

b

e

? ?

? ?

? ?

be

b

b

e

BE P(A | E,B)

E, B, A<Y,N,N><Y,?,Y><N,N,Y><?,Y,Y> . .<N,Y, ?>

InducerInducerInducerInducerE B

A.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

E B

A

Page 9: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-9

Learning Parameters for the Burglary Story

E B

A

C

][][][][

]1[]1[]1[]1[

MCMAMBME

CABE

D

mE

M mB

mEBAAC

EBEBAm

AC

m

mEPmBPMEmBmAPmAmCP

mEPmBPMEmBmAPmAmCP

mCmAmBmEPDL

):][():][():][],[|][():][|][(

):][():][():][],[|][():][|][(

):][],[],[],[():(

,||

,||

i.i.d. samples

Network factorization

We have 4 independent estimation problems

Page 10: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-10

Incomplete Data

Data is often incomplete Some variables of interest are not assigned value

This phenomena happen when we have Missing values Hidden variables

Page 11: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-11

Missing Values

Examples: Survey data Medical records

Not all patients undergo all possible tests

Page 12: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-12

Missing Values (cont.)

Complicating issue: The fact that a value is missing might be indicative of its

value The patient did not undergo X-Ray since she complained

about fever and not about broken bones….

To learn from incomplete data we need the following assumption:

Missing at Random (MAR): The probability that the value of Xi is missing is independent

of its actual value given other observed values

Page 13: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-13

Hidden (Latent) Variables

Attempt to learn a model with variables we never observe In this case, MAR always holds

Why should we care about unobserved variables?

X1 X2 X3

H

Y1 Y2 Y3

X1 X2 X3

Y1 Y2 Y3

17 parameters 59 parameters

Page 14: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-14

Learning Parameters from Incomplete Data (cont.).

In the presence of incomplete data, the likelihood can have multiple global maxima

Example: We can rename the values of hidden variable H If H has two values, likelihood has two global maxima

Similarly, local maxima are also replicated Many hidden variables a serious problem

H Y

Page 15: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-15

Gradient Ascent

Main result

Requires computation: P(xi,Pai|o[m],) for all i, m

Pros: Flexible Closely related to methods in neural network training

Cons: Need to project gradient onto space of legal parameters To get reasonable convergence we need to combine with “smart”

optimization techniques

)],[|,(1)|(log

,,

mopaxPDP

iimpaxpax iiii

Page 16: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-16

Expectation Maximization (EM)

A general purpose method for learning from incomplete data

Intuition: If we had access to counts, then we can estimate parameters However, missing values do not allow to perform counts “Complete” counts using current parameter assignment

1.30.41.71.6

X Z N (X,Y )X Y #H

THHT

Y

??HTT

T??TH

HTHT

HHTT

P(Y=H|X=T,) = 0.4

Expected CountsP(Y=H|X=H,Z=T,) = 0.3

Data

Current model

Page 17: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-17

EM (cont.)

TrainingData

X1 X2 X3

H

Y1 Y2 Y3

Initial network (G,0)

Expected CountsN(X1)N(X2)N(X3)N(H, X1, X1, X3)N(Y1, H)N(Y2, H)N(Y3, H)

Computation

(E-Step)

Reparameterize

X1 X2 X3

H

Y1 Y2 Y3

Updated network (G,1)

(M-Step)

Reiterate

Page 18: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-18

EM (cont.)

Formal Guarantees: L(1:D) L(0:D)

Each iteration improves the likelihood

If 1 = 0 , then 0 is a stationary point of L(:D) Usually, this means a local maximum

Main cost: Computations of expected counts in E-Step Requires a computation pass for each instance in training set

These are exactly the same as for gradient ascent!

Page 19: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-19

Example: EM in clustering

Consider clustering example

E-Step: Compute P(C[m]|X1[m],…,Xn[m],) This corresponds to “soft” assignment to clusters Compute expected statistics:

M-Step Re-estimate P(Xi|C), P(C)

Cluster

X1 ...X2 Xn

ii xmXm

ni mxmxcPcxNE][,

1 )],[],...,[|()],([

Page 20: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-20

EM in Practice

Initial parameters: Random parameters setting “Best” guess from other source

Stopping criteria: Small change in likelihood of data Small change in parameter values

Avoiding bad local maxima: Multiple restarts Early “pruning” of unpromising ones

Speed up: various methods to speed convergence

Page 21: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-21

Why Struggle for Accurate Structure

Increases the number of parameters to be fitted

Wrong assumptions about causality and domain structure

Cannot be compensated by accurate fitting of parameters

Also misses causality and domain structure

Earthquake Alarm Set

Sound

BurglaryEarthquake Alarm Set

Sound

Burglary

Earthquake Alarm Set

Sound

Burglary

Adding an arc Missing an arc

Page 22: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-22

Minimum Description Length (cont.)

Computing the description length of the data, we get

Minimizing this term is equivalent to maximizing

):()dim(2

log)():( DGlG

MGDLGDDL

)()dim(2

log):():( GDLG

MDGlDGMDL

# bits to encode G# bits to encodeG

# bits to encode Dusing (G,G)

Page 23: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-23

Heuristic Search

We address the problem by using heuristic search

Define a search space: nodes are possible structures edges denote adjacency of structures

Traverse this space looking for high-scoring structures

Search techniques: Greedy hill-climbing Best first search Simulated Annealing ...

Page 24: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-24

Heuristic Search (cont.)

Typical operations:

S C

E

D

S C

E

D

S C

E

D

S C

E

D

Add C D

Reverse C ERemove C E

Page 25: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-25

Exploiting Decomposability in Local Search

Caching: To update the score of after a local change, we only need to re-score the families that were changed in the last move

S C

E

D

S C

E

D

S C

E

D

S C

E

D

Page 26: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-26

Greedy Hill-Climbing

Simplest heuristic local search Start with a given network

• empty network• best tree • a random network

At each iteration• Evaluate all possible changes• Apply change that leads to best improvement in score• Reiterate

Stop when no modification improves score

Each step requires evaluating approximately n new changes

Page 27: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-27

Greedy Hill-Climbing (cont.)

Greedy Hill-Climbing can get struck in: Local Maxima:

• All one-edge changes reduce the score Plateaus:

• Some one-edge changes leave the score unchanged

Both are occur in the search space

Page 28: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-28

Greedy Hill-Climbing (cont.)

To avoid these problems, we can use: TABU-search

Keep list of K most recently visited structures Apply best move that does not lead to a structure in the list This escapes plateaus and local maxima and with “basin” smaller

than K structures

Random Restarts Once stuck, apply some fixed number of random edge changes and

restart search This can escape from the basin of one maxima to another

Page 29: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-29

Other Local Search Heuristics

Stochastic First-Ascent Hill-Climbing Evaluate possible changes at random Apply the first one that leads “uphill” Stop when a fix amount of “unsuccessful” attempts to change the

current candidate

Simulated Annealing Similar idea, but also apply “downhill” changes with a probability that

is proportional to the change in score Use a temperature to control amount of random downhill steps Slowly “cool” temperature to reach a regime where performing strict

uphill moves

Page 30: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-30

Examples

Predicting heart disease Features: cholesterol, chest pain, angina, age, etc. Class: {present, absent}

Finding lemons in cars Features: make, brand, miles per gallon, acceleration,etc. Class: {normal, lemon}

Digit recognition Features: matrix of pixel descriptors Class: {1, 2, 3, 4, 5, 6, 7, 8, 9, 0}

Speech recognition Features: Signal characteristics, language model Class: {pause/hesitation, retraction}

Page 31: © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at: .

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-31

Some Applications

Biostatistics -- Medical Research Council (Bugs)

Data Analysis -- NASA (AutoClass)

Collaborative filtering -- Microsoft (MSBN)

Fraud detection -- ATT

Classification -- SRI (TAN-BLT)

Speech recognition -- UC Berkeley