© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved....
-
Upload
arleen-ellis -
Category
Documents
-
view
213 -
download
0
Transcript of © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved....
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved.
Learning ILearning IExcerpts from Tutorial at:Excerpts from Tutorial at:http://robotics.stanford.edu/people/nir/tutorials/index.htmlhttp://robotics.stanford.edu/people/nir/tutorials/index.html
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-2
Bayesian Networks
Qualitative part: statistical independence statements (causality!)
Directed acyclic graph (DAG)
Nodes - random variables of interest (exhaustive and mutually exclusive states)
Edges - direct (causal) influence
Quantitative part: Local probability models. Set of conditional probability distributions.
0.9 0.1
e
b
e
0.2 0.8
0.01 0.99
0.9 0.1
be
b
b
e
BE P(A | E,B)Earthquake
Radio
Burglary
Alarm
Call
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-3
Learning Bayesian networks (reminder)
InducerInducerInducerInducerData + Prior information
E
R
B
A
C .9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-4
The Learning Problem
Known Structure Unknown Structure
Complete Data Statisticalparametricestimation
(closed-form eq.)
Discrete optimizationover structures(discrete search)
Incomplete Data Parametricoptimization(EM, gradient
descent...)
Combined(Structural EM, mixture
models…)
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-5
Learning ProblemKnown Structure Unknown Structure
Complete Statistical parametricestimation
(closed-form eq.)
Discrete optimization overstructures
(discrete search)
Incomplete Parametric optimization(EM, gradient descent...)
Combined(Structural EM, mixture
models…)
? ?
e
b
e
? ?
? ?
? ?
be
b
b
e
BE P(A | E,B)
E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>
InducerInducerInducerInducerE B
A.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
E B
A
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-6
Learning ProblemKnown Structure Unknown Structure
Complete Statistical parametricestimation
(closed-form eq.)
Discrete optimization overstructures
(discrete search)
Incomplete Parametric optimization(EM, gradient descent...)
Combined(Structural EM, mixture
models…)
? ?
e
b
e
? ?
? ?
? ?
be
b
b
e
BE P(A | E,B)
E, B, A<Y,N,N><Y,?,Y><N,N,Y><N,Y,?> . .<?,Y,Y>
InducerInducerInducerInducerE B
A.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
E B
A
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-7
Learning ProblemKnown Structure Unknown Structure
Complete Statistical parametricestimation
(closed-form eq.)
Discrete optimization overstructures
(discrete search)
Incomplete Parametric optimization(EM, gradient descent...)
Combined(Structural EM, mixture
models…)
? ?
e
b
e
? ?
? ?
? ?
be
b
b
e
BE P(A | E,B)
E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>
InducerInducerInducerInducerE B
A.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
E B
A
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-8
Learning ProblemKnown Structure Unknown Structure
Complete Statistical parametricestimation
(closed-form eq.)
Discrete optimization overstructures
(discrete search)
Incomplete Parametric optimization(EM, gradient descent...)
Combined(Structural EM, mixture
models…)
? ?
e
b
e
? ?
? ?
? ?
be
b
b
e
BE P(A | E,B)
E, B, A<Y,N,N><Y,?,Y><N,N,Y><?,Y,Y> . .<N,Y, ?>
InducerInducerInducerInducerE B
A.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
E B
A
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-9
Learning Parameters for the Burglary Story
E B
A
C
][][][][
]1[]1[]1[]1[
MCMAMBME
CABE
D
mE
M mB
mEBAAC
EBEBAm
AC
m
mEPmBPMEmBmAPmAmCP
mEPmBPMEmBmAPmAmCP
mCmAmBmEPDL
):][():][():][],[|][():][|][(
):][():][():][],[|][():][|][(
):][],[],[],[():(
,||
,||
i.i.d. samples
Network factorization
We have 4 independent estimation problems
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-10
Incomplete Data
Data is often incomplete Some variables of interest are not assigned value
This phenomena happen when we have Missing values Hidden variables
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-11
Missing Values
Examples: Survey data Medical records
Not all patients undergo all possible tests
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-12
Missing Values (cont.)
Complicating issue: The fact that a value is missing might be indicative of its
value The patient did not undergo X-Ray since she complained
about fever and not about broken bones….
To learn from incomplete data we need the following assumption:
Missing at Random (MAR): The probability that the value of Xi is missing is independent
of its actual value given other observed values
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-13
Hidden (Latent) Variables
Attempt to learn a model with variables we never observe In this case, MAR always holds
Why should we care about unobserved variables?
X1 X2 X3
H
Y1 Y2 Y3
X1 X2 X3
Y1 Y2 Y3
17 parameters 59 parameters
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-14
Learning Parameters from Incomplete Data (cont.).
In the presence of incomplete data, the likelihood can have multiple global maxima
Example: We can rename the values of hidden variable H If H has two values, likelihood has two global maxima
Similarly, local maxima are also replicated Many hidden variables a serious problem
H Y
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-15
Gradient Ascent
Main result
Requires computation: P(xi,Pai|o[m],) for all i, m
Pros: Flexible Closely related to methods in neural network training
Cons: Need to project gradient onto space of legal parameters To get reasonable convergence we need to combine with “smart”
optimization techniques
)],[|,(1)|(log
,,
mopaxPDP
iimpaxpax iiii
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-16
Expectation Maximization (EM)
A general purpose method for learning from incomplete data
Intuition: If we had access to counts, then we can estimate parameters However, missing values do not allow to perform counts “Complete” counts using current parameter assignment
1.30.41.71.6
X Z N (X,Y )X Y #H
THHT
Y
??HTT
T??TH
HTHT
HHTT
P(Y=H|X=T,) = 0.4
Expected CountsP(Y=H|X=H,Z=T,) = 0.3
Data
Current model
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-17
EM (cont.)
TrainingData
X1 X2 X3
H
Y1 Y2 Y3
Initial network (G,0)
Expected CountsN(X1)N(X2)N(X3)N(H, X1, X1, X3)N(Y1, H)N(Y2, H)N(Y3, H)
Computation
(E-Step)
Reparameterize
X1 X2 X3
H
Y1 Y2 Y3
Updated network (G,1)
(M-Step)
Reiterate
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-18
EM (cont.)
Formal Guarantees: L(1:D) L(0:D)
Each iteration improves the likelihood
If 1 = 0 , then 0 is a stationary point of L(:D) Usually, this means a local maximum
Main cost: Computations of expected counts in E-Step Requires a computation pass for each instance in training set
These are exactly the same as for gradient ascent!
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-19
Example: EM in clustering
Consider clustering example
E-Step: Compute P(C[m]|X1[m],…,Xn[m],) This corresponds to “soft” assignment to clusters Compute expected statistics:
M-Step Re-estimate P(Xi|C), P(C)
Cluster
X1 ...X2 Xn
ii xmXm
ni mxmxcPcxNE][,
1 )],[],...,[|()],([
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-20
EM in Practice
Initial parameters: Random parameters setting “Best” guess from other source
Stopping criteria: Small change in likelihood of data Small change in parameter values
Avoiding bad local maxima: Multiple restarts Early “pruning” of unpromising ones
Speed up: various methods to speed convergence
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-21
Why Struggle for Accurate Structure
Increases the number of parameters to be fitted
Wrong assumptions about causality and domain structure
Cannot be compensated by accurate fitting of parameters
Also misses causality and domain structure
Earthquake Alarm Set
Sound
BurglaryEarthquake Alarm Set
Sound
Burglary
Earthquake Alarm Set
Sound
Burglary
Adding an arc Missing an arc
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-22
Minimum Description Length (cont.)
Computing the description length of the data, we get
Minimizing this term is equivalent to maximizing
):()dim(2
log)():( DGlG
MGDLGDDL
)()dim(2
log):():( GDLG
MDGlDGMDL
# bits to encode G# bits to encodeG
# bits to encode Dusing (G,G)
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-23
Heuristic Search
We address the problem by using heuristic search
Define a search space: nodes are possible structures edges denote adjacency of structures
Traverse this space looking for high-scoring structures
Search techniques: Greedy hill-climbing Best first search Simulated Annealing ...
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-24
Heuristic Search (cont.)
Typical operations:
S C
E
D
S C
E
D
S C
E
D
S C
E
D
Add C D
Reverse C ERemove C E
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-25
Exploiting Decomposability in Local Search
Caching: To update the score of after a local change, we only need to re-score the families that were changed in the last move
S C
E
D
S C
E
D
S C
E
D
S C
E
D
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-26
Greedy Hill-Climbing
Simplest heuristic local search Start with a given network
• empty network• best tree • a random network
At each iteration• Evaluate all possible changes• Apply change that leads to best improvement in score• Reiterate
Stop when no modification improves score
Each step requires evaluating approximately n new changes
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-27
Greedy Hill-Climbing (cont.)
Greedy Hill-Climbing can get struck in: Local Maxima:
• All one-edge changes reduce the score Plateaus:
• Some one-edge changes leave the score unchanged
Both are occur in the search space
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-28
Greedy Hill-Climbing (cont.)
To avoid these problems, we can use: TABU-search
Keep list of K most recently visited structures Apply best move that does not lead to a structure in the list This escapes plateaus and local maxima and with “basin” smaller
than K structures
Random Restarts Once stuck, apply some fixed number of random edge changes and
restart search This can escape from the basin of one maxima to another
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-29
Other Local Search Heuristics
Stochastic First-Ascent Hill-Climbing Evaluate possible changes at random Apply the first one that leads “uphill” Stop when a fix amount of “unsuccessful” attempts to change the
current candidate
Simulated Annealing Similar idea, but also apply “downhill” changes with a probability that
is proportional to the change in score Use a temperature to control amount of random downhill steps Slowly “cool” temperature to reach a regime where performing strict
uphill moves
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-30
Examples
Predicting heart disease Features: cholesterol, chest pain, angina, age, etc. Class: {present, absent}
Finding lemons in cars Features: make, brand, miles per gallon, acceleration,etc. Class: {normal, lemon}
Digit recognition Features: matrix of pixel descriptors Class: {1, 2, 3, 4, 5, 6, 7, 8, 9, 0}
Speech recognition Features: Signal characteristics, language model Class: {pause/hesitation, retraction}
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved . MP1-31
Some Applications
Biostatistics -- Medical Research Council (Bugs)
Data Analysis -- NASA (AutoClass)
Collaborative filtering -- Microsoft (MSBN)
Fraud detection -- ATT
Classification -- SRI (TAN-BLT)
Speech recognition -- UC Berkeley