Post on 05-Jul-2020
1
1
School of Computer Science
State Space Models
Probabilistic Graphical Models (10Probabilistic Graphical Models (10--708)708)
Lecture 13, part II Nov 5th, 2007
Eric XingEric XingReceptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
X1 X2
X3 X4 X5
X6
X7 X8
Reading: J-Chap. 15, K&F chapter 19.1 -19.3
Eric Xing 2
A road map to more complex dynamic models
AX
Y
AX
Y
AX
Ydiscrete
discrete
discrete
continuous
continuous
continuous
Mixture modele.g., mixture of multinomials
Mixture modele.g., mixture of Gaussians
Factor analysis
A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
...
HMM (for discrete sequential data, e.g., text)
HMM(for continuous sequential data, e.g., speech signal)
State space model
... ... ... ...
A AA Ax2 x3x1 xN
yk2 yk3yk1 ykN...
...
y12 y13y11 y1N...
S2 S3S1 SN...
... ... ... ...
A AA Ax2 x3x1 xN
yk2 yk3yk1 ykN...
...
y12 y13y11 y1N...
S2 S3S1 SN...
... ... ... ...
A AA Ax2 x3x1 xN
yk2 yk3yk1 ykN...
...
y12 y13y11 y1N...
S2 S3S1 SN...
... ... ... ...
A AA Ax2 x3x1 xN
yk2 yk3yk1 ykN...
...
y12 y13y11 y1N...
S2 S3S1 SN...
Factorial HMM Switching SSM
2
Eric Xing 3
State space models (SSM):
A sequential FA or a continuous state HMM
In general,
A AA AY2 Y3Y1 YN
X2 X3X1 XN...
... ),;0(~
);0(~ ),;0(~
0
1
0
tt
ttt
ttt
RvQwvC
wGA
Σ
+=+= −
NNN
x
xyxx
This is a linear dynamic system.
ttt
ttt
vw
+=+=
−
−
)()(
1
1
xyxx
gGf
where f is an (arbitrary) dynamic model, and g is an (arbitrary) observation model
Eric Xing 4
LDS for 2D trackingDynamics: new position = old position + ∆×velocity + noise (constant velocity model, Gaussian noise)
Observation: project out first two components (we observe Cartesian position of object - linear!)
noise
10000100
010001
21
11
21
11
2
1
2
1
+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛∆
∆
=
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
−
−
−
−
t
t
t
t
t
t
t
t
xxxx
xxxx
&
&
&
&
noise +
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛=⎟⎟
⎠
⎞⎜⎜⎝
⎛
2
1
2
1
2
1
00100001
t
t
t
t
t
t
xxxx
yy
&
&
3
Eric Xing 5
The inference problem 1
jtt
jttt
ittt jipipip 111 −− ===∝== ∑ αα )X|X()X|y()|X( :y
A Y2 Y3Y1 Yt
X2 X3X1 Xt...
...
)|( :1 ttxP y
0α 1α 2α tα
Filtering given y1, …, yt, estimate xt: The Kalman filter is a way to perform exact online inference (sequential Bayesian updating) in an LDS. It is the Gaussian analog of the forward algorithm for HMMs:
Eric Xing 6
The inference problem 2Smoothing given y1, …, yT, estimate xt (t<T)
The Rauch-Tung-Strievel smoother is a way to perform exact off-line inference in an LDS. It is the Gaussian analog of the forwards-backwards (alpha-gamma) algorithm:
∑ ++∝==j
jt
ji
jt
it
itTt XXPyip 11:1 )|()|X( γαγ
A Y2 Y3Y1 Yt
X2 X3X1 Xt...
...
0α 1α 2α tα0γ 1γ 2γ tγ
4
Eric Xing 7
2D tracking
filtered
X1
X 2
X1
X 2
Eric Xing 8
Kalman filtering in the brain?
5
Eric Xing 9
Kalman filtering derivationSince all CPDs are linear Gaussian, the system defines a large multivariate Gaussian.
Hence all marginals are Gaussian.Hence we can represent the belief state p(Xt|y1:t) as a Gaussian
mean
covariance .Hence, instead of marginalization for message passing, we will directly estimate the means and covariances of the required marginalsIt is common to work with the inverse covariance (precision) matrix ; this is called information form.
),,|XX(| ttttt yy K1TEP ≡
1−tt |P
),,|X(x̂ | tttt yy K1E≡
Eric Xing 10
Kalman filtering derivationKalman filtering is a recursive procedure to update the belief state:
Predict step: compute p(Xt+1|y1:t) from prior belief p(Xt|y1:t) and dynamical model p(Xt+1|Xt) --- time update
Update step: compute new belief p(Xt+1|y1:t+1) from prediction p(Xt+1|y1:t), observation yt+1 and observation model p(yt+1|Xt+1) --- measurement update
AA YtY1
Xt Xt+1X1 ...
A AA Yt Yt+1Y1
Xt Xt+1X1 ...
6
Eric Xing 11
Predict stepDynamical Model:
One step ahead prediction of state:
Observation model:One step ahead prediction of observation:
);(~ , QGA 01 Ntttt ww+=+ xx
);0(~ , RvvC tttt N+= xy
AA YtY1
Xt Xt+1X1 ...
A AA Yt Yt+1Y1
Xt Xt+1X1 ...
Eric Xing 12
Update stepSummarizing results from previous slide, we have p(Xt+1,Yt+1|y1:t) ~ N(mt+1, Vt+1), where
Remember the formulas for conditional Gaussian distributions:
) , (),|( ⎥⎦
⎤⎢⎣
⎡ΣΣΣΣ
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=Σ⎥
⎦
⎤⎢⎣
⎡
2221
1211
2
1
2
1
2
1
µµ
µxx
xx
Np
211
22121121
221
2212121
2121121
ΣΣΣ−Σ=
−ΣΣ+=
=
−
−
|
|
||
)(
),|()(
V
xm
Vmxxx
µµ
Np
222
22
2222
Σ=
=
=
m
m
mmp
V
m
Vmxx
µ
),|()( N
,ˆˆ
|
|
⎟⎟⎠
⎞⎜⎜⎝
⎛=
+
++
tt
ttt xC
xm
1
11 ,
||
||
⎟⎟⎠
⎞⎜⎜⎝
⎛
+=
++
+++ RCCPCP
CPPV Ttttt
Ttttt
t11
111
7
Eric Xing 13
Kalman FilterMeasurement updates:
where Kt+1 is the Kalman gain matrix
)x̂C(yˆˆ t|1t1t|| ++++++ −+= 1111 ttttt Kxx
ttttttt CPKPP |11|11|1 +++++ −=
-1TT RCCPCPK )( || += +++ ttttt 111
Eric Xing 14
Example of KF in 1DConsider noisy observations of a 1D particle doing a random walk:
KF equations:
),(~ ,| xttt wwxx σ011 N+= −− ),(~ , ztt vvxz σ0N+=
t|tt|ttt xxx ˆˆˆ| ==+ A1, ||1 xt
TTtttt GQGAAPP σσ +=+=+
( )zxt
ttztxtttttt
xzxCzKxx
σσσσσσ
++++
=+= +++++++
|1t|1t1t1|11|1
ˆ)ˆ-(ˆˆ
( )zxt
zxttttttt σσσ
σσσ++
+== ++++ ||| 1111 KCP-PP
))(()( || zxtxtttttt σσσσσ +++=+= +++-1TT RCCPCPK 111
8
Eric Xing 15
KF intuitionThe KF update of the mean is
the term is called the innovation
New belief is convex combination of updates from prior and observation, weighted by Kalman Gain matrix:
If the observation is unreliable, σz (i.e., R) is large so Kt+1 is small, so we pay more attention to the prediction.If the old prior is unreliable (large σt) or the process is very unpredictable (large σx), we pay more attention to the observation.
( )zxt
ttztxtttttt
xzxCzKxx
σσσσσσ
++++
=+= +++++++
|1t|1t1t1|11|1
ˆ)ˆ-(ˆˆ
)ˆ( t|1t1t ++ − xz C
))(()( || zxtxtttttt σσσσσ +++=+= +++-1TT RCCPCPK 111
Eric Xing 16
The KF update of the mean is
Consider the special case where the hidden state is a constant, xt =θ, but the “observation matrix” C is a time-varying vector, C = xt
T.Hence the observation model at each time slide, , is a linear regression
We can estimate recursively using the Kalman filter:
This is called the recursive least squares (RLS) algorithm.
We can approximate by a scalar constant. This is called the least mean squares (LMS) algorithm.We can adapt ηt online using stochastic approximation theory.
)ˆ-(ˆˆt|1t1t|| +++++ += xyxx ttttt CKA 111
tTtt vxy += θ
ttTtttt xxy )ˆ(ˆˆ
1t θθθ −+= +−
++1
11 RP
11
1 +−
+ ≈ tt ηRP
KF, RLS and LMS
9
Eric Xing 17
Complexity of one KF stepLet and ,
Computing takes O(Nx3) time, assuming
dense P and dense A.
Computing takes O(Ny3) time.
So overall time is, in general, max {Nx3,Ny
3}
xNtX R∈ yN
tY R∈
TTtttt GQGAAPP +=+ ||1
-1TT RCCPCPK )( || += +++ ttttt 111
Eric Xing 18
The inference problem 2Smoothing given y1, …, yT, estimate xt (t<T)
The Rauch-Tung-Strievel smoother is a way to perform exact off-line inference in an LDS. It is the Gaussian analog of the forwards-backwards (alpha-gamma) algorithm:
∑ ++∝==j
jt
ji
jt
it
itTt XXPyip 11:1 )|()|X( γαγ
A Y2 Y3Y1 Yt
X2 X3X1 Xt...
...
0α 1α 2α tα0γ 1γ 2γ tγ
10
Eric Xing 19
RTS smoother derivationSmoothing given y1, …, yT, estimate P(xt |y1:T) (t<T)
Step 1: joint distribution of xt and xt+1 conditioned on y1:t
Use A AYt Yt+1
Xt+1
);;0(~;1 QwwGA tttt N+=+ xx
Xt
Eric Xing 20
RTS smoother derivationFollowing the results from previous slide, we need to derive p(Xt+1,Xt|y1:t) ~ N(m, V), where
all the quantities here are available after a forward KF pass
Remember the formulas for conditional Gaussian distributions:
,ˆˆ
|
|
⎟⎟⎠
⎞⎜⎜⎝
⎛=
+ tt
tt
xx
m1
,||
||
⎟⎟⎠
⎞⎜⎜⎝
⎛=
+ tttt
Ttttt
PAPAPPV1
, ) , (),|( ⎥⎦
⎤⎢⎣
⎡ΣΣΣΣ
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=Σ⎥
⎦
⎤⎢⎣
⎡
2221
1211
2
1
2
1
2
1
µµ
µxx
xx
Np
211
22121121
221
2212121
2121121
ΣΣΣ−Σ=
−ΣΣ+=
=
−
−
|
|
||
)(
),|()(
V
xm
Vmxxx
µµ
Np
222
22
2222
Σ=
=
=
m
m
mmp
V
m
Vmxx
µ
),|()( N
11
−+= ttttt || PAPL T
11
Eric Xing 21
RTS smoother derivation
Step 2: compute = E[xt|y0:T] using results aboveUseUse E[X|Z] = E[E[X|Y,Z]|Z]
A AYt Yt+1
Xt+1Xt
Ttx |ˆ
Eric Xing 22
RTS derivationRepeat the same process for Variance
Refer to Jordan chapter 15
The RTS smoother results:
)ˆ-ˆ(ˆˆ |1|1|| ttTttttTt Lx +++= xxx
( ) TtttTttttTt LPPLPP |1|1|| ++ −+=
12
Eric Xing 23
Learning SSMsComplete log likelihood
EME-step: compute
these quantities can be inferred via KF and RTS filters, etc., e,g.,
M-step: MLE using
c.f., M-step in factor analysis
{ }( ) { }( )RCGQA ,;:,,,;:,,);(
)|(log)|(log)(log),(log),( ,,,,
tXXXftXXXXXfXf
xypxxpxpyxpD
tTttt
Ttt
Ttt
n ttntn
n ttntn
nnnn
∀+∀+Σ=
++==
−
− ∑∑∑∑∑∑
312011
11θcl
TtTtt
Ttt yyXXXXX K, , , 11−
22TtTtt
Ttt
Ttt xPXXXXX ||
ˆ)(E)var( +=+≡
( ) { }( ) { }( )RCGQA ,;:,,,;:,,;),( tXXXftXXXXXfXfD tTttt
Ttt
Ttt ∀+∀+Σ= − 312011θcl
Eric Xing 24
Nonlinear systemsIn robotics and other problems, the motion model and the observation model are often nonlinear:
An optimal closed form solution to the filtering problem is no longer possible.The nonlinear functions f and g are sometimes represented by neural networks (multi-layer perceptrons or radial basis function networks).The parameters of f and g may be learned offline using EM, where we do gradient descent (back propagation) in the M step, c.f. learning a MRF/CRF with hidden nodes.Or we may learn the parameters online by adding them to the state space: xt'= (xt, θ). This makes the problem even more nonlinear.
)( , )( tttttt vxgywxfx +=+= −1
13
Eric Xing 25
Extended Kalman Filter (EKF)The basic idea of the EKF is to linearize f and g using a second order Taylor expansion, and then apply the standard KF.
i.e., we approximate a stationary nonlinear system with a non-stationary linear system.
where and and
The noise covariance (Q and R) is not changed, i.e., the additional error due to linearization is not modeled.
)ˆ()ˆ(
)ˆ()ˆ(
|ˆ|
|ˆ|
|
|
ttttxttt
ttttxttt
vxxxgywxxxfx
tt
tt
+−+=
+−+=
−−
−−−−−
−
−−
11
11111
1
11
C
A
)ˆ(ˆ|| 111 −−− = tttt xfx
xx x
fˆ
def
ˆ ∂∂
=Ax
x xg
ˆ
def
ˆ ∂∂
=C
26
School of Computer Science
Complex Graphical Models
Probabilistic Graphical Models (10Probabilistic Graphical Models (10--708)708)
Lecture 14, Nov 5th, 2007
Eric XingEric XingReceptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
X1 X2
X3 X4 X5
X6
X7 X8
Reading: K&F chapter 20.1 - 20.3
14
Eric Xing 27
A road map to more complex dynamic models
AX
Y
AX
Y
AX
Ydiscrete
discrete
discrete
continuous
continuous
continuous
Mixture modele.g., mixture of multinomials
Mixture modele.g., mixture of Gaussians
Factor analysis
A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
...
HMM (for discrete sequential data, e.g., text)
HMM(for continuous sequential data, e.g., speech signal)
State space model
... ... ... ...
A AA Ax2 x3x1 xN
yk2 yk3yk1 ykN...
...
y12 y13y11 y1N...
S2 S3S1 SN...
... ... ... ...
A AA Ax2 x3x1 xN
yk2 yk3yk1 ykN...
...
y12 y13y11 y1N...
S2 S3S1 SN...
... ... ... ...
A AA Ax2 x3x1 xN
yk2 yk3yk1 ykN...
...
y12 y13y11 y1N...
S2 S3S1 SN...
... ... ... ...
A AA Ax2 x3x1 xN
yk2 yk3yk1 ykN...
...
y12 y13y11 y1N...
S2 S3S1 SN...
Factorial HMM Switching SSM
Eric Xing 28
NLP and Data MiningWe want:
Semantic-based search infer topics and categorize documentsMultimedia inferenceAutomatic translation Predict how topics evolve…
Researchtopics
1900 2000
Researchtopics
1900 2000
15
Eric Xing 29
The Vector Space ModelRepresent each document by a high-dimensional vector in the space of words
⇒⇒
Eric Xing 30
* *
T (m x k)
Λ(k x k)
DT
(k x n)
=
X (m x n)
Document
Term ...
∑=
=K
kkkk Tdw
1
rr λ
Latent Semantic Indexing
LSA does not define a properly normalized probability distribution of observed and latent entities
Does not support probabilistic reasoning under uncertainty and data fusion
16
Eric Xing 31
Latent Semantic Structure
Latent Structure
Words
∑=l
l),()( ww PPl
w
Distribution over words
)w()()|w()w|(
PPPP ll
l =
Inferring latent structure
...)w|( 1 =+nwPPrediction
Eric Xing 32
Objects are bags of elements
Mixtures are distributions over elements
Objects have mixing vector θRepresents each mixtures’ contributions
Object is generated as follows:Pick a mixture component from θPick an element from that component
Admixture Models
money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 bank1 money1 stream2
money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 bank1 money1 stream2
money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 bank1 money1 stream2
…
0.1 0.1 0.5…..
0.1 0.5 0.1…..
0.5 0.1 0.1…..
money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 bank1 money1 stream2
17
Eric Xing 33
Topic Models =Admixture ModelsGenerating a document
Prior
θ
z
w βNd
N
K
( ){ } ( )
from ,| Draw -
from Draw- each wordFor
prior thefrom
:1 nzknn
n
lmultinomiazwlmultinomiaz
nDraw
ββθ
θ−
Which prior to use?
Eric Xing 34
Choice of Prior
Dirichlet (LDA) (Blei et al. 2003)Conjugate prior means efficient inference
Can only capture variations in each topic’s intensity independently
Logistic Normal (CTM=LoNTAM) (Blei & Lafferty 2005, Ahmed & Xing 2006)
Capture the intuition that some topics are highly correlated and can rise up in intensity togetherNot a conjugate prior implies hard inference
18
Eric Xing 35
Logistic Normal Vs. Dirichlet
Dirichlet
Eric Xing 36
Logistic Normal Vs. Dirichlet
Logistic
Normal
19
Eric Xing 37
Mixed Membership Model (M3)Mixture versus admixture
⇒⇒
A Bayesian mixture model A Bayesian admixture model: Mixed membership model
Eric Xing 38
Latent Dirichlet Allocation: M3 in text mining
A document is a bag of words each generated from a randomly selected topic
20
Eric Xing 39
Population admixture: M3 in genetics
The genetic materials of each modern individual are inherited from multiple ancestral populations, each DNA locus may have a different generic origin …
Ancestral labels may have (e.g., Markovian) dependencies
Eric Xing 40
Inference in Mixed Membership Models
Mixture versus admixture
Inference is very hard in M3, all hidden variables are coupled and not factorizable!
∑ ∏ ∏∫ ∫ ⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
}{,,
,
)|()|()|()|()(mn
nz
Nn
nm
nmnzmn dddGppzpxpDp φππφαππφ LL 1
⇒⇒
∑ ∫ ∏ ∏ −⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛
}{,,
,
)|()|()|()|(~)|(mn
nz
in
nm
nmnzmnn ddGppzpxpDp φπφαππφπ
21
Eric Xing 41
Approaches to inferenceExact inference algorithms
The elimination algorithmThe junction tree algorithms
Approximate inference techniques
Monte Carlo algorithms:Stochastic simulation / sampling methodsMarkov chain Monte Carlo methods
Variational algorithms: Belief propagationVariational inference