Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem,...
-
Upload
beatrix-kelly -
Category
Documents
-
view
214 -
download
1
Transcript of Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem,...
![Page 1: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/1.jpg)
Information Bottleneck EM
School of Engineering & Computer Science
The Hebrew University, Jerusalem, Israel
Gal Elidan and Nir Friedman
![Page 2: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/2.jpg)
Problem: No closed-form solution for ML estimation
Use Expectation Maximization (EM)
Problem: Stuck in inferior local Maxima Random Restarts Deterministic Simulated annealing
Learning with Hidden VariablesInput: Output: A model P(X,T)
DATA
??????
X1 … XN T
Params
Lik
elih
oo
d
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
EM + information regularizationfor learning parameters
T
X2 X3X1
![Page 3: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/3.jpg)
Learning Parameters
Input: Output: A model P(X)
DATA
X1 … XN
Empirical distribution Q(X)
Parametrization of P
P(X1) = Q(X1)P(X2|X1) = Q(X2|X1) P(X3|X1) = Q(X3|X1)
X1
X2 X3
M
x,x,x#)x,x,x(Q 321
321
![Page 4: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/4.jpg)
Empirical distribution Q(X,T) = ?
Learning with Hidden Variables
DATA
X1 … XN
??????
T
4321
M
Y
Parametrization for P
T
X2 X3X1
guess of
Desired structure:
EM Iterations
Empirical distribution Q(X,T,Y) =
Empirical distributionQ(X,T,Y)=Q(X,T)Q(T|Y)
Input:
For each instance ID, complete value of T
![Page 5: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/5.jpg)
The EM Algorithm:
E-Step: Generate empirical distribution
M-Step: Maximize using
EM is equivalent to optimizing function of Q,P
Each step increases value of functional
EM Functional
)Y|T(H)T,X(PlogE QQ
[Neal and Hinton, 1998]
])Y[X|T(P)Y|T(Q
)X,T(PlogEargmax Q
]P,Q[F
![Page 6: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/6.jpg)
Information Bottleneck EM
Target:
In the rest of the talk… Understanding this objective How to use it to learn better models
)Y;T(I)1(L QEMIB ]P,Q[F
EM target
Information between hidden and ID
![Page 7: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/7.jpg)
Information RegularizationMotivating idea:Fit training data: Set T to be instance ID to “predict” X
Generalization: “Forget” ID and keep essence of X
Objective:
parameter free regularization of Q[Tishby et. al, 1999]
(lower bound of) Likelihood of P
Compression of instance IDvs.
)Y;T(I)1(]P,Q[FL QEMIB
![Page 8: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/8.jpg)
1
7
5
3
11
9
2
6
108
4 total compression
=0
1
7
5
3
11
9
2
6
108
4
Clustering example
)Y;T(I)1(]P,Q[FL QEMIB
EMTarget
Compressionmeasure
=0EMTarget
Compressionmeasure
![Page 9: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/9.jpg)
Clustering example
1
7
5
3
11
9
2
6
108
4 total preservation
=1
1
7
5
311
9
2
6
10
8
4
)Y;T(I)1(]P,Q[FL QEMIB
Compressionmeasure
EMTarget
=1
T ID
![Page 10: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/10.jpg)
Clustering example
1
7
5
3
11
9
2
6
108
4
1 7
5
3
119
2
6
84 10
Desired
=?
)Y;T(I)1(]P,Q[FL QEMIB
Compressionmeasure
EMTarget
=?
|T| = 2
![Page 11: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/11.jpg)
Information Bottleneck EM
Formal equivalence with Information Bottleneck
at =1 EM and Information Bottleneck coincide [Generalizing result of Slonim and Weiss for univariate case]
)Y;T(I)1(L QEMIB
EM functional
]P,Q[F
![Page 12: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/12.jpg)
Information Bottleneck EM
Formal equivalence with Information Bottleneck
)Y;T(I)1(L QEMIB
EM functional
]P,Q[F
1)t(Q)Y(Z
1)y|t(Q
])y[x|t(P
Prediction ofT using P
Marginal ofT in QNormalization
Maximum of Q(T|Y) is obtained when
![Page 13: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/13.jpg)
The IB-EM Algorithm for fixed Iterate until convergance
E-Step: Maximize LIB-EM by optimizing Q
M-Step: Maximize LIB-EM by optimizing P
(same as standard M-step)
Each step improves LIB-EM
Guaranteed to converge
])y[x|t(P)t(Q)Y(Z
1)y|t(Q 1
)X,T(PlogEmaxarg Q
![Page 14: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/14.jpg)
Information Bottleneck EM
Target:
In the rest of the talk… Understanding this objective How to use it to learn better models
)Y;T(I)1(L QEMIB
EM target
Information between hidden and ID
]P,Q[F
![Page 15: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/15.jpg)
Continuation
easy
hard
Follow ridge from optimum at =0
LIB
-EM
Q
0
1
![Page 16: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/16.jpg)
Continuation
Recall, if Q is a local maxima of LIB-EM then
We want to follow a path in (Q, ) space so that…
for all t, and y
])y[x|t(P)t(Q)Y(Z
1)y|t(Q 1 0])y[x|t(P)t(Q
)Y(Z1
)y|t(Q 1
),(, QG yt
0),(, QG yt
QLocal maxima for all
![Page 17: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/17.jpg)
Continuation Step
1. Start at (Q,) so that
2. Compute gradient
3. Take direction
4. Take a step in thedesired direction
G
,QG
0),(, QG yt
0),Q(G y,t
Q
0
1
start
![Page 18: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/18.jpg)
Staying on the ridge
Q
0
1
start
Potential problem:
Direction is tangent to path
miss optimumSolution:
Use EM steps to regain path
![Page 19: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/19.jpg)
The IB-EM Algorithm
Set =0 (start at easy solution) Iterate until =1 (EM solution is reached) Iterate (stay on the ridge)
E-Step: Maximize LIB-EM by optimizing Q
M-Step: Maximize LIB-EM by optimizing P
Step (follow the ridge) Compute gradient and direction Take the step by changing and Q
![Page 20: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/20.jpg)
Q
0
1
Calibrating the step size
Potential problem:
Step size too small too slow
Step size too large overshoot target
Inferior solution
![Page 21: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/21.jpg)
Non-parametric: involves only QCan be bounded: I(T;Y) ≤ log2|T|
Calibrating the step size
0 0.2 0.4 0.6 0.8 1
0.5
1
1.5
Use change in I(T;Y)
0 0.2 0.4 0.6 0.8 1
0.5
1
1.5
I(T
;Y)
Naive
Recall that I(T;Y) measures compression of ID
When I(T;Y) rises more of data is captured
Too sparse
“Interesting”area
![Page 22: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/22.jpg)
The Stock DatasetNaive Bayes modelDaily changes of20 NASDAQ stocks. 1213 train, 303 test
IB-EM outperforms best of EM solutions I(T;Y) follows changes of likelihood Continuation ~follows region of change ( marks evaluated )
0 0.2 0.4 0.6 0.8 1
0.5
1
1.5
I(T
;Y)
0 0.2 0.4 0.6 0.8 1
-23
-21
-19
Tra
in li
ke
liho
od
IB-EM
Best of EM
[Boyen et. al, 1999]
![Page 23: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/23.jpg)
Multiple Hidden VariablesWe want to learn a model
with many hiddens ( )
Naive: Potentially exponential in # of hiddens
Variational approximation: use factorized form (Mean Field)
LIB-EM = (Variational EM) - (1-
)Regularization [Friedman et. al, 2002]
P
i
i )Y|T(Q)Y|T(Q
)Y|TT(Q)Y|T(Q K1
Q(T|Y) Y
![Page 24: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/24.jpg)
Percentage of random runs
Te
st lo
g-l
os
s / i
ns
tan
ce
20 40 60 80 100
-342
-338
-334
-330
Mean Field EM1 min/run
400 samples 21 hiddens
Superior to all Mean Field EM runs Time single exact EM run
The USPS Digits dataset
single IB-EM 27 min
exact EM25 min/run
Offers good value for your time!
3/50 EM runs are IB-EM:
EM needs x17 time for similar results
![Page 25: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/25.jpg)
0 20 40 60 80 100
-151.5
-150.5
-149.5
-148.5
-147.5
Precentage of random runs
Te
st lo
g-l
os
s / i
ns
tan
ce
Mean Field EM~0.5 hours
Yeast Stress Response173 experiments (variables)
6152 genes (samples)
25 hidden variables
Superior to all Mean Field EM runs An order of magnitude faster then exact EM
Effective when exact solution becomes intractable!
IB-EM ~6 hours
Exact EM>60 hours
5-24 experiments
![Page 26: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/26.jpg)
Summary
New framework for learning hidden variables Formal relation of Bottleneck and EM Continuation for bypassing local maxima Flexible: structure / variational approximation
• Learn optimal ≤1 for better generalization
• Explore other approximations of Q(T|Y)• Model selection: learning cardinality and
enrich structure
Future Work
![Page 27: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/27.jpg)
Relation to Weight Annealing
[Elidan et. al, 2002]
Y
4321
M
DATA
X1 … XN W Init: temp = hotIterate until temp = cold
Perturb w temp Use QW and optimize Cool down
Similarities: Change in
empirical Q Morph towards
EM solution
Differences: IB-EM uses info. regulatization IB-EM uses continuation WA requires cooling policy WA applicable for wider range of
problems
![Page 28: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.](https://reader036.fdocuments.us/reader036/viewer/2022062713/56649ceb5503460f949b73df/html5/thumbnails/28.jpg)
Relation to Deterministic AnnealingY
4321
M
DATA
X1 … XNInit: temp = hotIterate until temp = cold “Insert” entropy temp into model Optimize noisy model Cool down
Similarities: Use information
measure Morph towards
EM solution
Differences: DA parameterization dependent IB-EM uses continuation DA requires cooling policy DA applicable for wider range of
problems