Applications of Information Geometryimage.diku.dk › MLLab › SummerSchools ›...
Transcript of Applications of Information Geometryimage.diku.dk › MLLab › SummerSchools ›...
Applications of Information Geometry
Clustering based on Divergence: Cluster CenterStochastic Reasoning: Belief Propagation in Graphical ModelSupport Vector MachineBayesian Framework and Restricted Boltzmann MachineNatural Gradient in Multilayer Perceptron LearningIndependent Component AnalysisSparse Signal Analysis and Minkovskian GradientConvex Optimization
Clustering of Points1 2 3{ , , ,..., }ND
Clustering : center of a cluster of points
1, , mC
arg min , ii
D
Center of C
C1
23
m
is simply the arithmetic average in the dual coordinate system
k‐means: clustering algorithm
*
[ : ] ( ) ( )[ : ]
( ) 01*
i i i
i i
i
i
DD
m
k‐means: clustering algorithm
1. Choose k cluster centers2. Classify patterns of D into clusters3. Calculate new cluster centers
Voronoi Diagram: Boundaries of Clusters
1 2[ : *] [ : *]D D
Total Bregman divergence (Vemuri)
2
( ) ( )tBD( : )
1 | ( ) |p qp q
p qq
t-center of C
2
*
1
1
i i
i
i
i
ww
w
Conformal change of divergence
: :D p q q D p q
ij ijg p g
( )
logijk ijk k ij j ik i jk
i i
T T s g s g s g
s
t-center is robust
1, , ;
1; ,
nE
n
y
z y
influence fun ;ction z y
robust as : c z y
1, yGw
z y
y
21
1
y y
yw
w
y
y
Robust: is boundedz
21Euclidean case 2
f x
1
2,
1
,
G
yz yy
z y y
1
1 [ : ] [ : y]minimize ( )1
i
i N
D DN w w
MPEG7 database• Great intraclass variability, and small interclass dissimilarity.
Shape representation
First clustering then retrieval
Other TBD applications
Diffusion tensor imaging (DTI) analysis [Vemuri]• Interpolation• Segmentation
Baba C. Vemuri, Meizhu Liu, Shun-ichi Amari and Frank Nielsen, Total Bregman Divergence and its Applications to DTI Analysis, IEEE TMI, to appear
Information Geometryof Stochastic Reasing
Belief Propagation in Graphical Model
• Shun‐ichi Amari (RIKEN BSI)• Shiro Ikeda (Inst. Statist. Math.)• Toshiyuki Tanaka (Kyoto U.)
Stochastic Reasoning: Graphical Model
( , , , , )p x y z r s
( , , | , ), , ,... 1, 1p x y z r sx y z
Stochastic Reasoningq(x1,x2,x3,…| observation)
X= (x1 x2 x3 …..) x = 1, ‐1
X= argmax q(x1, x2 ,x3 ,…..) maximum likelihood
Xi = sgn E[xi] least bit error rate estimator
Mean Value Marginalization: projection to independent distributions
0 1 1 2 2 0( ) ( ) ( )... ( ) ( )n nq q x q x q x q x x
1, 1( ) ( ..., ) .. ..i i n i nq x q x x dx dx dx
0[ ] [ ]q q E x E x
1
1
1
1 2
exp
,
1, 1
e p
,
x
s
ij
L
i j i
i i r qr
r r i i s
i
i
q k x c
c c x x r
q w x x h x
i i
x r i i
x
x x
x
Boltzmann machine, spin glass, neural networksTurbo Codes, LDPC Codes
cliques
Computationally DifficultComputationally Difficult
exp r q
q E
q c
x x
x x
mean‐field approximation
belief propagation
tree propagation, CCCP (convex‐concave)
Information Geometry ofMean Field Approximation
• m‐projection• e‐projection
q = argmin D[q:p]q = argmin D[p:q]
( )[ : ] ( )log( )x
q xD q p q xp x
0e
0m 0 { ( )}i i iM p x
0( )p x M
0
*
*
(x, *)
(x, *) *
( ) *( )( )*( )
( ), * ( *) ( ) *( ) 0
[ ] [ ]
px
q p
p q M
p x
q x p xt xp x
t x x x q x p x
E x E x
m‐projection keeps the expectation of x
Information GeometryInformation Geometry
0 0 0 0 0, exp
, expr r r r r r
M p
M p c
x x
x x x
1, ,r L
q x
rM
'rM
0M
( ) exp{ ( )rq x c x
Belief Prop Algorithm
0M
rM
'rM
r
'r
r
'r
0 0
1 t0
1 1'
'11
0
( , ) ( , )
( , ) : belief for ( )
t tr r r
t tr r r r r
t tr r
r rtt
r
p x p x
p x c x
Belief PropagationBelief Propagation( , ) exp{ ( ) }r r r rp x c x x
Equilibrium of BPEquilibrium of BP , rξ
1) m‐condition
*0 ,r rp x
-flat submanifold m M
rM
'rM
0M
rM
'rM
0M
q2) e‐condition
*r
θ ξ
-flat submanifold q ex
ξ1( ')ξ θ
Belief Propagation e‐condition OK
CCCP m‐condition OK
1 2
1 2 1 2
( ; , , , ' '( , , ) ( ' , ' , ' )
L r
L L
θ ξ ξ ξ θ ξ
0 0
0
: ( , ) ( , )
'
r r r r
r
p x p x
θ θ
θ
Convex‐Concave ComputationalProcedure (CCCP) : A. Yuille
1 21
1 2
( ) ( ) ( )
( ) ( )
t t
F F F
F FElimination of double loops
Geometry of Support Vector Machinemodification of kernel
Simple perceptron: decision surface
( )| |
| |
f x w x bw x bd
w
2minimize | |constraint ( ) 1i i
wy w x b
21( , , ) | | ( )2
0 : support vector : 0
( , )
i i
i i i i
i i i i i i
L w b w w x b
y x
w y x f x w y x b
Embedding in higher‐dimensional space
( )( ) ( )
x z xf x w x b
z: infinite‐dimensional
Kernel trick ( , ') ( ) ( ')
( , ') ( ') ' ( ')
1( ) ( )
( , ) ( , )
i i i
ii
i i i
K x x x x
K x x x dx x
x x
f x w y K x x
2 2
2
| '
| ( ) ( ) | z( ) ( )
( ) ( ) ( )
( ) ( , x')
i ji j
iji j
ij x xi j
ds z x dx z x x z x dx dxx x
g x x xx x
g x K xx x
2
2
2
( ) exp{ { ( )}
( , ') ( ) ( ') ( , ')( ) exp{ | x x * | }
g ( ) { ( )} g ( ) ( ) ( )i i
ij ij i j
x f x
K x x x x K x xx
x x x x x
Conformal Transformation of Kernel
Basic Principles ofUnsupervised and SupervisedLearning Toward Deep Learning
Shun‐ichi Amari (RIKEN Brain Science Institute)collaborators: R. Karakida, M. Okada (U. Tokyo)
Deep LearningSelf‐Organization + Supervised Learning
RBM: Restricted Boltzmann MachineAuto‐Encoder, Recurrent Net
DropoutContrastive divergence
Boltzmann Machine
RBM: Restricted Boltzmann Machine
RBM
Simple Hebbian Self‐Organization
self‐organization of
Self‐Organization
Interaction of Hidden Neurons
Bayesian Duality in Exponential Family
Data x Parameter (higher‐order concepts)
Curved exponential family
RBM= h, x = Wvx = v = hW
Gaussian Boltzmann Machine
Equilibrium Solution
Equilibrium Solution
General Solution
•
You can choose m(≤ k) eigen values form
• diagonalized by
Stable Solution the case of m = k
‐ No connections within layers
Contrastive Divergence
50
• How to train RBM Maximum Likelihood (ML) learning is hard
• RBM ‐ 2‐layered probabilistic neural network
Sampling
EquilibriumInput
Many iterations of Gibbs Sampling demand too muchcomputational time
W
Hidden hVisible v
Contrastive Divergence Solution
Two Manifolds
Geometry of CDn (contrastive divergence)
Bernoulli‐Gaussian RBM
ICAR. Karakida
Simulation
InputMixing
Sources p (s) ICA Solution
OutputUniformDistribution
The number of Neurons: N = M = 2, σ = 1/2
Independent sources are extracted in G‐B RBM
CD
55
Information Geometry of Neuromanifolds
Natural Gradient and Singularities
Shun-ichi AmariRIKEN Brain Science Institute
Multilayer Perceptron
Mathematical Neurons
i iy w x h w x
x y( )u
u
Multilayer Perceptrons
i iy v n w x
21; exp ,2
, i i
p y c y f
f v
x x
x w x
x y
1 2( , ,..., )nx x x x
1 1( ,..., ; ,..., )m mw w v v
Multilayer Perceptron: Neuromanifold
1 1,
,
, ; ,i i
m m
y f
v
v v
x θ
w x
θ w w
neuromanifold ( )x
space of functions
Universal Function Approximator
xwx
xx
xx
ii
i
m
ii
i
N
i
v
a
v
1
1i
Learning from examples
ˆ, xfx
1 1examples { , , , , }n nD y y x x
learning ; estimation
Backpropagation ‐‐‐gradient learningBackpropagation ‐‐‐gradient learning
1 1
2
examples : , , , training set1( , ; ) ,2
log , ;
t ty y
E y x y f
p y
x x
x
x
,
t t
i i
E
f v
x w x
( , )y f x n
Multilayer Perceptron: NeuromanifoldMetric and Topology
1 1,
,
, ; ,i i
m m
y f
v
v v
x θ
w x
θ w w
neuromanifold ( )x
space of functions
Metric: Riemannian manifold
22
ij i j
T
ds d
g d d
d G d
j
i
d
log ( | ; ) log ( | ; )( ) [ ]iji j
p y x p y xg E
Topology: Neuromanifold
• Metrical structure
• Topological structure
singularities
Geometry of singular model
y v n w x
W
v| | 0v w
Parameter Space S
S
Equivalence
1) 02)
/
i i
i j i j
vv v
M S
ww w
i iy v n w x
Topological Singularities
: parameter spaceS
: behavioral spaceM
/M S
2 hidden‐units
1 1 2
1 2 1 1
1 2
: 0
1
y v v n
S v v
v x w v x w
2
2 2
w x w x
w w w w
Gaussian mixtures
21( ) exp2i ip x v x w
Gaussian mixture
1 2 1 2; , , 1p x v w w v x w v x w
21 1exp22
x x
1 2 s ingular: , 1 0w w v v
1w2w
v
Learning, Estimation, and Model Selection
gen 0
train emp
: ;
;
E D p y p y
E D p y
x x
x
gen
gen train
: dimension2dE dn
dE En
Problem of Backprop
• slow convergence‐‐‐‐plateau‐‐‐saddle
• local minima
( , ; )t t t t tl x y
Flaws of MLPFlaws of MLPslow convergence : plateau
local minima
Boosting, Bagging, SVM
error
Steepest Direction --- Natural GradientSteepest Direction --- Natural Gradient
1
1
2
, ,
=
n
i jij
l ll
l G l
d d Gd G d d
d
( )l
( , ; )t t t t tl x y
Natural Gradient
2 2
1
max
under ij i j
dl l d l l d
d g d d
d l G l
( , ; )t t t t tl x y
Information Geometry of MLPInformation Geometry of MLPNatural Gradient Learning :
S. Amari ; H.Y. Park
1
1 1 1 11 1 T
t t t t
lG
G G G f f G
Adaptive natural gradient learning
Regular statistical model ,
: Fisher information
M p x
G
11TE Gn
01ˆ, : ,2
2
E KL p x p x G En
dn
AIC, MDL
Landscape of error at singularityMilner attractor
Dynamics of Learning
1,
( , ), ( , )
( , ) , ( , )
d dl G ldt dt
du dzf u z k u zdt dt
du f u zdz k u z
2 2 1 log
2u z z c
Coordinate Transformation
2 1
1 1 2 2
1 2
2 2
: 0
1
v vv
v v v v vv vz z
v
u w w uw ww w w
Dynamics of Learning: Natural gradient works!!
1,
( , ), ( , )
( , ) , ( , )
d dl G ldt dt
du dzf u z k u zdt dt
du f u zdz k u z
2 2 1 log
2u z z c
Dynamic vector fields: General case (|z|<1 part stable)
Dynamic vector fields: General case (|z|>1 part stable )
Adaptive Natural Gradient works well
Signal ProcessingICA : Independent Component Analysis
t t t tA x s x s
sparse component analysis
positive matrix factorization
mixture and unmixtureof independent signals
2x
1s
ns2smx
1x1
n
i ij jj
x A s
x As
Independent Component Analysis
1
i ij jA x A s
W W A
x s
y x
observations: x(1), x(2), …, x(t)recover: s(1), s(2), …, s(t)
s A W y
x
Semiparametric Statistical Model
( ; , ) | | ( )p r rx W W Wx
unknown
x(1), x(2), …, x(t)
ir r , ( ) :r s 1W A
Natural Gradient
, Tl
y W
W W WW
Space of Matrices : Lie group
-1d dX WW
2 1tr trT T T
T
d d d d d
ll
W X X WW W W
W WW
:dX
I I d X
WdW W
non-holonomic basis
1W
Information Geometry of ICA
natural gradientestimating functionstability, efficiency
S ={p(y)}
1 1 2 2{ ( ) ( )... ( )}n nI q y q y q y
{ ( )}p Wx
r q
( ) [ ( ; ) : ( )] ( )
l KL p qr
W y W yy
Estimating Functions
,
)0, '
, ' 0, 'rE
W
W F(y,WW W
F y WW W
estimating equation
0 t tt
tF y ,W y Wx
learning
t t t W F W x
Admissible class
on-line learnin
, { }
,
0: estimating equation
canonical estimating functio
g :
n:
T
t
t t t t
W
F IW
F y W I y y
F R W F y W
R W F Wx
W R W F W x
{ }i j j iy y y y W
Basis Given: overcomplete caseSparse Solution
many solutionsmany 0
ˆ
i i
i
t t
A s
s
A
x s a
x s
ˆ ˆ ˆ: A x Assparse
generalized inverse
ˆmin 2is
sparse solution
ˆmin ii s
ˆ ˆ ˆ: x A Assparse
2 :-normL
1 :-normL
Minkovskian Gradient
min : convex functionconstraint F c
typical case:
21 1 ( *) ( *)2 21 ;
T
pi
X G
Fp
y
2, 1, 1/ 2p p p
Optimization under Sparsity Condition:
(X y n)
: -sparsek
2 logm k n
1
, 1,2,...k
tt i i t
i
y x n t N
1
2
xN
xx
X X
y n
Linear regression
Overcomplete: N < kunder‐determined infinitely many solutions
Sparsity constraint solves the problem
21ˆ arg min , ( )2t t
t
y u u
x
' 0t t tt
y x x
ˆ, T TG X X G X y
1 †ˆ T TX X X
y = X y
Maximum likelihood estimator
Euclidean case (Gaussian noise):
Not sparse
Sparse Solution
*min min [ ] ( ) ( *) ( *)( *) ( *) 0
penalty = pp i
D
F c
0
1 1
22
#1[ 0] :
:
: 0 1
:
i
i
p
i
F
F L
F p
F
sparsest solution
solution
generalized inverse solution
Sparse solution: overcomplete case
( ) : *min β
orthogonal projection, dual
D β : β , F β = c dual geodesic
projec
projec
tion
tion
( )c cF
dual
a) : 2, 1cR n p b) : 2, 1cR n p
c) : 2, 1cR n p
Fig. 1
non-convex
L1-constrained optimizationLASSO
LARS
min under
solution : 0
0c
F c
c c
min
solution 0
0
F
, ,
: ,
* *c λsolutions β and β : coincide λ = λ c p 1
p < 1 λ = λ c multiple noncontinuous stability different
λP Problem
cP Problem
*
*
Projection from to F = c (information geometry)*β
n
F n
Fig. 5 subgradient
( )c cF
LASSO path and LARS path(stagewise solution)
min :
min
F c
F
,c c λ correspondence
Active set and gradient
1
0
sgn ,, ,
1,1
i
pi i
p
A i
i AF i A
Solution path
1
0,
;
c c
A c c A c c
A A c c A A c c A c
c cA
c
cK
F
F F
ddc
F
c c cK G F
1 1 10; (sgn ) : iF F L
Solution pathin the subspace of the active set
1
0 : active direction
A A A
A A
F
K F
turning point A A
Gradient Descent Method
{ ( )}: covariant
{ ( )}: contravariant
i
ji
i
L L
L g L
2min L( +a): i jijg a a
1 ( )t t tc L
ij i jG g a aa
0pG t t G a a
. 1piG a p a
0L G
a aa
Riemannian metric
Minkovskian metric
Lp-norm
Steepest direction of L
1sign pi iG p a a a
ii
f F
1
1sgn pi i ia c f f
1, ij i jG g a a F G f a Natural gradient
F f
1
1sgn pi iF c f f
0
1sgn
0
0
iF c f
Euclidean case
1
arg max ii f
max i i jf f f
1, for and ,0 otherwise.i
i i jF
1t t F LASSOTry for various p, p->1Try for various noise functionLASSO and flat geometry
Extended LARS (p = 1) and Minkovskian gradient
11
norm
max under 1
1
sgn , max , ,
0, otherwise
pip
p
p
i i NA
a
p
1
a
a a
a a
arg max ii f
max i i jf f f
1, for and ,0 otherwise.i
i i jF
1t t F LARS
c-trajectory and-trajectory
Ex. 1-dim 21 *2
21 2 22 f F
L1/2 constraint: non-convex optimization
2: min , cP c
c c
: 0P f 0
ˆ : Xu Zongben's operator R
c c c
0 c
R
c
An Example of the greedy path
β1
β2
LASSO and LARS :
1 : is non-sparse1 : sparse
cpp
Minkovskian gradient
2
1
:
0
c
c c c
c c c
D F
F
F
c
1
c c c c c c c
c c c
F F
G H
G H F
p
Solution Path :
not continuous, not-monotonejump
c
c
Linear Programming
inner met
max
lo
ho
g
d
ij j i
i i
ij j ii
A x b
c x
A x b
x
Convex Cone Programming
P : positive semi-definite matrix
convex potential function
dual geodesic approach
, minA x b c x
Support vector machine
Convex Programming━ Inner Method
:LP A x b
min c x
log
ij j iA x b x
i x
Simplex method ; inner method
Barrier function
Polynomial-Time Algorithm
curvature : step-size 2mH
min : geodesict t c x x x
Integration of evidences:
arithmetic meangeometric meanharmonic mean‐mean
1 2, ,... mx x x
Generalized mean: f‐meanf(u): monotone; f‐representation of u
1 ( ) ( )( , ) { }2f
f a f bm a b f
12( ) , 1
log ,
( , )
1
( ,
)
=
f fm ca cb cm a
f u u
b
u
scale free
α-representation
‐mean :
2
1 :
1: 2
10 : ( )4 2
min( , ) max( , )
aba b
a ba b ab
m a bm a b
1 2( ( ), ( ))m p s p s
Various Means
2a b
ab2
1 1a b:
arithmetic geometric
Any other mean?
:
harmonic
Family of Distributions 1( ), , ( )kp s p s
1( ) ( ), 1
k
mix i i ii
p s t p s t
explog ( ) log ( )i ip s t p s
mixture family :
exponential family :
1( ; ) { ( ( ))}i ip x f f p x
α‐geodesic projection
robust estimator