machine learn IG.ppt [互換モード]yosinski.com/mlss12/...Amari-Information-Geometry.pdf ·...
Transcript of machine learn IG.ppt [互換モード]yosinski.com/mlss12/...Amari-Information-Geometry.pdf ·...
Machine Learning SS: Kyoto U.
Information Geometryand Its Applications toMachine Learning
Shun-ichi AmariRIKEN Brain Science Institute
Information Geometry
-- Manifolds of Probability Distributions
{ ( )}M p= x
Information GeometryInformation Geometry
Systems Theory Information Theory
Statistics Neural Networks
Combinatorics PhysicsInformation Sciences
Riemannian ManifoldDual Affine Connections
Manifold of Probability Distributions
Math. AIVision
Optimization
( ){ } ( ) ( )2
2
1; , ; , exp22
xS p x p x
μμ σ μ σ
σπσ
⎧ ⎫−⎪ ⎪= = −⎨ ⎬⎪ ⎪⎩ ⎭
Information Geometry ?Information Geometry ?
( ){ }p xσ
μ
( ){ };S p x= θ
Gaussian distributions
( , )μ σ=θ
Manifold of Probability DistributionsManifold of Probability Distributions
( )1 2 3 1 2 3
1, 2,3 ={ ( )}, , 1
nx S p xp p p p p p
=
= + + =
3p
2p1p
p
( ){ };M p x= θ
InvarianceInvariance ( ){ },S p x= θ
Invariant under different representation
( ) ( ), ,=y y x p ysufficient statistics
θ( ) ( )
2
1 2
21 2
, ,
| ( , ) ( , ) |
p x p x dx
p y p y dy
θ θ
θ θ
−
≠ −
∫∫
Two Geometrical StructuresTwo Geometrical StructuresRiemannian metric affine connection --- geodesic
( )2ij i jds g d dθ θ= ∑ θ
Fisher information
log logiji j
g E p pθ θ
⎡ ⎤∂ ∂= ⎢ ⎥
∂ ∂⎢ ⎥⎣ ⎦
Orthogonality: innner product
1 2 1 2, Td d d Gdθ θ θ θ< >=
Affine Connectioncovariant derivative; parallel transport
,
geodesic X=X X=X(t)
( )
X c
i jij
Y X Y
s g d dθ θ θ
∇ Π =
Π
= ∑∫minimal distance
& &
straight line
Duality: two affine connectionsDuality: two affine connections
, , , i jijX Y X Y X Y g X Y∗= Π Π < >= ∑
Riemannian geometry: ∗∏ = ∏
X
Y
X
YΠ
*Π
{ , , , *}S g ∇ ∇
Dual Affine Connections
e-geodesic
m-geodesic
( ) ( ) ( ) ( ) ( )log , log 1 logr x t t p x t q x c t= + − +
( ) ( ) ( ) ( ), 1r x t tp x t q x= + −
( ), ∗∇ ∇
( )q x
( )p x
*( , )Π Π
( ) 0* ( ) 0x
x
x tx t
∇ =∇ =
&
&
&
&
Mathematical structure of ( ){ },S p x= ξ
( )( )
ij i j
ijk i j k
g E l l
T E l l l
ξ
ξ
⎡ ⎤= ∂ ∂⎣ ⎦⎡ ⎤= ∂ ∂ ∂⎣ ⎦
( )log , ; i il p xξ∂
= ∂ =∂
ξ
-connection
{ }, ;ijk ijki j k Tα αΓ = −
α α−∇ ↔ ∇ : dually coupled
, , ,X XX Y Z Y Z Y Z∗= ∇ + ∇
α
{M,g,T}
Divergence: [ ]:D z y
[ ]
[ ]
[ ]
: 0
: 0, iff
: ij i j
D
D
D d g dz dz
≥
= =
+ = ∑
z y
z y z y
z z z
positive‐definite
Z
Y
M
Kullback-Leibler Divergencequasi-distance
( )[ ( ) : ( )] ( ) log( )
[ ( ) : ( )] 0 =0 iff ( ) ( )[ : ] [ : ]
x
p xD p x q x p xq x
D p x q x p x q xD p q D q p
=
≥ =≠
∑
[ ]: 0if i
i
qD p fp
⎛ ⎞= ≥⎜ ⎟
⎝ ⎠∑ %
% % %%
p q [ ]: 0fD = ⇔ =% % % %p q p q
( ) ( ) ( )not invariant under 1f u f u c u= − −%
divergence of f S%
{ }, 0 : ( 1 holds)i iS p p nn= > =∑% % % %p
divergence
α
1 12 21 1[ : ] { }
2 2i i i iD p q p q p qα α
αα α − +− +
= + −∑% % % % % %
[ : ] { log }ii i i
i
pD p q p p qq
= + −∑ %% % % % %
%
KL-divergence
α
( , ) divergenceα β −
, [ : ] { }i i i iD p q p q p qα β α β α βα β
α βα β α β
+ += + −+ +∑
: divergence1: -divergence
β α αα β
= − −=
Metric and Connections Induced by Divergence(Eguchi)
( ) [ ] [ ] ( )
( ) [ ]
( ) [ ]
'
' '
1: : : = 2
:
:
=
=
∗=
= ∂ ∂ +
Γ = −∂ ∂ ∂
Γ = −∂ ∂ ∂
ij i j ij i j
ijk i j k
ijk i j k
g D D d g dz dz
D
D
y z
y z
y z
z z y z z z z
z z y
z z y
*
'
{ , }
, i ii iz y
∇ ∇
∂ ∂∂ = ∂ =
∂ ∂
Riemannian metric
affine connections
Duality:
{ }, ,
k ij kij kji
ijk ijk ijk
g
T
M g T
∗
∗
∂ = Γ + Γ
Γ = Γ −
*, , ,X XX Y Z Y Z Y Z< >=< ∇ > + < ∇ >
Dually flat manifoldexponential family; mixture family; {p(x); x discrete}
-coordinates : affine coordinates, flat, geodesics
-coordinates: (dual) affine coordinates, flat, geodesics
canonical divergence D(P: P') :
θ
η
−
b
KL divergencenot Riemannian flat
( ) ( ){ }, exp : exponential familyθ θ ψ θ= −∑ i ip x x
( ) ( )
( ) ( )
( ) ( )
( ) ( ) ( )
( ) ( ){ }
2 2
potential functions ,
; :
0
, exp : exponential family
: cumulant generating function: negative en
ψ θ ϕ η
η ψ θ θ ϕ θθ η
ψ θ ϕ η θη
θ ψ θ ϕ θθ θ η η
θ θ ψ θ
ψϕ
∂ ∂= =
∂ ∂
+ − =
∂ ∂= =
∂ ∂ ∂ ∂
= −
∑
∑
L
i ii i
i i
ijij
i j i j
i i
g g
p x x
Legendre transformation
( ) ( )tropy
canonical divergence D(P: P')= ' 'ψ θ ϕ η θη+ − ∑ i i
Dually flat manifold
Manifold with Convex Function
S : coordinates ( )1 2, , , nθ θ θ= Lθ
( )ψ θ : convex function
negative entropy ( ) ( ) ( )logp p x p x dxϕ = ∫
( ) ( )212
iψ θ= ∑θ
Riemannian metric and flatness(affine structure)Bregman divergence
( ) ( ) ( ) ( ), grad D ψ ψ′ ′ ′= − − ⋅θ θ θ θ θ θ
( ) ( )1,2
i jijD d g d dθ θ+ = ∑θ θ θ θ
( ) , ij i j i ig ψθ∂
= ∂ ∂ ∂ =∂
θ
θ : geodesic (not Levi-Civita)Flatness (affine)
{ , ( ), }S ψ θ θ
Legendre Transformation
( )i iη ψ= ∂ θ
↔θ η one-to-one
( ) ( ) 0iiϕ ψ θ η+ − =η θ
( ) ,i i i
i
θ ϕη∂
= ∂ ∂ =∂
η
( ) ( ) ( ),D ψ ϕ′ ′ ′= + − ⋅θ θ θ η θ η
( ) max { ( )}iiθϕ η θ η ψ θ= −
Two affine coordinate systems ( ),θ η
θ : geodesic (e-geodesic)
η : dual geodesic (m-geodesic)
“dually orthogonal”,
,
j ji i
ii i
i
δ
θ η
∂ ∂ =
∂ ∂∂ = ∂ =
∂ ∂
*, , ,X XX Y Z Y Z Y Z< >=< ∇ > + < ∇ >
Pythagorean Theorem (dually flat manifold)
[ ] [ ] [ ]: : :D P Q D Q R D P R+ =
Euclidean space: self-dual =θ η
( ) ( )212 iψ θ θ= ∑
Projection Theorem
[ ]min :Q M
D P Q∈
Q = m-geodesic projection of P to M
[ ]min :Q M
D Q P∈
Q’ = e-geodesic projection of P to M
Two Types of DivergenceInvariant divergence (Chentsov, Csiszar)
f-divergence: Fisher- structure
Flat divergence (Bregman) – convex function
KL-divergence belongs to both classes: flat and invariant
∫q(x)D[p : q] = p(x)f{ }dxp(x)
α
dually flat space
convex functionsBregman
divergence
invariance
invariant divergence Flat divergence
KL‐divergenceF-divergenceFisher inf metricAlpha connection
: space of probability distributions}{p=S
log∫p(x)D[p : q] = p(x) { }dxq(x)
{ }, 0 : ( 1 holds)i iS p p nn= > =∑% % % %p
Space of positive measures : vectors, matrices, arrays
f‐divergence
α‐divergence
Bregman divergence
Applications of Information Geometry
Statistical InferenceMachine Learning and AIComputer VisionConvex ProgrammingSignal Processing (ICA; Sparse)Information Theory, Systems TheoryQuantum Information Geometry
Applications to Statisticscurved exponential family:
( ) ( ) ( )( ){ }, expp x u u uθ ψ θ= ⋅ −x
1
1=
= ∑n
kk
x xn
: estimator
u
ˆ xη =
1, 2( , ) ,... np x u x x x ( , ) exp{ ( )}p x xθ θ ψ θ= ⋅ −
1ˆ( ,..., )nu x x
x : discrete X = {0, 1, …, n}
0 1
0 0
{ ( ) | }:
( ) ( ) exp[ ( )]
log log ; ( ); ( ) log
( , )
n
n ni
i i ii i
ii i i
S p x x X
p x p x x
p p x x p
p x u
δ θ ψ θ
θ δ ψ θ= =
= ∈
= = −
= − = = −
∑ ∑
exponential family
statistical model :
High-Order AsymptoticsHigh-Order Asymptotics
( )( )
1
1
, (u) : , ,
u u , ,n
n
p x x x
x x=
L
L
θ
( )( )ˆ ˆ Te E u u u u⎡ ⎤= − −⎣ ⎦
1 22
1 1e G Gn n
= +
11G G−≥ :Cramér-Rao: linear theory
( ) ( ) ( )2 2 2
2e m m
M AG H H= + + Γ
:
u
ˆ xη =
quadratic approximation
Information Geometryof
Belief Propagation
• Shun-ichi Amari (RIKEN BSI)• Shiro Ikeda (Inst. Statist. Math.)• Toshiyuki Tanaka (Kyoto U.)
Stochastic Reasoning
( , , , , )p x y z r s
( , , | , ), , ,... 1, 1p x y z r sx y z = −
Stochastic Reasoningq(x1,x2,x3,…| observation)
X= (x1 x2 x3 …..) x = 1, -1
X= argmax q(x1, x2 ,x3 ,…..) maximum likelihood
Xi = sgn E[xi] least bit error rate estimator
Mean Value Marginalization: projection to independent distributions
0 1 1 2 2 0( ) ( ) ( )... ( ) ( )n nq q x q x q x qΠ = =x x
1, 1( ) ( ..., ) .. ..i i n i nq x q x x dx dx dx= ∫(
0[ ] [ ]q qη = =E x E x
( ) ( )
( ) ( ){ } ( )
( ) { }
1
1
1
1 2
exp
,
1, 1
e p
,
x
s
ij
L
i j i
i i r qr
r r i i s
i
i
q k x c
c c x x r
q w x x h x
i i
x r i i
ψ
ψ
=
⎧ ⎫= ⋅ + −⎨ ⎬⎩ ⎭
=
= −
=
+
=
= −
∑
∑
∑
∑
L L
x
x x
x
Boltzmann machine, spin glass, neural networksTurbo Codes, LDPC Codes
Computationally DifficultComputationally Difficult
( ) [ ]( ) ( ){ }exp r q
q E
q c
η
ψ
→ =
= −∑x x
x x
mean-field approximation
belief propagation
tree propagation, CCCP (convex-concave)
Information Geometry ofMean Field Approximation
• m-projection• e-projection
D[q:p]=
q = argmin D[q:p]q = argmin D[p:q]
( )( )log( )x
q xq xp x∑
0eΠ
0mΠ
0 { ( )}i i iM p x= Π0( ) ∈p x M
Information GeometryInformation Geometry
( ){ } { }( ) ( ){ }{ }
0 0 0, exp
, exp
ψ
ψ
= = ⋅ −
= = + ⋅ −r r r r r r
M p
M p cξ ξ
x x
x x x
θ θ
1, ,r L= L
( )q x
rM
'rM
0Mθ
( ) exp{ ( )rq x c x φ= − }∑
0
1 t0
1 1
( , )
( , ) : belief for ( )+
+ +
Π
= Π −
= ∑
tr r
t tr r r r r
t tr
p x
p x c x
ξ
θ ξ ξ
θ θ
Belief PropagationBelief Propagation
( ) ( ){ }: , exp ψ= + ⋅ −r r r r r rM p cξ ξx x x
Belief Prop Algorithm
0M
rM
'rM
rς
'rς
rςΠ
'rς
Equilibrium of BPEquilibrium of BP ( ),∗ ∗rξθ
1) m-condition
( )*0 ,∗ = Π r rp ξθ x
( )-flat submanifold m M θ∗
rM
'rM
0M
rM
'rM
0M
q2) e-condition
*11
∗ ∗=− ∑ r rL
θ ξ
( ) -flat submanifold q e∈x
ξ1( ')ξ θ
( ) [ ] [ ]1 0 0, , , : :L rF D p q D p pθ ζ ζ = −∑L
critical point
0 : -condition
0 : -conditionr
F e
F m
∂=
∂∂
=∂
θ
ζnot convex
Free energy:Free energy:
Belief Propagatione-condition OK
CCCP m-condition OK
1 2
1 2 1 2
1( ; , , , ' '1
( , , ) ( ' , ' , ' )
, ... ) =−
, ... → ,...
∑L r
L L
Lθ ξ ξ ξ θ ξ
ξ ξ ξ ξ ξ ξ
( ) ( )1 1 1
0 0
'( '), ( '),..., ( ')' , ' : ' , '
→
= Π = Πr r rp p
θ θξ θ ξ θ ξ θθ ξ ξ θx x
( )1 1 10
1 1
,
ξ
+ + +
+ +
= Π = Π
= − ∑
t t tr r r
t t tr
p
L
ξ θ θ
θ θ
x
Convex-Concave ComputationalProcedure (CCCP) Yuille
1 21
1 2
( ) ( ) ( )
( ) ( )
θ θ θ
θ θ+
= −
∇ = ∇t t
F F F
F FElimination of double loops
Boltzmann Machine
( ) ( )1i ij j ip x w x hϕ= = −∑
( ) ( ){ }exp ij i j i ip x w x x h x wψ= − −∑ ∑
2x
3x
4x
1x
( )q x
( )p x B
Boltzmann machine---hidden units
• EM algorithm• e-projection• m-projection
D
M
EM algorithm
hidden variables
( ), ;p x y u
{ }1, , ND = Lx x
( ){ }, ;M p= x y u
( ) ( ) ( ){ },M DD p p p= =x y x x
( )ˆmin , :KL p p M∈⎡ ⎤⎣ ⎦x y m-projection to M
De-projection to( )ˆmin : , ;KL p D p∈⎡ ⎤⎣ ⎦x y u
SVM : support vector machine
Embedding
Kernel
Conformal change of kernel
( )( ) ( ) ( , )
i i
i i i i i
z xf x w x y K x x
φ
φ α
=
= =∑ ∑
( , ') ( ) ( ')i iK x x x xφ φ= ∑
2
( , ') ( ) ( ') ( , ')( ) exp{ | ( ) | }
ρ ρ
ρ κ
⎯⎯→
= −
K x x x x K x xx f x
Signal ProcessingICA : Independent Component Analysis
t t t tA= →x s x s
sparse component analysis
positive matrix factorization
mixture and unmixtureof independent signals
2x
1s
ns2smx
1x1
n
i ij jj
x A s=
=
=
∑
x As
Independent Component Analysis
1
i ij jA x A s
W W A−
= =
= =
∑x s
y x
s A W y
x
observations: x(1), x(2), …, x(t)recover: s(1), s(2), …, s(t)
Space of Matrices : Lie group
-1d d=X WW
( ) ( )2 1tr trT T T
T
d d d d d
ll
− −= =
∂∇ =
∂
W X X WW W W
W WW
:dX
I I d+ X
Wd+W W
non-holonomic basis
1W −
Information Geometry of ICA
natural gradientestimating functionstability, efficiency
S ={p(y)}
1 1 2 2{ ( ) ( )... ( )}n nI q y q y q y=
{ ( )}p Wx
r q
( ) [ ( ; ) : ( )] ( )
l KL p qr
=W y W yy
Semiparametric Statistical Model
( ; , ) | | ( )p r r=x W W Wxunknown
x(1), x(2), …, x(t)
ir r= Π, ( ) :r s−= 1W A
Natural Gradient
( ), Tlη
∂Δ = −
∂y W
W W WW
Basis Given: overcomplete caseSparse Solution
many solutionsmany 0
ˆ
i i
i
t t
A s
s
A
= =
→
=
∑x s a
x s
ˆ ˆ ˆ: =A x Assparse
generalized inverse
ˆmin Σ 2is
sparse solution
ˆmin ii
∑ s
ˆ ˆ ˆ: x =A Assparse
2 :-normL
1 :-normL
Overcomplete Basis and Sparse Solution
1
'
min
min
i i
i
p p
s A
s
A α
= =
=
− +
∑∑
x a s
s
s x s
non-linear denoising
Sparse Solution( )
( )
min
penalty : Bayes priorpp iF
ϕ
= ∑
β
β β
( )
( )
( )
( )
0
1 1
22
#1[ 0] :
:
: 0 1
:
i
i
p
i
F
F L
F p
F
β
β
β
= ≠
=
≤ ≤
=
∑
∑
sparsest solution
solution
generalized inverse solution
β
β
β
β
Sparse solution: overcomplete case
Optimizationunder Spasity Condition
( )( )
min : convex functionconstraint F c
ϕ⎧⎪⎨ ≤⎪⎩
ββ
typical case: ( )
( )
21 1 ( *) ( *)2 21 ;
ϕ
β
= − =
= ∑
T
pi
X G
Fp
β β β − β β − β
β
y
2, 1, 1/ 2p p p= = =
L1-constrained optimizationLASSO
LARS
( ) ( )( )
min under
solution : 0
0c
F c
c c
ϕ∗
∗ ∗
≤
= → ∞
= →
β β
β
β β
( ) ( )( )
min
solution 0
0λ
ϕ λ
λ λ∗
∗ ∗
+
= ∞ →
= →
Fβ β
β
β β
( )( )
, ,
: ,
≥* *c λsolutions β and β : coincide λ = λ c p 1
p < 1 λ = λ c multiple noncontinuous stability different
λP Problem
cP Problem
*β
*β
Projection from to F = c (information geometry)*β
Convex Cone Programming
P : positive semi-definite matrix
convex potential function
dual geodesic approach
, minA = ⋅x b c x
Support vector machine
a) : 2, 1cR n p= > b) : 2, 1cR n p= =
c) : 2, 1cR n p= <
Fig. 1
non-convex
( ) ( ) : ϕ ⎡ ⎤= ⎣ ⎦*min β
orthogonal projection, dual
D β :β , F β = c dual geodesic
projec
projec
tion
tion
η η η∗ ∗ ∗− ∝ ∇ ( )c cF
dual
n
F= ∇n
Fig. 5 subgradient
η η∗ ∗∝ ∇ ( )c cF
LASSO path and LARS path(stagewise solution)
( ) ( )
( ) ( )
min :
min
F c
F
ϕ
ϕ λ
=
+
β β
β β
( ) ( ),c λ∗ ∗ ⇔c λ correspondenceβ β
Active set and gradient
( ) { }
( )( ) ( )
( )[ ]
1
0
sgn ,, ,
1,1
i
pi i
p
A i
i AF i A
β
β β − −
= ≠
⎧ ∈⎪
∇ = −∞ ∞ ∉⎨⎪ −⎩
β
β
Solution path( ) ( )
( ) ( ){ } ( )
( )1
0,
;
ϕ λ
λ
λϕ λ
∗ ∗ ∗
∗ ∗
− ∗
∇ + ∇ =
∇ ∇ + ∇ ∇ ⋅ = − ∇
= − ∇ =&
&&
&&c c
A c c A c c
A A c c A A c c A c
c cA
c
cK
F
F F
ddc
F
β β β
β β β
ββ β
β
β
( ) ( )c c cK G Fλ∗ ∗= + ∇∇β β
( )1 1 10; (sgn ) : β∇∇ = ∇ = iF F L
Solution pathin the subspace of the active set
( ) ( )
( )1
0 : active directionλ λ
λ λ
ϕ λ∗ ∗
∗ − ∗
∇ + ∇ = ∇
= − ∇&
A A A
A A
F
K F
β β
β β
′→turning point A A
Gradient Descent Method
{ ( )}: covariant
{ ( )}: contravariant
i
ji
i
L L xx
L g L xx
∂∇ =
∂∂
∇ =∂∑%
2min L(x+a): i jijg a a ε=
1 ( )t t tx x c L x+ = − ∇
Extended LARS (p = 1) and Minkovskian grad
( )
( )
( ) { }
( )
11
norm
max under 1
1
sgn , max , ,
0, otherwise
pip
p
p
i i NA
a
p
ψ ε
ψ ε λ
η η η ηψ
ψ
+
=
+ =
+ −
=
⎧ =⎪∇ = ⎨⎪⎩
= ∇
∑
1L
a
a a
a a
β
β
β
η β
arg max ii f∗ =
max i i jf f f∗ ∗= =
( ) 1, for and ,0 otherwise.i
i i jF
∗ ∗⎧ =∇ = ⎨
⎩%
1t t Fη+ = − ∇%β β LARS
F f∇ = ∇%
( )1
1sgn pi iF c f f −∇ =%
( )
0
1sgn
0
0
iF c f ∗
⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥
∇ = ⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
M
%
M
Euclidean case
1α →
λ c-trajectory and-trajectory
Ex. 1-dim ( ) ( )21 *2
ϕ β β β= −
( ) ( )21 2 22λ β φ λ β λ β= + = − +f F
L1/2 constraint: non-convex optimization
( )2: min , cP cβ β β∗− ≤
c cβ =
: 0P fλ λ∇ = 0λβ ββ
∗− + = ( )ˆ : Xu Zongben's operatorλβ β ∗= R
( )c c cλ β ∗= −
0 c β ∗β
β ∗
( )Rλ β ∗
c
λ
ICCN-Huangshan(黄山)
Sparse Signal AnalysisShun-ichi Ammari (甘利俊一)
RIKEN Brain Science Institute
(Collaborator: Masahiro Yukawa, Niigata University)
Solution Path :
not continuous, not-monotonejump
cλ ↔
cλβ β⇔
An Example of the greedy path
β1
β2
Linear Programming
( ) ( )inner met
max
lo
ho
g
d
ij j i
i i
ij j ii
A x b
c x
A x bψ
≥
= −
∑∑
∑ ∑x
Convex Programming━ Inner Method
: , 0LP A ≥ ⋅ ≥x b c x
min ⋅c x
( ) ( )log
logij j i
i
A x b
x
ψ = −
+
∑ ∑∑
x
( )iψ= ∂ xη
Simplex method ; inner method
Polynomial-Time Algorithm
curvature : step-size( ) 2mH
( ) ( )min : geodesict tψ ∗⋅ + = ∇ −c x x x δ
Neural Networks
Higher-order correlations
Synchronous firing
Multilayer Perceptron
Multilayer Perceptrons
( )i iy v nϕ= ⋅ +∑ w x
( ) ( )( )
( ) ( )
21; exp ,2
, i i
p y c y f
f v ϕ
⎧ ⎫= − −⎨ ⎬⎩ ⎭
= ⋅∑
x x
x w x
θ θ
θ
x y
1 2( , ,..., )nx x x x=
1 1( ,..., ; ,..., )m mw w v vθ =
Multilayer Perceptron
( )( )
( )1 1,
,
, ; ,i i
m m
y f
v
v v
ϕ
=
= ⋅
=
∑L L
x θ
w x
θ w w
neuromanifold ( )xψ
space of functions
singularities
Geometry of singular model
( )y v nϕ= ⋅ +w x
W
v| | 0v =w
Backpropagation ---gradient learningBackpropagation ---gradient learning
( ) ( )
( ) ( )
1 1
2
examples : , , ,1 , log , ;2
t ty y
E y f p y= − = −
L
θ θ
x x
x x
( ) ( ),
t t
i i
E
f v
η
ϕ
∂Δ = −
∂= ⋅∑x w x
θθ
θ
1
natural gradient (Riemannian)
--steepest descentE G E−∇ = ∇%
q ‐Fisher information
( )( ) ( )( )
q Fij ij
q
qg p g ph p
=
conformal transformation
11[ ( ): ( )] (1 ( ) ( ) )(1 ) ( )
q qq
q
q divergence
D p x r x p x r x dxq h p
−= −−
−
∫
Total Bregman Divergence and its Applications to
Shape Retrieval
•Baba C. Vemuri, Meizhu Liu, Shun-ichi Amari, Frank Nielsen
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010
Total Bregman Divergence
[ ] [ ]2
::
1
DTD
f=
+ ∇
x yx y
•rotational invariance
•conformal geometry
Total Bregman divergence (Vemuri)
2( ) ( ) ( ) ( )TBD( : )
1 | ( ) |
p q q p qp qq
ϕ ϕ ϕ
ϕ
− − ∇ ⋅ −=
+ ∇
Clustering : t-center
{ }1, , mE x x= L
[ ]arg min , ii
TD∗ = ∑x x x
E
y
T-center of E
∗x
t-center
( ) ( )
( ) 2
1
1
i i
i
i
i
w ff
w
wf
∗ ∇∇ =
=+ ∇
∑∑
xx
x
∗x
q ‐super‐robust estimator (Eguchi)
( ) ( )
( ) ( ) ( ){ }
( )
( ) ( )
1
1
1 1
0 1
,ˆmax , max
bias-corrected -estimating function
ˆ, , log
1 log1
1 ˆ, 0 max ,
q
q i q
q q
N
q i ii q
p xp x
h
q
s x p x p c
c hq
s x p xh
ξξ
ξ ξ ξ
ξ
ξ ξ
+
+
+ +
= +
→
= ∂ −
= ∂+
= ⇔∑ ∑
Conformal change of divergence
( ) ( ) [ ]: :D p q p D p qσ=%
( )ij ijg p gσ=%
( )
logijk ijk k ij j ik i jk
i i
T T s g s g s g
s
σ
σ
= + + +
= ∂
%
t-center is robust{ }
( )
1, , ;
1; ,
nE
nε ε
∗
∗ ∗ ∗
=
= + =
L
%
x x y
x x z x y
( )influence fun ;ction ∗z x y
robust as : c< → ∞z y
( ) ( ) ( )( )
( ) ( )
1,
1i i
f fG
w
G w fn
∗∗ −
∇ − ∇=
= ∇∇∑
y xz x y
y
x x
( )( )
( )( )
( )
21
1
f fw f
w
∇ ∇= < ∞
+ ∇
>
y yy y
y
Robust: is boundedz
21Euclidean case 2
f = x
( )
( )
1
2,
1
,
G∗ −
∗
=+
=
yz x yy
z x y y
MPEG7 database• Great intraclass variability, and small
interclass dissimilarity.
Other TBD applicationsDiffusion tensor imaging (DTI) analysis
[Vemuri]• Interpolation• Segmentation
Baba C. Vemuri, Meizhu Liu, Shun-ichi Amari and Frank Nielsen, Total Bregman Divergence and its Applications to DTI Analysis, IEEE TMI, to appear
TBD application-shape retrieval
• Using MPEG7 database;• 70 classes, with 20 shapes each class
(Meizhu Liu)
Multiterminal Information & Statistical InferenceMultiterminal Information & Statistical Inference
: nX x
: nY y
XR
YRθ
( ), ;
2 2X YR n R nX r
p x y
M M
θ
= =
marginal
correlation
0-rate Slepian-WolfM CG G G= +
p(x,y)
p(x,y,z)
Linear SystemsLinear Systems
ARMAARMA
( )( )
11
1 11
1 1
11
11
, , : , ,
,
pq
t tpp
p q
t t
b z b zx u
a z a z
a a b b
f z u
θ
− −
+ − −
−+
+ + +=
+ + +
=
=
L
L
L L
x θ,
tu tx
AR---e-flat
MA---m-flat
Machine LearningBoosting : combination of weak learners
( ) ( ) ( ){ }1 1 2 2, , , , , ,N ND y y y= Lx x x
1iy = ± −
( ) ( ) ( ), : , sgn ,f y h f= =x u x u x u
Weak Learners
( ) ( )( )sgn t tH hα= ∑x x
( ){ }: Prob t t i i th y Wε ≠x
( ) ( ) ( ){ }1 expt t t i t iW i cW i y h xα+ = −
weight distribution
Boosting ━generalization
( ) ( ) ( ){ }{ }1 expt t t t tQ Q y x Q y x yh x fα−= = − %
( ) ( ){ }, constt tF P y x Eyh x= =
: min :t tD P Qα ⎡ ⎤⎣ ⎦%
( ) ( )1, ,t tD P Q D P Q+ <% %
Integration of evidences:
arithmetic meangeometric meanharmonic mean
-meanα
1 2, ,... mx x x
Various Means
2a b+
ab2
1 1a b
+:
arithmetic geometric
Any other mean?
:
harmonic
Generalized mean: f-meanf(u): monotone; f-representation of u
1 ( ) ( )( , ) { }2f
f a f bm a b f − +=
12( ) , 1
log ,
( , )
1
( ,
)
=
f fm ca cb cm a
f u u
b
u
α
α αα
−
= ≠
=scale free
α-representation
-mean : α
2
1 :
1: 2
10 : ( )4 2
min( , ) max( , )
aba b
a ba b ab
m a bm a b
α
α
α
α
α
αα
=+
= −
+= + = +
= ∞ == −∞ =
1 2( ( ), ( ))m p s p sα
Family of Distributions{ }1( ), , ( )kp s p sL
1
( ) ( ), 1k
mix i i ii
p s t p s t=
= =∑ ∑
explog ( ) log ( )i ip s t p s ψ= −∑
α −
mixture family :
exponential family :
1( ; ) { ( ( ))}i ip x f f p xα αθ θ−= ∑