machine learn IG.ppt [互換モード]yosinski.com/mlss12/...Amari-Information-Geometry.pdf ·...

Machine Learning SS: Kyoto U.

Information Geometryand Its Applications toMachine Learning

Shun-ichi AmariRIKEN Brain Science Institute

Information Geometry

-- Manifolds of Probability Distributions

{ ( )}M p= x

Information GeometryInformation Geometry

Systems Theory Information Theory

Statistics Neural Networks

Combinatorics PhysicsInformation Sciences

Riemannian ManifoldDual Affine Connections

Manifold of Probability Distributions

Math. AIVision

Optimization

( ){ } ( ) ( )2

2

1; , ; , exp22

xS p x p x

μμ σ μ σ

σπσ

⎧ ⎫−⎪ ⎪= = −⎨ ⎬⎪ ⎪⎩ ⎭

Information Geometry ?Information Geometry ?

( ){ }p xσ

μ

( ){ };S p x= θ

Gaussian distributions

( , )μ σ=θ

Manifold of Probability DistributionsManifold of Probability Distributions

( )1 2 3 1 2 3

1, 2,3 ={ ( )}, , 1

nx S p xp p p p p p

=

= + + =

3p

2p1p

p

( ){ };M p x= θ

InvarianceInvariance ( ){ },S p x= θ

Invariant under different representation

( ) ( ), ,=y y x p ysufficient statistics

θ( ) ( )

2

1 2

21 2

, ,

| ( , ) ( , ) |

p x p x dx

p y p y dy

θ θ

θ θ

−

≠ −

∫∫

Two Geometrical StructuresTwo Geometrical StructuresRiemannian metric affine connection --- geodesic

( )2ij i jds g d dθ θ= ∑ θ

Fisher information

log logiji j

g E p pθ θ

⎡ ⎤∂ ∂= ⎢ ⎥

∂ ∂⎢ ⎥⎣ ⎦

Orthogonality: innner product

1 2 1 2, Td d d Gdθ θ θ θ< >=

Affine Connectioncovariant derivative; parallel transport

,

geodesic X=X X=X(t)

( )

X c

i jij

Y X Y

s g d dθ θ θ

∇ Π =

Π

= ∑∫minimal distance

& &

straight line

Duality: two affine connectionsDuality: two affine connections

, , , i jijX Y X Y X Y g X Y∗= Π Π < >= ∑

Riemannian geometry: ∗∏ = ∏

X

Y

X

YΠ

*Π

{ , , , *}S g ∇ ∇

Dual Affine Connections

e-geodesic

m-geodesic

( ) ( ) ( ) ( ) ( )log , log 1 logr x t t p x t q x c t= + − +

( ) ( ) ( ) ( ), 1r x t tp x t q x= + −

( ), ∗∇ ∇

( )q x

( )p x

*( , )Π Π

( ) 0* ( ) 0x

x

x tx t

∇ =∇ =

&

&

&

&

Mathematical structure of ( ){ },S p x= ξ

( )( )

ij i j

ijk i j k

g E l l

T E l l l

ξ

ξ

⎡ ⎤= ∂ ∂⎣ ⎦⎡ ⎤= ∂ ∂ ∂⎣ ⎦

( )log , ; i il p xξ∂

= ∂ =∂

ξ

-connection

{ }, ;ijk ijki j k Tα αΓ = −

α α−∇ ↔ ∇ : dually coupled

, , ,X XX Y Z Y Z Y Z∗= ∇ + ∇

α

{M,g,T}

Divergence: [ ]:D z y

[ ]

[ ]

[ ]

: 0

: 0, iff

: ij i j

D

D

D d g dz dz

≥

= =

+ = ∑

z y

z y z y

z z z

positive‐definite

Z

Y

M

Kullback-Leibler Divergencequasi-distance

( )[ ( ) : ( )] ( ) log( )

[ ( ) : ( )] 0 =0 iff ( ) ( )[ : ] [ : ]

x

p xD p x q x p xq x

D p x q x p x q xD p q D q p

=

≥ =≠

∑

[ ]: 0if i

i

qD p fp

⎛ ⎞= ≥⎜ ⎟

⎝ ⎠∑ %

% % %%

p q [ ]: 0fD = ⇔ =% % % %p q p q

( ) ( ) ( )not invariant under 1f u f u c u= − −%

divergence of f S%

{ }, 0 : ( 1 holds)i iS p p nn= > =∑% % % %p

divergence

α

1 12 21 1[ : ] { }

2 2i i i iD p q p q p qα α

αα α − +− +

= + −∑% % % % % %

[ : ] { log }ii i i

i

pD p q p p qq

= + −∑ %% % % % %

%

KL-divergence

α

( , ) divergenceα β −

, [ : ] { }i i i iD p q p q p qα β α β α βα β

α βα β α β

+ += + −+ +∑

: divergence1: -divergence

β α αα β

= − −=

Metric and Connections Induced by Divergence(Eguchi)

( ) [ ] [ ] ( )

( ) [ ]

( ) [ ]

'

' '

1: : : = 2

:

:

=

=

∗=

= ∂ ∂ +

Γ = −∂ ∂ ∂

Γ = −∂ ∂ ∂

ij i j ij i j

ijk i j k

ijk i j k

g D D d g dz dz

D

D

y z

y z

y z

z z y z z z z

z z y

z z y

*

'

{ , }

, i ii iz y

∇ ∇

∂ ∂∂ = ∂ =

∂ ∂

Riemannian metric

affine connections

Duality:

{ }, ,

k ij kij kji

ijk ijk ijk

g

T

M g T

∗

∗

∂ = Γ + Γ

Γ = Γ −

*, , ,X XX Y Z Y Z Y Z< >=< ∇ > + < ∇ >

Dually flat manifoldexponential family; mixture family; {p(x); x discrete}

-coordinates : affine coordinates, flat, geodesics

-coordinates: (dual) affine coordinates, flat, geodesics

canonical divergence D(P: P') :

θ

η

−

b

KL divergencenot Riemannian flat

( ) ( ){ }, exp : exponential familyθ θ ψ θ= −∑ i ip x x

( ) ( )

( ) ( )

( ) ( )

( ) ( ) ( )

( ) ( ){ }

2 2

potential functions ,

; :

0

, exp : exponential family

: cumulant generating function: negative en

ψ θ ϕ η

η ψ θ θ ϕ θθ η

ψ θ ϕ η θη

θ ψ θ ϕ θθ θ η η

θ θ ψ θ

ψϕ

∂ ∂= =

∂ ∂

+ − =

∂ ∂= =

∂ ∂ ∂ ∂

= −

∑

∑

L

　　i ii i

i i

ijij

i j i j

i i

g g

p x x

Legendre transformation

( ) ( )tropy

canonical divergence D(P: P')= ' 'ψ θ ϕ η θη+ − ∑ i i

Dually flat manifold

Manifold with Convex Function

S : coordinates ( )1 2, , , nθ θ θ= Lθ

( )ψ θ : convex function

negative entropy ( ) ( ) ( )logp p x p x dxϕ = ∫

( ) ( )212

iψ θ= ∑θ

Riemannian metric and flatness(affine structure)Bregman divergence

( ) ( ) ( ) ( ), grad D ψ ψ′ ′ ′= − − ⋅θ θ θ θ θ θ

( ) ( )1,2

i jijD d g d dθ θ+ = ∑θ θ θ θ

( ) , ij i j i ig ψθ∂

= ∂ ∂ ∂ =∂

θ

θ : geodesic (not Levi-Civita)Flatness (affine)

{ , ( ), }S ψ θ θ

Legendre Transformation

( )i iη ψ= ∂ θ

↔θ η one-to-one

( ) ( ) 0iiϕ ψ θ η+ − =η θ

( ) ,i i i

i

θ ϕη∂

= ∂ ∂ =∂

η

( ) ( ) ( ),D ψ ϕ′ ′ ′= + − ⋅θ θ θ η θ η

( ) max { ( )}iiθϕ η θ η ψ θ= −

Two affine coordinate systems ( ),θ η

θ : geodesic (e-geodesic)

η : dual geodesic (m-geodesic)

“dually orthogonal”,

,

j ji i

ii i

i

δ

θ η

∂ ∂ =

∂ ∂∂ = ∂ =

∂ ∂

*, , ,X XX Y Z Y Z Y Z< >=< ∇ > + < ∇ >

Pythagorean Theorem (dually flat manifold)

[ ] [ ] [ ]: : :D P Q D Q R D P R+ =

Euclidean space: self-dual =θ η

( ) ( )212 iψ θ θ= ∑

Projection Theorem

[ ]min :Q M

D P Q∈

Q = m-geodesic projection of P to M

[ ]min :Q M

D Q P∈

Q’ = e-geodesic projection of P to M

Two Types of DivergenceInvariant divergence (Chentsov, Csiszar)

f-divergence: Fisher- structure

Flat divergence (Bregman) – convex function

KL-divergence belongs to both classes: flat and invariant

∫q(x)D[p : q] = p(x)f{ }dxp(x)

α

dually flat space

convex functionsBregman

divergence

invariance

invariant divergence Flat divergence

KL‐divergenceF-divergenceFisher inf metricAlpha connection

: space of probability distributions}{p=S

log∫p(x)D[p : q] = p(x) { }dxq(x)

{ }, 0 : ( 1 holds)i iS p p nn= > =∑% % % %p

Space of positive measures : vectors, matrices, arrays

f‐divergence

α‐divergence

Bregman divergence

Applications of Information Geometry

Statistical InferenceMachine Learning and AIComputer VisionConvex ProgrammingSignal Processing (ICA; Sparse)Information Theory, Systems TheoryQuantum Information Geometry

Applications to Statisticscurved exponential family:

( ) ( ) ( )( ){ }, expp x u u uθ ψ θ= ⋅ −x

1

1=

= ∑n

kk

x xn

: estimator

u

ˆ xη =

1, 2( , ) ,... np x u x x x ( , ) exp{ ( )}p x xθ θ ψ θ= ⋅ −

1ˆ( ,..., )nu x x

x : discrete X = {0, 1, …, n}

0 1

0 0

{ ( ) | }:

( ) ( ) exp[ ( )]

log log ; ( ); ( ) log

( , )

n

n ni

i i ii i

ii i i

S p x x X

p x p x x

p p x x p

p x u

δ θ ψ θ

θ δ ψ θ= =

= ∈

= = −

= − = = −

∑ ∑

exponential family

statistical model :

High-Order AsymptoticsHigh-Order Asymptotics

( )( )

1

1

, (u) : , ,

u u , ,n

n

p x x x

x x=

L

L

θ

( )( )ˆ ˆ Te E u u u u⎡ ⎤= − −⎣ ⎦

1 22

1 1e G Gn n

= +

11G G−≥ :Cramér-Rao: linear theory

( ) ( ) ( )2 2 2

2e m m

M AG H H= + + Γ

:

u

ˆ xη =

quadratic approximation

Information Geometryof

Belief Propagation

• Shun-ichi Amari (RIKEN BSI)• Shiro Ikeda (Inst. Statist. Math.)• Toshiyuki Tanaka (Kyoto U.)

Stochastic Reasoning

( , , , , )p x y z r s

( , , | , ), , ,... 1, 1p x y z r sx y z = −

Stochastic Reasoningq(x1,x2,x3,…| observation)

X= (x1 x2 x3 …..) x = 1, -1

Ｘ= argmax q(x1, x2 ,x3 ,…..) maximum likelihood

Ｘi = sgn E[xi] least bit error rate estimator

Mean Value Marginalization: projection to independent distributions

0 1 1 2 2 0( ) ( ) ( )... ( ) ( )n nq q x q x q x qΠ = =x x

1, 1( ) ( ..., ) .. ..i i n i nq x q x x dx dx dx= ∫(

0[ ] [ ]q qη = =E x E x

( ) ( )

( ) ( ){ } ( )

( ) { }

1

1

1

1 2

exp

,

1, 1

e p

,

x

s

ij

L

i j i

i i r qr

r r i i s

i

i

q k x c

c c x x r

q w x x h x

i i

x r i i

ψ

ψ

=

⎧ ⎫= ⋅ + −⎨ ⎬⎩ ⎭

=

= −

=

+

=

= −

∑

∑

∑

∑

L L

x

x x

x

Boltzmann machine, spin glass, neural networksTurbo Codes, LDPC Codes

Computationally DifficultComputationally Difficult

( ) [ ]( ) ( ){ }exp r q

q E

q c

η

ψ

→ =

= −∑x x

x x

mean-field approximation

belief propagation

tree propagation, CCCP (convex-concave)

Information Geometry ofMean Field Approximation

• m-projection• e-projection

D[q:p]=

q = argmin D[q:p]q = argmin D[p:q]

( )( )log( )x

q xq xp x∑

0eΠ

0mΠ

0 { ( )}i i iM p x= Π0( ) ∈p x M

Information GeometryInformation Geometry

( ){ } { }( ) ( ){ }{ }

0 0 0, exp

, exp

ψ

ψ

= = ⋅ −

= = + ⋅ −r r r r r r

M p

M p cξ ξ

x x

x x x

θ θ

1, ,r L= L

( )q x

rM

'rM

0Mθ

( ) exp{ ( )rq x c x φ= − }∑

0

1 t0

1 1

( , )

( , ) : belief for ( )+

+ +

Π

= Π −

= ∑

tr r

t tr r r r r

t tr

p x

p x c x

ξ

θ ξ ξ

θ θ

Belief PropagationBelief Propagation

( ) ( ){ }: , exp ψ= + ⋅ −r r r r r rM p cξ ξx x x

Belief Prop Algorithm

0M

rM

'rM

rς

'rς

rςΠ

'rς

Equilibrium of BPEquilibrium of BP ( ),∗ ∗rξθ

1) m-condition

( )*0 ,∗ = Π r rp ξθ x

( )-flat submanifold m M θ∗

rM

'rM

0M

rM

'rM

0M

q2) e-condition

*11

∗ ∗=− ∑ r rL

θ ξ

( ) -flat submanifold q e∈x

ξ1( ')ξ θ

( ) [ ] [ ]1 0 0, , , : :L rF D p q D p pθ ζ ζ = −∑L

critical point

0 : -condition

0 : -conditionr

F e

F m

∂=

∂∂

=∂

θ

ζnot convex

Free energy:Free energy:

Belief Propagatione-condition OK

CCCP m-condition OK

1 2

1 2 1 2

1( ; , , , ' '1

( , , ) ( ' , ' , ' )

, ... ) =−

, ... → ,...

∑L r

L L

Lθ ξ ξ ξ θ ξ

ξ ξ ξ ξ ξ ξ

( ) ( )1 1 1

0 0

'( '), ( '),..., ( ')' , ' : ' , '

→

= Π = Πr r rp p

θ θξ θ ξ θ ξ θθ ξ ξ θx x

( )1 1 10

1 1

,

ξ

+ + +

+ +

= Π = Π

= − ∑

t t tr r r

t t tr

p

L

ξ θ θ

θ θ

x

Convex-Concave ComputationalProcedure (CCCP) Yuille

1 21

1 2

( ) ( ) ( )

( ) ( )

θ θ θ

θ θ+

= −

∇ = ∇t t

F F F

F FElimination of double loops

Boltzmann Machine

( ) ( )1i ij j ip x w x hϕ= = −∑

( ) ( ){ }exp ij i j i ip x w x x h x wψ= − −∑ ∑

2x

3x

4x

1x

( )q x

( )p x B

Boltzmann machine---hidden units

• EM algorithm• e-projection• m-projection

D

M

EM algorithm

hidden variables

( ), ;p x y u

{ }1, , ND = Lx x

( ){ }, ;M p= x y u

( ) ( ) ( ){ },M DD p p p= =x y x x

( )ˆmin , :KL p p M∈⎡ ⎤⎣ ⎦x y m-projection to M

De-projection to( )ˆmin : , ;KL p D p∈⎡ ⎤⎣ ⎦x y u

SVM : support vector machine

Embedding

Kernel

Conformal change of kernel

( )( ) ( ) ( , )

i i

i i i i i

z xf x w x y K x x

φ

φ α

=

= =∑ ∑

( , ') ( ) ( ')i iK x x x xφ φ= ∑

2

( , ') ( ) ( ') ( , ')( ) exp{ | ( ) | }

ρ ρ

ρ κ

⎯⎯→

= −

K x x x x K x xx f x

Signal ProcessingICA : Independent Component Analysis

t t t tA= →x s x s

sparse component analysis

positive matrix factorization

mixture and unmixtureof independent signals

2x

1s

ns2smx

1x1

n

i ij jj

x A s=

=

=

∑

x As

Independent Component Analysis

1

i ij jA x A s

W W A−

= =

= =

∑x s

y x

s A W y

x

observations: x(1), x(2), …, x(t)recover: s(1), s(2), …, s(t)

Space of Matrices : Lie group

-1d d=X WW

( ) ( )2 1tr trT T T

T

d d d d d

ll

− −= =

∂∇ =

∂

W X X WW W W

W WW

:dX

I I d+ X

Wd+W W

non-holonomic basis

1W −

Information Geometry of ICA

natural gradientestimating functionstability, efficiency

S ={p(y)}

1 1 2 2{ ( ) ( )... ( )}n nI q y q y q y=

{ ( )}p Wx

r q

( ) [ ( ; ) : ( )] ( )

l KL p qr

=W y W yy

Semiparametric Statistical Model

( ; , ) | | ( )p r r=x W W Wxunknown

x(1), x(2), …, x(t)

ir r= Π, ( ) :r s−= 1W A

Natural Gradient

( ), Tlη

∂Δ = −

∂y W

W W WW

Basis Given: overcomplete caseSparse Solution

many solutionsmany 0

ˆ

i i

i

t t

A s

s

A

= =

→

=

∑x s a

x s

ˆ ˆ ˆ: =A x Assparse

generalized inverse

ˆmin Σ 2is

sparse solution

ˆmin ii

∑ s

ˆ ˆ ˆ: x =A Assparse

2 :-normL

1 :-normL

Overcomplete Basis and Sparse Solution

1

'

min

min

i i

i

p p

s A

s

A α

= =

=

− +

∑∑

x a s

s

s x s

non-linear denoising

Sparse Solution( )

( )

min

penalty : Bayes priorpp iF

ϕ

= ∑

β

β β

( )

( )

( )

( )

0

1 1

22

#1[ 0] :

:

: 0 1

:

i

i

p

i

F

F L

F p

F

β

β

β

= ≠

=

≤ ≤

=

∑

∑

sparsest solution

solution

generalized inverse solution

β

β

β

β

Sparse solution: overcomplete case

Optimizationunder Spasity Condition

( )( )

min : convex functionconstraint F c

ϕ⎧⎪⎨ ≤⎪⎩

ββ

typical case: ( )

( )

21 1 ( *) ( *)2 21 ;

ϕ

β

= − =

= ∑

T

pi

X G

Fp

β β β − β β − β

β

y

2, 1, 1/ 2p p p= = =

L1-constrained optimizationLASSO

LARS

( ) ( )( )

min under

solution : 0

0c

F c

c c

ϕ∗

∗ ∗

≤

= → ∞

= →

β β

β

β β

( ) ( )( )

min

solution 0

0λ

ϕ λ

λ λ∗

∗ ∗

+

= ∞ →

= →

Fβ β

β

β β

( )( )

, ,

: ,

≥* *c λsolutions β and β : coincide λ = λ c p 1

p < 1 λ = λ c multiple noncontinuous stability different

　λP Problem

　cP Problem

*β

*β

Projection from to F = c (information geometry)*β

Convex Cone Programming

P : positive semi-definite matrix

convex potential function

dual geodesic approach

, minA = ⋅x b c x

Support vector machine

a) : 2, 1cR n p= > b) : 2, 1cR n p= =

c) : 2, 1cR n p= <

Fig. 1

non-convex

( ) ( ) : ϕ ⎡ ⎤= ⎣ ⎦*min β

orthogonal projection, dual

D β :β , F β = c dual geodesic

projec

projec

tion

tion

η η η∗ ∗ ∗− ∝ ∇ ( )c cF

dual

n

F= ∇n

Fig. 5 subgradient

η η∗ ∗∝ ∇ ( )c cF

LASSO path and LARS path(stagewise solution)

( ) ( )

( ) ( )

min :

min

F c

F

ϕ

ϕ λ

=

+

β β

β β

( ) ( ),c λ∗ ∗ ⇔c λ correspondenceβ β

Active set and gradient

( ) { }

( )( ) ( )

( )[ ]

1

0

sgn ,, ,

1,1

i

pi i

p

A i

i AF i A

β

β β − −

= ≠

⎧ ∈⎪

∇ = −∞ ∞ ∉⎨⎪ −⎩

β

β

Solution path( ) ( )

( ) ( ){ } ( )

( )1

0,

;

ϕ λ

λ

λϕ λ

∗ ∗ ∗

∗ ∗

− ∗

∇ + ∇ =

∇ ∇ + ∇ ∇ ⋅ = − ∇

= − ∇ =&

&&

&&c c

A c c A c c

A A c c A A c c A c

c cA

c

cK

F

F F

ddc

F

β β β

β β β

ββ β

β

β

( ) ( )c c cK G Fλ∗ ∗= + ∇∇β β

( )1 1 10; (sgn ) : β∇∇ = ∇ = iF F L

Solution pathin the subspace of the active set

( ) ( )

( )1

0 : active directionλ λ

λ λ

ϕ λ∗ ∗

∗ − ∗

∇ + ∇ = ∇

= − ∇&

A A A

A A

F

K F

β β

β β

′→turning point A A

Gradient Descent Method

{ ( )}: covariant

{ ( )}: contravariant

i

ji

i

L L xx

L g L xx

∂∇ =

∂∂

∇ =∂∑%

2min L(x+a): i jijg a a ε=

1 ( )t t tx x c L x+ = − ∇

Extended LARS (p = 1) and Minkovskian grad

( )

( )

( ) { }

( )

11

norm

max under 1

1

sgn , max , ,

0, otherwise

pip

p

p

i i NA

a

p

ψ ε

ψ ε λ

η η η ηψ

ψ

+

=

+ =

+ −

=

⎧ =⎪∇ = ⎨⎪⎩

= ∇

∑

1L

a

a a

a a

β

β

β

η β

arg max ii f∗ =

max i i jf f f∗ ∗= =

( ) 1, for and ,0 otherwise.i

i i jF

∗ ∗⎧ =∇ = ⎨

⎩%

1t t Fη+ = − ∇%β β LARS

F f∇ = ∇%

( )1

1sgn pi iF c f f −∇ =%

( )

0

1sgn

0

0

iF c f ∗

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥

∇ = ⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

M

%

M

Euclidean case

1α →

λ c-trajectory and-trajectory

Ex. 1-dim ( ) ( )21 *2

ϕ β β β= −

( ) ( )21 2 22λ β φ λ β λ β= + = − +f F

L1/2 constraint: non-convex optimization

( )2: min , cP cβ β β∗− ≤

c cβ =

: 0P fλ λ∇ = 0λβ ββ

∗− + = ( )ˆ : Xu Zongben's operatorλβ β ∗= R

( )c c cλ β ∗= −

0 c β ∗β

β ∗

( )Rλ β ∗

c

λ

ICCN-Huangshan（黄山）

Sparse Signal AnalysisShun-ichi Ammari （甘利俊一）

RIKEN Brain Science Institute

（Collaborator: Masahiro Yukawa, Niigata University)

Solution Path :

not continuous, not-monotonejump

cλ ↔

cλβ β⇔

An Example of the greedy path

β１

β２

Linear Programming

( ) ( )inner met

max

lo

ho

g

d

ij j i

i i

ij j ii

A x b

c x

A x bψ

≥

= −

∑∑

∑ ∑x

Convex Programming━ Inner Method

: , 0LP A ≥ ⋅ ≥x b c x

min ⋅c x

( ) ( )log

logij j i

i

A x b

x

ψ = −

+

∑ ∑∑

x

( )iψ= ∂ xη

Simplex method ; inner method

Polynomial-Time Algorithm

curvature : step-size( ) 2mH

( ) ( )min : geodesict tψ ∗⋅ + = ∇ −c x x x δ

Neural Networks

Higher-order correlations

Synchronous firing

Multilayer Perceptron

Multilayer Perceptrons

( )i iy v nϕ= ⋅ +∑ w x

( ) ( )( )

( ) ( )

21; exp ,2

, i i

p y c y f

f v ϕ

⎧ ⎫= − −⎨ ⎬⎩ ⎭

= ⋅∑

x x

x w x

θ θ

θ

x ｙ

1 2( , ,..., )nx x x x=

1 1( ,..., ; ,..., )m mw w v vθ =

Multilayer Perceptron

( )( )

( )1 1,

,

, ; ,i i

m m

y f

v

v v

ϕ

=

= ⋅

=

∑L L

x θ

w x

θ w w

neuromanifold ( )xψ

space of functions

singularities

Geometry of singular model

( )y v nϕ= ⋅ +w x

Ｗ

v| | 0v =w

Backpropagation ---gradient learningBackpropagation ---gradient learning

( ) ( )

( ) ( )

1 1

2

examples : , , ,1 , log , ;2

t ty y

E y f p y= − = −

L

θ θ

x x

x x

( ) ( ),

t t

i i

E

f v

η

ϕ

∂Δ = −

∂= ⋅∑x w x

θθ

θ

1

natural gradient (Riemannian)

--steepest descentE G E−∇ = ∇%

q ‐Fisher information

( )( ) ( )( )

q Fij ij

q

qg p g ph p

=

conformal transformation

11[ ( ): ( )] (1 ( ) ( ) )(1 ) ( )

q qq

q

q divergence

D p x r x p x r x dxq h p

−= −−

−

∫

Total Bregman Divergence and its Applications to

Shape Retrieval

•Baba C. Vemuri, Meizhu Liu, Shun-ichi Amari, Frank Nielsen

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010

Total Bregman Divergence

[ ] [ ]2

::

1

DTD

f=

+ ∇

x yx y

•rotational invariance

•conformal geometry

Total Bregman divergence (Vemuri)

2( ) ( ) ( ) ( )TBD( : )

1 | ( ) |

p q q p qp qq

ϕ ϕ ϕ

ϕ

− − ∇ ⋅ −=

+ ∇

Clustering : t-center

{ }1, , mE x x= L

[ ]arg min , ii

TD∗ = ∑x x x

E

y

T-center of E

∗x

t-center

( ) ( )

( ) 2

1

1

i i

i

i

i

w ff

w

wf

∗ ∇∇ =

=+ ∇

∑∑

xx

x

∗x

q ‐super‐robust estimator (Eguchi)

( ) ( )

( ) ( ) ( ){ }

( )

( ) ( )

1

1

1 1

0 1

,ˆmax , max

bias-corrected -estimating function

ˆ, , log

1 log1

1 ˆ, 0 max ,

q

q i q

q q

N

q i ii q

p xp x

h

q

s x p x p c

c hq

s x p xh

ξξ

ξ ξ ξ

ξ

ξ ξ

+

+

+ +

= +

→

= ∂ −

= ∂+

= ⇔∑ ∑

Conformal change of divergence

( ) ( ) [ ]: :D p q p D p qσ=%

( )ij ijg p gσ=%

( )

logijk ijk k ij j ik i jk

i i

T T s g s g s g

s

σ

σ

= + + +

= ∂

%

t-center is robust{ }

( )

1, , ;

1; ,

nE

nε ε

∗

∗ ∗ ∗

=

= + =

L

%

x x y

x x z x y

( )influence fun ;ction ∗z x y

robust as : c< → ∞z y

( ) ( ) ( )( )

( ) ( )

1,

1i i

f fG

w

G w fn

∗∗ −

∇ − ∇=

= ∇∇∑

y xz x y

y

x x

( )( )

( )( )

( )

21

1

f fw f

w

∇ ∇= < ∞

+ ∇

>

y yy y

y

Robust: is boundedz

21Euclidean case 2

f = x

( )

( )

1

2,

1

,

G∗ −

∗

=+

=

yz x yy

z x y y

MPEG7 database• Great intraclass variability, and small

interclass dissimilarity.

Other TBD applicationsDiffusion tensor imaging (DTI) analysis

[Vemuri]• Interpolation• Segmentation

Baba C. Vemuri, Meizhu Liu, Shun-ichi Amari and Frank Nielsen, Total Bregman Divergence and its Applications to DTI Analysis, IEEE TMI, to appear

TBD application-shape retrieval

• Using MPEG7 database;• 70 classes, with 20 shapes each class

(Meizhu Liu)

Multiterminal Information & Statistical InferenceMultiterminal Information & Statistical Inference

: nX x

: nY y

XR

YRθ

( ), ;

2 2X YR n R nX r

p x y

M M

θ

= =

marginal

correlation

0-rate Slepian-WolfM CG G G= +

p(x,y)

p(x,y,z)

Linear SystemsLinear Systems

ARMAARMA

( )( )

11

1 11

1 1

11

11

, , : , ,

,

pq

t tpp

p q

t t

b z b zx u

a z a z

a a b b

f z u

θ

− −

+ − −

−+

+ + +=

+ + +

=

=

L

L

L L

x θ,

tu tx

AR---e-flat

MA---m-flat

Machine LearningBoosting : combination of weak learners

( ) ( ) ( ){ }1 1 2 2, , , , , ,N ND y y y= Lx x x

1iy = ± −

( ) ( ) ( ), : , sgn ,f y h f= =x u x u x u

Weak Learners

( ) ( )( )sgn t tH hα= ∑x x

( ){ }: Prob t t i i th y Wε ≠x

( ) ( ) ( ){ }1 expt t t i t iW i cW i y h xα+ = −

weight distribution

Boosting ━generalization

( ) ( ) ( ){ }{ }1 expt t t t tQ Q y x Q y x yh x fα−= = − %

( ) ( ){ }, constt tF P y x Eyh x= =

: min :t tD P Qα ⎡ ⎤⎣ ⎦%

( ) ( )1, ,t tD P Q D P Q+ <% %

Integration of evidences:

arithmetic meangeometric meanharmonic mean

-meanα

1 2, ,... mx x x

Various Means

2a b+

ab2

1 1a b

+:

arithmetic geometric

Any other mean?

:

harmonic

Generalized mean: f-meanf(u): monotone; f-representation of u

1 ( ) ( )( , ) { }2f

f a f bm a b f − +=

12( ) , 1

log ,

( , )

1

( ,

)

=

f fm ca cb cm a

f u u

b

u

α

α αα

−

= ≠

=scale free

α-representation

-mean : α

2

1 :

1: 2

10 : ( )4 2

min( , ) max( , )

aba b

a ba b ab

m a bm a b

α

α

α

α

α

αα

=+

= −

+= + = +

= ∞ == −∞ =

1 2( ( ), ( ))m p s p sα

Family of Distributions{ }1( ), , ( )kp s p sL

1

( ) ( ), 1k

mix i i ii

p s t p s t=

= =∑ ∑

explog ( ) log ( )i ip s t p s ψ= −∑

α −

mixture family :

exponential family :

1( ; ) { ( ( ))}i ip x f f p xα αθ θ−= ∑

machine learn IG.ppt [互換モード]yosinski.com/mlss12/...Amari-Information-Geometry.pdf ·...

Documents

Transcript of machine learn IG.ppt [互換モード]yosinski.com/mlss12/...Amari-Information-Geometry.pdf ·...