Information Theoretic Learning Finding structure in data ...

61
Information Theoretic Learning Finding structure in data ... Jose Principe and Sudhir Rao University of Florida [email protected] www.cnel.ufl.edu

description

Jose Principe and Sudhir Rao University of Florida [email protected] www.cnel.ufl.edu. Information Theoretic Learning Finding structure in data. Outline. Structure Connection to Learning Learning Structure – the old view A new framework Applications. Structure. - PowerPoint PPT Presentation

Transcript of Information Theoretic Learning Finding structure in data ...

Page 1: Information Theoretic Learning Finding structure in data ...

Information Theoretic LearningFinding structure in data ...

Jose Principe and Sudhir Rao

University of [email protected]

Page 2: Information Theoretic Learning Finding structure in data ...

Outline

Structure

Connection to Learning

Learning Structure – the old view

A new framework

Applications

Page 3: Information Theoretic Learning Finding structure in data ...

Structure

Patterns / Regularities

Amorphous/chaos

Interdependence between subsystems

White Noise

Page 4: Information Theoretic Learning Finding structure in data ...

Connection to Learning

Page 5: Information Theoretic Learning Finding structure in data ...

Type of Learning

Supervised Learning Data Desired Signal/Teacher

Reinforcement Learning Data Rewards/Punishments

Unsupervised Learning Only the Data

Page 6: Information Theoretic Learning Finding structure in data ...

Unsupervised Learning

What can be done only with the data??

First Principles Examples

Preserve maximum information Auto associative memory, ART PCA, Linsker’s “informax” rule …

Extract independent features Barlow’s minimum redundancy principle, ICA etc

Learn the probability distributionGaussian Mixture Models,EM algorithm, Parametric DensityEstimation.

Page 7: Information Theoretic Learning Finding structure in data ...

Connection to Self Organization

“If cell 1 is one of the cells providing input to cell 2, and if cell 1’s activity tends to be “high” whenever cell 2’s activity is “high”, then the future contributions that the firing of cell 1 makes to the firing of cell 2 should increase..”

-Donald Hebb, 1949, Neuropsychologist.

-

+

+

A

B

C

Increase wb proportional to activity of B and C

What is the purpose????

“Does the Hebb-type algorithm cause a developing perceptual network to optimize some property that is deeply connected with the mature network’s functioning as a information processing system.”

- Linsker, 1988

Page 8: Information Theoretic Learning Finding structure in data ...

Linsker’s Infomax principle

( , )

( ) ( | )

( ) ( | )

R I X Y

H Y H Y X

H X H X Y

noise

X1

X2

XL-1

XL

Y

Maximize Shannon Rate I(X,Y)

w1

wL

Under Gaussian assumptions and uncorrelated noise the rate for a linear network is ,

2

2

1ln

2y

n

R

2max ( )w

E yMaximize Rate =

( 1) ( )i i iw n w n x y Hebbian Rule!!

Linear Network

Page 9: Information Theoretic Learning Finding structure in data ...

Barlow’s redundancy principle

MinimumEntropyCoding

Feature 1

Feature N

Stimulus 1

Stimulus M

Independencefeaturesno redundancy

Converting an M dimensional problem N one dimensional problems

2M conditional probabilitiesrequired for an event V

P(V|stimuli)

N conditional probabilitiesrequired for an event V

P(V|Feature i)

ICA!!!

Page 10: Information Theoretic Learning Finding structure in data ...

Summary 1

Self Organizing Rule

Discovering structure in Data

Global Objective Function

Extracting desired signal from the data itself

Revealing the structure through interaction of the data points

example, Infomax

example, Hebbian rule

example, PCA UnsupervisedLearning

Page 11: Information Theoretic Learning Finding structure in data ...

Questions

Can we go beyond these preprocessing stages??

Can we create global cost function which extract “goal oriented structures” from the data?

Can we derive self organizing principle from such a cost function??

A big YES!!!

Page 12: Information Theoretic Learning Finding structure in data ...

What is Information Theoretic Learning?

ITL is a methodology to adapt linear or nonlinear systems using criteria based on the information descriptors of entropy and divergence.

Center piece is a non-parametric estimator for entropy that:• Does not require an explicit estimation of pdf• Uses the Parzen window method which is known to be consistent and

efficient• Estimator is smooth• Readily integrated in conventional gradient descent learning• Provides a link to Kernel learning and SVMs.

• Allows an extension to random processes

Page 13: Information Theoretic Learning Finding structure in data ...

ITL is a different way of thinking about data quantification

Moment expansions, in particular Second Order moments are still today the workhorse of statistics. We automatically translate deep concepts (e.g. similarity, Hebb’s postulate of learning ) in 2nd order statistical equivalents.

ITL replaces 2nd order moments with a geometric statistical interpretation of data in probability spaces.

• Variance by Entropy• Correlation by Correntopy• Mean square error (MSE) by Minimum error entropy (MEE)• Distances in data space by distances in probability spaces

Page 14: Information Theoretic Learning Finding structure in data ...

Information Theoretic LearningEntropy

• Entropy quantifies the degree of uncertainty in a r.v. Claude Shannon defined entropy as

Not all random variables (r.v.) are equally random!

5-5 00

0.5

1

f x(x)

-5 0 50

0.2

0.4

x

f x(x)

dxxfxfXHxpxpXH XXSXXS ))(log()()()(log)()(

Page 15: Information Theoretic Learning Finding structure in data ...

0

1

1

p p1 p2 =

p1

p2

11

1

p2

p1

p30

p p1 p2 p3 =

pk

k 1=

n

p (entropy -norm)=

( norm of – p raised power to )

Information Theoretic LearningRenyi’s Entropy

• Norm of the pdf:

dxxfXHxpXH XX

)(log

1

1)()(log

1

1)(

V f y y d=

norm– V 1 =

Renyi’s entropy equals Shannon’s as 1

Page 16: Information Theoretic Learning Finding structure in data ...

Information Theoretic LearningParzen windowing

Given only samples drawn from a distribution:

Convergence:

-5 0 50

0.5

1N=10

x

f x(x)

-5 0 50

0.2

0.4

x

f x(x)

-5 0 50

0.5

1N = 1000

x

f x(x)

-5 0 50

0.2

0.4

x

f x(x)

N=10 N = 1000

N

ii

N

xxGN

xp

xpxx

1

1

)(1

)(ˆ

)(~},...,{

Kernel function

)(

)(*)()(ˆlim )(

NNhatprovided t

xGxpxp NN

Page 17: Information Theoretic Learning Finding structure in data ...

Information Theoretic Learning Renyi’s Quadratic Entropy

Order-2 entropy & Gaussian kernels:

j iij

j iij

N

ii

xxGN

dxxxGxxGN

dxxxGN

dxxpXH

)(1

log

)()(1

log

)(1

log)(log)(

22

2

2

1

22

Pairwise interactions

between samples

O(N2)

Information potential, V2(X)

Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.

provides a potential field over the space of the samples

parameterized by the kernel size )(ˆ xp

Page 18: Information Theoretic Learning Finding structure in data ...

Information Theoretic Learning Information Force

In adaptation, samples become information particles that interact through information forces.

Information potential:

Information force:

j iij xxG

NXV )(

1)(

222

iij

jxxG

Nx

V)(

122

2

Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.

Erdogmus, Principe, Hild, Natural Computing, 2002.

xj

xi

Page 19: Information Theoretic Learning Finding structure in data ...

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

Information force within a dataset arising due to H(X)

What will happen if we allow the

particles to move under the influence

of these forces?

Page 20: Information Theoretic Learning Finding structure in data ...

Information Theoretic Learning Backpropagation of Information Forces

Information forces become the injected error to the dual or adjoint network that determines the weight updates for adaptation.

Information

Forces

Adaptive

System

Adjoint

Network

Input

Desired

IT

CriterionOutput

Weight

Updates

ij

pk

p

N

n pij w

ne

ne

J

w

J

)(

)(1 1

Page 21: Information Theoretic Learning Finding structure in data ...

Information Theoretic Learning Quadratic divergence measures

Kulback-Liebler Divergence:

Renyi’s Divergence:

Euclidean Distance:

Cauchy- Schwartz Distance :

Mutual Information is a special case (divergence between the joint and the product of marginals)

dxxq

xpxpYXDKL )(

)(log)();(

dxxq

xpxpYXD

1

)(

)()(log

1

1);(

dxxqxpYXDE2)()();(

))()(

)()(log();(

22

dxxqdxxp

dxxqxpYXDC

Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.

Page 22: Information Theoretic Learning Finding structure in data ...

Learning System

Y q X W =Input Signal Output Signal

X Y

Desired Signal D

OptimizationInformation Measure

I Y D

Information Theoretic Learning Unifying criterion for learning from samples

Page 23: Information Theoretic Learning Finding structure in data ...

Training ADALINE sample by sample Stochastic information gradient (SIG)

Theorem: The expected value of the stochastic information gradient (SIG), is the gradient of Shannon’s entropy estimated from the samples using Parzen windowing.

For the Gaussian kernel and M=1

The form is the same as for LMS except that entropy learning works with differences in samples.

The SIG works implicitly with the L1 norm of the error.

Erdogmus, Principe, Hild, IEEE Signal Processing Letters, 2003.

1

, )(1

log)(ˆn

LniinnS ee

LEeH

1

1

,

)(

)()()(ˆn

Lniin

n

Lnikkin

k

ns

ee

ixnxeeE

w

H

))1()()((1ˆ

12,

nxnxeew

Hkknn

k

nS

Page 24: Information Theoretic Learning Finding structure in data ...

SIG Hebbian updates

In a linear network the Hebbian update is The update maximizing Shannon output entropy with the SIG becomes

Which is more powerful and biologically plausible?

kkk xyw

Erdogmus, Principe, Hild, IEEE Signal Processing Letters, 2003.

)()(1)(

112

kkkk

k xxyyw

yH

0 50 100 150 200 250 300 350 400 450 500-1

-0.5

0

0.5

1

1.5

2

Direction (

Gaussia

n)

Epochs

0 50 100 150 200 250 300 350 400 450 500-1

-0.5

0

0.5

1

1.5

2

Direction (

Cauchy)

Epochs

Generated 50 samples of a 2D distribution where the x axis is uniform and the y axis is Gaussian and the sample covariance matrix is 1

Hebbian updates would converge to any direction but SIG found consistently the 90 degree direction!

-4 -3 -2 -1 0 1 2 3 4-4

-3

-2

-1

0

1

2

3

4

Page 25: Information Theoretic Learning Finding structure in data ...

ITL - Applications

ITL

System identification Feature extraction

Blind source separation Clustering

www.cnel.ufl.edu ITL has examples and Matlab code

Page 26: Information Theoretic Learning Finding structure in data ...

Let be two r.vs with iid samples. Then Renyi’s cross entropy is given by

Renyi’s cross entropydM

jjNii yYxX 11 )(,)(

( ; ) log ( ) ( )X YH X Y p t p t dt

1 1

( ; ) log ( ; )

( ; ) ( ) ( ) ( )

1

Yp X X Y

N M

i ji j

H X Y V X Y

V X Y E p X p t p t dt

G x yMN

Using parzen estimate for the pdfs gives

Page 27: Information Theoretic Learning Finding structure in data ...

“Cross” information potential and “cross” information force

21

1

1

1

)|();();(

1);(

ijji

M

j

M

jjii

ii

ji

M

ji

xyyxG

MN

yxFYxVx

YxF

yxGMN

YxV

Force between particles of two datasets

Page 28: Information Theoretic Learning Finding structure in data ...

-1.5 -1 -0.5 0 0.5 1 1.5-1.5

-1

-0.5

0

0.5

1

1.5

Cross information force between two datasets arising due to H(X;Y)

Page 29: Information Theoretic Learning Finding structure in data ...

Cauchy Schwartz Divergence

A measure of similarity between two datasets

)()();(2

)()(

)()(log);(

22

2

YHXHYXH

duupduup

duupupYXD

YX

YX

cs

( ; ) 0csD X Y Same probability density functions

Page 30: Information Theoretic Learning Finding structure in data ...

A New ITL Framework:Information Theoretic Mean Shift

STATEMENT

oNi NNxX ,1

oN

ioo xX 1

oX

Consider a dataset with iid samples. We wish

to find a new dataset which captures “interesting structures” of the original dataset .

FORMULATION

Cost = Redundancy Reduction term + Similarity Measure Term

WeightedCombination

Page 31: Information Theoretic Learning Finding structure in data ...

Information Theoretic Mean Shift

),()(min)( ocsX

XXDXHXJ

Form 1

This cost looks like a reaction diffusion equation:

Entropy term implements diffusionCauchy Schwarz implements attraction to the original data

Page 32: Information Theoretic Learning Finding structure in data ...

Analogy

( )H X

( ; )cs oD X X

oX X

The weighting parameter λ squeezes the information flow through a bottleneck extracting different levels of structure in the data.

( )H X

( ; )cs oD X X

slope We can also visualize λ as a slope parameter. The previous methods used only λ=1 or ( ; ) ( )cs oD X X H X

Page 33: Information Theoretic Learning Finding structure in data ...

Self organizing rule

0);();(

2)(

)(

)1(2

ok

ok XxF

XXVxF

XV

);(log2)(log)1(min)( oX

XXVXVXJ

Rewriting cost function as

Differentiating w.r.to xk={1,2,…,N} and rearranging gives

kN

jojk

N

jjk

N

jojk

N

jojojk

N

jojk

N

jjjk

k x

xxG

xxG

c

xxG

xxxG

xxG

xxxG

cxoo

o

o

1

1

1

1

1

1 )1()1()1(

Fixed Point Update!!

Page 34: Information Theoretic Learning Finding structure in data ...

An Example

-4 -3 -2 -1 0 1 2 3 4-8

-6

-4

-2

0

2

4

Crescent shaped Dataset

Page 35: Information Theoretic Learning Finding structure in data ...

Effect of λ

Page 36: Information Theoretic Learning Finding structure in data ...

Summary 2

Starting with the Data

Modes

λ∞λ = 1λ = 0

Single Point Back to Data

Page 37: Information Theoretic Learning Finding structure in data ...

Applications- Clustering

Segment data into different groups such that samples belonging to same group are “closer” to each other than samples of different groups.

Statement

The idea

Mode Finding Ability Clustering

Page 38: Information Theoretic Learning Finding structure in data ...

Mean Shift – a review

)(1

)(1

, i

N

iX xxG

Nxp

02

)(2

1,

xxxxG

Nxp i

i

N

iX

Modes are stationary points of the equation,

N

ii

N

iii

xxG

xxxGxmx

1

1)()1( )(

Page 39: Information Theoretic Learning Finding structure in data ...

Two variants: GBMS and GMS

Gaussian Mean ShiftGaussian Blurring Mean Shift

Single dataset XInitialize X=Xo

Two datasets X and Xo

Initialize X=Xo

N

ii

N

iii

xxG

xxxGxmx

1

1)()1( )(

N

ioi

N

ioioi

xxG

xxxGxmx

1

1)()1( )(

Page 40: Information Theoretic Learning Finding structure in data ...

Connection to ITMS

);(2)()1(min)( oX

XXHXHXJ

( ) min ( )X

J X H X ( ) min ( ; )oX

J X H X X

λ = 0 λ = 1

N

ioi

N

ioioi

xxG

xxxGxmx

1

1)()1( )(

N

ii

N

iii

xxG

xxxGxmx

1

1)()1( )(

GBMS GMS

Page 41: Information Theoretic Learning Finding structure in data ...

Applications- Clustering

-0.2 0 0.2 0.4 0.6 0.8 1 1.2-0.2

0

0.2

0.4

0.6

0.8

1

1.2

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0

0.5

X axisY axis

Pdf

est

imat

ion

10 Random Gaussian Clusters and its pdf plot

Page 42: Information Theoretic Learning Finding structure in data ...

-0.2 0 0.2 0.4 0.6 0.8 1 1.2-0.2

0

0.2

0.4

0.6

0.8

1

1.2

GBMS result

-0.2 0 0.2 0.4 0.6 0.8 1 1.2-0.2

0

0.2

0.4

0.6

0.8

1

1.2

GMS result

Page 43: Information Theoretic Learning Finding structure in data ...

Image segmentation

Page 44: Information Theoretic Learning Finding structure in data ...

GMS GBMS

Page 45: Information Theoretic Learning Finding structure in data ...

Applications- Principal Curves

Non linear extension of PCA. “Self-consistent” smooth curves which pass through

the “middle” of a d-dimensional probability distribution or data cloud.

A new definition (Erdogmus et al.)

A point is an element of the d-dimensional principal set ,denoted by iff is orthonormal to at least (n-d) eigenvectors of and is a strict local maximum in the subspace spanned by these eigenvectors.

d )(xg)(xp

)( dn )(xU

x

Page 46: Information Theoretic Learning Finding structure in data ...

PC continued…

is a 0-dimensional principal set corresponding to modes of the data. is the 1-dimensional principal curve, is a 2-dimensional principal surface and so on …

Hierarchical structure, . . ITMS satisfies this definition (experimentally).

Gives principal curve for .31

1 dd

01

2

Page 47: Information Theoretic Learning Finding structure in data ...

-12 -10 -8 -6 -4 -2 0 2 4 6 8

-10

-5

0

5

10

Principal curve of spiral data passing through the modes

Page 48: Information Theoretic Learning Finding structure in data ...

Denoising

-10

0

10-10

0

10

-5

0

5

Chain of Ring Dataset

Page 49: Information Theoretic Learning Finding structure in data ...

Applications -Vector Quantization

Limiting case of ITMS (λ ∞).

Dcs(X;Xo) can be seen as distortion measure between X and Xo.

Initialize X with far fewer points than Xo

( ) min ( ; )cs oX

J X D X X

Page 50: Information Theoretic Learning Finding structure in data ...

Comparison

0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

ITVQ LBG

Page 51: Information Theoretic Learning Finding structure in data ...

Unsupervised Learning Tasks: choose a point in a 2D space!

λ

Diff

eren

t T

asks

σ Different Scales

clustering

Principal curves

Vector Q.

Page 52: Information Theoretic Learning Finding structure in data ...

Conclusions

“Goal-oriented structures” goes beyond preprocessing stages and helps us extract abstract representations in the data.

A common framework binds these “interesting structures” as different levels of information extraction from the data. ITMS achieves this and can be used for Clustering Principal Curves Vector Quantization and more …..

Page 53: Information Theoretic Learning Finding structure in data ...

What’s Next?

Page 54: Information Theoretic Learning Finding structure in data ...

Correntropy:A new generalized similarity measure

Correlation is one of the most widely used functions in signal processing and pattern recognition.

But, correlation only quantifies similarity fully if the random variables are Gaussian distributed.

Can we define a new function that measures similarity but it is not restricted to second order statistics?

Use the ITL framework.

Page 55: Information Theoretic Learning Finding structure in data ...

Correntropy:A new generalized similarity measure

Define correntropy of a random process {xt} as

We can easily estimate correntropy using kernels

The name correntropy comes from the fact that the average over the lags (or the dimensions) is the information potential (argument of Renyi’s entropy)

For strictly stationary and ergodic r. p.

dxxpxxxxEstV ststx )()())((),(

N

nmnnm xxk

NV

1

)(1ˆ

))((),(ˆstx xxkEstV

Santamaria I., Pokharel P., Principe J., “Generalized Correlation Function: Definition, Properties and Application to Blind Equalization”, IEEE Trans. Signal Proc. vol 54, no 6, pp 2187- 2186, 2006.

Page 56: Information Theoretic Learning Finding structure in data ...

Correntropy:A new generalized similarity measure

How does it look like? The sinewave

Page 57: Information Theoretic Learning Finding structure in data ...

Correntropy:A new generalized similarity measure

Properties of Correntropy:• It has a maximum at the origin ( )• It is a symmetric positive function• Its mean value is the information potential • Correntropy includes higher order moments of data

• The matrix whose elements are the correntopy at different lags is Toeplitz

2/1

0

2

2 !2

)1(),(

n

ntsnn

n

xxEn

tsV

Page 58: Information Theoretic Learning Finding structure in data ...

Correntropy:A new generalized similarity measure

• Correntropy as a cost function versus MSE. 2

2

,

2

( , ) [( ) ]

( ) ( , )

( )

XY

x y

E

e

MSE X Y E X Y

x y f x y dxdy

e f e de

,

( , ) [ ( )]

( ) ( , )

( ) ( )

XY

x y

E

e

V X Y E k X Y

k x y f x y dxdy

k e f e de

Page 59: Information Theoretic Learning Finding structure in data ...

Correntropy:A new generalized similarity measure

• Correntropy induces a metric (CIM) in the sample space defined by

• Therefore correntropy can

be used as an alternative

similarity criterion in the

space of samples.

1/ 2( , ) ( (0,0) ( , ))CIM X Y V V X Y

0.3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Liu W., Pokharel P., Principe J., “Correntropy: Properties and Applications in Non Gaussian Signal Processing”, accepted in IEEE Trans. Sig. Proc

Page 60: Information Theoretic Learning Finding structure in data ...

RKHS induced by CorrentropyDefinition

For a stochastic process {Xt, tєT} with T being an index set, correntropy defined as

VX(t,s) = E[K(XtXs)]

is symmetric and positive definite.

Thus it induces a new RKHS denoted as VRKHS (VH). There is a kernel mapping Ψ such that

Vx(t,s)=< Ψ(t),Ψ(s) >VH

Any symmetric non-negative kernel is the covariance kernel of a random function and vice versa (Parzen).

Therefore, given a random function {Xt, t є T} there exists another random function {ft, t є T} such that

E[ft fs]=VX(t,s)

Page 61: Information Theoretic Learning Finding structure in data ...

RKHS induced by CorrentropyDefinition

This RKHS seems very appropriate for nonlinear signal processing.

In this space we can compute using linear algorithms nonlinear systems in the input space such as:

– Matched filters– Wiener filters– Principal Component Analysis– Solve constrained optimization problems– Due adaptive filtering and controls