Clase08-Mlss05au Hyvarinen Ica 02
-
Upload
juan-alvarez -
Category
Documents
-
view
222 -
download
0
Transcript of Clase08-Mlss05au Hyvarinen Ica 02
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
1/87
Independent Component Analysis
Aapo Hyvarinen
HIIT Basic Research Unit
University of Helsinki, Finland
http://www.cs.helsinki.fi/aapo.hyvarinen/
1
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
2/87
Blind source separation
Four source signals:
0 10 20 30 40 50 60 70 80 90 1001.5
1
0.5
0
0.5
1
1.5
0 10 20 30 40 50 60 70 80 90 1002
1.5
1
0.5
0
0.5
1
1.5
2
0 10 20 30 40 50 60 70 80 90 1003
2
1
0
1
2
3
0 10 20 30 40 50 60 70 80 90 1008
6
4
2
0
2
4
Due to some external circumstances, only linear mixtures of the sourcesignals are observed.
0 10 20 30 40 50 60 70 80 90 1004
3
2
1
0
1
2
3
4
0 10 20 30 40 50 60 70 80 90 1008
6
4
2
0
2
4
0 10 20 30 40 50 60 70 80 90 1006
4
2
0
2
4
6
8
10
0 10 20 30 40 50 60 70 80 90 1006
4
2
0
2
4
6
8
Estimate (separate) original signals!
2
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
3/87
Solution by independence
Useonlyinformation on statistical independence to recover:
0 10 20 30 40 50 60 70 80 90 1002
1.5
1
0.5
0
0.5
1
1.5
2
0 10 20 30 40 50 60 70 80 90 1003
2
1
0
1
2
3
0 10 20 30 40 50 60 70 80 90 1001.5
1
0.5
0
0.5
1
1.5
0 10 20 30 40 50 60 70 80 90 1004
2
0
2
4
6
8
These are the independent components!
3
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
4/87
Independent Component Analysis.
(Herault and Jutten, 1984-1991)
Observed random vectorxis modelled by a linear latent variable model
xi=m
j=1
ai jsj, i=1...n (1)
or in matrix form:
x = As (2)
where
The mixing matrixAis constant (a parameter matrix).
Thesi are latent random variables calledthe independent components.
Estimate bothAands, observing onlyx.
4
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
5/87
Basic properties of the ICA model Must assume:
Thesi are mutually independent
Thesi are nongaussian. For simplicity: The matrixAis square.
Thesi defined only up to a multiplicative constant.
Thesi are not ordered.
5
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
6/87
ICA and decorrelation
First approach: decorrelate variables.
Whitening or sphering: decorrelate and normalizeE
{xxT
}= I
Simple by eigen-value decomposition of covariance matrix. But: Decorrelation uses only correlation matrix: n2/2 equations,
andAhasn2 elements
Not enough information!
6
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
7/87
Independence is better
Fortunately, independence is stronger than uncorrelatedness. For independent variables we have
E{h1(y1)h2(y2)}E{h1(y1)}E{h2(y2)} =0. (3)
Still, decorrelation (whitening) is usually done before ICA forvarious technical reasons
For example: after decorrelation and standardization,Acan be
considered orthogonal.
Gaussian data determined by correlations alone model cannot be estimated for gaussian data.
7
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
8/87
Illustration of whitening
Two ICs with uniform distributions:
latent variables
observed variables
whitened variables
Original variables, observed mixtures, whitened mixtures.
Cf. gaussian density: symmetric in all directions.
8
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
9/87
Basic intuitive principle of ICA estimation.
(Sloppy version of) the Central Limit Theorem(Donoho, 1982).
Consider a linear combinationwTx= qTs qisi+ qjsj is more gaussian thansi. Maximizing the nongaussianityofqTs, we can findsi.
Also known as projection pursuit.
9
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
10/87
Marginal and joint densities, uniform distributions.
Marginal and joint densities, whitened mixtures of uniform ICs
10
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
11/87
Marginal and joint densities, supergaussian distributions.
Whitened mixtures of supergaussian ICs
11
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
12/87
Kurtosis as nongaussianity measure.
Problem: how to measure nongaussianity?
Definition:kurt(x) =E{x4}3(E{x2})2 (4)
if variance constrained to unity, essentially 4th moment.
Simple algebraic properties because its a cumulant:kurt(s1+ s2) = kurt(s1) +kurt(s2) (5)
kurt(s1) =
4
kurt(s1) (6) zero for gaussian RV, non-zero for most nongaussian RVs. positive vs. negative kurtosis have typical forms of pdf.
12
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
13/87
Left: Laplacian pdf, positive kurt (supergaussian).Right: Uniform pdf, negative kurt (subgaussian).
13
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
14/87
The extrema of kurtosis
by the properties of kurtosis:
kurt(wT
x) = kurt(qT
s) =q41 kurt(s1) + q
42 kurt(s2) (7)
constrain variance to equal unity
E
{(wTx)2
}=E
{(qTs)2
}=q21+ q
22=1 (8)
for simplicity, consider kurtoses equal to one. maxima of kurtosis give independent components (see figure) general result: absolute valueof kurtosis maximized by thesi
(Delfosse and Loubaton, 1995).
Note: extrema are orthogonal due to whitening.
14
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
15/87
Optimization landscape for kurtosis. Thick curve is unit sphere, thin
curves are contours where kurtosis is constant.
15
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
16/87
0 0.5 1 1.5 2 2.5 3 3.52
2.5
3
3.5
4
4.5
5
5.5
angle of w
kurtosis
Kurtosis as a function of the direction of projection. For positive kurtosis,
kurtosis (and its absolute value) are maximized in the directions of the
independent components.
16
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
17/87
0 0.5 1 1.5 2 2.5 3 3.51.3
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5
angle of w
ku
rtosis
Case of negative kurtosis. Kurtosis is minimized, and its absolute value
maximized, in the directions of the independent components.
17
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
18/87
Basic ICA estimation procedure
1. Whiten the data to givez.2. Set iteration counti=1.
3. Take a random vectorwi.
4. Maximize nongaussianity ofwTi z,
under constraintswi2 =1 andwTi wj=0,j
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
19/87
Why kurtosis is not optimal
Sensitive to outliers:Consider a sample of 1000 values with unit var, and one value equal
to 10.
Kurtosis equals at least 104/10003=7. For supergaussian variables, statistical performance not optimal even
without outliers.
Other measures of nongaussianity should be considered.
19
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
20/87
Differential entropy as nongaussianity measure
Generalization of ordinary discrete Shannon entropy:H(x) = E{ logp(x)} (9)
for fixed variance, maximized by gaussian distribution. often normalized to give negentropy
J(x) =H(xgauss)H(x) (10)
Good statistical properties, but computationally difficult.
20
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
21/87
Approximation of negentropy
Approximations of negentropy(Hyvarinen, 1998):
JG(x) = (E{G(x)}E{G(xgauss)})2 (11)
whereGis a nonquadratic function.
Generalization of (square of) kurtosis (which isG(x) =x4). A good compromise?
statistical properties not bad (for suitable choice of G)
computationally simple
Further possibility: Skewness (for nonsymmetric ICs)
21
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
22/87
Information-theoretic approach.
(Comon 1994)
Mutual information ofy = (y1, ...,yn)T
I(y1, ...,yn) =n
i=1
H(yi)H(y) (12)
whereHis differential entropy.
A measure of redundancy ofy. Equals zero iffyi are independent.
Fory = Wx, we obtain
I(y) =n
i=1
H(yi) log |det W|+ const. (13)
22
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
23/87
Mutual information and nongaussianity
IfWconstrained to be orthogonal (whitened data):
I(y1, ...,yn) =
n
i=1H(yi) + const. (14)
Sum of nongaussianities! (though opposite sign)
Rigorous derivation of maximization of nongaussianities.
23
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
24/87
Maximum likelihood estimation.
(Pham et al, 1992)
Log-likelihood of the model: (W= A1)
L=T
t=1
n
i=1
logpsi (wTi x(t)))+ Tlog |det W| (15)
Equivalent to the infomax approach in neural networks.
Needs estimates of the psi , but these need not be exact at all. Roughly: consistent ifpsi is of the right type (sub or supergaussian).
Very similar to mutual information:
I(y) =E{n
i=1
logpyi (yi)}+ log |det W|+C (16)
24
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
25/87
Overview of ICA estimation principles.
Most approaches can be interpreted as maximizing the
nongaussianity of ICs.
Basic choice: the nonquadratic function in the nongaussianitymeasure:
kurtosis: fourth power
entropy/likelihood: log of density
approx of entropy:G(s) =logcosh sor others.
One-by-one estimation vs. estimation of the whole model. Estimates constrained to be white vs. no constraint
25
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
26/87
Algorithms (1). Adaptive gradient methods
Gradient methods for one-by-one estimation straightforward. Stochastic gradient ascent for likelihood(Bell-Sejnowski 1995)
W (W1)T + g(Wx)xT (17)
withg= (logps). Problem: needs matrix inversion! Better: natural/relative gradient ascent of likelihood
(Amari et al, 1996, Cardoso and Laheld, 1994)
W [I + g(y)yT]W (18)
withy =Wx. Obtained by multiplying gradient byWTW.
26
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
27/87
Algorithms (2). The FastICA fixed-point algorithm
(Hyvarinen 1997,1999)
An approximate Newton method in block (batch) mode.
No matrix inversion, but still quadratic (or cubic) convergence. No parameters to be tuned.
For a single IC (whitened data)w E{xg(wTx)}E{g(wTx)}w, normalizew
wheregis the derivative ofG.
For likelihood:
W W + D1[D2+E{g(y)yT}]W, orthonormalizeW
27
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
28/87
0 0.5 1 1.5 2 2.5 31.3
1.2
1.1
1
0.9
0.8
0.7
iteration count
k
urtosis
Convergence of FastICA. Vectors after 1 and 2 iterations, values of
kurtosis.
28
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
29/87
0 0.5 1 1.5 2 2.5 32.5
3
3.5
4
4.5
5
5.5
iteration count
kurtosis
Convergence of FastICA (2). Vectors after 1 and 2 iterations, values of
kurtosis.
29
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
30/87
Relations to other methods (1): Projection pursuit(Friedman and Tukey,
1974; Huber, 1985)
Projection pursuit is a method for visualization and exploratory dataanalysis.
Attempts to show clustering structure of data by finding interesting
projections.
PCA is not designed to find clustering structure.
Interestingness is usually measured by nongaussianity.
For example, bimodal distributions are very nongaussian.
30
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
31/87
Illustration of projection pursuit. The projection pursuit direction is
horizontal, the principal component vertical.
31
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
32/87
Relations to other methods. (2)
Factor analysis: ICA is a nongaussian (usually noise-free) version
Blind deconvolution: obtained by constraining the mixing matrix
Principal component analysis often the same applications
very different statistical principles
32
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
33/87
Basic ICA estimation: conclusions
ICA is very simple as a model:
linear nongaussian latent variables model.
Estimation not so simple due to nongaussianity:objective functions cannot be quadratic.
Estimation by maximizing nongaussianity of independentcomponents.
Equivalently (?), maximum likelihood or min of mutual info.
Algorithms: adaptive (natural gradient descent) vs. block/batch mode
(FastICA).
Choice of nonlinearity: cubic (kurtosis) vs. non-polynomial functions
33
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
34/87
Applications (1)
Brain Imaging Data
34
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
35/87
Main application areasso far:
audio noise cancelling (cocktail-party problem)Very difficult... mainly historical
biomedical signals: Electro????gram
Brain images
Microarray data (gene expression)
vision modelling and image processing econometric time series telecommunications
35
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
36/87
Brain imaging data analysis
EEG, MEG: high temporal resolutionPET, fRMI: global activity maps
Huge amounts of data: need for neuroinformatics Physiological models vs. unsupervised methods
36
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
37/87
Electric and magnetic fields
Magnetic FieldElectric Potential
EEG and MEG measures over the scalp
(From Vigario et al, 2000.)
Many sources mixed in the measurements.
37
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
38/87
Magnetoencephalography
Dewar
Liquid
helium
Neuromag-122TM
40 mm
Sensor array
Gradiometer
(degrees)
planargradiometer
axialgradiometer
Bnet
(fT)
- 40 + 200 + 40- 20
Neuromag-122 whole scalp magnetometer.
(From Vigario et al, 2001.)
38
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
39/87
Artefact removal from MEG
1
1
2
2
3
3
4
4
5
5
6
6
MEGsaccades blinking biting
A subset of 12 spontaneous MEG signals.
(From Vigario et al, 1998.)
39
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
40/87
IC1
IC2
IC3
IC4
IC5
IC6
IC7
IC8
IC9
10 s
Artefacts found from MEG
data, using the FastICA algo-
rithm. (From Vigario et al,
1998.)
40
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
41/87
Analysis of evoked magnetic fields
Left side Right side
MEG25
MEG83
MEG sample
MEG60(MEG-L)
MEG10(MEG-R)
1 0 1 2 3 4 5-
Averaged auditory evoked responses to 200 tones, using MEG. (From
Vigario et al, 1998.)
41
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
42/87
1 0 1 2 3 4 5
1 0 1 2 3 4 5-
PC1
PC2
PC3
PC4
PC5
IC1
IC2
IC3
IC4
-
a)
b)
Principal (a) and inde-
pendent (b) components
found from the auditoryevoked field study. (From
Vigario et al, 1998.)
42
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
43/87
Applications (2)
Image and VisionModelling
43
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
44/87
ICA and image data
Models of image data always useful In computational neuroscience: evolution+development give optimal
receptive fields
In image processing: Essential for denoising, prediction, etc. ICA gives an interesting model
(Olshausen and Field, 1996; Bell and Sejnowski, 1997)
Important connection to sparse coding.
44
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
45/87
Linear models of images.
Observed variablesxi are gray-scale values of pixels in an image Modelled by a linear latent variable model
x= As =i aisi (19)
Columnsai are called basis vectors
Image is superposition of basis vectors Well-known basis vector sets:
Fourier analysis (sines, cosines)
DCT Wavelets
Gabor analysis
What could be the best basis vectors?45
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
46/87
Some DCT (top) and wavelet (bottom) basis vectors
46
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
47/87
What is sparseness?
A form of nongaussianity (higher-order structure) often encountered
in natural signals
Variable is active only rarely
gaussian:
sparse:
47
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
48/87
What is sparseness? (2)
A random variable is sparse if its density has heavy tails, and a peakat zero.
Kurtosis is (strongly) positive, i.e. supergaussianity Typical sparse pdf (Laplace):
4 3 2 1 0 1 2 3 40
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
(dash-dot: Gaussian density)
48
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
49/87
Linear Sparse Coding
For random vectorx, find linear representation:x = As (20)
so that the componentssi are as sparse as possible.
a given data pointx(t)is represented using only a limited number ofactive componentssi.
ICA is sparse coding since sparseness is supergaussianity.
Sparse coding/ICA gives an optimal basis.
49
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
50/87
ICA basis vectors of image windows.
50
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
51/87
Why sparse coding?
Good fit to V1 simple cell receptive fields (Van Hateren et al, 1998). Compress images: code only nonzero components. In biological networks: saves energy. Internal model for recovery of structure. Denoising: use thresholding to leave only components that are really
active.
Wavelet methods use the same principles.
51
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
52/87
Sparse Coding: Denoising by Shrinkage
(Hyvarinen, 1999)
Assume the data is corrupted by white Gaussian noisen
x= As + n (21)
and constrainAto be orthogonal.
Estimatesi fromxby ML method,
assuming thesi to be independent:
si= f(wT
i x) (22)
and reconstructx = As.
52
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
53/87
Shrinkage nonlinearity as denoising
E.g. if thesi have a Laplace distribution, we have
f(u) =sign(u)max(0,
|u
|
22) (23)
53
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
54/87
Sparse Code Shrinkage algorithm
1. Estimate the sparse coding matrixW = A1, and the shrinkagenonlinearities fi.
2. Compute for each noisy observationx(t)the corresponding noisy
sparse componentswTi x(t).
3. Reduce noise by applying the shrinkage non-linearity fi(.)on thenoisy sparse components:
si(t) = fi(wTi x(t)).
4. Invert the coding to obtainx(t), given byx(t) = Ws(t).
Can be considered as an adaptive version of wavelet shrinkage (Donoho
et al, 1995)
54
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
55/87
Experiments on Sparse Code Shrinkage
input dataxwas 88 windows from images. basis vectors estimated from noise-free images, using a modification
of FastICA
sparse code shrinkage applied for sliding windows in noisy images. averages of the 88 reconstructions taken as final reconstructions
55
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
56/87
Noise level: 0.3
Left: Noisy image. Middle: Wiener filtered. Right: Sparse code shrinkageresult.
56
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
57/87
Conclusion: Image feature extraction by ICA
ICA gives an interesting model for image data Takes into account nongaussianity, here: sparseness
Performs sparse coding. Features related to Gabor functions, wavelets, V1 simple cells.
Shrinkage denoising possible as with wavelets.
57
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
58/87
Extensions:
Subspace and topographyformalisms
58
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
59/87
Relaxing independence
For most data sets, the estimated components are not veryindependent.
In fact, independent components can not be found in general.
We attempt to model some of the remaining dependencies. Basic models group components:
Multidimensional ICA, and
Independent Subspace Analysis.
59
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
60/87
Multidimensional ICA(Cardoso 1998)
One approach to relaxing independence. thesi can be divided inton-tuples, such that
thesi inside a givenn-tuple may be dependent on each other
dependencies between differentn-tuples are not allowed.
Everyn-tuple corresponds to a subspace.
60
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
61/87
Invariant-feature subspaces(Kohonen 1996)
Linear filters (like in ICA) necessarily lack any invariance.
invariant-feature subspaces is an abstract approach to representinginvariant features.
Principle: invariant feature is a linear subspace in a feature space. The value of the invariant feature is given by norm of the projection
on that subspace.k
i=1(wTi x)
2
(24)
61
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
62/87
Independent Subspace Analysis(Hyvarinen and Hoyer, 2000)
Combination of multidimensional ICA and invariant-featuresubspaces.
The probability density inside each subspace isspherically
symmetric, i.e. depends only on the norm of the projection.
Simplifies the model considerably.
The nature of the invariant features is not specified.
62
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
63/87
IInput
8
7
(.)
2(.)
2
6
2
1
3
5
4
(.)
2(.)
2(.)
2(.)
2(.)
2
2(.)
63
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
64/87
Application on image data
Applied on image data, our model shows emergence ofcomplex-cell properties:
We have phase and some translation invariance,as well as orientation and frequency selectivity.
Each subspace can be interpreted as a complex cell. Similar to energy models for complex cells (norm is like local
energy).
64
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
65/87
Independent Subspaces of natural image data.
65
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
66/87
Independent Subspace Analysis: Conclusions
A simple way of relaxing the independence constraint in ICA. Instead of scalar components, only subspaces are independent.
Densities inside subspaces are spherically symmetric. Can be interpreted as invariant-feature subspaces.
When applied on image data, complex cell properties emerge.
66
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
67/87
Problem: Dependencies still remain
Linear decomposition often does not give independence, even forsubspaces.
Remaining dependencies could be visualized or else utilized.
Components can be decorrelated, so only higher-order correlationsare interesting
How to visualize them? E.g. using topographic order
67
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
68/87
Extending the model to include topography
Instead of having unordered components,
they are arranged on a two-dimensional lattice
dependent
independent
The components are typically sparse, but not independent. Near-by components have higher-order correlations.
68
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
69/87
Dependence through local variances
Often encountered in e.g. image data
Components are independentgiven their variances
In our model, variances are not independent
instead: correlated for near-by components
e.g. generated by another ICA model, with topographic mixing
INDEPENDENT TOPOGRAPHIC VARIANCE DEPENDENCE
69
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
70/87
Two signals that are independent given their variances.
70
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
71/87
Topographic ICA model(Hyvarinen et al, 2000)
u
2
3
1
x
x
x
A
3
u
u
1
2
s
s
s
2
3
1
1
2
3
Variance-generating variablesui are generated randomly, and mixed
linearly inside their topographic neighbourhoods. Mixtures are
transformed using a nonlinearity, thus giving variancesi of thesi.
Finally, ordinary linear mixing.
71
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
72/87
Approximation of likelihood
Likelihood of the model intractable
Approximation:T
t=1
n
j=1
G(n
i=1
h(i,j)(wTi x(t))2) + Tlog
|det W
|. (25)
whereh(i,j)is neighborhood function, andGa nonlinear function.
Generalization of independent subspace analysis. Function of local energies only!
72
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
73/87
Top-down modulated Hebbian learning
Approximation of likelihood can be maximized by gradient ascent
Learning rule:
wiE{x(wTi x)ri)}+ normalization+feedback (26)
where
ri=n
k=1
h(i, k)g(n
j=1
h(k,j)(wTjx)2). (27)
Hebbian learning withri a function of the outputs of a higher-order(complex) cells.
73
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
74/87
Topographic ICA of natural image data. Topographically ordered
Gabor-like basis vectors for image patches.
74
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
75/87
Independent subspace analysis and topographic ICA
In ISA, single components are not independent, but subspaces are.
In topographic ICA, dependencies modelled continuously. No strict division into subspaces.
For image data, each neighbourhood is a complex cell. Localenergies are their outputs.
Topographic ICA is a generalization of ISA, incorporating theinvariant-feature subspace principle as invariant-feature
neighbourhoods.
75
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
76/87
Topographic ICA: Conclusion
A more sophisticated way of relaxing independence.
Dependencies that cannot be cancelled by ICA define a similaritymeasure
New principle for topographic mappings Formulated as a modification of the ICA model. Approximation of likelihood gives tractable algorithms. For image data, topography similar to V1.
76
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
77/87
Using time dependencies
77
Using autocorrelations for ICA estimation
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
78/87
Using autocorrelations for ICA estimation
Take the basic linear mixture model
x(t) = As(t) (28)
Cannot be estimated in general (take gaussian RVs) Usually in ICA, we assume thesi to be nongaussian
higher-order statistics provide missing information.
Alternatively: assume thesi are time-dependent signals use time correlations to give more information
For example, a lagged covariance matrix
Cx=E{x(t)x(t )}. (29)
measures covariances of lagged signals.
78
The AMUSE algorithm for using autocorrelations
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
79/87
g g
(Tong et al, 1991; Molgedey and Schuster, 1994)
Basic principle: decorrelate each signaly = Wxwith other signals,lagged as well as not lagged.
In other words: E{yi(t)yj(t )} =0 for alli = j. To do this:
1. Whiten the data to obtainz(t) = Vx(t)
2. Find orthogonal transformationWso that the lagged covariance
matrix ofy(t) = Wz(t)is identity.
Matrix diagonalization problem
Cx =E{x(t)x(t )} =E{As(t)s(t )TAT} = ACsAT
of a (more or less) symmetric matrix.
79
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
80/87
Pros and cons of separation by autocorrelations
Very fast to compute: a single eigen-value decomposition, like PCA
Can only separate ICs with different autocorrelations Because the lagged covariance matrix must have different
eigenvalues
Some improvement can be achieved by using several lags in the
algorithm (Belouchrani et al, 1997, SOBI).
but if signals have identical Fourier spectra, autocorrelations just
cannot separate them
80
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
81/87
Combining nongaussianity and autocorrelations
Best results should be obtained by using these two kinds ofinformation.
E.g.: Model temporal structure of signals with e.g. ARMA models
A more general approach: minimize coding complexity
Find a decompositiony= Wxso that theyi are easy to code. Rigorously defined by Kolmogoroff Complexity.
Signals are easy to code if they are nongaussian and have timedependencies.
81
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
82/87
Coding complexity as a general framework
(Pajunen, 1998)
For whitened dataz, and an orthogonalW: minimize sum of codinglengths of they= Wz.
If only marginal distributions are used,
coding length is given by entropy, i.e. nongaussianity.
If only autocorrelations are used, coding length is related to
autocorrelations.
Thus we have a generalization of both frameworks.
82
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
83/87
Approximation of coding complexity
The value ofy(t)is predicted from the preceding values
y(t) = f(y(t1),y(t2), ...y(1)). (30)
The residualsy(t)y(t)are coded independently from each other.
Predictor could be linear. Coding length is approximated by entropy of residuals
H(yy) (31)
Many other approximations can be developed.
83
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
84/87
Estimation using variance nonstationarity
(Matsuoka et al, 1995)
An alternative to autocorrelations (and nongaussianity)
Variance changes slowly over time
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100004
3
2
1
0
1
2
3
4
This gives enough information to estimate model
84
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
85/87
Convolutive ICA
Often the signals do not arrive at the same in the sensors
There may be echos as well (multi-path phenomena) Include convolution in the model:
xi=n
j=1
ai j(t)
si(t) =
n
j=1
k
ai j(k)si(t
k), fori=1, ..., n, (32)
In theory: Estimation by the same principles as ordinary ICA
In practice: huge number of parameters since (de)convolving filters
may be very long
special methods may need to be used
85
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
86/87
Final Summary
ICA is a very simple model. Simplicity implies wide applicability. A nongaussian alternative to PCA or factor analysis. Decorrelation or whitening is only half ICA.
The other half uses the higher-order statistics of nongaussian
variables
(or alternatively: autocorrelations, variance nonstationarity,
complexity)
Basic principle is to find maximally nongaussian directions.- Essentially equivalent to maximum likelihood or
information-theoretic formulations.
86
-
8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02
87/87
Final Summary (2)
Applications:
Blind source separation: biomedical signals, econometrics etc. Feature extraction: images etc.
Exploratory data analysis: like projection pursuit
New coming all the time
Since dependencies cannot always be cancelled, subspaces ortopographic versions may be useful.
Alternatively, separation is possible using time dependencies. Nongaussianity is beautiful !?
87