1!
Independent!Component Analysis!
!Lecture 14!
2!
ICA: Motivation!g Cocktail party problem:!
n Imagine you are in a room where two people are speaking simultaneously. "
n You have two microphones placed in two different locations."
g Microphones will give you two recorded time signals which we are denoted by x1(t) and x2(t), with x1 and x2 the amplitudes and t the time index."
g This is of course a simple model where we have omitted time delays and reverberances in the room."
x1(t) = a11s1(t)+ a12s2 (t)x2 (t) = a21s1(t)+ a22s2 (t)
3!
ICA: Motivation!g Cocktail party problem:!
Sources (si)! Mixing Matrix (A)!Observed Mixture !
Signals (x)!
4!
ICA!
Sources (si)! Observed Mixture Signals (x)!
5!
ICA!g Image processing!
Introduction to ICA, Hyvarinen!
6!
ICA!g Image processing!
Introduction to ICA, Hyvarinen!
7!
ICA Motivation!g Fetal ECG!
Gari D. Clifford (MIT) ; Images © B. Campbell 2006. Creative Commons License.!
8!
ICA Motivation!g Fetal ECG!
Gari D. Clifford (MIT) ; Images © B. Campbell 2006. Creative Commons License.!
9!
ICA: Motivation!g Electroencephalogram(EEG):!
n The EEG data consists of recordings of electrical potentials generated by mixing some underlying components of brain activity"
n We would like to the original components of brain activity but we can only observe mixture of components"
10!
ICA!
From Gutierrez-Osuna!
unknown! unknown! given!
11!
ICA Problem:!g Assume that we observe n linear mixtures x1, x2, …xn, from n
independent observers!
g Or, using matrix notation!
g Our goal is to find a de-mixing matrix W such that!
g Assumptions!n Both mixture signals and source signals are zero-mean (i.e. E[xi]=E[sj]
=0, ∀i,j)"g If not, we simply subtract their means"
n The sources have non-Gaussian distributions"g More on this in a minute"
n The mixing matrix is square, i.e., there are as many sources as mixing signals"
g This assumption, however, can sometimes be relaxed"
!
x j (t) = a j1s1(t) + a j 2s2(t) +!+ a jnsn (t)
!
x = As
!
s =Wx
From Gutierrez-Osuna!
12!
An Example!g Given the observed signal can we find the
sources?!
Sources (si)! Observed Mixture Signals (x)!
13!
Independence vs. uncorrelatedness!g What is independence?!
n Note: Variables s1 and s2 are independent but mixture variables x1 and x2 are not"n Two random variables y1 and y2 are said to be independent if knowledge of the
value of y1 does not provide any information about the value of y2, and viceversa"
p(y1|y2)=p(y1)=>p(y1,y2)=p(y1)p(y2)""
g What is uncorrelatedness?!n Two random variables y1 and y2 are said to be uncorrelated if their covariance is
zero"" " " "E[y1
2y22=0]"
g Equivalences!n Independence implies uncorrelatedness"n Uncorrelatedness DOES NOT imply independence…"
g Unless the random variables y1 and y2 are Gaussian, in which case uncorrelatedness and independence are equivalent"
From Gutierrez-Osuna!
14!
Geometric View!g Linearly independent variables
are those with vectors that do not fall along the same line; that is, there is no multiplicative constant that will expand, contract, or reflect one vector onto the other!
g Orthogonal variables are a special case of linearly independent variables!
g "uncorrelated” implies that once each variable is centered (i.e., the mean of each vector is subtracted from the elements of that vector), then the vectors are perpendicular.!
Rodgers et al., 1984!
15!
Independence and non-Gaussianity!g A necessary condition for ICA to work is that the signals be
non-Gaussian. Otherwise, ICA cannot resolve the independent directions due to symmetries!
n The joint density of unit variance gaussian s1 & s2 is symmetric. So it doesnʻt contain any information about the directions of the cols of the mixing matrix A. So A cannot be estimated."
n Besides, if signals are Gaussian, one may just use PCA to solve the problem (!)"
g We will now show that finding the independent components is equivalent to finding the directions of largest non-Gaussianity!
n For simplicity, let us assume that all the sources have identical distributions"
n Our goal is to find the vector w such that y=wTx is equal to the sources !
From Gutierrez-Osuna!
16!
Why non-Gaussianity?!g Consider an example n =2, such that!
n I is a 2x2 identity matrix"n Contours are circles centered at origin, and is
rotationally symmetric"g We observed !
n x will be Gaussian, with zero mean and the covariance is:"
"""n x~N(0,AAT)!"
s ~ N(0, I )
From Ngʼs notes, Stanford!
x = As
E xxT!" #$= E AssTAT!" #$= AAT
17!
Why non-Gaussianity?!g Now let R be an arbitrary orthogonal matrix!
g Let!""n If that data has been mixed using Aʼ instead of A, we would
observe xʼ=Aʼs.!n xʼ is also Gaussian distributed with zero mean and the
covariance matrix is:!
n xʼ ~ N(0,AAT)!
g This implies that is an arbitrary rotational component that cannot be determined form the data!
RRT = RTR = I
From Ngʼs notes, Stanford!
E x ' x 'T!" #$= E A 'ssT A 'T!" #$= A 'A 'T = ARRTAT = AAT
A ' = AR
18!
Why canʼt Gaussian variables be used with ICA?!
From Gutierrez-Osuna!
19!
Central Limit Theorem!g The distribution of sum of independent random variables,
which itself is a random variable, tends toward a Gaussian Distribution as the number of terms in the sum increases!
From Wikipedia!
Probability Mass Function!
1! 2! 3!
1/3!
20!
Central Limit Theorem!g Sum of two independent copies of X!
From Wikipedia!
Probability Mass Function!
2! 3! 4! 5! 6!
1/9!
2/9!
21!
Central Limit Theorem!g Sum of three independent copies of X!
From Wikipedia!Probability Mass Function!
3! 4! 5! 6! 7!
1/27!
3/27!
8! 9!
6/27!
22!
ICA!
From Gutierrez-Osuna!
unknown! unknown! given!
23!
ICA and Central Limit Theorem!g According to the CLT, the signal y is more
Gaussian than the sources s since it is a linear combination of them, and becomes the least Gaussian when it is equal to one of the sources!
g Therefore, the optimal w is the vector that maximizes the non-Gaussianity of wTx, since this will make y equal to one of the sources!
g The trick is now how to measure “non-Gaussianity”…!
24!
Gaussian Distribution!g The Gaussian distribution, also known as the normal
distribution, is a widely used model for the distribution of continuous variables!
n One dimensional Gaussian distribution"
n Multi-dimensional Gaussian distribution"!
P x µ," 2( ) =1
2#" 2( )1/ 2exp $ 1
2" 2 x $ µ( )2%
& '
(
) *
µ +mean
" 2 +var iance
!
P x µ,"( ) =1
2#( )D / 2 "1/ 2exp $
12x $ µ( )T"$1 x $ µ( )
%
& '
(
) *
µ +mean D $ dimensional, 2 +var iance DxD covariance matrix" +determinant covariance matrix
25!
Geometry of Multivariate Gaussian!g The red curve shows elliptical surface of constant probability
density for a 2-d gaussian!n Axis of ellipse are defined by eigenvectors (ui) of the covariance matrix"n Scaling factors in the directions of ui
are given by sqrt(λi)"
From Bishop PRML!
Mahalanobis distance!
26!
Moments of a Gaussian Distribution!
First moment (mean):!
Second moment:!
Third moment:!
Fourth moment:!
27!
Skewness!g Characterizes asymmetry of distribution
around the mean!n A positively skewed distribution has a "tail" which is
pulled in the positive direction. "n A negatively skewed distribution has a "tail" which is
pulled in the negative direction"
28!
Measures of Gaussianity!g Kurtosis!
n Measures “peakedness” or “flatness” of a distribution relative to a normal distribution"
g Kurtosis can be both positive or negative!n When kurtosis is zero, the variable is
Gaussian"n When kurtusis is positive, the variable is said
to be supergaussian or leptokurtic"g Supergaussians are characterized by a “spiky”
pdf with heavy tails, i.e., the Laplace pdf"n When kurtosis is negative, the variable is said
to be subgaussian or platykurtic"g Subgaussians are characterized by a rather
“flat” pdf"
29!
Measures of Gaussianity!g Kurtosis!
n Measures “peakedness” or “flatness” of a distribution relative to a normal distribution"
g Kurtosis can be both positive or negative!n When kurtosis is zero, the variable is
Gaussian"n When kurtusis is positive, the variable is said
to be supergaussian or leptokurtic"g Supergaussians are characterized by a “spiky”
pdf with heavy tails, i.e., the Laplace pdf"n When kurtosis is negative, the variable is said
to be subgaussian or platykurtic"g Subgaussians are characterized by a rather
“flat” pdf"
g Thus, the absolute value of the kurtosis can be used as a measure of non-Gaussianity!
n Kurtosis has the advantage of being computationally cheap"
n Unfortunately, kurtosis is rather sensitive to outliers"
30!
Preprocessing for ICA!g The computation of independent components can be made simpler
and better conditioned it the data is preprocessed prior to the analysis!g Centering!
n This step consists of subtracting the mean of the observation vector"" " " "x'= x −E[x]"
n The mean vector can be added to the estimates of the sources afterwards""
g Whitening!n Whitening consists of applying a linear transform to the observations so that its
components are uncorrelated and have unit variance"
n This can be achieved through principal components"
n where (the columns of) V and (the diagonal of) D are the are eigenvector and eigenvalues of E[xxT], respectively"
!
z = ˜ W x " E zzT[ ] = I
z =VD!1/2VT x Note : Cx!1/2 =VD!1/2VT
31!
Preprocessing for ICA!g Whitening!
n Note that whitening makes the mixing matrix orthogonal"
n Which has the advantage of halving the number of parameters that need to be estimated , since an orthogonal matrix only has n(n-1)/2 free parameters"
!
z = VD"1/ 2V T x
z = VD"1/ 2V T As = ˜ A s
E zzT[ ] = E ˜ A ssT ˜ A [ ] = ˜ A E ssT[ ] ˜ A T = ˜ A ̃ A T = I
32!
FastICA algorithm for kurtosis maximization!
!
J(w) = kurt wT z( ) = E wT z( )4[ ] " 3E wT z( )2[ ]2
zzT = IwTw = I
33!
FastICA algorithm for kurtosis maximization!
!
J(w) = kurt wT z( ) = E wT z( )4[ ] " 3E wT z( )2[ ]2
#J(w)#w
=
# E wT z( )4[ ] " 3E wT z( )2[ ]2$
% &
'
( )
#w
#J(w)#w
=# E wT z( )4[ ] " 3E wT zzTw( )[ ]
2$ % &
' ( )
#wzzT = I[ ]
#J(w)#w
=# E wT z( )4[ ] " 3 wTw( )2$ % &
' ( )
#w
34!
FastICA algorithm for kurtosis maximization!
!
J(w) = kurt wT z( ) = E wT z( )4[ ] " 3E wT z( )2[ ]2
#J(w)#w
=
# E wT z( )4[ ] " 3E wT z( )2[ ]2$
% &
'
( )
#w
#J(w)#w
=# E wT z( )4[ ] " 3E wT zzTw( )[ ]
2$ % &
' ( )
#wzzT = I[ ]
#J(w)#w
=# E wT z( )4[ ] " 3 wTw( )2$ % &
' ( )
#w
#J(w)#w
=
#1N
wT z(t)( )4t=1
N
* " 3 wTw( )2$
% & &
'
( ) )
#w
35!
FastICA algorithm for kurtosis maximization!
!
J(w) = kurt wT z( ) = E wT z( )4[ ] " 3E wT z( )2[ ]2
#J(w)#w
=
# E wT z( )4[ ] " 3E wT z( )2[ ]2$
% &
'
( )
#w
#J(w)#w
=# E wT z( )4[ ] " 3E wT zzTw( )[ ]
2$ % &
' ( )
#wzzT = I[ ]
#J(w)#w
=# E wT z( )4[ ] " 3 wTw( )2$ % &
' ( )
#w
#J(w)#w
=
#1N
wT z(t)( )4t=1
N
* " 3 wTw( )2$
% & &
'
( ) )
#w#J(w)#w
=4N
z(t) wT z(t)( )3t=1
N
* " 3+ 2 wTw( )+ 2w
#J(w)#w
= 4 E z(t) wT z(t)( )3[ ] " 3w(wTw)
#J(w)#w
= 4 E z(t) wT z(t)( )3[ ] " 3w wTw =1[ ]
36!
FastICA algorithm for kurtosis maximization!
J(w) = kurt wT z( ) = E wTz( )4!
"#$%&'3E wTz( )
2!"#
$%&2
!J(w)!w
=
! E wTz( )4!
"#$%&'3E wTz( )
2!"#
$%&2(
)*
+
,-
!w
!J(w)!w
=! E wTz( )
4!"#
$%&'3E wTzzTw( )!
"$%2(
)*
+,-
!wzzT = I!" $%
!J(w)!w
=! E wTz( )
4!"#
$%&'3 w
Tw( )2(
)*
+,-
!w
!J(w)!w
=
!1N
wTz(t)( )4
t=1
N
. '3 wTw( )2(
)**
+
,--
!w!J(w)!w
=4N
z(t) wTz(t)( )3
t=1
N
. '3/2 wTw( ) /2w
!J(w)!w
= 4 E z wTz( )3!
"#$%&'3w(w
Tw)
!J(w)!w
= 4 E z wTz( )3!
"#$%&'3w
37!
FastICA algorithm for kurtosis maximization!g Fast ICA algorithm!
!
wi+1 = E z wiT z( )3[ ] " 3wi
wi+1 =wi+1
norm(wi+1)
38!
FastICA algorithm for kurtosis maximization!g To estimate several independent components, we run the one-
unit FastICA with several units w1, w2, …, wn!n To prevent several of these vectors from converging to the same
solution, we decorrelate outputs w1Tx, w2
Tx, …, wnTx at each iteration"
n This can be done using a deflation scheme based on Gram-Schmidt"g We estimate each independent component one by one"g With p estimated components w1,w2,…,wp, we run the one-unit ICA iteration
for wp+1"g After each iteration, we subtract from wp+1 its projections (wT
p+1wj)wj on the previous vectors wj"
g Then, we renormalize wp+1"
!
wp+1 = wp+1 " wp+1T
i=1
p
# w jw j
wp+1 =wp+1
wp+1T wp+1
39!
ICA Ambiguities !g The variance of the independent components cannot be
determined!n Since both s and A are undetermined, any multiplicative factor in s,
including a change of sign, could be absorbed by the coefficients of A"
n To resolve this ambiguity, source signals are assumed to have unit variance"
g The order of independent components cannot be determined!n Since both s and A are unknown, any permutation of the mixing terms
would yield the same result"n Compare this with Principal Components Analysis, where the order of
the components can be determined by their eigenvalues (their variance)"
40!
PCA vs. ICA!
41!
42!
43!
Matrix Calculus!g Let x be a n by 1 vector and y be m by
1vector, where each component yi may be a function all xj !
!
x =
x1x2!xn
"
#
$ $ $ $
%
&
' ' ' '
, y =
y1y2!ym
"
#
$ $ $ $
%
&
' ' ' '
y = f (x)
44!
Matrix Calculus!g Derivative of the vector y with respect to
vector x is n by m matrix!
!
"y"x
=
"y1"x1
"y2"x1
…"ym"x1
"y1"x2
"y2"x2
…"ym"x21
! ! … !"y1"xn
"y2"xn
…"ym"xn
#
$
% % % % % % %
&
'
( ( ( ( ( ( (
45!
Matrix Calculus!g Derivative of a scalar y with respect to vector
x is n by 1 matrix!
!
"y"x
=
"y"x1"y"x2!"y"x2
#
$
% % % % % % %
&
'
( ( ( ( ( ( (
46!
Matrix Calculus!g Derivative of a vector y with respect to a
scalar x is 1 by m matrix!
!
"y"x
="y1"x
"y2"x
!"ym"x
#
$ % &
' (
47!
Matrix Calculus!g An Example!
!
x =x1x2
"
# $
%
& ' , A =
2 11 (20 1
"
#
$ $ $
%
&
' ' '
y = Ax
y =
2x1 + x2x1 ( 2x2x2
"
#
$ $ $
%
&
' ' '
48!
Matrix Calculus!g An Example!
!
x =x1x2
"
# $
%
& ' , A =
2 11 (20 1
"
#
$ $ $
%
&
' ' '
y = A x
y =
2x1 + x2x1 ( 2x2x2
"
#
$ $ $
%
&
' ' '
)y)x
=2 1 01 (2 1"
# $
%
& '
49!
Matrix Calculus!
Note: A is a matrix!
50!
LDA worked example!g Compute the Linear Discriminant projection for the following two-
dimensional dataset!n X1=(x1,x2)={(4,1),(2,4),(2,3),(3,6),(4,4)}"n X2=(x1,x2)={(9,10),(6,8),(9,5),(8,7),(10,8)}"
!
Si =1Ni
x " µi( ) x " µi( )Tx#Ci$
SW = S1 + S2SB = µ1 " µ2( ) µ1 " µ2( )T
From Gutierrez-Osuna!
51!
LDA worked example!g Compute the Linear Discriminant projection for the following two-
dimensional dataset!n X1=(x1,x2)={(4,1),(2,4),(2,3),(3,6),(4,4)}"n X2=(x1,x2)={(9,10),(6,8),(9,5),(8,7),(10,8)}"
g Class Means:!
!
Si =1Ni
x " µi( ) x " µi( )Tx#Ci$
SW = S1 + S2SB = µ1 " µ2( ) µ1 " µ2( )T
!
µ1 = 3.00 3.60[ ]µ2 = 8.40 7.60[ ]
52!
LDA worked example!g Compute the Linear Discriminant projection for the following two-
dimensional dataset!n X1=(x1,x2)={(4,1),(2,4),(2,3),(3,6),(4,4)}"n X2=(x1,x2)={(9,10),(6,8),(9,5),(8,7),(10,8)}"
g Class Means:!
g Within Class Scatter (class1):!!
Si =1Ni
x " µi( ) x " µi( )Tx#Ci$
SW = S1 + S2SB = µ1 " µ2( ) µ1 " µ2( )T
!
µ1 = 3.00 3.60[ ]T
µ2 = 8.40 7.60[ ]T
!
xC1 " µ1 =
1 "2.6"1 0.4"1 "0.60 2.41 0.4
#
$
% % % % % %
&
'
( ( ( ( ( (
T
!
S1 =15
x " µ1( )x#C1$ x " µ1( )T
S1 =0.8 "0.4"0.4 2.64%
& '
(
) *
53!
LDA worked example!g Compute the Linear Discriminant projection for the following two-
dimensional dataset!n X1=(x1,x2)={(4,1),(2,4),(2,3),(3,6),(4,4)}"n X2=(x1,x2)={(9,10),(6,8),(9,5),(8,7),(10,8)}"
g Class Means:!
g Within Class Scatter (class2):!!
Si =1Ni
x " µi( ) x " µi( )Tx#Ci$
SW = S1 + S2SB = µ1 " µ2( ) µ1 " µ2( )T
!
µ1 = 3.00 3.60[ ]T
µ2 = 8.40 7.60[ ]T
!
S2 =15
x " µ1( )x#C1$ x " µ1( )T
S2 =1.84 "0.04"0.04 2.64%
& '
(
) *
!
xC1 " µ1 =
0.6 2.4"2.4 0.40.6 "2.6"0.4 "0.61.6 0.4
#
$
% % % % % %
&
'
( ( ( ( ( (
T
54!
LDA worked example!g Compute the Linear Discriminant projection for the following two-
dimensional dataset!n X1=(x1,x2)={(4,1),(2,4),(2,3),(3,6),(4,4)}"n X2=(x1,x2)={(9,10),(6,8),(9,5),(8,7),(10,8)}"
g Class Means:!
g Total Within Class Scatter :!!
Si =1Ni
x " µi( ) x " µi( )Tx#Ci$
SW = S1 + S2SB = µ1 " µ2( ) µ1 " µ2( )T
!
µ1 = 3.00 3.60[ ]T
µ2 = 8.40 7.60[ ]T
!
S2 =1.84 "0.04"0.04 2.64#
$ %
&
' (
!
S1 =0.8 "0.4"0.4 2.6#
$ %
&
' (
!
Sw =2.64 "0.44"0.44 5.28#
$ %
&
' (
55!
LDA worked example!g Compute the Linear Discriminant projection for the following two-
dimensional dataset!n X1=(x1,x2)={(4,1),(2,4),(2,3),(3,6),(4,4)}"n X2=(x1,x2)={(9,10),(6,8),(9,5),(8,7),(10,8)}"
g Class Means:!
g Total Within Class Scatter :!
!g Between Class Scatter:!
!
Si =1Ni
x " µi( ) x " µi( )Tx#Ci$
SW = S1 + S2SB = µ1 " µ2( ) µ1 " µ2( )T
!
µ1 = 3.00 3.60[ ]T
µ2 = 8.40 7.60[ ]T
!
Sw =2.64 "0.44"0.44 5.28#
$ %
&
' (
!
SB ="5.4"4
#
$ %
&
' ( "5.4 "4( )
Sw =29.16 21.621.6 16
)
* +
,
- .
56!
LDA worked example!g Compute the Linear Discriminant projection for the following two-
dimensional dataset!n X1=(x1,x2)={(4,1),(2,4),(2,3),(3,6),(4,4)}"n X2=(x1,x2)={(9,10),(6,8),(9,5),(8,7),(10,8)}"
g Class Means:!
"g Scatter Matrices :!!
g The LDA projection is then obtained as the solution of the generalized eigenvalue problem:!
!
Si =1Ni
x " µi( ) x " µi( )Tx#Ci$
SW = S1 + S2SB = µ1 " µ2( ) µ1 " µ2( )T
!
µ1 = 3.00 3.60[ ]T
µ2 = 8.40 7.60[ ]T
!
Sw =2.64 "0.44"0.44 5.28#
$ %
&
' (
!
SB =29.16 21.621.6 16
"
# $
%
& '
!
Sw"1SBw = #w$ Sw
"1SB " #I = 0$11.89 " # 8.815.08 3.76 " #
= 0$ # =15.65
11.89 8.815.08 3.76
w1w2
%
& '
(
) * =15.65
w1w2
%
& '
(
) *
w1w2
%
& '
(
) * =
0.910.39%
& '
(
) *
57!
LDA worked example!g Compute the Linear Discriminant projection for the following two-
dimensional dataset!n X1=(x1,x2)={(4,1),(2,4),(2,3),(3,6),(4,4)}"n X2=(x1,x2)={(9,10),(6,8),(9,5),(8,7),(10,8)}"
g Class Means:!
"g Scatter Matrices :!!
g Eigenvectors of Sw-1SB:!
!
µ1 = 3.00 3.60[ ]T
µ2 = 8.40 7.60[ ]T
!
Sw =2.64 "0.44"0.44 5.28#
$ %
&
' (
!
SB =29.16 21.621.6 16
"
# $
%
& '
!
w1w2
"
# $
%
& ' =
0.910.39"
# $
%
& '
From Gutierrez-Osuna!
Top Related