Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y =...

31
Chapter 5 Independent Component Analysis Part II: Algorithms

Transcript of Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y =...

Page 1: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

Chapter 5 Independent

Component AnalysisPart II: Algorithms

Page 2: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

ICA definition• Given n observations of m random variables in

matrix X, find n observations of m independent components in S and m-by-m invertible mixing matrix A s.t. X = SA

• Components are statistically independent

• At most one is Gaussian

• We can assume A is orthogonal (by whitening X)

2

Page 3: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

Maximal non-Gaussian

3Skillikorn chapter 7; Hyvärinen & Oja 2000

Page 4: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

Central limit theorem• Average of i.i.d. variables converges to normal

distribution

• as n → ∞

• Hence (X1 + X2)/2 is “more Gaussian” than X1 or X2 alone

• For i.i.d. zero-centered non-Gaussian X1 and X2

• Hence, we can try to find components s that are “maximally non-Gaussian”

4

pn�1

nPn

�=1 X��� �ä d! N(0,�2)

Page 5: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

Re-writing ICA• Recall, in ICA x = sA ⇔ s = xA–1

• Hence, sj is a linear combination of xi

• Approximate sj ≈ y = xbT (b to be determined)

• Now y = sAbT so y is a lin. comb. of s

• Let qT = AbT and write y = xbT = sqT

5

Page 6: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

More re-writings• Now sj ≈ y = xbT = sqT

• If bT is a column of A–1, sj = y and qj = 1 and q is 0 elsewhere

• CLT: sqT is least Gaussian when q looks correct

• We don’t know s, so we can’t vary q

• But we can vary b and study xbT

• Approach: find b s.t. xbT is least Gaussian

6

Page 7: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

Kurtosis• One way to measure how Gaussian a random

variable is is its kurtosis

• kurt(y) = E[(y – μ)4] – 3(E[(y – μ)2])2

• E[y] = μ

• Normalized version of the fourth central moment E[(y – μ)4]

• If y ~ N(μ, σ2), kurt(y) = 0, most other distributions have non-zero kurtosis (positive or negative)

7

Page 8: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

Computing with kurtosis

• If x and y are independent random variables:

• kurt(x + y) = kurt(x) + kurt(y)

• Homework

• If α is a constant:

• kurt(αx) = α4kurt(x)

• E[(αx)4] – 3(E[(αx)2])2 = α4E[x4] – α43(E[x2])2

8

Page 9: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

Sub- and super-Gaussian distributions

• Distributions with negative kurtosis are sub-Gaussian (or platykurtic)

• Flatter than Gaussian

• Distributions with positive kurtosis are super-Gaussian (or leptokurtic)

• Spikier than Gaussian

9

Page 10: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

Examples

10https://en.wikipedia.org/wiki/Kurtosis#/media/File:Standard_symmetric_pdfs.png

Page 11: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

Negentropy• Another measure of non-Gaussianity

• Entropy of discrete r.v. X is H(X) = –∑iPr[X=i] logPr[X=i]

• The differential entropy of continuous random vector x with density f(x) is H(x) = –∫f(x) logf(x) dx

• Gaussian x has the largest entropy over all random variables of equal variance

• Negentropy is J(x) = H(xGauss) – H(x)

• xGauss is a Gaussian r.v. of the same covariance matrix as x

11

Page 12: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

Approximating negentropy

• Computing the negentropy requires estimating the (unknown) pdfs

• It can be approximated as J(y) ≈ ∑i ki(E[Gi(y)] – E[Gi(v)])2

• v ∼ N(0, 1), ki are positive constants and Gi are some non-quadratic functions

• With only one function G(y) = y4, this is kurtosis

• One choice: G1(y) = log(cosh(ay))/a, G2(y) = –exp(–y2/2)

12

Page 13: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

Back to optimization (using kurtosis)

• Recall: with two components y = xbT = sqT = q1s1 + q2s2

• si have unit variance

• We want to find ±b = argmax |kurt(xbT)|

• We can’t determine the sign

• We want y to be either s1 or s2, hence E[y2] = q1

2 + q22 = 1

13

Page 14: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

Whitening, again• Generally, ||q||2 = 1

• Recall: Z = U = XVΣ–1 is the whitened version of X

• Target becomes ±w = argmax |kurt(zwT)|

• Now

• Hence we have constraint ||w||2 = 1

14

kqk22 = (�UT )(U�T ) = k�k22

Page 15: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

Gradient-based algorithm• Gradient with kurtosis is

• E[(zwT)2] = ||w||2 for whitened data

• We can optimize this using standard gradient methods

• To satisfy the constraint ||w||2 = 1, we divide w with its norm after every update

15

�|k�rt(z�T )|�� = 4sign(k�rt(z�T ))

�E[(z�T)3z] � 3� k�k22

Page 16: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

FastICA for one IC and kurtosis

• Noticing that ||w||2 = 1 by constraint and taking infinite step update, we getw ← E[(zwT)3z] – 3w

• Again set w ← w/||w|| after every update

• Expectation has naturally to be estimated

• No theoretical guarantees but works in practice

16

Page 17: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

FastICA with approximations of negentropy

• Let g be the derivative of a function used to approximate the negentropy

• g1(x) = G1’(x) = tanh(ax)

• The general fixed-point update rule is w ← E[g(zwT)z] – E[g’(zwT)]w

17

Page 18: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

Multiple components• So far we have found only one component

• To find more, remember that vectors wi are orthogonal (columns of invertible A)

• General idea:

• Find one vector w

• Find second that is orthogonal to the first one

• Find third that is orthogonal to the two previous ones, etc.

18

Page 19: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

Symmetric orthogonalization

• We can compute wis in parallel

• Update wis independently

• Run orthogonalization after every update step

• W ← (WWT)–1/2W

• Iterate until convergence

19

Page 20: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

Maximum Likelihood

20Skillikorn chapter 7; Hyvärinen & Oja 2000

Page 21: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

Maximum-likelihood algorithms

• Idea: We are given observations X that are drawn from some parameterized family of distributions D(Θ)

• The likelihood of X given Θ, L(Θ; X) = pD(X; Θ), where pD(·; Θ) is the probability density function of D with parameters Θ

• In maximum-likelihood estimation (MLE) we try to find Θ that maximizes the likelihood given X

21

Page 22: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

ICA as MLE• If px(x) is the pdf of x = sA, then

• here B = A–1

• In general, if x is r.v. with pdf px(x) and y = Bx, then py(y) = px(Bx)|det B|

• For T observations x1, x2, …, xT the log-likelihood of B given X is

22

p�(�) = ps(s) |detB| = |detB|Q

� p�(s�) = |detB|Q

� p�(�bT� )

log L(B;X) =PT

t=1

Pm�=1 logp�(�tb

T� ) + T log |detB|

Page 23: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

Problems with MLE

• The likelihood is expressed as a function of B

• But we also need to estimate the pdfs pi()

• Non-parametric problem, infinite number of different pdfs

• Very hard problem…

23

Page 24: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

If we know the pdfs• Sometimes we know the pdfs of the components

• We only need to estimate their parameters and B

• Sometimes we know only that the pdfs are super-Gaussian (for example)

• We can use log pi(si) = –log cosh(si)

• Requires normalization

24

Page 25: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

–log cosh(x) ≈ –|x|

25

-2,4 -2 -1,6 -1,2 -0,8 -0,4 0 0,4 0,8 1,2 1,6 2 2,4

-2,4

-2

-1,6

-1,2

-0,8

-0,4

0,4

Page 26: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

Nothing on the pdfs is known

• We might not know whether the pdfs of the components are sub- or super-Gaussian

• It is enough to estimate which one they are!

• For super-Gaussian, log pi

+(si) = α1 – 2log cosh(si)

• For sub-Gaussian, log pi

–(si) = α2 – (si2/2 – log cosh(si))

26

αi are only needed to make these logs of pdfs – not in optimization

Page 27: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

Log-likelihood gradient• The gradient is

• Here with gi(yi) = (log pi(yi))’ = pi’(yi)/pi(yi)

• This gives us B ← B + δ((BT)–1 + ∑t g(xtBT)Txt)

• Multiplying from right with BTB and defining yt = xtBT gives B ← B + δ(I + ∑t g(yt)Tyt)B

• So-called infomax algorithm

27

g(y) = (g�(y�))n�=1

� log L�B = (BT )�1 +

PTt=1 g(�tB

T )T�t

Step size

Page 28: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

Setting g()

• We compute E[–tanh(si)si + (1 – tanh(si)2)]

• If positive, set g(y) = –2tanh(y)

• If negative (or zero), set g(y) = tanh(y) – y

• Use current estimates of si

28

Page 29: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

Putting it all together• Start with random B and γ, choose learning rates δ and δγ

• Iterate until convergence

• y ← Bx and normalize y to unit variance

• γi ← (1 – δγ)γi–1 + δγE[–tanh(yi)yi + (1 – tanh(yi)2)]

• if γi > 0, use super-Gaussian g; o/w sub-Gaussian g

• B ← B + δ(I + ∑t g(yt)Tyt)B

29

Page 30: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

ICA summary• ICA can recover independent source signals

• if they are non-Gaussian

• Does not reduce rank

• Many applications, special case of blind source separation

• Standard algorithmic technique is to maximize non-Gaussianity of the recovered components

30

Page 31: Chapter 5 Independent Component Analysis · • In general, if x is r.v. with pdf p x(x) and y = Bx, then p y(y) = p x(Bx)|det B| • For T observations x 1, x 2, …, x T the log-likelihood

DMM, summer 2017 Pauli Miettinen

ICA literature

• Hyvärinen & Oja (2000): Independent Component Analysis: Algorithms and Applications. Neural networks 13(4), 411–430

• Hyvärinen (2013): Independent component analysis: recent advances. Phil. Trans. R. Soc. A 371:20110534

31