A Comparative Study between ICA (Independent Component Analysis) and PCA (Principal Component...

Post on 02-Jul-2015

416 views 2 download

description

A Comparative Study between ICA (Independent Component Analysis) and PCA (Principal Component Analysis)

Transcript of A Comparative Study between ICA (Independent Component Analysis) and PCA (Principal Component...

A Comparative Study between

ICA and PCA

Md. Sahidul IslamRoll No. 08054718

Department of StatisticsUniversity of Rajshahi

ripon.ru.statistics@gmail.com

1

Department of Statistics, University of Rajshahi-6205

Overview

Motivation of the study

Objective

Definition of ICA

FastICA algorithm

Results of the study

Latent structure

Cluster analysis

Outlier detection

Conclusions

2

Department of Statistics, University of Rajshahi-6205

Motivation of the study

o In multivariate statistics Latent structure detection, cluster analysis, and outlier detection using PCA is a promising old technique.

o In many cases ICA perform better than PCA.

o Our motivation in this thesis is to perform latent structure, cluster analysis and outlier detection using ICA and compare it with that of PCA

3

Department of Statistics, University of Rajshahi-6205

Objectives

o Study algorithms of ICA

o Applying ICA for Latent structure detection, cluster analysis

and outlier detection.

o Comparing its performance with that of PCA

4

Department of Statistics, University of Rajshahi-6205

Independent Component Analysis

The simple “Cocktail Party” Problem

SourcesObservations

s1

s2

x1

x2

Mixing matrix A

2221212

2121111

sasax

sasax

11a

12a

22a

21a

2

1

2221

1211

2

1

s

s

aa

aa

x

x

x=As

5

ICA

PCAy= WTx

Department of Statistics, University of Rajshahi-6205

Non-gaussianity is independent

Central limit theorem

The distribution of a sum of independent random variables tends

toward a Gaussian distribution

Observed signal = S1 S2 Sna1 + a2 ….+ an

toward Gaussian Non-GaussianNon-GaussianNon-Gaussian6

Department of Statistics, University of Rajshahi-6205

Non-guassianity is Independent

Nongaussianity estimates independent

Estimation of y = wT x =wTAs = zTs

let z = AT w, so y = wTAs = zTs

y is a linear combination of si, therefore zTs is more gaussian than any of si

zTs becomes least gaussian when it is equal to one of the si

wTx = zTs equals an independent component

Maximizing nongaussianity of wTx gives us one of the independent components

7

Department of Statistics, University of Rajshahi-6205

FastICA algorithm

Iteration procedure for maximizing nongaussianity

Step1: choose an initial weight vector w

Step2: Let w+=E[xg(wTx)]-E[g’(wTx)]w(g: a non-quadratic function)

Step3: Let w=w+/||w+||

Step4: if not converged, go back to

Step2

8

Department of Statistics, University of Rajshahi-6205

Results and Discussions

Latent structure detection

9

Department of Statistics, University of Rajshahi-6205

Simulated dataset -1

Figure: Matrix plot of original source of 10 uniform distribution.

10

Department of Statistics, University of Rajshahi-6205

Simulated dataset -1

Figure: (a) Matrix plot of 10 principal components. (b) Matrix plot of source variables.

11

Department of Statistics, University of Rajshahi-6205

Simulated dataset -1

Figure: (a) Matrix plot of 10 independent components. (b) Matrix plot of source variables

12

Department of Statistics, University of Rajshahi-6205

Simulated dataset-2

Simulated dataset-2 consists of

5 variables comes from Laplace

(super-gaussian), uniform

(sub-gaussian), binomial,

multinomial and normal

distribution each have 10000

observation.

Figure: Matrix plot of original source of 5 variables each

comes form different distribution.

13

Department of Statistics, University of Rajshahi-6205 14

Simulated dataset-2

Figure: (Left)Matrix plot of principle components. (Right) Original source of 5 variables

each comes form different distribution.

Department of Statistics, University of Rajshahi-6205

Simulated dataset-2

15

Figure: (Left)Matrix plot of independent components. (Right) Original source of 5

variables each comes form different distribution.

Department of Statistics, University of Rajshahi-6205

Cluster Analysis

16

Department of Statistics, University of Rajshahi-6205

The first experiment of real data set for clustering is Australian crabs data set where

there are 200 rows and 8 columns describing the 5 morphological measurements

(Frontal lob size, Rear width, Carapace length, Carapace width, Body depth). There

are two species in the data set each have both sexes (male, female) of the genus

Leptograpsus. There are 50 specimens of each sex of each species, collected on site

at Fremantle, Western Australia. (N. A. Campbell et al., 1974).

Australian Crabs dataset

17

Department of Statistics, University of Rajshahi-6205

The second example of real data set is world famous Fishers Iris data set

where the data report four characteristics (sepal width, sepal length, petal

width and petal length) of three species (setosa, versicolor, virginica) of Iris

flower.

Fisher Iris dataset

18

Department of Statistics, University of Rajshahi-6205

Outlier detection

19

Department of Statistics, University of Rajshahi-6205

Scottish hill racing dataset

The data gives the record wining times for 35 hill races in Scotland (Atkinson,

1986). The purpose of that study was to investigate the relationship of record

time 35 hill races.

20

Department of Statistics, University of Rajshahi-6205

Epilepsy dataset

Thal and Vail reported data from clinical trial of 59 patients with

epilepsy, 31 of whom were randomized to receive the anti-epilepsy

drug Progabide and 28 receive placebo

21

Department of Statistics, University of Rajshahi-6205

This data consists of 21 days of operation for a plant for the

oxidation of ammonia as a stage in the production of nitric acid. The

response is called stack loss which is percent of uncovered

ammonia that escapes from the planet. There are three explanatory

and one response variable in the dataset.

Stackloss data

22

Department of Statistics, University of Rajshahi-6205

Education expenditure dataset

These data are used by Chatterjee, Hadi, and Price as an example

of heteroscedasticity. The data gives the education expenditures of

U.S. states as projected in 1975.

23

Department of Statistics, University of Rajshahi-6205

Conclusions

If the subject domain supports the assumption of

independent non-gaussian source variables, we

recommended of using ICA in place of PCA for latent

structure detection, clustering and outlier detection.

24

Department of Statistics, University of Rajshahi-6205

Future Research

The following are the areas in which we want to study

o Use Kernel technique of ICA for shape study, clustering and outlier

detection.

o Separation of Nonlinear mixture.

o Data mining (sometimes called data or knowledge discovery) is the

most recent technique in multivariate analysis to extract information

from a data set and transform it into an understandable structure for

further use. Text data mining or Medical data mining using ICA wolud

be future research.

25

Department of Statistics, University of Rajshahi-6205

Thank you

26