1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer...

24
1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu DB/Bioinformatics Lab Chungbuk Nat’l University Korea

Transcript of 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer...

Page 1: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

1

Effective Feature Selection Framework for

Cluster Analysis of Microarray Data

Gouchol PokComputer Science Dept.

Yanbian UniversityChina

Keun Ho RyuDB/Bioinformatics Lab

Chungbuk Nat’l UniversityKorea

Page 2: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

2

Outline

Background Motivation Proposed Method Experiments Conclusion

Page 3: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

3

Feature Selection

Definition: Process of selecting a subset of relevant features for

building robust learning models

Objectives: Alleviating the effect of the curse of

dimensionality Enhancing generalization capability Speeding up learning process Improving model interpretability

from Wikipedia: http://en.wikipedia.org/wiki/Feature_selection

Page 4: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

4

Issues in Feature Selection

How to compute the degree to which a feature is relevant with the class (discrimination)

How to decide if a selected feature is redundant with other features (strongly correlated)

How to select features so that classifying power is not diminished (increased)

Removal of irrelevancy Removal of redundancy Maintain class-discriminating power

Page 5: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

5

Selection Modes

Univariate method: considers one feature at a time based on score rank measures are Correlation, Information measure,

K-S statistic, etc

Multivariate method: considers subsets of features altogether Bayesian and PCA based selection in principle, more powerful than univariate method,

but not always in practice (Guyon2008)

Page 6: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

6

Hard Case in Univariate method (Guyon2008*)

*Adopted from Guyon’s tutorial at IPAM summer school

Page 7: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

7

Proposed method: Motivation

Method that fits 2-D microarray data typical forms: thousands of genes (rows) and

hundreds of samples (columns)

Multivariate approach Feature relevancy and redundancy are addressed

simultaneously

Page 8: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

8

System Flow

samples

genes

Page 9: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

9

System Flow (cont.)

Page 10: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

10

Methods: Step1

Perform column-based difference op. Di(N,M) = C(N,M) Ci(N,1), i = 1,2,…, M Difference operator may depend on applications,

e.g. Euclidean or Manhattan distance Di(N,M) contains class-specific info. w.r.t each gene

genes

Page 11: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

11

Methods: Step2

Apply thresholds Find kind of “emerging patterns” which contrast 2 classes Suppose 1, 2,…, j C1 and j+1, j+2, … M C2 Sort the values in each column of Di(N,M) 25%-threshold to the same class differences and

75%-threshold to the different class differences

C1 C2 C1 C2 C1 C2 25% 75%

Page 12: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

12

Methods: Step3

Extract class-specific features Within-class summation of binary values (count 1’s)

summation

C1

C2

Page 13: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

13

Methods: Step4

Gene selection Apply different threshold value for different class Gene selection: we are done for the row-wise reduction

threshold

Page 14: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

14

Methods: Step5

Column-wise reduction by clustering Classification of samples Applied NMF method

Page 15: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

15

Nonnegative Matrix Factorization (NMF)

Matrix factorization: A ~ VHA: n m matrix of n genes and m samples.V: (n k): k columns of V are called basis vectorsH: (k m): describes how strongly each building block is

present in measurement vectors

=n

m

m

n

k

k•A V H

Page 16: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

16

NMF: Parts-based Clustering (Brunet2004)

Brunet introduce meta-genes concept

Page 17: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

17

Experiments: Datasets

Leukemia Data 5000 genes 38 samples of two classes

19 samples of ALL-B and 8 samples of ALL-T type, 11 samples of AML type.

Medulloblastoma Data 5893 genes 34 samples of two classes

25 classic type and 9 desmoplastic medulloblastoma type Central Nervous System Tumors Data

7129 samples 34 samples of four classes

10 classic medulloblastomas, 10 malig-nant gliomas, 10 rhabdoids, and 4 normals

Page 18: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

18

Classification Given a target sample, its class is predicted by

the highest value in k-dim column vector of H

=n

m

m

n

k

k•A V H

Page 19: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

19

Results

Leukemia Data (ALL-T vs. ALL-B vs. AML)

Page 20: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

20

Results

Medulloblastoma Data (Classic vs. Desmoplastic)

Page 21: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

21

Results

Central Nervous System Tumors Data (4 classes)

Page 22: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

22

Conclusions & Future work

Our approach tries to capture a group of features, but in contrast to holistic methods such as PCA and ICA, intrinsic structure of data distribution is preserved in the reduced space.

Still, PCA and ICA can be used as an aide to look into the data distribution structure, and provide useful information for further processing to other methods. Our on-going research is on how to combine the PCA

and ICA to the proposed work

Page 23: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

23

References

Wikipedia, http://en.wikipedia.org/wiki/Feature_selection J.-P. Brunet, P. Tamayo, T. Golub, and J. P. Mesirov. Metagenes and

molecular pattern discovery using matrix factorization. PNAS, 101(12):4164-4169, 2004.

L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proc 12th Int Conf on Machine Learning (ICML-03), pages 856–863, 2003

Biesiada J, Duch W (2005), Feature Selection for High-Dimensional Data: A Kolmogorov-Smirnov Correlation-Based Filter Solution. (CORES'05) Advances in Soft Computing, Springer Verlag, pp. 95-104, 2005.

D.D. Lee and H.S. Seung, Learning the parts of objects by nonnegative matrix factorization

Page 24: 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

24

Questions?