1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer...

1

Effective Feature Selection Framework for

Cluster Analysis of Microarray Data

Gouchol PokComputer Science Dept.

Yanbian UniversityChina

Keun Ho RyuDB/Bioinformatics Lab

Chungbuk Nat’l UniversityKorea

2

Outline

Background Motivation Proposed Method Experiments Conclusion

3

Feature Selection

Definition: Process of selecting a subset of relevant features for

building robust learning models

Objectives: Alleviating the effect of the curse of

dimensionality Enhancing generalization capability Speeding up learning process Improving model interpretability

from Wikipedia: http://en.wikipedia.org/wiki/Feature_selection

4

Issues in Feature Selection

How to compute the degree to which a feature is relevant with the class (discrimination)

How to decide if a selected feature is redundant with other features (strongly correlated)

How to select features so that classifying power is not diminished (increased)

Removal of irrelevancy Removal of redundancy Maintain class-discriminating power

5

Selection Modes

Univariate method: considers one feature at a time based on score rank measures are Correlation, Information measure,

K-S statistic, etc

Multivariate method: considers subsets of features altogether Bayesian and PCA based selection in principle, more powerful than univariate method,

but not always in practice (Guyon2008)

6

Hard Case in Univariate method (Guyon2008*)

*Adopted from Guyon’s tutorial at IPAM summer school

7

Proposed method: Motivation

Method that fits 2-D microarray data typical forms: thousands of genes (rows) and

hundreds of samples (columns)

Multivariate approach Feature relevancy and redundancy are addressed

simultaneously

8

System Flow

samples

genes

9

System Flow (cont.)

10

Methods: Step1

Perform column-based difference op. Di(N,M) = C(N,M) Ci(N,1), i = 1,2,…, M Difference operator may depend on applications,

e.g. Euclidean or Manhattan distance Di(N,M) contains class-specific info. w.r.t each gene

genes

11

Methods: Step2

Apply thresholds Find kind of “emerging patterns” which contrast 2 classes Suppose 1, 2,…, j C1 and j+1, j+2, … M C2 Sort the values in each column of Di(N,M) 25%-threshold to the same class differences and

75%-threshold to the different class differences

C1 C2 C1 C2 C1 C2 25% 75%

12

Methods: Step3

Extract class-specific features Within-class summation of binary values (count 1’s)

summation

C1

C2

13

Methods: Step4

Gene selection Apply different threshold value for different class Gene selection: we are done for the row-wise reduction

threshold

14

Methods: Step5

Column-wise reduction by clustering Classification of samples Applied NMF method

15

Nonnegative Matrix Factorization (NMF)

Matrix factorization: A ~ VHA: n m matrix of n genes and m samples.V: (n k): k columns of V are called basis vectorsH: (k m): describes how strongly each building block is

present in measurement vectors

=n

m

m

n

k

k•A V H

16

NMF: Parts-based Clustering (Brunet2004)

Brunet introduce meta-genes concept

17

Experiments: Datasets

Leukemia Data 5000 genes 38 samples of two classes

19 samples of ALL-B and 8 samples of ALL-T type, 11 samples of AML type.

Medulloblastoma Data 5893 genes 34 samples of two classes

25 classic type and 9 desmoplastic medulloblastoma type Central Nervous System Tumors Data

7129 samples 34 samples of four classes

10 classic medulloblastomas, 10 malig-nant gliomas, 10 rhabdoids, and 4 normals

18

Classification Given a target sample, its class is predicted by

the highest value in k-dim column vector of H

=n

m

m

n

k

k•A V H

19

Results

Leukemia Data (ALL-T vs. ALL-B vs. AML)

20

Results

Medulloblastoma Data (Classic vs. Desmoplastic)

21

Results

Central Nervous System Tumors Data (4 classes)

22

Conclusions & Future work

Our approach tries to capture a group of features, but in contrast to holistic methods such as PCA and ICA, intrinsic structure of data distribution is preserved in the reduced space.

Still, PCA and ICA can be used as an aide to look into the data distribution structure, and provide useful information for further processing to other methods. Our on-going research is on how to combine the PCA

and ICA to the proposed work

23

References

Wikipedia, http://en.wikipedia.org/wiki/Feature_selection J.-P. Brunet, P. Tamayo, T. Golub, and J. P. Mesirov. Metagenes and

molecular pattern discovery using matrix factorization. PNAS, 101(12):4164-4169, 2004.

L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proc 12th Int Conf on Machine Learning (ICML-03), pages 856–863, 2003

Biesiada J, Duch W (2005), Feature Selection for High-Dimensional Data: A Kolmogorov-Smirnov Correlation-Based Filter Solution. (CORES'05) Advances in Soft Computing, Springer Verlag, pp. 95-104, 2005.

D.D. Lee and H.S. Seung, Learning the parts of objects by nonnegative matrix factorization

24

Questions?

1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer...

Documents

Transcript of 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer...