1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer...
-
Upload
suzanna-doyle -
Category
Documents
-
view
219 -
download
0
Transcript of 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer...
1
Effective Feature Selection Framework for
Cluster Analysis of Microarray Data
Gouchol PokComputer Science Dept.
Yanbian UniversityChina
Keun Ho RyuDB/Bioinformatics Lab
Chungbuk Nat’l UniversityKorea
2
Outline
Background Motivation Proposed Method Experiments Conclusion
3
Feature Selection
Definition: Process of selecting a subset of relevant features for
building robust learning models
Objectives: Alleviating the effect of the curse of
dimensionality Enhancing generalization capability Speeding up learning process Improving model interpretability
from Wikipedia: http://en.wikipedia.org/wiki/Feature_selection
4
Issues in Feature Selection
How to compute the degree to which a feature is relevant with the class (discrimination)
How to decide if a selected feature is redundant with other features (strongly correlated)
How to select features so that classifying power is not diminished (increased)
Removal of irrelevancy Removal of redundancy Maintain class-discriminating power
5
Selection Modes
Univariate method: considers one feature at a time based on score rank measures are Correlation, Information measure,
K-S statistic, etc
Multivariate method: considers subsets of features altogether Bayesian and PCA based selection in principle, more powerful than univariate method,
but not always in practice (Guyon2008)
6
Hard Case in Univariate method (Guyon2008*)
*Adopted from Guyon’s tutorial at IPAM summer school
7
Proposed method: Motivation
Method that fits 2-D microarray data typical forms: thousands of genes (rows) and
hundreds of samples (columns)
Multivariate approach Feature relevancy and redundancy are addressed
simultaneously
8
System Flow
samples
genes
9
System Flow (cont.)
10
Methods: Step1
Perform column-based difference op. Di(N,M) = C(N,M) Ci(N,1), i = 1,2,…, M Difference operator may depend on applications,
e.g. Euclidean or Manhattan distance Di(N,M) contains class-specific info. w.r.t each gene
genes
11
Methods: Step2
Apply thresholds Find kind of “emerging patterns” which contrast 2 classes Suppose 1, 2,…, j C1 and j+1, j+2, … M C2 Sort the values in each column of Di(N,M) 25%-threshold to the same class differences and
75%-threshold to the different class differences
C1 C2 C1 C2 C1 C2 25% 75%
12
Methods: Step3
Extract class-specific features Within-class summation of binary values (count 1’s)
summation
C1
C2
13
Methods: Step4
Gene selection Apply different threshold value for different class Gene selection: we are done for the row-wise reduction
threshold
14
Methods: Step5
Column-wise reduction by clustering Classification of samples Applied NMF method
15
Nonnegative Matrix Factorization (NMF)
Matrix factorization: A ~ VHA: n m matrix of n genes and m samples.V: (n k): k columns of V are called basis vectorsH: (k m): describes how strongly each building block is
present in measurement vectors
=n
m
m
n
k
k•A V H
16
NMF: Parts-based Clustering (Brunet2004)
Brunet introduce meta-genes concept
17
Experiments: Datasets
Leukemia Data 5000 genes 38 samples of two classes
19 samples of ALL-B and 8 samples of ALL-T type, 11 samples of AML type.
Medulloblastoma Data 5893 genes 34 samples of two classes
25 classic type and 9 desmoplastic medulloblastoma type Central Nervous System Tumors Data
7129 samples 34 samples of four classes
10 classic medulloblastomas, 10 malig-nant gliomas, 10 rhabdoids, and 4 normals
18
Classification Given a target sample, its class is predicted by
the highest value in k-dim column vector of H
=n
m
m
n
k
k•A V H
19
Results
Leukemia Data (ALL-T vs. ALL-B vs. AML)
20
Results
Medulloblastoma Data (Classic vs. Desmoplastic)
21
Results
Central Nervous System Tumors Data (4 classes)
22
Conclusions & Future work
Our approach tries to capture a group of features, but in contrast to holistic methods such as PCA and ICA, intrinsic structure of data distribution is preserved in the reduced space.
Still, PCA and ICA can be used as an aide to look into the data distribution structure, and provide useful information for further processing to other methods. Our on-going research is on how to combine the PCA
and ICA to the proposed work
23
References
Wikipedia, http://en.wikipedia.org/wiki/Feature_selection J.-P. Brunet, P. Tamayo, T. Golub, and J. P. Mesirov. Metagenes and
molecular pattern discovery using matrix factorization. PNAS, 101(12):4164-4169, 2004.
L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proc 12th Int Conf on Machine Learning (ICML-03), pages 856–863, 2003
Biesiada J, Duch W (2005), Feature Selection for High-Dimensional Data: A Kolmogorov-Smirnov Correlation-Based Filter Solution. (CORES'05) Advances in Soft Computing, Springer Verlag, pp. 95-104, 2005.
D.D. Lee and H.S. Seung, Learning the parts of objects by nonnegative matrix factorization
24
Questions?