1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo...
-
Upload
prince-dollison -
Category
Documents
-
view
213 -
download
0
Transcript of 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo...
![Page 1: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/1.jpg)
1/38
Jochen JägerUniversity of Washington
Department of Computer Science
Advisors:Larry Ruzzo
Rimli Sengupta
Improved gene selection in microarrays by combining clustering and statistical techniques
![Page 2: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/2.jpg)
2/38
Motivation
• Think of a complicated question:
• Will it be sunny tomorrow?
• How can you answer it correctly if you DO NOT know the answer?
• Ask around or better, make a poll
![Page 3: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/3.jpg)
3/38
Majority vote
• Student: I heard it is supposed to be sunny
• TV: partly sunny
• Yourself: Considering the past few days and looking outside I would guess it will rain
• Weather.com: partly cloudy with scattered showers
• Result: 2 (sunny) : 2 (not sunny)
• Better: Use weights
• Idea: remove redundant answers as well
![Page 4: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/4.jpg)
4/38
Outline
• Motivating example example• Biological background• Problem statement• Current solution• Proposed attack• Results• Future work
![Page 5: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/5.jpg)
5/38
Biological task
• Find informative genes• (e.g. genes which can discriminate
between cancer and normal)• Use series of microarrays• Compare results from different tissues
![Page 6: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/6.jpg)
6/38
Microarrays
DNA
select genes
spot genes
cell tissue
extract cDNA
* * **** * label cDNA
* **
*
Annealing phase
![Page 7: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/7.jpg)
7/38
Outline
• Motivating example example• Biological background• Problem statement• Current solution• Proposed attack• Results• Future work
![Page 8: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/8.jpg)
8/38
Finding informative genes
• Microarrays from different tissues
cancerous normal
![Page 9: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/9.jpg)
9/38
Outline
• Motivating example example• Biological background• Problem statement• Current solution• Proposed attack• Results• Future work
![Page 10: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/10.jpg)
10/38
Current solution
• Use a test statistic on all genes
• Rank them• Select top k
Gene Tumor 1 Tumor 2 Tumor 3 Normal 1 Normal 2 Normal 3 t-test P-value
A 80 72 85 50 44 15 0.0448836B 80 72 85 50 44 51 0.0048027C 71 53 62 57 64 70 0.8024078
normal
2normal
tumor
2tumo
normaltumor
nn
xxt
r
![Page 11: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/11.jpg)
11/38
Problem with current solution
• Each gene independently scored• Top k ranking genes might be very similar
and therefore no additional information gain
• Reason: genes in similar pathways probably all have very similar score
• What happens if several pathways involved in perturbation but one has main influence
• Possible to describe this pathway with fewer genes
![Page 12: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/12.jpg)
12/38
Problem of redundancy
Accession Number Adenoma 1 Adenoma 2 Adenoma 3 Adenoma 4 Normal 1 Normal 2 Normal 3 Normal 4
t-test P-value
AF001548 54.55 43.93 55.69 28.47 1354.36 1565.42 1459.48 1612.85 0.00012
M12125 35.9 46.64 35.73 35.27 642.46 577.81 580.5 707.35 0.00028
X13839 46.16 47.72 26.79 17 652.66 653.14 546.12 720.43 0.0003
X15882 13.52 15.73 27.32 16.15 209.3 209.64 221.24 267.43 0.0004
AB002533 659.25 958.82 812.77 786.24 407.91 558.33 529.68 379.84 0.00557
M93651 40.1 54.77 39.93 40.37 8.74 21.07 14.45 32.94 0.01038
AF001548 M12125 X13839 X15882 AB002533 M93651
AF001548 1
M12125 0.99 1
X13839 0.991 0.996 1
X15882 0.992 0.995 0.988 1
AB002533 -0.87 -0.898 -0.891 -0.888 1
M93651 -0.8 -0.802 -0.789 -0.776 0.808 1
Top 3 genes highly correlated!
![Page 13: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/13.jpg)
13/38
Outline
• Motivating example example• Biological background• Problem statement• Current solution• Proposed attack• Results• Future work
![Page 14: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/14.jpg)
14/38
Proposed solution
• Several possible approaches– next neighbors– correlation– euclidean distance
• Approach: instead use clustering• Advantages using clustering techniques
– natural embedding– many different distance functions possible– different shapes, models possible
![Page 15: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/15.jpg)
15/38
Hard clustering – k-means
Randomly assign cluster to each point
Reassign points to nearest center
Iterate until convergence
Find centroids
![Page 16: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/16.jpg)
16/38
Soft - Fuzzy Clustering
instead of hard assignment, probability for each cluster
Very similar to k-means but fuzzy softness factor m (between 1 and infinity) determines how hard the assignment has to be
![Page 17: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/17.jpg)
17/38
Fuzzy examples
Nottermans carcinoma dataset:
18 colon adenocarcinoma and 18 normal tissues
data from 7457 genes and ESTs
cluster all 36 tissues
![Page 18: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/18.jpg)
18/38
Fuzzy softness 1.3
![Page 19: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/19.jpg)
19/38
Fuzzy softness 1.25
![Page 20: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/20.jpg)
20/38
Fuzzy softness 1.2
![Page 21: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/21.jpg)
21/38
Fuzzy softness 1.15
![Page 22: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/22.jpg)
22/38
Fuzzy softness 1.05
![Page 23: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/23.jpg)
23/38
Selecting genes from clusters
• Two way filter: exclude redundant genes, select informative genes
• Get as many pathways as possible• Consider cluster size and quality as well
as discriminative power
![Page 24: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/24.jpg)
24/38
How many genes per cluster?
• Constraints: – minimum one gene per cluster
– maximum as many as possible
• Take genes proportionally to cluster quality and size of cluster
• Take more genes from bad clusters• Smaller quality value indicates tighter cluster
• Quality for k-means: sum of intra cluster distance• Quality for fuzzy c-means: avg cluster membership
probability
![Page 25: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/25.jpg)
25/38
Which genes to pick?
• Choices:– Genes closest to center– Genes farthest away– Sample according to probability
function– Genes with best discriminative power
![Page 26: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/26.jpg)
26/38
Comparison Evaluation
extract features
microarray data: n examples with m expression levels each
classify held-out sample
Repeat for each of the n examples:leave out one sample test data train data
train learner
apply same feature extraction to left out sample
![Page 27: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/27.jpg)
27/38
Support Vector machines
• Find separating hyperplane with maximal distance to closest training example
• Advantages: – avoids overfitting– can handle higher order interactions and noise using kernel
functions and soft margin
![Page 28: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/28.jpg)
28/38
Outline
• Motivating example example• Biological background• Problem statement• Current solution• Proposed attack• Results• Future work
![Page 29: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/29.jpg)
29/38
Experimental setup
• Datasets:– Alons Colon (40 tumor and 22 normal colon
adenocarcinoma tissue samples) – Golubs Leukemia (47 ALL, 25 AML)– Nottermans Carcinoma and Adenoma (18
adenocarcinoma, 4 adenomas and paired normal tissue)
• Experimental setup:– calculate LOOCV using SVM on feature
subsets– do this for feature size 10-100 (in steps of
10) and 1-30 clusters
![Page 30: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/30.jpg)
30/38
Results
![Page 31: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/31.jpg)
31/38
fuzzy c-means vs k-means
![Page 32: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/32.jpg)
32/38
Different test-statistics
![Page 33: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/33.jpg)
33/38
Comparing best results
![Page 34: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/34.jpg)
34/38
How about randomly choosing?
![Page 35: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/35.jpg)
35/38
Related work
• Tusher, Tibshirani and Chu (2001): Significance analysis of microarrays applied to the ionizing radiation response, PNAS 2001 98: 5116-5121
• Ben-Dor, A., L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini (2000). Tissue classification with gene expression profiles. In Proceeding of the fourth annual international conference on computational molecular biology, pp. 54-64
• Park, P.J., Pagano, M., Bonetti, M.: A nonparametric scoring algorithm for identifying informative genes from microarray data. Pac Symp Biocomput :52-63, 2001.
• Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh M, Downing JR, Caligiuri MA, Bloomfield CD, and Lander 18 ES. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286: 531-537, 1999.
• J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik. Feature selection for SVMs . In Sara A Solla, Todd K Leen, and Klaus-Robert Muller, editors, Advances in Neural Information Processing Systems 13. MIT Press, 2001. 11
![Page 36: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/36.jpg)
36/38
Outline
• Motivating example example• Biological background• Problem statement• Current solution• Proposed attack• Results• Future work
![Page 37: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/37.jpg)
37/38
Future work
• Problem how to find best parameters (model selection, model based clustering, BIC)
• Combine good solutions• Incorporate overall cluster discriminative
power into quality score• Use of non integer error score• ROC analysis
![Page 38: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/38.jpg)
38/38
Summary
• Used clustering as a pre-filter for feature selection in order to get rid of redundant data
• Defined a quality measurement for clustering techniques
• Incorporated cluster quality, size and statistical property into feature selection
• Improved LOOCV error for almost all feature sizes and different related tests
![Page 39: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/39.jpg)
39/38
Result Notterman
![Page 40: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/40.jpg)
40/38
Result Golub
![Page 41: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/41.jpg)
41/38
Result Alon
![Page 42: 1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.](https://reader036.fdocuments.us/reader036/viewer/2022062619/5516ad98550346a25b8b5936/html5/thumbnails/42.jpg)
42/38
Result Alon 2