Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines
-
Upload
yeo-clayton -
Category
Documents
-
view
17 -
download
1
description
Transcript of Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines
![Page 1: Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines](https://reader031.fdocuments.us/reader031/viewer/2022013101/56813857550346895da00097/html5/thumbnails/1.jpg)
Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines
Michael P. S. Brown, William Noble Grundy, David Lin, Nello Cristianini, Charles Sugnet, Terrence S. Furey, Manuel Ares, Jr. David Haussler
Proceedings of the National Academy of Sciences. 2000
![Page 2: Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines](https://reader031.fdocuments.us/reader031/viewer/2022013101/56813857550346895da00097/html5/thumbnails/2.jpg)
Overview Objective: Classify genes based on
functionality
Observation: Genes of similar function yield similar expression pattern in microarray hybridization experiments
Method: Use SVM to build classifiers, using microarray gene expression data.
![Page 3: Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines](https://reader031.fdocuments.us/reader031/viewer/2022013101/56813857550346895da00097/html5/thumbnails/3.jpg)
Previous Methods Most current methods employ
unsupervised learning methods (at the time of the publication)
Genes are grouped using clustering algorithms based on a distance measure Hierarchical clustering Self-organizing maps
![Page 4: Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines](https://reader031.fdocuments.us/reader031/viewer/2022013101/56813857550346895da00097/html5/thumbnails/4.jpg)
DNA Microarray Data Each data point represents the ratio of expression
levels of a particular gene in an experimental condition and a reference condition n genes on a single chip m experiments performed The results is an n by m matrix of expression-level ratios
n ge
ne
s
m experiments
m-element expression vector for a single gene
![Page 5: Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines](https://reader031.fdocuments.us/reader031/viewer/2022013101/56813857550346895da00097/html5/thumbnails/5.jpg)
DNA Microarray Data Normalized logarithmic ratio
For gene X, in experience i, define:• Ei is the expression level in the experiment• Ri is the expression level in the reference state• Xi=(x1, x2,..., xn) is the normalized logarithmic ratio
• Xi is positive when the gene is induced (turned up)• Xi is negative when the gene is repressed (turned down)
![Page 6: Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines](https://reader031.fdocuments.us/reader031/viewer/2022013101/56813857550346895da00097/html5/thumbnails/6.jpg)
Support Vector Machines
* Edda Leopold† and Jörg Kindermann
Searches for a hyperplane that Maximizes the margin Minimizes the violation of the margin
![Page 7: Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines](https://reader031.fdocuments.us/reader031/viewer/2022013101/56813857550346895da00097/html5/thumbnails/7.jpg)
Linear Inseparability What if data points are not linearly
separable?
* Andrew W. Moore
![Page 8: Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines](https://reader031.fdocuments.us/reader031/viewer/2022013101/56813857550346895da00097/html5/thumbnails/8.jpg)
Linear Inseparability Map the data
to higher-dimension space
* Andrew W. Moore
![Page 9: Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines](https://reader031.fdocuments.us/reader031/viewer/2022013101/56813857550346895da00097/html5/thumbnails/9.jpg)
Linear Inseparability
Problems with mapping data to higher-dimension space
1. Overfitting• SVM chooses the maximum margin, and deals
well with overfitting
2. High computational cost• SVM kernels only involve dot products between
points (cheap!)
![Page 10: Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines](https://reader031.fdocuments.us/reader031/viewer/2022013101/56813857550346895da00097/html5/thumbnails/10.jpg)
SVM Kernels K(X, Y) is function that calculates a
measure of similarity between X and Y
Dot product• K(X,Y) = X.Y • Simplest kernel. Linear hyperplane
Degree d polynomials• K(X,Y) = (X.Y + 1)d
Gaussian• K(X,Y) = exp(-|X - Y|2/22)
![Page 11: Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines](https://reader031.fdocuments.us/reader031/viewer/2022013101/56813857550346895da00097/html5/thumbnails/11.jpg)
Experimental Dataset Expression data from the budding yeast
2467 genes (n) 79 experiments (m) Dataset available on Stanford web site
Six functional classes From the Munich Information Centre for Protein Sequences Yeast
Genome Database Class definitions come from biochemical and genetic studies
Training data: positive labels: set of genes that have a common function Negative labels: set of genes known not to be a member of this
function class
![Page 12: Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines](https://reader031.fdocuments.us/reader031/viewer/2022013101/56813857550346895da00097/html5/thumbnails/12.jpg)
Experimental Design Compare the performance of
SVM (with degree 1 kernel, i.e. linear)) SVM (with degree 2 kernel) SVM (with degree 3 kernel) SVM (Gaussian) Parzen Windows Fisher’s Linear Discriminate C4.5 Decision Trees MOC1 Decision Trees
![Page 13: Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines](https://reader031.fdocuments.us/reader031/viewer/2022013101/56813857550346895da00097/html5/thumbnails/13.jpg)
Experimental Design Define the cost of method M
C(M) = fp(M) + 2.fn(M) False negatives are weighted higher because the
number of true negatives is larger
Cost of each method is compared to: C(N) = cost of classifying everything as negative
Cost saving of method M is : S(M) = C(N) - C(M)
![Page 14: Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines](https://reader031.fdocuments.us/reader031/viewer/2022013101/56813857550346895da00097/html5/thumbnails/14.jpg)
Experimental Results
SVM (d=1)
SVM (d=2)
SVM (d=3)
SVM(Gauss)
ParzenWindows
Fisher's LD C4.5 MOC1
TCA 6 9 12 11 6 5 -7 -1Resp 31 39 38 33 18 30 8 -4Ribo 224 229 229 226 220 217 169 164Prot 35 48 51 52 39 39 33 26Hist 18 18 18 18 14 16 16 10HTH -56 -3 -1 0 -14 -14 -2 -6
SVMs outperform other methods All classifiers fail to recognize the HTH protein
this is expected Members of this class are not “similarly regulated”
![Page 15: Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines](https://reader031.fdocuments.us/reader031/viewer/2022013101/56813857550346895da00097/html5/thumbnails/15.jpg)
Consistently Misclassified Genes
20 genes are consistently misclassified by 4 SVM kernels, in different experiments
Difference between the expression data and definitions based on protein structures.
Many of the false positives are known to be important for the functional class (even though they are not included as part of the class)