Renikko
-
Upload
azuring -
Category
Technology
-
view
108 -
download
0
description
Transcript of Renikko
![Page 1: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/1.jpg)
A Comprehensive Evaluation of Multicategory Classification
Methods for Microarray Gene Expression Cancer Diagnosis
Presented by: Renikko Alleyne
![Page 2: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/2.jpg)
Outline• Motivation
• Major Concerns
• Methods
– SVMs
– Non-SVMs
– Ensemble Classification
• Datasets
• Experimental Design
• Gene Selection
• Performance Metrics
• Overall Design
• Results
• Discussion & Limitations
• Contributions
• Conclusions
![Page 3: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/3.jpg)
Why?
Clinical Applications of
Gene Expression Microarray Technology
Gene Discovery Disease Diagnosis Drug Discovery
Prediction of clinical outcomes
in response to treatment
Cancer Infectious Diseases
![Page 4: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/4.jpg)
GEMS (Gene Expression Model Selector)
Creation of powerful and reliable cancer diagnostic models
Equip with best classifier, gene selection, and cross-validation methods
Evaluation of major algorithms for multicategory classification, gene selection methods, ensemble classifier methods & 2 cross validation designs
11 datasets spanning 74 diagnostic categories & 41 cancer types & 12 normal tissue types
Microarray data
![Page 5: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/5.jpg)
Major Concerns
• The studies conducted limited experiments in terms of the number of classifiers, gene selection algorithms, number of datasets and types of cancer involved.
• Cannot determine which classifier performs best.
• It is poorly understood what are the best combinations of classification and gene selection algorithms across most array-based cancer datasets.
• Overfitting.
• Underfitting.
![Page 6: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/6.jpg)
Goals for the Development of an Automated System that creates high-quality diagnostic models for use in clinical applications
• Investigate which classifier currently available for gene expression diagnosis performs the best across many cancer types
• How classifiers interact with existing gene selection methods in datasets with varying sample size, number of genes and cancer types
• Whether it is possible to increase diagnostic performance further using meta-learning in the form of ensemble classification
• How to parameterize the classifiers and gene selection procedures to avoid overfitting
![Page 7: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/7.jpg)
Why use Support Vector Machines (SVMs)?
• Achieve superior classification performance compared to other learning algorithms
• Fairly insensitive to the curse of dimensionality
• Efficient enough to handle very large-scale classification in both sample and variables
![Page 8: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/8.jpg)
How SVMs Work
• Objects in the input space are mapped using a set of mathematical functions (kernels).
• The mapped objects in the feature (transformed) space are linearly separable, and instead of drawing a complex curve, an optimal line (maximum-margin hyperplane) can be found to separate the two classes.
![Page 9: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/9.jpg)
SVM Classification Methods
SVMs
Binary SVMs Multiclass SVMs
OVR OVO DAGSVM WW SW
![Page 10: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/10.jpg)
Binary SVMs
• Main idea is to identify the maximum-margin hyperplane that separates training instances.
• Selects a hyperplane that maximizes the width of the gap between the two classes.
• The hyperplane is specified by support vectors.
• New classes are classified depending on the side of the hyperplane they belong to.
Support Vector
Hyperplane
![Page 11: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/11.jpg)
1. Multiclass SVMs: one-versus-rest (OVR)
• Simplest MC-SVM
• Construct k binary SVM classifiers: – Each class (positive) vs all
other classes (negatives).
• Computationally Expensive because there are k quadratic programming (QP) optimization problems of size n to solve.
![Page 12: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/12.jpg)
2. Multiclass SVMs: one-versus-one (OVO)
• Involves construction of binary SVM classifiers for all pairs of classes
• A decision function assigns an instance to a class that has the largest number of votes (Max Wins strategy)
• Computationally less expensive
![Page 13: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/13.jpg)
3. Multiclass SVMs: DAGSVM
• Constructs a decision tree
• Each node is a binary SVM for a pair of classes
• k leaves: k classification decisions
• Non-leaf (p, q): two edges– Left edge: not p decision
– Right edge: not q decision
![Page 14: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/14.jpg)
4 & 5. Multiclass SVMs: Weston & Watkins (WW) and Crammer & Singer (CS)
• Constructs a single classifier by maximizing the margin between all the classes simultaneously
• Both require the solution of a single QP problem of size
(k-1)n, but the CS MC-SVM uses less slack variables in the constraints of the optimization problem, thereby making it computationally less expensive
![Page 15: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/15.jpg)
Non-SVM Classification Methods
Non-SVMs
KNN NN PNN
![Page 16: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/16.jpg)
K-Nearest Neighbors (KNN)
• For each case to be classified, locate the k closest members of the training dataset.
• A Euclidean Distance measure is used to calculate the distance between the training dataset members and the target case.
• The weighted sum of the
variable of interest is found for the k nearest neighbors.
• Repeat this procedure for the other target set cases.
?
?
![Page 17: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/17.jpg)
Backpropagation Neural Networks (NN) & Probabilistic Neural Networks (PNNs)
• Back Propagation Neural Networks:– Feed forward neural networks with
signals propagated forward through the layers of units.
– The unit connections have weights which are adjusted when there is an error, by the backpropagation learning algorithm.
• Probabilistic Neural Networks:– Design similar to NNs except that the
hidden layer is made up of a competitive layer and a pattern layer and the unit connections do not have weights.
![Page 18: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/18.jpg)
Ensemble Classification Methods
In order to improve performance:
Classifier 1
Ensembled Classifiers
Techniques: Major Voting, Decision Trees, MC-SVM (OVR, OVO, DAGSVM)
Classifier 2 Classifier N
Output 1 Output NOutput 2
![Page 19: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/19.jpg)
Datasets & Data Preparatory Steps
• Nine multicategory cancer diagnosis datasets
• Two binary cancer diagnosis datasets
• All datasets were produced by oligonucleotide-based technology
• The oligonucleotides or genes with absent calls in all samples were excluded from analysis to reduce any noise.
![Page 20: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/20.jpg)
Datasets
![Page 21: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/21.jpg)
Experimental Designs
• Two Experimental Designs to obtain reliable performance estimates and avoid overfitting.
• Data split into mutually exclusive sets.
• Outer Loop estimates performance by: – Training on all splits but one (use for
testing).
• Inner Loop determines the best parameter of the classifier.
![Page 22: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/22.jpg)
Experimental Designs
• Design I uses stratified 10 fold cross-validation in both loops while Design II uses 10 fold cross-validation in its inner loop and leave-one-out-cross-validation in its outer loop.
• Building the final diagnostic model involves:– Finding the best parameters for the classification using a single loop
of cross-validation– Building the classifier on all data using the previously found best
parameters– Estimating a conservative bound on the classifier’s accuracy by using
either Designs
![Page 23: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/23.jpg)
Gene Selection
Gene Selection Methods
Ratio of genes between-categories to within-category sum of squares (BW)
Signal-to-noise scores
(S2N)
Kruskal-Wallis non-parametric one-way
ANOVA (KW)
S2N-OVR S2N-OVO
![Page 24: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/24.jpg)
Performance Metrics
• Accuracy– Easy to interpret– Simplifies statistical testing– Sensitive to prior class probabilities– Does not describe the actual difficulty of the decision problem
for unbalanced distributions
• Relative classifier information (RCI)– Corrects for the differences in:
• Prior probabilities of the diagnostic categories
• Number of categories
![Page 25: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/25.jpg)
Overall Research Design
Stage 1:Conducted a Factorial design involving datasets & classifiers w/o gene selection
Stage 2: Conducted a Factorial Design w/ gene selection using datasets for which the full gene sets yielded poor performance
2.6 million diagnostic models generated
Selection of one model for each combination of algorithm and dataset
![Page 26: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/26.jpg)
Statistical Comparison among classifiersTo test that differences b/t the best method and the other methods are non-random
Null Hypothesis: Classification algorithm X is as good as Y
Obtain permutation distribution of XY ∆ by repeatedly rearranging the
outcomes of X and Y at random
Compute the p-value of XY ∆ being greater than or equal to observed difference XY
∆ over 10000 permutations
If p < 0.05 Reject H0
Algorithm X is not as good as Y in terms of classification accuracy
If p > 0.05 Accept H0
Algorithm X is as good as Y in terms of classification accuracy
![Page 27: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/27.jpg)
Performance Results (Accuracies) without Gene Selection Using Design I
![Page 28: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/28.jpg)
Performance Results (RCI) without Gene Selection Using Design I
![Page 29: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/29.jpg)
Total Time of Classification Experiments w/o gene selection for all 11 datasets and two experimental designs
• Executed in a Matlab R13 environment on 8 dual-CPU workstations connected in a cluster.
• Fastest MC-SVMs: WW & CS• Fastest overall algorithm: KNN
• Slowest MC-SVM: OVR• Slowest overall algorithms: NN
and PNN
![Page 30: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/30.jpg)
Performance Results (Accuracies) with Gene Selection Using Design I
Applied the 4 gene selection methods to the 4 most challenging datasets
Imp
rov
em
en
t b
y g
en
e s
ele
cti
on
![Page 31: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/31.jpg)
Performance Results (RCI) with Gene Selection Using Design I
Applied the 4 gene selection methods to the 4 most challenging datasets
Imp
rov
em
en
t b
y g
en
e s
ele
cti
on
![Page 32: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/32.jpg)
Discussion & Limitations
• Limitations:– Use of the two performance metrics– Choice of KNN, PNN and NN classifiers
• Future Research:– Improve existing gene selection procedures with the selection of
optimal number of genes by cross-validation– Applying multivariate Markov blanket and local neighborhood
algorithms– Extend comparisons with more MC-SVMs as they become
available– Updating GEMS system to make it more user-friendly.
![Page 33: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/33.jpg)
Contributions of Study
• Conducted the most comprehensive systematic evaluation to date of multicategory diagnosis algorithms applied to the majority of multicategory cancer-related gene expression human datasets.
• Creation of the GEMS system that automates the experimental procedures in the study in order to:– Develop optimal classification models for the domain of cancer
diagnosis with microarray gene expression data.– Estimate their performance in future patients.
![Page 34: Renikko](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6cc4c4a79593f718b4581/html5/thumbnails/34.jpg)
Conclusions
• MSVMs are the best family of algorithms for these types of data and medical tasks. They outperform non-SVM machine learning techniques
• Among MC-SVM methods OVR, CS and WW are the best w.r.t classification performance
• Gene selection can improve the performance of MC and non-SVM methods
• Ensemble classification does not further improve the classification performance of the best MC-SVM methods