Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay...
-
Upload
estefani-cornell -
Category
Documents
-
view
217 -
download
0
Transcript of Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay...
![Page 1: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/1.jpg)
Predictive Analysis of
Gene Expression Data from
Human SAGE Libraries
Alexessander Alves* Nikolay Zagoruiko+ Oleg Okun§
Olga Kutnenko+ Irina Borisova+
* University of Porto, PORTUGAL+ Russian Academy of Sciences RUSSIA§ University of Oulu FINLAND
![Page 2: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/2.jpg)
Outline
1. Goals
2. Background
3. SAGE Data
4. Gene Expression Data
5. Feature Selection
6. GRAD
7. Experiments
8. Conclusions
![Page 3: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/3.jpg)
Goal
Predictive Analysis:• Feature Selection Methods in Bioinformatics
and Machine Learning
• Cancer Classification
![Page 4: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/4.jpg)
Background
Genes code proteins and other larger biomolecules
Genes are expressed in a two steps process (Central Dogma of Biology)
Several technologies measure transcription: SAGE, Micro array…
Central Dogma of Biology
Gene Expression Process
1- Transcribed into an RNA Sequence
2- Translated into a protein
Molla et al, 2003
![Page 5: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/5.jpg)
SAGE DATA
Advantages:• Compare samples between different organs
and patients. (No normalisation required)
• Collects complete gene expression profile of a cell/tissue without prior knowledge of the mRNA to be profiled
![Page 6: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/6.jpg)
SAGE DATA
Drawbacks:• Very Expensive to Collect Data using the
SAGE method
• Very Few Examples (consequence)
![Page 7: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/7.jpg)
GENE EXPRESSION DATA
Challenges posed to Machine Learning• Number of Genes Dramatically Exceeds
Examples!!!
• Curse of Dimensionality (not enough density to estimate accuratelly the model)
• Over-fitting (higher probability of finding casual relationships among data attributes)
![Page 8: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/8.jpg)
Remove Irrelevant and Redundant Genes Methods:
• Wrapper• Fit classifier to a subset of data and use classification accuracy to
drive the search for relevant genes (e.g. C4.5 accuracy )
• Filtering• Use a function to assess the goodness of a subset of genes (e.g.
euclidean distance, entropy, correlation, etc...) Problem Complexity
• O(2n) ... • n, number of genes• Smaller dataset n=822. • O(2n) 2.8x10246 Intractable using a simple exaustive search
Feature Selection
![Page 9: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/9.jpg)
Gene Selection In Bioinformatics Filtering is usually prefered because is
computationally less expensive Several works on classification select genes
with:• Wilcoxon test, • t-test • Additionally, also remove genes with low entropy,
variability, or absolute expression level. Cons
• Redundancy• Interdependency unaware
![Page 10: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/10.jpg)
Our Proposals
Study Bioinformatics Filtering Techniques
Compare with Machine Learning Algorithms
• Avoid Redundancy
• Consider Interdependency and low expressed genes
Introduce a new Filtering Algorithm GRAD
![Page 11: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/11.jpg)
GRAD
Search StrategySearch Strategy1. Use Exaustive Search on the formation of
informative groups of attributes (“granules”)
2. Use AdDel for choosing subsets of granules
• AdDel: A combination of forward sequential search (FSS) and backward sequential search (BSS)
• Number of attributes to include on a subset is estimated by algorithm
![Page 12: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/12.jpg)
and are the distances to closest neighbors, one from each class
GRAD
AlgorithmAlgorithmP0: x1,x2,…,xn – initial set of features
Formation of granules: Ordering by individual relevanceG1: x7, x33, x12,…,xn All pairs by exhaustive searchG2: x3x8, x15x88,…,xi xj All triplets by exhaustive searchG3: x75x1x35, x11x49x55,…, xi xj xk Top level most relevant granules using AdDel• G=<G1,G2,G3>… AdDel
),(21 211 rrrf 1r 2r
![Page 13: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/13.jpg)
Experiments Comparison
1. GRAD2. Wrapper C4.53. Original Dataset4. Filtering
– Wilcoxon Test, low entropy, variability, and very low absolute expression level
Classifiers1. C4.52. SVM3. RBF 4. NN-MLP
Data• Small Dataset: 74x822
![Page 14: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/14.jpg)
Data Characterization
Not all organs have samples of both classes
Unbalanced number of cases:
• 50 Cancer Samples
• 24 Normal Samples
Most data is relativelly low expressed
Mean quite far from median:
Potentially due to outliers
![Page 15: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/15.jpg)
Data Characterization
average vs standard deviation average vs range
Both range and standard deviation have roughly linear relationship with gene expression level average
![Page 16: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/16.jpg)
Experimental Results
Predictive AccuracyGRAD WRAPPER Original Filtering
86% 82% 79% 78%
GRAD is significantly better than using the original or the filtered dataset
Wrapper approach is not
![Page 17: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/17.jpg)
GRAD Results
Importance of considering dependence Distance Function:
10 best by GRAD P=100 %
10 most individually informative P=75,7 %
),(21 211 rrrf
![Page 18: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/18.jpg)
GRAD Results
Scatter Plot of GRAD Attributes
Interdependency relationship between two non differentially expressed genes selected with GRAD
Two differentially expressed genes selected with GRAD.
![Page 19: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/19.jpg)
GRAD Results
Examples ordered by the value of the Distance Function
In the future it can allow to estimate the degree of risk, to make early diagnostics and to supervise a course of treatment
![Page 20: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/20.jpg)
Induced Classifiers
C4.5 Induced on GRAD attributes C4.5 Induced using a Wrapper Approach
![Page 21: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/21.jpg)
Conclusions
1. Coping with redundancy and dependency between attributes is very important.
2. Algorithm GRAD represents effective means to select a subset of attributes from very big initial set.
3. The submitted results have only illustrative character.
4. We are open for cooperation with those who have interest on the biological interpretation of results
![Page 22: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/22.jpg)
Questions
…
![Page 23: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/23.jpg)
In increasing n the relevance grows, then growth stops and begins its decrease due to addition less informative, rustling attributes.
The maximum of the curve of quality allows to specify optimum quantity of attributes. Only algorithms of AdDel family has such property.
GRAD
![Page 24: Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649c795503460f9492e65f/html5/thumbnails/24.jpg)
Feature Selection
Wrapper• Considers the classifier while searching best subset
• Accuracy Improves
• May overfit due to small sample sizes and huge dimensionality
• Computationally more expensive
Filtering:• Potentially less accurate
• Faster: Does not requires the induction of a predictor
• Commonly prefered approach in bioinformatics