Gene Feature Recognition
description
Transcript of Gene Feature Recognition
![Page 1: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/1.jpg)
NUS-KI Course on Bioinformatics, Nov 2005
For written notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician
Gene Feature Recognition
Limsoon Wong
![Page 2: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/2.jpg)
NUS-KI Course on Bioinformatics, Nov 2005
Recognition of Splice Sites
A simple example to start the day
![Page 3: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/3.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Splice Sites
AcceptorDonor
![Page 4: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/4.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Acceptor Site (Human Genome)
• If we align all known acceptor sites (with their splice junction site aligned), we have the following nucleotide distribution
• Acceptor site: CAG | TAG | coding regionImage credit: Xu
![Page 5: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/5.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Donor Site (Human Genome)
• If we align all known donor sites (with their splice junction site aligned), we have the following nucleotide distribution
• Donor site: coding region | GTImage credit: Xu
![Page 6: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/6.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
What Positions Have “High” Information Content?
• For a weight matrix, information content of each column is calculated as
– X{A,C,G,T} Prob(X)*log (Prob(X)/0.25)
• When a column has evenly distributed nucleotides, its information content is lowest
• Only need to look at positions having high information content
![Page 7: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/7.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Information Content Around Donor Sites in Human Genome
• Information content column –3 = – .34*log (.34/.25) – .363*log
(.363/.25) – .183* log (.183/.25) – .114* log (.114/.25) = 0.04
column –1 = – .092*log (.92/.25) – .03*log (.033/.25) – .803* log (.803/.25) – .073* log (.73/.25) = 0.30
Image credit: Xu
![Page 8: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/8.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Weight Matrix Model for Splice Sites
• Weight matrix model – build a weight matrix for donor, acceptor,
translation start site, respectively– use positions of high information content
Image credit: Xu
![Page 9: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/9.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Splice Site Prediction: A Procedure
• Add up freq of corr letter in corr positions:
• Make prediction on splice site based on some threshold
AAGGTAAGT: .34 + .60 + .80 +1.0 + 1.0 + .52 + .71 + .81 + .46 = 6.24
TGTGTCTCA: .11 + .12 + .03 +1.0 + 1.0 + .02 + .07 + .05 + .16 = 2.56
Image credit: Xu
![Page 10: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/10.jpg)
NUS-KI Course on Bioinformatics, Nov 2005
Recognition of Translation Initiation Sites
An introduction to the World’s simplest TIS recognition system
A simple approach to accuracy and understandability
![Page 11: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/11.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Translation Initiation Site
![Page 12: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/12.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
A Sample cDNA
• What makes the second ATG the TIS?
299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT............................................................ 80................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
![Page 13: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/13.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Approach
• Training data gathering• Signal generation
– k-grams, distance, domain know-how, ...• Signal selection
– Entropy, 2, CFS, t-test, domain know-how...• Signal integration
– SVM, ANN, PCL, CART, C4.5, kNN, ...
![Page 14: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/14.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Training & Testing Data
• Vertebrate dataset of Pedersen & Nielsen [ISMB’97]
• 3312 sequences• 13503 ATG sites• 3312 (24.5%) are TIS• 10191 (75.5%) are non-TIS• Use for 3-fold x-validation expts
![Page 15: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/15.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Signal Generation
• K-grams (ie., k consecutive letters)– K = 1, 2, 3, 4, 5, …– Window size vs. fixed position– Up-stream, downstream vs. any where in window– In-frame vs. any frame
0
0.5
1
1.5
2
2.5
3
A C G T
seq1
seq2
seq3
![Page 16: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/16.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Signal Generation: An Example
• Window = 100 bases• In-frame, downstream
– GCT = 1, TTT = 1, ATG = 1…• Any-frame, downstream
– GCT = 3, TTT = 2, ATG = 2…• In-frame, upstream
– GCT = 2, TTT = 0, ATG = 0, ...
299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT
![Page 17: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/17.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
An Example File Resulting From Feature
Generation
![Page 18: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/18.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Too Many Signals
• For each value of k, there are 4k * 3 * 2 k-grams
• If we use k = 1, 2, 3, 4, 5, we have 24 + 96 + 384 + 1536 + 6144 = 8184 features!
• This is too many for most machine learning algorithms
![Page 19: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/19.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Signal Selection (Basic Idea)
• Choose a signal w/ low intra-class distance• Choose a signal w/ high inter-class distance
![Page 20: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/20.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Signal Selection (eg., t-statistics)
![Page 21: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/21.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Signal Selection (eg., 2)
![Page 22: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/22.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Signal Selection (eg., CFS)
• Instead of scoring individual signals, how about scoring a group of signals as a whole?
• CFS– Correlation-based Feature Selection– A good group contains signals that are highly
correlated with the class, and yet uncorrelated with each other
![Page 23: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/23.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Sample k-grams Selected by CFS
• Position –3• in-frame upstream ATG• in-frame downstream
– TAA, TAG, TGA, – CTG, GAC, GAG, and GCC
Kozak consensusLeaky scanning
Stop codon
Codon bias?
![Page 24: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/24.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Signal Integration
• kNN– Given a test sample, find the k training samples
that are most similar to it. Let the majority class win
• SVM– Given a group of training samples from two
classes, determine a separating plane that maximises the margin of error
• Naïve Bayes, ANN, C4.5, ...
![Page 25: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/25.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Neighborhood
5 of class
3 of class
=
Illustration of kNN (k=8)
Image credit: ZakiTypical “distance” measure =
![Page 26: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/26.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Using WEKA for TIS Prediction
![Page 27: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/27.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Results (3-fold x-validation)
TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy
Naïve Bayes 84.3% 86.1% 66.3% 85.7%
SVM 73.9% 93.2% 77.9% 88.5%
Neural Network 77.6% 93.2% 78.8% 89.4%
Decision Tree 74.0% 94.4% 81.1% 89.4%
3-NN* 73.2% 92.9% 77.2% 88.0%
* Using top 20 2-selected features from amino-acid features
![Page 28: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/28.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Validation Results (on Chr X and Chr 21)
• Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s
ATGpr
Ourmethod
![Page 29: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/29.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Technique Comparisons
• Pedersen&Nielsen [ISMB’97]
– 85% accuracy– Neural network– No explicit features
• Zien [Bioinformatics’00]
– 88% accuracy– SVM+kernel engineering– No explicit features
• Hatzigeorgiou [Bioinformatics’02]
– 94% accuracy (with scanning rule)
– Multiple neural networks– No explicit features
• Our approach– 89% accuracy (94% with
scanning rule)– Explicit feature
generation– Explicit feature selection– Use any machine
learning method w/o any form of complicated tuning
![Page 30: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/30.jpg)
NUS-KI Course on Bioinformatics, Nov 2005
Recognition of Transcription Start Sites
An introduction to the World’s best TSS recognition system
A heavy tuning approach
![Page 31: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/31.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Transcription Start Site
![Page 32: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/32.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Structure of Dragon Promoter Finder
-200 to +50window size
Model selected based on desired sensitivity
![Page 33: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/33.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Each model has two submodels based on GC content
GC-rich submodel
GC-poor submodel
(C+G) =#C + #GWindow Size
![Page 34: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/34.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Data Analysis Within Submodel
K-gram (k = 5) positional weight matrix
p
e
i
![Page 35: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/35.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Promoter, Exon, Intron Sensors
• These sensors are positional weight matrices of k-grams, k = 5 (aka pentamers)
• They are calculated as s below using promoter, exon, intron data respectively Pentamer at ith
position in input
jth pentamer atith position in training window
Frequency of jthpentamer at ith positionin training window
Window size
![Page 36: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/36.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Data Preprocessing & ANNTuning parameters
tanh(x) =ex e-x
ex e-x
sIE
sI
sEtanh(net)
Simple feedforward ANN trained by the Bayesian regularisation method
wi
net = si * wi
Tunedthreshold
![Page 37: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/37.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Accuracy Comparisons
without C+G submodels
with C+G submodels
![Page 38: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/38.jpg)
NUS-KI Course on Bioinformatics, Nov 2005
Notes
![Page 39: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/39.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
References (TIS Recognition)
• A. G. Pedersen, H. Nielsen, “Neural network prediction of translation initiation sites in eukaryotes”, ISMB 5:226--233, 1997
• H.Liu, L. Wong, “Data Mining Tools for Biological Sequences”, Journal of Bioinformatics and Computational Biology, 1(1):139--168, 2003
• A. Zien et al., “Engineering support vector machine kernels that recognize translation initiation sites”, Bioinformatics 16:799--807, 2000
• A. G. Hatzigeorgiou, “Translation initiation start prediction in human cDNAs with high accuracy”, Bioinformatics 18:343--350, 2002
![Page 40: Gene Feature Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062410/56815b14550346895dc8c342/html5/thumbnails/40.jpg)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
References (TSS Recognition)
• V. B. Bajic et al., “Computer model for recognition of functional transcription start sites in RNA polymerase II promoters of vertebrates”, J. Mol. Graph. & Mod. 21:323--332, 2003
• J. W. Fickett, A. G. Hatzigeorgiou, “Eukaryotic promoter recognition”, Gen. Res. 7:861--878, 1997
• A. G. Pedersen et al., “The biology of eukaryotic promoter prediction---a review”, Computer & Chemistry 23:191--207, 1999
• M. Scherf et al., “Highly specific localisation of promoter regions in large genome sequences by PromoterInspector”, JMB 297:599--606, 2000