(H)MMs in gene prediction and similarity searches
description
Transcript of (H)MMs in gene prediction and similarity searches
![Page 1: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/1.jpg)
(H)MMs in gene prediction and similarity searches
![Page 2: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/2.jpg)
What is an HMM? (Eddy2004)
•States
•Transition Probabilities
•Emission Probabilities
![Page 3: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/3.jpg)
What is hidden? (Eddy2004)
•State Path
Log (product of transition and emission probabilities)Log (1 x 0.25 x 0.9 x 0.25 x 0.9 x 0.25 …0.9 x 0.4) = -41.22
![Page 4: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/4.jpg)
What is hidden? (Eddy2004)
•State Path
![Page 5: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/5.jpg)
Using HMMs• Given the parameters of the model, compute the probability of
a particular output sequence. This problem is solved by the forward algorithm.
• Given the parameters of the model, find the most likely sequence of hidden states that could have generated a given output sequence. This problem is solved by the Viterbi algorithm.
• Given an output sequence or a set of such sequences, find the most likely set of state transition and output probabilities. In other words, train the parameters of the HMM given a dataset of sequences. This problem is solved by the Baum-Welch algorithm.
![Page 6: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/6.jpg)
Profile Hidden Markov Models
• Statistical model of multiple sequence alignments
• Position-specific description of the level of conservation and the probabilities of observing each type of amino acid (nucleotide) at that position
• Protein domain alignments (PFAM, TIGRFams,…)• Regulator binding site alignments
![Page 7: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/7.jpg)
Simple Profile HMM – no gaps
Emission Probabilities determined from distribution of amino acids at each site of the alignment
![Page 8: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/8.jpg)
Allowing gaps in a position-specific way
Need to allow a sequence to contain one or more residues not found in the model (Insert) and also be missing regions that are present in the model (Delete)
![Page 9: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/9.jpg)
![Page 10: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/10.jpg)
![Page 11: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/11.jpg)
Pfam
• Database of protein domains and families available as multiple alignments and HMMs
• Pfam-A is curated. Pfam-B is automated.
![Page 12: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/12.jpg)
A sample Pfam: MCPsignal
![Page 13: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/13.jpg)
Pfam- Seed Alignment
![Page 14: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/14.jpg)
Pfam – scoring members
• Trusted cut-off– Bit score for lowest
scoring match included in the full alignment
• Noise cut-off– Bit score for highest
scoring match not included in the full alignment
• Gathering cut-off
![Page 15: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/15.jpg)
ATTTATCCGCCGAAGCCATTACATAGTATCGCGCTTGGCAGTCGGATTCCGGCGCTGCGTGAAGACTATA AACTTGGGCGTTTATGCGGTCGTTATTTCCTCGCCACGGTTGGCAAGCTATTAACTGAAAAAGCGCCGCT TACCCGCCATCTGGTGCCAGTGGTGACGCCGGAATCGATTGTCATTCCGCCTGCGCCAGTCGCCAACGAT ACGCTGGTTGCCGAAGTGAGCGACGCTCCGCAGGCGAACGACCCGACATTTAACAATGAGGATCTGGCTT GATTTGCCGTTTTATCGACACCCACTGCCATTTTGATTTCCCGCCGTTTAGTGGCGATGAAGAGGCCAGC CTGCAACGCGCGGCACAAGCGGGCGTAGGCAAGATCATTGTTCCGGCAACAGAGGCGGAAAATTTTGCCC GTGTGTTGGCATTAGCGGAAAATTATCAACCGCTGTATGCCGCATTGGGCTTGCATCCTGGTATGTTGGA AAAACATAGCGATGTGTCTCTTGAGCAGCTACAGCAGGCGCTGGAAAGGCGTCCGGCGAAGGTGGTGGCG GTGGGGGAGATCGGTCTGGATCTCTTTGGCGACGATCCGCAATTTGAGAGGCAGCAGTGGTTACTCGACG AACAACTGAAACTGGCGAAACGCTACGATCTGCCGGTGATCCTGCATTCACGGCGCACGCACGACAAACT GGCGATGCATCTTAAACGCCACGATTTACCGCGCACTGGCGTGGTTCACGGTTTTTCCGGCAGCCTGCAA CAGGCCGAACGGTTTGTACAGCTGGGCTACAAAATTGGCGTAGGCGGTACTATCACCTATCCACGCGCCA GTAAAACCCGCGATGTCATCGCAAAATTACCGCTGGCATCGTTATTGCTGGAAACCGACGCGCCGGATAT GCCGCTCAACGGTTTTCAGGGGCAGCCTAACCGCCCGGAGCAGGCTGCCCGTGTGTTCGCCGTGCTTTGC GAGTTGCGCCGGGAACCGGCGGATGAGATTGCGCAAGCGTTGCTTAATAACACGTATACGTTGTTTAACG TGCCGTAGGCCGGATAAGGCGTTCACGCCGCATCCGGCAGTTGGCGCACAATGCCTGATGCGACGCTTAA CGCGTCTTATCATGCCTACAGGTTTGTGCCGAACCGTAGGCCGGATAAGGCGTTCACGCCGCATCCGGCA GTTGGCGCACAATGCCTGATGCGACGCTTGTCGCGTCTTATCATGCCTACAAGTCTGTGCCGAACCGTAG GCCGGATAAGGCGTTCACGCCGCATCCGGCAGTCGGCGCATAATGCCTGATGCGACGCTTGTCGCGTCTT ATCATGCCTACAGGTTTGTGCCGAACCGTAGGCCGGATAAGGCGTTCGCGCCGCATCCGGCAGTTGGCGC ACAATGCCTGATGCGACGCTTGACGCGTCTTATCAGGCCTACAAGTCTGTGCCGAACCGTAGGCCGTATC CGGCATGTCACAAATAGAGCGCCGGAAATATCAACCGGCTCACCCCGCGCACCTTTAACGCATCAGCCAA CGGCTCAACGTCTTCCGGCGTGGCGCTCGCCCAGCTTTGCGCCTCGCCATACACGCCGTGGGCATGAAAC GCGTTCAGGCGTACCGGAACATCGCCGAGTCCCTTGATAAACGCCGCCAGTTCTTCGATGTGTTGCAAAT AATCCACCTGGCCAGGGATCACCAGCAAACGCAGTTCCGCCAGCTTGCCGCGCTCTGCCAGCAAATAGAT GCTGCGCTTAATCTGCTGATTATCGCGTCCGGTGAGTTGTTGATGACATTCGCTCCCCCACGCTTTGAGA TCGAGCATTGCGCCGTCGCACACCGGGAGCAATTTTTCCCAGCCGGTTTCGCTCAACATGCCGTTACTGT CCACCAGACAGGTGAGATGGCGCAGTTGCGGATCGTTTTTGATAGCAGTAAACAGCGCCACCACAAACGG CAGCTGGGTCGTGGCTTCACCGCCACTCACCGTTATCCCTTCGATAAACAGCACTGCTTTGCGGACATGG CTAAGCACTTCGTCCACGCTCATGGATTGCGCCATGGGCGTGGCATGTTGCGGACACCTCTTCAGGCAGG TATCACACTGCTCGCAAACCACAGCGTTCCACACCACTTTGCCGTCAACAATCTGCAACGCCTGATGCGG ACACTGTGGCACGCACTCCCCACAGTCATTGCAACGTCCCATCGTCCACGGATTGTGACAGTTTTTGCAG CGCAGATTGCAGCCCTGCAAAAACAGAGCCAGACGACTGCCTGGCCCGTCAACGCAGGAGAAGGGGATAA TCTTACTGACTAAAGCGCATCTGCTGTTCATGGCTTATCACGCGCGGCTGGCGTTCCAGAATACGAGTGT TGCGTGCGGCTTCTTCGCCCAGCCAGGTGGTGTTGGTGCGTGAACCTTCGGCGCGATATTTTTCTAAATC CGACAAACGCACCATATAACCGGTAACGCGAACCAGATCGTTACCGCTGACATTGGCGGTAAATTCACGC ATTCCGGCTTTAAAGGCACCGAGGCAAAGCTGTACCAGTGCCTGCGGGTTACGTTTGATGGTTTCGTCGA
Gene Discovery
![Page 16: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/16.jpg)
Prokaryotes: 10 kb
Eukaryotes: 10 kb
DNA
DNA
3 mRNAs
9 proteins
Unprocessed mRNA
Processed mRNA
1 protein
![Page 17: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/17.jpg)
Two Approaches
• Ab initio– Based exclusively on computational models– Error prone, esp. for eukaryotes– Generally requires manual clean up
• Comparative– Find genes corresponding to sequenced cDNAs– Find the genes already predicted for a closely related organism
• If you can...use both strategies
![Page 18: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/18.jpg)
Attributes that prove useful for gene prediction
Begin with a start codonEnd with a stop codonHave a length divisible by 3
Splice sites
Tend to have a species specific codon usageExhibit even higher order biases in composition
Tend to be more conserved between organisms than non-coding regions
ORFOpen Reading Frame
![Page 19: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/19.jpg)
Detecting Signal Amid the NoiseEach sequence can be translated in each of 6 reading frames, 3 for the sequenced strand and 3 for the reverse complement.
There are far more open reading frames than there are genes.
How do we know which reading frame contains real genes?
![Page 20: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/20.jpg)
Organism-specific Composition Biases
![Page 21: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/21.jpg)
51.8%GC coding 38.1%GC coding
Codon usage in the E. coli K-12 and H. influenzae genomes
Preference for GGC glycine codons
Preference for GGU glycine codons
![Page 22: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/22.jpg)
Example of a 1st order Markov model for gene prediction:
The probability that base X is part of a coding region depends only on the base immediately preceding X.
AX, TX, CX, GX
How frequently does AX occur in a coding region vs. a non-coding region?
A 5th order model: AAAAAX, AAAATX, AAAACX, … GGGGGX
Gene Discovery using Markov Models and HMMs
![Page 23: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/23.jpg)
Model Order – which is best?
• In general, higher order models better describe the properties of real genes, but training higher order models requires more data and the training sets are limiting.
• The probabilities of rare sequences in higher order models can be low enough that the model performs worse.
![Page 24: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/24.jpg)
Gene Prediction Models based on Markov Chains
Basic Method:
•Build at least 6 submodels (one for each reading frame) for coding regions and 1 for noncoding
•Find ORFs -Start, Stop, mod(3)
•Score each ORF by calculating the probability that it was generated by each model. Choose the model with the highest probability – if it exceeds a user-specified threshold, you have a gene.
Two popular applications: GLIMMER, GeneMark
Hidden Markov Models add modeling the gene boundaries as transitions between “hidden” states.
![Page 25: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/25.jpg)
![Page 26: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/26.jpg)
GLIMMERReference:A.L. Delcher, D. Harmon, S. Kasif, O. White and S.L. Salzberg. Improved microbial gene identificaton with GLIMMER NAR, 1999, Vol. 27, No. 23, pp. 4636-4641.
• GLIMMER can be “trained” using the genome itself
Finds the longest ORFs in the genome and assumes they are real genes to estimate emission probabilities
• Interpolated Markov model•Not necessary to “fix” the order of the model
•Analysis of 10 microbial genomes:
GLIMMER 2 finds 97.4-99.7% of annotated genes
PLUS another 7-25% !!!
•GLIMMER 3 has a much lower False Positive Rate
Specificity vs.
Sensitivity
![Page 27: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/27.jpg)
W.H. Majoros, M. Pertea, and S.L. Salzberg. TigrScan and GlimmerHMM: two open-source ab initio eukaryotic gene-findershttp://www.tigr.org/software/GlimmerHMM/index.shtml
Sensitivity: TP/(TP+FN)How much of what you hoped to detect did you get?
Specificity: TP/(TP+FP)How much of what you detected is real?
![Page 28: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/28.jpg)
![Page 29: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/29.jpg)
![Page 30: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/30.jpg)
![Page 31: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/31.jpg)
![Page 32: (H)MMs in gene prediction and similarity searches](https://reader036.fdocuments.us/reader036/viewer/2022062502/5681560c550346895dc3ceb5/html5/thumbnails/32.jpg)