Gene Recognition
description
Transcript of Gene Recognition
![Page 1: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/1.jpg)
CS262 Lecture 9, Win07, Batzoglou
Gene Recognition
![Page 2: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/2.jpg)
CS262 Lecture 9, Win07, Batzoglou
Gene structure
exon1 exon2 exon3intron1 intron2
transcription
translation
splicing
exon = protein-codingintron = non-coding
Codon:A triplet of nucleotides that is converted to one amino acid
![Page 3: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/3.jpg)
CS262 Lecture 9, Win07, Batzoglou
Needles in a Haystack
![Page 4: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/4.jpg)
CS262 Lecture 9, Win07, Batzoglou
• Classes of Gene predictors Ab initio
• Only look at the genomic DNA of target genome De novo
• Target genome + aligned informant genome(s)
EST/cDNA-based & combined approaches• Use aligned ESTs or cDNAs + any other kind of evidence
Gene Finding
EXON EXON EXON EXON EXON
Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-ctaArmadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg
![Page 5: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/5.jpg)
CS262 Lecture 9, Win07, Batzoglou
Signals for Gene Finding
1. Regular gene structure
2. Exon/intron lengths
3. Codon composition
4. Motifs at the boundaries of exons, introns, etc.Start codon, stop codon, splice sites
5. Patterns of conservation
6. Sequenced mRNAs
7. (PCR for verification)
![Page 6: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/6.jpg)
CS262 Lecture 9, Win07, Batzoglou
Next Exon:Frame 0
Next Exon:Frame 1
![Page 7: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/7.jpg)
CS262 Lecture 9, Win07, Batzoglou
Exon and Intron Lengths
![Page 8: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/8.jpg)
CS262 Lecture 9, Win07, Batzoglou
Nucleotide Composition• Base composition in exons is characteristic due to the genetic code
Amino Acid SLC DNA CodonsIsoleucine I ATT, ATC, ATALeucine L CTT, CTC, CTA, CTG, TTA, TTGValine V GTT, GTC, GTA, GTGPhenylalanine F TTT, TTCMethionine M ATGCysteine C TGT, TGCAlanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCGThreonine T ACT, ACC, ACA, ACGSerine S TCT, TCC, TCA, TCG, AGT, AGCTyrosine Y TAT, TACTryptophan W TGGGlutamine Q CAA, CAGAsparagine N AAT, AACHistidine H CAT, CACGlutamic acid E GAA, GAGAspartic acid D GAT, GACLysine K AAA, AAGArginine R CGT, CGC, CGA, CGG, AGA, AGG
![Page 9: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/9.jpg)
CS262 Lecture 9, Win07, Batzoglou
atg
tga
ggtgag
ggtgag
ggtgag
caggtg
cagatg
cagttg
caggccggtgag
![Page 10: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/10.jpg)
CS262 Lecture 9, Win07, Batzoglou
Splice Sites
(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)
![Page 11: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/11.jpg)
CS262 Lecture 9, Win07, Batzoglou
HMMs for Gene Recognition
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
exon exon exonintronintronintergene intergene
Intergene State
First Exon State
IntronState
![Page 12: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/12.jpg)
CS262 Lecture 9, Win07, Batzoglou
HMMs for Gene Recognition
exon exon exonintronintronintergene intergene
Intergene State
First Exon State
IntronState
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
![Page 13: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/13.jpg)
CS262 Lecture 9, Win07, Batzoglou
Duration HMMs for Gene Recognition
TAA A A A A A A A A A A AA AAT T T T TT TT T T TT T T TG GGG G G G GGGG G G G GCC C C C C C
Exon1 Exon2 Exon3
Duration d
iPINTRON(xi | xi-1…xi-w)
PEXON_DUR(d)iPEXON((i – j + 2)%3)) (xi | xi-1…xi-w)
j+2
P5’SS(xi-3…xi+4)
PSTOP(xi-4…xi+3)
![Page 14: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/14.jpg)
CS262 Lecture 9, Win07, Batzoglou
Genscan
• Burge, 1997
• First competitive HMM-based gene finder, huge accuracy jump
• Only gene finder at the time, to predict partial genes and genes in both strands
Features– Duration HMM– Four different parameter sets
• Very low, low, med, high GC-content
![Page 15: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/15.jpg)
CS262 Lecture 9, Win07, Batzoglou
Using Comparative Information
![Page 16: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/16.jpg)
CS262 Lecture 9, Win07, Batzoglou
Using Comparative Information
• Hox cluster is an example where everything is conserved
![Page 17: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/17.jpg)
CS262 Lecture 9, Win07, Batzoglou
Patterns of Conservation
30% 1.3%0.14%
58%14%
10.2%
Genes Intergenic
Mutations Gaps Frameshifts
Separation
2-fold10-fold75-fold
![Page 18: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/18.jpg)
CS262 Lecture 9, Win07, Batzoglou
Comparison-based Gene Finders
• Rosetta, 2000• CEM, 2000
– First methods to apply comparative genomics (human-mouse) to improve gene prediction
• Twinscan, 2001– First HMM for comparative gene prediction in two genomes
• SLAM, 2002– Generalized pair-HMM for simultaneous alignment and gene prediction in two
genomes
• NSCAN, 2006– Best method to-date based on a phylo-HMM for multiple genome gene
prediction
![Page 19: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/19.jpg)
CS262 Lecture 9, Win07, Batzoglou
Twinscan
1. Align the two sequences (eg. from human and mouse)
2. Mark each human base as gap ( - ), mismatch ( : ), match ( | )
New “alphabet”: 4 x 3 = 12 lettersS = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }
3. Run Viterbi using emissions ek(b) where b { A-, A:, A|, …, T| }
Emission distributions ek(b) estimated from real genes from human/mouse
eI(x|) < eE(x|): matches favored in exonseI(x-) > eE(x-): gaps (and mismatches) favored in introns
ExampleHuman: ACGGCGACGUGCACGUMouse: ACUGUGACGUGCACUUAlignment: ||:|:|||||||||:|
![Page 20: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/20.jpg)
CS262 Lecture 9, Win07, Batzoglou
SLAM – Generalized Pair HMM
d
e
Exon GPHMM1.Choose exon lengths (d,e).2.Generate alignment of length d+e.
![Page 21: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/21.jpg)
CS262 Lecture 9, Win07, Batzoglou
NSCAN—Multiple Species Gene Prediction• GENSCAN
• TWINSCAN
• N-SCAN
Target GGTGAGGTGACCAAGAACGTGTTGACAGTA
Target GGTGAGGTGACCAAGAACGTGTTGACAGTAConservation |||:||:||:|||||:||||||||......sequence
Target GGTGAGGTGACCAAGAACGTGTTGACAGTAInformant1 GGTCAGC___CCAAGAACGTGTAG......Informant2 GATCAGC___CCAAGAACGTGTAG......Informant3 GGTGAGCTGACCAAGATCGTGTTGACACAA
...
),...,,...,|( 1 oiioiii TTP III),...,|( 1 oiii TTTP
),...,,,...,|,( 11 oiioiiii TTTP III
Target sequence:
Informant sequences (vector):
Joint prediction (use phylo-HMM):
![Page 22: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/22.jpg)
CS262 Lecture 9, Win07, Batzoglou
NSCAN—Multiple Species Gene Prediction
X
C Y
Z H
M R
)|()|()|()|()|()|()(
),,,,,,(
1
ZRPZMPYZPYHPXYPXCPAP
ZYXRMCHP
X
C
Y
Z
H
M R
)|()|()|()|()|()|()(
),,,,,,(
ZRPZMPXCPYZPYXPHYPHP
ZYXRMCHP
![Page 23: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/23.jpg)
CS262 Lecture 9, Win07, Batzoglou
Performance Comparison
GENSCANGeneralized HMMModels human sequence
TWINSCANGeneralized HMMModels human/mouse alignments
N-SCANPhylo-HMMModels multiple sequence evolution
NSCAN human/mouse
>Human/multiple
informants
![Page 24: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/24.jpg)
CS262 Lecture 9, Win07, Batzoglou
• 2-level architecture• No Phylo-HMM that models alignments
CONTRAST
Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-ctaArmadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg
SVM SVM
CRF
X
Y
a b a b
![Page 25: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/25.jpg)
CS262 Lecture 9, Win07, Batzoglou
CONTRAST
![Page 26: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/26.jpg)
CS262 Lecture 9, Win07, Batzoglou
• log P(y | x) ~ wTF(x, y)
• F(x, y) = Si f(yi-1, yi, i, x)
• f(yi-1, yi, i, x):
1{yi-1 = INTRON, yi = EXON_FRAME_1} 1{yi-1 = EXON_FRAME_1, xhuman,i-2,…, xhuman,i+3 = ACCGGT) 1{yi-1 = EXON_FRAME_1, xhuman,i-1,…, xdog,i+1 = ACC, AGC) (1-c)1{a<SVM_DONOR(i)<b} (optional) 1{EXON_FRAME_1, EST_EVIDENCE}
CONTRAST - Features
![Page 27: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/27.jpg)
CS262 Lecture 9, Win07, Batzoglou
• Accuracy increases as we add informants
• Diminishing returns after ~5 informants
CONTRAST – SVM accuraciesSN SP
![Page 28: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/28.jpg)
CS262 Lecture 9, Win07, Batzoglou
CONTRAST - Decoding
Viterbi Decoding:
maximize P(y | x)
Maximum Expected Boundary Accuracy Decoding:
maximize Si,B 1{yi-1, yi is exon boundary B} Accuracy(yi-1, yi, B | x)
Accuracy(yi-1, yi, B | x) = P(yi-1, yi is B | x) – (1 – P(yi-1, yi is B | x))
![Page 29: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/29.jpg)
CS262 Lecture 9, Win07, Batzoglou
CONTRAST - Training
Maximum Conditional Likelihood Training:
maximize L(w) = Pw(y | x)
Maximum Expected Boundary Accuracy Training:
ExpectedBoundaryAccuracy(w) = Si Accuracyi
Accuracyi = SB 1{(yi-1, yi is exon boundary B} Pw(yi-1, yi is B | x) -
SB’ ≠ B P(yi-1, yi is exon boundary B’ | x)
![Page 30: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/30.jpg)
CS262 Lecture 9, Win07, Batzoglou
N-SCAN(Mouse)
CONTRAST(Mouse)
CONTRAST(Eleven
Informants)Gene Sn 35.6 50.8 58.6Gene Sp 25.1 29.3 35.5
Exon Sn 84.2 90.8 92.8Exon Sp 64.6 70.5 72.5Nucleotide Sn 90.8 96.0 96.9
Nucleotide Sp 67.9 70.0 72.0
Performance Comparison
N-SCAN(Mouse)
CONTRAST(Mouse)
CONTRAST(Eleven
Informants)
Gene Sn 46.8 60.7 65.4Gene Sp 31.7 40.6 46.2Exon Sn 89.7 92.6 93.9Exon Sp 66.9 74.8 76.2
Nucleotide Sn 93.7 95.7 96.7
Nucleotide Sp 69.3 74.3 75.8
De Novo
EST-assisted
HumanMacaqueMouseRatRabbitDogCowArmadilloElephantTenrecOpossumChicken
![Page 31: Gene Recognition](https://reader035.fdocuments.us/reader035/viewer/2022062813/5681658c550346895dd855a7/html5/thumbnails/31.jpg)
CS262 Lecture 9, Win07, Batzoglou
Performance Comparison