Post on 01-Apr-2015
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis Prokaryotic gene
finding
Marie SkovgaardPh.D. student
marie@cbs.dtu.dk
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Prokarya
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
>AE006641GTATACTCTTCTTCCCTATACATTGTCGCAGCAAGCTTAGTTTCTTTAGCCTCTCTGCTTTCATTATTACTTATAATCTTAATAGCAAGGAGACATATGATAGAGTATTTCTATATGATTCCTTCGTTCGTTTATATGAACTTTATTGTCGCACTAAACTTCACTGCAATATTTTTAGAGTTAATAAGAGCACCTAGAGTGTGGGTAAAAACTGAAAGAAGTGCCAAGGTTACGGGGGAGGTCATGGGATGATAACTGAATTTTTACTTAAAAAGAAATTAGAAGAACATTTAAGCCATGTAAAGGAAGAGAATACGATATATGTAACAGATTTAGTAAGATGCCCCAGAAGAGTAAGATATGAGAGTGAATACAAGGAGCTTGCAATCTCTCAGGTTTACGCGCCTTCAGCTATTTTAGGGGACATATTGCATCTCGGTCTTGAAAGCGTATTAAAAGGGAACTTTAATGCAGAAACTGAAGTTGAAACTCTGAGAGAAATTAACGTCGGAGGTAAAGTTTATAAAATTAAAGGAAGAGCCGATGCAATAATTAGAAATGACAACGGGAAGAGTATTGTAATTGAGATAAAAACTTCTAGAAGTGATAAAGGATTACCTCTAATTCATCATAAAATGCAGCTACAGATATATTTATGGTTATTTAGTGCAGAAAAAGGTATACTAGTTTACATAACTCCAGATAGGATAGCTGAGTATGAAATAAACGAACCTTTAGATGAAGCAACAATAGTAAGACTTGCAGAGGATACAATAATGTTACAAAACTCACCTAGATTCAACTGGGAATGTAAATATTGCATATTTTCCGTCATTTGCCCAGCTAAACTAACCTAAAATTAAAATCTCTCATCGATATAATTAAATTGTGCACACTAGACCAGTAGTTGCCACAATAGCTGGGAGTGACAGTGGAGGAGGTGCTGGATTACAGGCTGATCTAAAGACGTTTAGCGCATTAGGAGTTTTTGGTACAACAATAATAACCGGTTTAACAGCACAGAATACAAGAACAGTTACAAAAGTATTAGAGATACCATTAGATTTCATTGAAGCTCAGTTTGATGCGGTTTGCCTAGATTTACATCCAACTCACGCCAAAACTGGAATGTTAGCTTCTGGTAAAGTGGTAGAACTTGTACTGAGAAAAATTAGAGAGTATAACATAAAACTAGTTTTAGATCCAGTGATGGTTGCGAAATCTGGATCATTATTGGTAACAGAGGATATCTCGGAGCAAATAAAAAAGGCGATGAAGGAGGCCATAATATCTACTCCAAACAGATATGAAGCTGAGATAATAAATAAGACAAAGATTAATAGTCAAGATGATGTTATAAAAGCGGCAAGGGAAATTTATTCTAAGTATGGGAATGTTGTAGTTAAAGGATTTAATGGAGTAGATTACGCCATAATTGACGGAGAAGAAATAGAGTTAAAAGGTGATTACATCAGTACTAAAAATACACATGGTAGTGGAGACGTATTTTCTGCCTCCATAACTGCATATCTTGCCTTGGGATACAAACTTAAAGATGCATTAATAAGAGCTAAAAAATTCGCTACAATGACAGTCAAATACGGTTTGGACTTAGGAGGAGGATATGGACCAGTAGATCCCTTTGCCCCTATAGAGTCCATAGTGAAGAGAGAAGAAGGAAGAAATCAGCTAGAAAACTTACTTTGGTACTTAGAGTCTAATCTTAACGTTATACTTAAACTAATTAACG
Can you spot the gene?
/ (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Identifying open reading frames
/ (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
A. pernix (43% AT)
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Why care about over annotated genes?
Genome comparison:• Fraction of known
proteins• Average gene length • Amino acid
composition
The quality of our databases
To gain biological knowledge
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Regular expression
Regular expression: /[AT][CG][AC][ACGT]*A[TG][CG]/
The regular expression is able to find all posible sequences, but do not distinguish between the consensus sequence and the highly unlikely sequence:
ACAC—ATC or TGCT--AGG
Weigth matrixes can be used to score the sequence but do not deal with insertions and deletions.
ACA---ATGTCAACTATCACAC--AGCAGA---ATGACCG--ATC
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Markov model
A 0.8C G T 0.2
A C 0.8G 0.2T
A 0.8C 0.2G T
A 1.0C G T
A C G 0.2T 0.8
A C 0.8G 0.2T
A 0.2C 0.4G 0.2T 0.2
1.0 1.0 0.4
1.0 1.0
0.6
0.6
0.4
ACA---ATGTCAACTATCACAC--AGCAGA---ATGACCG--ATC
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Profile HMM
Profile HMM have a predefined architecture and the parameters are estimated from multiple sequence alignments.
Profile HMM are not usefull for gene finding, since all genes in an organism can not be aligned in a meaningfull way.
Begin End
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Markov Model for gene findingDefine a simple architecture:
/ (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/
ATGC
TAGTAATGA
ATGC
ATGC
ATGGTGTTG
S1 S2 S3 S4 S5
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Markov models
Knowledge of the structure of genes is used to define the architecture of the model.
Sequences (x) from known genes are used to estimate the parameters of the model – training of the model.
The training is done by counting the number of times a nucleotide occur in a given state and dividing this number with the number of sequences used in training giving the frequencies.
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Training
Sequence x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 …..….xn
Sta
tes S1
S2
S3
S4
S5
ATGC
TAGTAATGA
ATGC
ATGC
ATGGTGTTG
S1 S2 S3 S4 S5
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Model after training
A: 0.22T: 0.24G: 0.27C: 0.27
TAG: 0.6TAA: 0.3TGA: 0.1
A: 0.25T: 0.23G: 0.27C: 0.25
A: 0.26T: 0.24G: 0.25C: 0.25
ATG: 0.77TTG: 0.11GTG: 0.12CTG: 0.00
S1 S2 S3 S4 S5
0.98
The trained model can be used to search for genes in DNA sequences.
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Searching with the HMM
S1
S2
S3
S4
S5
Sequence ATG A T T T C G C G C G A T ……….T A G
Sta
tes 0.77 0.00 0.00 0.00
0.00 (0.22*0.77) 0.00 0.00
0.00 0.00 (0.23*0.22*0.77) 0.00
0.00 0.00 0.00 (0.24*0.23*0.22*0.77)
0.00 0.00 0.00 =P(x|M)
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Log-Odds score
The propability of a sequence gets infinitly small as the sequence x becomes longer.
This is solved by defining a background (NULL) model. For example a random distribution: A=T=C=G=0.25
From this the Log-Odds score can be calculated: -log(P(x|M)/P(x|NULL))
A high Log-Odds score corresponds to a sequence that looks more like the gene model than the background model.
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Is the model to simple?
ATGC
TAGTAATGA
ATGC
ATGC
ATGGTGTTG
S1 S2 S3 S4 S5
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Codon usage
Synonymous codons incode the same amino acid. At random synonymous codons would be expected to be used with equal frequencies. In real life synonomous codons have different frequencies.
Different species have consistent and characteristic codon biases. Lateral transferred genes and genes from plasmids and phages will have atypical codon usage.
Variations in codon usage within an organism can be modelled in different coding models in the HMM.
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
1stPosition
2nd Position 3rdPositionU C A G
U
30,407 Phe22,581 Phe18,943 Leu18,629 Leu
11,523 Ser11,766 Ser
9,793 Ser12,195 Ser
22,048 Tyr16,669 Tyr2,706 Stop
326 Stop
7,062 Cys8,846 Cys1,260 Stop
20,756 Trp
UCAG
C
15,018 Leu15,104 Leu
5,316 Leu71,710 Leu
9,569 Pro7,491 Pro
11,496 Pro31,614 Pro
17,631 His13,272 His20,912 Gln39,285 Gln
28,458 Arg29,968 Arg
4,860 Arg7,404 Arg
UCAG
A
41,375 Ile34,261 Ile5,967 Ile
37,994 Met
12,223 Thr31,889 Thr
9,683 Thr19,682 Thr
24,189 Asn29,529 Asn45,812 Lys14,076 Lys
11,982 Ser21,907 Ser2,899 Arg1,694 Arg
UCAG
G
24,910 Val20,800 Val14,850 Val35,979 Val
20,808 Ala34,770 Ala27,468 Ala45,862 Ala
43,817 Asp25,996 Asp53,780 Glu24,312 Glu
33,731 Gly40,396 Gly10,902 Gly15,118 Gly
UCAG
Fields : [number] [amino acid]
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Is the model to simple?
ATGGTGTTG
TAGTAATGA
S2
S3
AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC
S1
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
HMM for gene finding
TAGTAATGA
S4
S3
AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC
ATGGTGTTG
S1
S2
AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Multiple coding models
TAGTAATGA
E
ATGGTGTTG
S
AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC
AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC
AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC
AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC
AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC
AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC
AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC
AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC
AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC
AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC
AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC
AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Order of the model
A zero order Markov model (state) has a propability of letter in the state – the propabilities are independent of the previous sequence. The NULL model is a zero order Markov model (A=T=G=C=0.25).
The propability of a letter in a first order Markov model depends on the previous letter (di-nucleotide distributions).
Second order depends on the two previous letters (corresponding to a codon).
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Order of the coding model
Inter-codon denpendencies are correlations between amino acids typically found in proteins. They reflect typical features of proteins and can be used to improve the performance of the gene finder.
The use of higher order coding models in gene finding is a way to capture these inter-codon denpendencies.
Higher order models requires more training data and more computational time when searching.
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
The Shine-Dalgarno sequence
The ribosome binds to the messenger RNA through baseparing to the 30S ribosomal subunit.The binding site is the Shine-Dalgarno sequence (SD).The SD is a purine-rich sequence (consensussequence: AGGAG) at the 5' end of most prokaryotic mRNAs. The SD is found 5-10 basepairs upstream from the start codon.
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
EasyGene
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
R. prowazekii
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
GeneMark.hmmhttp://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.cgi
Lukashin A. and Borodovsky M., “GeneMark.hmm: new solutions for gene finding”, NAR, 1998, Vol. 26, No. 4, pp. 1107-1115.
EasyGenehttp://cbs.dtu.dk/services/EasyGene
Schou Larsen T. and Krogh A., “EasyGene – A prokaryotic gene finder that ranks ORFs by statistical significance”. BMC Bioinformatics 2003, 4:21