Center for Biological Sequence Analysis Prokaryotic gene finding Marie Skovgaard Ph.D. student...

ical S

sis Prokaryotic gene

finding

Marie SkovgaardPh.D. student

marie@cbs.dtu.dk

ical S

Prokarya

ical S

>AE006641GTATACTCTTCTTCCCTATACATTGTCGCAGCAAGCTTAGTTTCTTTAGCCTCTCTGCTTTCATTATTACTTATAATCTTAATAGCAAGGAGACATATGATAGAGTATTTCTATATGATTCCTTCGTTCGTTTATATGAACTTTATTGTCGCACTAAACTTCACTGCAATATTTTTAGAGTTAATAAGAGCACCTAGAGTGTGGGTAAAAACTGAAAGAAGTGCCAAGGTTACGGGGGAGGTCATGGGATGATAACTGAATTTTTACTTAAAAAGAAATTAGAAGAACATTTAAGCCATGTAAAGGAAGAGAATACGATATATGTAACAGATTTAGTAAGATGCCCCAGAAGAGTAAGATATGAGAGTGAATACAAGGAGCTTGCAATCTCTCAGGTTTACGCGCCTTCAGCTATTTTAGGGGACATATTGCATCTCGGTCTTGAAAGCGTATTAAAAGGGAACTTTAATGCAGAAACTGAAGTTGAAACTCTGAGAGAAATTAACGTCGGAGGTAAAGTTTATAAAATTAAAGGAAGAGCCGATGCAATAATTAGAAATGACAACGGGAAGAGTATTGTAATTGAGATAAAAACTTCTAGAAGTGATAAAGGATTACCTCTAATTCATCATAAAATGCAGCTACAGATATATTTATGGTTATTTAGTGCAGAAAAAGGTATACTAGTTTACATAACTCCAGATAGGATAGCTGAGTATGAAATAAACGAACCTTTAGATGAAGCAACAATAGTAAGACTTGCAGAGGATACAATAATGTTACAAAACTCACCTAGATTCAACTGGGAATGTAAATATTGCATATTTTCCGTCATTTGCCCAGCTAAACTAACCTAAAATTAAAATCTCTCATCGATATAATTAAATTGTGCACACTAGACCAGTAGTTGCCACAATAGCTGGGAGTGACAGTGGAGGAGGTGCTGGATTACAGGCTGATCTAAAGACGTTTAGCGCATTAGGAGTTTTTGGTACAACAATAATAACCGGTTTAACAGCACAGAATACAAGAACAGTTACAAAAGTATTAGAGATACCATTAGATTTCATTGAAGCTCAGTTTGATGCGGTTTGCCTAGATTTACATCCAACTCACGCCAAAACTGGAATGTTAGCTTCTGGTAAAGTGGTAGAACTTGTACTGAGAAAAATTAGAGAGTATAACATAAAACTAGTTTTAGATCCAGTGATGGTTGCGAAATCTGGATCATTATTGGTAACAGAGGATATCTCGGAGCAAATAAAAAAGGCGATGAAGGAGGCCATAATATCTACTCCAAACAGATATGAAGCTGAGATAATAAATAAGACAAAGATTAATAGTCAAGATGATGTTATAAAAGCGGCAAGGGAAATTTATTCTAAGTATGGGAATGTTGTAGTTAAAGGATTTAATGGAGTAGATTACGCCATAATTGACGGAGAAGAAATAGAGTTAAAAGGTGATTACATCAGTACTAAAAATACACATGGTAGTGGAGACGTATTTTCTGCCTCCATAACTGCATATCTTGCCTTGGGATACAAACTTAAAGATGCATTAATAAGAGCTAAAAAATTCGCTACAATGACAGTCAAATACGGTTTGGACTTAGGAGGAGGATATGGACCAGTAGATCCCTTTGCCCCTATAGAGTCCATAGTGAAGAGAGAAGAAGGAAGAAATCAGCTAGAAAACTTACTTTGGTACTTAGAGTCTAATCTTAACGTTATACTTAAACTAATTAACG

Can you spot the gene?

/ (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/

ical S

Identifying open reading frames

ical S

A. pernix (43% AT)

ical S

Why care about over annotated genes?

Genome comparison:• Fraction of known

proteins• Average gene length • Amino acid

composition

The quality of our databases

To gain biological knowledge

ical S

Regular expression

Regular expression: /[AT][CG][AC][ACGT]*A[TG][CG]/

The regular expression is able to find all posible sequences, but do not distinguish between the consensus sequence and the highly unlikely sequence:

ACAC—ATC or TGCT--AGG

Weigth matrixes can be used to score the sequence but do not deal with insertions and deletions.

ACA---ATGTCAACTATCACAC--AGCAGA---ATGACCG--ATC

ical S

Markov model

A 0.8C G T 0.2

A C 0.8G 0.2T

A 0.8C 0.2G T

A 1.0C G T

A C G 0.2T 0.8

A C 0.8G 0.2T

A 0.2C 0.4G 0.2T 0.2

1.0 1.0 0.4

1.0 1.0

ACA---ATGTCAACTATCACAC--AGCAGA---ATGACCG--ATC

ical S

Profile HMM

Profile HMM have a predefined architecture and the parameters are estimated from multiple sequence alignments.

Profile HMM are not usefull for gene finding, since all genes in an organism can not be aligned in a meaningfull way.

Begin End

ical S

Markov Model for gene findingDefine a simple architecture:

TAGTAATGA

ATGGTGTTG

S1 S2 S3 S4 S5

ical S

Markov models

Knowledge of the structure of genes is used to define the architecture of the model.

Sequences (x) from known genes are used to estimate the parameters of the model – training of the model.

The training is done by counting the number of times a nucleotide occur in a given state and dividing this number with the number of sequences used in training giving the frequencies.

ical S

Training

Sequence x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 …..….xn

tes S1

TAGTAATGA

ATGGTGTTG

S1 S2 S3 S4 S5

ical S

Model after training

A: 0.22T: 0.24G: 0.27C: 0.27

TAG: 0.6TAA: 0.3TGA: 0.1

A: 0.25T: 0.23G: 0.27C: 0.25

A: 0.26T: 0.24G: 0.25C: 0.25

ATG: 0.77TTG: 0.11GTG: 0.12CTG: 0.00

S1 S2 S3 S4 S5

The trained model can be used to search for genes in DNA sequences.

ical S

Searching with the HMM

Sequence ATG A T T T C G C G C G A T ……….T A G

tes 0.77 0.00 0.00 0.00

0.00 (0.22*0.77) 0.00 0.00

0.00 0.00 (0.23*0.22*0.77) 0.00

0.00 0.00 0.00 (0.24*0.23*0.22*0.77)

0.00 0.00 0.00 =P(x|M)

ical S

Log-Odds score

The propability of a sequence gets infinitly small as the sequence x becomes longer.

This is solved by defining a background (NULL) model. For example a random distribution: A=T=C=G=0.25

From this the Log-Odds score can be calculated: -log(P(x|M)/P(x|NULL))

A high Log-Odds score corresponds to a sequence that looks more like the gene model than the background model.

ical S

Is the model to simple?

TAGTAATGA

ATGGTGTTG

S1 S2 S3 S4 S5

ical S

Codon usage

Synonymous codons incode the same amino acid. At random synonymous codons would be expected to be used with equal frequencies. In real life synonomous codons have different frequencies.

Different species have consistent and characteristic codon biases. Lateral transferred genes and genes from plasmids and phages will have atypical codon usage.

Variations in codon usage within an organism can be modelled in different coding models in the HMM.

ical S

1stPosition

2nd Position 3rdPositionU C A G

30,407 Phe22,581 Phe18,943 Leu18,629 Leu

11,523 Ser11,766 Ser

9,793 Ser12,195 Ser

22,048 Tyr16,669 Tyr2,706 Stop

326 Stop

7,062 Cys8,846 Cys1,260 Stop

20,756 Trp

15,018 Leu15,104 Leu

5,316 Leu71,710 Leu

9,569 Pro7,491 Pro

11,496 Pro31,614 Pro

17,631 His13,272 His20,912 Gln39,285 Gln

28,458 Arg29,968 Arg

4,860 Arg7,404 Arg

41,375 Ile34,261 Ile5,967 Ile

37,994 Met

12,223 Thr31,889 Thr

9,683 Thr19,682 Thr

24,189 Asn29,529 Asn45,812 Lys14,076 Lys

11,982 Ser21,907 Ser2,899 Arg1,694 Arg

24,910 Val20,800 Val14,850 Val35,979 Val

20,808 Ala34,770 Ala27,468 Ala45,862 Ala

43,817 Asp25,996 Asp53,780 Glu24,312 Glu

33,731 Gly40,396 Gly10,902 Gly15,118 Gly

Fields : [number] [amino acid]

ical S

Is the model to simple?

ATGGTGTTG

TAGTAATGA

AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC

ical S

HMM for gene finding

TAGTAATGA

ATGGTGTTG

ical S

Multiple coding models

TAGTAATGA

ATGGTGTTG

ical S

Order of the model

A zero order Markov model (state) has a propability of letter in the state – the propabilities are independent of the previous sequence. The NULL model is a zero order Markov model (A=T=G=C=0.25).

The propability of a letter in a first order Markov model depends on the previous letter (di-nucleotide distributions).

Second order depends on the two previous letters (corresponding to a codon).

ical S

Order of the coding model

Inter-codon denpendencies are correlations between amino acids typically found in proteins. They reflect typical features of proteins and can be used to improve the performance of the gene finder.

The use of higher order coding models in gene finding is a way to capture these inter-codon denpendencies.

Higher order models requires more training data and more computational time when searching.

ical S

The Shine-Dalgarno sequence

The ribosome binds to the messenger RNA through baseparing to the 30S ribosomal subunit.The binding site is the Shine-Dalgarno sequence (SD).The SD is a purine-rich sequence (consensussequence: AGGAG) at the 5' end of most prokaryotic mRNAs. The SD is found 5-10 basepairs upstream from the start codon.

ical S

EasyGene

ical S

R. prowazekii

ical S

GeneMark.hmmhttp://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.cgi

Lukashin A. and Borodovsky M., “GeneMark.hmm: new solutions for gene finding”, NAR, 1998, Vol. 26, No. 4, pp. 1107-1115.

EasyGenehttp://cbs.dtu.dk/services/EasyGene

Schou Larsen T. and Krogh A., “EasyGene – A prokaryotic gene finder that ranks ORFs by statistical significance”. BMC Bioinformatics 2003, 4:21

Center for Biological Sequence Analysis Prokaryotic gene finding Marie Skovgaard Ph.D. student...

Documents

Transcript of Center for Biological Sequence Analysis Prokaryotic gene finding Marie Skovgaard Ph.D. student...

Perl course The teacher: Peter Wad Sackett Center for Biological Sequence Analysis pws@cbs.dtu.dk Computer scientist. Programmed in Perl since 1995. Taught.

Anders Gorm Pedersen Center for Biological Sequence Analysis Gorm@cbs.dtu.dk

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Phylogenetic Reconstruction: Parsimony Anders Gorm Pedersen gorm@cbs.dtu.dk.

It og Sundhed Thomas Nordahl Petersen, Associate Professor Center for Biological Sequence Analysis, DTU Mail:tnp@cbs.dtu.dktnp@cbs.dtu.dk.

Less is more? (or how to make the most of our new hospitals) Nanna Skovgaard Head of Division.

Immunological databases on the web Ole Lund Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark lund@cbs.dtu.dk.

Developing a Molecular Roadmap of Drug-Food Interactionshub.hku.hk/bitstream/10722/208753/1/content.pdf · * gipa@hku.hk, irene@cbs.dtu.dk Abstract Recent researchhasdemonstrated

The SEED Family - cbs.dtu.dk · 60.2% CDA 21.7% Respiration Capsule Motility Membrane transport Stress Signaling Phosphorus RNA Mine Saltern Marine Microbialites Coral Fish Animals

Integration of predictions - cbs.dtu.dk · Integration of predictions ... Technical University of Denmark - DTU ... Nucleoprotein NP B58 41 0.44 1.51 1.64 0.99

MHC polymorphism - cbs.dtu.dk€¦ · MHC polymorphism Funcional clustering of MHC molecules: The concept of supertypes Martin Thomsen and Morten Nielsen CBS, DTU and Instituto de

Marie Watts. Marie Watts, “Threshold” Marie Watts “Heirloom”

Immunological Bioinformatics Or Finding the needle in the haystack Morten Nielsen mniel@cbs.dtu.dk.

Anders Gorm Pedersen Center for Biological Sequence Analysis Gorm@cbs.dtu.dk.

Three Letters from Martha Vanderkloot to Marie Marie Denker

It og Sundhed Nov. 11 - 27 Jan. Thomas Nordahl Petersen, Associate Professor Center for Biological Sequence Analysis, DTU Mail:tnp@cbs.dtu.dktnp@cbs.dtu.dk.

Credit Insurance Cooperating with ECAs Denmark Copenhagen, June 16, 2009 Erik Skovgaard Nielsen, Executive Manager, Denmark.

Molecular Evolution Lecture Notes - · PDF fileMolecular Evolution Lecture Notes Anders Gorm Pedersen gorm@cbs.dtu.dk February 10, 2009

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis gorm@cbs.dtu.dk.

Janome Marie OsmondQC manual - Toews Sewing... Janome Marie OsmondQC Instruction Manual Janome Marie Osmond Instruction Manual Janome Sewing Machine Marie Osmond Owners Manual/ User

and Lauge N. Skovgaard Poulsen · 2018-04-30 · Geoffrey Gertz1, Srividya Jandhyala2, Lauge N. Skovgaard Poulsen3 Abstract4 Empirical research on the impact of investment treaties