Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev

Post on 08-Jan-2016

28 views 2 download

Tags:

description

Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev Institute des Hautes Etudes Scientifique, Bures-sur-Yvette. Markov chain models. Transition probabilities = Frequencies of N-grams … AGGTC G ATC … …A GGTCG A TC … …AG GTCGA T C …. f AAA. f AAC. - PowerPoint PPT Presentation

Transcript of Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev

Simple cluster structure oftriplet distributions in genetic texts

Andrei Zinovyev

Institute des Hautes Etudes Scientifique,Bures-sur-Yvette

Transition probabilities = Frequencies of N-grams

…AGGTCGATC …

…AGGTCGATC …

…AGGTCGATC …

Markov chain models

AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC

AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC

AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC

AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC

AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC

Sliding window

width W

fAAA

fAAC

fGGG

…= fijk, i,j,k in [A,C,G,T]

AGGTCGATGAATCCGTATTGACAAATGAATCCGTAATGACATGACAATCCAACATGACAAT

Protein-coding sequences

bacterial gene

corr

ect f

ram

e

fijk

fijk(1)

fijk(2)

nml

kmnlijijk fffP,,

)1(

nml

ijnlmiijk fffP,,

)2(

TCCAGCTTA TGAGGCATAACTGTTTACTGAGGCCAT ACT GTACTGTTAGGTTGTACTGTTA

AGGTCGAATACTCCGTATTGACAAATGACTCCGGTATGACATGACAATCCAACATGACAAT

“Shadow” genes

shadow gene,

ijkijkR

ijk ffCf ˆˆˆˆ TA ˆ C =G

ijkijk fPf ˆˆ )1()1( ijkijk fPf ˆˆ )2()2(

When we can detect genes (by their content)?

,

1. When non-coding regions are very different in base composition (e.g., different GC-content)

2. When distances between the phases are large:

ijkfP )1(ijkfP )2(

ijkfnon-coding

ijk kji

ijkijk ppp

ffM 2log

Simple experiment

,

1. Only the forward strands of genomes are used for triplet counting

2. Every p positions in the sequence, open a window (x-W/2,x+W/2,) of size W and centered at position x

3. Every window, starting from the first base-pair, is divided into W/3 non-overlapping triplets, and the frequencies of all triplets fijk are calculated

4. The dataset consists of N = [L/p] points, where L is the entire length of the sequence

5. Every data point Xi={xis} corresponds to one window and has 64 coordinates, corresponding to the frequencies of all possible triplets s = 1,…,64

Principal Component Analysis

,

Max

imal

disp

ersio

n

1st Principalaxis

2nd principalaxis

ViDaExpert tool

,

Caulobacter crescentus (GenBank NC_002696)

,

ijkf

ijkf

ijkfP )1(

ijkfP )2(

“Path” of sliding window

,

Helicobacter pylori (GenBank NC_000921)

,

Saccharomyces cerevisiae chromosome IV

,

Model sequences: (random codon usage)

,

Model sequences: (random codon usage+50% of frequencies are set to 0)

,

Graph of coding phase

,

Assessment

,

Sequence L W% of

codingbases

Sn1 Sp1 Sn2 Sp2

Helicobacter pylori, complete genome (NC_000921)Caulobacter crescentus, complete genome (NC_002696)Prototheca wickerhamii mitochondrion (NC_001613)Saccharomyces cerevisiae chromosome III (NC_001135)Saccharomyces cerevisiae chromosome IV (NC_001136)

16438314016947

55328316613

1531929

300300120399399

9091496973

0.930.930.820.900.89

0.970.970.930.880.91

0.930.940.840.900.92

0.980.980.950.900.92

Model text RANDOMModel text RANDOM_BIAS

100000100000

500500

4945

0.900.99

0.610.83

0.820.94

0.770.90

FNTP

TPSn

FPTP

TPSp

Completelyblind prediction

Dependence on window size

,

0.75

0.8

0.85

0.9

0.95

1

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300

window size

Sn

Sp

Dependence on window size

,

W = 51 W = 252

W = 900 W = 2000

State of art: GLIMMER strategy

,

1. Use MM of 5th order (hexamers) 2. Use interpolation for transition probabilities3. Use long ORF (>500bp) as learning dataset

Problems:1. The number of hexamers to be evaluated

is still big2. Applicable only for collected genomes

of good quality (<1frameshift/1000bp)

What can we learn from this game?

,

• Learning can be replaced with self-learning • Bacterial gene-finders work relatively well, when

concentration of coding sequences is high• Correlations in the order of codons are small• Codon usage is approximately the same along the

genome

• The method presented allows self-learning on piecesof even uncollected DNA (>150 bp)

• The method gives alternative to HMM view on the problem of gene recognition

Acknowledgements

,

Professor Alexander GorbanProfessor Misha Gromov

My coordinates:http://www.ihes.fr/~zinovyev