Hier a r c hi ca l C lus t er S t ru ct ures a nd Symme t ries in G enomi c Sequen c es
-
Upload
jaime-randolph -
Category
Documents
-
view
14 -
download
0
description
Transcript of Hier a r c hi ca l C lus t er S t ru ct ures a nd Symme t ries in G enomi c Sequen c es
Hierarchical Cluster Structures and
Symmetries in Genomic Sequences
Andrei Zinovyev
Institut des Hautes Études Scientifiques
Math@Bio group of M.Gromov
Plan of the talk
Genomic sequences: geometric approach, clustering
Genomic sequence as text Basic 7-cluster structure Global structure of codon frequencies Internal structure of codon frequencies Applications
Genomic sequence as a text in unknown language
tagggrcgcacgtggtgagctgatgctaggg
frequency dictionaries:t a g g g r c g c a c g t g g t g a g c t g a t g c t a g g g
ta gg gr cg ca cg tg gt ga gc tg at gc ta gg
tag ggr cgc acg tgg tga gct gat gct agg
tagg grcg cacg tggt gagc tgat gcta gggr
N = 4=41
N = 16=42
N = 64=43
N=256=44
gggrcgccacgttggtgagctgatgctagggrcgacgtgg
tagggrcgcacgtggtgagctgatgctagggrcgacgtgg
agggrcgcacgtggtgagctgatgctagggrcgacgtggc
..cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc…
From text to geometrycgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc
107
cgtggtgagctgatgctagggrcgcacggtgagctgatgctagggrcgcacacttgagctgatgctagggrcgcacaattcgtgagctgatgctagggrcgcacggtg……gagctgatgctagggrcgcacaagtga
length~300-400
3000-4000 fragments
RN
Caulobacter crescentus
singles N=4
doublets N=16
triplets N=64
quadruplets N=256
!!!
the information in genomic sequence is encodedby non-overlapping triplets
tga tgc tag ggr cgc acg tgg
ctg atg cta ggg rcg cac gtg
Basic 7-cluster structure
gtgagctgatgctagggrcgcacgtggtgagc
gct gat gct agg grc gca cgt
gtgaatcggtgggtgaqtgtgctgctatgagc
atc ggt ggg tga gtg tgc tgc
tcg gtg ggt gag tgt gct gct
cgg tgg gtg agt gtg ctg ctg
Mean-field approximationfor triplet frequencies
321KJIIJK PPPF
FIJK : Frequency of triplet IJK ( I,J,K {A,C,G,T} ):
FAAA , FAAT , FAAC … FGGC , FGGG : 64 numbers
letter frequency + correlations
: 12 numbersjiP
Genome codon usageand mean-field approximation
ggtgaATG gat gct agg … gtc gca cgc TAAtgagct
…
correct frameshift
64 frequencies FIJK
…
ggtgaATG gat gct agg … gtc gca cgc TAAtgagct
12 frequencies PI1 , PJ
2 , PK3
Four symmetry typesof the basic 7-cluster structure
eubacteria
flower-likedegeneratedperpendiculartriangles
paralleltriangles
Fast-growing bacteria
IV
II
I
III
Genes of class I(most of)
Genes of class II(higly expressed)
Genes of class III(unusual)
Genes of class IV(hydrophobic proteins)
Escherichia coli
Genes of class I(most of)
Genes of class II(higly expressed)
Genes of class III(unusual)
Genes of class IV(hydrophobicproteins)
Protein expression optimization
IV
II
I
III
gene sequence S,protein A
gene sequence S’,same protein A,higher expression
PapersGorban A, Popova T, Zinovyev AFour basic symmetry types in the universal 7-cluster Four basic symmetry types in the universal 7-cluster structure of 143 complete bacterial genomic sequences.structure of 143 complete bacterial genomic sequences.2004. Arxive e-print.
Gorban A, Zinovyev A, Popova T Seven clusters in genomic triplet distributionsSeven clusters in genomic triplet distributions. 2003. In Silico Biology. V.3, 0039.
Zinovyev A, Gorban A, Popova T Self-Organizing Approach Self-Organizing Approach for Automated Gene Identificationfor Automated Gene Identification. 2003. Open Systems and Information Dynamics 10 (4).