Comparative genomics: functional characterization of new genes and regulatory interactions using computer
analysisMikhail Gelfand
Institute for Information Transmission Problems(The Kharkevich Institute), RAS
Workshop at the Landau Instiute of Theoretical Physics, RAS
September 27-28, 2007, Moscow
The genome is decyphered!
Is it?To intercept a message does not mean to
understand it
Fragment of a genome (0.1% of E. coli)
A typical bacterial genome: several million nucleotides~600 through ~9,000 genes (~90% of the genome encodes proteins)
Propaganda
100
1000
10000
100000
1000000
10000000
1982 1984 1986 1988 1990 1992 1994 1996 1998 2000год
sequences in GenBank (~genes)
articles in PubMed (~experiments)
More propagandaMost genes will never be studied in experiment
Even in E.coli: only 20-30 new genes per year (hundreds are still uncharacterized)
• “Universally missing genes” – not a single known gene even for ~10% reactions of the central metabolism. No genes for >40% reactions overall.
• “Conserved hypothetical genes” (5-15% of any bacterial genome) – essential, but unknown function.
The local goal: to characterize the genes
• What?– function (rather, role)
• When?– regulation (conditions)
• gene expression• lifetime (mRNA, protein)
• Where?– Localization
• Cellular/membrane/secreted• How?
– Mechanism of action• Specificity, regulation (biochemistry)
2007:> 1200 bacterial genomes
Propaganda-2: complete genomes
2
149
4
18
30
55
84
8
19
422
1
107
4321
15
0
10
20
30
40
50
60
70
80
90
1995 1996 1997 1998 1999 2000 2001 2002
The global goal:to predict the organism’s
properties given its genome
(plus some additional information, e.g. the initial state after cell division)
and “to understand” the evolution of genomes/organisms
Haemophilus influenzae, 1995
Vibrio cholerae, 2000
The metabolic map, the bird’s view
Metabolic pathways, the eagle’s view
A submap (metabolism of arginine and proline)
Approaches• Similarity => homology (common origin)• Homology => common function• “The Pearson Principle” (after Karl
Pearson):important features are conserved– functional sites in proteins– regulatory (protein-binding) sites in DNA– not necessarily sequences:
• structure of protein and RNA• gene localization on chromosomes• co-expression of genes
• Allows one to annotate 50-75% of genes in a bacterial genome
• Necessary first step, may be automated (to some extent)
… but not so simple• Similarity ≠ homology
– Low complexity regions, unstructured domains, transmembrane segments and other regions with non-strandard amino acid composition
• The need for correct similarity measures– Does homology always follow from the
structural similarity?• What is structural similarity?
How can it be measured?• Convergent evolution of structures?
Independent emergence of folds?• Homology ≠ same function
– What is «the same function»?• Biochemical details and cellular role
“The Fermi principle”(after Enrico Fermi)
Purely homology-based annotation: boring (nothing radically new)
It turns out, one can predict something completely new
Comparative genomics
Positional clustering• Genes that are located in immediate
proximity tend to be involved in the same metabolic pathway or functional subsystem – caused by operon structure, but not only
• horizontal transfer of loci containing several functionally linked operons
• compartmentalisation of products in the cytoplasm– very weak evidence
• stronger if observed in may unrelated genomes• May be measured
– e.g. the STRING database/server (P.Bork, EMBL) – and other sources
STRING: trpB –
positional
clusters
Functionally dependent genes tend to cluster on chromosomes in many
different organismsVertical axis: number of gene pairs with association score exceeding a threshold.
Control: same graph, random re-labeling of vertices
More genomes (stronger links) => highly significant clustering
Fusions• If two (or more) proteins form a single
multidomain protein in some organism, they all are likely to be tightly functionally related
• Very useful for the analysis of eukaryotes• Sometimes useful for the analysis of
prokaryotes
STRING: trpB – fusions
Phyletic patterns• Functionally linked genes tend to occur
together
• Enzymes with the same function (isozymes) have complementary phyletic profiles
STRING: trpB –
co-occurrence (phyletic patterns)
Phyletic patterns in the Phe/Tyr pathway
shikimate kinase
Archaeal shikimate-kinaseChorismate biosynthesis pathway (E.
coli)
Arithmetics of phyletic patterns
3-dehydroquinate dehydratase (EC 4.2.1.10):Class I (AroD) COG0710 aompkzyq---lb-e----n---i-- Class II (AroQ) COG0757 ------y-vdr-bcefghs-uj---- Two forms combined aompkzyqvdrlbcefghsnuj-i--+
5-enolpyruvylshikimate 3-phosphate synthase (EC 2.5.1.19) AroA COG0128 aompkzyqvdrlbcefghsnuj-i--
Shikimate dehydrogenase (EC 1.1.1.25):AroE COG0169 aompkzyqvdrlbcefghsnuj-i--
+
Shikimate kinase (EC 2.7.1.71):Typical (AroK) COG0703 ------yqvdrlbcefghsnuj-i--Archaeal-type COG1685 aompkz-------------------- Two forms combined aompkzyqvdrlbcefghsnuj-i--
Chorismate synthase (EC 2.5.1.19) AroC COG0082 aompkzyqvdrlbcefghsnuj-i--
Distribution of association scores: monotonic for subunits,
bimodal for isozymes
Comparative analysis of regulation
• Phylogenetic footprinting: regulatory sites are more conserved than non-coding regions in general and are often seen as conserved islands in alignments of gene upstream regions
• Consistency filtering: regulons (sets of co-regulated genes) are conserved =>– true sites occur upstream of orthologous
genes– false sites are scattered at random
Riboflavin (vitamin B2) biosynthesis pathway
ribAribA
ribA ribB
G TP cyclohydrolase II
ribD
ribD
ribG
ribG
P yrim id ine deam inase
3,4-D HB P synthase P yrim id ine reductase
ribHribH R ibo flavin synthase, -cha in
ribEribBypaA
R ibo flavin synthase, -chain
GTP
2,5-diam ino-6-hydroxy-4-(5`-phosphoribosylamino)pyrimidine
ribulose-5-phosphate
PENTOSE-PHOSPHATE PATHWAY
PURINE BIO SYNTHESIS PATHWAY
3,4-dihydroxy-2-butanone-4-phosphate 5-am ino-6-(5`-phosphoribitylam ino)uracil
5-am ino-6-(5`-phosphoribosylamino)uracil
6,7-dimethyl-8-ribityllumazine
Riboflavin
5’ UTR regions of riboflavin genes from bacteria 1 2 2’ 3 Add. 3’ Variable 4 4’ 5 5’ 1’
=========> ==> <== ===> -><- <=== -> <- ====> <==== ==> <== <========= BS TTGTATCTTCGGGG-CAGGGTGGAAATCCCGACCGGCGGT 21 AGCCCGTGAC-- 8 4 8 -----TGGATTCAGTTTAA-GCTGAAGCCGACAGTGAA-AGTCTGGAT-GGGAGAAGGATGAT BQ AGCATCCTTCGGGG-TCGGGTGAAATTCCCAACCGGCGGT 19 AGTCCGTGAC-- 8 5 8 -----TGGATCTAGTGAAACTCTAGGGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGGATATG BE TGCATCCTTCGGGG-CAGGGTGAAATTCCCGACCGGCGGT 20 AGCCCGCGA--- 3 4 3 -----AGGATCCGGTGCGATTCCGGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGGATGCC HD TTTATCCTTCGGGG-CTGGGTGGAAATCCCGACCGGCGGT 19 AGTCCGTGAC-- 10 4 10 ----–TGGACCTGGTGAAAATCCGGGACCGACAGTGAA-AGTCTGGAT-GGGAGAAGGAAACG Bam TGTATCCTTCGGGG-CTGGGTGAAAATCCCGACCGGCGGT 23 AGCCCGTGAC-- 8 4 8 ----–TGGATTCAGTGAAAAGCTGAAGCCGACAGTGAA-AGTCTGGAT-GGGAGAAGGATGAG CA GATGTTCTTCAGGG-ATGGGTGAAATTCCCAATCGGCGGT 2 AGCCCGCAA--- 3 4 3 ------AGATCCGGTTAAACTCCGGGGCCGACAGTTAA-AGTCTGGAT-GAAAGAAGAAATAG DF CTTAATCTTCGGGG-TAGGGTGAAATTCCCAATCGGCGGT 2 AGCCCGCG---- 7 6 7 --------ATTTGGTTAAATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GGAAGAAGATATTT SA TAATTCTTTCGGGG-CAGGGTGAAATTCCCAACCGGCAGT 6 AGCCTGCGAC-- 11 3 11 ----–CTGATCTAGTGAGATTCTAGAGCCGACAGTTAA-AGTCTGGAT-GGGAGAAAGAATGT LLX ATAAATCTTCAGGG-CAGGGTGTAATTCCCTACCGGCGGT 2 AGCCCGCGA--- 4 4 4 -----ATGATTCGGTGAAACTCCGAGGCCGACAGT-AT-AGTCTGGAT-GAAAGAAGATAATA PN AACTATCTTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT 2 AGCCCACGA--- 3 4 3 -----ATGATTTGGTGAAATTCCAAAGCCGACAGT-AT-AGTCTGGAT-GAAAGAAGATAAAA TM AAACGCTCTCGGGG-CAGGGTGGAATTCCCGACCGGCGGT 3 AGCCCGCGAG-- 5 4 5 ----–TTGACCCGGTGGAATTCCGGGGCCGACGGTGAA-AGTCCGGAT-GGGAGAGAGCGTGA DR GACCTCTTTCGGGG-CGGGGCGAAATTCCCCACCGGCGGT 15 AGCCCGCGAA-- 8 12 9 ----–CCGATGCCGCGCAACTCGGCAGCCGACGGTCAC-AGTCCGGAC-GAAAGAAGGAGGAG TQ CACCTCCTTCGGGG-CGGGGTGGAAGTCCCCACCGGCGGT 3 AGCCCGCGAA-- 5 4 5 -----CCGACCCGGTGGAATTCCGGGGCCGACGGTGAA-AGTCCGGAT-GGGAGAAGGAGGGC AO AATAATCTTCAGGG-CAGGGTGAAATTCCCGATCGGCGGT 2 AGTCCGCGA--- 7 7 7 -----AGGAACCGGTGAGATTCCGGTACCGACAGT-AT-AGTCTGGAT-GGAAGAAGATGAAA DU TTTAATCTTCAGGG-CAGGGTGAAATTCCCGATCGGTGGT 2 AGTCCGCGA--- 13 4 12 -----AGGAACTAGTGAAATTCTAGTACCGACAGT-AT-AGTCTGGAT-GGAAGAAGAGCAGA CAU GAAGACCTTCGGGG-CAAGGTGAAATTCCTGATCGGCGGT 20 AGCCCGCGA--- 3 4 3 -----AGGACCCGGTGTGATTCCGGGGCCGACGGT-AT-AGTCCGGAT-GGGAGAAGGTCGGC FN TAAAGTCTTCAGGG-CAGGGTGAAATTCCCGACCGGTGGT 2 AGTCCACG---- 5 4 5 -------GATTTGGTGAAATTCCAAAACCGACAGT-AG-AGTCTGGAT-GGGAGAAGAATTAG TFU ACGCGTGCTCCGGG-GTCGGTGAAAGTCCGAACCGGCGGT 3 AGTCCGCGAC-- 8 5 8 -----TGGAACCGGTGAAACTCCGGTACCGACGGTGAA-AGTCCGGAT-GGGAGGTAGTACGTG SX -AGCGCACTCCGGG-GTCGGTGAAAGTCCGAACCGGCGGT 3 AGTCCGCGAC-- 8 5 8 -----TTGACCAGGTGAAATTCCTGGACCGACGGTTAA-AGTCCGGAT-GGGAGGCAGTGCGCG BU GTGCGTCTTCAGGG-CGGGGTGAAATTCCCCACCGGCGGT 30 AGCCCGCGAGCG 137 GTCAGCAGATCTGGTGAGAAGCCAGAGCCGACGGTTAG-AGTCCGGAT-GGAAGAAGATGTGC BPS GTGCGTCTTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT 21 AGCCCGCGAGCG 8 4 8 GTCAGCAGATCTGGTCCGATGCCAGAGCCGACGGTCAT-AGTCCGGAT-GAAAGAAGATGTGC REU TTACGTCTTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT 31 AGCCCGCGAGCG 7 5 7 GTCAGCAGATCTGGTGAGAGGCCAGGGCCGACGGTTAA-AGTCCGGAT-GAAAGAAGATGGGC RSO GTACGTCTTCAGGG-CGGGGTGGAATTCCCCACCGGCGGT 21 AGCCCGCGAGCG 11 3 11 GTCAGCAGATCCGGTGAGATGCCGGGGCCGACGGTCAG-AGTCCGGAT-GGAAGAAGATGTGC EC GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT 17 AGCCCGCGAGCG 8 4 8 GACAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAG-AGTCCGGAT-GGGAGAGAGTAACG TY GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT 67 AGCCCGCGAGCG 8 3 8 GTCAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAA-AGTCCGGAT-GGGAGAGGGTAACG KP GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT 20 AGCCCGCGAGCG 8 4 8 GTCAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAA-AGTCCGGAT-GGGAGAGAGTAACG HI TCGCATTCTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT 2 AGCCCACGAGCG 26 9 30 GTCAGCAGATTTGGTGAAATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GAAAGAGAATAAAA VK GCGCATTCTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT 14 AGCCCACGAGCG 11 9 11 GTCAGCAGATTTGGTGAGAATCCAAAGCCGACAGT-AT-AGTCTGGAT-GAAAGAGAATAAGC VC CAATATTCTCAGGG-CGGGGCGAAATTCCCCACCGGTGGT 13 AGCCCACGAGCG 5 4 5 GTCAGCAGATCTGGTGAGAAGCCAGGGCCGACGGTTAC-AGTCCGGAT-GAGAGAGAATGACA YP GCTTATTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT 40 AGCCCGCGAGCG 16 6 16 GTCAGCAGACCCGGTGTAATTCCGGGGCCGACGGTTAT-AGTCCGGAT-GGGAGAGAGTAACG AB GCGCATTCTCAGGG-CAGGGTGAAAGTCCCTACCGGTGGT 25 AGCCCACGAGCG 16 4 27 GTCAGCAGATTTGGTGCGAATCCAAAGCCGACAGTGAC-AGTCTGGAT-GAAAGAGAATAAAA BP GTACGTCTTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT 18 AGCCCGCGAGCG 10 4 10 GTCAGCAGACCTGGTGAGATGCCAGGGCCGACGGTCAT-AGTCCGGAT-GAGAGAAGATGTGC AC ACATCGCTTCAGGG-CGGGGCGTAATTCCCCACCGGCGGT 16 AGCCCGCGAGCA 10 3 11 ---CGCAGATCTGGTGTAAATCCAGAGCCGACGGT-AT-AGTCCGGAT-GAAAGAAGACGACG Spu AACAATTCTCAGGG-CGGGGTGAAACTCCCCACCGGCGGT 34 AGCCCGCGAGCG 6 6 6 GTCAGCAGATCTGGTG 52 TCCAGAGCCGACGGT 31 AGTCCGGAT-GGAAGAGAATGTAA PP GTCGGTCTTCAGGG-CGGGGTGTAAGTCCCCACCGGCGGT 13 AGCCCGCGAGCG 7 3 7 GTCAGCAGATCTGGTGCAACTCCAGAGCCGACGGTCAT-AGTCCGGAT-GAAAGAAGGCGTCA AU GGTTGTTCTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT 17 AGCCCGCGAGCG 7 9 7 GTCAGCAGATCCGGTGAGAGGCCGGAGCCGACGGT-AT-AGTCCGGAT-GGAAGAGGACAAGG PU AAACGTTCTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT 19 AGCCCGCGAGCG 19 4 18 GTCAGCAGACCCGGTGTGATTCCGGGGCCGACGGTCAC-AGTCCGGATGAAGAGAGAACGGGA PY TAACGTTCTCAGGG-CGGGGTGCAACTCCCCACCGGCGGT 19 AGCCCGCGAGCG 15 4 16 GTCAGCAGACCCGGTGTGATTCCGGGGCCGACGGTCAT-AGTCCGGATGAAGAGAGAGCGGGA PA TAACGTTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT 19 AGCCCGCGAGCG 14 4 13 GTCAGCAGACCCGGTGCGATTCCGGGGCCGACGGTCAT-AGTCCGGATAAAGAGAGAACGGGA MLO TAAAGTTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT 16 AGCCCGCGAGCG 8 5 8 GTCAGCAGATCCGGTGTGATTCCGGAGCCGACGGTTAG-AGTCCGGAT-GAAAGAGGACGAAA SM AAGCGTTCTCAGGG-CGGGGTGAAATTCCCCACCGGCGGT 34 AGCCCGCGAGCG 8 3 8 GTCAGCAGATCCGGTCGAATTCCGGAGCCGACGGTTAT-AGTCCGGAT-GGAAGAGAGCAAGC BME GCTTGTTCTCGGGG-CGGGGTGAAACTCCCCACCGGCGGT 17 AGCCCGCGAGCG 10 15 10 GTCAGCAGATCCGGTGAGATGCCGGAGCCGACGGTTAA-AGTCCGGAT-GGAAGAGAGCGAAT BS ATCAATCTTCGGGG-CAGGGTGAAATTCCCTACCGGCGGT 18 AGCCCGCGA--- 5 4 5 -----AGGATTCGGTGAGATTCCGGAGCCGACAGT-AC-AGTCTGGAT-GGGAGAAGATGGAG BQ GTCTATCTTCGGGG-CAGGGTGAAAATCCCGACCGGCGGT 27 AGCCCGCGA—-- 3 5 3 -----AGGATTTGGTGTGATTCCAAAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGGAG BE ATTCATCTTCGGGG-CAGGGTGAAATTCCCGACCGGCGGT 20 AGCCCGCGA--- 3 4 3 -----AGGATCCGGTGCGAGTCCGGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGAAG CA AATGATCTTCAGGG-CAGGGTGAAATTCCCTACCGGCGGT 2 AGCCCGCGAG-- 3 4 3 ----TATGATCCGGTTTGATTCCGGAGCCGACAGT-AA-AGTCTGGAT-GAAAGAAGATATAT DF GAAGATCTTCGGGG-CAGGGTGAAATTCCCTACCGGCGGT 2 AGCCCGCG---- 6 4 6 -------GATTTGGTGAGATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GAGAGAAGATATTT EF GTTCGTCTTCAGGGGCAGGGTGTAATTCCCGACCGGTGGT 3 AGTCCACGAC-- 5 3 5 ----ATTGAATTGGTGTAATTCCAATACCGACAGT-AT-AGTCTGGAT—-AAAGAAGATAGGG LLX AAATATCTTCAGGG-CACCGTGTAATTCGGGACCGGCGGT 21 ACTCCGCGAT-- 4 4 4 ----–TTGAAGCAGTGAGAATCTGCTAGCGACAGT-AA-AGTCTGGAT-GGAAGAAGATGAAC LO GTTCATCTTCGGGG-CAGGGTGCAATTCCCGACCGGTGGT 3 AGTCCACGAT-- 3 10 3 ----TTGACTCTGGTGTAATTCCAGGACCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGTTG PN AAGAGTCTTCAGGG-CAGGGTGAAATTCCCGACCGGCGGT 125 AGTCCGTG---- 3 4 3 -------GATGTGGTGAGATTCCACAACCGACAGT-AT-AGTCTGGAT-GGGAGAAGACGAAA ST AAGTGTCTTCAGGG-CAGGGTGTGATTCCCGACCGGCGGT 14 AGTCCGCG---- 3 4 3 -------GATGTGGTGTAACTCCACAACCGACAGT-AT-AGTCTGGAT-GAGAGAAGACCGGG MN AAGTGTCTTCAGGG-CAGGGTGAGATTCCCGACCGGCGGT 104 AGTCCGCG---- 3 4 3 -------GATGTGGTGAAATTCCACAACCGACAGT-AA-AGTCTGGAT-GGGAGAAGACTGAG SA ATTCATCTTCGGGG-TCGGGTGTAATTCCCAACCGGCAGT 6 AGCCTGCGAC-- 11 3 11 ----–CTGATCTAGTGAGATTCTAGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGGAG AMI TCACAGTTTCAGGG-CGGGGTGCAATTCCCCACTGGCGGT 14 AGCCCGCGC--- 5 5 5 ------TGATCTGGTGCAAATCCAGAGCCAACGGT-AT-AGTCCGGAT-GGAAGAAACGGAGC DHA ACGAACCTTCGAGG-TAGGGTGAAATTCCCGACCGGCGGT 20 AGCCCGCAAC-- 11 4 11 --CGACTGACTTGGTGAGACTCCAAGGCCGACGGT-AT-AGTCCGGAT-GGGAGAAGGTACAA FN AATAATCTTCGGGG-CAGGGTGAAATTCCCGACCGGTGGT 2 AGTCCACG---- 4 6 4 -------GATTTGGTGAAATTCCAAAACCGACAGT-AG-AGTCTGGAT-GAGAGAAGAAAAGA GLU ---TGTTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT 28 AGCCCGCGAGCG 10 4 10 GTCAGCAGATCCGGTTAAATTCCGGAGCCGACGGTCAT-AGTCCGGAT-GCAAGAGAACC---
Conserved secondary structure of the RFN-element
NNNNyYYUC
NNNNrRRAG
NgGGNcCC
rgGGxc A
RRgxuAG
GRCCYG
AcCG
AGCCRGY
GG YRCC
GRYBy CYRVrG N
YGNaA N U U x N
NxAGU
UrN A g
Y
variab lestem-loop
additionalstem -loop
3 4
2
1
5
5 ’ 3 ’
u K NRA
xK
*
****
Capitals: invariant (absolutely conserved) positions. Lower case letters: strongly conserved positions. Dashes and stars: obligatory and facultative base pairs Degenerate positions: R = A or G; Y = C or U; K = G or U; B= not A; V = not U. N: any nucleotide. X: any nucleotide or deletion
RFN: the mechanism of regulation• Transcription attenuation
• Translation attenuation
Early observation: an uncharacterized gene (ypaA) with an upstream RFN element
Phylogenetic tree of RFN-elements (regulation of riboflavin
biosynthesis)
duplications
no riboflavin biosynthesis
no riboflavin biosynthesis
YpaA a.k.a. RibU: riboflavin transporter
in Gram-positive bacteria• 5 predicted transmembrane segments => a transporter
• Upstream RFN element (likely co-regulation with riboflavin genes) => transport of riboflaving or a precursor
• S. pyogenes, E. faecalis, Listeria sp.: ypaA, no riboflavin pathway => transport of riboflavin
Prediction: YpaA is riboflavin transporter (Gelfand et al., 1999)
Validation:• YpaA transports flavines (riboflavin, FMN, FAD): by
genetic analysis (Kreneva et al., 2000) by direct measurement (Burgess et al., 2006; Vogl et al., 2007 )
• ypaA is regulated by riboflavin: by microarray expression study (Lee et al., 2001)
• … via attenuation of transcription (and to some extent inhibition of translaition) (Winkler et al., 2003)
Conserved structures of riboswitches (circled: X-ray)
NNNNyYYUC
NNNNrRRAG
NgGGNcCC
RgG
Gxc G
Aux
gRRA
GRC
CYG
AcCG
AGCCRGYGG YRCC GRYBy CYRVr
G N
YGNaA N U U x N
Nx
AGU
UrN
A gY u
K NRA
xK
Var
Add
RFN-elem ent
MGGGA
G G A
A G
C C U
THI-element
C Y G GN U N
RUR
UCRR G
A
A
A
AA
AA
CGd
a
aa
a
a
ktk
h
CC
cC
C
GG G
GGG
G
GT
M
Y
K
y
c
c G
g
g G
G
G YG
tg
g
ggN
RNN
NN
r
r
r
g
g C
c
c T
C
C G
C Ca
ta N
B 12 box
P1
5' 3'
P2
P5 P6 P7
P3
N
base stem
CGh
G
d
yc c
C C
P4
g u yc a r
NaAUGc
AP1
5' 3'
u R
CA
U
Uu
Ga
P4
NaGA
g
c
GRCA
aCcD H
Gg
UGCY
a
AA NuccN
r
NN
G gyC cr
P2G GG A
C C DC
rG
N y G A a
Ac
gg
P3
P5g
AUR
UA
P1
5' 3'
C GU R
Y
CA RUAU GG
P2A
N
U
A
C
GU N U U A
UA
A A
G
GCCP3
C
N G A
U
P1
P2
P3
P4
P5
P3 P2
P4
base stem base stem5' 3' 5' 3'
B12-element
base stem
S box-
base stem
G box-
Add
Add I
Add II
Add III
Var
P5
P1
uaAG
uCG
P1
5' 3'base stem
R Yr y
Gyy
r
aa
g
u g
aa a GG
r Cr G
y G Cyk
a G ug R
C a Yu
a
Gg N
a
aA
a N
acUGC
GA
G G gaR
ruYy
P2
P5P6
P7
P3P4
LYS-element
Mec
hani
sm
s
UUUUUUUU
5’
33 ’
5 ’
Regulatory hairpin(term inator of transcription and or RBS-sequestor)/
In the case of regulation of transcription
In the case of regulation of translation
GENES
3’ GENES
RNA-elem ent
A
5 ’ 1 3
UUUUUUUU
Antiterm inator/Antisequestor
3’ GENES
5 ’ 1 2
RNA-elem ent
3 ’ GENES
B 5 ’
2 3
Antiterminator/Antisequestor
3 ’ GENES
C
5’
RNA-elem ent
3’ GENES
12
5 ’
1 23 ’ GENES
Regulatory hairpin
+ Effector
UUUUUUUU
- Effector
2
1
gcvT: ribozyme, cleaves its mRNA (the Breaker group)
THI-box in plants: inhibition of splicing (the Breaker and Hanamoto groups)
Characterized riboswitches (more are predicted)RFN Riboflavin
biosynthesis and transport
FMN (flavin mononucleotide)
Bacillus/Clostridium group, proteobacteria, actinobacteria, other bacteria
THI Biosynthesis and transport of thiamin and related compounds
TPP (thiamin pyrophosphate)
Bacillus/Clostridium group, proteobacteria, actinobacteria, cyanobacteria, other bacteria, archea (thermoplasmas), plants, fungi
B12 Biosynthesis of cobalamine, transport of cobalt, cobalamin-dependent enzymes
Coenzyme B12 (adenosyl-cobalamin)
Bacillus/Clostridium group, proteobacteria, actinobacteria, cyanobacteria, spirochaetes, other bacteria
S-boxSAM-IISAM-III
Metabolism of methionine and cystein
SAM (S-adenosyl- methionine)
Bacillus/Clostridium group and some other bacteriaSAM-II (alpha), SAM-III (Streptococci)
LYS Lysine metabolism lysine Bacillus/Clostridium group, enterobacteria, other bacteria
G-box Metabolism of purines
purines Bacillus/Clostridium group and some other bacteria
glmS (ribozyme)
Synthesis of glucosamine-6-phosphate
glucosamine-6-phosphate
Bacillus/Clostridium group
gcvT (tandem)
Catabolism of glycine
glycine Bacillus/Clostridium group
Properties of riboswitches• Direct binding of ligands• High conservation
– Including “unpaired” regions: tertiary interactions, ligand binding• Same structure – different mechanisms:
transcription, translation, splicing, (RNA cleavage)• Distribution in all taxonomic groups
– diverse bacteria– archaea: thermoplasmas– eukaryotes: plants and fungi
• Correlation of the mechanism and taxonomy:– attenuation of transcription (anti-anti-terminator) – Bacillus/Clostridium
group– attenuation of translation (anti-anti-sequestor of translation initiation) –
proteobacteria– attenuation of translation (direct sequestor of translation initiation) –
actinobacteria• Evolution: horizontal transfer, duplications, lineage-specific loss• Sometimes very narrow distribution: evolution from scratch?
Conserved signal upstream of nrd genes
Identification of the candidate regulator by the analysis of phyletic
patternsCOG1327: the only COG with exactly the
same phylogenetic pattern as the signal– “large scale” on the level of major taxa– “small scale” within major taxa:
• absent in small parasites among alpha- and gamma-proteobacteria
• absent in Desulfovibrio spp. among delta-proteobacteria• absent in Nostoc sp. among cyanobacteria• absent in Oenococcus and Leuconostoc among
Firmicutes• present only in Treponema denticola among four
spirochetes
COG1327 “Predicted transcriptional regulator, consists of a Zn-ribbon and ATP-cone domains”: regulator of the riboflavin pathway (RibX)?
Additional evidence: co-localization
nrdR is sometimes clustered with nrd genes or with replication genes dnaB, dnaI, polA
Additional evidence: co-
regulated genes In some genomes,
candidate NrdR-binding sites are found upstream of other replication-related genes– dNTP salvage– topoisomerase I,
replication initiator dnaA, chromosome partitioning, DNA helicase II
Multiple sites (nrd genes): FNR, DnaA, NrdR
Mode of regulation
• Repressor (overlaps with promoters)• Co-operative binding:
– most sites occur in tandem (> 90% cases)
– the distance between the copies (centers of palindromes) equals an integer number of DNA turns:• mainly (94%) 30-33 bp, in 84% 31-32 bp – 3
turns• 21 bp (2 turns) in Vibrio spp.• 41-42 bp (4 turns) in some Firmicutes
Experimental validations
Acknowledgements• Dmitry Rodionov (comparative genomics)• Andrei Mironov (software)• Alexei Vitreschak (riboswitches)
• Funding:– Howard Hughes Medical Institute– Russian Foundation of Basic Research– RAS, program “Molecular and Cellular Biology”– INTAS
Top Related