From Genomes to Genes
description
Transcript of From Genomes to Genes
![Page 1: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/1.jpg)
From Genomes to Genes
Rui Alves
![Page 2: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/2.jpg)
How to make sense of genome sequences?
…atgattattggcggaatcggcggtgcaaggacacaaacaggactcagattcgaagaacgtacagacttacgaaagttgttt
gaagaaattcc…
How do I know where genes are?
![Page 3: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/3.jpg)
Predicting ORFs is easy, predicting genes is hard
• An ORF is a sequence of nucleotides that goes from a start codon (ATG, GTG,…) to a stop codon (GTA)
• Finding them is as easy as reading the DNA sequence
• How do we know if an ORF is a gene?
![Page 4: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/4.jpg)
There are several ways to predict genes
• By homology
![Page 5: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/5.jpg)
Homology predictions
…Sequenced … Genome…
Sequence of
known gene
Homologue
gene
![Page 6: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/6.jpg)
How are sequences aligned?
A C - …
A 1 0.001 … …
C 0.001 1 … …
- … 1 …
Substitution probability table
…UUACAUUUCCCGUCCGCUCU…
…GGGGUUAAUUUGCCCGUCCA…
…UUACAUUUCCCGUCCGCUCU…
…GGGGUUAAUUUGCCCGUCCA…
S1S2>S1
![Page 7: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/7.jpg)
Problems of homology predictions: The genetic code
…UUAAUUUCCCGUCCG…
…CUUAUAAGUAGACCA…
…LISRP…
NO HOMOLOGY!!
Yet, the code is for the same peptide
![Page 8: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/8.jpg)
Solution for redundancy of genetic code:
Use synonymous substitution when doing the DNA alignment
The problem of doing this:
…UUAAUUUCCCGUCCG…
…UUAAUUUCCCGUCCA…
…UUAAUUUCCAGACCG…
…
…CUUAUAAGUAGACCA…
Combinatorial Explosion!!!Solutions?
Not many, efficient algorithms, more computer power, pacience
![Page 9: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/9.jpg)
Homology predictions most effective for closely related organisms
Thus, homology-based gene predictions works best when the genome
of a close organism has been fully
sequenced and annotated!!!
![Page 10: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/10.jpg)
There are other ways to predict if Orfs are genes
• By homology
• Ab initio methods– Signal Sensors
• ATG sites• Promoter elements id• Regulatory elements id• Shine-Dalgarno sequences id (i.e. rybosome
binding sites)• …
![Page 11: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/11.jpg)
Using initiation and termination codons to identify ORFs
• ATG is the start codon– GTG, CTG, TTG are minor start codons
• If termination codon too close to ATG then ORFs unlikely to be gene
atgaatgaatgctgccgaagatctctggcaccaaattttggagcggttgcag…
atgaatgaatgctgccgaagatctctggcaccaaattttggagcggtgacag…
![Page 12: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/12.jpg)
Using Promoter sequences to identify ORFs
• Many promoters have a known structure
• Identifying Promoters close to initiation codons increases likelihood of ORF being gene
Lac promoter
![Page 13: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/13.jpg)
Using response elements to identify ORFs
• Regulatory binding sites (RBS) have a known structure
• Identifying RBS close to initiation codons increases likelihood of ORF being gene
![Page 14: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/14.jpg)
Using Rybosomal binding sequences to identify ORFs
• Rybosomal binding sites (SDS) have a known structure
• Identifying SDS close to initiation codons increases likelihood of ORF being gene
AGGAGGConsensus Shine-Dalgarno sequence
![Page 15: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/15.jpg)
There are several ways to predict genes
• By homology• Ab initio methods
– Signal Sensors• Promoter elements id• Regulatory elements id• Shine-Dalgarno sequences id (i.e. rybosome binding sites)• ATG sites• …
– Content Sensors• Codon usage• GC content• Position assymetry• CpG islands• …
![Page 16: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/16.jpg)
Using codon bias to predict expressed ORFs
Average Codon
Usage Ile
ATT ATC ATA
0.34 0.46 0.20
Average Codon
usage Ile RF1
ATT ATC ATA
0.34 0.26 0.40Average Codon
usage Ile RF2
ATT ATC ATA
0.40 0.20 0.40
• Frequency of synonymous codons in an organism are not uniform
• Frequency of synonymous codons in coding sequences is different from that in non-coding sequences
• This can be used to predict coding open reading frames
Average Codon
usage Ile RF3
ATT ATC ATA
0.32 0.42 0.25
atgaatgcatgctgccgaagatctctggcaccaaattttggagcggttgcag…
The third reading frame is the most likely to be a gene
![Page 17: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/17.jpg)
Using GC content to predict expressed ORFs
Frame 1 Frame 2 Frame 3
11 9 5
gtgattagctctgccgaagatctctggcaccaaattttggagcggttgcag…
Genes have a very high (low) G+C content on the third position of the codons in the reading frame. Frame 1 (3) more likely to
be expressed
Not very usefull for eukaryotes
The G+C content of the third position of codons in coding sequences is biased
![Page 18: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/18.jpg)
Using position assymetry to predict expressed ORFs
Av Gene A T C G
Position 1 0.20 0.20 0.22 0.40
Position 2 0.38 0.22 0.20 0.20
Position 3 0.30 0.22 0.24 0.24
RF1 A T C G
Position 1 0.19 0.19 0.24 0.38
Position 2 0.38 0.24 0.19 0.19
Position 3 0.29 0.24 0.24 0.24
RF2 A T C G
Position 1 0.38 0.24 0.19 0.19
Position 2 0.19 0.38 0.24 0.19
Position 3 0.25 0.25 0.25 0.25
gtgaatgtatgctctgccgaagatctctggcaccaaattttggagcggttgcag… RF3 A T C G
Position 1 0.45 0.15 0.15 0.25
Position 2 0.20 0.18 0.30 0.32
Position 3 0.11 0.36 0.25 0.25
• Coding sequences have a characteristic distribution of nucleotides in each of the three positions of codons
![Page 19: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/19.jpg)
Using position assymetry to predict expressed ORFs
Position Assymetry For A
00,05
0,10,150,2
0,25
0,30,350,4
0,450,5
1 2 3
Position
Fre
qu
ency
<A>
A R1
A R2
A R3
Position Assymetry For T
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
1 2 3
Position
Fre
qu
ency
T
T R1
T R2
T R3
Position Assymetry For C
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
1 2 3
Position
Fre
qu
ency
C
C R1
C R2
C R3
Position Assymetry For G
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
0,45
1 2 3
Position
Fre
qu
ency
G
G R1
G R2
G R3
Reading Frame 1 the most likely because it has the highest similarity to the position assymetry of known
genes.
![Page 20: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/20.jpg)
CpG Islands are signals for transcription initiation
• Near the promoter of known genes, the content of CG dinucleotides is higher than that away from initiation of transcription sites
• Thus, ATG preceded by CpG island are more likely to be genes
![Page 21: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/21.jpg)
Other assimetry measures of gene likelihood
• Dinucleotide bias
• Hexanucleotide bias
• …
![Page 22: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/22.jpg)
Summary• Genes can be predicted by
•Homology
•Content sensors
•Signal sensors
![Page 23: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/23.jpg)
How are eukaryotic genes different?
DNA
RNA PolmRNA
RybProtein
![Page 24: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/24.jpg)
How are eukaryotic genes different?
DNA
RNA Pol
RybProtein
mRNA mRNA
SpliceosomemRNA mRNA
Correctly Identifying Splicing sites is not a trivial task
![Page 25: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/25.jpg)
How do we predict splicing sites?
• By Homology
• Ab initio– SS motifs– Codon usage– Exonic Splicing Enhancers– Intronic Splicing Enhancers– Exonic Splicing Silencers– Intronic Splicing Silencers
![Page 26: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/26.jpg)
Homology Splice Site Prediction
Known spliced gene
Predicted spliced gene
![Page 27: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/27.jpg)
Splice Site Motifs
![Page 28: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/28.jpg)
Exonic Splicing Enhancers
![Page 29: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/29.jpg)
Exonic Splicing Silencers
Genes & Development 18:1241-1250
![Page 30: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/30.jpg)
Interaction between SE and SI
![Page 31: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/31.jpg)
Rules for Splicing
• 3’ end likely target for repression
• Distance between SE and 3’ end < 100bp
• Splicing efficiency p(interaction SEC-3’ end)
![Page 32: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/32.jpg)
Methods for splicing detection
Training set
of
know spliced
genes
Algorithm
Test set
of
know spliced
genes
Set
of
know spliced
genes
GA, NN, HMM
Bayesian
GA, NN, HMM
Bayes,METest set
Predictions
![Page 33: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/33.jpg)
A Genetic Algorithm Method
Motif DM1 … AMi … EM
DM1
AM
p(i)
EM
IM
Shuffle lines and columns k times and each time calculate the probability of a given
combination of motifs getting spliced
Select m best combinations and continue to evolve the algorithm until it predicts training
set
![Page 34: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/34.jpg)
A Neural Net Method
Weight Table for splice
elements
Hidden Nodes
Sequences
Predicted Splicing
Corrected Weight Table for splice
elements
![Page 35: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/35.jpg)
Summary
• Eukaryotic genes have exons
• Biological rules combined with mathematical and statistical approaches can be used to predict the boundaries for the exons and to predict the splice variants
![Page 36: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/36.jpg)
How to find what genes a string of DNA contains
Rui Alves
![Page 37: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/37.jpg)
Simple steps
• Go to a known gene prediction server (or google for one)
• Input sequence and wait for prediction
• Get prediction(s), either as cDNA or as a tranlated protein sequence and do homology searches to identify them in a known database (e.g. NCBI or SWISSPROT)
![Page 38: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/38.jpg)
Simple steps a)
• Go to a known gene prediction server (or google for one)
• Input sequence and wait for prediction
• Get prediction(s), either as cDNA or as a translated protein sequence and do homology searches to identify them
![Page 39: From Genomes to Genes](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681320d550346895d9860b9/html5/thumbnails/39.jpg)
Paper PresentationThe human genome (Science) vs. The human
genome (Nature)
Nature : Pages 875 to 901
Science: Pages 1317-1337
Compare the differences in methods and results for the annotation
DO NOT SPEND TIME TALKING ABOUT THE SEQUENCING OR ASSEMBLY ITSELF
Do not go into the comparative genome analysis