Bioinformatika a výpočetní biologie KFC/BIN V. Predikce...
Transcript of Bioinformatika a výpočetní biologie KFC/BIN V. Predikce...
Bioinformatika
a
výpočetní biologie
KFC/BIN
VI. Predikce genů RNDr. Karel Berka, Ph.D.
Univerzita Palackého v Olomouci
Predikce genů
gene is "a locatable region of genomic sequence,
corresponding to a unit of inheritance, which is
associated with regulatory regions, transcribed regions,
and or other functional sequence regions „
allele is one variant of that gene (e.g. "good genes, "hair
color gene")
Gregor Mendel
Predikce:
rozdílný informační obsah kódujících (CDS) a nekódujících
(UTR) sekvencí v genomu.
informační obsah
i l-ve you
hr-jlka ds
4
Genom: Co je v chromosomu?
5
Hierarchical vs. Whole Genome shotgun
6
The value of genome sequences
lies in their annotation
• Annotation – Characterizing genomic
features using computational and
experimental methods
• Genes: Four levels of annotation
– Gene Prediction – Where are genes?
– What do they look like?
– Domains – What do the proteins do?
– Role – What pathway(s) involved in?
7
Kolik má člověk genů?
Consortium: 35,000 genes?
Celera: 30,000 genes?
Affymetrix: 60,000 human genes on GeneChips?
Incyte and HGS: over 120,000 genes?
GenBank: 49,000 unique gene coding sequences?
UniGene: > 89,000 clusters of unique ESTs?
8
Current consensus (in flux …)
• 20,000 known genes (2010)
– (similarity to previously isolated genes and
expressed sequences from a large variety of
different organisms)
– 15 000 known in 2003
• 22,333 predicted (RefSeq)
– problémy s predikčními algoritmy (nízká
účinnost) (Nature blog 2010)
9
How to we get from here …
10
to here,
11
• Complete DNA segments responsible to
make functional products
• Products
– Proteins
– Functional RNA molecules
• RNAi (interfering RNA)
• rRNA (ribosomal RNA)
• snRNA (small nuclear)
• snoRNA (small nucleolar)
• tRNA (transfer RNA)
What are genes? - 1
12
What are genes? - 2
• Definition vs. dynamic concept
• Consider
– Prokaryotic vs. eukaryotic gene models
– Introns/exons
– Posttranscriptional modifications
– Alternative splicing
– Differential expression
– Genes-in-genes
– Genes-ad-genes
– Posttranslational modifications
– Multi-subunit proteins
13
Prokaryotic gene model:
ORF-genes • “Small” genomes, high gene density
– Haemophilus influenza genome 85% genic
• Operons
– One transcript, many genes
• No introns.
– One gene, one protein
• Open reading frames (ORF)
– One ORF per gene
– ORFs begin with start,
end with stop codon (DNA)
- TAG ("amber") UAG
- TAA ("ochre") UAA
- TGA ("opal" or "umber"). UGA
Mnemonic UGA: "U Go Away" UAA: "U Are Away" UAG: "U Are Gone"
TIGR: http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi
14
Eukaryotic gene model: spliced genes
Posttranscriptional modification
5’-CAP, polyA tail, splicing
Open reading frames
Mature mRNA contains ORF
All internal exons contain open “read-through”
Pre-start and post-stop sequences are UTRs
Multiple translates
One gene – many proteins via alternative splicing
15
Expansions and Clarifications • ORFs
– Start – triplets – stop
– Prokaryotes: gene = ORF
– Eukaryotes: spliced genes or ORF genes
• Exons
– Remain after introns have been removed
– Flanking parts contain non-coding sequence (5’-
and 3’-UTRs)
16
Where do genes live?
• V genomech
• Příklad: lidský genom
– 3,274,571,503 bp (Ensembl 2010)
– 25 chromosomes : 1-22, X, Y, mt
– 22,333 genes (RefSeq estimate 2010)
– 128 nucleotides (RNA gene) – 2,800 kb (DMD)
– Ca. 25% of genome are genes (introns, exons)
– Ca. 1% of genome codes for amino acids (CDS)
– 30 kb gene length (average)
– 1.4 kb ORF length (average)
– 3 transcripts per gene (average)
17
Sample genomes
97 13.338 137Mb D.melanogaster
934
410
224
214
11
Genes/Mb
4,300 4.6Mb E.coli
6,144 15Mb S.cerevisiae
25,800 115Mb A.thaliana
18,266 85.5Mb C.elegans
35,000 3,200Mb H.sapiens
Genes Size Species
List of 68 eukaryotes, 141 bacteria, and 17 archaea at
http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/links2a.html
18
Genomic sequence features
• Repeats (“Junk DNA”)
– Transposable elements, simple repeats
– RepeatMasker (http://www.repeatmasker.org/)
• Genes
– Vary in density, length, structure
– Identification depends on evidence and methods and may require concerted application of bioinformatics methods and lab research
• Pseudo genes
– Look-a-likes of genes, obstruct gene finding efforts.
• Non-coding RNAs (ncRNA)
– tRNA, rRNA, snRNA, snoRNA, miRNA
– tRNASCAN-SE, COVE (http://selab.janelia.org/software.html)
19
• Homology-based gene prediction
– Similarity Searches (e.g. BLAST, BLAT)
– Genome Browsers
– RNA evidence (ESTs - Expressed sequence tag in cDNA)
• Ab initio gene prediction
– Gene prediction programs
– Prokaryotes
• ORF identification
– Eukaryotes
• Promoter prediction
• PolyA-signal prediction
• Splice site, start/stop-codon predictions
Gene identification
20
Gene prediction through comparative
genomics
• Highly similar (Conserved) regions between two
genomes are useful or else they would have diverged
• If genomes are too closely related all regions are similar,
not just genes
• If genomes are too far apart, analogous regions may be
too dissimilar to be found
21
Genome Browsers
Generic Genome Browser (CSHL)
www.wormbase.org/db/seq/gbrowse
NCBI Map Viewer
www.ncbi.nlm.nih.gov/mapview/
Ensembl Genome Browser
www.ensembl.org/
Apollo Genome Browser
www.bdgp.org/annot/apollo/
UCSC Genome Browser
genome.ucsc.edu/cgi-bin/hgGateway?org=human
22
Gene discovery using ESTs
• Expressed Sequence Tags (ESTs)
represent sequences from expressed
genes.
• If region matches EST with high
consensus then region is probably a gene
or pseudogene.
– EST overlapping exon boundary gives an
accurate prediction of exon boundary.
23
Ab initio gene prediction
• Prokaryotes
– ORF-Detectors
• Eukaryotes
– Position, extent & direction: through promoter
and polyA-signal predictors
– Structure: through splice site predictors
– Exact location of coding sequences: through
determination of relationships between
potential start codons, splice sites, ORFs, and
stop codons
24
How it works I - ORF swf film
25
How it works I – Motif
identification Exon-Intron Borders = Splice Sites
Exon Intron Exon
~~gaggcatcag|gtttgtagac~~~~~~~~~~~tgtgtttcag|tgcacccact~~
~~ccgccgctga|gtgagccgtg~~~~~~~~~~~tctattctag|gacgcgcggg~~
~~tgtgaattag|gtaagaggtt~~~~~~~~~~~atatctccag|atggagatca~~
~~ccatgaggag|gtgagtgcca~~~~~~~~~~~ttatttccag|gtatgagacg~~
Splice site Splice site
Exon Intron Exon
~~gaggcatcag|GTttgtagac~~~~~~~~~~~tgtgtttcAG|tgcacccact~~
~~ccgccgctga|GTgagccgtg~~~~~~~~~~~tctattctAG|gacgcgcggg~~
~~tgtgaattag|GTaagaggtt~~~~~~~~~~~atatctccAG|atggagatca~~
~~ccatgaggag|GTgagtgcca~~~~~~~~~~~ttatttccAG|gtatgagacg~~
Splice site Splice site
Motif Extraction Programs at http://www-btls.jst.go.jp/
26
How it works III – The (ugly) truth
27
Gene prediction programs
• Homology – use BLAST-like
– Example: Exofish, CRITICA
• Rule-based programs – Use explicit set of rules to make decisions.
– Example: GeneFinder
• Neural Network-based programs – Use data set to build rules.
– Examples: Grail, GrailEXP, Genemark
• Hidden Markov Model-based programs
– Use probabilities of states and transitions between these states to predict features.
– Examples: Genscan, GenomeScan
28
Tools
• ORF detectors – NCBI: http://www.ncbi.nih.gov/gorf/gorf.html
• Promoter predictors – CSHL: http://rulai.cshl.org/software/index1.htm
– BDGP: fruitfly.org/seq_tools/promoter.html
– ICG: TATA-Box predictor
• PolyA signal predictors – CSHL: argon.cshl.org/tabaska/polyadq_form.html
• Splice site predictors – BDGP: http://www.fruitfly.org/seq_tools/splice.html
• Start-/stop-codon identifiers – DNALC: Translator/ORF-Finder
– BCM: Searchlauncher
CRITICA prediction of prokaryotic genes
search for RBS (ribosomal binding site, Shine-Dalgarno
sequence)
Principle:
TBLASTP against protein database and choosing clearly
coding parts (usually only parts of the genes).
Calculating of statistical model.
Prediction of genes.
New statistical model and new prediction etc etc.
Genscan
prediction of eukaryotic genes
different statistical models for the first and last
exon
search for promotores, terminators, polyA signal
different statistical parameter for different GC
www:
http://genes.mit.edu/GENSCAN.html
Genscan
probability exons exactly partialy overlap
s
error
0.99 - 1.00 917 97.7% 0.9% 0.0% 1.4%
0.95 - 0.99 551 92.4% 3.4% 0.2% 4.0%
0.90 - 0.95 263 87.8% 6.1% 0.4% 5.7%
0.75 - 0.90 337 74.8% 16.0% 1.2% 8.0%
0.50 - 0.75 362 54.1% 26.2% 2.2% 17.4%
0.00 - 0.50 248 29.8% 27.8% 4.0% 38.3%
GENSCAN 1.0 Date run: 31-Oct-100 Time: 15:54:20
Sequence HERV17_004640 : 40714 bp : 37.79% C+G : Isochore 1 ( 0.00 -
43.00 C+G%)
Parameter matrix: HumanIso.smat
Predicted genes/exons:
Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr..
----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------
1.01 Init + 1825 1853 29 0 2 86 71 45 0.579 1.72
1.02 Term + 3886 4075 190 1 1 85 44 198 0.941 11.04
1.03 PlyA + 4961 4966 6 1.05
2.00 Prom + 6668 6707 40 -4.65
2.01 Init + 17251 17375 125 0 2 45 72 80 0.590 1.81
2.02 Term + 20137 20329 193 1 1 85 43 196 0.990 10.71
2.03 PlyA + 20809 20814 6 1.05
3.08 PlyA - 21608 21603 6 -3.24
3.07 Term - 22315 21651 665 2 2 -17 55 522 0.952 31.44
3.06 Intr - 24268 22592 1677 2 0 81 94 2124 0.885 198.67
…
Genscan - example
Genscan - example
Suboptimal exons with probability > 0.100
Exnum Type S .Begin ...End .Len Fr Ph B/Ac Do/T CodRg P.... Tscr..
----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------
S.001 Init + 2937 3136 200 2 2 67 -22 154 0.301 0.72
S.002 Intr + 3239 3325 87 2 0 43 23 121 0.358 -0.73
S.003 Intr + 17250 17375 126 0 0 66 72 94 0.141 4.47
S.004 Init + 17311 17375 65 0 2 55 72 45 0.204 0.27
S.005 Intr - 24927 24728 200 2 2 12 91 115 0.146 2.27
S.006 Intr - 25129 25003 127 2 1 51 92 37 0.117 -0.78
S.007 Intr - 29973 29878 96 1 0 44 111 87 0.473 5.66
S.008 Intr - 32589 32418 172 2 1 19 70 151 0.336 5.42
S.009 Intr - 32563 32427 137 2 2 46 70 116 0.122 4.97
S.010 Intr - 32589 32427 163 2 1 19 70 135 0.114 3.86
S.011 Intr - 32857 32804 54 0 0 104 103 2 0.262 0.48
S.012 Init - 33114 33008 107 0 2 79 17 87 0.296 0.46
S.013 Init + 37062 37067 6 2 0 53 68 1 0.115 -4.38
Genscan - example
kvalita predikce
Sensitivity = TP / (TP + FN)
How many genes were found out of all present? Specificity = TP / (TP + FP) How many predicted genes are indeed genes?
TP . TN + FP . FN
PP . PN + RP . RN Correlation Coefficient =
RP
PP
TP real
předpověď
FP TN FN TP FN
RN
PN
36
Gene prediction accuracies
• Nucleotide level: 95%Sn, 90%Sp (Lows less than
50%)
• Exon level: 75%Sn, 68%Sp (Lows less than 30%)
• Gene Level: 40% Sn, 30%Sp (Lows less than 10%)
• Programs that combine statistical evaluations with
similarity searches most powerful.
37
Common difficulties
• First and last exons difficult to annotate because
they contain UTRs.
• Smaller genes are not statistically significant so
they are thrown out.
• Algorithms are trained with sequences from
known genes which biases them against genes
about which nothing is known.
• Masking repeats frequently removes potentially
indicative chunks from the untranslated regions
of genes that contain repetitive elements.
38
The annotation pipeline
• Mask repeats using RepeatMasker.
• Run sequence through several programs.
• Take predicted genes and do similarity
search against ESTs and genes from
other organisms.
• Do similarity search for non-coding
sequences to find ncRNA.
39
Annotation nomenclature
• Known Gene – Predicted gene matches the entire length of a known gene.
• Putative Gene – Predicted gene contains region conserved with known gene. Also referred to as “like” or “similar to”.
• Unknown Gene – Predicted gene matches a gene or EST of which the function is not known.
• Hypothetical Gene – Predicted gene that does not contain significant similarity to any known gene or EST.
Credits
• http://www.dnalc.org/bioinformatics/presen
tations/hhmi_2003/2003_3.ppt
• Paces a Vondrasek, kurz Bioinformatiky,
UK