Bioinformatika a výpočetní biologie KFC/BIN V. Predikce...

Bioinformatika

a

výpočetní biologie

KFC/BIN

VI. Predikce genů RNDr. Karel Berka, Ph.D.

Univerzita Palackého v Olomouci

Predikce genů

gene is "a locatable region of genomic sequence,

corresponding to a unit of inheritance, which is

associated with regulatory regions, transcribed regions,

and or other functional sequence regions „

allele is one variant of that gene (e.g. "good genes, "hair

color gene")

Gregor Mendel

Predikce:

rozdílný informační obsah kódujících (CDS) a nekódujících

(UTR) sekvencí v genomu.

http://en.wikipedia.org/wiki/Locus_(genetics)

http://en.wikipedia.org/wiki/Genome

http://en.wikipedia.org/wiki/Gregor_Mendel

informační obsah

i l-ve you

hr-jlka ds

4

Genom: Co je v chromosomu?

5

Hierarchical vs. Whole Genome shotgun

http://www.pnas.org/content/vol99/issue6/images/large/pq0426924001.jpeg

6

The value of genome sequences

lies in their annotation

• Annotation – Characterizing genomic

features using computational and

experimental methods

• Genes: Four levels of annotation

– Gene Prediction – Where are genes?

– What do they look like?

– Domains – What do the proteins do?

– Role – What pathway(s) involved in?

7

Kolik má člověk genů?

Consortium: 35,000 genes?

Celera: 30,000 genes?

Affymetrix: 60,000 human genes on GeneChips?

Incyte and HGS: over 120,000 genes?

GenBank: 49,000 unique gene coding sequences?

UniGene: > 89,000 clusters of unique ESTs?

8

Current consensus (in flux …)

• 20,000 known genes (2010)

– (similarity to previously isolated genes and

expressed sequences from a large variety of

different organisms)

– 15 000 known in 2003

• 22,333 predicted (RefSeq)

– problémy s predikčními algoritmy (nízká

účinnost) (Nature blog 2010)

9

How to we get from here …

10

to here,

11

• Complete DNA segments responsible to

make functional products

• Products

– Proteins

– Functional RNA molecules

• RNAi (interfering RNA)

• rRNA (ribosomal RNA)

• snRNA (small nuclear)

• snoRNA (small nucleolar)

• tRNA (transfer RNA)

What are genes? - 1

12

What are genes? - 2

• Definition vs. dynamic concept

• Consider

– Prokaryotic vs. eukaryotic gene models

– Introns/exons

– Posttranscriptional modifications

– Alternative splicing

– Differential expression

– Genes-in-genes

– Genes-ad-genes

– Posttranslational modifications

– Multi-subunit proteins

13

Prokaryotic gene model:

ORF-genes • “Small” genomes, high gene density

– Haemophilus influenza genome 85% genic

• Operons

– One transcript, many genes

• No introns.

– One gene, one protein

• Open reading frames (ORF)

– One ORF per gene

– ORFs begin with start,

end with stop codon (DNA)

- TAG ("amber") UAG

- TAA ("ochre") UAA

- TGA ("opal" or "umber"). UGA

Mnemonic UGA: "U Go Away" UAA: "U Are Away" UAG: "U Are Gone"

TIGR: http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi

http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi



http://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.spl

http://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.spl

14

Eukaryotic gene model: spliced genes

Posttranscriptional modification

5’-CAP, polyA tail, splicing

Open reading frames

Mature mRNA contains ORF

All internal exons contain open “read-through”

Pre-start and post-stop sequences are UTRs

Multiple translates

One gene – many proteins via alternative splicing

15

Expansions and Clarifications • ORFs

– Start – triplets – stop

– Prokaryotes: gene = ORF

– Eukaryotes: spliced genes or ORF genes

• Exons

– Remain after introns have been removed

– Flanking parts contain non-coding sequence (5’-

and 3’-UTRs)

16

Where do genes live?

• V genomech

• Příklad: lidský genom

– 3,274,571,503 bp (Ensembl 2010)

– 25 chromosomes : 1-22, X, Y, mt

– 22,333 genes (RefSeq estimate 2010)

– 128 nucleotides (RNA gene) – 2,800 kb (DMD)

– Ca. 25% of genome are genes (introns, exons)

– Ca. 1% of genome codes for amino acids (CDS)

– 30 kb gene length (average)

– 1.4 kb ORF length (average)

– 3 transcripts per gene (average)

17

Sample genomes

97 13.338 137Mb D.melanogaster

934

410

224

214

11

Genes/Mb

4,300 4.6Mb E.coli

6,144 15Mb S.cerevisiae

25,800 115Mb A.thaliana

18,266 85.5Mb C.elegans

35,000 3,200Mb H.sapiens

Genes Size Species

List of 68 eukaryotes, 141 bacteria, and 17 archaea at

http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/links2a.html



18

Genomic sequence features

• Repeats (“Junk DNA”)

– Transposable elements, simple repeats

– RepeatMasker (http://www.repeatmasker.org/)

• Genes

– Vary in density, length, structure

– Identification depends on evidence and methods and may require concerted application of bioinformatics methods and lab research

• Pseudo genes

– Look-a-likes of genes, obstruct gene finding efforts.

• Non-coding RNAs (ncRNA)

– tRNA, rRNA, snRNA, snoRNA, miRNA

– tRNASCAN-SE, COVE (http://selab.janelia.org/software.html)

http://repeatmasker.genome.washington.edu/cgi-bin/RepeatMasker



http://www.repeatmasker.org/



http://www.genetics.wustl.edu/eddy/software/


http://selab.janelia.org/software.html


19

• Homology-based gene prediction

– Similarity Searches (e.g. BLAST, BLAT)

– Genome Browsers

– RNA evidence (ESTs - Expressed sequence tag in cDNA)

• Ab initio gene prediction

– Gene prediction programs

– Prokaryotes

• ORF identification

– Eukaryotes

• Promoter prediction

• PolyA-signal prediction

• Splice site, start/stop-codon predictions

Gene identification

20

Gene prediction through comparative

genomics

• Highly similar (Conserved) regions between two

genomes are useful or else they would have diverged

• If genomes are too closely related all regions are similar,

not just genes

• If genomes are too far apart, analogous regions may be

too dissimilar to be found

21

Genome Browsers

Generic Genome Browser (CSHL)

www.wormbase.org/db/seq/gbrowse

NCBI Map Viewer

www.ncbi.nlm.nih.gov/mapview/

Ensembl Genome Browser

www.ensembl.org/

Apollo Genome Browser

www.bdgp.org/annot/apollo/

UCSC Genome Browser

genome.ucsc.edu/cgi-bin/hgGateway?org=human

http://www.wormbase.org/db/seq/gbrowse

http://www.wormbase.org/db/seq/gbrowse

http://www.ncbi.nlm.nih.gov/mapview/

http://www.ncbi.nlm.nih.gov/mapview/

http://www.ensembl.org/

http://www.ensembl.org/

http://www.ncbi.nlm.nih.gov/mapview/static/MVstart.html

http://www.bdgp.org/annot/apollo/

http://www.bdgp.org/annot/apollo/

http://genome.ucsc.edu/cgi-bin/hgGateway?org=human




22

Gene discovery using ESTs

• Expressed Sequence Tags (ESTs)

represent sequences from expressed

genes.

• If region matches EST with high

consensus then region is probably a gene

or pseudogene.

– EST overlapping exon boundary gives an

accurate prediction of exon boundary.

23

Ab initio gene prediction

• Prokaryotes

– ORF-Detectors

• Eukaryotes

– Position, extent & direction: through promoter

and polyA-signal predictors

– Structure: through splice site predictors

– Exact location of coding sequences: through

determination of relationships between

potential start codons, splice sites, ORFs, and

stop codons

24

How it works I - ORF swf film

25

How it works I – Motif

identification Exon-Intron Borders = Splice Sites

Exon Intron Exon

~~gaggcatcag|gtttgtagac~~~~~~~~~~~tgtgtttcag|tgcacccact~~

~~ccgccgctga|gtgagccgtg~~~~~~~~~~~tctattctag|gacgcgcggg~~

~~tgtgaattag|gtaagaggtt~~~~~~~~~~~atatctccag|atggagatca~~

~~ccatgaggag|gtgagtgcca~~~~~~~~~~~ttatttccag|gtatgagacg~~

Splice site Splice site

Exon Intron Exon

~~gaggcatcag|GTttgtagac~~~~~~~~~~~tgtgtttcAG|tgcacccact~~

~~ccgccgctga|GTgagccgtg~~~~~~~~~~~tctattctAG|gacgcgcggg~~

~~tgtgaattag|GTaagaggtt~~~~~~~~~~~atatctccAG|atggagatca~~

~~ccatgaggag|GTgagtgcca~~~~~~~~~~~ttatttccAG|gtatgagacg~~

Splice site Splice site

Motif Extraction Programs at http://www-btls.jst.go.jp/

http://www-btls.jst.go.jp/




26

How it works III – The (ugly) truth

27

Gene prediction programs

• Homology – use BLAST-like

– Example: Exofish, CRITICA

• Rule-based programs – Use explicit set of rules to make decisions.

– Example: GeneFinder

• Neural Network-based programs – Use data set to build rules.

– Examples: Grail, GrailEXP, Genemark

• Hidden Markov Model-based programs

– Use probabilities of states and transitions between these states to predict features.

– Examples: Genscan, GenomeScan

http://argon.cshl.org/genefinder/

http://argon.cshl.org/genefinder/

http://compbio.ornl.gov/Grail-1.3/

http://compbio.ornl.gov/grailexp/



http://genes.mit.edu/GENSCAN.html

http://genes.mit.edu/genomescan.html

http://genes.mit.edu/genomescan.html

28

Tools

• ORF detectors – NCBI: http://www.ncbi.nih.gov/gorf/gorf.html

• Promoter predictors – CSHL: http://rulai.cshl.org/software/index1.htm

– BDGP: fruitfly.org/seq_tools/promoter.html

– ICG: TATA-Box predictor

• PolyA signal predictors – CSHL: argon.cshl.org/tabaska/polyadq_form.html

• Splice site predictors – BDGP: http://www.fruitfly.org/seq_tools/splice.html

• Start-/stop-codon identifiers – DNALC: Translator/ORF-Finder

– BCM: Searchlauncher

http://www.ncbi.nih.gov/gorf/gorf.html

http://www.ncbi.nih.gov/gorf/gorf.html

http://rulai.cshl.org/software/index1.htm

http://rulai.cshl.org/software/index1.htm

http://www.fruitfly.org/seq_tools/promoter.html

http://www.fruitfly.org/seq_tools/promoter.html

http://wwwmgs.bionet.nsc.ru/mgs/programs/bdna/tata_bdna.html




http://argon.cshl.org/tabaska/polyadq_form.html

http://argon.cshl.org/tabaska/polyadq_form.html

http://www.fruitfly.org/seq_tools/splice.html

http://www.fruitfly.org/seq_tools/splice.html

http://www.dnalc.org/bioinformatics/dnalc_nucleotide_analyzer.htm




http://searchlauncher.bcm.tmc.edu/seq-util/seq-util.html

http://searchlauncher.bcm.tmc.edu/seq-util/seq-util.html

CRITICA prediction of prokaryotic genes

search for RBS (ribosomal binding site, Shine-Dalgarno

sequence)

Principle:

TBLASTP against protein database and choosing clearly

coding parts (usually only parts of the genes).

Calculating of statistical model.

Prediction of genes.

New statistical model and new prediction etc etc.

Genscan

prediction of eukaryotic genes

different statistical models for the first and last

exon

search for promotores, terminators, polyA signal

different statistical parameter for different GC

www:

http://genes.mit.edu/GENSCAN.html

Genscan

probability exons exactly partialy overlap

s

error

0.99 - 1.00 917 97.7% 0.9% 0.0% 1.4%

0.95 - 0.99 551 92.4% 3.4% 0.2% 4.0%

0.90 - 0.95 263 87.8% 6.1% 0.4% 5.7%

0.75 - 0.90 337 74.8% 16.0% 1.2% 8.0%

0.50 - 0.75 362 54.1% 26.2% 2.2% 17.4%

0.00 - 0.50 248 29.8% 27.8% 4.0% 38.3%

GENSCAN 1.0 Date run: 31-Oct-100 Time: 15:54:20

Sequence HERV17_004640 : 40714 bp : 37.79% C+G : Isochore 1 ( 0.00 -

43.00 C+G%)

Parameter matrix: HumanIso.smat

Predicted genes/exons:

Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr..

----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------

1.01 Init + 1825 1853 29 0 2 86 71 45 0.579 1.72

1.02 Term + 3886 4075 190 1 1 85 44 198 0.941 11.04

1.03 PlyA + 4961 4966 6 1.05

2.00 Prom + 6668 6707 40 -4.65

2.01 Init + 17251 17375 125 0 2 45 72 80 0.590 1.81

2.02 Term + 20137 20329 193 1 1 85 43 196 0.990 10.71

2.03 PlyA + 20809 20814 6 1.05

3.08 PlyA - 21608 21603 6 -3.24

3.07 Term - 22315 21651 665 2 2 -17 55 522 0.952 31.44

3.06 Intr - 24268 22592 1677 2 0 81 94 2124 0.885 198.67

…

Genscan - example

Genscan - example

Suboptimal exons with probability > 0.100

Exnum Type S .Begin ...End .Len Fr Ph B/Ac Do/T CodRg P.... Tscr..

----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------

S.001 Init + 2937 3136 200 2 2 67 -22 154 0.301 0.72

S.002 Intr + 3239 3325 87 2 0 43 23 121 0.358 -0.73

S.003 Intr + 17250 17375 126 0 0 66 72 94 0.141 4.47

S.004 Init + 17311 17375 65 0 2 55 72 45 0.204 0.27

S.005 Intr - 24927 24728 200 2 2 12 91 115 0.146 2.27

S.006 Intr - 25129 25003 127 2 1 51 92 37 0.117 -0.78

S.007 Intr - 29973 29878 96 1 0 44 111 87 0.473 5.66

S.008 Intr - 32589 32418 172 2 1 19 70 151 0.336 5.42

S.009 Intr - 32563 32427 137 2 2 46 70 116 0.122 4.97

S.010 Intr - 32589 32427 163 2 1 19 70 135 0.114 3.86

S.011 Intr - 32857 32804 54 0 0 104 103 2 0.262 0.48

S.012 Init - 33114 33008 107 0 2 79 17 87 0.296 0.46

S.013 Init + 37062 37067 6 2 0 53 68 1 0.115 -4.38

Genscan - example

kvalita predikce

Sensitivity = TP / (TP + FN)

How many genes were found out of all present? Specificity = TP / (TP + FP) How many predicted genes are indeed genes?

TP . TN + FP . FN

PP . PN + RP . RN Correlation Coefficient =

RP

PP

TP real

předpověď

FP TN FN TP FN

RN

PN

36

Gene prediction accuracies

• Nucleotide level: 95%Sn, 90%Sp (Lows less than

50%)

• Exon level: 75%Sn, 68%Sp (Lows less than 30%)

• Gene Level: 40% Sn, 30%Sp (Lows less than 10%)

• Programs that combine statistical evaluations with

similarity searches most powerful.

37

Common difficulties

• First and last exons difficult to annotate because

they contain UTRs.

• Smaller genes are not statistically significant so

they are thrown out.

• Algorithms are trained with sequences from

known genes which biases them against genes

about which nothing is known.

• Masking repeats frequently removes potentially

indicative chunks from the untranslated regions

of genes that contain repetitive elements.

38

The annotation pipeline

• Mask repeats using RepeatMasker.

• Run sequence through several programs.

• Take predicted genes and do similarity

search against ESTs and genes from

other organisms.

• Do similarity search for non-coding

sequences to find ncRNA.

39

Annotation nomenclature

• Known Gene – Predicted gene matches the entire length of a known gene.

• Putative Gene – Predicted gene contains region conserved with known gene. Also referred to as “like” or “similar to”.

• Unknown Gene – Predicted gene matches a gene or EST of which the function is not known.

• Hypothetical Gene – Predicted gene that does not contain significant similarity to any known gene or EST.

Credits

• http://www.dnalc.org/bioinformatics/presen

tations/hhmi_2003/2003_3.ppt

• Paces a Vondrasek, kurz Bioinformatiky,

UK

http://www.dnalc.org/bioinformatics/presentations/hhmi_2003/2003_3.ppt

http://www.dnalc.org/bioinformatics/presentations/hhmi_2003/2003_3.ppt

Bioinformatika a výpočetní biologie KFC/BIN V. Predikce...

Documents

Transcript of Bioinformatika a výpočetní biologie KFC/BIN V. Predikce...