Bioinformatika a výpočetní biologie KFC/BIN V. Predikce...

40
Bioinformatika a výpočetní biologie KFC/BIN VI. Predikce genů RNDr. Karel Berka, Ph.D. Univerzita Palackého v Olomouci

Transcript of Bioinformatika a výpočetní biologie KFC/BIN V. Predikce...

Page 1: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

Bioinformatika

a

výpočetní biologie

KFC/BIN

VI. Predikce genů RNDr. Karel Berka, Ph.D.

Univerzita Palackého v Olomouci

Page 2: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

Predikce genů

gene is "a locatable region of genomic sequence,

corresponding to a unit of inheritance, which is

associated with regulatory regions, transcribed regions,

and or other functional sequence regions „

allele is one variant of that gene (e.g. "good genes, "hair

color gene")

Gregor Mendel

Predikce:

rozdílný informační obsah kódujících (CDS) a nekódujících

(UTR) sekvencí v genomu.

Page 3: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

informační obsah

i l-ve you

hr-jlka ds

Page 4: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

4

Genom: Co je v chromosomu?

Page 6: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

6

The value of genome sequences

lies in their annotation

• Annotation – Characterizing genomic

features using computational and

experimental methods

• Genes: Four levels of annotation

– Gene Prediction – Where are genes?

– What do they look like?

– Domains – What do the proteins do?

– Role – What pathway(s) involved in?

Page 7: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

7

Kolik má člověk genů?

Consortium: 35,000 genes?

Celera: 30,000 genes?

Affymetrix: 60,000 human genes on GeneChips?

Incyte and HGS: over 120,000 genes?

GenBank: 49,000 unique gene coding sequences?

UniGene: > 89,000 clusters of unique ESTs?

Page 8: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

8

Current consensus (in flux …)

• 20,000 known genes (2010)

– (similarity to previously isolated genes and

expressed sequences from a large variety of

different organisms)

– 15 000 known in 2003

• 22,333 predicted (RefSeq)

– problémy s predikčními algoritmy (nízká

účinnost) (Nature blog 2010)

Page 9: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

9

How to we get from here …

Page 10: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

10

to here,

Page 11: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

11

• Complete DNA segments responsible to

make functional products

• Products

– Proteins

– Functional RNA molecules

• RNAi (interfering RNA)

• rRNA (ribosomal RNA)

• snRNA (small nuclear)

• snoRNA (small nucleolar)

• tRNA (transfer RNA)

What are genes? - 1

Page 12: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

12

What are genes? - 2

• Definition vs. dynamic concept

• Consider

– Prokaryotic vs. eukaryotic gene models

– Introns/exons

– Posttranscriptional modifications

– Alternative splicing

– Differential expression

– Genes-in-genes

– Genes-ad-genes

– Posttranslational modifications

– Multi-subunit proteins

Page 13: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

13

Prokaryotic gene model:

ORF-genes • “Small” genomes, high gene density

– Haemophilus influenza genome 85% genic

• Operons

– One transcript, many genes

• No introns.

– One gene, one protein

• Open reading frames (ORF)

– One ORF per gene

– ORFs begin with start,

end with stop codon (DNA)

- TAG ("amber") UAG

- TAA ("ochre") UAA

- TGA ("opal" or "umber"). UGA

Mnemonic UGA: "U Go Away" UAA: "U Are Away" UAG: "U Are Gone"

TIGR: http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi

Page 14: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

14

Eukaryotic gene model: spliced genes

Posttranscriptional modification

5’-CAP, polyA tail, splicing

Open reading frames

Mature mRNA contains ORF

All internal exons contain open “read-through”

Pre-start and post-stop sequences are UTRs

Multiple translates

One gene – many proteins via alternative splicing

Page 15: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

15

Expansions and Clarifications • ORFs

– Start – triplets – stop

– Prokaryotes: gene = ORF

– Eukaryotes: spliced genes or ORF genes

• Exons

– Remain after introns have been removed

– Flanking parts contain non-coding sequence (5’-

and 3’-UTRs)

Page 16: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

16

Where do genes live?

• V genomech

• Příklad: lidský genom

– 3,274,571,503 bp (Ensembl 2010)

– 25 chromosomes : 1-22, X, Y, mt

– 22,333 genes (RefSeq estimate 2010)

– 128 nucleotides (RNA gene) – 2,800 kb (DMD)

– Ca. 25% of genome are genes (introns, exons)

– Ca. 1% of genome codes for amino acids (CDS)

– 30 kb gene length (average)

– 1.4 kb ORF length (average)

– 3 transcripts per gene (average)

Page 17: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

17

Sample genomes

97 13.338 137Mb D.melanogaster

934

410

224

214

11

Genes/Mb

4,300 4.6Mb E.coli

6,144 15Mb S.cerevisiae

25,800 115Mb A.thaliana

18,266 85.5Mb C.elegans

35,000 3,200Mb H.sapiens

Genes Size Species

List of 68 eukaryotes, 141 bacteria, and 17 archaea at

http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/links2a.html

Page 18: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

18

Genomic sequence features

• Repeats (“Junk DNA”)

– Transposable elements, simple repeats

– RepeatMasker (http://www.repeatmasker.org/)

• Genes

– Vary in density, length, structure

– Identification depends on evidence and methods and may require concerted application of bioinformatics methods and lab research

• Pseudo genes

– Look-a-likes of genes, obstruct gene finding efforts.

• Non-coding RNAs (ncRNA)

– tRNA, rRNA, snRNA, snoRNA, miRNA

– tRNASCAN-SE, COVE (http://selab.janelia.org/software.html)

Page 19: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

19

• Homology-based gene prediction

– Similarity Searches (e.g. BLAST, BLAT)

– Genome Browsers

– RNA evidence (ESTs - Expressed sequence tag in cDNA)

• Ab initio gene prediction

– Gene prediction programs

– Prokaryotes

• ORF identification

– Eukaryotes

• Promoter prediction

• PolyA-signal prediction

• Splice site, start/stop-codon predictions

Gene identification

Page 20: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

20

Gene prediction through comparative

genomics

• Highly similar (Conserved) regions between two

genomes are useful or else they would have diverged

• If genomes are too closely related all regions are similar,

not just genes

• If genomes are too far apart, analogous regions may be

too dissimilar to be found

Page 22: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

22

Gene discovery using ESTs

• Expressed Sequence Tags (ESTs)

represent sequences from expressed

genes.

• If region matches EST with high

consensus then region is probably a gene

or pseudogene.

– EST overlapping exon boundary gives an

accurate prediction of exon boundary.

Page 23: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

23

Ab initio gene prediction

• Prokaryotes

– ORF-Detectors

• Eukaryotes

– Position, extent & direction: through promoter

and polyA-signal predictors

– Structure: through splice site predictors

– Exact location of coding sequences: through

determination of relationships between

potential start codons, splice sites, ORFs, and

stop codons

Page 24: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

24

How it works I - ORF swf film

Page 25: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

25

How it works I – Motif

identification Exon-Intron Borders = Splice Sites

Exon Intron Exon

~~gaggcatcag|gtttgtagac~~~~~~~~~~~tgtgtttcag|tgcacccact~~

~~ccgccgctga|gtgagccgtg~~~~~~~~~~~tctattctag|gacgcgcggg~~

~~tgtgaattag|gtaagaggtt~~~~~~~~~~~atatctccag|atggagatca~~

~~ccatgaggag|gtgagtgcca~~~~~~~~~~~ttatttccag|gtatgagacg~~

Splice site Splice site

Exon Intron Exon

~~gaggcatcag|GTttgtagac~~~~~~~~~~~tgtgtttcAG|tgcacccact~~

~~ccgccgctga|GTgagccgtg~~~~~~~~~~~tctattctAG|gacgcgcggg~~

~~tgtgaattag|GTaagaggtt~~~~~~~~~~~atatctccAG|atggagatca~~

~~ccatgaggag|GTgagtgcca~~~~~~~~~~~ttatttccAG|gtatgagacg~~

Splice site Splice site

Motif Extraction Programs at http://www-btls.jst.go.jp/

Page 26: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

26

How it works III – The (ugly) truth

Page 27: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

27

Gene prediction programs

• Homology – use BLAST-like

– Example: Exofish, CRITICA

• Rule-based programs – Use explicit set of rules to make decisions.

– Example: GeneFinder

• Neural Network-based programs – Use data set to build rules.

– Examples: Grail, GrailEXP, Genemark

• Hidden Markov Model-based programs

– Use probabilities of states and transitions between these states to predict features.

– Examples: Genscan, GenomeScan

Page 28: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

28

Tools

• ORF detectors – NCBI: http://www.ncbi.nih.gov/gorf/gorf.html

• Promoter predictors – CSHL: http://rulai.cshl.org/software/index1.htm

– BDGP: fruitfly.org/seq_tools/promoter.html

– ICG: TATA-Box predictor

• PolyA signal predictors – CSHL: argon.cshl.org/tabaska/polyadq_form.html

• Splice site predictors – BDGP: http://www.fruitfly.org/seq_tools/splice.html

• Start-/stop-codon identifiers – DNALC: Translator/ORF-Finder

– BCM: Searchlauncher

Page 29: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

CRITICA prediction of prokaryotic genes

search for RBS (ribosomal binding site, Shine-Dalgarno

sequence)

Principle:

TBLASTP against protein database and choosing clearly

coding parts (usually only parts of the genes).

Calculating of statistical model.

Prediction of genes.

New statistical model and new prediction etc etc.

Page 30: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

Genscan

prediction of eukaryotic genes

different statistical models for the first and last

exon

search for promotores, terminators, polyA signal

different statistical parameter for different GC

www:

http://genes.mit.edu/GENSCAN.html

Page 31: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

Genscan

probability exons exactly partialy overlap

s

error

0.99 - 1.00 917 97.7% 0.9% 0.0% 1.4%

0.95 - 0.99 551 92.4% 3.4% 0.2% 4.0%

0.90 - 0.95 263 87.8% 6.1% 0.4% 5.7%

0.75 - 0.90 337 74.8% 16.0% 1.2% 8.0%

0.50 - 0.75 362 54.1% 26.2% 2.2% 17.4%

0.00 - 0.50 248 29.8% 27.8% 4.0% 38.3%

Page 32: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

GENSCAN 1.0 Date run: 31-Oct-100 Time: 15:54:20

Sequence HERV17_004640 : 40714 bp : 37.79% C+G : Isochore 1 ( 0.00 -

43.00 C+G%)

Parameter matrix: HumanIso.smat

Predicted genes/exons:

Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr..

----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------

1.01 Init + 1825 1853 29 0 2 86 71 45 0.579 1.72

1.02 Term + 3886 4075 190 1 1 85 44 198 0.941 11.04

1.03 PlyA + 4961 4966 6 1.05

2.00 Prom + 6668 6707 40 -4.65

2.01 Init + 17251 17375 125 0 2 45 72 80 0.590 1.81

2.02 Term + 20137 20329 193 1 1 85 43 196 0.990 10.71

2.03 PlyA + 20809 20814 6 1.05

3.08 PlyA - 21608 21603 6 -3.24

3.07 Term - 22315 21651 665 2 2 -17 55 522 0.952 31.44

3.06 Intr - 24268 22592 1677 2 0 81 94 2124 0.885 198.67

Genscan - example

Page 33: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

Genscan - example

Suboptimal exons with probability > 0.100

Exnum Type S .Begin ...End .Len Fr Ph B/Ac Do/T CodRg P.... Tscr..

----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------

S.001 Init + 2937 3136 200 2 2 67 -22 154 0.301 0.72

S.002 Intr + 3239 3325 87 2 0 43 23 121 0.358 -0.73

S.003 Intr + 17250 17375 126 0 0 66 72 94 0.141 4.47

S.004 Init + 17311 17375 65 0 2 55 72 45 0.204 0.27

S.005 Intr - 24927 24728 200 2 2 12 91 115 0.146 2.27

S.006 Intr - 25129 25003 127 2 1 51 92 37 0.117 -0.78

S.007 Intr - 29973 29878 96 1 0 44 111 87 0.473 5.66

S.008 Intr - 32589 32418 172 2 1 19 70 151 0.336 5.42

S.009 Intr - 32563 32427 137 2 2 46 70 116 0.122 4.97

S.010 Intr - 32589 32427 163 2 1 19 70 135 0.114 3.86

S.011 Intr - 32857 32804 54 0 0 104 103 2 0.262 0.48

S.012 Init - 33114 33008 107 0 2 79 17 87 0.296 0.46

S.013 Init + 37062 37067 6 2 0 53 68 1 0.115 -4.38

Page 34: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

Genscan - example

Page 35: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

kvalita predikce

Sensitivity = TP / (TP + FN)

How many genes were found out of all present? Specificity = TP / (TP + FP) How many predicted genes are indeed genes?

TP . TN + FP . FN

PP . PN + RP . RN Correlation Coefficient =

RP

PP

TP real

předpověď

FP TN FN TP FN

RN

PN

Page 36: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

36

Gene prediction accuracies

• Nucleotide level: 95%Sn, 90%Sp (Lows less than

50%)

• Exon level: 75%Sn, 68%Sp (Lows less than 30%)

• Gene Level: 40% Sn, 30%Sp (Lows less than 10%)

• Programs that combine statistical evaluations with

similarity searches most powerful.

Page 37: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

37

Common difficulties

• First and last exons difficult to annotate because

they contain UTRs.

• Smaller genes are not statistically significant so

they are thrown out.

• Algorithms are trained with sequences from

known genes which biases them against genes

about which nothing is known.

• Masking repeats frequently removes potentially

indicative chunks from the untranslated regions

of genes that contain repetitive elements.

Page 38: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

38

The annotation pipeline

• Mask repeats using RepeatMasker.

• Run sequence through several programs.

• Take predicted genes and do similarity

search against ESTs and genes from

other organisms.

• Do similarity search for non-coding

sequences to find ncRNA.

Page 39: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

39

Annotation nomenclature

• Known Gene – Predicted gene matches the entire length of a known gene.

• Putative Gene – Predicted gene contains region conserved with known gene. Also referred to as “like” or “similar to”.

• Unknown Gene – Predicted gene matches a gene or EST of which the function is not known.

• Hypothetical Gene – Predicted gene that does not contain significant similarity to any known gene or EST.

Page 40: Bioinformatika a výpočetní biologie KFC/BIN V. Predikce …fch.upol.cz/wp-content/uploads/2015/07/06_gene_prediction_vz1.pdf · require concerted application of bioinformatics

Credits

• http://www.dnalc.org/bioinformatics/presen

tations/hhmi_2003/2003_3.ppt

• Paces a Vondrasek, kurz Bioinformatiky,

UK