Information Theory, Statistical Measures and Bioinformatics approaches to gene expression
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational...
-
Upload
harry-mason -
Category
Documents
-
view
219 -
download
1
Transcript of Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational...
![Page 1: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/1.jpg)
Gene finding and genestructure prediction
M. Fatih BÜYÜKAKÇALI
2008639500
Computational Bioinformatics
2012
![Page 2: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/2.jpg)
Outline
• Introduction to genes and proteins
• Genetic code
• Open reading frames
![Page 3: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/3.jpg)
Outline
• Ab initio methods• Principles: signal detection and coding
statistics• Methods to integrate signal detection and
coding statistics• Examples of software
![Page 4: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/4.jpg)
Outline
• Homology methods• Principles• An overview of the homology methods
![Page 5: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/5.jpg)
Introduction to genes and proteins• Proteins are the main building block for
many tasks in living organisms. They are themselves build up as a chain of amino-acids (AA) (200-300, typically).
• The chain of amino-acids of a protein is produced by translation of an RNA sequence (via the ribosome; while translation takes place, the protein folds progressively to take its three-dimensional structure).
![Page 6: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/6.jpg)
Introduction to genes and proteins• The RNA sequence needed to
produce a given protein is normally obtained by transcribing a part of the DNA contained in the genome (it is then called mRNA) and the corresponding subsequence of DNA is called a gene coding for that protein.
![Page 7: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/7.jpg)
Introduction to genes and proteins• A given genome can contain as few as
500 genes or as many as 30,000 genes.
• Central dogma: DNARNAProtein
![Page 8: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/8.jpg)
Introduction: gene structure
![Page 9: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/9.jpg)
Genetic code• Correspondence between tri-mers
(codons) of nucleotides and amino-acids
• 20 amino-acids, but 64 codons (see book, or Internet, for explanations)
• Some amino-acids correspond to several codons: A (Alanine) corresponds to GCA, GCG, GCT
![Page 10: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/10.jpg)
Genetic code• Some codons do not correspond to an
amino-acid: TAA, TAG, TGA (these are stop codons, see below).
• One codon is special: ATG, it is the sole codon corresponding to Methionine, and is also called start codon (see below).
NB. Although it is RNA that is translated into amino-acids, we use the DNA alphabet (T instead of U) to describe the genetic code, because we will directly search DNA sequences for protein coding sequences.
![Page 11: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/11.jpg)
Open reading framesAn open reading frame, is a sequence of DNA nucleotides that could be translated into a protein. We know that:
• Translation goes from 5’ to 3’ end of a strain (sense, or anti-sense)
• Translation always starts with a methionine codon (ATG)
![Page 12: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/12.jpg)
Open reading frames
![Page 13: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/13.jpg)
Open reading frames
• Translation always stops, as soon as a stop codon is found (and the AA-sequence ends with the AA corresponding to the last non-stop codon).
![Page 14: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/14.jpg)
What is gene finding?
• From a genomic DNA sequence we want to predict the regions that will encode for a protein: the genes.
• Gene finding is about detecting these coding regions and infer the gene structure starting from genomic DNA sequences.
![Page 15: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/15.jpg)
What is gene finding?
• We need to distinguish coding from non-coding regions using properties specific to each type of DNA region.
• Gene finding is not an easy task!
![Page 16: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/16.jpg)
What is gene finding?• Gene finding is not an easy task!
• DNA sequence signals have low information content (small alphabet and short sequences);
• It is difficult to discriminate real signals from noise (degenerated and highly unspecific signals);
• Gene structure can be complex (sparse exons, alternative splicing, ...);
• DNA signals may vary in different organisms;• Sequencing errors (frame shifts, ...).
![Page 17: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/17.jpg)
Gene structure in prokaryotes
• High gene density and simple gene structure.
• Short genes have little information.• Overlapping genes.
![Page 18: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/18.jpg)
Gene structure in eukaryotes
• Low gene density and complex gene structure.
• Alternative splicing.• Pseudo-genes.
![Page 19: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/19.jpg)
Gene finding strategies
• Ab initio methods:• Based on statistical signals within the DNA:
• Signals: short DNA motifs (promoters, start/stop codons, splice sites, ...)
• Coding statistics: nucleotide compositional bias in coding and non-coding regions
![Page 20: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/20.jpg)
Gene finding strategies
• Strengths:• easy to run and fast execution time• only require the DNA sequence as input
![Page 21: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/21.jpg)
Gene finding strategies
• Weaknesses:• prior knowledge is required (training sets)• high number of mispredicted gene structures
![Page 22: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/22.jpg)
Gene finding strategies
• Homology methods:• Gene structure is deduced using
homologous sequences (EST, mRNA, protein).
• Very accurate results when using homologous sequences with high similarity.
![Page 23: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/23.jpg)
Gene finding strategies
• Strengths:• accurate
• Weaknesses:• need of good homologous sequences• execution is slow
![Page 24: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/24.jpg)
Gene finding:
Ab initio methods
![Page 25: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/25.jpg)
Ab initio methods: a simple view
![Page 26: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/26.jpg)
Methods for signal detection
• Detect short DNA motifs (promoters, start/stop codons, splice sites, intron branching point, ...).
![Page 27: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/27.jpg)
Methods for signal detection• A number of methods are used for
signal detection:• Consensus string: based on most frequently
observed residues at a given position.
• Pattern recognition: flexible consensus strings.
• Weight matrices: based on observed frequencies of residues at a given position. Uses standard alignment algorithms.
![Page 28: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/28.jpg)
Methods for signal detection• A number of methods are used for
signal detection:• Weight array matrices: weight matrices based on
dinucleotides frequencies. Takes into account the non-independence of adjacent positions in the sites.
• Maximal dependence decomposition (MDD): MDD generates a model which captures significant dependencies between non-adjacent as well adjacent positions, starting from an aligned set of signals.
![Page 29: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/29.jpg)
Methods for signal detection• Methods for signal detection:
• Hidden Markov Models (HMMs):• HMMs use a probabilistic framework to infer
the probability that a sequence correspond to a real signal.
• Neural Networks (NNs):• NNs are trained with positive and negative
examples. NNs ”discover” the features that distinguish the two sets.
![Page 30: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/30.jpg)
Methods for signal detection
![Page 31: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/31.jpg)
Signal detection limitations• Problems with signal detection:
• DNA sequence signals have low information content.
• Signals are highly unspecific and degenerated.• Difficult to distinguish between true and false
positive.
• How to improve signal detection:• Take context into consideration (ex. acceptor site
must be flanked by an intron and an exon).• Combine with coding statistics (compositional
bias).
![Page 32: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/32.jpg)
Types of coding statistics• Inter-genic regions, introns, and exons
have different nucleotides contents.• This compositional differences can be
used to infer gene structure.• Examples of coding statistics:
• ORF length:• Assuming an uniform random distribution,
stop codons are present every 64/3 codons (≈ 21 codons) in average.
• In coding regions stop codon average decrease
![Page 33: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/33.jpg)
Types of coding statistics
• This measure is sensitive to frame shift errors.• Can’t detect short coding regions
• Bias in nucleotide content in coding regions:• Generally coding regions are G+C rich.• There are exceptions! For example coding
regions of P. falciparum are A+T rich.
![Page 34: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/34.jpg)
Integrating signal and compositional information for gene structure prediction
• A number of methods exists for gene structure prediction which integrate different techniques to detect signals (splicing sites, promoters, etc.) and coding statistics.
• All these methods are classifiers based on machine learning theory.
• Training sets are required to train the algorithms.
![Page 35: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/35.jpg)
Ab initio methods: Generalized HMMs
![Page 36: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/36.jpg)
Ab initio methods: Generalized HMMs
![Page 37: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/37.jpg)
The Other Ab initio methods
• GENSCAN• HMMgene• Linear and quadratic discrimination analysis• FGENES• MZEF• Decision trees• Neural network• GRAIL
![Page 38: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/38.jpg)
![Page 39: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/39.jpg)
![Page 40: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/40.jpg)
Gene finding:
Homology methods
![Page 41: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/41.jpg)
Homology methods: a simple view
![Page 42: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/42.jpg)
Homology methods: Procrustes
Procrustes: robber who altered his victims to fit his bed by stretching them or cutting off their legs (Classical Mythology)
![Page 43: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/43.jpg)
Homology methods: Procrustes
![Page 44: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/44.jpg)
Homology methods: Genewise
• Uses HMMs to compare DNA sequences to protein sequences at the level of its conceptual translation, regardless of sequencing errors and introns.
• Principle:• The exon model used in genewise is a HMM with
3 base states (match, insert, delete) with the addition of more transitions between states to consider frame-shifts.
• Intron states have been added to the base model.• Genewise directly compare HMM-profiles of
proteins or domains to the gene structure HMM model.
![Page 45: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/45.jpg)
Homology methods: Genewise
• Genewise is a powerful tool, but time consuming.
• Requires strong similarities (>70% identity) to produce good predictions.
• Genewise is part of the Wise2 package: http://www.ebi.ac.uk/Wise2/.
![Page 46: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/46.jpg)
Homology methods: Genewise
![Page 47: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/47.jpg)
Homology methods: sim4
• Align cDNA to genomic sequences.
• sim4 performs standard dynamic programming:• models splice sites• introns are treated as special kind of gaps with
low penalties
• sim4 performs very well, but needs strong similarity between the sequences.
![Page 48: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/48.jpg)
Homology methods: BLAST
• BLAST can be used to find genomic sequences similar to proteins, ESTs, cDNAs.
• A BLAST hit doesn’t mean necessarily an exon. Some post-processing is required.
• BLAST can indicate the rough position of exons, but nothing about the gene structure.
![Page 49: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/49.jpg)
Homology methods: BLAST
• However, BLAST is fast! and can reduce the search space for others programs.
![Page 50: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/50.jpg)
Homology methods: Trimming with BLAST
![Page 51: Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062408/56649ee65503460f94bf6c8e/html5/thumbnails/51.jpg)
REFERENCES
• http://www.ch.embnet.org/CoursEMBnet/Zurich04/slides/gene_ho.pdf
• http://www.montefiore.ulg.ac.be/~lwh/IBIOINFO/• http://www.montefiore.ulg.ac.be/~
lwh/IBIOINFO/ibioinfo9-10-07.pdf• http://tr.wikipedia.org/wiki/Gen_bulma• http://tr.wikipedia.org/wiki/Genetik_kod• http://blast.ncbi.nlm.nih.gov/Blast.cgi• http://www.cs.utoronto.ca/~
brudno/csc2427/Lec12Notes.pdf