Gene Prediction

download Gene Prediction

If you can't read please download the document

description

Gene Prediction. Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar. Gene Prediction. Introduction Protein-coding gene prediction RNA gene prediction Modification and finishing Project schema. Gene Prediction. Introduction - PowerPoint PPT Presentation

Transcript of Gene Prediction

  • Gene PredictionChengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

  • Gene PredictionIntroductionProtein-coding gene predictionRNA gene predictionModification and finishingProject schema

  • Gene PredictionIntroductionProtein-coding gene predictionRNA gene predictionModification and finishingProject schema

  • Why gene prediction?experimental way?

  • Why gene prediction?Exponential growth of sequencesMetagenomics: ~1% grow in labNew sequencing technology

  • How to do it?

  • How to do it?It is a complicated task, lets break it into parts

  • How to do it?It is a complicated task, lets break it into partsGenome

  • How to do it?It is a complicated task, lets break it into partsGenome

  • How to do it?Protein-coding gene predictionPhillip Lee & Divya Anjan Kumar Homology Searchab initio approachNadeem Bulsara & Neha Gupta

  • How to do it?RNA gene predictionAmanda McCook & Chengwei LuotRNArRNAsRNA

  • Gene PredictionIntroductionProtein-coding gene predictionRNA gene predictionModification and finishingProject schema

  • Homology Search

  • Homology Search

  • Strategy

  • open reading frame(ORF)

  • How/Why find ORF?

  • How/Why find ORF?

  • How/Why find ORF?

  • Protein Database Searches

  • SWISSPROT- statistics

  • Pfam-Statistics11,912 families, with 1,808 new families and 236 families deletedUpdated to include metagenomic samplesInvolves MSA and HMMOnly 63%of the Pfam families match the proteins in SWISSPROT and TrEMBL

  • Domain searches

  • Integrating the results3 possible outcomes:Complete consensusPartial consensusNo consensus

    How do we choose?Scores like E-valuesPercentage similarityRelevance

  • Limitations of Extrinsic Prediction

  • ab initio Prediction

  • Homology Search is not Enough!Biased and incomplete DatabaseSequenced genomes are not evenly distributed on the tree of life, and does not reflect the diversity accordingly either.Number of sequenced genomes clustered here

  • ab initio Gene Prediction

  • Features

  • ORFs (6 frames)

  • Codon Statistics

  • Features (Contd.)

  • Probabilistic View

  • Supervised Techniques

  • Unsupervised Techniques

  • Usually Used ToolsGeneMarkGLIMMEREasyGenePRODIGAL

  • GeneMarkShortcomings

    Inability to find exact gene boundaries

  • GeneMark.hmm

  • GeneMark.hmm

    Probability of any sequence S underlying functional sequence X is calculated as P(X|S)=P(x1,x2,,xL| b1,b2,,bL)Viterbi algorithm then calculates the functional sequence X* such that P(X*|S) is the largest among all possible values of X.Ribosome binding site model was also added to augment accuracy in the prediction of translational start sites.

  • GeneMark

    RBS feature overcomes this problem by defining a % position nucleotide matrix based on alignment of 325 E coli genes whose RBS signals have already been annotated.Uses a consensus sequence AGGAG to search upstream of any alternative start codons for genes predicted by HMM. GENEMARKSConsidered the best gene prediction tool.Based on unsupervised learning.

  • GLIMMER

    Used IMM (Interpolated Markov Models) for the first time.Predictions based on variable context (oligomers of variable lengths).More flexible than the fixed order Markov models. Principle

  • Glimmer development Glimmer 2 (1999) Increased the sensitivity of prediction by adding concept of ICM (Interpolated Context Model) Glimmer 3 (2007)Overcomes the shortcomings of previous models by taking in account sum of RBS score, IMM coding potentials and a score for start codons which is dependent on relative frequency of each possible start codon in the same training set used for RBS determination.Algorithm used reverse scoring of IMM by scoring all ORF (open reading frames) in reverse, from the stop codon to start codon. Score being the sum of log likelihood of the bases contained in the ORF.

  • Glimmer3.02

  • PRODIGALProkaryotic Dynamic Programming Gene Finding AlgorithmDeveloped at Oak Ridge National Laboratory and the University of Tennessee

  • PRODIGAL-Features

  • PRODIGAL-Features

  • EasyGeneDeveloped at University of CopenhagenStatistical significance is the measure for gene prediction.

  • Comparison of Different Tools

  • Gene PredictionIntroductionProtein-coding gene predictionRNA gene predictionModification and finishingProject schema

  • RNA Gene Prediction

  • Why Predict RNA?

  • Regulatory sRNA

  • sRNA Challenges

  • Fundamental Methodology

  • RFAM

  • What Is Covariance?

  • Noncomparative Prediction

  • Noncomparative Prediction

  • Comparative+NoncomparativeEffective sRNA prediction in V. choleraeNon-enterobacteriasRNAPredict2 32 novel sRNAs predicted9 tested6 confirmed

    Jonathan Livny et al. Nucleic Acids Res. (2005) 33:4096

  • Software

  • Gene PredictionIntroductionProtein-coding gene predictionRNA gene predictionModification and finishingProject schema

  • Modification & FinishingConsensus strategy to integrate ab initio resultsBroken gene recruitingTIS correctingIS callingoperon annotatingGene presence/absence analysis

  • Modification & FinishingConsensus strategypasspassfailBroken gene recruitingab initio resultshomology searchcandidate fragments

  • Modification & FinishingTIS correctingStart codon redundancy:ATG, GTG, TTG, CTGMarkov iteration, experimental verified dataLeaderless genes

  • Modification & FinishingIS callingOperon annotatingIS Finder DB

  • Modification & FinishingGene Presence/absence analysis

  • Gene PredictionIntroductionProtein-coding gene predictionRNA gene predictionModification and finishingProject schema

  • Schema (proposed)

  • Schema (proposed)assembly group

  • Schema (proposed)assembly group