Gene Prediction

49
Gene Prediction Preliminary Results Computational Genomics February 20, 2012

description

Gene Prediction. Preliminary Results Computational Genomics February 20, 2012. ab initio Gene Prediction. Using Glimmer3, RAST, Prodigal and GenemarkS. Prodigal. lack of complexity(no Hidden Markov Model, no Interpolated Markov Model). based on dynamic programming. - PowerPoint PPT Presentation

Transcript of Gene Prediction

Page 1: Gene  Prediction

Gene PredictionPreliminary Results

Computational Genomics

February 20, 2012

Page 2: Gene  Prediction

ab initio Gene Prediction

Using Glimmer3, RAST, Prodigal and GenemarkS

Page 3: Gene  Prediction

Prodigal

• lack of complexity(no Hidden Markov Model, no Interpolated Markov Model).

• based on dynamic programming.

• remains accuracy in high GC content genomes.

• tends to predict longer genes rather than more genes.

Page 4: Gene  Prediction

Prodigal Protocol

Page 5: Gene  Prediction
Page 6: Gene  Prediction

Prodigal Options

Page 7: Gene  Prediction

Build Training File

Page 8: Gene  Prediction

Running Prodigal

Page 9: Gene  Prediction

Screenshot of Results

Page 10: Gene  Prediction

GeneMarkSGene prediction in Prokaryotic genome with unsupervised model parameter estimation

Page 11: Gene  Prediction
Page 12: Gene  Prediction

Web based version

Page 13: Gene  Prediction

Command line version

Syntax:runGeneMarkS <input_file> <output folder>

The Output folder contains 3 types of files:

•.out file: contains the default output •.faa file: contains the amino acid sequence of the corresponding ORFs in FASTA format

•.fnn file: contains the nucleotide sequence of the corresponding ORFs in FASTA format

Page 14: Gene  Prediction

Strand +:normal strand, -:reverse strandLeft end: Begin position, Right end: End position

Screenshot of the .out file

Page 15: Gene  Prediction

Screenshot of the .faa file

Page 16: Gene  Prediction

Screenshot of the .fnn file

Page 17: Gene  Prediction

Glimmer3

• A system for finding genes in microbial DNA• Works by creating a variable-length Markov

model from a training set of genes • Using the model to identify all genes in a DNA

sequence

Page 18: Gene  Prediction

Running Glimmer3

• 2 step progress• 1. A probability model of coding sequences must be

built called an interpolated context model.– a set of training sequences – 1. genes identified by homology or known genes– 2. from long, overlapping orfs – 3. genes from a highly similar species

• 2. program is run to analyze the sequences and make gene predictions– Best results require longest possible training set of genes

Page 19: Gene  Prediction

Glimmer3 programs

• Long-orfs uses an amino-acid distribution model to filter the set of orfs

• Extract builds training set from long, nonoverlapping orfs

• Build-icm build interpolated context model from training sequences

• Glimmer3 analyze sequences and make predictions

Page 20: Gene  Prediction

Interpolated Context Model

Page 21: Gene  Prediction
Page 22: Gene  Prediction
Page 23: Gene  Prediction

RAST

• RAST (Rapid Annotation using Subsystem Technology) is a system for annotating bacterial and archaeal genomes.

• Pipelines- tRNAScan-SE, Glimmer2, and comparing against other prokaryote genes that are universal across species.

Page 24: Gene  Prediction
Page 25: Gene  Prediction

Number Genes Predicted

ID Glimmer3 Prodigal RAST Genemark

M19107 1728 1728 1784 1808

M19501 1914 1867 2015 1933

M21127 2370 2317 2456 2413

M21621 1937 1914 1838 1972M21639 2698 2665 2823 2797

M21709 1924 1881 2004 1925

Average 2095 2062 2153 2141

Page 26: Gene  Prediction

Gene Length of Predicted Genes

ID Glimmer3 RAST GeneMark

M19107 791.43 793.56 801.50

M19501 806.71 809.12 840.52

M21127 987.09 692.20 708.70

M21621 851.47 900.93 885.61

M21639 740.28 751.85 762.46

M21709 840.49 843.18 873.15

Average 836.25 798.47 811.99

Page 27: Gene  Prediction

Homology-based Gene Prediction using BLAT

Page 28: Gene  Prediction
Page 29: Gene  Prediction

M19107.fasta M19501.fasta M21127.fasta M21621.fasta M21639.fasta M21709.fasta

99 17 29 24 49 31

1709Protein

coding genes

Haemophilus influenzaeQuery

Haemophilus haemolyticusTargets

Output.pslx

QueryCoverage(%)

Frequency graphs

Define cutoff

Predicted genes

Blat-UCSC

Homology-based Gene Prediction using BLAT

Page 30: Gene  Prediction

Cut-off

Query-Coverage %

Freq

uenc

y

Page 31: Gene  Prediction

Strand Contigs Query-coverage

CUTOFF (%)

Predicted genes Average Lenght

M19107 99 90 787 1049M19501 17 90 1063 996M21127 29 90 901 963M21621 24 90 930 685M21639 49 90 970 1277

M21709* 31 90 1515 813

Homology-based Gene Prediction using BLATResults

Page 32: Gene  Prediction

M19107 M19501 M21127 M21621 M21639 M21709*

787 1063 901 930 970 1515

Gene Calling ProtocolN° of Predicted Genes (≥ 90% Query-coverage)

Gene Scoring SystemPresence / Absence

≥ 4/5 = 3/5 ≤ 2/5

Multiple Alignment (Muscle)

Consensus SequenceFinal set of homology- based predicted genes

?

Page 33: Gene  Prediction

RNA Prediction

Page 34: Gene  Prediction

tRNAScan SEFirst pass filters identify "candidate" tRNA regions of the sequence.• tRNAscan and EufindtRNA

Further analysis to confirm the initial tRNAprediction. • Cove

Page 35: Gene  Prediction

tRNAscan-SE –B <inputfile> -o <outputfile1> -f <outputfile2> -m <outputfile3>

-B <file> : search for bacterial tRNAs• This option selects the bacterial covariace model for tRNA analysis, and loosens

the search parameters for EufindtRNA to improve detection o f bacterial tRNAs.

-o <file> : save final results in <file> • Specifiy this option to write results to <file>.

-f <file> : save results and tRNA secondary structures to <file>.

-m <file> : save statistics summary for run• contains the run options selected as well as statistics on the number of tRNAs

detected at each phase of the search, search speed, and other statistics.

Parameters passed

Page 36: Gene  Prediction

Output using “–o” parameter

Output using “–f” parameter

Page 37: Gene  Prediction
Page 38: Gene  Prediction

M19107 M19501 M21127 M21621 M21639 M21709

No. of contigs 99 17 29 23 49 29

Contigs with atleast 1 tRNA

45 12 22 19 33 21

First-pass tRNAs predicted

103 124 114 123 137 113

Cove-confirmed tRNAs

41 51 50 52 51 51

ResultsOutput using “–m” parameter

Page 39: Gene  Prediction

ISOTYPE AND ANTI CODON COUNT (M19107)

Page 40: Gene  Prediction

RNAmmer

Page 41: Gene  Prediction

Working

• It works using two level of Hidden markov models.

• The spotter model is constructed from highly conserved loci within a structural alignment of known rRNA sequences.

• Once the spotter model detects an approximate position of a gene, flanking regions are extracted and parsed to the full model which matches the entire gene.

• By enabling a two-level approach it is avoided to run a full model through an entire genome sequence allowing faster predictions.

Page 42: Gene  Prediction

Command line options

Rnammer -S (species) –m (molecules) –xml (xml file) –gff (gff file) –h (hmm report file) –f (fasta file)

• -S : specify the species to use. In out case, it will be bacterial

• -m : molecules to search for. (ie. Large subunit or small subunit)

Page 43: Gene  Prediction

##gff-version2##source-version RNAmmer-1.2##date 2012-02-19##Type DNA# seqname source feature start end score +/- frame attribute# ---------------------------------------------------------------------------------------------------------84 RNAmmer-1.2 rRNA 28110 31006 3556.4 + . 23s_rRNA84 RNAmmer-1.2 rRNA 31127 31241 82.9 + . 5s_rRNA1 RNAmmer-1.2 rRNA 116969 117083 82.9 - . 5s_rRNA60 RNAmmer-1.2 rRNA 338 452 82.9 + . 5s_rRNA29 RNAmmer-1.2 rRNA 198 312 82.9 + . 5s_rRNA84 RNAmmer-1.2 rRNA 25977 27507 1872.9 + . 16s_rRNA# ---------------------------------------------------------------------------------------------------------

Sequence 5s 16s 23s

M19107 4 1 1

M19501 7 1 1

M21127 4 1 0

M21621 4 0 0

M21639 7 2 1

M21709 8 2 2

Results

Page 44: Gene  Prediction

sRNA Prediction

Page 45: Gene  Prediction

Rfam Database Homology Search

• A collection of RNA families– Non-coding RNA genes– Structured cis-regulatory elements– Self-splicing RNAs

• WU-BLAST search, and keeps hits with E-value < 1e-5

Page 46: Gene  Prediction

Rfam Preliminary Results

Accession # Total ncRNA # of rRNA

# of tRNA / tmRNA

# of sRNA Others (RNasep)

Sequencing Coverage

M19107 65 10 43 11 1 12 X

M19501 85 14 53 17 1 53 X

M21127 79 9 52 17 1 20 X

M21621 81 10 54 16 1 25 X

M21639 95 12 53 29 1 78 X

M21709 92 16 54 21 1 34 X

The output format is: <rfam acc> <rfam id> <seq id> <seq start> <seq end> <strand> <score>Results:84 Rfam similarity 25970 27512 1477.28 + . evalue=2.08e-50;gc-content=52;id=SSU_rRNA_bacteria.1;model_end=1518;model_start=1;rfam-acc=RF00177;rfam-id=SSU_rRNA_bacteria

Page 47: Gene  Prediction

Things to be done

• Get Geneprimp to work since we are having some problems with the installation and the web server takes a long time to process.

• Get further information required to run other RNA prediction softwares.

• Compare specific RNA prediction softwares with Rfam predictions.

Page 48: Gene  Prediction

Leading Biocomputational Tools

• eQRNA (Rivas and Eddy 2001)

• RNAz (Washietl et al. 2005; Gruber etal. 2010)

• sRNAPredict3/SIPHT (Livny et al. 2006, 2008)

• NAPP (Marchais et al. 2009)

Lu, X., H. Goodrich-Blair, et al. (2011). "Assessing computational tools for the discovery of small RNA genes in bacteria." RNA 17(9): 1635-1647

All four approaches use comparative genomics!!

Page 49: Gene  Prediction

sRNApredict3 Pipeline