Genome Annotation

Rosana O. Babu

Sequence to Annotation

Input1-Variant Annotation

Input2- Structural Annotation

Structural Annotation was conducted using AUGUSTUS (version 2.5.5), Magnaporthe_grisea as genome model

However, we have to develop genome model for Oomycete to obtain accurate result

Input3-Functional Annotation

Genome Annotation

The process of identifying the locations of genes and the coding regions in a genome to determe what those genes do

Finding and attaching the structural elements and its related function to each genome locations

Genome Annotation

gene structure prediction

Identifying elements (Introns/exons,CDS,stop,start) in the genome

gene function prediction

Attaching biological information to these elements- eg: for which protein exon will code for

Eukaryote genome annotation

Genome

ATG STOP

Transcription

Primary Transcript

Processed mRNA

Polypeptide

Folded protein

Functional activity

Translation

Protein folding

Enzyme activity

RNA processing

Find locus

Find exons using transcripts

Find exons using peptides

Find function

Prokaryote genome annotation

Genome

START STOP

Transcription

Primary Transcript

Processed RNA

Polypeptide

Folded protein

Functional activity

Translation

Protein folding

Enzyme activity

RNA processing

Find locus

Find CDS

Find function

START STOP

Genome annotation - workflow

Genome sequence

Repeats

Structural annotation-Gene finding

Protein-coding genesnc-RNAs, Introns

Functional annotation

Viewed & Released in Genome viewer

Masked or un-masked genome sequence

Genome Repeats & features

Percentage of repetitive sequences in different organisms

Genome Genome Size (Mb)

% Repeat

Aedes aegypti 1,300 ~70

Anopheles gambiae 260 ~30

Culex pipiens 540 ~50

Microsatellite Minisatellite Tandem repeat Short tandem repeat SSR

Polymorphic between individuals/populations

Finding repeats as a preliminary to gene prediction

Repeat discovery

Literature and public databanks

Homology based approaches

Automated approaches (e.g. RepeatScout or RECON)

Tandem repeats: Tandem, TRF

Use RepeatMasker to search the genome and mask the sequence

Masked sequence

Repeatmasked sequence is an artificial construction where those regions which are thought to be repetitive are marked with X’s

Widely used to reduce the overhead of subsequent computational analyses and to reduce the impact of TE’s in the final annotation set

>my sequence

atgagcttcgatagcgatcagctagcgatcaggctactattggcttctctagactcgtctatctctattagctatcatctcgatagcgatcagctagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctactattggctgatcttaggtcttctgatcttct

>my sequence (repeatmasked)

atgagcttcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxatctcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxtagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctxxxxxxxxxxxxxxxxxxxtcttctgatcttct

Positions/locations are not affected by masking

Types of Masking- Hard or Soft?

Sometimes we want to mark up repetitive sequence but not to exclude it from downstream analyses. This is achieved using a format known as soft-masked

>my sequence

ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGCTACTATTGGCTTCTCTAGACTCGTCTATCTCTATTAGTATCATCTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGATCTTCT

>my sequence (softmasked)

ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGCTACTATTggcttctctagactcgtctatctctattagtatcATCTCGATAGCGATCAGCTAGCGATCAGGCTACTATTggcttcgatagcgatcagcTAGCGATCAGGCTACTATTggcttcgatagcgatcagcTAGCGATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGATCTTCT

>my sequence (hardmasked)

atgagcttcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxatctcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxtagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctxxxxxxxxxxxxxxxxxxxtcttctgatcttct

Genome sequence

Map repeats

Gene finding- structural annotation

Masked or un-masked

Structural annotation

Identification of genomic elements

Open reading frame and their localization Coding regions Location of regulatory motifs Start/Stop Splice Sites Non coding Regions/RNA’s

Methods

Similarity• Similarity between sequences which does not necessarily infer any

evolutionary linkage

Ab- initio prediction• Prediction of gene structure from first principles using only the

genome sequence

Genefinding

ab initio similarity

Gene_finding resources for Homology based methods

Transcript cDNA sequences EST sequences

Peptide Non-redundant (nr) protein database Protein sequence data, Mass spectrometry data

Genome Other genomic sequence

ab initio prediction

Genome

Coding potential

ATG & Stop codons

Splice sites

Genefinding - ab initio predictions

Use compositional features of the DNA sequence to define coding segments (essentially exons)

Coding bias

Splice site consensus sequences

Start and stop codons

Methods

Training sets are required

Each feature is assigned a log likelihood score

Use dynamic programming to find the highest scoring path for accuracy

Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh

Genefinding - similarity

Use known coding sequence to define coding regions

EST sequences

Peptide sequences

Problem to handle fuzzy alignment regions around splice sites

Examples: EST2Genome, exonerate, genewise

Gene-finding - comparative

Use two or more genomic sequences to predict genes based on conservation of exon sequences

Examples: Twinscan and SLAM

Genefinding - non-coding RNA genes

Non-coding RNA genes can be predicted using knowledge of their structure or by similarity with known examples

tRNAscan - uses an HMM and co-variance model for prediction of tRNA genes

Rfam - a suite of HMM’s trained against a large number of different RNA genes

Gene-finding omissions

Alternative isoformsCurrently there is no good method for predicting alternative isoformsOnly created where supporting transcript evidence is present

PseudogenesEach genome project has a fuzzy definition of pseudogenesBadly curated/described across the board

PromotersRarely a priority for a genome projectSome algorithms exist but usually not integrated into an annotation set

Practical- structural annotation

Eukaryotes- AUGUSTUS (gene model)

~/Programs/augustus.2.5.5/bin/augustus --strand=both --genemodel=partial --singlestrand=true --alternatives-from-evidence=true --alternatives-from-sampling=true --progress=true --gff3=on --uniqueGeneId=true --species=magnaporthe_grisea our_genome.fasta >structural_annotation.gff

Prokaryotes – PRODIGAL (Codon Usage table)

~/Programs/prodigal.v2_60.linux -a protein_file.fa -g 11 –d nucleotide_exon_seq.fa -f gff -i contigs.fa -o genes_quality.txt -s genes_score.txt -t genome_training_file.txt

Structural Annotation- Structural Annotation was conducted using AUGUSTUS (version

2.5.5), Magnaporthe_grisea as genome model

However, we have to develop genome model for obtaining accurate result

Functionalannotation

Attaching biological information to genomic elements

Biochemical functionBiological functionInvolved regulation and interactionsExpression

• Utilise known structural information to predicted protein sequence

Genome sequence

Map repeats

Gene finding- structural annotation

Masked or un-masked

Genome annotation

Genome

ATG STOP

Transcription

Primary Transcript

Processed mRNA

Polypeptide

Folded protein

Functional activity

Translation

Protein folding

Enzyme activity

RNA processing

Find function

Functional annotation – Homology Based

Predicted Exons/CDS/ORF are searched against the non-redundant protein database (NCBI, SwissProt) to search for similarities

Visually assess the top 5-10 hits to identify whether these have been assigned a function

Functions are assigned

Functional annotation - Other features

Other features which can be determined

Signal peptides Transmembrane domains Low complexity regions Various binding sites, glycosylation sites etc. Protein Domain

See http://expasy.org/tools/ for a good list of possible prediction algorithms

Functional annotation - Other features (Ontologies)

Use of ontologies to annotate gene products Gene Ontology (GO)

Cellular component Molecular function Biological process

Practical - FUNCTIONAL ANNOTATION

Homology Based Method

setup blast database for nucleotide/protein Blasting the genome.fasta for annotations (nucleotide/protein) sorting for blast minimum E-value (>=0.01) for nucleotide/protein Further filtering for best blast hit (5-15) and assigning functions Removing Positive strand blast hits Removing negative strand blast hits

Functional annotation- output

August 2008 Bioinformatics tools for Comparative Genomics of Vectors

Conclusion

Annotation accuracy is only as good as the available supporting data at the time of annotation- update information is necessary

Gene predictions will change over time as new data becomes available (ESTs, related genomes) that are much similar than previous ones

Functional assignments will change over time as new data becomes available (characterization of hypothetical proteins)

Thank You

Genome Annotation

Documents

Transcript of Genome Annotation

Apollo Collaborative genome annotation editing

Gene prediction and Genome Annotation

Crowdsourcing genome annotation at #ccs14

Chromatin-state discovery and genome annotation with ChromHMMcompbio.mit.edu/publications/166_Ernst_NatureProtocols_17.pdf · Chromatin-state discovery and genome annotation with

Robots and Automatic Genome Annotation

Genome Sequencing Impact on Annotation

Era7 Bacterial Genome Annotation

Genome Annotation - Cornell University

Melampsora Genome Annotation and Genome Structure Analysis

Genome Annotation and Visualisation using R and Bioconductorbioinformatics-core-shared-training.github.io/cruk-bioinf-sschool/Day3/Genome...7/22/2015 Genome Annotation and Visualisation

Genome Annotation - NDSUmcclean/plsc411/Genome... · Genome Annotation . Genome Sequencing • Costliest aspect of sequencing the genome o But Devoid of content • Genome must be

Ensembl Genome Annotation Overview

Whole Genome Annotation: In Silico Analysis

Genome Annotation: A Protein-centric Perspective.

Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Review Free EGASP: the human ENCODE Genome Annotation ...repository.cshl.edu/25307/1/...Genome-Annotation...Automatic genome annotation methods To date, accurate automatic annotation

MULTI-GENOME ANNOTATION OF GENOME …...UNIVERSITY OF HI'.\",/.4./'/ LIBRARY MULTI-GENOME ANNOTATION OF GENOME FRAGMENTS USING HIDDEN MARKOV MODEL PROFILES A THESIS SUBMITTED TO THE

Genome Annotation: A Protein-centric Perspective

Bio305 genome analysis and annotation 2012

Genome sequencing and annotation