Post on 11-Jan-2016
description
Genome Annotation
Rosana O. Babu
1
Sequence to Annotation
Input1-Variant Annotation
Input2- Structural Annotation
Structural Annotation was conducted using AUGUSTUS (version 2.5.5), Magnaporthe_grisea as genome model
However, we have to develop genome model for Oomycete to obtain accurate result
Input3-Functional Annotation
Genome Annotation
The process of identifying the locations of genes and the coding regions in a genome to determe what those genes do
Finding and attaching the structural elements and its related function to each genome locations
6
Genome Annotation
7
gene structure prediction
Identifying elements (Introns/exons,CDS,stop,start) in the genome
gene function prediction
Attaching biological information to these elements- eg: for which protein exon will code for
Eukaryote genome annotation
9
Genome
ATG STOP
AAAn
A B
Transcription
Primary Transcript
Processed mRNA
Polypeptide
Folded protein
Functional activity
Translation
Protein folding
Enzyme activity
RNA processing
m7G
Find locus
Find exons using transcripts
Find exons using peptides
Find function
Prokaryote genome annotation
10
Genome
START STOP
A B
Transcription
Primary Transcript
Processed RNA
Polypeptide
Folded protein
Functional activity
Translation
Protein folding
Enzyme activity
RNA processing
Find locus
Find CDS
Find function
START STOP
Genome annotation - workflow
11
Genome sequence
Repeats
Structural annotation-Gene finding
Protein-coding genesnc-RNAs, Introns
Functional annotation
Viewed & Released in Genome viewer
Masked or un-masked genome sequence
Genome Repeats & features
12
Percentage of repetitive sequences in different organisms
Genome Genome Size (Mb)
% Repeat
Aedes aegypti 1,300 ~70
Anopheles gambiae 260 ~30
Culex pipiens 540 ~50
Microsatellite Minisatellite Tandem repeat Short tandem repeat SSR
Polymorphic between individuals/populations
Finding repeats as a preliminary to gene prediction
13
Repeat discovery
Literature and public databanks
Homology based approaches
Automated approaches (e.g. RepeatScout or RECON)
Tandem repeats: Tandem, TRF
Use RepeatMasker to search the genome and mask the sequence
Masked sequence
Repeatmasked sequence is an artificial construction where those regions which are thought to be repetitive are marked with X’s
Widely used to reduce the overhead of subsequent computational analyses and to reduce the impact of TE’s in the final annotation set
14
>my sequence
atgagcttcgatagcgatcagctagcgatcaggctactattggcttctctagactcgtctatctctattagctatcatctcgatagcgatcagctagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctactattggctgatcttaggtcttctgatcttct
>my sequence (repeatmasked)
atgagcttcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxatctcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxtagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctxxxxxxxxxxxxxxxxxxxtcttctgatcttct
Positions/locations are not affected by masking
Types of Masking- Hard or Soft?
Sometimes we want to mark up repetitive sequence but not to exclude it from downstream analyses. This is achieved using a format known as soft-masked
15
>my sequence
ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGCTACTATTGGCTTCTCTAGACTCGTCTATCTCTATTAGTATCATCTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGATCTTCT
>my sequence (softmasked)
ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGCTACTATTggcttctctagactcgtctatctctattagtatcATCTCGATAGCGATCAGCTAGCGATCAGGCTACTATTggcttcgatagcgatcagcTAGCGATCAGGCTACTATTggcttcgatagcgatcagcTAGCGATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGATCTTCT
>my sequence (hardmasked)
atgagcttcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxatctcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxtagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctxxxxxxxxxxxxxxxxxxxtcttctgatcttct
Genome annotation - workflow
16
Genome sequence
Map repeats
Gene finding- structural annotation
Protein-coding genesnc-RNAs, Introns
Functional annotation
Viewed & Released in Genome viewer
Masked or un-masked
Structural annotation
Identification of genomic elements
Open reading frame and their localization Coding regions Location of regulatory motifs Start/Stop Splice Sites Non coding Regions/RNA’s
17
Methods
19
Similarity• Similarity between sequences which does not necessarily infer any
evolutionary linkage
Ab- initio prediction• Prediction of gene structure from first principles using only the
genome sequence
Genefinding
20
ab initio similarity
Gene_finding resources for Homology based methods
Transcript cDNA sequences EST sequences
Peptide Non-redundant (nr) protein database Protein sequence data, Mass spectrometry data
Genome Other genomic sequence
21
ab initio prediction
22
Genome
Coding potential
Coding potential
ATG & Stop codons
ATG & Stop codons
Splice sites
Genefinding - ab initio predictions
23
Use compositional features of the DNA sequence to define coding segments (essentially exons)
ORFs
Coding bias
Splice site consensus sequences
Start and stop codons
Methods
Training sets are required
Each feature is assigned a log likelihood score
Use dynamic programming to find the highest scoring path for accuracy
Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh
Genefinding - similarity
24
Use known coding sequence to define coding regions
EST sequences
Peptide sequences
Problem to handle fuzzy alignment regions around splice sites
Examples: EST2Genome, exonerate, genewise
Gene-finding - comparative
Use two or more genomic sequences to predict genes based on conservation of exon sequences
Examples: Twinscan and SLAM
Genefinding - non-coding RNA genes
25
Non-coding RNA genes can be predicted using knowledge of their structure or by similarity with known examples
tRNAscan - uses an HMM and co-variance model for prediction of tRNA genes
Rfam - a suite of HMM’s trained against a large number of different RNA genes
Gene-finding omissions
26
Alternative isoformsCurrently there is no good method for predicting alternative isoformsOnly created where supporting transcript evidence is present
PseudogenesEach genome project has a fuzzy definition of pseudogenesBadly curated/described across the board
PromotersRarely a priority for a genome projectSome algorithms exist but usually not integrated into an annotation set
Practical- structural annotation
27
Eukaryotes- AUGUSTUS (gene model)
~/Programs/augustus.2.5.5/bin/augustus --strand=both --genemodel=partial --singlestrand=true --alternatives-from-evidence=true --alternatives-from-sampling=true --progress=true --gff3=on --uniqueGeneId=true --species=magnaporthe_grisea our_genome.fasta >structural_annotation.gff
Prokaryotes – PRODIGAL (Codon Usage table)
~/Programs/prodigal.v2_60.linux -a protein_file.fa -g 11 –d nucleotide_exon_seq.fa -f gff -i contigs.fa -o genes_quality.txt -s genes_score.txt -t genome_training_file.txt
Structural Annotation- Structural Annotation was conducted using AUGUSTUS (version
2.5.5), Magnaporthe_grisea as genome model
However, we have to develop genome model for obtaining accurate result
Functionalannotation
29
Functional annotation
30
Attaching biological information to genomic elements
Biochemical functionBiological functionInvolved regulation and interactionsExpression
• Utilise known structural information to predicted protein sequence
Genome annotation - workflow
31
Genome sequence
Map repeats
Gene finding- structural annotation
Protein-coding genesnc-RNAs, Introns
Functional annotation
Viewed & Released in Genome viewer
Masked or un-masked
Genome annotation
32
Genome
ATG STOP
AAAn
A B
Transcription
Primary Transcript
Processed mRNA
Polypeptide
Folded protein
Functional activity
Translation
Protein folding
Enzyme activity
RNA processing
m7G
Find function
Functional annotation – Homology Based
Predicted Exons/CDS/ORF are searched against the non-redundant protein database (NCBI, SwissProt) to search for similarities
Visually assess the top 5-10 hits to identify whether these have been assigned a function
Functions are assigned
33
Functional annotation - Other features
Other features which can be determined
Signal peptides Transmembrane domains Low complexity regions Various binding sites, glycosylation sites etc. Protein Domain
See http://expasy.org/tools/ for a good list of possible prediction algorithms
34
Functional annotation - Other features (Ontologies)
Use of ontologies to annotate gene products Gene Ontology (GO)
Cellular component Molecular function Biological process
35
Practical - FUNCTIONAL ANNOTATION
Homology Based Method
setup blast database for nucleotide/protein Blasting the genome.fasta for annotations (nucleotide/protein) sorting for blast minimum E-value (>=0.01) for nucleotide/protein Further filtering for best blast hit (5-15) and assigning functions Removing Positive strand blast hits Removing negative strand blast hits
36
Functional annotation- output
August 2008 Bioinformatics tools for Comparative Genomics of Vectors
37
Conclusion
Annotation accuracy is only as good as the available supporting data at the time of annotation- update information is necessary
Gene predictions will change over time as new data becomes available (ESTs, related genomes) that are much similar than previous ones
Functional assignments will change over time as new data becomes available (characterization of hypothetical proteins)
38
Thank You
39