Genome Annotation
description
Transcript of Genome Annotation
![Page 1: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/1.jpg)
Genome Annotation
Rosana O. Babu
1
![Page 2: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/2.jpg)
Sequence to Annotation
![Page 3: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/3.jpg)
Input1-Variant Annotation
![Page 4: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/4.jpg)
Input2- Structural Annotation
Structural Annotation was conducted using AUGUSTUS (version 2.5.5), Magnaporthe_grisea as genome model
However, we have to develop genome model for Oomycete to obtain accurate result
![Page 5: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/5.jpg)
Input3-Functional Annotation
![Page 6: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/6.jpg)
Genome Annotation
The process of identifying the locations of genes and the coding regions in a genome to determe what those genes do
Finding and attaching the structural elements and its related function to each genome locations
6
![Page 7: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/7.jpg)
Genome Annotation
7
gene structure prediction
Identifying elements (Introns/exons,CDS,stop,start) in the genome
gene function prediction
Attaching biological information to these elements- eg: for which protein exon will code for
![Page 8: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/8.jpg)
Eukaryote genome annotation
9
Genome
ATG STOP
AAAn
A B
Transcription
Primary Transcript
Processed mRNA
Polypeptide
Folded protein
Functional activity
Translation
Protein folding
Enzyme activity
RNA processing
m7G
Find locus
Find exons using transcripts
Find exons using peptides
Find function
![Page 9: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/9.jpg)
Prokaryote genome annotation
10
Genome
START STOP
A B
Transcription
Primary Transcript
Processed RNA
Polypeptide
Folded protein
Functional activity
Translation
Protein folding
Enzyme activity
RNA processing
Find locus
Find CDS
Find function
START STOP
![Page 10: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/10.jpg)
Genome annotation - workflow
11
Genome sequence
Repeats
Structural annotation-Gene finding
Protein-coding genesnc-RNAs, Introns
Functional annotation
Viewed & Released in Genome viewer
Masked or un-masked genome sequence
![Page 11: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/11.jpg)
Genome Repeats & features
12
Percentage of repetitive sequences in different organisms
Genome Genome Size (Mb)
% Repeat
Aedes aegypti 1,300 ~70
Anopheles gambiae 260 ~30
Culex pipiens 540 ~50
Microsatellite Minisatellite Tandem repeat Short tandem repeat SSR
Polymorphic between individuals/populations
![Page 12: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/12.jpg)
Finding repeats as a preliminary to gene prediction
13
Repeat discovery
Literature and public databanks
Homology based approaches
Automated approaches (e.g. RepeatScout or RECON)
Tandem repeats: Tandem, TRF
Use RepeatMasker to search the genome and mask the sequence
![Page 13: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/13.jpg)
Masked sequence
Repeatmasked sequence is an artificial construction where those regions which are thought to be repetitive are marked with X’s
Widely used to reduce the overhead of subsequent computational analyses and to reduce the impact of TE’s in the final annotation set
14
>my sequence
atgagcttcgatagcgatcagctagcgatcaggctactattggcttctctagactcgtctatctctattagctatcatctcgatagcgatcagctagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctactattggctgatcttaggtcttctgatcttct
>my sequence (repeatmasked)
atgagcttcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxatctcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxtagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctxxxxxxxxxxxxxxxxxxxtcttctgatcttct
Positions/locations are not affected by masking
![Page 14: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/14.jpg)
Types of Masking- Hard or Soft?
Sometimes we want to mark up repetitive sequence but not to exclude it from downstream analyses. This is achieved using a format known as soft-masked
15
>my sequence
ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGCTACTATTGGCTTCTCTAGACTCGTCTATCTCTATTAGTATCATCTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGATCTTCT
>my sequence (softmasked)
ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGCTACTATTggcttctctagactcgtctatctctattagtatcATCTCGATAGCGATCAGCTAGCGATCAGGCTACTATTggcttcgatagcgatcagcTAGCGATCAGGCTACTATTggcttcgatagcgatcagcTAGCGATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGATCTTCT
>my sequence (hardmasked)
atgagcttcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxatctcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxtagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctxxxxxxxxxxxxxxxxxxxtcttctgatcttct
![Page 15: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/15.jpg)
Genome annotation - workflow
16
Genome sequence
Map repeats
Gene finding- structural annotation
Protein-coding genesnc-RNAs, Introns
Functional annotation
Viewed & Released in Genome viewer
Masked or un-masked
![Page 16: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/16.jpg)
Structural annotation
Identification of genomic elements
Open reading frame and their localization Coding regions Location of regulatory motifs Start/Stop Splice Sites Non coding Regions/RNA’s
17
![Page 17: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/17.jpg)
Methods
19
Similarity• Similarity between sequences which does not necessarily infer any
evolutionary linkage
Ab- initio prediction• Prediction of gene structure from first principles using only the
genome sequence
![Page 18: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/18.jpg)
Genefinding
20
ab initio similarity
![Page 19: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/19.jpg)
Gene_finding resources for Homology based methods
Transcript cDNA sequences EST sequences
Peptide Non-redundant (nr) protein database Protein sequence data, Mass spectrometry data
Genome Other genomic sequence
21
![Page 20: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/20.jpg)
ab initio prediction
22
Genome
Coding potential
Coding potential
ATG & Stop codons
ATG & Stop codons
Splice sites
![Page 21: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/21.jpg)
Genefinding - ab initio predictions
23
Use compositional features of the DNA sequence to define coding segments (essentially exons)
ORFs
Coding bias
Splice site consensus sequences
Start and stop codons
Methods
Training sets are required
Each feature is assigned a log likelihood score
Use dynamic programming to find the highest scoring path for accuracy
Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh
![Page 22: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/22.jpg)
Genefinding - similarity
24
Use known coding sequence to define coding regions
EST sequences
Peptide sequences
Problem to handle fuzzy alignment regions around splice sites
Examples: EST2Genome, exonerate, genewise
Gene-finding - comparative
Use two or more genomic sequences to predict genes based on conservation of exon sequences
Examples: Twinscan and SLAM
![Page 23: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/23.jpg)
Genefinding - non-coding RNA genes
25
Non-coding RNA genes can be predicted using knowledge of their structure or by similarity with known examples
tRNAscan - uses an HMM and co-variance model for prediction of tRNA genes
Rfam - a suite of HMM’s trained against a large number of different RNA genes
![Page 24: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/24.jpg)
Gene-finding omissions
26
Alternative isoformsCurrently there is no good method for predicting alternative isoformsOnly created where supporting transcript evidence is present
PseudogenesEach genome project has a fuzzy definition of pseudogenesBadly curated/described across the board
PromotersRarely a priority for a genome projectSome algorithms exist but usually not integrated into an annotation set
![Page 25: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/25.jpg)
Practical- structural annotation
27
Eukaryotes- AUGUSTUS (gene model)
~/Programs/augustus.2.5.5/bin/augustus --strand=both --genemodel=partial --singlestrand=true --alternatives-from-evidence=true --alternatives-from-sampling=true --progress=true --gff3=on --uniqueGeneId=true --species=magnaporthe_grisea our_genome.fasta >structural_annotation.gff
Prokaryotes – PRODIGAL (Codon Usage table)
~/Programs/prodigal.v2_60.linux -a protein_file.fa -g 11 –d nucleotide_exon_seq.fa -f gff -i contigs.fa -o genes_quality.txt -s genes_score.txt -t genome_training_file.txt
![Page 26: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/26.jpg)
Structural Annotation- Structural Annotation was conducted using AUGUSTUS (version
2.5.5), Magnaporthe_grisea as genome model
However, we have to develop genome model for obtaining accurate result
![Page 27: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/27.jpg)
Functionalannotation
29
![Page 28: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/28.jpg)
Functional annotation
30
Attaching biological information to genomic elements
Biochemical functionBiological functionInvolved regulation and interactionsExpression
• Utilise known structural information to predicted protein sequence
![Page 29: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/29.jpg)
Genome annotation - workflow
31
Genome sequence
Map repeats
Gene finding- structural annotation
Protein-coding genesnc-RNAs, Introns
Functional annotation
Viewed & Released in Genome viewer
Masked or un-masked
![Page 30: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/30.jpg)
Genome annotation
32
Genome
ATG STOP
AAAn
A B
Transcription
Primary Transcript
Processed mRNA
Polypeptide
Folded protein
Functional activity
Translation
Protein folding
Enzyme activity
RNA processing
m7G
Find function
![Page 31: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/31.jpg)
Functional annotation – Homology Based
Predicted Exons/CDS/ORF are searched against the non-redundant protein database (NCBI, SwissProt) to search for similarities
Visually assess the top 5-10 hits to identify whether these have been assigned a function
Functions are assigned
33
![Page 32: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/32.jpg)
Functional annotation - Other features
Other features which can be determined
Signal peptides Transmembrane domains Low complexity regions Various binding sites, glycosylation sites etc. Protein Domain
See http://expasy.org/tools/ for a good list of possible prediction algorithms
34
![Page 33: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/33.jpg)
Functional annotation - Other features (Ontologies)
Use of ontologies to annotate gene products Gene Ontology (GO)
Cellular component Molecular function Biological process
35
![Page 34: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/34.jpg)
Practical - FUNCTIONAL ANNOTATION
Homology Based Method
setup blast database for nucleotide/protein Blasting the genome.fasta for annotations (nucleotide/protein) sorting for blast minimum E-value (>=0.01) for nucleotide/protein Further filtering for best blast hit (5-15) and assigning functions Removing Positive strand blast hits Removing negative strand blast hits
36
![Page 35: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/35.jpg)
Functional annotation- output
August 2008 Bioinformatics tools for Comparative Genomics of Vectors
37
![Page 36: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/36.jpg)
Conclusion
Annotation accuracy is only as good as the available supporting data at the time of annotation- update information is necessary
Gene predictions will change over time as new data becomes available (ESTs, related genomes) that are much similar than previous ones
Functional assignments will change over time as new data becomes available (characterization of hypothetical proteins)
38
![Page 37: Genome Annotation](https://reader030.fdocuments.us/reader030/viewer/2022012913/568143f0550346895db075fc/html5/thumbnails/37.jpg)
Thank You
39