Importing Community annotations into VectorBase. Aims Provide the VectorBase community with tools...
-
Upload
mercy-ellis -
Category
Documents
-
view
225 -
download
1
Transcript of Importing Community annotations into VectorBase. Aims Provide the VectorBase community with tools...
Importing Community annotations into
VectorBase
Aims
• Provide the VectorBase community with tools for improving genome annotation.
• Must have low entry requirements, be scaleable and (relatively) simple to use
Genome annotation
• First-pass genome annotation is almost always based on “automatic” computational approaches
• ab initio
• Similarity based
• Transcript (ESTs, RNAseq)
• Protein (nr protein database)
Genome assembly
Map Repeats
Genefinding
Protein-coding genes
Map Transcripts Map Peptides
nc-RNAs
Functional annotation
Submission to archival databases (Release)
Genome annotation - building a pipeline
Current VectorBase annotation pipeline
• MAKER based automatic annotation
• includes SNAP training and ab initio
• RNAseq based transcript similarity prediction
• Taxonomically constrained peptide similarity prediction
• 2 rounds of prediction refinement & final round includes all peptide similarity
• Community annotation phase
• Capture gene structure changes
• Metadata associated with locus (symbol, description, citation)
• Submission to INSDC, propagation to UniProt
• Presentation through VectorBase
Start
1.0 set(automati
c)
1.1 set(published
)
Processing submissions
• 4 phases
• Capture
• Moderation
• Storage
• Integration
Capture: Community annotation decision tree
Community annotation decision tree
Tool of choice: WebApollo
• Web-based
• Eliminates main drawback of deprecated CAP system - GFF3 format validation
WebApollo example
Community annotation decision tree
Community annotation decision tree
Tool of choice: Web forms
Moderation & Storage
• Gene metadata captured through forms to spreadsheets
• Batch submissions use similar spreadsheet format
Integration: Dataflow for ‘patch’ build
CAP GFF3
WebApollo
Reference core
Updated geneset
TXT
Patch
Users
Stable IDs
Reports
Updated core
IDs
Reference core CAP
Release coreGoogle Fusion
TableXrefs
Release
XrefsGoogle Form
`
Metadata
Users
}Commit
Presentation of community annotation
Usage (as of 2015-03-30)
• 31 WebApollo instances (Organisms)
• 3,407 gene models
• Gene metadata (protein-coding loci)
• 4,987 gene symbols
• 512 gene synonyms
• 57,878 gene descriptions
• 910 loci citations from 208 publications
Supplementing annotations
• Community jamboree’s
• ‘Standard’ improvement (e.g. Sandfly, snail communities)
• Glossina community (e.g. March 2015, Kenya)
• VectorBase
• Default Xref run includes symbol/description assignment via UniProt
• Projection of gene description via orthology from key marker species (e.g. An. gambiae). Due to be deployed for June (VB-2015-06) release.
• Supplemental data from genome papers (e.g. 16 Anopheles spp, Musca)
Deprecated CAP system example