VectorBase BRC4 20061 The evolving VectorBase gene build: mixing automated and manual approaches...
-
Upload
kevin-wood -
Category
Documents
-
view
215 -
download
0
Transcript of VectorBase BRC4 20061 The evolving VectorBase gene build: mixing automated and manual approaches...
VectorBase BRC4 2006 1
The evolving VectorBase gene build: mixing automated and manual
approaches when annotating vector genomes
Daniel Lawson
VectorBase-EBI, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton UK
VectorBase BRC4 2006 2
VectorBase species
Generic GeneBuild (new genomes)
VectorBase GeneBuild (new developments)
Influence of manual annotation
Progress in manual annotation
Partial GeneBuilds
Points to cover
VectorBase BRC4 2006 3
Aedes aegypti
Anopheles gambiae PESTAnnotated
Ixodes scapularis Sequencing
Culex pipiens quinquefasciatus Assembly
Anopheles gambiae M & S formPediculus humanus
Sequencing
Glossina morsitans morsitans
Lutzomyia longipalpis
Phlebotomus papatasiRhodnius prolixus
Initiated
VectorBase BRC4 2006 4
Annotation of new genomes
Assembled genome
VectorBase gene predictions
Sequencing centre gene predictions
Merge into canonical set
Protein analysis
Display on genome browser
Release to GenBank/EMBL/DDBJ
VectorBase BRC4 2006 5
VectorBase gene prediction pipeline
Blessed predictions
Community submissionsManual annotations
Species-specific predictions
Similarity predictions
Transcript based predictions
Ab initio gene predictions
Canonical predictions
(Genewise) (Genewise)
(SNAP) (Exonerate)
(Apollo) (Genewise, Exonerate, Apollo)
Protein family HMMs(Genewise)
ncRNA predictions(Rfam)
VectorBase BRC4 2006 6
VectorBase curation database pipeline for manual/community
annotationCurationwarehouse db
Manual annotation (Harvard)
Apollo
Apollo
Community annotation (Community representatives)
Chado-XML
Chado-XML Chad
o
Ensembl
GFF3
Gene build db
Community annotation (in collaboration with Harvard)
VectorBase BRC4 2006 7
Manual annotation progress
Protein-coding gene No.
VectorBasemanual
Communitysubmission
Anopheles gambiae
AgamP3.3 13,277 261 ( 2.0 %) 667 ( 5.0 %)
current 2474 (18.6 %) 667* ( 5.0 %)
Aedes aegypti
AaegL1.1 15,419 0 ( 0.0 %) 0 ( 0.0 %)
current 0 ( 0.0 %) 341 ( 2.2 %)
VectorBase BRC4 2006 8
Manual annotation visualisation
VectorBase BRC4 2006 9
Overview of proposed re-annotation system
Blessed genes
Current gene set
Compare
Species-specific gene prediction
New gene build
Merge
Updated gene set
Full gene build
Partial Gene build
VectorBase BRC4 2006 10
Comparing new gene builds with the old one
• Use of manual annotation for validation of automated gene build improvements
• Simple statistics (CDS length, intron size, CDS matching TE’s)
• BRC annotation metrics– Supporting evidence for a gene prediction (citation,
expression, orthology)– Attachment of Standard Operating Procedures
(SOPs)
VectorBase BRC4 2006 11
VectorBase gene prediction pipeline (SOP)
Blessed predictions
Community submissionsManual annotations
Species-specific predictions
Similarity predictions
Transcript based predictions
Ab initio gene predictions
Canonical Gene set
VB:SOP0001 VB:SOP0002 & SOP0003
VB:SOP0005 VB:SOP004
Protein family HMMsVB:SOP0009
ncRNA predictionsVB:SOP0008
VB:SOP0007
VectorBase BRC4 2006 12
Gene build schedules
Full gene build
Partial gene build
4 months
1 month
Triggers for re-annotation
• Temporal
• Data
• New data for species
• New genomes
• Re-annotated genomes
VectorBase BRC4 2006 13
3 wise annotators
VectorBase BRC4 2006 14
VectorBase BRC4 2006 15
Merging gene sets
Reduce to single predictions per locus
Compare exon/intron structures
Gene set #1
Gene set #2
Identical structures
Compatible structures
Different structures
Merge/Split structures Complex No Map
Add isoform predictions based on EST/Peptide data
Canonical gene set