DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2,...
-
Upload
milton-haylett -
Category
Documents
-
view
216 -
download
1
Transcript of DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2,...
DAS/2: Next Generation Distributed Annotation
SystemGregg HeltGregg Helt11, Steve Chervitz, Steve Chervitz11, Tony Cox, Tony Cox22, Andrew , Andrew DalkeDalke33, Allen Day, Allen Day44, Ed Erwin, Ed Erwin11, Ed Griffiths, Ed Griffiths22, and , and
Lincoln SteinLincoln Stein44
(1) Affymetrix, Inc.(2) Sanger Institute (3) Dalke Scientific;(4) Cold Spring Harbor Laboratory(5) University of Alabama
Distributed Annotation System (DAS) Overview
A specification designed for sharing genome annotations
Defines client requests and server responses
Simplified Web Services approach: HTTP GET, URLs, XML
Intended to be simple to implement
No central annotation authority
Intended to support client-side integration of annotations from different servers
First draft specification Spring 2000
Last major change to DAS1 was Spring 2002
Grant from NIH awarded June 2004 for development of next-generation DAS/2
DAS: Multiple Servers, Multiple Clients
Reference Server
AC003027AC005122M10154
Annotation Server Annotation Server
AC003027 M10154
WI1029 AFM820 AFM1126 WI443
AC005122
Annotation Server
Widespread Adoption of DAS/1
Server Implementations– Dazzle, ProServer, LDAS
Server sites– Ensembl, UCSC, TIGR, KEGG, WormBase, Affymetrix,
etc.
Clients– GBrowse, Ensembl, Dasty, IGB,
Libraries:– BioPerl, BioJava, JDAS
DAS Extensions– GeneDAS (non-positional annotations)– DAS web services registry– SPICE (protein structures)– DALEC (asynchronous analysis)
Ensembl is an ensemble of DAS servers
GBrowse on Ensembl
Distributed GBrowse
MyGBrowse
GBrowse 1
MODs
GBrowse 2
DAS
DAS
DAS
Ensembl UCSC
DAS Limitations
No ontology (controlled vocabulary) of feature types.– Is a “gene” from DAS server 1 the same as a
“gene” from DAS server 2?
Not particularly extensible.
Ambiguous semantics for retrieving features that overlap a range on the genome.
Development of DAS/2 Specification
Enhancements have largely been motivated by initial discussions on the DAS mailing list.– Series of RFCs collected– Though informal, still a long process!
Most recent DAS/2 draft specification is available at http://biodas.org/documents/das2/das2_protocol.html (tied to CVS repository), so anyone can review and comment
Feedback from the DAS developer and user communities will continue to guide future iterations of the DAS/2 specification
Preserving DAS1 Strengths in DAS/2
Specification is independent of implementation– Many server implementations– Many client implementations
Simple, simple, simple– HTTP for transport– URLs for queries– XML for responses– REST-like style
Ontologies are integral
Focus on location-based annotations of biological sequences
Basic DAS/2 Queries
Sources query: what genomes and versions of those genomes are available?
– http://server/das/genome
Regions query: what annotated sequences are available for a given version of a genome?
– http://server/das/genome/[genome]/[version]/region
Types query: what annotation types are availabe for a given genome version?
– http://server/das/genome/[genome]/[version]/type
Range query: return all annotations of a given type that overlap a genomic region
– http://server/das/genome/[genome]/[version]/feature?
overlaps=[seq/min:max];type=[type]
DAS/2 Enhancements: Ontologies
All features are required to be described by an ontology– What is the feature?
Gene, mRNA, transposable_element…– What are attributes of the feature?
Polycistronic_mRNA, programmed_frameshift…
Sequence ontology (SO) is the default (song.sourceforge.net)
– Can be changed & extended– ~500 terms in all– Standard OBO format
Feature hierarchy allows features to be contained within others: e.g. gene->mRNA->CDS
DAS/2 Enhancements: Performance
One of the biggest complaints about DAS1– Very verbose annotation XML
DAS/2 Solution #1: Refactoring annotation XML– Much smaller minimum footprint
DAS/2 Solution #2: Alternative return formats– All servers can return defined das2xml annotation
format– Servers can also specify additional return formats per
annotation type– Clients can choose from alternative formats if they
desire– Not restricted to XML, or even text– Examples: GFF3, BED, PSL, GAME– Extreme performance improvements possible
DAS/2 Enhancements: Resolving Ambiguities Example: Ambiguous Range Queries
query range = x:yquery range = x:y
xx yy
Server 1 Response:Server 1 Response:
Server 2 Response:Server 2 Response:
Overlap or containment?Overlap or containment?Parent based or separate?Parent based or separate?
Server 3 Response:Server 3 Response:
Server 4 Response:Server 4 Response:
DAS/2 Solution #1 – remove spec ambiguity
Specify that if parent meets region filter, also return all children
Specify whether overlap, containment, etc.
Add different region filters for different possibilities– Overlaps– Contains– Within– Identical
Allow boolean combinations of these and other filters in the query URL
DAS/2 filter spec allows client query optimization
xx yy
QueryLQueryLQueryCQueryC
QueryRQueryR
LL RR
Keep track of overlap bounds of all previous queriesKeep track of overlap bounds of all previous queriesInstead of filter = “overlaps:S/x:y”, use filter = “overlaps:S/x:y; Instead of filter = “overlaps:S/x:y”, use filter = “overlaps:S/x:y; within:S/L:R”within:S/L:R”If annotation A not contained within L:R, then either:If annotation A not contained within L:R, then either:
i) bounds crosses L, in which case must overlap QueryLi) bounds crosses L, in which case must overlap QueryLii) bounds crosses R, in which case must overlap QueryRii) bounds crosses R, in which case must overlap QueryRiii) bothiii) both
Therefore if client has used this approach for all previous queries Therefore if client has used this approach for all previous queries (and restricts other filtering to single “type” filter), then for QueryC (and restricts other filtering to single “type” filter), then for QueryC no annotations will be returned that were already returned in a no annotations will be returned that were already returned in a previous queryprevious query
Solution #2: DAS/2 Validation Suite
Verify whether a DAS/2 server is compliant with the specification.– Critical for improving interoperability between clients and
servers developed by different groups.
Standalone tool and web application, written in Python– Enter a URL for a DAS/2 server– Get an HTML report about DAS/2 compliance
Reference dataset– Sequences and annotations that can be loaded into a DAS/2
server for additional validation of server implementation/configuration
Source code available at: http://sourceforge.net/projects/dasypus/
More DAS/2 Spec Enhancements
“Writeback” spec to allow DAS/2 clients to create and edit annotations on DAS/2 servers– Still undergoing development
IDs are URIs– Could be LSIDs or URLs– Allows for integration with many other web
technologies– xml:base
Feature hierarchies
And more…
DAS/2 UML Modeling
DAS/2 Reference Server
Implemented as an Apache/mod_perl 2.0 content handler – Annotations are converted to Bioperl objects and
subsequently text-transformed using Template Toolkit.
Datasources are accessible using an adaptor pattern– Current adapter is for CHADO (GMOD schema)– Soon any datasource accessible to the Generic Genome
Browser (Gbrowse) will be be accessible from the DAS/2 server.
Flatfile formats: GenBank, GFF Databases: Ensembl, GMOD/Chado, Bio::DB::GFF DAS1 web service
Source code released under Artistic License– Available via anonymous CVS as part of GMOD– See http://www.gmod.org for access details.
DAS/2 Reference Client
Implemented in Java in the Integrated Genome Browser– IGB (“ig-bee”) - A visualization app developed at Affymetrix – Supports data loading via a variety of formats and
mechanisms– Full implementation of DAS/2 read client, partial
implementation of DAS/2 writeback.
Handles large amounts of genome-scale data– Loads hundreds of thousands of sequence annotations at
once– Loads dense quantitative graphs with millions of data points– Maintains real-time responsiveness to user interactions– Includes features to support exploratory data analysis– Plugin architecture for customized extensions
Source code released under Common Public License– http://genoviz.sourceforge.net
Upcoming DAS/2 Developments
Writeback protocol– Ready for implementation
Registry and discovery protocol– Various alternatives have been discussed– A “playpen server” available at EBI
DAS/2 & caBIG
Project 1: Add DAS/2 support to caCORE– Will enable caCORE to read genome annotations from
DAS/2 servers and re-export as caCORE objects.– Uses a flexible plug-in architecture that will be
generally useful.
Project 2: Export HapMap database as DAS/2– Will make HapMap human variation data available to
caBIG grid via caCORE.
Project 3: Export Vertebrate Promoter Database as DAS/2– Will make curated information on vertebrate
transcription factors and their binding sites available to caB IG grid via caCORE.
Acknowledgements
DAS & DAS2 mailing list participants!
Lincoln Stein (CSHL)
Ed Erwin, Steve Chervitz, Eric Blossom, Hari Tammara (Affymetrix)
Tony Cox, Ed Griffiths (Sanger Institute)
Allen Day, Brian O’Connor (UCLA)
Andrew Dalke (Dalke Consulting)
Suzanna Lewis (LBL)
Ann Loraine (U. of Alabama)