Exploring Williams-Beuren Syndrome
description
Transcript of Exploring Williams-Beuren Syndrome
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Exploring Williams-Beuren Syndrome
Professor Carole Goble
http://www.mygrid.org.uk
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
AcknowledgementsmyGrid is an EPSRC funded UK eScience Program Pilot Project
Particular thanks to the other members of the Taverna project, http://taverna.sf.net
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Roadmap• myGrid in a nutshell• Gene characterisation in Williams-Beuren Syndrome.• Semantic Aspects
– Information model– Service discovery – Data Management - LSID– Metadata management for provenance – RDF
• Lessons learnt and opportunities
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
In a nutshell
• Bioinformatics toolkit• Open (Web) Services
– myGrid components– External domain services– No control or influence over service
providers
• Open to third party metadata • Open extensible architecture
– Assemble your own components– Designed to work together– Toolkit– Axis/Apache based– RDF and DAML+OIL/OWL– Jena, OilEd, Instance Store & FaCT
Freefluo
WfEE
TavernaWfDE
ViewUDDIregistry
EventNotification
mIR
PedroSemanticDiscovery
Info.Model
SoaplabGowlab
Gateway & CHEFPortal
LSID
HaystackProvenanceBrowser
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Williams-Beuren Syndrome
• Microdeletion of 155 Mbases on Chromosome 7• Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s
Hospital, Manchester, UK• Characterise an unknown gene• Annotation pipelines and Gene expression analysis
Services from USA, Japan, various sites in UK
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Williams-Beuren Syndrome Microdeletion
**
Chr 7 ~155 Mb
~1.5 Mb7q11.23
GTF2I
RFC2
CYLN2
GTF2IRD1
NCF1
WBSCR1/E1f4H
LIM
K1
ELN
CLDN4
CLDN3
STX1A
WBSCR18
WBSCR21
TBL2
BCL7B
BAZ1B
FZD9
WBSCR5/LAB
WBSCR22
FKBP6
POM121
NOLR1
GTF2IRD2
C-c
en
C-m
id
A-c
en
B-m
id
B-c
en
A-m
id
B-t
el
A-t
el
C-t
el
WBSCR14
WBS
SVAS
ST
AG
3P
MS
2L
Block A
FK
BP
6T
PO
M12
1N
OL
R1
Block C
GT
F2I
P
NC
F1P
GT
F2I
RD
2P
Block B
Patient deletions
CTA-315H11
CTB-51J22
Gap
Physical Map
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Filling a genomic gap
Two major steps:• Extend into the gap: Similarity searches; RepeatMasker, BLAST• Characterise the new sequence: NIX, Interpro, etc…
• Numerous web-based services (i.e. BLAST, RepeatMasker)• Cutting and pasting between screens• Large number of steps• Frequently repeated – info now rapidly added to public databases• Don’t always get results• Time consuming• Huge amount of interrelated data is produced – handled in lab book and
files saved to local hard drive• Mundane• Much knowledge remains undocumented• Bioinformatician does the analysis
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Point, click, cut, paste
ID MURA_BACSU STANDARD; PRT; 429 AA.DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASEDE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINEDE ENOLPYRUVYL TRANSFERASE) (EPT).GN MURA OR MURZ.OS BACILLUS SUBTILIS.OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;OC BACILLUS.KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE.FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY).FT CONFLICT 374 374 S -> A (IN REF. 3).SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
WBS Workflows:
GenBank Accession No
GenBank Entry
Seqret
Nucleotide seq (Fasta)
GenScanCoding sequence
ORFs
prettyseq
restrict
cpgreport
RepeatMasker
ncbiBlastWrapper
sixpack
transeq
6 ORFs
Restriction enzyme map
CpG Island locations and %
Repetative elements
Translation/sequence file. Good for records and publications
Blastn Vs nr, est databases.
Amino Acid translation
epestfind
pepcoil
pepstats
pscan
Identifies PEST seq
Identifies FingerPRINTS
MW, length, charge, pI, etc
Predicts Coiled-coil regions
SignalPTargetPPSORTII
InterProPFAMPrositeSmart
Hydrophobic regions
Predicts cellular location
Identifies functional and structural domains/motifs
Pepwindow?Octanol?
ncbiBlastWrapper
URL inc GB identifier
tblastn Vs nr, est, est_mouse, est_human databases.Blastp Vs nr
RepeatMasker
Query nucleotide sequence ncbiBlastWrapper
Sort for appropriate Sequences only
Pink: Outputs/inputs of a servicePurple: Taylor-made servicesGreen: Emboss soaplab services Yellow: Manchester soaplab services Grey: Unknowns
RepeatMasker
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Collections of Tasks
Finding
DescriptionService
Discovery
Enactment
BuildingWorkflow
Provenance
Storage
DataManagement
Querying
DomainTasks Service
Providers
Bioinformaticians
Scientists
Annotation providers
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Registry
mIR
Feta
HaystackProvenance
Browser
FreeFluoWfEE
TavernaWfDE
PedroAnnotation tool
Ontology Store
Others
WSDLSoap-lab
Interface Description
Annotation/description
Annotation providers
Query &Retrieve Workflow
Execution
Store data/knowledge
Scientists
Bioinformaticians
invoking
Querying/sharing/federating/registering
ServiceProviders
Data descriptions
Vocabulary
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
WBS task• Wrap services as web
services• Register them• Build a workflow using the
services• Evolve the workflow• Run it over and over again in
case data has changed• Record results & provenance• Inspect and compare results
& provenance• Event notification, portal, 3rd
party annotation…
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
User Results
Benchmark: Two iterations of workflows (1 day run)– Reduced gap by 267 693 bp at its centrmeric end– Correctly located all seven known genes in this region– Identified 33 of the 36 known exons residing in this
location
Manually: takes two days (+) including analysis
Now: takes 30 mins to produce results and half a day for analysis.
• Less boring. Less prone to mistakes.• Once notification installed won’t even have to
initiate it.
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Where is the semantics
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Information Model v2• Resources and Identifiers
• People, teams and organizations• Representing the e-science
process• Experimental methods for e-
science
1..*0..* uses
1
0..*
contains
10..*
selected studies
0..*1
method
0..*
0..*
acts in
10..*
labBooks
scmInvestigator
1 0..*has participants 10..* participates in
0..*
1
uses
method
1 0..*has instances
AgentExperimentInstance
LabBookView
+name:String+rule:String
SubjectObject
Resources.Resource
+getId:URIString
ProgrammeResource
+name:String
<<Resource>>Study
+name:String+description:String+startTime:DateTime+endTime:DateTime+status:String
Programme
<<Resource>>Operations.Operation
<<Resource>>ExperimentDesign
Investigation
<<Resource>>PeopleAndTeams.Person
StudyRole
+roleName:String+description:String
Agent<<Resource>>
StudyParticipation
• Scientific data and the life-science identifier– Types– Identifier Types– Values and Documents
• Provenance information• Annotation and Argumentation
XML messages between services conform to the IMv2
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Semantic discovery• The User does the
choosing of services• A common ontology is
used to annotate and query any myGrid object including services.
• Ontology is built using DAML+OIL and reasoning
• Deployed as a static RDF graph
• Discover workflows and services described in the registry via Taverna.
• Look for all workflows that accept an input of semantic type nucleotide sequence.
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Role of Ontologies
Composing and validating workflows and service compositions & negotiations
Describing & Linking Provenance records
Change & event Notification topics
Ontologies
Resource annotations
Service & resource registration & discovery
Schema mediation
Controlling contents of metadata and dataKnowledge-based guidance
and recommendation
Service matching and provisioning
Help
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Observations
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Services
• Practically all the services are remote and third party• Services are changeable and unreliable• Redundant services are essential• WSDL in the wild is poor• Automated annotation
http://pedro.man.ac.uk
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Can you guess what it is yet?
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
operation
name, descriptioninputoutputtaskmethodresourceapplication
workflow
bioMoby service
WSDL operation
Soaplab service
service
name, descriptionauthororganisation
WSDL service
parameter
name, descriptionsemantic typeformattransport typecollection typecollection format
Model of services
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
SHIM Services
Main Bioinformatics Applications
Main Bioinformatics
Services
Main Bioinformatics
Application
Main Bioinformatics
Application
SHIM Services
• Services that enable domain services to fit together
• Outnumber domain services
• Libraries• Candidates for
automatic selection, composition and substitution
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Results management
• Automated workflows produce lots of heterogeneous data
• These are just some of the results from one workflow run for Williams Disease
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Amplification
One input
Many outputs
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
• FreeFluo agnostic about the data flowing through it.
• Taverna includes a DataThing class, which can be tagged with terms from ontologies, free text descriptions and MIME types, and which may contain arbitrary collection structures.
• Using the metadata hints we can locate and launch pluggable view components.
• Hybrid typing scheme allows for a ‘best effort’ approach to data typing.
• Life science types are intractable for reasonable effort or completeness.
Dealing with results
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Intermediate Results
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Intermediate Results• Workflows
change the way the bioinformatican works
• Before: analyse results as go along
• After: all results in one go
• So linking intermediate results important
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Life Science IDs• LSID provides a uniform naming
scheme.• LSID Resolver guarantees to
resolve to same data object.• LSID Authority dishes them out.• Also returns metadata of object.• Used throughout myGrid as an
object naming device.• myGrid Repository acts an LSID
Authority• LSID allows universal access to
results for collaboration, as well as for review.
• RDF+LSID explains the context of results, and provides guidance for further investigations.
Pioneered by myGrid
I3C / IBM / EBI proposal for a Life Science Identifier
http://www.i3c.org/wgr/ta/resources/lsid/docs/
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Process Provenance
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Link v Data Representation
• Data management questions refer to relationships rather than internal content– What are the origins of this data?
• Which service produced this data?• Which data is this derived from?• Who was this data produced for?• ?What is this data telling me?
• Data analysis questions delegated to external services.
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Representing links
• Identify each resource– Life science identifier: URI with associated data and
metadata retrieval protocols.– Understanding that underlying data will not change
urn:lsid:taverna.sf.net:datathing:45fg6 urn:lsid:taverna.sf.net:datathing:23ty3
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Representing links II
• Identify link type– Again use URI– Allows us to use RDF infrastructure
• Repositories• Ontologies
urn:lsid:taverna.sf.net:datathing:45fg6 urn:lsid:taverna.sf.net:datathing:23ty3
http://www.mygrid.org.uk/ontology#derived_from
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Workflow run
Workflow design
Experiment design
Project
Person
Organisation
Process
Service
Event
Data item
Data itemData item
data derivation e.g. output data derived from input data
knowledge statementse.g. similar protein sequence to
instanceOf
partOf componentProcesse.g. web service invocation of BLAST @ NCBI
componentEvente.g. completion of a web service invocation at 12.04pm
runBye.g. BLAST @ NCBI
run for
Organisation level provenance Process level provenance
Data/ knowledge level provenance
User can add templates to each workflow process to determine links between data items.
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
19747251 AC005089.3831Homo sapiens BAC
clone CTA-315H11 from 7, complete sequence15145617 AC073846.6
815Homo sapiens BAC
clone RP11-622P13 from 7, complete sequence15384807 AL365366.20
46.1Human DNA sequence
from clone RP11-553N16 on chromosome 1, complete sequence7717376 AL163282.2
44.1Homo sapiens
chromosome 21 segment HS21C08216304790 AL133523.5
44.1Human chromosome 14
DNA sequence BAC R-775G15 of library RPCI-11 from chromosome 14 of Homo sapiens (Human), complete sequence34367431 BX648272.1
44.1Homo sapiens mRNA;
cDNA DKFZp686G08119 (from clone DKFZp686G08119)5629923 AC007298.17
44.1Homo sapiens 12q22
BAC RPCI11-256L6 (Roswell Park Cancer Institute Human BAC Library) complete sequence34533695 AK126986.1
44.1Homo sapiens cDNA
FLJ45040 fis, clone BRAWH302048620377057 AC069363.10
44.1Homo sapiens
chromosome 17, clone RP11-104J23, complete sequence4191263 AL031674.1
44.1Human DNA sequence
from clone RP4-715N11 on chromosome 20q13.1-13.2 Contains two putative novel genes, ESTs, STSs and GSSs, complete sequence17977487 AC093690.5
44.1Homo sapiens BAC
clone RP11-731I19 from 2, complete sequence17048246 AC012568.7
44.1Homo sapiens
chromosome 15, clone RP11-342M21, complete sequence14485328 AL355339.7
44.1Human DNA sequence
from clone RP11-461K13 on chromosome 10, complete sequence5757554 AC007074.2
44.1Homo sapiens PAC
clone RP3-368G6 from X, complete sequence4176355 AC005509.1
44.1Homo sapiens
chromosome 4 clone B200N5 map 4q25, complete sequence2829108 AF042090.1
44.1Homo sapiens
chromosome 21q22.3 PAC 171F15, complete sequence
>gi|19747251|gb|AC005089.3| Homo sapiens BAC clone CTA-315H11 from 7, complete sequenceAAGCTTTTCTGGCACTGTTTCCTTCTTCCTGATAACCAGAGAAGGAAAAGATCTCCATTTTACAGATGAGGAAACAGGCTCAGAGAGGTCAAGGCTCTGGCTCAAGGTCACACAGCCTGGGAACGGCAAAGCTGATATTCAAACCCAAGCATCTTGGCTCCAAAGCCCTGGTTTCTGTTCCCACTACTGTCAGTGACCTTGGCAAGCCCTGTCCTCCTCCGGGCTTCACTCTGCACACCTGTAACCTGGGGTTAAATGGGCTCACCTGGACTGTTGAGCG
urn:lsid:taverna:datathing:15
..BLAST_Report
rdf:type
urn:lsid:taverna:datathing:13
..similar_sequences_to
.. nucleotide_sequence
rdf:type
service invocation
..created_by
workflow invocation
workflow definition
experiment definition
project
person
group
service description
organisation
..described_by
..run_during
..invocation_of
..part_of
..works_for
..part_of
..part_of
..author
..author
..run_for
A B
..masked_sequence_of
..filtered_version_of
Relationship BLAST report has with other items in the repository
Other classes of information related to BLAST report
Provenance tracking• Automated generation of this web of links preferable
• Workflow enactor generates– LSIDs– Data derivation links– Knowledge links– Process links– Organisation links
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Storage
• LSID has no protocol for storage
• Taverna/ Freefluo implements its own data/ metadata storage protocol
Taverna/Freefluo
Metadata Store
Data store
Publish interface
data
metadata
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Retrieval
• LSID protocol used to retrieve data and metadata
• Query handled separately
Metadata Store
Data store
LSID interface
LSID aware client
Query
RDF aware client
Taverna/Freefluo
Metadata Store
Data store
Publish interface
data
metadata
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
GenBank record
Portion of the Web of
provenance
Managing collection of
sequences for review
IBM’s BioHaystack
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Observations• Managed the transition from generic middleware
development to practical day to day useful services• Real users (plural) fundamental to that• End to end support for an entire scenario• Bury the semantics• Show stoppers for practical adoption are not technical
showstoppers– Can I incorporate my favourite service?– Can I manage the results?
• By tapping into (defacto) standards and communities we can leverage others results and tools – LSID, Haystack, Pedro.
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
AcknowledgementsmyGrid is an EPSRC funded UK eScience Program Pilot Project
Particular thanks to the other members of the Taverna project, http://taverna.sf.net
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
myGrid PeopleCore• Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis,
Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pockock Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe.
Users• Simon Pearce and Claire Jennings, Institute of Human Genetics School of
Clinical Medical Sciences, University of Newcastle, UK• Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital,
Manchester, UKPostgraduates• Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman,
Keith Flanagan, Antoon Goderis, Tracy Craddock, Alastair HampshireIndustrial • Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM)• Robin McEntire (GSK)Collaborators• Keith Decker
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
http://www.mygrid.org.uk
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004
Summary• myGrid offers service based middleware components• Open source and freely downloadable• Open Grid Service Architecture-compliant• Allows the scientist to be at the centre of the Grid --
Personalisation• Generic middleware that suits the creation of
bioinformatics applications• Inclusion of rich semantics to facilitate the scientific
process
Available from http://www.mygrid.org.uk