linked data for biopharma 14FEB2013...Linked Data Uses the Web Data Format RDF R D i ti F k •...
Transcript of linked data for biopharma 14FEB2013...Linked Data Uses the Web Data Format RDF R D i ti F k •...
The Path to Linked Data in BioPharma
Tom Plasterer, PhD.Tom Plasterer, PhD.integrated informatics Semantic Framework Lead (i2SF)
I t t d R&D I f ti d K l d M tIntegrated R&D Informatics and Knowledge Management
Abstract
As BioPharma adapts to incorporate nimble networks of suppliersAs BioPharma adapts to incorporate nimble networks of suppliers, collaborators, and regulators the ability to link data is critical for dynamic interoperability. Adoption of linked data paradigm allows BioPharma to
focus on core business: delivering valuable therapeutics in a timely manner
Blockbuster ‘Patent Cliff’ Gives Way to Personalized ApproachDrivers & Solutions
Blockbuster Patent CliffBlockbuster Patent Cliff
Growth of GenericsGrowth of Generics
Personalized Personalized
Mergers & AcquisitionsMergers &
Acquisitions
e so a edMedicine•Pharmacogenetics•Biomarkers
e so a edMedicine•Pharmacogenetics•Biomarkers
Evaluate Pharma World Preview 2018From: http://www.liv.ac.uk/pharmacogenetics/
American Action Forum; Primer: The Pharmaceutical Industry (Han Zhong l Updated June 2012)
R&D | RDI IMAP Pharma & Biotech Industry Global Report 2011
Where do the new opportunities arise?Inside & Outside
• Nurture ‘best in class’ programs• Kill earlyBuild from withinBuild from within
Inside & Outside
• Kill early• Repositioning
Build from withinBuild from within
P t B ?M &M & • Partner or Buy?• Integrate cultures & technology• Is the disruption worth it?
Mergers & AcquisitionsMergers &
Acquisitions
• How much can be shared—and still be useful?• Who is driving?
Pre-Competitive Consortiums
Pre-Competitive Consortiums
• Aggressive Regional Partnerships (Pfizer's Centers for Therapeutic Innovation)
• Co-locate near Academic Centers of Excellence (Novartis)Finding ‘KOLs’Finding ‘KOLs’
R&D | RDI
Co locate near Academic Centers of Excellence (Novartis)• Cherry pick (GSK, AZ, others)
gg
Distributed Data in a Monolithic EnvironmentManaging SilosManaging Silos
• Regulated Systems vs. Discovery• Regulated Systems vs. DiscoveryPartitioned By ContentPartitioned By Content
• US, EU, ASIAPAC• US, EU, ASIAPACPartitioned By Geography & OrganizationPartitioned By Geography & Organization
RDB E l T t RSS RDF?RDB E l T t RSS RDF?• RDB, Excel, Text, RSS, RDF?Data FormatsData Formats
• Steps in the right direction?• Steps in the right direction?Warehouses & Service Oriented Architect re
Warehouses & Service Oriented Architect reArchitectureArchitecture
• eRooms, Sharepoint,Yammer, ‘Lync’ • eRooms, Sharepoint,Yammer, ‘Lync’ vs. Twitter, Google Docs, SkypeCollaborative EnvironmentCollaborative Environment
• Vendor specific or open?• Vendor specific or open?• Mixed BagStandards?Standards?
• UI? Services?• UI? Services?
R&D | RDI
• UI? Services?• Metadata?Where are the ‘smarts’Where are the ‘smarts’
Requirements of The Informatics LandscapeM i l A ilit
Must span the entire drug development lifecycle
Maximal Agility
Must span the entire drug development lifecycleo and back (post-market surveillance to discovery)
Must support large and very heterogeneous datao single nucleotide polymorphisms to countries
Will change as new science emerges & new regulations come into playo Medline just under 1M articles/year
Must be able to work with multiple, international regulatory bodieso Emerging markets
Partners, customers and collaborators will changeo and will have divergent technical aptitudes
Must be able to interoperated with precompetitive consortiao Can they perform common tasks for the community
Must be able to work with legacy datao Lots of unmined gems here!
R&D | RDI
o Lots of unmined gems here!
Linked Data Uses the Web Data FormatRDF R D i ti F k• Resources:
- Represent things on the web (web pages, information resources)
RDF: Resource Description Framework
p g ( p g , )- Represent things NOT on the web (people, places Non-Information
Resources)- Can represent anything at all- Named using URIs (usually)
May not have a name Blank Nodes
RDF Triple
SubjectP di
Object- May not have a name — Blank Nodes- “nouns”- (Subjects or Objects)
• Literal ValuesPT3445 CT5877
participatesIn
Predicate
- Are values to work with and show users- Can be just a string of text — Plain Literals- Can have a language assigned to them using ISO codes- Can have a specific datatype assigned to them — Typed Literals
• Predicates:- Relationships between Resources- Named using URIs- “Verbs”
R&D | RDI
- Described in Schema (or vocabularies, or ontologies)
7
What’s Needed?Li k d D t !Linked Data!
R&D | RDI
http://thedatahub.org/group/lodcloudLOD Cloud 2011
The 5 Stars of Open Linked DataW3C/TBL Guidance
★★ Make your stuff available on the web (any format)
★★★★ make it available as structured data (e.g. Excel instead of image scan of a table)
★★★★★★ Use a non-proprietary format (e.g. CSV instead of Excel)
★★★★★★★★ Use URLs to identify things, so that people can point at your stuff
★★★★★★★★★★ Link your data to other people’s data to provide context
R&D | RDI 9 http://www.w3.org/DesignIssues/LinkedData.html
The 5 Stars of Open ClosedLinked DataW3C/TBL Guidance
★★ Make your stuff available on the web intranet (any format)
★★★★ make it available as structured data (e.g. Excel instead of image scan of a table)
★★★★★★ Use a non-proprietary format (e.g. CSV instead of Excel)
★★★★★★★★ Use URLs to identify things, so that people can point at your stuff
★★★★★★★★★★ Link your data to other people’s data to provide context
R&D | RDI 10 http://www.w3.org/DesignIssues/LinkedData.html
Towards a Linked Data Architecture
Central IdentityManagement
Active & Partial PURLs SemanticVisualization
Vocabulary
Catalogues, Mapping, Queries
RD
F+Tagging
Server
Search
Triplestores
Coontent
11
Structured
http://research.vocab.azlinkeddata.com/id/DOID/2841 http://humandiseaseontology.astrazeneca.net/DOID/2841
Semi-StructuredUnstructured
Choosing Linked VocabulariesCurrent LOD Cloud Adoption VocabularyCurrent LOD Cloud Adoption
Vocabulary prefix Vocabulary link
Number of usages in data
sets
Server
dc http://purl.org/dc/elements/1.1/ 92 (31.19 %)
foaf http://xmlns.com/foaf/0.1/ 81 (27.46 %)
skos http://www.w3.org/2004/02/skos/core# 58 (19.66 %)skos http://www.w3.org/2004/02/skos/core# 58 (19.66 %)
geo http://www.w3.org/2003/01/geo/wgs84_pos# 25 (8.47 %)
xhtml http://www.w3.org/1999/xhtml/vocab# 19 (6.44 %)
akt http://www.aktors.org/ontology/portal# 17 (5.76 %)
bibo http://purl.org/ontology/bibo/ 14 (4.75 %)
mo http://purl.org/ontology/mo/ 13 (4.41 %)
vcard http://www.w3.org/2006/vcard/ns# 10 (3.39 %)
sioc http://rdfs.org/sioc/ns# 10 (3.39 %)
cc http://creativecommons org/ns# 8 (2 71 %)
R&D | RDI 12
cc http://creativecommons.org/ns# 8 (2.71 %)
geonames http://www.geonames.org/ontology# 6 (2.03 %)
http://www4.wiwiss.fu-berlin.de/lodcloud/state/#terms
The 5 Stars of Open Linked VocabulariesB d V t t ( ) G idBernard Vatant (Mondeca) Guidance
★★ Publish your vocabulary on the Web at a stable URIURI
★★★★ Provide human-readable documentation and basic metadata (e.g. creator, publisher, date of creation last modification version number)creation, last modification, version number)
★★★★★★ Provide labels and descriptions, if possible in several languages, to make your vocabulary usable in multiple linguistic scopesp g p
★★★★★★★★ Make your vocabulary available via its namespace URI, both as a formal file and human-readable documentation, using content negotiation
★★★★★★★★★★ Link to other vocabularies by re-using elements rather than re-inventing
R&D | RDI 13 http://blog.hubjects.com/2012/02/is-your-linked-data-vocabulary-5-star_9588.html
Domain Specific VocabulariesLi k d O V b l i NCBOLinked Open Vocabularies, NCBO
http://labs.mondeca.com/dataset/lov/index.html
R&D | RDI 14
http://bioportal.bioontology.org/
Building Linked Data Applications
Capture Business Questions and
Sources
Capture Business Questions and
SourcesSourcesSources
Domain Expert Concept Map
Domain Expert Concept Map
Interact with RDF answer in a
Faceted Browser
Interact with RDF answer in a
Faceted Browser
Build Formal Ontology•Reuse Vocabularies!
Build Formal Ontology•Reuse Vocabularies!
Model Business Questions (SPARQL)
Model Business Questions (SPARQL)
Challenge with Linked Data
Challenge with Linked Data
Improving Internal Interoperability
Scientists, Clinicians, Informaticists can now freely interoperate as:
The PURL server provides a central identity management authority for resources that are of value (need to persist) across the enterprise. The Persistent URLs are used to connect resources found in multiple locations
The vocabulary server provides a way of harmonizing concepts across different domainsdifferent domains
o Where possible, public vocabularies are usedo Where not, they’re extendedo We don’t want to develop and maintain vocabularieso We don t want to develop and maintain vocabularies
Inside/Outside DisappearsExternal Internal
Central IdentityManagement
Active & Partial PURLs SemanticVisualization
VocabularyServer
Catalogues, Mapping, Queries
RD
F+Tagging
Vendor Content
Consortium ContentRESTful
APIs Triplestores
Co
Vendor Content APIs
ontent
R&D | RDI
Structured Structured Semi-StructuredUnstructured
17
Unstructured Content( t f th d t t th )
Giving Structure to Unstructured Content
(or most of the data out there…)
Giving Structure to Unstructured ContentoEntity RecognitionoUse of common vocabularies
o SchemasDomain Specific Content? Open BEL? TMO?o Domain-Specific Content? Open BEL? TMO?
oCompatibility of text indices with triplestores & middleware tools
Encouraging Publishers to Structure ContentoHow can this be ‘monetized’ so they don’t lose their ROI?oWhat about interoperability & persistence?oCan this be mandated via funding agenciesoRDFa to start?oRDFa to start?
Publishers or ‘Re-publishers’o Thomson-Reuters
Elsevier
R&D | RDI
oElseviero IngenuityoOpen up vocabularies (thanks, Cortellis!)
Pre-Competitive Consortia
Open PHACTS (Innovative Medicines Initiative)
Pistoia Alliance
W3C Health Care & Life Sciences Interest Groupp
National Center for Biomedical Ontologies(NCBO)
Open BEL (Biological Expression Language)
R&D | RDI
Open PHACTS (Open Pharmacological Space)• EU/EFPIA Innovative Medicines Initiative (IMI) project
Flexible and adaptableKey Points Large scale data integration
• EU/EFPIA Innovative Medicines Initiative (IMI) project
Flexible and adaptable Dynamic schema-less approach; rapidly incorporate new datasets Queries are adaptive, based on scientific profiles (e g chemist or
Large scale data integration Focused on pharmacology We integrate so you don’t have to Dealing with multiple identifiers for th t scientific profiles (e.g. chemist or
biologist) Use-case driven & tested by users in industry and academia
the same concept Always up-to-date State of the art and industrial strength
Great APIs for building apps JSON REST-style APIs Also supports XML, Turtle, etc
Focus On Data Quality Provenance is critical – know where every data point comes from Google-style indexing; Data providers pp
Chemistry services Exemplars show how to take advantage of the platform Clear licensing details for all data in
g y gkeep their own data Chemistry Standardization –enhancing chemistry connectivity Working with data providers to expose
R&D | RDI
gthe system and enhance their data
20
From: Open PHACTS Architecture - Building the extensible platform (EuroQSAR 2012 in Vienna, 30.08.2012)
W3C HCLS
The mission of the Semantic Web Health Care and Life Sciences Interest Group (HCLS IG) is to develop, advocate for, and
support the use of Semantic Web technologies across health
Activities:o Continue to develop high level (e.g. TMO) and architectural (e.g. SWAN)
support the use of Semantic Web technologies across health care, life sciences, clinical research and translational medicine
p g ( g ) ( g )vocabularies.
o Implement proof-of-concept demonstrations and industry-ready code.o Document guidelines to accelerate the adoption of the technology.o Disseminate information about the group's work at government, industry, academic
events and by participating in community initiatives.events and by participating in community initiatives.Use Cases/Domainso Drug Discoveryo Electronic Lab Notebookso Comparator Arm Data
CDISC2RDF: Making Clinical Data Interchange Standards Consortium
o Patient Data Ownershipo Biotech Acquisitiono Supply Chain Automationo Web Integrationo Bio surveillance
(CDISC) available as RDF• Roche, AZ, TopQuadrant, Vrije
Universiteit, Amsterdam• More at CSHALS in two weeks
R&D | RDI
o Bio-surveillanceo Co-development
http://www.w3.org/blog/hcls/
Pleas & Future Directions
PrognosticationsRDF Content Farms
Vendors: Someone will figure out
Community HelpResist Silos
Where is your data? Where is it likely Vendors: Someone will figure out how to monetize thisConsortia: Who ‘Owns’ this?Government in Health Care & Life Sciences can e learn from the
Where is your data? Where is it likely to be in 5, 10 years?A single triplestore with all ETL-streams leading to an RDF ‘data
Sciences; can we learn from the EPA? open.gov?
Shrinking Pharma
warehouse’ is another silooBuilding on top of ‘standards+’ may
lead to silosShrinking Pharma
Smaller (or virtual) footprintoBack to first principles—what do
we do best?
Need to follow & influence emergence of standards if you have a ‘horse in the race’
we do best?More modeling & SimulationRise of the informaticist…
Support (business focused) ConsortiumsWe’re doing the same job many, many times
R&D | RDI
times
Thank YouLi t &Listeners & Molecular Med TRI-CONMolecular Med TRI CON 2013 Organizers