Transparency in the Data Supply Chain
-
Upload
paul-groth -
Category
Technology
-
view
414 -
download
0
description
Transcript of Transparency in the Data Supply Chain
![Page 1: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/1.jpg)
Paul Groth (@pgroth)Web & Media GroupDepartment of Computer ScienceVU University Amsterdamhttp://www.few.vu.nl/~pgroth
Transparency in the Data Supply Chain
![Page 2: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/2.jpg)
![Page 3: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/3.jpg)
Outline
• Data integration for analysis– i.e. remixing data
• The need for transparency• Two solutions• The future
![Page 5: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/5.jpg)
Why?
Public Domain Drug Discovery Data:Pharma are accessing, processing, storing & re-processing
LiteraturePubChem
GenbankPatents
DatabasesDownloads
Data Integration Data AnalysisFirewalled Databases
Repeat @ each
companyx
![Page 6: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/6.jpg)
Prioritised Research QuestionsNumber sum Nr of 1 Question
15 12 9 All oxido,reductase inhibitors active <100nM in both human and mouse
18 14 8Given compound X, what is its predicted secondary pharmacology? What are the on and off,target safety concerns for a compound? What is the evidence and how reliable is that evidence (journal impact factor, KOL) for findings associated with a compound?
24 13 8 Given a target find me all actives against that target. Find/predict polypharmacology of actives. Determine ADMET profile of actives.
32 13 8 For a given interaction profile, give me compounds similar to it.
37 13 8 The current Factor Xa lead series is characterised by substructure X. Retrieve all bioactivity data in serine protease assays for molecules that contain substructure X.
38 13 8 Retrieve all experimental and clinical data for a given list of compounds defined by their chemical structure (with options to match stereochemistry or not).
41 13 8
A project is considering Protein Kinase C Alpha (PRKCA) as a target. What are all the compounds known to modulate the target directly? What are the compounds that may modulate the target directly? i.e. return all cmpds active in assays where the resolution is at least at the level of the target family (i.e. PKC) both from structured assay databases and the literature.
44 13 8 Give me all active compounds on a given target with the relevant assay data46 13 8 Give me the compound(s) which hit most specifically the multiple targets in a given pathway (disease)59 14 8 Identify all known protein-protein interaction inhibitors
www.openphacts.org
![Page 7: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/7.jpg)
Research question 15: All oxido reductase inhibitors active < 100nM in both human and mouse
From Mabel Loza - USC team
![Page 8: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/8.jpg)
From Mabel Loza - USC team
![Page 9: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/9.jpg)
From Mabel Loza - USC team
![Page 10: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/10.jpg)
From Mabel Loza - USC team
![Page 11: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/11.jpg)
Research question 15: All oxido reductase inhibitors active < 100nM in both human and mouse
ChEMBL:
Search target Oxidoreductase: 481 targets from different species
Selection of all the oxidoreductases and filtering bioactivities with the criteria IC50 < 100 (no units could be selected): 11497 data obtained
Table exported to a excel spreadsheet and manually filtered
From Mabel Loza - USC team
![Page 12: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/12.jpg)
5 people
Working 6 hours
![Page 13: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/13.jpg)
Problem: Data Integration
DataSource
DataSource
Data Warehouse
Queries
ExtractTransformLoad
DataSource
DataSource
Mediator
Queries
QueryReformulation
![Page 14: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/14.jpg)
Using the Power of Open PHACTS, London, 22-23 April 2013
RDFNanopub
Db
VoID
Data Cache (Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)DomainSpecificServices
Identity Resolution
Service
Chemistry RegistrationNormalisation & Q/C
IdentifierManagement
Service
index
Co
re P
latf
orm
P12374EC2.43.4
CS4532
“Adenosine receptor 2a”
RDF
VoID
Db
RDFNanopub
Db
VoID
RDF
Db
VoID
RDFNanopub
VoID
Public Content Commercial
Public Ontologies
User Annotations
Applications
![Page 15: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/15.jpg)
15
Open PHACTS Explorer
![Page 16: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/16.jpg)
16
Open PHACTS Explorer
?
![Page 17: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/17.jpg)
Credits: Curt Tilmes, Peter Fox
Tilmes, C.; Fox, P.; Ma, X.; McGuinness, D.L.; Privette, A.P.; Smith, A.; Waple, A.; Zednik, S.; Zheng, J.G., "Provenance Representation for the National Climate Assessment in the Global Change Information System," Geoscience and Remote Sensing, IEEE Transactions on , vol.51, no.11, pp.5160,5168, Nov. 2013
![Page 18: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/18.jpg)
![Page 19: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/19.jpg)
Problem: I don’t trust your assessment what is it based on?
![Page 20: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/20.jpg)
Tension:
Integrated & SummarizedData
Transparency& Trust
![Page 21: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/21.jpg)
Solution
Integrating and exposing provenance provided by multiple sources
![Page 22: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/22.jpg)
![Page 23: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/23.jpg)
provbook.org
![Page 24: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/24.jpg)
![Page 25: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/25.jpg)
National Climate Change Assessment Provenance
![Page 26: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/26.jpg)
![Page 27: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/27.jpg)
![Page 28: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/28.jpg)
PROV the database as a black box
Q
![Page 29: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/29.jpg)
Goal
• the capability to trace back, for each query result, the complete list of sources and how they were combined to deliver a result.
![Page 30: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/30.jpg)
Implement In a Graph Database at Scale
Marcin WylotPhilippe Cudré-MaurouxExascale LabUniversity of Fribourg
http://diuf.unifr.ch/main/xi/diplodocus
![Page 31: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/31.jpg)
TriplePROV [WWW2014]
![Page 32: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/32.jpg)
Provenance Polynomials
![Page 33: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/33.jpg)
Test on large messy data
• Billion Triple Challenge– Crawled from the linked open data cloud
• Web Data Commons– RDFa, Microdata extracted from common crawl
• 115 million triples (25 GB)• 8 Queries defined for BTC
– T. Neumann and G. Weikum. Scalable join processing on very large rdf graphs. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 627–640. ACM, 2009.
![Page 34: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/34.jpg)
External + Internal Provenance
• Unified queries over external and database provenance
• Adapting query results based on provenance
• Performance improvements
![Page 35: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/35.jpg)
FUTURE
![Page 36: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/36.jpg)
60 % of time is spent on data preparation
![Page 37: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/37.jpg)
Big Data is often lots of small data
http://www.data2semantics.org/prov-reconstruction-challenge/
![Page 38: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/38.jpg)
Questions?
• More info:– openphacts.org– data2semantics.org– provbook.org– Paul Groth, "Transparency and Reliability in the Data Supply
Chain," IEEE Internet Computing, vol. 17, no. 2, pp. 69-71, March-April, 2013
– Paul Groth, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.-Oct. 2013
– Marcin Wylot, Philippe Cudré-Mauroux and Paul Groth. TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store. WWW 2014
![Page 39: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/39.jpg)
Backup
![Page 40: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/40.jpg)
Hack Sparql
![Page 41: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/41.jpg)
What’s the overhead? Setup
Source and complete trace (i.e. triple level)
![Page 42: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/42.jpg)
Annotations:
Propagate annotations through the query processing pipeline
![Page 43: Transparency in the Data Supply Chain](https://reader036.fdocuments.us/reader036/viewer/2022081602/5549c952b4c905856d8b46ff/html5/thumbnails/43.jpg)
What’s the overhead?