Post on 10-Feb-2017
Data: The Good, The Bad& The Ugly
Lee Harland @SciBitely
http://www.scibite.comhttp://www.slideshare.net/scibitely
Lee HarlandLilly Global IT Meeting November 2016
Context• This is an invited talk I gave at Lilly’s Internal Global IT meeting on the
subject of “data”
The Good
http://www.nejm.org/doi/full/10.1056/NEJMp1606181
What matters to me!
The Bad
+ =
…. (Promotion of) the nutritional importance of spinach over other foods, lead to an increase of over 30 per cent in its
consumption during the 1920s and 30s.
The action of S. Oleracea on cardiovascular output and muscular tone
Bad, Bad Data Point
1870 35.2 mg Fe/100g1937 3.52 mg Fe/100g
The mythical strength-giving properties of spinach are ... credited to a simple mistake concerning the iron content of the vegetable.
In 1870, Dr E von Wolf published figures which were accepted until the 1930s, when they were rechecked
This revealed that a decimal point had been placed wrongly and that the real figure was only one tenth of Dr von Wolf's claim
Still Making Headlines After 140 Years2013
There Is No Decimal Point
Error
X X
X
Spinach: One Small Data Point, One Huge Mess
1870 35.2 mg Fe/100g1937 3.52 mg Fe/100g
✓✓
Both Values Are Correct – The difference is down to the assay conditions
http://www.merriam-webster.com/dictionary/provenance
35.2
35.2
The datapoint + its provenance (experimental context)
What people saw
So What?
……estimates for the reproducibility of preclinical research range from 51 percent to 89 percent. They estimate that at least half of all U.S. preclinical biomedical research funding—about $28 billion annually—is therefore squandered……
http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002165
http://www.merriam-webster.com/dictionary/provenance
Provenance Is A Critical Component of Reproducibility
What L cells, where from, how old, epigenetic profile
etc etc?
When, how often, in what way, using what
system?????
What, when, how?
Could you accurately reproduce this experiment from this method?
* I was responsible for this paragraph
http://www.nature.com/nrd/journal/v10/n9/full/nrd3439-c1.html
A first-of-a-kind analysis of Bayer's internal efforts to validate 'new drug target' claims now not only supports this view but suggests that 50% may be an underestimate; the company's in-house experimental data do not match literature claims in 65% of target-validation projects, leading to project discontinuation.
This is where Informatics & Data Science can add real
value toDrug Discovery
Open PHACTS https://www.openphacts.org/
Open PHACTS: Adding Provenance To Data
http://nanopub.org/
.sub:Head {this: np:hasAssertion sub:assertion ;np:hasProvenance sub:provenance ;np:hasPublicationInfo sub:pubinfo ;a np:Nanopublication .}
sub:assertion {nx:NX_P35712 bfo:BFO_0000066 ts:TS-0276 ; # Protein NX_P35712 is localized in tissue TS-0276ro:has_quality "positive" .}
sub:provenance {<http://www.nextprot.org/help/quality_criteria/silver> a eco:ECO_0000205 ;rdfs:label "neXtProt silver"^^xsd:string .sub:_1 a efo:EFO_00027688 .sub:_10 a eco:ECO_0000218 .sub:_2 a eco:ECO_0000218 .sub:_9 a efo:EFO_00027688 .sub:assertion prv:usedData <http://bgee.unil.ch/bgee/bgee?page=expression&action=data&stage_id=HsapDO:0000087&organ_id=EV:0100115&gene_id=ENSG00000110693> , <http://bgee.unil.ch/bgee/bgee?page=expression&action=data&stage_id=HsapDO:0000088&organ_id=EV:0100115&gene_id=ENSG00000110693> , <http://bgee.unil.ch/bgee/bgee?page=expression&action=data&stage_id=HsapDO:0000090&organ_id=EV:0100115&gene_id=ENSG00000110693&stage_children=on> , <http://bgee.unil.ch/bgee/bgee?page=expression&action=data&stage_id=HsapDO:0000092&organ_id=EV:0100115&gene_id=ENSG00000110693&stage_children=on> , <http://bgee.unil.ch/bgee/bgee?page=expression&action=data&stage_id=HsapDO:0000094&organ_id=EV:0100115&gene_id=ENSG00000110693&stage_children=on> ;wi:evidence <http://www.nextprot.org/help/quality_criteria/silver> ;a eco:ECO_0000220 ;rdfs:comment " data, NX_P35712 is expressed in Endometrium"^^xsd:string ;prov:wasDerivedFrom sub:_1 , sub:_3 , sub:_5 , sub:_7 , sub:_9 ;prov:wasGeneratedBy sub:_10 , sub:_2 , sub:_4 , sub:_6 , sub:_8 .}
sub:pubinfo {sub:_11 a eco:ECO_0000205 .sub:_12 a eco:ECO_0000205 . sub:_15 a eco:ECO_0000205 .this: dcterms:created "2014-09-19T00:00:00.0Z"^^xsd:dateTime ;dcterms:rights <http://creativecommons.org/licenses/by/3.0/> ;dcterms:rightsHolder <http://nextprot.org> ;prv:usedData "neXtProt database" ;pav:authoredBy "CALIPHO project" , <http://orcid.org/0000-0001-6710-1373> , <http://orcid.org/0000-0001-6818-334X> , <http://orcid.org/0000-0002-1303-2189> , <http://orcid.org/0000-0003-1813-6857> ;pav:versionNumber "3" ;prov:wasGeneratedBy sub:_11 , sub:_12 , sub:_13 , sub:_14 , sub:_15 .} http://nanopub.org
https://explorer.openphacts.org
One of the few user interfaces where provenance is intrinsically “there”
The Ugly
80-90% of all potentially usable business information may originate in unstructured form
https://en.wikipedia.org/wiki/Unstructured_data
The Ugly
“Carboxypeptidase B2” “Thrombin-ActivatableFibrinolysis Inhibitor”
“Plasma CPU”
The True Picture(they are the same thing)
It hasn’t just got 3 names its got LOTScarboxypeptidase B-like protein OR thrombin-activatable fibrinolysis
inhibitor OR CPB type 2 OR Carboxypeptidase type B2 OR plasma carboxypeptidase type B OR carboxypeptidase type B2 OR
CPB2 OR Plasma carboxypeptidase type B OR CPB-2 OR carboxypeptidase B2 (plasma),carboxypeptidase U OR
Carboxypeptidase type U OR carboxypeptidase type U OR plasma carboxypeptidase B2 OR carboxy-peptidylase U OR thrombin-
activable fibrinolysis inhibitor OR plasma carboxypeptidase type B2 OR carboxypeptidase B2 (plasma OR CPU OR
carboxypeptidase B2 OR PCPB OR pCPB OR Carboxypeptidase U OR plasma carboxypeptidase B OR TAFI OR Carboxypeptidase B2
OR Plasma carboxypeptidase B OR Thrombin-activablefibrinolysis inhibitor OR carboxypeptidase B2 plasma OR
carboxypeptidase R
“We also manually standardized data related to lab measurement units and terminology related to patient race and ethnicity, geographical study regions, and names of drugs and drug families. “
Yet Another Issue
(an accident waiting to happen)
VARCHAR2PROJ_TITLE
EXPERIMENT_INFO
ASSAY_DESCRIPTION
KEYWORDS
USER_PROFILE SUMMARY
EXPT_METADATA
SETTINGS_INFO
REPORT_TEXT
EXPT_NAME
Databases: Where Knowledge Goes To Die
MEETING_MINUTES
PROJ_ACTIONS
ASSAY_CONLCUSIONCOHORT_DESC
INCLUSION_CRITERIA
POLICY_DETAILS
PROJECT_OVERVIEWRATIONALE
JUSTIFICATION
Text2Data MicroService
TERMiteSupports basic keyword search only
TEXT Rich substrate for search and discovery & insight
DATA
Just What Is “The Data”?• Mentions of all
• Genes, Diseases, Drugs, Tissues, Cells, Techniques, Assays, Measures, Protocols, Compounds, Regimens, Companies, People, Locations, Pathologies, Adverse Events, Pathways, Metabolism, Manufacturing Concepts, QC/QA, Pathogens, Strains, Animals … and so on...
• … And their relationships to each other• … And their locations (section, database column)• … Inferring relationships between documents/entries• … Regardless of actual keyword used
Systems Integration Guide
http://yourcompany.com/termite?text=<content>app=<application name>index=<e.g. page, table or column name>
ELN Screening Registry
PDMRegistry
ProjectManagement Sharepoint
Whats going on, right now
Trending Today
Why Give Ugly Data A Makeover?• ELN annotation using Bioassay Ontology
• Find all experiments using any Cell Flourescence technique”• Pharmacovigilance
• Monitoring newsfeeds & internal data for safety signals• Automatic Process Notification
• Alert groups based on content of CRO documents Etc• Synergise Both Semantic Technology & Information Professionals
• Re-energise Therapeutic Area Literature Searching• Build Knowledge Chains (Assertional Provenance)
• Project Management à ELN Data à Screen SOP
Before I go…..
Spinach: The Truth Is Out There!
Spinach is highin iron (!)
..oxalic acid in spinach prevents more than 90% of iron from being
absorbed..
Acknowledgement
Acknowledgements
IMI Open PHACTS Team(many more involved, I just don’t have a photo L )http://openphacts.org
SciBite Teamhttp://scibite.com