|
From Maslow’s Hierarchy to Knowledgegraphs: Experiments in Big and Small Data at Elsevier
Anita de Waard, [email protected] Research Data Management, ElsevierCharleston Conference, November 4, 2016
| 2
Big Data vs. Small Data: What Will I Be Talking About?
Data Type Small Big
User UX User analytics
Performance Pure Scival
Research Research Data Management (RDM)
HPC systems (HEP, astronomy, etc)
Text Text mining KnowledgeGraphs
Health Medical systems Precision Medicine
Elsevier does I will talk about
|
Bauer, B. (Bruno) et al,(2015) ‘Forschende und ihre Daten. Ergebnisse einer österreichweiten Befragung (eBook)‘ (in German)E-infrastructures Austria, https://phaidra.univie.ac.at/detail_object/o:407736
Stays at institution
Take it with me
Don’t know
Data is lost
Other
When You Leave Your Institution, What Happens To Your Data?
|
When we talk about data, we really talk about the following:
Machine & environment settings
Raw data Processed data
Scripts & analyses
Protocols, methods, algorithms
Accessibility
Reproducibility
Reusability
Discoverability
Note: images for illustrative purpose only4
|
https://www.elsevier.com/connect/10-aspects-of-highly-effective-research-data
A Maslow Hierarchy for Research Data:
|
Preserve Process: Hivebench (http://www.hivebench.com)
|
Linked to published papers – or not
Linked to Github – or not
Versioning and provenance
Preserve Data: Mendeley Data (https://data.mendeley.com/)
|
http://www.journals.elsevier.com/softwarex/
Share and Comprehend: SoftwareX (http://www.journals.elsevier.com/softwarex/)
|
Access: Linking papers to data: www.Scholix.org
• ICSU/WDS/RDA Publishing Data Service Working group
• Creating linked-data model for exposing DOI to DOI links outside publisher’s firewall
• Merged with National Data Service pilot with the same goal
• Collaboration between CrossRef, DataCite, Europe PubMed Central, ANDS, Thompson Reuters, Elsevier, OpenAire
Objective: move from
a plethora of (mostly) bilateral arrangements between the different players…
.. a one-for-all cross-referencing service for articles and data
.. to ..
|
Discover: Data Search (http://datasearch.elsevier.com)
DataSearch.Elsevier.com
1.Across repositories
2.(Deep) indexing of data, so not just metadata
3.Data preview
1
3
2
|
https://www.elsevier.com/connect/10-aspects-of-highly-effective-research-data
A Maslow Hierarchy for Research Data:
Data at Risk
Reproducibility Papers
|
GOAL: IDENTIFY ENTITIES AND RELATIONSHIP ACROSS THE ENTIRE ELSEVIER CORPUS IN SCIENCE DIRECT
TEXT MINING + ENTITY IDENTIFICATION, USING OUR TAXONOMIES (EMMET, COMPENDEX, AND OTHER)
UNSUPERVISED, SCALABLE AND BUILT WITH OFF-THE-SHELF TECHNOLOGIES
COLLABORATION WITH UNIVERSITY COLLEGE LONDON AND UM AMHERST [1]
TOWARDS AN ELSEVIER KNOWLEDGE GRAPH
14M articles from Science Direct
3.3M triples
475M triples
49M triples p x r matrix p x k, k x r latent factor matrices
~102 triples
920K concepts from EMMeT
[1] Riedel, S., L. Yao, A. McCallum, and B. M. Marlin. (2013). "Relation extraction with matrix factorization and universal schemas”, http://www.aclweb.org/anthology/N13-1008
|
SAMPLE OUTPUT:
glaucoma developed many years after chronic inflammation of uveal tractglaucoma develop following chronic inflammation of uveal tract glaucoma can appear soon in family history of glaucomaglaucoma can appear soon in age over 40glaucoma the risk of functional visual field lossglaucoma contributing causes of functional visual field lossglaucoma contributed to functional visual field lossglaucoma is considered the second leading cause of functional visual field lossglaucoma remains the second leading cause of functional visual field loss
Deduplication/normalization: downsampled from 49M entity-resolved triples:
|
Knowledge Graphs for the Life Sciences:
Bradley Allen, DC Conference, Oct 2016, http://www.slideshare.net/bpa777/dc2016-keynote-20161013-67164305/15
| 15
Trends driving Digital Health & Precision Medicine:need for health data with consent
4500 tests for gene disorders available(2013: 3200 +20% CAGR)
$1245cost to sequence full genome(10/2014: $5730)
$199cost of 23andME test
25 million biomed articles referenced on PubMed
30 days → 1 hourmanual to machine learningtime needed to develop one prediction model at Elsevier
1.2 millionnew biomed articles p.a.
76%of US hospitals use at least a basic EMR
130 million patientdata sets at large insurer21 m complete for last 2 years7 m with clinical and lab dataNB: 6 m (no clin, lab) in Germany6.5 million in Catalonia
105 mm ECG high ecg quality, heart rate, respiratory, body temp, activity, body position,water tight, induction charged, bluetooth, continuous data feed
patientslikeme has
400,000+ members 31 million data points covering 2,500+ conditions, donating data
1. genetic testing
2. information explosion
3. patient data
4. biosensors - IoT in health
5. machine learning
6. patient empowerment
| 16
The Elsevier Medical Graph is a deep predictive model that relates attributes of over 2000 medical conditions to phenotypes of patients at potential risk of re-admission.
Probability of occurrance within next five years. 2,083 ICD10 conditions. Based on 6 year longitudinal history of 6 million German patients.
| 17
Big Data vs. Small Data: What Did I Talk About?
Data Type Small Big
User UX User analytics
Performance Pure Scival
Research Research Data Management (RDM)
HPC systems (HEP, astronomy, etc)
Text Text mining KnowledgeGraphs
Health Medical systems Precision Medicine
Elsevier does I discussed!
|
Thank you!
18
Anita de Waard, VP Research Data Collaborations,Elsevier RDM ServicesJericho, VT [email protected]
Top Related