Towards Automating the Redesign of the Synchronization Skeletons of UPC Programs
Towards Automating Data Narratives
Transcript of Towards Automating Data Narratives
TOWARDS AUTOMATING DATA NARRATIVES
Yolanda Gil, Daniel GarijoInformation Sciences Institute
University of Southern California@yolandagil, @dgarijov
{gil,dgarijo}@isi.edu
Information Sciences Institute
The Scientific Research Process
Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo
Formulate hypothesis
Define the experiment (data + method)
Find data
Run experiments (methods)
Meta-analysis of results
Revise hypothesis
The products of scientific research
Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 3
Formulate hypothesis
Define the experiment (data + method)
Find data
Run experiments (methods)
Meta-analysis of results
Revise hypothesis
Publication
Methods
Data
SoftwareExecution traces
Reconstructing the Computations from the Text in the Paper
Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 4
Comparison of Ligand Binding Sites
The SMAP software was used to compare the binding sites of the 749 M.tb protein structures plus 1,446 homology models (a total of 2,195 protein structures) with the 962 binding sites of 274 approved drugs, in an all-against-all manner. While the binding sites of the approved drugs were already defined by the bound ligand, the entire protein surface of each of the 2,195 M.tb protein structures was scanned in order to identify alternative binding sites. For each pairwise comparison, a P -value representing the significance of the binding site similarity was calculated.
“The Mycobacterium Tuberculosis Drugome and Its Polypharmacological Implications.” Kinnings, S. L.; Xie, L.; Fung, K. H.; Jackson, R. M.; Xie, L.; and Bourne, P. E. PLoS Computational Biology, 2011.
Problem with current approaches: what the paper said vs what the software did
Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 5
“The Mycobacterium Tuberculosis Drugome and Its Polypharmacological Implications.” Kinnings, S. L.; Xie, L.; Fung, K. H.; Jackson, R. M.; Xie, L.; and Bourne, P. E. PLoS Computational Biology, 2011.
Actual computation
Problem with current approaches
Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 6
Incomplete Missing steps and intermediate
data
Ambiguous Several interpretations about how
computations are done
Inconsistent level of detail Mixing of general methods
with execution details
Step1
Step ??
Step 2
?
Step1
Step 2
Step1’
Step 2’
Implementation 1?
Implementation 2?
Step1
Step 2
Param1 = 2
File = “Input.txt”
Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 7
Formulate hypothesis
Define the experiment (data + method)
Find data
Run experiments (methods)
Meta-analysis of results
Revise hypothesis
Publication
Methods
Data
http://ext.net/wp-content/uploads/tortoise-svn-logo.pngExecution traces
Reportgeneration
Our approach: From research outputs to text
https://image.flaticon.com/icons/svg/28/28842.svg
Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 8
Formulate hypothesis
Define the experiment (data + method)
Find data
Run experiments (methods)
Meta-analysis of results
Revise hypothesis
Publication
Methods
Data
http://ext.net/wp-content/uploads/tortoise-svn-logo.pngExecution traces
Reportgeneration
Our approach: From research outputs to text
http://www.hurricanesoftwares.com/wp-content/uploads/2009/03/import-CSV-in-php.png
Reports must:• Be true to actual events• Enable inspection • Be human-understandable• Abstract details
Data Narratives• Interlinked record of• High level workflows (methods)• Provenance of results (method executions)• Data• Software metadata
• Persistent identifiers
• Data narrative accounts • Alternative descriptions of a result with a different level of detail.
Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 9
http://bitpoetry.io/content/images/2016/03/uriurnurl.png https://en.wikipedia.org/wiki/File:DOI_logo.svg
Truth to actual records
Inspectability
Human readable, levels of abstraction
Data Narrative Accounts: An example
Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 10
How was the dataset used in this visualization generated?
Data Narrative Accounts: An example
Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 11
“Topic modeling was run on the Reuters R8 dataset (10.6084/ m9.figshare.776887), and English Words dataset (10.6084/m9.figshare.776888), with iterations set to 100, stop word size set to 3, number of topics set to 10 and batch size set to 10. The results are at 10.6084/m9.figshare.776856”
“The topics at 10.6084/m9.figshare.776856 were found in the Reuters R8 dataset (10.6084/m9.figshare.776887) and English Words dataset (10.6084/m9.figshare.776888)”
• Execution view• Inputs, parameters and main outputs
• Data view• Just the data that influenced the results
• Method view• Main steps based on their functionality“Topic training was run on the input dataset. The results are product of PlotTopics, a visualization step”
• Dependency view• How the steps depend on each other
• Implementation view• How the steps were implemented in the execution
• Software view• Details on the software used to implement the steps
Data Narrative Accounts: An example
Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 12
“First, the input data is filtered by Stop Words, followed by Small Words, Format Dataset, and Train Topics. The final results are produced by Plot Topics”
“Train topics was implemented using Latent Dirichlet allocation”
“The train topics step was generated with Online LDA open source software, written in Java. Plot topics was generated with the Termite software.”
DANA: DAta NArratives
Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 13
Experiment Records
Provenance RepositoryExperiment-
specificKnowledge Base
DANA Generator
Narrativeaccounts Software
registry
Query patterns
Data Narrative aggregator
InputResourcerequest
Response
Resourcerequest
ResponseOutput
Get query Patternresult
Get pattern
1. Identify which experiment records to describe2. Generation of an Experiment-specific knowledge base3. Creation of the Data Narrative from templates4. Produce narrative accounts
Generation of an experiment-specific knowledge base: scientific workflows
Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 14
WINGS workflow system
• High level workflow templates that can be elaborated through component ontologies
http://www.wings-workflows.org/
Generation of an experiment-specific knowledge base: provenance records as RDF
Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 15
See a hyperlinked description/visualization at its persistent URL:https://goo.gl/v8EPg5
http://www.opmw.org/export/page/resource/WorkflowExecutionAccount/ACCOUNT1348628778528
10.6084/m9.figshare.776887
Generation of an experiment-specific knowledge base: Software metadata• Catalog of motifs [Garijo et al 2013]
• A catalog of common domain independent workflow patterns based on the functionality of workflow steps
• Ontosoft distributed software registry [Gil et al 2016]• Descriptions of hundreds of software components• Key metadata of software:
• License• Usage• Authors• Web page• Code repository• Etc.
Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 16
[Garijo et al 2016]: Common Motifs in Scientific Workflows: An Empirical Analysis. Garijo, D.; Alper, P.; Belhajjame, K.; Corcho, O.; Gil, Y.; and Goble, C. Future Generation Computer Systems, . 2013. .
http://purl.org/net/wf-motifs
http://www.ontosoft.org/portal
[Gil et al 2016]: OntoSoft: A Distributed Semantic Registry for Scientific Software. Gil, Y.; Garijo, D.; Mishra, S.; and Ratnakar, V. In Proceedings of the Twelfth IEEE Conference on eScience, Baltimore, MD, 2016.
Generating narrative accounts
Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 17
RDF
Accounttemplate
Formative evaluation• Survey with 6 target scenarios
• Each scenario:• Description of a situation where a user has to do a task
Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 18
Formative evaluation• Survey with 6 target scenarios
• Each scenario:• Description of a situation where a user has to do a task• A workflow sketch of the analysis done• Six candidate narratives of that workflow sketch.
Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 19
Formative evaluation• Survey with 6 target scenarios
• Each scenario:• Description of a situation where a user has to do a task• A workflow sketch of the analysis done• Six candidate narratives of that workflow sketch.
• 12 responses from users
• Results
• Each narrative is considered appropriate for describing some scenario
• Different users chose different narratives for each scenario
Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 20
Summary: Benefits of Data Narratives
Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 21
Features Data Narratives
Provenance Records
Visualizations Articles Electronic Notebooks
Truth to actual records Y Y Just data Maybe Maybe
Enable inspection Y Y Just data N Y
Human understandable Y N Y Y Y
Abstract details Y N Y Y N
Part of papers Y N Y Y Maybe
Persistent Y Maybe N Y Maybe
Different audiences Y N N N N
Automatically generated Y Y Maybe N N
Conclusions and future work• Data Narratives• Interlink data, software, workflows and provenance of a scientific experiment• Persistent identifiers• Narrative accounts
• Future work:• Ease navigation through levels of detail• Mixing details of different narratives• Improve summarization of results• Additional evaluation of narrative usefulness
Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 22
See more: http://dgarijo.github.io/DataNarratives/
TOWARDS AUTOMATING DATA NARRATIVES
Yolanda Gil, Daniel GarijoInformation Sciences Institute andDepartment of Computer Science
@yolandagil, @dgarijov
{gil,dgarijo}@isi.edu