Kurator Project Overview (Brief)

Post on 23-Jun-2015

156 views 1 download



eScience Research Round Table (ERRT) @ GSLIS on Data Curation for Biodiversity Informatics

Transcript of Kurator Project Overview (Brief)

KURATOR: A Provenance-enabled Workflow Platform and Toolkit to Curate

Biodiversity Data

Bertram Ludäscher

Graduate School of Library and Information Science (GSLIS)National Center for Supercomputing Applications (NCSA)

ERRT @ GSLIS 10/22/2014 2

• Kurator:– What problems is Kurator tackling and for whom? – Curation Workflow Example– How we’re going about it

• Not Today:– Related Biodiversity Informatics Projects

• Filtered-Push• Exploring Taxon Concepts (ETC)• Euler

– Other Informatics Projects• DataONE• SKOPE


ERRT @ GSLIS 10/22/2014 3

What is Kurator?

• NSF-DBI #1356751 – Collaborative Research: ABI Development:

Kurator: A Provenance-enabled Workflow Platform and Toolkit to Curate Biodiversity Data

– Sept. 2014 – 2017– @Illinois:

• B. Ludäscher, James Macklin, Tim McPhillips, …

– @Harvard: • James Hanken, Paul Morris, Bob Morris, …

ERRT @ GSLIS 10/22/2014 4

Problem: Data & Metadata Quality• Collections & occurrence data is

all over the map– … literally (off the map!)

• Issues:– Lat/Long transposition,

coordinate & projection issues– Scientific Names (spelling

errors, other) – Data entry/creation, “fuzzy”

data, naming issues, bit rot, data conversions and transformations, schema mappings, … (you name it)

• Precursor:– Filtered-Push Collaboration

ERRT @ GSLIS 10/22/2014 5

What Problems does Kurator try to solve?

• Detect and flag data quality issues

• Repair if possible

• Keep track of provenance– automatic repairs– human curator edits

ERRT @ GSLIS 10/22/2014 6

Who are the customers?

• Collection Managers – … who are managing the collections databases– Can run curation workflows periodically

• … in the presence of new data and/or new curation services

• (Biodiversity) Researchers– To perform an analysis in the presence of (partially)

dirty data, researchers need to• Clean or fix dirty data• Throw out unfixable data

– Pushing changes to the original data collections and collection managers (cf. FPush)

ERRT @ GSLIS 10/22/2014 7

Example: Kepler/Kurator (FPush project)

ERRT @ GSLIS 10/22/2014 8

Simplified Example Workflow

• Related Research (Tianhong Song, UC Davis)– Analyze linear workflow “story”– Use patterns to discover wf design issues

(e.g. use before update); then fix them– Parallelize when possible

• Kurator:– Allow easy assembly

of such workflows– For tool makers– … and tool users – … scalability


ERRT @ GSLIS 10/22/2014 9

Example Output …

ERRT @ GSLIS 10/22/2014 10

… close up …

ERRT @ GSLIS 10/22/2014 11

How we do it

• Build a library of curation services such that curation workflows can be run from various platforms– Scientific workflow systems

• e.g. Restflow, Kepler, Taverna, Galaxy

– Other platforms• e.g. Akka, Python-based, …

• … leveraging existing technologies

ERRT @ GSLIS 10/22/2014 12

How we do it

• Open source, community-friendly approach– git repository (NCSA open source projects)

• Agile software development– NCSA support tools, e.g. JIRA, Bamboo

• Inspired by – Small bioinformatics tools manifesto (post-facto)– Unix tenets (small, interoperable tools, … )– Experience with other (sometimes not so agile)

development projects

ERRT @ GSLIS 10/22/2014 13

Kurator: Agile Development

ERRT @ GSLIS 10/22/2014 14

Q & A …

• What does data curation, quality control mean in you domain / application / research?

• Are there particular issues that are important to you?

• Join us!– Kurator & other Biodiversity Interest

• Hackers welcome, too.

– Email: ludaesch@illinois.edu

ERRT @ GSLIS 10/22/2014 15

Related Research (Tianhong Song)

• Automated Design, Analysis, Optimization of Curation Workflows.

• Idea:

• Example Workflow[Scientific Name Validation] [GeoRef Validation] [Date Validation]

ERRT @ GSLIS 10/22/2014 16

Related Research (Tianhong Song)

• Analyze linear workflow “story”

• Use patterns to discover wf design issues (e.g. use before update); then fix them

• Parallelize when possible