Preservation, Publishing, and People: A SEAD View

88
Preservation, Publishing, and People: a SEAD View Beth Plale Director, Data To Insight Center Indiana University IU Scholarworks

description

Slides from SEAD's workshop on the Virtual Archive. This took place on Jun 30, 2014 in Bloomington, IN.

Transcript of Preservation, Publishing, and People: A SEAD View

Page 1: Preservation, Publishing, and People: A SEAD View

Preservation, Publishing, and People: a SEAD View

Beth PlaleDirector, Data To Insight Center

Indiana UniversityIU Scholarworks

Page 2: Preservation, Publishing, and People: A SEAD View

Publishable results of computationally‐based science rarely takes form of single data file or homogeneous collection. 

More often bundle: primary results, metadata describing the generated data, software used, configuration parameters used 

with the software, input data sources, ….

We call these bundles Research Objects

Bechhofer, S., Buchan, I., De Roure, D., Missier, P., Ainsworth, J., Bhagat, J., … & Goble, C. (2011). Why linked data is not enough for scientists. Future Generation Computer Systems, 29(2), 599–611.

Page 3: Preservation, Publishing, and People: A SEAD View

Data lifecycle• Research occurs over months to years. Praveen Kumar study of Mississippi River Basin flood of late April, early May 2011.  

• Arrange funding, define objectives (2011) • Data gathering:  sample flood plain at designated locations, take pictures, obtain satellite data, contract with independent organization to fly over the area with Lidar

• Data cleaning and analysis• Publish 2‐3 papers (2014)• Decide what data to package for publishing alongside publications• Publish the datasets• Each published package we call a Research Object

Page 4: Preservation, Publishing, and People: A SEAD View

Publish‐reuse window

•We focus on one window in time in lifecycle of research data : starts when researcher is ready to make data publically available … through to its first case of use by unrelated party (reuse).

“publish‐reuse window”•Why this window?• Repository services have to be self‐documenting to achieve reproducibility.  I derive new object from object in SEAD VA, I revise object in SEAD VA – these are different actions, by different people, with different implications. 

Page 5: Preservation, Publishing, and People: A SEAD View

Publish reuse window and important actors•Data in shared file system, or other project space

Researcher brings together, 

organizes, cleans, and analyzes data

Researcher brings together, 

organizes, cleans, and analyzes data

•Package up into Research Object

Researcher organizes and preps data for publishing

Researcher organizes and preps data for publishing Researcher 

initiates submission for deposit 

Researcher initiates 

submission for deposit  •Data curator 

examines object, augments, and approves

SEAD VA unpacks RO 

and processes for deposit to IR

SEAD VA unpacks RO 

and processes for deposit to IR

•Download RO, new object created

Data scientist uses published RO in his/her research

Data scientist uses published RO in his/her research

Publish‐reuse window

Actors:Data creator Curator

Data scientist

Page 6: Preservation, Publishing, and People: A SEAD View

Research Object: what the RO is

• The Research object (RO) is an aggregation of resources that can be transferred, produced, and consumed by common services across organizational boundaries. The RO encapsulates digital knowledge and provides a mechanism for sharing and discovering re‐usable research. 

• ROs are a bundle of primary results, metadata describing the generated data, software used, configuration parameters used with the software, input data sources, …

• An RO can and will likely have multiple manifestations. • Research object is the publishable object. 

Page 7: Preservation, Publishing, and People: A SEAD View

Why is Research Object view important?

• Addresses weaknesses in existing solutions:  The hierarchical “belongs to” organization of information is extremely inadequate for all but simplest cases. 

• Facilitates reproducibility: We can no longer look just at data products:  software is critical for reproducibility (even if repeatability is not the goal.) 

• Allows for uniform handling:  Research object is dropped into a BagIT bag (1 bag = 1 RO).  SEAD VA accepts bags of all colors, but all are bags.  Lifecycle of ROs tracked in SEAD VA

• Just makes sense:  When is the result of a scientific dissertation a uniform collection of files with fixed directory structure?  <answer: never> 

Page 8: Preservation, Publishing, and People: A SEAD View

Research Object, Components of

• Identity : unique ID • Entities : core data or software objects themselves• Properties : Aggregation : “belongs to” relationship, used to aggregate within Research Object

• Properties : Relationships : “related to” relationship• Properties: Descriptive/Annotative : metadata• Properties: Provenance : “derived from”, “versioned from” relationship as well as others

• Properties: Agents : data creator (author list), curator, data scientist• State : external to the RO

Page 9: Preservation, Publishing, and People: A SEAD View

Research Object

Page 10: Preservation, Publishing, and People: A SEAD View

Research Object State TransitionAn RO  in one of three states: LO, PO, and CO as follows: 

• Live object (LO) – a work in progress.  Data creator assembling content for publication

• Curation object (CO) – an object after creator signaled intention to publish.  Curator works on the curation object; changes are selective. 

• Publication object (PO) ‐ a final version ready to be disseminated widely. Published Objects (PO’s) are mutable under certain conditions only.

RO described by model:

RO = {s, dm, c}Where s is state of an RO at any point in time, dm is its descriptive metadata, and cis the entities (core content) and relationships amongst entities

Page 11: Preservation, Publishing, and People: A SEAD View

RO = {s, dm, c}

Where s is state of an RO at any point in time, dm is its descriptive metadata, and cis the resources and relationships amongst resources. 

State transition graph

Page 12: Preservation, Publishing, and People: A SEAD View

Architectural implications of RO model

SEAD Virtual Archive

User Interface (GWT web application)

Ingest workflow (Data Conservancy)

KomaduProvenance System

VA Registry

RO Subsystem

Matchmaker

Page 13: Preservation, Publishing, and People: A SEAD View

RO model implemented in SEAD VA

SEAD Virtual Archive

Ingest workflow (Data Conservancy)

KomaduProvenance System

VA Registry

RO Subsytem

Matchmaker

Extended ingest workflow to seamlessly: ‐ Extract RO 

from BagITbag

‐ Transition from RO to SIP model of Data Conservancy model

User Interface (GWT web application)

Page 14: Preservation, Publishing, and People: A SEAD View

RO model implemented in SEAD VA

SEAD Virtual Archive

Ingest workflow (Data Conservancy)

KomaduProvenance System

VA Registry

RO Subsytem

Matchmaker

Extended SEAD VA with registry and provenance tracking to implement RO lifecycle.  Modular functionality (built outside DC for portability)

User Interface (GWT web application)

Page 15: Preservation, Publishing, and People: A SEAD View

People: Data Creator, Curator, and Data Scientist

Page 16: Preservation, Publishing, and People: A SEAD View

Each of Data Creator, Curator, and Data Scientist are related to one another, and their relationship is through the Research Objects that they create, work on, and use.

This relationship information exists in form of provenance in SEAD VA.  Future work is to capture these nuanced relationships in the SEAD Research Network as well.

Page 17: Preservation, Publishing, and People: A SEAD View

And onto … SEAD VA Workshop Agenda and Resources

http://bit.ly/sead‐va‐workshop063014

Page 18: Preservation, Publishing, and People: A SEAD View

Data Creator in SEAD VAInna Kouper

Page 19: Preservation, Publishing, and People: A SEAD View

OverviewThe Data Creator collects data and, once done with a study, gathers materials that support the study and submits them for publication and preservation in institutional repositories.

Example: A dissertation that is based• images from USGS• spreadsheets with numbers and calculations• computing scripts• videos of experiments

In VA a data creator can:• Upload research objects (ROs)• Preview, review and download ROs• Check status of ROs in queue to IR

Page 20: Preservation, Publishing, and People: A SEAD View

Background : SEAD Services

• SEAD Research Network• Project Spaces• Packaging and Mapping

Page 21: Preservation, Publishing, and People: A SEAD View

Research Network• Network of data creators, curators and re‐use scientists across disciplines• Rich ontology to support links to data, projects and publications• Visualizations of co‐authorship and co‐citation

Page 22: Preservation, Publishing, and People: A SEAD View

ORCiD / SEAD Research Network Integration

• Create empty profile in VIVO• Execute harvester• Ingest data

Page 23: Preservation, Publishing, and People: A SEAD View

Project Spaces

• 15 project spaces (incl. an open demo space and an internal testing space)

• Thousands of collections in active curation• Once a collection is marked for publication, it can be ingested into Virtual Archive

Page 24: Preservation, Publishing, and People: A SEAD View

Project Space = Active Content Repository (ACR)

Page 25: Preservation, Publishing, and People: A SEAD View
Page 26: Preservation, Publishing, and People: A SEAD View

Packaging and Mapping (BagIT / ORE)

• BagIt format• standardized “envelopes” (bags)• no requirements for “knowing” internal semantics• 3 elements: a bag declaration (bag.txt), a manifest file (manifest‐<algorithm>.txt, folder with content (data)

• Tools available for bagging• SEAD BagIt service• LOC Bagger tool (http://sourceforge.net/projects/loc‐xferutils/files/loc‐bagger/2.1.2/)

Page 27: Preservation, Publishing, and People: A SEAD View

Resource Maps• OAI/ORE standard

• Exposes rich content• Captures semantic of relationships among RO items• Identifies aggregations

• SEAD VA OAI/ORE relationship classes: • Aggregation• Description• Authorship• Copyright / rights• Modification• Derivation• Citation• Processing (calculation, computation, etc.)

Page 28: Preservation, Publishing, and People: A SEAD View

OAI‐ORE Example

Resource Map

Aggregation

_readme

spreadsheet

image

Image 2.0

spreadsheet 1.1

describes

aggregates

describes

wasDerivedFrom

wasModifiedFrom

aggregates

Aggregation2.0

wasDerivedFrom

Page 29: Preservation, Publishing, and People: A SEAD View

OAI/ORE Map Example<rdf:RDF…<rdf:Description rdf:about=URI>  <!‐‐ data item‐‐>

<ore:isAggregatedBy>ID</ore:isAggregatedBy><dcterms:identifier rdf:datatype=URI>ID</dcterms:identifier><dcterms:title rdf:datatype=URI>Vortex_Mining.xlsx</dcterms:title><dcterms:source rdf:datatype=URI>test_bag/data/Vortex_Mining.xlsx</dcterms:source> 

<!‐‐ A related resource from which the described resource is derived. ‐‐></rdf:Description>…..</rdf:RDF>

Page 30: Preservation, Publishing, and People: A SEAD View

Demo / Hands on[Data creator role]

Page 31: Preservation, Publishing, and People: A SEAD View

Download Test Research Objects

Or go to https://iu.box.com/sead‐va‐test‐bags

Page 32: Preservation, Publishing, and People: A SEAD View

Register / Sign In• Go to http://seadva‐test.d2i.indiana.edu:5672/sead‐access/• Click LOG IN and fill your login information (or click SignUp below)

Page 33: Preservation, Publishing, and People: A SEAD View

Upload Research Object• On the Upload Data tab, click “Choose File”• Select a test dataset in the dialog window• Click upload

Upload Data Tab

Page 34: Preservation, Publishing, and People: A SEAD View

Review Research Object

• Check that the object is correct

• Change project name and description

• Agree to the license terms• Click “Submit Dataset for Review”

Page 35: Preservation, Publishing, and People: A SEAD View

Status and Success Messages

Page 36: Preservation, Publishing, and People: A SEAD View

Trace Activity• Go to activity tab• See all actions performed by you• Click on the dataset  name to see details

Activity tab

Page 37: Preservation, Publishing, and People: A SEAD View

View Research Object Details

Page 38: Preservation, Publishing, and People: A SEAD View

Receive Notification• After the next part of the tutorial, check your inbox for email from SEAD VA

Page 39: Preservation, Publishing, and People: A SEAD View

Curator in SEAD VAKavitha Chandrasekar

Page 40: Preservation, Publishing, and People: A SEAD View

Overview

The Curator works on Research Objects created and submitted by Data Creators:  reviews submission, modifies metadata, and takes action to move submission to their Institutional Repository

In VA curator can:• Select Item for review from curation queue• Enhance Metadata• Deposit to Institutional Repository

Page 41: Preservation, Publishing, and People: A SEAD View

“Under the Hood”IR Recommendation and IR Description

Page 42: Preservation, Publishing, and People: A SEAD View

Automatic IR Recommendation (SEAD VA Matchmaker)• Matches RO’s to compatible  Institutional Repository• Recommends best Institutional Repository match for RO• Facilitates transfer and deposit of heterogeneous ROs

Page 43: Preservation, Publishing, and People: A SEAD View

IR Recommendation Flow

Submit

•User‐initiated

Deposit

•RO received by SEAD VA

Stage

•For decision making

Execute Rules

•Rules engine

Send to Curator queue

•Workflow‐initiated

IR MatchmakerAdd to IR queue based on 

match found – eg:IU Scholarworks

or Ideals

Page 44: Preservation, Publishing, and People: A SEAD View

IR – SEAD VA “contract”:  the Service Level Agreement• Service Level Agreement (SLA) is a contract of sorts between SEAD VA and an Institutional Repository. It captures• Repository requirements and privileges• Repository services

• The IR Recommendation system uses excerpts from IR’s SLA to identify compatible pairs of datasets and repositories during RO deposit.

Page 45: Preservation, Publishing, and People: A SEAD View

Service Level Agreement‐ Requirements and Privileges (summary)• RO properties – Requirements

• Data contributor Institutional Affiliation • Scientific Domain• Data Organization (e.g.: BagIt or SWORD)• Size• Versioning• Minimal Metadata• Licensing (eg: open, embargoed)

• Repository privileges• Repository is free to re‐distribute the RO received from SEAD VA, except in case of embargo.

• Repository can migrate RO into other formats and re‐distribute migrate ROs.• Repository curators can annotate data collections to comply with standards or upgrades in our policies.

Page 46: Preservation, Publishing, and People: A SEAD View

SLA – Repository Service Guarantees

• Long‐term preservation• Format Migration• Archival support• Embargo• Access • DOI generation• Technical guarantees:

• Limited Downtime• Data Ingest Time• Backup• Integrity checks

Page 47: Preservation, Publishing, and People: A SEAD View

Excerpt from from SLA for IU Scholarworks

• Institutional Affiliation• At least one author, at the time of deposit, belongs to the same institution as our repository.

• RO Size• 150 MB for items uploaded directly to IUScholarWorks, 10 GB total• 5 TB for items hosted on the SDA

• Versioning• Only final PO is accepted, subsequent versions will substitute the version of record.

• Scientific Domain – Curator review might be needed• ROs are associated with research in the domains of  ANY (identify specific domains or put “sustainability science” for a broader match)

Page 48: Preservation, Publishing, and People: A SEAD View

The IR Recommender use of an SLA

• IR Recommender implements an IR’s SLA as a set of executable rules in the Matchmaker.  The rules are executed with a rules engine called “Drools”• Rules can be added on the fly, meaning new IR can be added just by specifying a SLA. • Incorporate modifications in SLA to rules at runtime

• Clean mapping of SLA terms to Drools Drools rules

Page 49: Preservation, Publishing, and People: A SEAD View

Mapping SLA to Drools rulesrule "IU Scholaworks Affiliation rule”

dialect "mvel”salience 20

whenSeadDeliverableUnit( title != null ) //Per IU SLA collection should have titleSeadDeliverableUnit($contributors:dataContributors )eval( $contributors.size>0 ) //Creators should not be empty per IU SLA$seadPerson: SeadPerson( idType == "vivo" && getEmail(id)=="Indiana University") from $contributors;$seadDu : SeadDeliverableUnit(sizeBytes  < 10000000000 ) //Total collection size less than10 GB approximatelySeadDeliverableUnit(fileNo  <  1000 ) //Total  file count less than 1000SeadDeliverableUnit( "CC" in (rights) ) //Open access data 

thenaddRepository("iu", 2);  //Adding IU repository to the queue of  matched repositories with priority 2

end

Rule declaration

Condition

Execution

Affiliated data contributor found

Page 50: Preservation, Publishing, and People: A SEAD View

Demo / Hands On[Curator role]

Page 51: Preservation, Publishing, and People: A SEAD View

Select Item from Curation QueueMatched Institutional Repository

Click on Curate Tab

Assign RO to self for review by clicking “Assign to me”

Page 52: Preservation, Publishing, and People: A SEAD View

Download ReadMe file for Dataset under edit

Unzip Bag

Open data/_readme.txt

Page 53: Preservation, Publishing, and People: A SEAD View

Enhance Metadata

Click on ‘Edit’ button

Page 54: Preservation, Publishing, and People: A SEAD View

View Research Object in Edit mode

To edit, click on entities in the bottom pane

Page 55: Preservation, Publishing, and People: A SEAD View

Populate metadata from ReadMe file

To save changes, click on ‘Save Changes’ button

Page 56: Preservation, Publishing, and People: A SEAD View

Save Final Curation changes

Finally click on ‘Save Changes’ below

After changes are saved, click on ‘Back’ to go back to Curation queue

Page 57: Preservation, Publishing, and People: A SEAD View

Approve and Publish to Institutional Repository

Publish

Page 58: Preservation, Publishing, and People: A SEAD View

Trace Activity

• Go to activity tab• See all actions performed by you• Click on the Research Object name to see details

Activity tab

Page 59: Preservation, Publishing, and People: A SEAD View

View in Institutional Repository

Page 60: Preservation, Publishing, and People: A SEAD View

Data Reuse Scientist in SEAD VAIsuru Suriarachchi

Page 61: Preservation, Publishing, and People: A SEAD View

Overview: The Data Scientist Data Scientist uses research objects that were created by someone elsefor his/her purposes and creates new research objects by modifyingexisting objects.Super Simple Example: Putting images in given RO 3 into a singlepresentation and creating a new RO

Data scientist can:• Search• Download (bags)• Modify• Re‐upload

Page 62: Preservation, Publishing, and People: A SEAD View

“Under the Hood”Provenance, Component Interaction

Page 63: Preservation, Publishing, and People: A SEAD View

Provenance

• What is Provenance? • Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness

• Also called “Lineage” or “Pedigree”• Advantages of provenance for preservation

• Derive ownership• Asses quality and trustworthiness• Reproducibility• Validation• Failure Tracing

Not used in Preservation Provenance

Page 64: Preservation, Publishing, and People: A SEAD View

Provenance in Repositories

• The provenance important here is provenance of a Research Object • Why important?

• For the data scientists in “Search”• To check ownership of RO• To asses quality and trustworthiness of RO

• For the Curators • To check curation history 

• Provenance role in “Publish ‐ Reuse window”• Published Object (PO) Provenance• Curation Object (CO) Provenance

Page 65: Preservation, Publishing, and People: A SEAD View

Provenance Capture in SEAD VA

• Uses Komadu provenance system• Captures activity in real time, assembles new activity into internal representation as provenance graphs 

• W3C PROV spec compliant

• Terminology• Activity : Some Processing Event in SEAD VA• Entity : A Research Object (in CO or PO state)• Agent : Data Creator, Curator, Data Scientist

Page 66: Preservation, Publishing, and People: A SEAD View

Provenance among Published Objects

Create RO

Publish RO

Download RO

Upload RO’

Publish RO’

DataCreator

Curator Data Scientist

CuratorData Scientist

Provenance captured between these 2 published RO’s (RO and RO’).  Provenance relationship is:Derivation: if Data Creator =/ Data Scientist.   Revision:    if  Data Creator same as Data Scientist

Page 67: Preservation, Publishing, and People: A SEAD View

Maintaining Provenance among Published ROs• Two identifiers maintained: DOI and Internal Identifier.•Why two identifiers? • DOI: each RO has a unique DOI. • Internal Identifier: lineage maintained through internal identifier which maintains the relationship between original object and derived object 

Page 68: Preservation, Publishing, and People: A SEAD View

Provenance among Published Objects• At first publish of RO, a DOI and Internal Identifier are added to oaiore.xml

• At Re‐upload

Page 69: Preservation, Publishing, and People: A SEAD View

Provenance among Published ROs

• Provenance relationships captured in Komadu• Entity‐Entity (derivation) : When the second publish is done

• This RO provenance capture continues up to any number of publish:download:re‐upload cycles

• At second publish (RO’), “wasDerivedFrom” element is added in the oaiore.xml referring to the original Internal Identifier

Page 70: Preservation, Publishing, and People: A SEAD View

Usage of Published Object Provenance• Data scientist can see lineage graph of her new RO’. This helps her assess the collection and is useful if original object changes (forward provenance). 

Page 71: Preservation, Publishing, and People: A SEAD View

Curation Time Provenance Capture

Create RO

Publish RO

Creator Curator

Provenance within Curation 

Page 72: Preservation, Publishing, and People: A SEAD View

Curation Time Provenance Capture• Curation Activities

• Curation‐Edit‐Event• Publish‐Event

• Provenance relationships captured in Komadu• Agent‐Activity : When some Agent triggers one of above Activities• Activity‐Entity : When an Activity Generates (Updates) a Research Object

• Example Scenario• Curator X edits metadata on research object Y

• Agent‐Activity relationship (association) between X and Curation‐Edit‐Event• Activity‐Entity relationship (generation) between Curation‐Edit‐Event and Y

Page 73: Preservation, Publishing, and People: A SEAD View

Usage of Provenance at Curation time

• Curator can see all actions he/she performed on a particular Research Object

Page 74: Preservation, Publishing, and People: A SEAD View

Component InteractionSEAD VA Workflow

Local ID GenerationLocal ID 

GenerationPersist RO

Persist RO

DOI Generation

DOI Generation

Publish to IR

Publish to IR

RO Subsystem

RO Sub

system

 API

RO Sub

system

 API

SEAD VARegistrySEAD VARegistry

KomaduProvenance

Server

KomaduProvenance

Server

Metadata/Provenance Processor

Metadata/Provenance Processor

REST APIREST API WS APIWS API

SEAD VAUI

Upload Bag/Publish RO

Curate/Provenance

MatchMakerMatchMaker

Page 75: Preservation, Publishing, and People: A SEAD View

Demo / Hands On[as a data scientist]

Page 76: Preservation, Publishing, and People: A SEAD View

Register / Sign In• Go to http://seadva‐test.d2i.indiana.edu:5672/sead‐access/• Click LOG IN and fill your login information (or click SignUp below)

Page 77: Preservation, Publishing, and People: A SEAD View

Search for Data

Page 78: Preservation, Publishing, and People: A SEAD View

Find data

Filter

Page 79: Preservation, Publishing, and People: A SEAD View

Browse data collection

Page 80: Preservation, Publishing, and People: A SEAD View

Request Data Download

Page 81: Preservation, Publishing, and People: A SEAD View

Receive data download email

Page 82: Preservation, Publishing, and People: A SEAD View

Download Data

Modify Data

Page 83: Preservation, Publishing, and People: A SEAD View

Re‐Upload data

Page 84: Preservation, Publishing, and People: A SEAD View

Access Curation Queue

Page 85: Preservation, Publishing, and People: A SEAD View

Approve and Publish

Publish

Page 86: Preservation, Publishing, and People: A SEAD View

Check Activity

• Go to activity tab• See activities performed (Curation time provenance)

• Click on the Research Object name to see details

Activity tab

Page 87: Preservation, Publishing, and People: A SEAD View

Check Provenance Graph

Provenance between 2 published objects (derivation)

Page 88: Preservation, Publishing, and People: A SEAD View

Thank You