Research methods group accelarating impact by sharing data

Accelerating Impact by sharing data

Anja Gassner and Leroy Mwanzia

Why should we share our data?

Let's Move Beyond Open Data to Open Development?

This year July The Sunday Business section of the New York featured a story about the Bank’s Open Data initiative and claimed that datasets and information will ultimately become more valuable than Bank lending. This is not about the World Bank as the central repository of knowledge sharing its knowledge and wisdom with clients from the South. It is about “democratizing development economics” in that it levels the playing field on knowledge creation and dissemination and opens the development paradigm to participation from researchers and practitioners, software developers and students, from north and south.

The CGIAR is unique in having the capacity to collect experimental, monitoring, and survey data on agricultural systems throughout the developing world.

Most data collected by CRPs, whether broad-scale data used to describe and monitor farming system changes, or focused data collected to examine specific processes and hypotheses, should be of such potential value that the cost of archiving and sharing is justified by the value added in terms of expanded research results from the use of that data by a wider research community.

SRF

“This is one clear and consistent message from the last CGIAR science forum: data archiving of the CG Centers is overall abysmally poor”

Robert Nasi Director Forests, Trees and Agroforestry (CRP6)

Program Participant Agreement

What data platform are we going to use?

Research Data Repository

What should be deposited

1. All research data belonging to publications2. High value data sets of interest to ICRAF, other CG

centers & Partners

Research Data Management Policy

The Policyall of the Centre’s data needs to be:

a) derived from research relevant to our agenda, to the development challenges in our strategy, to the Strategy and Results Framework (SRF) of the CGIAR and to the CGIAR Research Programs (CRPs)

b) of high quality (well designed, well collected, well verified, well documented);

c) protected and archived;d) is made available (know that it exists) and easily

accessible (can gain access to the data) to all; e) is adaptable so that it can be well utilized and

transformed where possible into actionable knowledge;

Who’s responsibility?

The Centre

• setting up clear protocols, conducting peer reviews, using robust and well-documented methods and appropriate statistical analyses, and producing meta-analysis and syntheses of results

• providing a stable, reliable data repository system that can handle both document-centric and data-centric objects.

• ensuring that all necessary raw data will be made public to reproduce or replicate every scientific publication that is based on research data

Who’s responsibility?

The project/scientist

• compliance with explicit quality standards • submit necessary raw, verified data for every scientific publication

in standard file formats. • ensure that research data produced for the Centre is described by

appropriated Metadata throughout their lifecycle

How do we achieve highest scientific standards?

RMG: quality control throughout the data lifecycle (collection, verifying, managing, analyzing, storing)

Beyond RMG: to ensure that all staff follow the institutional standards and guidelines.

The ultimate benchmark for all scientists however, is the consensus of peers

Research Data Repository

Challenge:

Move data from scientist laptops to institutional serverand

Have the data described by sufficient metadatawithout

Increasing transaction costs orCreating an auditing issue

Dataverse?

• The Dataverse Network is an application to publish, share, reference, extract and analyse research data.

• It facilitates making data available to others, and enables replication of work.

• Researchers and data authors get credit, publishers and distributors get credit, affiliated institutions get credit.

Dataverse Network

• A Dataverse Network hosts multiple Dataverses.

• Each Dataverse contains studies or collections of studies.

• Each study contains cataloguing information that describes the data plus the actual data files and complementary files.

Dataverse Network

Data Backup & Preservation• The IQSS Dataverse Network maintains a full backup of all data

and directories on the Network for 6 months, in the Harvard Depository. This means that there always is a full, offsite copy of the Network that is less than 7 months old.

• IQSS will maintain on-line storage, backup, and media migration sufficient for all studies it accepts (in addition to storage provided for the IQSS DVN).

• The Henry A. Murray Archive, through its endowment, supports permanent bit-level preservation of all social science research studies directly deposited in the IQSS Dataverse Network.

http://thedata.org/book/data-backup-terms

http://thedata.org/book/data-backup-terms

Hosting• There are two approaches:

1. You can download and install the Dataverse Network Application and effectively become a host; or

2. You can create a Dataverse on *IQSS Dataverse Network at Harvard University. This Network is open to all researchers, publishers and data distributors.

• Option 1 gives you more control but includes added responsibility & cost

*Institute for Quantitative Social Science

Hosting – IQSS Option• Advantages

– Dataverse software is installed, hosted and managed for you by IQSS– Dataverse is hosted in Harvard’s infrastructure which is very good– IQSS offer great support in assisting you set up your dataverse and

provide great help if you run into any problems

• Disadvantages– Network level administrative tasks cannot be done, these include:

• Creating user groups based on IP address or IQSS network user names• Creating harvesting dataverses which allow you to share meta data with other

systems e.g. Dspace. Sharing includes exporting and importing meta data.• Complete deletion of studies not just deaccession• Accessing web statistics

– Cannot use alias URLs to point to your dataverse e.g. we cannot have the url http://data.worldagroforestry.org pointing to the ICRAF IQSS dataverse http://dvn.iq.harvard.edu/dvn/dv/icraf

*Institute for Quantitative Social Science

http://data.worldagroforestry.org/

http://dvn.iq.harvard.edu/dvn/dv/icraf

http://dvn.iq.harvard.edu/dvn/dv/icraf

Hosting – Self Hosting• Advantages

– Full access to Network level Administrative tasks including:• Ability to import and export studies to and from other systems• Ability to create user groups based on IP address and your dataverse users• Ability to use software supplied utilities e.g. complete deletion of studies and

locking of studies • Greater flexibility in user management and “Terms of use” management• Greater flexibility in dataverse branding

– Ability to use organization URLs to point to the dataverse e.g. http://data.worldagroforestry.org

• Disadvantages– Need an IT expert to install and manage the dataverse software,

including things like upgrading, applying security patches, backups etc.– Need good server infrastructure for hosting the application especially

server space.

http://data.worldagroforestry.org/

Brand your Dataverse

Recognition through Citation

• Dataverse allows to cite research digital data

from published printed work• Citation automatically generated when study

is created.• Data Citation format:Author, Date, “Title”, Persistent Identifier

Universal Numerical Fingerprint (UNF) Distributor or other optional fields [ …]

Data Citation for each study

1. Persistent Identifiers – Offer permanent and

reliable links to digital objects. Uses the handle system. e.g. hdl:1902.1/15673

2. Universal Numerical Fingerprint – – Applied on quantitative data– Used to uniquely identify and verify datae.g. 5:G22I+TtPQPAyFcRT6SrUfA==

Unique Citation Components

Frank Place; Patti Kristjanson; Steve Staal; Russ

Kruska; Tineke deWolff; Robert Zomer; E C Njuguna, 2005, "Replication data for: Development pathways in medium-high potential Kenya: a meso-level analysis of agricultural patterns and determinants.", http://hdl.handle.net/1902.1/15673 UNF:5:G22I+TtPQPAyFcRT6SrUfA== World Agroforestry Centre [Distributor] V1 [Version]

Example of Citation

Designed for Research DataData-format aware• Input formats: CSV , TAB, SPSS,

STATA, GraphML• Export: reformat, subset, analyze• Preservation-reformatting• Semantic fingerprints

Find distributed resources• Can provide a portal to distributed

resources (OAI-PMH harvesting client)

• Data can also include meta data for harvesting

RobustSupports Any file type, only restriction 1 file size = 2GB

Flexible licensing• Access control for research

groups• Layered usage terms• Data request workflow

Research data workflows• Researcher can enter deposit

directly• Multiple workflows: closed,

review-and-release, wiki• Versioned

Permissions Summary

What can you do with dataverse?

Start

GRP/Region submit publication

Publication submitted into

Dspace

Dspace Editors Approval

Publication published to the

web

Request Changes

Publication Approval

Publication or data

Data submitted to RMG Data

manager

Data Received?

Upload data to dataverse

Update Dspace (unreleased)

publication with data link

Request Data from scientists

Dspace Editors receive data link

Publication published in

dspace

Publication has data?

No

NoYes

Yes

Yes

No

Publication

Data

Publication and Data Submission Proposed Workflow

Data Request Email

Thank You

Research methods group accelarating impact by sharing data

Technology

Transcript of Research methods group accelarating impact by sharing data