Research methods group accelarating impact by sharing data
-
Upload
world-agroforestry-centre-icraf -
Category
Technology
-
view
1.740 -
download
2
Transcript of Research methods group accelarating impact by sharing data
Accelerating Impact by sharing data
Anja Gassner and Leroy Mwanzia
Why should we share our data?
Why should we share our data?
Let's Move Beyond Open Data to Open Development?
This year July The Sunday Business section of the New York featured a story about the Bank’s Open Data initiative and claimed that datasets and information will ultimately become more valuable than Bank lending. This is not about the World Bank as the central repository of knowledge sharing its knowledge and wisdom with clients from the South. It is about “democratizing development economics” in that it levels the playing field on knowledge creation and dissemination and opens the development paradigm to participation from researchers and practitioners, software developers and students, from north and south.
The CGIAR is unique in having the capacity to collect experimental, monitoring, and survey data on agricultural systems throughout the developing world.
Most data collected by CRPs, whether broad-scale data used to describe and monitor farming system changes, or focused data collected to examine specific processes and hypotheses, should be of such potential value that the cost of archiving and sharing is justified by the value added in terms of expanded research results from the use of that data by a wider research community.
SRF
“This is one clear and consistent message from the last CGIAR science forum: data archiving of the CG Centers is overall abysmally poor”
Robert Nasi Director Forests, Trees and Agroforestry (CRP6)
Program Participant Agreement
What data platform are we going to use?
Research Data Repository
What should be deposited
1. All research data belonging to publications2. High value data sets of interest to ICRAF, other CG
centers & Partners
Research Data Management Policy
The Policyall of the Centre’s data needs to be:
a) derived from research relevant to our agenda, to the development challenges in our strategy, to the Strategy and Results Framework (SRF) of the CGIAR and to the CGIAR Research Programs (CRPs)
b) of high quality (well designed, well collected, well verified, well documented);
c) protected and archived;d) is made available (know that it exists) and easily
accessible (can gain access to the data) to all; e) is adaptable so that it can be well utilized and
transformed where possible into actionable knowledge;
Who’s responsibility?
The Centre
• setting up clear protocols, conducting peer reviews, using robust and well-documented methods and appropriate statistical analyses, and producing meta-analysis and syntheses of results
• providing a stable, reliable data repository system that can handle both document-centric and data-centric objects.
• ensuring that all necessary raw data will be made public to reproduce or replicate every scientific publication that is based on research data
Who’s responsibility?
The project/scientist
• compliance with explicit quality standards • submit necessary raw, verified data for every scientific publication
in standard file formats. • ensure that research data produced for the Centre is described by
appropriated Metadata throughout their lifecycle
How do we achieve highest scientific standards?
RMG: quality control throughout the data lifecycle (collection, verifying, managing, analyzing, storing)
Beyond RMG: to ensure that all staff follow the institutional standards and guidelines.
The ultimate benchmark for all scientists however, is the consensus of peers
Research Data Repository
Challenge:
Move data from scientist laptops to institutional serverand
Have the data described by sufficient metadatawithout
Increasing transaction costs orCreating an auditing issue
Dataverse?
• The Dataverse Network is an application to publish, share, reference, extract and analyse research data.
• It facilitates making data available to others, and enables replication of work.
• Researchers and data authors get credit, publishers and distributors get credit, affiliated institutions get credit.
Dataverse Network
• A Dataverse Network hosts multiple Dataverses.
• Each Dataverse contains studies or collections of studies.
• Each study contains cataloguing information that describes the data plus the actual data files and complementary files.
Dataverse Network
Data Backup & Preservation• The IQSS Dataverse Network maintains a full backup of all data
and directories on the Network for 6 months, in the Harvard Depository. This means that there always is a full, offsite copy of the Network that is less than 7 months old.
• IQSS will maintain on-line storage, backup, and media migration sufficient for all studies it accepts (in addition to storage provided for the IQSS DVN).
• The Henry A. Murray Archive, through its endowment, supports permanent bit-level preservation of all social science research studies directly deposited in the IQSS Dataverse Network.
http://thedata.org/book/data-backup-terms
Hosting• There are two approaches:
1. You can download and install the Dataverse Network Application and effectively become a host; or
2. You can create a Dataverse on *IQSS Dataverse Network at Harvard University. This Network is open to all researchers, publishers and data distributors.
• Option 1 gives you more control but includes added responsibility & cost
*Institute for Quantitative Social Science
Hosting – IQSS Option• Advantages
– Dataverse software is installed, hosted and managed for you by IQSS– Dataverse is hosted in Harvard’s infrastructure which is very good– IQSS offer great support in assisting you set up your dataverse and
provide great help if you run into any problems
• Disadvantages– Network level administrative tasks cannot be done, these include:
• Creating user groups based on IP address or IQSS network user names• Creating harvesting dataverses which allow you to share meta data with other
systems e.g. Dspace. Sharing includes exporting and importing meta data.• Complete deletion of studies not just deaccession• Accessing web statistics
– Cannot use alias URLs to point to your dataverse e.g. we cannot have the url http://data.worldagroforestry.org pointing to the ICRAF IQSS dataverse http://dvn.iq.harvard.edu/dvn/dv/icraf
*Institute for Quantitative Social Science
Hosting – Self Hosting• Advantages
– Full access to Network level Administrative tasks including:• Ability to import and export studies to and from other systems• Ability to create user groups based on IP address and your dataverse users• Ability to use software supplied utilities e.g. complete deletion of studies and
locking of studies • Greater flexibility in user management and “Terms of use” management• Greater flexibility in dataverse branding
– Ability to use organization URLs to point to the dataverse e.g. http://data.worldagroforestry.org
• Disadvantages– Need an IT expert to install and manage the dataverse software,
including things like upgrading, applying security patches, backups etc.– Need good server infrastructure for hosting the application especially
server space.
Brand your Dataverse
Recognition through Citation
• Dataverse allows to cite research digital data
from published printed work• Citation automatically generated when study
is created.• Data Citation format:Author, Date, “Title”, Persistent Identifier
Universal Numerical Fingerprint (UNF) Distributor or other optional fields [ …]
Data Citation for each study
1. Persistent Identifiers – Offer permanent and
reliable links to digital objects. Uses the handle system. e.g. hdl:1902.1/15673
2. Universal Numerical Fingerprint – – Applied on quantitative data– Used to uniquely identify and verify datae.g. 5:G22I+TtPQPAyFcRT6SrUfA==
Unique Citation Components
Frank Place; Patti Kristjanson; Steve Staal; Russ
Kruska; Tineke deWolff; Robert Zomer; E C Njuguna, 2005, "Replication data for: Development pathways in medium-high potential Kenya: a meso-level analysis of agricultural patterns and determinants.", http://hdl.handle.net/1902.1/15673 UNF:5:G22I+TtPQPAyFcRT6SrUfA== World Agroforestry Centre [Distributor] V1 [Version]
Example of Citation
Designed for Research DataData-format aware• Input formats: CSV , TAB, SPSS,
STATA, GraphML• Export: reformat, subset, analyze• Preservation-reformatting• Semantic fingerprints
Find distributed resources• Can provide a portal to distributed
resources (OAI-PMH harvesting client)
• Data can also include meta data for harvesting
RobustSupports Any file type, only restriction 1 file size = 2GB
Flexible licensing• Access control for research
groups• Layered usage terms• Data request workflow
Research data workflows• Researcher can enter deposit
directly• Multiple workflows: closed,
review-and-release, wiki• Versioned
Permissions Summary
What can you do with dataverse?
Start
GRP/Region submit publication
Publication submitted into
Dspace
Dspace Editors Approval
Publication published to the
web
Request Changes
Publication Approval
Publication or data
Data submitted to RMG Data
manager
Data Received?
Upload data to dataverse
Update Dspace (unreleased)
publication with data link
Request Data from scientists
Dspace Editors receive data link
Publication published in
dspace
Publication has data?
No
NoYes
Yes
Yes
No
Publication
Data
Publication and Data Submission Proposed Workflow
Data Request Email
Thank You