Research Data Management and Librarians
-
Upload
johann-van-wyk -
Category
Education
-
view
327 -
download
1
description
Transcript of Research Data Management and Librarians
Research Data Management and LibrariansPresentation at Elsevier Library Connect Seminar, 6 October 2014, Johannesburg, 7 October 2014, Durban and 9 October 2014, Cape Town
By Johann van Wyk (University of Pretoria)
Introduction
Internationally research data is increasingly recognised as a vital
resource whose value needs to be preserved for future research. This
places a huge responsibility on Higher Education Institutions to
ensure that their research data is managed in such a manner that
they are protected from substantial reputational, financial and legal
risks in the future. Librarians have a unique skillset to help these
institutions navigate this complex environment. This presentation
will highlight a number of potential roles librarians could play.
Research Data Management: A (Brave) Complex New World
DOI
Visualisation
RDMData Prese
rvation
Big Data Link
ed
Dat
a
DMP
Data Policy
Data Repository
Data Journals
Open Data
Data Anonymisation
Copyright License
Data Formats
Messy Complex
Small Data
Various formats Various
devicesVarious Versions
Sensitive Data
What is meant by Research Data?
Research data, unlike other types of
information, is collected, observed, created or
generated, for purposes of analysis to produce
original research results
http://www.docs.is.ed.ac.uk/docs/data-library/EUDL_RDM_Handbook.pdf
What is research data management?
• “the process of controlling the information generated
during a research project”
• “Managing data is an integral part of the research
process. How data is managed depends on the types
of data involved, how data is collected and stored,
and how it is used - throughout the research
lifecycle”.
http://www.libraries.psu.edu/psul/pubcur/what_is_dm.html
Why Manage Research Data?
• Meet funding body grant requirements, e.g. NSF, NIH;• Meet publisher requirements • Ensure research integrity and replication; • Ensure research data and records are accurate, complete,
authentic and reliable; • Increase your research efficiency; • Save time and resources in the long run; • Enhance data security and minimise the risk of data loss; • Prevent duplication of effort by enabling others to use your data; • Comply with practices conducted in industry and commerce; and • Protect your institution from reputational, financial and legal risk.
By managing research data you will:
Designing Data Management Plans
A Data Management Plan is “a formal document that outlines what you will do with your data during and after you complete your research” (The University of Virginia Library, 2014).
Data Management Planning Tools:
• Data Management Planning Tool (DMPTool) https://dmptool.org/ (University of California Curation Center of the California Digital Library)• DMPonline tool https://dmponline.dcc.ac.uk/ (Digital Curation Centre, UK)
Creating Data
Librarians can play an advisory role
Data Capture/Collection
The action or process of “gathering and measuring information on variables of interest, in an established systematic fashion that enables one to answer stated research questions, test hypotheses, and evaluate outcomes” (Responsible conduct of research, n.d.; The Oxford Dictionary, 2014).
Examples of data collection methods:Observations, textual or visual analysis, interviews, focus group interviews, surveys, tracking, experiments, case studies, literature reviews, questionnaires, data from sensors, model outputs, scenarios, etc.
Creating Data
Librarians can play their traditional role of information searching, - training and - consultation
Data Storage and Backup
Data storage is the process of “preservation of data files in a secure location which can be accessed readily” (Research Data Services, University of Wisconsin-Madison, 2014)
Data Backup is the process of “preserving additional copies of your data in a separate physical location from data files in storage”.
Creating Data
Processing Data
Analysing Data
Librarians can advise researchers on File Naming Conventions
Metadata Creation
• Metadata is searchable, standardised and structured “information that describes a dataset” and explains “the aim, origin, time references, geographic location, creating author, access conditions and terms of use of a data set”
(Corti et al., 2014: 38; USGS Data Management Website, 2014)
• Examples: - Dublin Core Metadata Element Set; - ISO 19115: 2003(E) — Geographic Information Metadata; - PREMIS
Creating Data
Processing Data
Analysing Data
Preserving Data
Librarians, especially cataloguers have the skill-set to assist with metadata creation and to advise
Data Cleansing, Verification & Validation
• Data Cleansing
“refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and
then replacing, modifying, or deleting this dirty data’ (Wikipedia)
• Data Verification
“the process of evaluating the completeness, correctness, and compliance of a dataset with
required procedures to ensure that the data is what it purports to be. This can be done by persons
“who are less familiar with the data”, for example Librarians.
(Martin and Ballard, 2010: 8-9; US EPA, 2002:7)
• Data validation
process “to determine if data quality goals have been achieved and the reasons for any
deviations. Validation checks that the data makes sense”.
(Martin and Ballard, 2010: 8; US EPA 2002:15).
Processing Data
Analysing Data
Data anonymisation
Data anonymisation is “the process of de-identifying sensitive
data, while preserving its format and data type” (Raghunathan, 2013: 4).
Anonymisation Techniques - Examples: Generalisation, Suppression,
Permutation, Pertubation, Substitution, Shuffling, Number and Date
Variance, Nulling-out (Charles, 2012; Cormode and Srivastava, 2009;
Raghunathan ,2013: 172-182; Simpson, n.d.; Vinogradov and Pastsyak,2012: 163).
Processing Data
Analysing Data
Data Interpretation & Analysis
Data interpretation and analysis “is the process of assigning
meaning” to the gathered information and ascertaining “the
conclusions, significance, and implications of the findings”
(Analyzing and Interpreting Data, n.d.).
Analysing Data
Data Publishing
Data publishing
This is the process of making research data underpinning the findings published in peer-reviewed articles, available for readers and reviewers in an appropriate repository, or “as supplementary materials to a journal publication” (Corti et al 2014: 197; Marques, 2013)
Data Journals
A more recent development has been the appearance of data journals. These journals publish data papers that describe a dataset, and also give an indication in which repository the dataset is available (Corti et al. 2014: 7-8).
Analysing Data
Librarians can be involved in creating and managing a data repository, and can give training and advise
Examples of Data Repository Software
Registry of Research Data Repositories
• re3data.org is a global registry of research data repositories that
covers research data repositories from different academic
disciplines.
• It presents repositories for the permanent storage and access of
data sets to researchers, funding bodies, publishers and scholarly
institutions.
• It can be used a tool for the easy identification of appropriate data
repositories to store research data.
Data Journals• A list of Data Journals – available at http
://proj.badc.rl.ac.uk/preparde/blog/DataJournalsList
• Example of data journal at Elsevier: “Data in Brief”
Data Visualisation
Data Visualisation is the visual representation of data, and is used to enable people to both understand and communicate information through graphical and schematic avenues (Friendly, 2009: 2; Schnell and Shetterley, 2013: 3)
Analysing Data
From Xiaoru Yuan’s presentation at CODATA Workshop on 12 June 2014
Data Archiving
Data archiving can be described as the process of retention
and storage of valuable data (this is data that will be
essential for future reference) for long-term preservation, so
that the data will be protected from risk (i.e. loss, or
corruption), and will be accessible for future use (Rouse, 2010).
Preserving Data
Data Preservation
Data preservation is ”the process of providing enough
representation information, context, metadata, fixity, etc. to the
data so that anyone other than the original data creator can use
and interpret the data” (Ruth Duerr, National Snow and Ice Data Center as
cited by Choudhury, 2014)
Preserving Data
The Librarian can assist researchers in preparing data for long-term preservation, by advising on metadata standards
Linking Data to research outputs
This is the process of connecting the underlying data relating to a
specific research output, e.g. journal article, thesis, etc to the
research output itself. This can be done by adding a digital object
identifier (DOI) to the dataset and including this in the metadata of
the research output, or by citing the dataset (Callaghan et al., 2013).
Preserving Data
The Librarian can assist researchers, through training and consultation on DOIs and data citation methods
Data Sharing
• Sharing data is the process of opening up access to research
data and making it available to other researchers (Corti et al.,
2014: 2).
• Data sharing provides “opportunities for other researchers to
review, confirm or challenge research findings” (Data sharing
and implementation guide, n.d.).
Giving Access to Data
Data sharing Methods
The method for sharing data will depend on a variety of factors, including size and complexity of the dataset, sensitivity of the data collected, and anticipated number of requests for data sharing.
Researchers could
(1) Take responsibility for sharing data themselves, or
(2) Use a data archive, or
(3) Use a combination of these methods.
Data repurposing/reuse
• This is the process where secondary data (data that have been
captured and analysed by other researchers) can be re-analysed,
reworked or -used for new analyses, and compared with
contemporary data (Corti et al., 2014: 169)
• This process “also enables research where the required data may
be expensive, difficult or impossible to collect”, e.g. large scale
surveys, or historic data (Corti et al., 2014: 169).
Re-using Data
Data Citation
Data citation is the process of referencing (attributing and acknowledging) reused data in a similar fashion as traditional sources of information (Corti et al. 2014: 197).
Helpful Sources :
• Publication Manual of the American Psychological Association (APA, 2009)
• Oxford Manual of Style (OUP, 2012)
• Data Citation Awareness Guide (ANDS, 2011)
• Data Citation: What you Need to Know (ESRC, 2012)
Re-using Data
The Librarian can assist researchers, through training and consultation in data citation methods
Data Citation: DOI
DOI = Digital Object Identifier
To enable a unique and persistent identification of a digital object
A DOI is a unique alphanumeric string assigned by a registration agency (the International DOI Foundation) to identify a digital object, e.g. a data set. Metadata about the object is stored together with the DOI name. This may include a location, such as a URL, where the object can be found. (Wikipedia)
For example: http://dx.doi.org/10.1000/182
Re-using Data
The Librarian can assist researchers, through training and consultation on DOIs
Specific Object
Registrant
DOI Registry
Provenance of Data
• history of a data file or data set
• this includes information
o on the person(s) responsible for the data set
o context of the data set
o revision history, including additions of new data and error corrections (Strasser et al., 2012: 7, 11)
Management of Big Data
Big data can be described in terms of its characteristics:
• Relative characteristics: denotes those datasets which cannot be
acquired, managed or processed on common devices within an
acceptable time;
• Abolute chacteristics defines big data through Volume, Variety,
Veracity and Velocity (Huadong, 2014)
Big Data is part of a new science paradigm called Data Intensive
Science, where Scientists are overwhelmed with data sets from many different
sources, e.g. captured by instruments, generated by simulations, and generated
by sensor networks
Absolute Characteristics of Big Data
• Volume: The scale of data that systems must ingest, process and
disseminate;
• Variety: the complexity of the types of information handled (many
sources and types of data both structured and unstructured)
• Velocity: the pace at which data flows in and out from sources
like business processes, machines, networks and human
interaction with things like social media sites, mobile devices
• Veracity: refers to the biases, noise and abnormality in data http://inside-bigdata.com/2013/09/12/beyond-volume-variety-velocity-issue-big-data-veracity/
Role of Librarian in Big Data
• Create awareness among researchers about Big Data
Initiatives internationally
• Create awareness among colleagues about the activities,
workgroups and task groups of CODATA (Committee on
Data for Science and Technology, of the International
Council for Science) and Research Data Alliance
• Become a member of a number of CODATA task groups
Examples of International InitiativesCenter for International Earth Science Information Network, EARTH INSTITUTE, COLUMBIA UNIVERSITY
Computer Network Information Center, CAS
World Data Center for Microorganisms
Institute of Remote Sensing and Digital Earth, CAS
Dept of Earth Sciences
Institute for environment and Human Security
Thetherless World Constellation
International Society for Digital Earth
Pilot Projects at University of Pretoria
• The UP Library Services implemented two data management pilot projects in 2013-2014:
• Institute for Cellular and Molecular Medicine (ICMM) and the Neuro-Physio-Group
• An Open Source Document Management System was customised for this purpose
• Why Alfresco?• Open Source
• Captured provenance of data
• Had a versioning function
• Good metadata function
• Easy to integrate with other software
• Workflow function gave supervisor overview of progress of students
• Sync function with dropbox and Google Drive
• Drag and Drop function
• File Sharing function
• Mobile App
Long-term Preservation
Archival Information Package (AIP)
• Bagit format (Bag-it and tag-it)
• Bagit “bag” contains:• Bag declaration file, manifest file, data files,
metadata file (XML)• METS wrapper• Dublin Core and MODS(Descriptive Metadata)• PREMIS (Preservation Metadata)
Next Phase
Various stakeholders in RDM
Executive Management
Deans & Dept Heads
IT Services
Research Office
Principal Investigator/Re
searcher
LibraryFunders
Publishers
External (disciplinary)
data repositories
(De Waard, Rotman and Lauruhn, 2014)
Funders Funders Requirements: USA
(Dietrich et al. 2012)
http://www.istl.org/12-summer/refereed1.html
Funders Funders’ Requirements: UK
UK Digital Curation Centre
http://www.dcc.ac.uk/resources/policy-and-legal/overview-funders-data-policies
Conclusion
This presentation showed that although the RDM environment looks
daunting the Library Professional can play an essential and much
needed role in determining the success of Research Data
Management initiatives at Higher Education Institutions.
This vast, untamed and complex environment is waiting for someone
to conquer it. Librarians have the necessary skillset to do that.
May this motto also become our victory cry:
“Veni, vidi, vici” – I came I saw I conquered
References
• Analyzing and interpreting data. Syracuse, NY: Office of Institutional Research and Assessment, Syracuse University, n.d. [Online] available at https://oira.syr.edu/assessment/assesspp/Analyze.htm (Accessed 18 September 2014).
• CALLAGHAN, S. et al. 2013. Connecting data repositories and publishers for data publication. Presentation delivered on 7 February 2013 at the OpenAIRE Interoperability workshop, University of Minho, held 7-8 February 2013. Braga, Portugal: University of Minho Gualtar Campus. [Online] available at http://openaccess.sdum.uminho.pt/wp-content/uploads/2013/02/7_SarahCallaghan_OpenAIREworkshopUMinho.pdf (Accessed 19 September 2014).
• CHARLES, K. 2012. Comparing enterprise data anonymization techniques. Newton, MA: TechTarget. [Online] available at http://searchsecurity.techtarget.com/tip/Comparing-enterprise-data-anonymization-techniques (Accessed 18 September 2014).
References• CHOUDHURY, S. 2014. Public Institution perspective (Research Library). Presented at
the Digital Media Analysis, Search and Management (DMASM), 2014. [Online] available at http://dataconservancy.org/wp-content/uploads/2014/03/DC_DMASM_2014.pdf (Accessed 24 September 2014).
• CORMODE, G. AND SRIVASTAVA, D. 2009. Anonymized data: generation, models, usage. Tutorial at SIGMOD, July 2009. [Online] available at http://dimacs.rutgers.edu/~graham/pubs/papers/anontut.pdf (Accessed 17 September 2014).
• CORTI, L. et al. 2014. Managing and sharing research data: a guide to good practice. Los Angeles: SAGE.
• Data Management Planning Tool (DMPTool). Oakland, CA: University of California Curation Center of the California Digital Library, 2014. [Online] available at https://dmptool.org/ (Accessed 24 September 2014).
• Data sharing and implementation guide. Washington, DC: Institute of Education Sciences, U.S. Department of Education, n.d. [Online] available at http://ies.ed.gov/funding/datasharing_implementation.asp (Accessed 19 September 2014).
References• DCC. 2014. Overview of funders data policies. Edinburgh, UK: Digital Curation Centre. [Online]
available at http://www.dcc.ac.uk/resources/policy-and-legal/overview-funders-data-policies (Accessed 24 September 2014).
• DCC. 2014. What are metadata standards? Edinburgh, UK: Digital Curation Centre. [Online] available at http://www.dcc.ac.uk/resources/briefing-papers/standards-watch-papers/what-are-metadata-standards (Accessed 24 September 2014).
• DE WAARD, A. AND ROTMAN, D. AND LAURUHN, M. 2014. Research data management at institutions: part 1: visions. Elsevier Library Connect, 6 February 2014. [Online] available at http://libraryconnect.elsevier.com/articles/2014-02/research-data-management-institutions-part-1-visions (Accessed 5 October 2014)
• DIETRICH, D. et al. 2012. De-mystifying the data management requirements of research funders. Issues in Science and Technology Librarianship, Summer, 2012, No. 70. [Online] available at http://www.istl.org/12-summer/refereed1.html (Accessed 22 September 2012).
References• DMPonline tool. Edinburgh, UK: Digital Curation Centre, 2014. [Online]
available at https://dmponline.dcc.ac.uk/ (Accessed 22 September 2014).
• Edinburgh University Data Library Research Data Management Handbook, v.10, Aug, 2011. [Online] available at http://www.docs.is.ed.ac.uk/docs/data-library/EUDL_RDM_Handbook.pdf (Accessed 25 September 2014).
• FRIENDLY, M. 2009. Milestones in the history of thematic cartography, statistical graphics, and data visualization. [Sl.: s.n.] [Online] available at http://www.math.yorku.ca/SCS/Gallery/milestone/milestone.pdf (Accessed 19 September 2014).
• HODSON, S. 2014. Global collaboration in data science: an introduction to CODATA. Presentation on 6 June 2014 at CODATA International Training Workshop in Big Data for Science for Researchers from Emerging and Developing Countries, Beijing, China, 4-20 June 2014.
References• HUADONG, G. 2014. Scientific Big Data for knowledge discovery.
Presentation on 8 June 2014 at the CODATA Workshop on Big Data for International Scientific Programmes: Challenges and Opportunities, Beijing, China, 8-9 June 2014.
• Library of Congress. 2014. PREMIS. Washington, DC: Library of Congress. [Online] available at http://www.loc.gov/standards/premis/ (Accessed 25 September 2014).
• A list of data journals. Trac Integrated SCM and Project Management. [Online] available at http://proj.badc.rl.ac.uk/preparde/blog/DataJournalsList (Accessed 25 September 2014).
• MARTIN, E. AND BALLARD, G. 2010. Data management best practices and standards for Biodiversity data applicable to Bird Monitoring Data. U.S. North American Bird Conservation Initiative Monitoring Subcommittee. [Online] available at http://www.nabci-us.org/ (Accessed 24 September 2014).
References• MARQUES, D. 2013. Research data driving new services. Elsevier Library
Connect, 25 February 2013. [Online] available at http://libraryconnect.elsevier.com/articles/best-practices/2013-02/research-data-driving-new-services (Accessed 5 October 2014)
• NORMANDIEU, K. 2013. Beyond Volume, Variety and Velocity is the Issue of Big Data Veracity. Inside Big Data. [Online] available at http://inside-bigdata.com/2013/09/12/beyond-volume-variety-velocity-issue-big-data-veracity/ (Accessed 25 September 2014).
• The Oxford Dictionary. [sl.]: Oxford University Press, 2014. [Online] available at http://www.oxforddictionaries.com/us/ (Accessed 16 September 2014).
• RAGHUNATHAN, B. 2013. The complete book of data anonymization: from planning to implementation. Broken Sound Parkway, NW: CRC Press, Taylor and Francis Group.
References
• Research Data Services, University of Wisconsin-Madison. Madison, WI: University of Wisconsin Madison, 201. [Online] available at http://researchdata.wisc.edu/manage-your-data/data-backup-and-integrity/ (Accessed 24 September 2014).
• Responsible conduct of research. DeKalb, Illinois: Northern Illinois University Faculty Development and Instructional Design Center, n.d. [Online] available at: http://ori.dhhs.gov/education/products/n_illinois_u/datamanagement/dctopic.html. (Accessed: 16 September 2014).
• ROUSE, M. 2010. Data archiving. Techtarget. [Online] available at http://searchdatabackup.techtarget.com/definition/data-archiving (Accessed 19 August 2014)
• SCHNELL, K. AND SHETTERLEY, N. 2013. Understanding data visualization. [Sl.]: Accenture. [Online] available at http://www.accenture.com/SiteCollectionDocuments/PDF/Accenture-Tech-Labs-Data-Visualization-Full-Paper.pdf (Accessed 19 September 2014).
References
• SIMPSON, J. n.d. Data Masking and Encryption Are Different. IRI Blog Articles. [Online] available at http://www.iri.com/blog/data-protection/data-masking-and-data-encryption-are-not-the-same-things/ (Accessed 18 September 2014).
• STRASSER, C. et al. 2012. Primer on data management: what you always wanted to know. [Albuquerque,NM]: DataONE, [University of New Mexico], p1-11. [Online] available at http://www.dataone.org/sites/all/documents/DataONE_BP_Primer_020212.pdf (Accessed 28 August 2013)
• United States Environmental Protection Agency (US EPA). 2002. Guidance on Environmental Data Verification and Data Validation: EPA QA/G-8. Washington, DC: Environmental Protection Agency. [Online] available at http://www.epa.gov/QUALITY/qs-docs/g8-final.pdf (Accessed 24 September 2014).
• USGS Data Management. [Online] available at http://www.usgs.gov/datamanagement/describe/metadata.php (Accessed 19 August 2014).
References
• VINOGRADOV, S AND PASTSYAK, A. 2012. Evaluation of data anonymization tools. In: DBKDA 2012 : The Fourth International Conference on Advances in Databases, Knowledge, and Data Applications, held 29 February-5 March, 2012, Reunion Island. Wilmington, DE: International Academy, Research, and Industry Association (IARIA).
• What is data management? University Park, PA: Publishing and Curation Services, Penn State University Libraries, 2014. [Online] available at http://www.libraries.psu.edu/psul/pubcur/what_is_dm.html (Accessed 25 September 2014).
• YUAN, X. 2014. Visualization and visual analytics. Presentation 0n 12 June 2014 at CODATA International Training Workshop in Big Data for Science for Researchers from Emerging and Developing Countries, Beijing, China, 4-20 June 2014.