Data management issues and how Societies can contribute · data . Processed Data and . Data...
Transcript of Data management issues and how Societies can contribute · data . Processed Data and . Data...
Data management issues
and how Societies can contribute
David Martinsen Senior Scientist Digital Strategy and Platform Development, Publications Division American Chemical Society [email protected] AIP Assembly of Society Officers March 27, 2014 College Park, MD
27 March 2014
Outline 1. Background/Publisher Perspective 2. NISO/NFAIS Recommendation 3. Publisher Initiatives 4. Data Citation 5. Government Initiatives 6. Institutional Initiatives 7. Data Repositories 8. Research Data Advocacy 9. What can Societies Do?
27 March 2014
Resources for Data Management
Part 1: Background/Publisher Perspective
Avoiding a Digital Dark Age for Data: why data and publications belong together
Integration of Research Data and Publications Eefke Smit International Association of STM Publishers Director, Standards and Technology ICSTI workshop Delivering Data in Science PARIS, 5 March 2012
A famous paper in Nature: DNA structure - 1953
• 1 page • 2 authors • 1 figure • no data
Source: V. Kiermer, Nature Publishing Group, 2011
Nature in 2001: The human genome issue • 62 pages, 49 figures, 27 tables
Source: V. Kiermer, Nature Publishing Group, 2011
Publications with data
Processed Data and Data
Representations
Data Collections and Structured Databases
Raw Data and Data Sets
(1) Data contained and
explained within the article
(2) Further data explanations in
any kind of supplementary files to articles
(3) Data referenced from the article and
held in data centers and repositories (4) Data
publications, describing available datasets (5) Data in
drawers and on disks at the
institute
The Data Publication Pyramid
8
The Pyramid’s likely short term reality:
Pubs
Supps
Data Archives
Data on Disks and in Drawers
(1) Top of the pyramid is stable
but small (2) Risk that supplements to articles turn into Data Dumping
places (3) Too many
disciplines lack a community
endorsed data archive
(4) Estimates are that at least
75 % of research data is
never made openly avaiable
9
The Ideal Pyramid
Data In
Publications Article Supps
Data Archives
Data on Disks and in Drawers
(1) More integration of text and data, viewers
and seamless links to interactive
datasets (2) Only if data
cannot be integrated in
article, and only relevant extra explanations
(3) Seamless links (bi-directional)
between publications and data, interactive
viewers within the articles
(4) More Data Journals that
describe datasets, data mgt plans and data methods
27 March 2014
Resources for Data Management
Part 2: NISO/NFAIS Recommendations on
Supplemental Journal Article Material
The NISO/NFAIS Working Group
Business Working Group Co-chairs: Linda Beebe (APA), Marie McVeigh (Thomson-Reuters ISI)
• Recommended Practices: scope and general principles • Definitions: supplemental material, article, data,
metadata • Roles and responsibilities of publishers, authors, editors,
peer reviewers, libraries, A&I services, repositories • Curation and life cycle: selection, peer review, editing,
presentation, providing context, referencing, citing, managing/hosting, discovery, preservation
• Discoverability & Linking • Intellectual property rights management
Technical Working Group Co-chairs: Dave Martinsen (ACS), Sasha Schwarzman (OSA)
• Metadata • Persistent identifiers • Preservation • Packaging and exchange • Supporting documentation
non-normative DTD Tag Library tagged samples
27 March 2014
Resources for Data Management
Part 3: Publisher Initiatives
Moving forward our shared data agenda: a view from the publishing industry
Connecting with Data Repositories, 1
Link to CCDC database (indicates that information for this article is available)
Screenshot of journal article on ScienceDirect (http://dx.doi.org/10.1016/j.jfluchem.2009.07.015)
Article Linking example: CCDC
Connecting with Data Repositories, 2
... clicking on the CCDC logo takes the reader to a page at the CCDC repository with data related to the article
Screenshot of information page at CCDC (Cambridge Crystallographic Data Centre)
Article Linking example: CCDC
Connecting with Data Repositories, 3
Tagged Genbank entry (genetic sequence)
Screenshot of journal article on ScienceDirect (http://dx.doi.org/10.1016/j.biortech.2010.03.063 )
Entity Linking example: Genbank Accession Number
Connecting with Data Repositories, 4
... clicking on the linked Genbank accession code takes the reader to an information page on the NCBI data repository about that specific genetic sequence
Screenshot of information page at NCBI (National Center for Biotechnology Information)
Entity Linking example: Genbank Accession Number
Connecting with Data Repositories, 5
Database Subject Type of Linking CCDC Crystallography Article-level PANGAEA Earth Sciences Article-level* EMBL Molecular Interactions Chemistry Entity, tagging Molecular INTeraction DB Chemistry Entity, tagging Genbank Nucleotides Entity, tagging UniProt Proteins Entity, tagging Protein Data Bank Proteins Entity, tagging ClinicalTrials Medicine Entity, tagging TAIR (Arabidopsis) Model organism Entity, tagging Mendelian Inheritance in Men Genetics, inheritance Entity, tagging
*: with Application
27 March 2014
27 March 2014
Data accessibility: Why?
• Researchers should get credit – data should be citable
• Help authors comply with funding agency accessibility requirements
• Data should be not only accessible, but usable – need industry standards and guidelines that suit community dependent needs
• Reduce barriers to furthering research
US National Science Board’s (NSB) Task Force on Data Policies issued its report on Digital Research Data Sharing and Management (NSB 2011) recommending “require grantees to make both the data and the methods and techniques used in the creation and analysis of the data accessible for the purposes of building upon or verifying figures, tables, findings, and conclusions in peer reviewed publications.”
27 March 2014
Data accessibility: How? • NSF funded project with AAS and AIP • Survey and Researcher workshops to gather community
requirements • Establish community specific metadata requirements
and format best practices • Pilot project to link to data sets from published articles
• OSTI will register the datasets for such Physics of Plasmas articles involved in the project.
• Allow researchers to use recommended repositories that fit with archiving and metadata standards
• DOIs for datasets will allow data to be cited as its own entity
27 March 2014
AAS/AIP data access project
Project goals: 1. Extend methods for providing access to data objects 2. Survey community attitudes about data sharing and re-use 3. Engage community in discussions about formats / metadata
Responds to calls for digital data curation and sharing • EC: Riding the Wave (2010) • UK/RIN: Collaborative Yet Independent (2011) • US NSB: Digital Research Data Sharing and Management (2011)
27 March 2014
0
10
20
30
40
50
60
70
80
90
100
%
AAS/AIP data sharing survey
1. 62% answered yes to the question – “In the past 2 years, have you shared the dataset(s) used to generate data
elements (tables, figures, etc.) of an article that you had published?”
2. How data was provided
Highlights regarding sharing data
3. 58% definitely / probably 28% possibly − “Within the next few years,
I will provide the dataset(s) that generated the data elements (tables, figures, etc.) as a supplement to an article submission.”
27 March 2014
0
10
20
30
40
50
60
70
80
90
Explore newquestions
Integratesources
Replicate work Other
%
0
10
20
30
40
50
60
70
Directly fromauthor
Large datarepository
PublishingJournal
Affiliatedinstitution
Other
%
AAS/AIP data sharing survey
1. 60% answered yes to the question – “In the past two years, have you requested, acquired, or worked with
datasets that were made available by other researchers as a supplement to a published article?”
2. How data was obtained
Highlights regarding using data
3. How data was used
Copyright © 2011 American Chemical Society Copyright © 2011 American Chemical Society
why is ACS concerned with data?
From The Journal of Organic Chemistry For Notes, Brief Communications, and Articles, all
experimental procedures and listings of compound characterization data must be included in the manuscript file’s experimental section, and not in the supporting information. The supporting information should contain only copies of spectra, chromatograms, graphs, tables, crystallographic data, and computational data.
Copyright © 2011 American Chemical Society Copyright © 2011 American Chemical Society
why is ACS concerned with data?bbb integrity of published research
and reproducibilty The Journal upholds a high standard for compound
characterization to ensure that compounds being added to the chemical literature have been correctly identified and can be synthesized in known yield and purity by the reported preparation, isolation, and purification methods. For all new compounds, evidence adequate to establish both identity and degree of purity (homogeneity) must be provided. Purity documentation must be provided for known compounds whose preparation by a new or improved method is reported. For combinatorial libraries containing more than 20 compounds, complete characterization data must be provided for at least 20 diverse members. Authors may be asked to provide copies of original spectra or analytical reports if an editor or reviewer raises a question about any of the reported results.
Copyright © 2011 American Chemical Society Copyright © 2011 American Chemical Society
why is ACS concerned with data?bbb integrity of published research
and reproducibilty
Crystal structures: Regardless of the level of detail of the discussion of the structure, a Crystallographic Information File (CIF) containing complete details of data collection, crystal and unit-cell parameters, structure solution and refinement, and tables of atomic coordinates and thermal parameters, bond lengths, bond angles, and torsion angles should be furnished as supporting information.
Copyright © 2011 American Chemical Society Copyright © 2011 American Chemical Society
why is ACS concerned with data?bbb integrity of published research
and reproducibilty
Purity: Evidence for documenting compound purity should include one or more of the following:
• A standard 1D proton NMR spectrum or proton-decoupled
carbon NMR spectrum showing at most trace peaks not attributable to the assigned structure. A copy of a spectrum with a signal-to-noise ratio sufficient to permit seeing peaks with 5% of the intensity of the strongest peak should be included in the supporting information. The normal full range of chemical shifts should be displayed (usually 0–10 ppm for proton; 0–220 ppm for carbon). For new compounds, copies of both proton and carbon spectra are required (see ‘Identity’ above).
Copyright © 2011 American Chemical Society Copyright © 2011 American Chemical Society
why is ACS concerned with data?bbb integrity of published research
and reproducibilty
Spectral Data: Reproductions of spectra will be published in the results and discussion section only when concise numerical summaries are inadequate for the discussion. Papers dealing primarily with interpretation of spectra, and those in which band shape or fine structure needs to be illustrated, may qualify for this exception. When presentation of spectra is essential, only the pertinent sections, prepared as figures, should be included. Spectra used as adjuncts to the characterization of compounds should be included in the supporting information.
Copyright © 2011 American Chemical Society Copyright © 2011 American Chemical Society
data submission for characterizing compounds
27 March 2014
27 March 2014
Some Data Journals
NEW TRADITIONAL
27 March 2014
Recent Development at PLOS • http://www.plos.org/plos-data-policy-faq/ • http://dx.doi.org/10.1371/journal.pbio.1001797 Data are any and all of the digital materials that are collected and analyzed in the pursuit of scientific advances. In line with its stance on providing Open Access to research articles themselves, PLOS strongly believes that, to best foster scientific progress, the underlying data should be made freely available for researchers to use, wherever this is legal and ethical. Data availability allows validation, replication, reanalysis, new analysis, reinterpretation, or inclusion into meta-analyses, and facilitates reproducibility of research [1]. Making data available for all these uses provides a better “bang for the buck” out of scientific research, much of which is funded from public or nonprofit sources. Ultimately, our viewpoint is quite simple: Ensuring access to the underlying data should be an intrinsic part of the scientific publishing process.
27 March 2014
Resources for Data Management
Part 4: Data Citation
27 March 2014
DataCite Facts
• DataCite is a not-for-profit organization whose aim is to: • establish easier access to research data on the Internet • increase acceptance of research data as legitimate,
citable contributions to the scholarly record • support data archiving that will permit results to be
verified and re-purposed for future study.
• DataCite US: California Digital Library, OSTI, Purdue University Libraries
27 March 2014
DataCite DOI Services • DataCite serves as a DOI registration agency, like CrossRef • Acts as resolving “agent” for dataset DOIs. “All DataCite DOIs
resolve to a public landing page that contains information about the associated dataset and a direct link to the dataset itself. Maintains information in the DataCite Metadata Store.
• DataCite allows DOI resolution to multiple formats of the same data. Suppliers can specify multiple resource URLs
• CrossRef and DataCite collect bibliographic metadata about the works they link to and collaborate. Therefore, this metadata can be retrieved from the dx.doi.org DOI resolver too, using content negotiation to request a particular representation of the metadata.
27 March 2014
27 March 2014
27 March 2014
Resources for Data Management
Part 5: Government Initiatives
(a few examples)
27 March 2014
27 March 2014
27 March 2014
27 March 2014
27 March 2014
Resources for Data Management
Part 6: Institutional Initiatives
(a few examples)
A Few Examples of Library Initiatives
• Johns Hopkins University (http://dmp.data.jhu.edu/)
• California Digital Library (https://dmp.cdlib.org)
• Purdue University (https://purr.purdue.edu/dmp/)
• University of North Carolina (http://guides.lib.unc.edu/researchdatatoolkit)
• Stanford University (http://dataplan.stanford.edu)
• DPN – Digital Preservation Network (http://www.dpn.org)
DPN Members
Data Curation Programs
• U. of Illinois: Digital Curation Education Program (DCEP):
– http://cirss.lis.illinois.edu/CollMeta/dcep.html
• U. of North Carolina: Post-Masters Certificate: Data Curation:
– http://sils.unc.edu/programs/graduate/post-masters-certificates/data-curation
• U. of Tennessee: Data Curation Education in Research Centers (DCERC):
– http://www.sis.utk.edu/dcerc
• April 30, 2013: Introduction to Data Science MOOC
– http://escience.washington.edu/blog/data-science-mooc-engages-students-solving-problems-real-organizations
27 March 2014
Resources for Data Management
Part 7: Data Repositories
27 March 2014
27 March 2014
27 March 2014
27 March 2014
27 March 2014
27 March 2014
Resources for Data Management
Part 8: Research Data Advocacy
27 March 2014
27 March 2014
Resources for Data Management
Caution: Editorial Nature 464, 649-650 (1 April 2010) |
doi:10.1038/464649a; Published online 31 March 2010
The human genome at ten
27 March 2014
Nature 464, 649-650 (1 April 2010) | doi:10.1038/464649a; Published online 31 March
2010 Editorial: The Human Genome at Ten
“But for all the intellectual ferment of the past decade, has human health truly benefited from
the sequencing of the human genome? A startlingly honest response can be found on pages 674 and 676, where the leaders of the public and private efforts, Francis Collins and Craig Venter,
both say ‘not much’.”
27 March 2014
Resources for Data Management
Part 9: What is the role for Societies?
27 March 2014
What is the role for Societies? • Ensure that supplemental materials are published with
an emphasis on preservation and reusability • Foster best practices to help address concerns over
reproducibility (note: launch of Meta-Research Innovation Center at Stanford http://med.stanford.edu/metrics/)
• Keep track of the evolving environment • Listen to your members concerns and help frame the
discussion • Help your members comply with data management plans • Work with institutions, vendors, and researchers to gain
experience in handling and publishing data
27 March 2014
Resources for Data Management
Conclusion: There is a lot happening in a diversity of environments. Societies
and Publishers have something to offer, but this is going to be a collaborative
effort.
Data management issues
and how Societies can contribute
David Martinsen Senior Scientist Digital Strategy and Platform Development, Publications Division American Chemical Society AIP Assembly of Society Officers March 27, 2014 College Park, MD