Ing. Claudio Rosas Castro, Superintendente de Telecomunicaciones (S) [email protected]
The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital...
Transcript of The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital...
![Page 1: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/1.jpg)
The Rise of Data Publishing in the Digital World
(and how Dataverse and DataTags help)
Mercè Crosas, Ph.D.Chief Data Science and Technology Officer
Institute for Quantitive Social ScienceHarvard University
@mercecrosas
NDSR 2016 Symposium
![Page 2: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/2.jpg)
From 1665 to late 20th century:A steady increase in size and
complexity of research output
![Page 3: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/3.jpg)
The number of journals doubles every 20 years since 1750s, with growth of number of scientists
1700: 3 journals
1800: ~10 journals
1900: ~400 journals
2000: ~14,000 journals(peer-reviewed)
1665 1765 1865 1965
100
10000
Mabe, 2003
![Page 4: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/4.jpg)
Data Tables and Visuals Become Increasingly Common, and part of the Scientific Argument
a few tables & visuals, as part of
the text
50% of articles have tables & figures
most articles have tables & figures, often standalone
50% cite previous work
100% with citations(1 per 100 words)
part of scholarly credit
method sections appear
First Line Graphs and bar charts (Playfair, 1786)
First Scatterplots (Hershel,1833; Galton 1896)
1665 1765 1865 1965
100
10000
![Page 5: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/5.jpg)
Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001)
• 18th century: • formal components appear in articles (introduction,
conclusions, table, figures, citations)• 19th century:
• explain data instead of establish observations of facts• wide use of visuals, high citation density, methods section
• 20th century:• structured quantitative data with increased use of statistics• wide range of data types with new technologies
• Number of scientists increases from 100s to a few millions• Science becomes extremely specialized:
• from 1 journal to 14,000 peer-reviewed journals• one new journal for each 150 authors, read by 500
![Page 6: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/6.jpg)
In the last decades, more and more publications
and data
![Page 7: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/7.jpg)
A Steeper Growth of Scholarly Output Since 1950, the total number of journals doubles every ~15 years
2010: 80,000 journals
2010: 33,000 peer-reviewed
![Page 8: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/8.jpg)
An Outburst of Research Data and Specialization, Results into > 1000 Community Repositories
First Social Science Data Archives
(ODUM, ICPSR, ...)
A wide range of Research Data Repositories
First Biomedical Databases
(PDB, GenBank, ...)
1500 repositories listed in re3data.org
1920 - 1950s 1970 - 1980s 2016
![Page 9: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/9.jpg)
Data Publishing Emerges as the Union of Scholarly Publishing and Data Archiving
Scholarly publishing: Distribute research output
• Attribution and credit
• Dissemination
• Finding & Reuse
Data Archiving: Long-term access to data
• Accessibility
• Preservation
• Finding & Reuse
![Page 10: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/10.jpg)
Why Data Publishing now?
• Data (and software) have become common input and output of research
• A scholarly article cannot hold or describe accurately these vast amounts of data and software
• As input and output of research, data must be citable and accessible to enable validation and reuse, with attribution
Extending Gross et al. thesis, data publishing accommodates the complexity of research input and output in the digital world.
![Page 11: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/11.jpg)
What is needed for FAIR Data Publishing
Data Citation
• Persistent id to reference data uniquely
• Support for versions and fixity
• Attribution to authors and repository
Metadata
• Catalog to discover and locate the data
• Sufficient information to understand and reuse the data
Repository
• Digital access to metadata and data
• Archive and preservation for long-term access
• Interoperability through standards and APIs
FAIR = Findable Accessible Interoperable Reusable
![Page 12: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/12.jpg)
![Page 13: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/13.jpg)
A data repository system that serves as a solution for publishing FAIR research data
![Page 14: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/14.jpg)
Around the World
Harvard Dataverse: Generic data repository open to researchers world wide
Dataverse repositories serve a community, an institution, an archive, ...
![Page 15: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/15.jpg)
Dataverses contain datasets, datasets contain metadata and data files
![Page 16: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/16.jpg)
Data Citation in Dataverse
Published Year
Dataset Title
Global Persistent Identifier
Repository= Data Publisher
Version (or time range)
Authors
![Page 17: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/17.jpg)
Data Citation Basics
Force11, Joint Declaration of Data Citation Principles; Starr et al, 2015
The dataset landing page is accessible and guaranteed by the repository (or data publisher), even when data are restricted or deaccessioned
![Page 18: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/18.jpg)
Metadata In Dataverse
Citation Metadataauthor, title, repository, year published, version,
etc
• Dublin Core• DataCite
Domain-specific Metadata
data collection info (methods, organism, observation, survey,
experiment, etc)
• DDI (social sciences)• ISA-Tab BioCaddie (biomed)• Virtual Observatory (astro)• + Custom metadata blocks
File-level Metadata
metadata inside the data file (variables, instrument
details, geospatial info, etc)
• DDI (for variables),• + more to be determined
Fields StandardsMetadata Level
Dat
aver
se JS
ON
Sch
ema
![Page 19: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/19.jpg)
Information Extraction: Tabular Files
RDataStataSPSSExcelCSV
var 1 var 2 var 3
obs 1 2 a 0
obs 2 4 c 0
obs 3 6 b 1
obs 4 1 e 0
obs 5 2 a 1
obs 6 3 b 1
Variable Metadata:Variable name, label, type, stats, geospatial
coordinates
2 a 0
4 c 0
6 b 1
1 e 0
2 a 1
3 b 1
Data Values: Independent of format
Universal Numerical Fingerprint (UNF):checksum on data values, from canonical format
![Page 20: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/20.jpg)
Information Extraction: FITS (astro) Files
Header Metadata:coordinates (R.A., declination), photometric info, ...
Data Objects:• Image Files•Spectra•Data cubes•Tables• ...
![Page 21: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/21.jpg)
In addition to data citation and metadata features, Dataverse has a rich set of features that
facilitate data publishing
![Page 22: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/22.jpg)
Tiered Access
Open (default): CC0
Open Open Click to Download
GuestBook Open Open Fill in guestbook before download
Terms of Use Open Open Click through terms of use before download
Data Restricted Open Restricted Request Access via click through
Data Restricted Open Restricted Request Access via application
Metadata Files How to Access
![Page 23: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/23.jpg)
Data Publishing Workflows
Create Dataset(landing page restricted)
Publish v. 1Review
(collaborators or anonymous reviewers)
Minor change (metadata only) Publish v. 1.1
Major change (might include new
data file)Publish v. 2
![Page 24: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/24.jpg)
And more at dataverse.org guides ...
![Page 25: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/25.jpg)
Biomedical Dataverse addresses data publication of large files: SBGridData
![Page 26: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/26.jpg)
The Biomedical Dataverse at Harvard Medical School - also tested as a persistent repository for LINCS data
(NIH Library of Integrated Network based Cellular Signatures)
Collaboration with Piotr Sliz and Caroline Shamu (HMS)
(NIH Library of Integrated Network-based Cellular Signatures)
![Page 27: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/27.jpg)
An additional challenge for data publishing:
Sensitive Data
![Page 28: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/28.jpg)
“User Uploads must be void of all iden4fiable informa4on, such that re-‐iden4fica4on of any subjects from the amalgama4on of the informa4on available from all of the materials (across datasets and dataverses) uploaded under any one author and/or user should not be possible.”
![Page 29: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/29.jpg)
“SubmiCer represents and warrants that the Content does not contain any informa4on (i) which iden4fies, or which can be used in conjunc4on with other publicly available informa4on to personally iden4fy, any individual;”
![Page 30: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/30.jpg)
“If you are submiHng human sequences to GenBank, do not include any data that could reveal the personal iden4ty of the source. It is our assump4on that you have received any necessary informed consent authoriza4ons that your organiza4ons require prior to submiHng your sequences.”
GenBank
![Page 31: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/31.jpg)
How can we maximize publishing sensitive data while
being mindful of privacy?
![Page 32: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/32.jpg)
Sweeney L, Crosas M, Bar-‐Sinai M. Sharing Sensi4ve Data with Confidence: The DataTags System. Technology Science. 2015101601. October 16, 2015. hCp://techscience.org/a/2015101601
The DataTags System
![Page 33: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/33.jpg)
A datatag is a set of security features and access requirements for file handling
A datatags repository is one that stores and shares data files in accordance with a standardized and ordered levels of security and access requirements
![Page 34: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/34.jpg)
Datatags&Levels&Tag$Type$ Descrip-on$ Security$Features$ Access$Requirements$
Blue$ Public& Clear&storage&Clear&transmission&
&Open&
Green$ Controlled$public&
Clear&storage&Clear&transmission&
Email,&OAuth&verified®istra:on&
Yellow$ Accountable& Clear&storage&Encrypted&transmit&
Password,&Registered&,&Approval,&Click&DUA&
Orange$ More$accountable&
Encrypted&storage&Encrypted&transmit&
Password,&Registered,&Approval,&Signed&DUA&
Red$ Fully$accountable&
Encrypted&storage&Encrypted&transmit&
TwoDfactor&authen:ca:on,&Approval,&Signed&DUA&
Crimson$ Maximally$restricted&
Mul:Encrypt&store&Encrypted&transmit&
TwoDfactor&authen:ca:on,&Approval,&Signed&DUA&
![Page 35: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/35.jpg)
DataTags Workflow in a Dataverse Repository(under development)
Data$File$Inges-on$
Sensi-ve$Dataset$
Direct$Access$
Privacy$Preserving$Access$
Automa-c$Interview$$
Review$Board$Approval$
hCp://datatags.orghCp://privacytools.seas.harvard.edu
Two-‐factor Authen4ca4on;Signed DUA
![Page 36: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/36.jpg)
Example of DataTags Interview
![Page 37: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/37.jpg)
Example of DataTags Interview
![Page 38: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/38.jpg)
Thanks!And join us to this year’s
Dataverse Community Meeting
![Page 39: The Rise of Data Publishing in the Digital World · The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science](https://reader033.fdocuments.us/reader033/viewer/2022042223/5ec9eab675dc0534da69c33b/html5/thumbnails/39.jpg)
References• http://dataverse.org
• http://dataverse.harvard.edu
• http://datatags.org
• Sweeney L, Crosas M, Bar-Sinai M. 2015, Sharing Sensitive Data with Confidence: The DataTags System. Technology Science, hCp://techscience.org/a/2015101601
• Gross Harmon, Reidy, 2001, Communicating Science
• Mabe, 2003, The Growth and Number of Journals
• Friendly, 2006, A Brief History of Data Visualiza4on