Mercè&Crosas,Ph.D.& Chief&DataScience&and ... -...
Transcript of Mercè&Crosas,Ph.D.& Chief&DataScience&and ... -...
![Page 1: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&](https://reader036.fdocuments.us/reader036/viewer/2022071110/5fe54a4c16cb732fcd7da96d/html5/thumbnails/1.jpg)
ADDRESSING THE NEXT CHALLENGES IN DATA SHARING: LARGE-‐SCALE DATA AND SENSITIVE DATA
Mercè Crosas, Ph.D. Chief Data Science and Technology Officer Ins=tute for Quan=ta=ve Social Science Harvard University @mercecrosas
![Page 2: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&](https://reader036.fdocuments.us/reader036/viewer/2022071110/5fe54a4c16cb732fcd7da96d/html5/thumbnails/2.jpg)
Data sharing: good for you and good for the world
Researchers Get credit for their data
Publishers and Journals Verify published work
Federal funding agencies Make public assets accessible
Science Validate, reuse and extend previous work
![Page 3: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&](https://reader036.fdocuments.us/reader036/viewer/2022071110/5fe54a4c16cb732fcd7da96d/html5/thumbnails/3.jpg)
Data Sharing (or Publishing)
A formal data cita=on • Reference • Access (persistent iden=fier)
Informa=on about the data (metadata) • Discovery • Use A trusted data
repository • Access (long-‐term archival)
Data Sharing needs to support data discovery, referencing, access, and reuse
![Page 4: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&](https://reader036.fdocuments.us/reader036/viewer/2022071110/5fe54a4c16cb732fcd7da96d/html5/thumbnails/4.jpg)
dataverse.org
Open-‐source soVware developed at Harvard’s IQSS since 2006
Used to share, publish, cite and archive research data Installed in 12 sites world wide
Serving 100s of universi=es and organiza=ons
![Page 5: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&](https://reader036.fdocuments.us/reader036/viewer/2022071110/5fe54a4c16cb732fcd7da96d/html5/thumbnails/5.jpg)
Harvard Dataverse: dataverse.harvard.edu Started as a community repository for Social Science Now open to all research fields and all researchers
More than 1300 dataverses More than 59,000 datasets
More than 1,400,000 downloads
![Page 6: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&](https://reader036.fdocuments.us/reader036/viewer/2022071110/5fe54a4c16cb732fcd7da96d/html5/thumbnails/6.jpg)
Data Sharing with Dataverse
Now
• No sensi=ve data • Seldom versioning • Datasets up to ~GB
The Next 5 Years
• Highly-‐sensi=ve data • Streaming or frequently
updated data • Datasets > GBs, TBs, PBs
– Thousands of files per dataset – Large dataset in a Big Data,
NoSQL storage (MongoDB, Cassandra, Lucene)
![Page 7: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&](https://reader036.fdocuments.us/reader036/viewer/2022071110/5fe54a4c16cb732fcd7da96d/html5/thumbnails/7.jpg)
Large-‐scale data sharing needs to con=nue suppor=ng discovery, referencing, access and reuse.
![Page 8: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&](https://reader036.fdocuments.us/reader036/viewer/2022071110/5fe54a4c16cb732fcd7da96d/html5/thumbnails/8.jpg)
Adhering to the same high standards for large-‐scale data
• Metadata for discovery: – cita=on metadata – domain-‐specific descrip=ve metadata – file-‐level or variable metadata
• Data cita=on for reference and access: – for en=re dataset and for subsets of the dataset (based on =me of retrieval or variables selected)
• Fast queries, data explora=on and visualiza=ons for reuse: – might not be able to download en=re dataset
![Page 9: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&](https://reader036.fdocuments.us/reader036/viewer/2022071110/5fe54a4c16cb732fcd7da96d/html5/thumbnails/9.jpg)
Data retrieval, explora=ons and visualiza=ons of large-‐scale datasets require data repositories be closer to compu=ng resources.
![Page 10: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&](https://reader036.fdocuments.us/reader036/viewer/2022071110/5fe54a4c16cb732fcd7da96d/html5/thumbnails/10.jpg)
Current collabora=ons to address the next challenges in data sharing
SB Grid Data Repository (HMS, IQSS) Social Science Big Data (IQSS)
Data Provenance (SEAS, IQSS)
Privacy Tools to share sensi=ve data (SEAS, Berkman, Privacy Lab, IQSS, MIT)
![Page 11: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&](https://reader036.fdocuments.us/reader036/viewer/2022071110/5fe54a4c16cb732fcd7da96d/html5/thumbnails/11.jpg)
Sharing and Preserving Large Structural Biology Data
Funded by hhps://data.sbgrid.org/
![Page 12: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&](https://reader036.fdocuments.us/reader036/viewer/2022071110/5fe54a4c16cb732fcd7da96d/html5/thumbnails/12.jpg)
Structural Biology Primary Data
1 Dataset is 180-‐360 images of X-‐ray diffrac=on data, 3.5-‐7 GB; ~ 1TB per dataset, with a total up to 100 PBs
Integra=on with Dataverse: ● Long-‐term access ● Formal Data Cita=on ● Standard Metadata ● Data Explora=on (OME) ● Preserva=on, with copies
in mul=ple sites (following dataPASS approach)
![Page 13: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&](https://reader036.fdocuments.us/reader036/viewer/2022071110/5fe54a4c16cb732fcd7da96d/html5/thumbnails/13.jpg)
Dataverse on the Massachusehs Open Cloud (MOC): Compu=ng closer to data storage
Current Architecture On the MOC
Network File System (data files)
UI Layer (PrimeFaces, js)
Applica=on Logic (Java EE)
A P I
PostgreSQL (user data, metadata)
Solr (Index)
RServe (R ingest, analysis)
COMPUTE SERVICES (R, Python, Spark,
Hadoop, …) CINDER block storage
SWIFT object storage
UI Layer (PrimeFaces, js)
Applica=on Logic (Java EE)
A P I
PostgreSQL (user data, metadata)
Solr (Index)
Dataverse
![Page 14: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&](https://reader036.fdocuments.us/reader036/viewer/2022071110/5fe54a4c16cb732fcd7da96d/html5/thumbnails/14.jpg)
Sharing Sensi=ve Data with Confidence: DataTags System
DataTag: A set of security features and access requirements for file handling Sweeney, Crosas, Bar-‐Sinai, 2015, “Sharing Sensi=ve Data with Confidence: The DataTags System” Technology Science
![Page 15: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&](https://reader036.fdocuments.us/reader036/viewer/2022071110/5fe54a4c16cb732fcd7da96d/html5/thumbnails/15.jpg)
Data Sharing Workflow for Sensi=ve Data
Sensi=ve Dataset
Sensi=ve Dataset
Direct Access
Privacy Preserving Access
hhp://datatags.org hhp://privacytools.seas.harvard.edu
Authorized Signed DUA
![Page 16: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&](https://reader036.fdocuments.us/reader036/viewer/2022071110/5fe54a4c16cb732fcd7da96d/html5/thumbnails/16.jpg)
THANKS
Piotrek Sliz (SBGrid, HMS), Latanya Sweeney (Data Privacy Lab, Harvard), Dataverse team (IQSS, Harvard) @mercecrosas