Databrary

51
Databrary David Millman, NYU • Rick Gilmore, PSU • Dylan Simon, NYU Coalition for Networked Information • CNI Fall 13 December 10, 2013 databrary.or g

description

Databrary . David Millman, NYU • Rick Gilmore, PSU • Dylan Simon, NYU Coalition for Networked Information • CNI Fall 13 December 10, 2013. databrary.org. Key Aims of Databrary project. Build a repository for sharing video Provide tools for scoring video Provide data management tools - PowerPoint PPT Presentation

Transcript of Databrary

Page 1: Databrary

Databrary David Millman, NYU • Rick Gilmore, PSU • Dylan Simon, NYU

Coalition for Networked Information • CNI Fall 13

December 10, 2013

databrary.org

Page 2: Databrary

Key Aims of Databrary project• Build a repository for sharing video• Provide tools for scoring video• Provide data management tools • Create policies that enable sharing• Transform the culture of developmental science!

Page 3: Databrary

Key Aims of Databrary project• Build a repository for sharing video• Provide tools for scoring video• Provide data management tools • Create policies that enable sharing• Transform the culture of developmental science!

Page 4: Databrary

Current Funding• NIH

• National Institute of Child Health and Human Development

• NSF• Development & Learning Sciences Program• Research and Evaluation on Education in Science

and Engineering (REESE)

Page 5: Databrary

What Users Can Do with Databrary

Page 6: Databrary

• I need video clips for teaching• I want to illustrate an idea• Show the range of behaviors and exceptions• Show an excerpt in a talk

Use cases: Education, teaching

Page 7: Databrary

• I want to browse the work in my field• I want to know whether a study is worth doing• I need preliminary data for grant proposal• I need ideas and inspiration• I want to replicate, expand on, or review previous

work

Use cases: Pre-research

Page 8: Databrary

• I want to repurpose videos for new uses• Replicate existing work by recoding videos• I want to grow my sample size• I want to include participants from other contexts

and populations• I want to conduct integrative analyses

Use cases: Research

Page 9: Databrary

Opportunities / Challenges

Raw data re-use○ The data is video of people participating in experiments.

○ Can be immediately re-used in different domains

without mapping or data dictionaries

Page 10: Databrary

Opportunities / Challenges

Video contains identifiable data○ Faces, voices, possibly names & locations○ De-identified data linked to video becomes identifiable○ Enabling sharing while protecting privacy

Page 11: Databrary

Opportunities / Challenges

Structural consistency○ No two labs organize material in the same way○ What data structure works for both contributors and “consumers”?

Page 12: Databrary

Opportunities / Challenges

How “open” is it ?○ Identifiable data○ Inter-institutional permission clearance○ Permissions structure / delegation○ New IRB, sponsored programs standards?

Page 13: Databrary

Opportunities / Challenges

Using significant univ infrastructure○ IT○ Library○ IRB○ OSP○ Counsel

Page 14: Databrary

Enabling sharing of identifiable Data

Page 15: Databrary

Data-sharing model

How it works today

Page 16: Databrary

Data-sharing model

Enter Databrary

Page 17: Databrary

Data-sharing model

Sharing with Databrary

Page 18: Databrary

Data-sharing model

New Investigator wants access to Databrary

Page 19: Databrary

Data-sharing model

Browsing, non-research

Page 20: Databrary

Data-sharing model

Conduct Research

Page 21: Databrary

Innovations / Insights● Seek permission to share from people

depicted in recordings○ Extends informed consent

● Restrict access to○ Recordings “permissioned” for sharing○ Authorized researchers with ethics training○ Researchers who agree to maintain privacy

Page 22: Databrary

Databrary Release Template● Sharing ≠ research

participation● Data privacy● Who has access?● How long?● No compensation● Minor assent● Levels of sharing

Page 23: Databrary

Levels of sharing • Private: No sharing• Shared: Sharing only with authorized

researchers• Excerptable: Sharing + excerpts may be

created and shown by authorized researchers to the public

• Open: Sharing with the public

Page 24: Databrary

Recording sharing permission• All depicted

individuals• Explicit

yes/no boxes• Adults and

minors

Page 25: Databrary

Getting permissions right• Electronically recorded permissions• Linked to session- and participant-level

metadata• Avoid data entry errors• Honor participants’ desired release level

• Spreadsheet template• Web-based permission system

Page 26: Databrary

A better way...● Why is the Databrary model better?

○ Clear and unambiguous○ Consent to participate ≠ permission to share data○ Easier for participants○ More realistic conceptualization of risk○ Standardization across contributors via templates

Page 27: Databrary

Building a user community• Users must become Authorized Investigators

• Designing registration process• Investigator Agreement

• Covers data contributions, non-research, research use/re-use

• 1.0 will be a web form• Institutional sign-off by Authorizing Official

Page 28: Databrary

Data-sharing model

Conduct Research

Page 29: Databrary

Who promises whatInvestigator Institution Databrary

Access to Data

Applies for access; Ethics training; notifies if change Institutions; supervise affiliates; not anonymous

Investigators are PIs, affiliates are associated with Institution;Certifies ethics training

Reviews, approves applications

Contributing data

Secures sharing permission from IRB, participants; transmits to Databrary; Removes PII from non-recordings

Authorizes Databrary to share

Maintains sharing permission; collects and hosts data, metadata

Page 30: Databrary

Who promises whatInvestigator Institution Databrary

Browsing, viewing data

Protect privacy; protect data; show excerpts only with permission

May receive information about usage

Keeps track of who views, downloads; citations; metadata

Using data for research

Secure IRB approval; communicate IRB to Databrary

Review and approve research protocols

Monitor research usage; store IRB protocol info

Monitoring and reporting

Report sharing or other violations

Reports sharing or other violations; Treats violations as violations of scientific integrity

Reports sharing or other violations; may deny access, remove data

Page 32: Databrary

A data model for diverse data sets

Page 33: Databrary

• Started by organizing around study• Different meanings for study: paper, analysis, etc.• Tremendous range in size of studies• Meaning can change over time

• Raw data themselves are fixed, constant• Begin by collecting raw, session data into datasets• Layer analyses, research products on datasets

A data model for Databrary

Page 34: Databrary

• Data collected at the same time, often single visit• Defined by:

• Date of test• Participant release level

• Contains raw data files (videos, etc)• Associated with participant(s), other metadata

Organizational unit: Session

Page 35: Databrary

What’s in a Session?• Like a folder• A set of files• Collected at a specific

time• Often a single visit or

participant• Datafiles, coding

spreadsheets layered on later

Page 36: Databrary

• Name/description• Home visit, interview, eye-tracking video, motion-tracking,

EEG, ...• File format

• .pdf, .doc, .csv, .mp4, .opf, .mat, ...• For video or other time series data

• Start point in time and length• Identifiable (video) or de-identified?

Each file within a session

Page 37: Databrary

What’s in a dataset?

Page 38: Databrary

• Top-level, binding information (optional)• Title and short description• Data owners and other users with access• Excerpts• Procedures, stimuli, blank forms, IRB approvals, and other

files• Funding information

• Set of sessions and metadata

What’s in a dataset?

Page 39: Databrary

How is a dataset organized?• Many ways to organize a dataset• User-defined groups (labels, tags, annotations)

• By participants, conditions, visits, tasks, etc.• Associated with metadata “measures”

• Session assigned to arbitrarily many groups• Groups specific to a single dataset

Page 40: Databrary

Main grouping: Participants• Each group represents a participant• Includes any number of user-defined “measures”

• Participant ID• Birthdate, gender, race/ethnicity• Geographic location, language, school grade, motor

experience, disability, IQ, ...• Any other text, dates, numbers, ...

Page 41: Databrary

Grouping sessions

Page 42: Databrary

Grouping sessions

Page 43: Databrary

Representing datasets as files• People organize their

own datasets in different ways

• By using groupings for this organization, can dynamically export/import in many forms

Page 44: Databrary

From datasets to studies• Datasets provide organization for labs

• Session storage for researchers, labs, and collaborators• Like a lab server, only better

• Studies present research data to others• Pull from datasets, organize sessions• Full control over how research is represented• Add additional analyses, coding manuals, spreadsheets,

scripts, figures, research products, ...

Page 45: Databrary

From datasets to studies

Page 46: Databrary

Data ingest: contributor role• Identify data to contribute• Determine organizational structure• Verify participant sharing permissions• Provide additional top-level metadata and files

• description/abstract• resulting publications, funding sources• images/figures, procedure documents, stimuli

• Set and maintain access restrictions

Page 47: Databrary

Data ingest• Organization, upload, and import

• Enumerate sessions, groupings (participants, etc.), files (in CSV)

• Collect original videos, best quality available• Transcode to standard video formats

• MPEG-4, H.264, AAC, ffmpeg• Gradual transition from hand-curation to

self-curation

Page 48: Databrary

System Architecture

Page 49: Databrary

• Features• Study views and data re-use• Search • Policy-driven form for user registration• Self curation features• Automatic upload and transcoding

• Timeline • Private beta early 2014, public release mid 2014

Looking to Databrary 1.0

Page 50: Databrary

Building a CommunityCreating a community of researchers who share

and self-curate

More interesting data

More users

More contributors

Page 51: Databrary

Key Aims of Databrary project• Build a repository for sharing video• Provide tools for scoring video• Provide data management tools • Create policies that enable sharing• Transform the culture of developmental science!