Databrary
description
Transcript of Databrary
Databrary David Millman, NYU • Rick Gilmore, PSU • Dylan Simon, NYU
Coalition for Networked Information • CNI Fall 13
December 10, 2013
databrary.org
Key Aims of Databrary project• Build a repository for sharing video• Provide tools for scoring video• Provide data management tools • Create policies that enable sharing• Transform the culture of developmental science!
Key Aims of Databrary project• Build a repository for sharing video• Provide tools for scoring video• Provide data management tools • Create policies that enable sharing• Transform the culture of developmental science!
Current Funding• NIH
• National Institute of Child Health and Human Development
• NSF• Development & Learning Sciences Program• Research and Evaluation on Education in Science
and Engineering (REESE)
What Users Can Do with Databrary
• I need video clips for teaching• I want to illustrate an idea• Show the range of behaviors and exceptions• Show an excerpt in a talk
Use cases: Education, teaching
• I want to browse the work in my field• I want to know whether a study is worth doing• I need preliminary data for grant proposal• I need ideas and inspiration• I want to replicate, expand on, or review previous
work
Use cases: Pre-research
• I want to repurpose videos for new uses• Replicate existing work by recoding videos• I want to grow my sample size• I want to include participants from other contexts
and populations• I want to conduct integrative analyses
Use cases: Research
Opportunities / Challenges
Raw data re-use○ The data is video of people participating in experiments.
○ Can be immediately re-used in different domains
without mapping or data dictionaries
Opportunities / Challenges
Video contains identifiable data○ Faces, voices, possibly names & locations○ De-identified data linked to video becomes identifiable○ Enabling sharing while protecting privacy
Opportunities / Challenges
Structural consistency○ No two labs organize material in the same way○ What data structure works for both contributors and “consumers”?
Opportunities / Challenges
How “open” is it ?○ Identifiable data○ Inter-institutional permission clearance○ Permissions structure / delegation○ New IRB, sponsored programs standards?
Opportunities / Challenges
Using significant univ infrastructure○ IT○ Library○ IRB○ OSP○ Counsel
Enabling sharing of identifiable Data
Data-sharing model
How it works today
Data-sharing model
Enter Databrary
Data-sharing model
Sharing with Databrary
Data-sharing model
New Investigator wants access to Databrary
Data-sharing model
Browsing, non-research
Data-sharing model
Conduct Research
Innovations / Insights● Seek permission to share from people
depicted in recordings○ Extends informed consent
● Restrict access to○ Recordings “permissioned” for sharing○ Authorized researchers with ethics training○ Researchers who agree to maintain privacy
Databrary Release Template● Sharing ≠ research
participation● Data privacy● Who has access?● How long?● No compensation● Minor assent● Levels of sharing
Levels of sharing • Private: No sharing• Shared: Sharing only with authorized
researchers• Excerptable: Sharing + excerpts may be
created and shown by authorized researchers to the public
• Open: Sharing with the public
Recording sharing permission• All depicted
individuals• Explicit
yes/no boxes• Adults and
minors
Getting permissions right• Electronically recorded permissions• Linked to session- and participant-level
metadata• Avoid data entry errors• Honor participants’ desired release level
• Spreadsheet template• Web-based permission system
A better way...● Why is the Databrary model better?
○ Clear and unambiguous○ Consent to participate ≠ permission to share data○ Easier for participants○ More realistic conceptualization of risk○ Standardization across contributors via templates
Building a user community• Users must become Authorized Investigators
• Designing registration process• Investigator Agreement
• Covers data contributions, non-research, research use/re-use
• 1.0 will be a web form• Institutional sign-off by Authorizing Official
Data-sharing model
Conduct Research
Who promises whatInvestigator Institution Databrary
Access to Data
Applies for access; Ethics training; notifies if change Institutions; supervise affiliates; not anonymous
Investigators are PIs, affiliates are associated with Institution;Certifies ethics training
Reviews, approves applications
Contributing data
Secures sharing permission from IRB, participants; transmits to Databrary; Removes PII from non-recordings
Authorizes Databrary to share
Maintains sharing permission; collects and hosts data, metadata
Who promises whatInvestigator Institution Databrary
Browsing, viewing data
Protect privacy; protect data; show excerpts only with permission
May receive information about usage
Keeps track of who views, downloads; citations; metadata
Using data for research
Secure IRB approval; communicate IRB to Databrary
Review and approve research protocols
Monitor research usage; store IRB protocol info
Monitoring and reporting
Report sharing or other violations
Reports sharing or other violations; Treats violations as violations of scientific integrity
Reports sharing or other violations; may deny access, remove data
Policy documents• Databrary Release Template• Investigator Agreement• Definitions of terms• Data Sharing Manifesto• Bill of Rights• Best Practices in Data Security• http://github.com/databrary/policies/
A data model for diverse data sets
• Started by organizing around study• Different meanings for study: paper, analysis, etc.• Tremendous range in size of studies• Meaning can change over time
• Raw data themselves are fixed, constant• Begin by collecting raw, session data into datasets• Layer analyses, research products on datasets
A data model for Databrary
• Data collected at the same time, often single visit• Defined by:
• Date of test• Participant release level
• Contains raw data files (videos, etc)• Associated with participant(s), other metadata
Organizational unit: Session
What’s in a Session?• Like a folder• A set of files• Collected at a specific
time• Often a single visit or
participant• Datafiles, coding
spreadsheets layered on later
• Name/description• Home visit, interview, eye-tracking video, motion-tracking,
EEG, ...• File format
• .pdf, .doc, .csv, .mp4, .opf, .mat, ...• For video or other time series data
• Start point in time and length• Identifiable (video) or de-identified?
Each file within a session
What’s in a dataset?
• Top-level, binding information (optional)• Title and short description• Data owners and other users with access• Excerpts• Procedures, stimuli, blank forms, IRB approvals, and other
files• Funding information
• Set of sessions and metadata
What’s in a dataset?
How is a dataset organized?• Many ways to organize a dataset• User-defined groups (labels, tags, annotations)
• By participants, conditions, visits, tasks, etc.• Associated with metadata “measures”
• Session assigned to arbitrarily many groups• Groups specific to a single dataset
Main grouping: Participants• Each group represents a participant• Includes any number of user-defined “measures”
• Participant ID• Birthdate, gender, race/ethnicity• Geographic location, language, school grade, motor
experience, disability, IQ, ...• Any other text, dates, numbers, ...
Grouping sessions
Grouping sessions
Representing datasets as files• People organize their
own datasets in different ways
• By using groupings for this organization, can dynamically export/import in many forms
From datasets to studies• Datasets provide organization for labs
• Session storage for researchers, labs, and collaborators• Like a lab server, only better
• Studies present research data to others• Pull from datasets, organize sessions• Full control over how research is represented• Add additional analyses, coding manuals, spreadsheets,
scripts, figures, research products, ...
From datasets to studies
Data ingest: contributor role• Identify data to contribute• Determine organizational structure• Verify participant sharing permissions• Provide additional top-level metadata and files
• description/abstract• resulting publications, funding sources• images/figures, procedure documents, stimuli
• Set and maintain access restrictions
Data ingest• Organization, upload, and import
• Enumerate sessions, groupings (participants, etc.), files (in CSV)
• Collect original videos, best quality available• Transcode to standard video formats
• MPEG-4, H.264, AAC, ffmpeg• Gradual transition from hand-curation to
self-curation
System Architecture
• Features• Study views and data re-use• Search • Policy-driven form for user registration• Self curation features• Automatic upload and transcoding
• Timeline • Private beta early 2014, public release mid 2014
Looking to Databrary 1.0
Building a CommunityCreating a community of researchers who share
and self-curate
More interesting data
More users
More contributors
Key Aims of Databrary project• Build a repository for sharing video• Provide tools for scoring video• Provide data management tools • Create policies that enable sharing• Transform the culture of developmental science!