Educational Big Data and the Digital Learner · The hub becomes software, NOT DATA! This is a mooc...

Post on 24-Sep-2020

1 views 0 download

Transcript of Educational Big Data and the Digital Learner · The hub becomes software, NOT DATA! This is a mooc...

Educational Big Data and

the Digital Learner

Privacy and Access Considerations Una-May O’Reilly

ALFA Group Computer Science & AI Lab

MIT

ALFA MOOCdb Project

MOOCDB Data Science Commons

organize interpret extract model visualize

MOOCPRIVACY

Stopout Prediction FEATUREFACTORY LABELMEFORUMSMOOCDB MOOCVIZ

Enabling technology •  MOOCdb project targets a Data Science Commons concept •  Includes multiple projects and open-source software that transform data and enable MOOC-related data science

Personalization of

Learning

Teaching /Learning Research

Sharing MOOC data science software

The moocdb project

The hub becomes software, NOT DATA! This is a mooc data science commons! moocdb.csail.mit.edu

moocViz featureFactory labelMe Text

•  Mooc data will be spatially distributed •  Mooc data access will be controlled locally

•  Chances of a big Data hub are low… •  Maybe meta-data can be shared

The moocdb project

Another Strategy •  MoocDB concept of software sharing shifts privacy

into checking software outputs •  Return to the goal of openly sharing data

–  Must return to the issues of privacy protection »  Risk of re-identification through linking

•  In this context: –  Technology is integral

»  But it’s only one element of the story… –  Start by bringing clarity to the complexity of the system by

identifying the stakeholders

The Stakeholders

Data Model

Who is likely to stopout? Community detection Weekly topic analysis

Shared data model Shared analytics Privacy as a service

Open access Privacy Protecting

Policy

Differential privacy Crowd Sourcing

Machine Learning

Subject/provider Controller? Controllers

Data Model

Who is likely to stopout? Community detection Weekly topic analysis

Shared data model Shared analytics Privacy as a service

Open access Privacy Protecting

Policy

Differential privacy Crowd Sourcing

Machine Learning

Platform provider

Instructor Institution of instructor

Analysts

Sharing Data and Protecting Privacy The Present

Two extremes 1.  Rich, raw data – highly vetted 2.  Used, simple data - open

One Possible Future •  Sharing via gradualism: a consistent protection level

provided while –  Access increases –  protection mechanism potentially changes:

»  This depends on nature of data being shared §  We are interested in sharing transformed, advanced research-ready data

consistent level of privacy protection Terms of use

Like

soc

ial s

ecur

ity d

ata

anon

ymiz

ed

prot

ecte

d

prot

ecte

d

prot

ecte

d

prot

ecte

d

Data Privacy and Big Data •  Study from policy, access control and technology

vantages in global, broad contexts

Closer to home: USA •  BigData@CSAIL initiative

–  Workshops on data privacy ->Report –  Working group report, sub-group on online learning

•  Helped me to formulate a working context and its requirements for progress

A Working Context

Requirements •  Identify the data we want to share •  Assess: technology, control and policy practices, state of art •  Integrative Roadmap!!

Sharing Research-Ready Transformed Data

CLICK STREAM

ASSESSMENT

STATE

WIKI

MO

OCdb

TRA

NSLATIO

N

...

...

. . .

6.002x

S1

V1 V2 VK

SM

FORUMS

Course datasources

Variable tables

Model

Coursecontent

VAR

IABLE

EXTR

ACTION

FEATUR

EEN

GIN

EERIN

GPSETPER

FM

ID TER

MEX

AM

MACH

INE

LEAR

NIN

GData Directly Model

Model Generated

Data

Privacy Protect Privacy Protect

Privacy Protect

Sharing transformed data that enables collaboration and cross-institute research will enable a MOOCDB data science commons concept

Assessment •  Technology

–  Gap between practitioners’ needs and technology maturity •  Policy

–  Asilomar •  Controllers

–  MIT Registrar “challenge”

Roadmap •  Design in progress! •  Started a modest investigation thread

–  Theory/Social Science Sharing/Privacy-protected data

Sanitizing the Data Directly

Post-Randomization

Perturb (γ,ε)

ε

ε-DP guarantee Mid-Term Grade

Pass/ Fail

A

B P

F

P B F

C

… …

F P

A P

Mid-Term Grade

Pass/ Fail

A

B P

F

P B F

C

… …

C P

B F

Truthful data Protected Data