Sam Madden [email protected] With a cast of many….

Sam [email protected]

With a cast of many….

Data Hub: A Collaborative Data Analytics and Visualization

Platform

mailto:[email protected]

BIG

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Data

Example: Medical Costs


MGH Cancer Center

“Super-Database”

Question: What are the factors driving costs for lung cancer patients?

Some results:No correlation of cost with

• Stage of presentation• Survival

Strong correlation of cost with oncologist!

Largest cancer database in the world (173,301 patients)Based on national tumor registryCross linked with death registryIncludes billing, reports, labs, imagery, genome SNPs

- Dr. James Michaelson, PhD, MGH, Harvard Medical School

Super Duper Indexes


Beyond scalable platforms

Challenge: Making Data Accessible

Main Memory DBsColumn Oriented DBsMap Reduce

What does the data look like?

How do I correlate it with other data sets?

How do I present it to users/execs?

Where are these anomalies and outliers coming from?


Introducing Datahub

Challenge: Making Data Accessible

+ =

Octocat, the Github mascot

DB Technology

Introducing Datahub


Data Commons

Selective Sharing and Access Control

Easy to Find, Combine, Clean Data Sets

Secure, Hosted Data Storage (“Database Service”)

Ability to Browse, Visualize, and Query Data in situ

Lots of other places to find data!

For example:


Datahub: “five-star” integrated, browse-able, & query-able repository of linked data

Aka … Just a bunch of zip files

★ make your stuff available on the Web under an open license★★ make it available as structured data ★★★ use non-proprietary formats (e.g., CSV instead of Excel)★★★★ use URIs to denote things, so that people can point at your stuff★★★★★ link your data to other data to provide context

Versus open, linked data (Tim Berners Lee Taxonomy)

Datahub Interface


Anant Bhardwaj

Datahub Interface


“Wrangling” Features


Wrangler: Interactive Visual Specification of Data Transformation ScriptsSean Kandel, Andreas Paepcke, Joseph Hellerstein, Jeffrey Heer

http://skandel.us/

http://skandel.us/

http://db.cs.berkeley.edu/jmh/

http://vis.stanford.edu/jheer

Post-Wrangling


More Datahub Interface


Versions

BrowsingandVisualization

MIT Living Lab

• Goal: allow MIT community to access, selectively share, and use data about itself, using DataHub.


A Dogfood Eating Exercise

MIT Living Lab

• Goal: allow MIT community to access, selectively share, and use data about itself, using DataHub.


MIT Data HubOrganizational Data Personal Data

Public Data

MIT data: ID card swipes, network packets, expense reports, medical data, payroll, parking garages, buses and cars, course catalogs, registrar, benefits, on-campus events/seminars, Infrastructure: energy, HVAC, maintenance, etc. Academic/Research: publications, presentations, research data…

Personal Data: location/GPS, calendar, video/pictures, exercise/physio data, application usage, meetings…

Relevant Linked Data: local transit / transport data, crime data, nearby restaurants, events etc.

What Will Data Hub Enable at MIT?

• Campus “Quantification”– is going to class correlated with better grades?– which dining facilities are most popular amongst different groups?

• Transportation planning: – bus utilization and on demand routing – parking lot utilization– carpool finding, etc

• Health + Medical: – campus wide public health, e.g., flu tracking,– observing who is missing class, depressed – Health signals: exercise and eating habits; partners; – outpatient care

• Research:– expert finding; – data sharing between groups


Challenges: It’s Not All Fuzzy Stuff

Platform Challenges:How to efficiently store thousands or millions of

databases?

How to anonymize data, control access, etc?How to keep data private and allowing querying over it?

Challenges in Improving Interaction with Databases:Data Cleaning and IntegrationInteractive Data PresentationUnderstanding Why Results are the Way They AreHow to Leverage Experts in an Organization


Monomi

MapDScorpion

We also don’t want our research to be like this guy

Confidential data leaks 2012: hackers extracted 6.5 million hashed passwords

from the DB of LinkedIn

Application DB ServerSQL

User 1

User 2

User 3

Private Data Problem

System administrator

Threat: passive DB server attacks

Hackers

Sensitive content

Datahub

How to protect data confidentiality?

DB Server

Client

Sensitive content

Sensitive content

Encrypt data server may not be able to process queries!

Compute on encrypted data! Without giving server encryption key!

[request]

[result]

General approach has been proposed several times…

1. Process SQL queries on encrypted data Hide DB from sys. admins., outsource DB to the cloud

2. Modest overhead

Monomi / CryptDB

3. No changes to DBMS (e.g., Postgres, MySQL) and no changes to applications

Application DB ServerSQL

User 1

User 2

User 3

Threat 1: passive DB server attacks

Sensitive content

w/ Raluca Popa, Stephen Tu, Hari Balakrishnan, Frans Kaashoek, Nickolai Zeldovich

col1/rank col2/name

table1/emp

SELECT * FROM emp WHERE salary = 100

x934bc1x5a8c34

x5a8c34

x84a21c

SELECT * FROM table1 WHERE col3 = x5a8c34

Proxy

?x5a8c34x5a8c34

?x5a8c34x5a8c34

x4be219

x95c623

x2ea887

x17cea7

col3/salary

Application

60

100

800

100

Randomized encryption

Deterministic encryption

SQL Queries on Encrypted Data Example

col1/rank col2/name

table1 (emp)

x934bc1x5a8c34

x5a8c34

x84a21cx638e54

x638e54x922eb4

x1eab81

SELECT * FROM table1

WHERE col3 ≥ x638e54Proxy

x638e54x922eb4x638e54

col3/salary

Application

60

100

800

100

Deterministic encryption

SELECT * FROM emp

WHERE salary ≥ 100

OPE (order)encryption

Monomi: Protecting Data in Datahub

• Extensions to CryptDB to efficiently support OLAP queries

• Show how to run all of TPC-H, rather than just 4 of 22 queries– Key insight: split queries, run as much as possible

on untrusted DBMS, compute remainder on trusted client

Monomi vs PlaintextTPC-H SF10, Postgres

Takeaway: median overhead 1.24x,

See Stephen Explain How it Really Works Right after this Talk!

Mo

no

mi R

untim

e vs

Pla

inte

xt

Many Open Problems

Understanding performance more broadly

How to reason about security of non-randomized schemes?

Auditing, information flow, etc.


DataHub Research Challenges


databases?




Monomi

MapDScorpion

Interactive Large-Scale Visualization

using a GPU Database

The Need for Interactive Analytics

• DataHub needs to support browsing massive data sets

• Browsing is best supported through visualization

ad-hoc analytics, with millisecond response times

MapD: GPU Accelerated SQL Database

• Key insight: GPUs have enough memory that a cluster of them can store substantial amounts of data

• Not an accelerator, but a full blown query processor!

• Massive parallelism enables interactive browsing interfaces– 4x GPUs can provide > 1 TB/sec of bandwidth– 12 Tflops compute– Order of magnitude speedups over CPUs,

when data is on GPU

• “Shared nothing” arrangement

Demo

http://geops.csail.mit.edu/MapD

Next Steps

• Scale out to many nodes, automate layout algorithms

• Add various advanced analytics (e.g., machine learning algorithms)

• Generalize visualization beyond maps

DataHub Research Challenges


databases?




Monomi

MapDScorpion

Visual Provenance: Scorpion

• Visualization of data is most common form of big data analysis

• Common problem: outliers• Would be nice to have a tool that identifies why

outliers exist Eugene Wu

Definition of WhyGiven an outlier group, find a predicate over the inputs that makes the output no longer an outlier.


i = Input Data

Italy France Spain US0

0.51

1.52

2.53

3.54

4.55

Output Visualization

p

Outlier Group

p = predicate



i = Input Data


0.51

1.52

2.53

3.54

4.55


p

p = predicate



i = Input Data


0.51

1.52

2.53

3.54

4.55


p

Removing the predicate makes US no longer an outlier

What are common properties of those records?

{Bill Gates, Steve Ballmer}p: Company = MSFT

Why is this hard?

Exponential search space over records, attributes

In general, each candidate predicate requires re-running aggregation


A B C D E F G

Why is this hard?




A B C D E F G

AVG(rows) = 2.7

A B C D E F G

Why is this hard?




A B C D E F G

AVG(rows) = 2.9

A B C D E F G

Why is this hard?




A B C D E F G

AVG(rows) = 2.2

A B C D E F G

Why is this hard?




A B C D E F G

AVG(rows) = 3.3

A B C D E F G

Why is this hard?



Desire for simple, understandable predicates and a general purpose visualization framework


A B C D E F G

AVG(rows) = 3.1

A B C D E F G

…

See Eugene Explain How it Really Works this Afternoon!

Next Steps


• A general purpose visualization language for expressing visualizations with provenance support

References to underlying data set


Big Data is a cry for help from non DB people

Lots of exciting work on scalable systems

DB community should be doing a much better job of helping users use data

We risk losing mindshare

Datahub aims to make data easy to find, visualize, and query, securely and efficiently

Many fascinating, hard problems!(Monomi, MapD, Scorpion)

Conclusion

Sam Madden [email protected] With a cast of many….

Documents

Transcript of Sam Madden [email protected] With a cast of many….