Sam Madden [email protected] With a cast of many….
description
Transcript of Sam Madden [email protected] With a cast of many….
With a cast of many….
Data Hub: A Collaborative Data Analytics and Visualization
Platform
BIG
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Data
Example: Medical Costs
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
MGH Cancer Center
“Super-Database”
Question: What are the factors driving costs for lung cancer patients?
Some results:No correlation of cost with
• Stage of presentation• Survival
Strong correlation of cost with oncologist!
Largest cancer database in the world (173,301 patients)Based on national tumor registryCross linked with death registryIncludes billing, reports, labs, imagery, genome SNPs
- Dr. James Michaelson, PhD, MGH, Harvard Medical School
Super Duper Indexes
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Beyond scalable platforms
Challenge: Making Data Accessible
Main Memory DBsColumn Oriented DBsMap Reduce
What does the data look like?
How do I correlate it with other data sets?
How do I present it to users/execs?
Where are these anomalies and outliers coming from?
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Introducing Datahub
Challenge: Making Data Accessible
+ =
Octocat, the Github mascot
DB Technology
Introducing Datahub
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Data Commons
Selective Sharing and Access Control
Easy to Find, Combine, Clean Data Sets
Secure, Hosted Data Storage (“Database Service”)
Ability to Browse, Visualize, and Query Data in situ
Lots of other places to find data!
For example:
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Datahub: “five-star” integrated, browse-able, & query-able repository of linked data
Aka … Just a bunch of zip files
★ make your stuff available on the Web under an open license★★ make it available as structured data ★★★ use non-proprietary formats (e.g., CSV instead of Excel)★★★★ use URIs to denote things, so that people can point at your stuff★★★★★ link your data to other data to provide context
Versus open, linked data (Tim Berners Lee Taxonomy)
Datahub Interface
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Anant Bhardwaj
Datahub Interface
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Datahub Interface
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
“Wrangling” Features
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Wrangler: Interactive Visual Specification of Data Transformation ScriptsSean Kandel, Andreas Paepcke, Joseph Hellerstein, Jeffrey Heer
Post-Wrangling
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
More Datahub Interface
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Versions
BrowsingandVisualization
MIT Living Lab
• Goal: allow MIT community to access, selectively share, and use data about itself, using DataHub.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
A Dogfood Eating Exercise
MIT Living Lab
• Goal: allow MIT community to access, selectively share, and use data about itself, using DataHub.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
MIT Data HubOrganizational Data Personal Data
Public Data
MIT data: ID card swipes, network packets, expense reports, medical data, payroll, parking garages, buses and cars, course catalogs, registrar, benefits, on-campus events/seminars, Infrastructure: energy, HVAC, maintenance, etc. Academic/Research: publications, presentations, research data…
Personal Data: location/GPS, calendar, video/pictures, exercise/physio data, application usage, meetings…
Relevant Linked Data: local transit / transport data, crime data, nearby restaurants, events etc.
What Will Data Hub Enable at MIT?
• Campus “Quantification”– is going to class correlated with better grades?– which dining facilities are most popular amongst different groups?
• Transportation planning: – bus utilization and on demand routing – parking lot utilization– carpool finding, etc
• Health + Medical: – campus wide public health, e.g., flu tracking,– observing who is missing class, depressed – Health signals: exercise and eating habits; partners; – outpatient care
• Research:– expert finding; – data sharing between groups
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Challenges: It’s Not All Fuzzy Stuff
Platform Challenges:How to efficiently store thousands or millions of
databases?
How to anonymize data, control access, etc?How to keep data private and allowing querying over it?
Challenges in Improving Interaction with Databases:Data Cleaning and IntegrationInteractive Data PresentationUnderstanding Why Results are the Way They AreHow to Leverage Experts in an Organization
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Monomi
MapDScorpion
We also don’t want our research to be like this guy
Confidential data leaks 2012: hackers extracted 6.5 million hashed passwords
from the DB of LinkedIn
Application DB ServerSQL
User 1
User 2
User 3
Private Data Problem
System administrator
Threat: passive DB server attacks
Hackers
Sensitive content
Datahub
How to protect data confidentiality?
DB Server
Client
Sensitive content
Sensitive content
Encrypt data server may not be able to process queries!
Compute on encrypted data! Without giving server encryption key!
[request]
[result]
General approach has been proposed several times…
1. Process SQL queries on encrypted data Hide DB from sys. admins., outsource DB to the cloud
2. Modest overhead
Monomi / CryptDB
3. No changes to DBMS (e.g., Postgres, MySQL) and no changes to applications
Application DB ServerSQL
User 1
User 2
User 3
Threat 1: passive DB server attacks
Sensitive content
w/ Raluca Popa, Stephen Tu, Hari Balakrishnan, Frans Kaashoek, Nickolai Zeldovich
col1/rank col2/name
table1/emp
SELECT * FROM emp WHERE salary = 100
x934bc1x5a8c34
x5a8c34
x84a21c
SELECT * FROM table1 WHERE col3 = x5a8c34
Proxy
?x5a8c34x5a8c34
?x5a8c34x5a8c34
x4be219
x95c623
x2ea887
x17cea7
col3/salary
Application
60
100
800
100
Randomized encryption
Deterministic encryption
SQL Queries on Encrypted Data Example
col1/rank col2/name
table1 (emp)
x934bc1x5a8c34
x5a8c34
x84a21cx638e54
x638e54x922eb4
x1eab81
SELECT * FROM table1
WHERE col3 ≥ x638e54Proxy
x638e54x922eb4x638e54
col3/salary
Application
60
100
800
100
Deterministic encryption
SELECT * FROM emp
WHERE salary ≥ 100
OPE (order)encryption
Monomi: Protecting Data in Datahub
• Extensions to CryptDB to efficiently support OLAP queries
• Show how to run all of TPC-H, rather than just 4 of 22 queries– Key insight: split queries, run as much as possible
on untrusted DBMS, compute remainder on trusted client
Monomi vs PlaintextTPC-H SF10, Postgres
Takeaway: median overhead 1.24x,
See Stephen Explain How it Really Works Right after this Talk!
Mo
no
mi R
untim
e vs
Pla
inte
xt
Many Open Problems
Understanding performance more broadly
How to reason about security of non-randomized schemes?
Auditing, information flow, etc.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
DataHub Research Challenges
Platform Challenges:How to efficiently store thousands or millions of
databases?
How to anonymize data, control access, etc?How to keep data private and allowing querying over it?
Challenges in Improving Interaction with Databases:Data Cleaning and IntegrationInteractive Data PresentationUnderstanding Why Results are the Way They AreHow to Leverage Experts in an Organization
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Monomi
MapDScorpion
Interactive Large-Scale Visualization
using a GPU Database
The Need for Interactive Analytics
• DataHub needs to support browsing massive data sets
• Browsing is best supported through visualization
ad-hoc analytics, with millisecond response times
MapD: GPU Accelerated SQL Database
• Key insight: GPUs have enough memory that a cluster of them can store substantial amounts of data
• Not an accelerator, but a full blown query processor!
• Massive parallelism enables interactive browsing interfaces– 4x GPUs can provide > 1 TB/sec of bandwidth– 12 Tflops compute– Order of magnitude speedups over CPUs,
when data is on GPU
• “Shared nothing” arrangement
Next Steps
• Scale out to many nodes, automate layout algorithms
• Add various advanced analytics (e.g., machine learning algorithms)
• Generalize visualization beyond maps
DataHub Research Challenges
Platform Challenges:How to efficiently store thousands or millions of
databases?
How to anonymize data, control access, etc?How to keep data private and allowing querying over it?
Challenges in Improving Interaction with Databases:Data Cleaning and IntegrationInteractive Data PresentationUnderstanding Why Results are the Way They AreHow to Leverage Experts in an Organization
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Monomi
MapDScorpion
Visual Provenance: Scorpion
• Visualization of data is most common form of big data analysis
• Common problem: outliers• Would be nice to have a tool that identifies why
outliers exist Eugene Wu
Definition of WhyGiven an outlier group, find a predicate over the inputs that makes the output no longer an outlier.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
i = Input Data
Italy France Spain US0
0.51
1.52
2.53
3.54
4.55
Output Visualization
p
Outlier Group
p = predicate
Definition of WhyGiven an outlier group, find a predicate over the inputs that makes the output no longer an outlier.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
i = Input Data
Italy France Spain US0
0.51
1.52
2.53
3.54
4.55
Output Visualization
p
p = predicate
Definition of WhyGiven an outlier group, find a predicate over the inputs that makes the output no longer an outlier.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
i = Input Data
Italy France Spain US0
0.51
1.52
2.53
3.54
4.55
Output Visualization
p
Removing the predicate makes US no longer an outlier
What are common properties of those records?
{Bill Gates, Steve Ballmer}p: Company = MSFT
Why is this hard?
Exponential search space over records, attributes
In general, each candidate predicate requires re-running aggregation
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
A B C D E F G
Why is this hard?
Exponential search space over records, attributes
In general, each candidate predicate requires re-running aggregation
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
A B C D E F G
AVG(rows) = 2.7
A B C D E F G
Why is this hard?
Exponential search space over records, attributes
In general, each candidate predicate requires re-running aggregation
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
A B C D E F G
AVG(rows) = 2.9
A B C D E F G
Why is this hard?
Exponential search space over records, attributes
In general, each candidate predicate requires re-running aggregation
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
A B C D E F G
AVG(rows) = 2.2
A B C D E F G
Why is this hard?
Exponential search space over records, attributes
In general, each candidate predicate requires re-running aggregation
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
A B C D E F G
AVG(rows) = 3.3
A B C D E F G
Why is this hard?
Exponential search space over records, attributes
In general, each candidate predicate requires re-running aggregation
Desire for simple, understandable predicates and a general purpose visualization framework
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
A B C D E F G
AVG(rows) = 3.1
A B C D E F G
…
See Eugene Explain How it Really Works this Afternoon!
Next Steps
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• A general purpose visualization language for expressing visualizations with provenance support
References to underlying data set
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Big Data is a cry for help from non DB people
Lots of exciting work on scalable systems
DB community should be doing a much better job of helping users use data
We risk losing mindshare
Datahub aims to make data easy to find, visualize, and query, securely and efficiently
Many fascinating, hard problems!(Monomi, MapD, Scorpion)
Conclusion