Interactive Data Exploration using Constraints Alexander Kalinin Ugur Cetintemel, Stan Zdonik.

18
Interactive Data Exploration using Constraints Alexander Kalinin Ugur Cetintemel, Stan Zdonik

Transcript of Interactive Data Exploration using Constraints Alexander Kalinin Ugur Cetintemel, Stan Zdonik.

Interactive Data Exploration using

ConstraintsAlexander Kalinin

Ugur Cetintemel, Stan Zdonik

2

CP + DBMSfor Data Intensive Exploration

3

Interactive Data Exploration (IDE)

Searching for the “interesting” within big data

• Exploratory-analysis: ad-hoc & repetitive• Questions are not well defined• “Interesting” can be complex

• Human-in-the loop operation• Fast, online results• Query refinement

Where’s Waldo?Where’s Horrible Gelatinous Blob?

4

Exploratory Queries: Some examples• First-order

• “Celestial 3-5o by 5-7o regions with brightness > 0.8”

• Higher-order• “Pairs of 2o by 2o celestial regions with similarity > 0.5”

• Optimized• “Celestial 3o by 7o region with maximum brightness”

Sloan Digital Sky Survey (SDSS)

5

“Celestial 3-5o by 5-7o regions with average brightness > 0.8” in SQL

1. Divide the data into cells2. Enumerate all regions3. Final filtering (> 0.8)

6

DBMSs for IDE?

• No native support for exploratory constructs• No power set• No user-defined objective functions

• No support for interactivity• No online results• No notion of a “query session”

7

Data Exploration as a CP problem

Decision variables:

Constraints:

“Celestial 3-5o by 5-7o regions with average brightness > 0.8”

Left-most corner

Lengths

8

CP Solvers

• Large variety of methods for exploring a search space• Branch-and-Cut• Large Neighborhood Search (LNS)• Randomized search with Restarts

• Highly extensible – important for ad-hoc exploration!• New constraints/functions• New search heuristics

• But… comparing with DBMSs• In-memory data (CP) vs. efficient disk data handling (DBMS)• No I/O cost-awareness (CP) vs. cost-based query planning (DBMS)

9

SearchLight

• A fusion of CP solvers and DBMSs• The DBMS stores and maintains data• The CP solver explores the constrained

search space

• SearchLight is a mediator• Extends CP solvers• Provides buffering, prefetching• Distributes the search• Makes CP solvers cost-aware

CP Solver(OR-tools, Gecode)

Constraints/Functions

Search Heuristics

SearchLight

Metadata Buffering

DBMS(PostgreSQL, SciDB)

Data

, esti

mat

es, d

ecisi

ons

Requ

ests

, Sol

ution

s Data, schema info

Data requests, constraints

Exploration Query

10

Research Issues

• A cost model for data-intensive CP• Each search decision has an I/O cost

• Mediation of data access• Meta-data for guiding and optimizing search (annotated trees, samples, etc.)• Prefetching

• Distributed search• Multi-node parallel branch processing

• CP/DBMS integrated query planning• Propagating CP/Schema constraints

11

Semantic Windows (SW)

• First step towards constraint-based exploration

• Supports first-order queries• Exploration via multi-dimensonal “windows of interest”• Shape-based constraints (“a 3-5o by 5-7o region”)• Content-based constraints (“avg_br() > 0.8")

• Custom distributed cost-aware solver

12

SQL/CP Extensions for Data ExplorationSELECT lb(ra), rb(ra), lb(dec), rb(dec),

avg(brightness)FROM sdssGRID BY ra BETWEEN 100 AND 300 STEP 1 dec BETWEEN 5 AND 40 STEP 1HAVING avg(brightness) > 0.8 AND

size(ra) = 5 AND size(dec) >= 5 AND size(dec) <= 7

13

Cost-aware Solver

• Best-first search based on the utility

• Utility = f(benefit, cost)

• Benefit – how close a window is to satisfy the constraints• A distance between the constraint’s value and the estimated value

• Cost – how expensive it is to read a window from disk• Measured in cells we have to read• Adjustments are made for skewed data

14

Optimizations

• Cost and benefit are estimated by sampling

• Objective function values are cached in a cell cache• Dynamic utility updates• Avoiding same cells re-reads

• Constraint-based pruning during the search

• Distributed search• Multiple nodes work in parallel

15

Adaptive Prefetching

• Dispersed reads hit total performance

• Prefetching: read the neighborhood with every window

• Progress-driven prefetching: how much? • Finding new results? Prefetch a small amount• No new results? Increase the prefetch

exponentially

3

2

1

4

No prefetching

With prefetching

1

2

3

4

16

Online vs. Total Performance Results• 35GB data set (part of the SDSS)• 4GB total memory (1GB shared buffer)• First results in 10-20 seconds

20% 40% 60% 80% 100% total0

1000

2000

3000

4000

5000

6000

Static Adaptive PostgreSQL

% of results returned

Tim

e, s

17

Conclusions

• Integrate CP and DBMS technologies

• SearchLight: Data-Intensive CP Engine

• Initial implementation: Semantic Windows• Cost-aware solver• Mediating disk access (sampling, prefetching)• Distributed search

• Current work:• OR-Tools as the CP solver• SciDB as the DBMS

18

Questions?

Supported by: