Interactive Exploration and Flattening of Deformed Historical ...
Interactive Data Exploration using Constraints
description
Transcript of Interactive Data Exploration using Constraints
Interactive Data Exploration using
ConstraintsAlexander Kalinin
Ugur Cetintemel, Stan Zdonik
2
CP + DBMSfor Data Intensive Exploration
3
Interactive Data Exploration (IDE)Searching for the “interesting” within big data
• Exploratory-analysis: ad-hoc & repetitive• Questions are not well defined• “Interesting” can be complex
• Human-in-the loop operation• Fast, online results• Query refinement
Where’s Waldo?Where’s Horrible Gelatinous Blob?
4
Exploratory Queries: Some examples• First-order
• “Celestial 3-5o by 5-7o regions with brightness > 0.8”
• Higher-order• “Pairs of 2o by 2o celestial regions with similarity > 0.5”
• Optimized• “Celestial 3o by 7o region with maximum brightness”
Sloan Digital Sky Survey (SDSS)
5
“Celestial 3-5o by 5-7o regions with average brightness > 0.8” in SQL
1. Divide the data into cells2. Enumerate all regions3. Final filtering (> 0.8)
6
DBMSs for IDE?• No native support for exploratory constructs• No power set• No user-defined objective functions
• No support for interactivity• No online results• No notion of a “query session”
7
Data Exploration as a CP problem
Decision variables:
Constraints:
“Celestial 3-5o by 5-7o regions with average brightness > 0.8”
Left-most corner
Lengths
8
CP Solvers• Large variety of methods for exploring a search space
• Branch-and-Cut• Large Neighborhood Search (LNS)• Randomized search with Restarts
• Highly extensible – important for ad-hoc exploration!• New constraints/functions• New search heuristics
• But… comparing with DBMSs• In-memory data (CP) vs. efficient disk data handling (DBMS)• No I/O cost-awareness (CP) vs. cost-based query planning (DBMS)
9
SearchLight• A fusion of CP solvers and DBMSs
• The DBMS stores and maintains data• The CP solver explores the constrained
search space
• SearchLight is a mediator• Extends CP solvers• Provides buffering, prefetching• Distributes the search• Makes CP solvers cost-aware
CP Solver(OR-tools, Gecode)
Constraints/Functions
Search Heuristics
SearchLight
Metadata Buffering
DBMS(PostgreSQL, SciDB)
Data
, esti
mat
es, d
ecisi
ons
Requ
ests
, Sol
ution
s Data, schema info
Data requests, constraints
Exploration Query
10
Research Issues• A cost model for data-intensive CP
• Each search decision has an I/O cost
• Mediation of data access• Meta-data for guiding and optimizing search (annotated trees, samples, etc.)• Prefetching
• Distributed search• Multi-node parallel branch processing
• CP/DBMS integrated query planning• Propagating CP/Schema constraints
11
Semantic Windows (SW)• First step towards constraint-based exploration
• Supports first-order queries• Exploration via multi-dimensonal “windows of interest”• Shape-based constraints (“a 3-5o by 5-7o region”)• Content-based constraints (“avg_br() > 0.8")
• Custom distributed cost-aware solver
12
SQL/CP Extensions for Data ExplorationSELECT lb(ra), rb(ra), lb(dec), rb(dec),
avg(brightness)FROM sdssGRID BY ra BETWEEN 100 AND 300 STEP 1 dec BETWEEN 5 AND 40 STEP 1HAVING avg(brightness) > 0.8 AND
size(ra) = 5 AND size(dec) >= 5 AND size(dec) <= 7
13
Cost-aware Solver• Best-first search based on the utility
• Utility = f(benefit, cost)
• Benefit – how close a window is to satisfy the constraints• A distance between the constraint’s value and the estimated value
• Cost – how expensive it is to read a window from disk• Measured in cells we have to read• Adjustments are made for skewed data
14
Optimizations• Cost and benefit are estimated by sampling
• Objective function values are cached in a cell cache• Dynamic utility updates• Avoiding same cells re-reads
• Constraint-based pruning during the search
• Distributed search• Multiple nodes work in parallel
15
Adaptive Prefetching• Dispersed reads hit total performance
• Prefetching: read the neighborhood with every window
• Progress-driven prefetching: how much? • Finding new results? Prefetch a small amount• No new results? Increase the prefetch
exponentially
3
2
1
4
No prefetching
With prefetching
1
2
3
4
16
Online vs. Total Performance Results• 35GB data set (part of the SDSS)• 4GB total memory (1GB shared buffer)• First results in 10-20 seconds
20% 40% 60% 80% 100% total0
1000
2000
3000
4000
5000
6000
Static Adaptive PostgreSQL
% of results returned
Tim
e, s
17
Conclusions• Integrate CP and DBMS technologies
• SearchLight: Data-Intensive CP Engine
• Initial implementation: Semantic Windows• Cost-aware solver• Mediating disk access (sampling, prefetching)• Distributed search
• Current work:• OR-Tools as the CP solver• SciDB as the DBMS
18
Questions?
Supported by: