Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein.
Online Aggregation Joseph M. Hellerstein Peter J.Haas Helen J.Wang
description
Transcript of Online Aggregation Joseph M. Hellerstein Peter J.Haas Helen J.Wang
Online AggregationJoseph M. HellersteinPeter J.HaasHelen J.Wang
Presented by
Archana Vijayalakshmanan
Contents
Introduction Example Advantages Requirements Approaches to building a system System issues Conclusion
Online Aggregation: Motivation
Select AVG(grade) from ENROLL; A “fancy” interface:
+Query Results
AVG3.262574342
A Better Approach
Don’t process in batch! Online aggregation:
Example Select AVG(grade) from ENROLL
GROUP BY major;
Advantages
• stopping condition set on the fly!• statistical techniques are more sophisticated• can handle GROUP BY w/o a priori
knowledge
Requirements
Usability Continuous output
non-blocking query plans
time/precision control fairness/partiality
Performance time to accuracy time to completion pacing
A Naive Approach
SELECT running_avg(final_grade),
running_confidence(final_grade),
running_interval(final_grade) FROM grades;No groupingCan’t meet performance & usability needs:
no guarantee of continuous output no guarantee of fairness (or control over partiality) no control over pacing
Random Access to Data
Heap ScanOK if clustering uncorrelated to agg & grouping attrs
Index Scan can scan an index on attrs uncorrelated to agg or
grouping Sampling from indices
could introduce new sampling access methods (e.g. Olken’s work)
Group By & Distinct
• Can’t sort! sorting blocks sorting is unfair
• Must use hash-based techniques non-blocking approach but do not scale gracefully.
• Hybrid Hashing.• “Hybrid Cache” even better.
Index Striding
For fair Group By:read tuples in round-robin fashion.
(want random tuple from Group 1, random tuple from Group 2, ...)
each group is updated at appropriate rate.gives info/speed match!
Join Algorithms
Non-Blocking Joinsno sorting!merge join OK, but watch for the sorted output hybrid hash not greatsymmetric pipeline hashnested loops always good, can be too slow
Query Optimization
Avoid sorting Blocking sub-operations
2 components in cost function: dead time (td ): time spent doing “invisible” work -- tax
this at a high rate! output time (to ): time spent producing output
Preference to plans that maximize user control e.g., index striding
Extended Aggregate Functions
Basically,aggregate functions must provide running estimates
SUM,COUNT-straight forward
VAR,STD DEV-algorithms return confidence intervals
APICurrent API uses built-in methods
e.g., StopGroup(cursor,groupval) speedUpGroup(cursor,groupval)
slowDownGroup(cursor,groupval)
setSkipFactor(cursor name,integer)
Future Work
Better UI -online data visualization (Tioga DataSplash)
data viz = “graphical” aggregate
- “drill down” and roll up” facilities Nested Queries Control w/o Indices Checkpointing/continuation Tracking online queries Extensions of statistical results
References
control.cs.berkeley.edu/online/olamd/olamd.PPT