Online Aggregation Joseph M. Hellerstein Peter J.Haas Helen J.Wang

Online AggregationJoseph M. HellersteinPeter J.HaasHelen J.Wang

Presented by

Archana Vijayalakshmanan

Contents

Introduction Example Advantages Requirements Approaches to building a system System issues Conclusion

Online Aggregation: Motivation

Select AVG(grade) from ENROLL; A “fancy” interface:

+Query Results

AVG3.262574342

A Better Approach

Don’t process in batch! Online aggregation:

Example Select AVG(grade) from ENROLL

GROUP BY major;

Advantages

• stopping condition set on the fly!• statistical techniques are more sophisticated• can handle GROUP BY w/o a priori

knowledge

Requirements

Usability Continuous output

non-blocking query plans

time/precision control fairness/partiality

Performance time to accuracy time to completion pacing

A Naive Approach

SELECT running_avg(final_grade),

running_confidence(final_grade),

running_interval(final_grade) FROM grades;No groupingCan’t meet performance & usability needs:

no guarantee of continuous output no guarantee of fairness (or control over partiality) no control over pacing

Random Access to Data

Heap ScanOK if clustering uncorrelated to agg & grouping attrs

Index Scan can scan an index on attrs uncorrelated to agg or

grouping Sampling from indices

could introduce new sampling access methods (e.g. Olken’s work)

Group By & Distinct

• Can’t sort! sorting blocks sorting is unfair

• Must use hash-based techniques non-blocking approach but do not scale gracefully.

• Hybrid Hashing.• “Hybrid Cache” even better.

Index Striding

For fair Group By:read tuples in round-robin fashion.

(want random tuple from Group 1, random tuple from Group 2, ...)

each group is updated at appropriate rate.gives info/speed match!

Join Algorithms

Non-Blocking Joinsno sorting!merge join OK, but watch for the sorted output hybrid hash not greatsymmetric pipeline hashnested loops always good, can be too slow

Query Optimization

Avoid sorting Blocking sub-operations

2 components in cost function: dead time (td ): time spent doing “invisible” work -- tax

this at a high rate! output time (to ): time spent producing output

Preference to plans that maximize user control e.g., index striding

Extended Aggregate Functions

Basically,aggregate functions must provide running estimates

SUM,COUNT-straight forward

VAR,STD DEV-algorithms return confidence intervals

APICurrent API uses built-in methods

e.g., StopGroup(cursor,groupval) speedUpGroup(cursor,groupval)

slowDownGroup(cursor,groupval)

setSkipFactor(cursor name,integer)

Future Work

Better UI -online data visualization (Tioga DataSplash)

data viz = “graphical” aggregate

- “drill down” and roll up” facilities Nested Queries Control w/o Indices Checkpointing/continuation Tracking online queries Extensions of statistical results

References

control.cs.berkeley.edu/online/olamd/olamd.PPT

Online Aggregation Joseph M. Hellerstein Peter J.Haas Helen J.Wang

Documents

Transcript of Online Aggregation Joseph M. Hellerstein Peter J.Haas Helen J.Wang