Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive...

32
Mirek Riedewald Department of Computer Science Cornell University of Massive Data Streams for Mining and Monitoring

Transcript of Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive...

Page 1: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Mirek RiedewaldDepartment of Computer Science

Cornell University

Efficient Processing of Massive Data Streams for

Mining and Monitoring

Page 2: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Acknowledgements

Al Demers Abhinandan Das Alin Dobra Sasha Evfimievski Johannes Gehrke KD-D initiative (Art Becker et al.)

Page 3: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Introduction

Data streams versus databases Infinite stream, continuous queries Limited resources

Network monitoring High arrival rates, approximation [CGJSS02]

Stock trading Complex computation [ZS02]

Retail, E-business, Intelligence, Medical Surveillance Identify relevant information on-the-fly, archive

for data mining Exact results, error guarantees

Page 4: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Information Spheres

Local Information Sphere Within each organization Continuous processing of distributed

data streams Online evaluation of thousands of

triggers Storage/archival of important data

Global Information Sphere Between organizations Share data in privacy preserving way

Page 5: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Local Information Sphere

Distributed data stream event processing and online data mining

Technical challenges Blocking operators, unbounded state Graceful degradation under increasing load Integration with archive Processing of physically distributed streams

Page 6: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Event Matching, Correlation

Join of data streams

Brand Mpix Price

Canon

3.0 200

Mpix Price

>2.0 <250

Page 7: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Event Matching, Correlation

Join of data streams

Brand Mpix Price

Canon

3.0 200

Fuji 3.0 100

Mpix Price

>2.0 <250

>4.0 <400

Page 8: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Event Matching, Correlation

Join of data streams

Equi-join, text similarity, geographical proximity,…

Problem: unbounded state, computation

Brand Mpix Price

Canon

3.0 180

Fuji 3.0 220

Kodak

4.0 340

Mpix Price

> 2.0 < 250

> 4.0 < 400

= 3.0 < 200

Page 9: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Window Joins

Restrict join to window of most recent records (tuples) Landmark window Sliding window based on time or

number of records Problem definition

Window based on time: size w Synchronous record arrival Equi-join

Page 10: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Abstract Model

Data streams R(A,…), S(A,…) Compute equi-join on A

Match all r and s of streams R, S such that r.A=s.A

Sliding window of size w

1 1 1

2 3 1

R

S

(r0,s2), (r1,s2), (r2,s2)

Page 11: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Abstract Model (cont.)

Data streams R(A,…), S(A,…) Compute equi-join on A

Match all r and s of streams R, S such that r.A=s.A

Sliding window of size w

1 1 1 3

2 3 1 1

R

S

(r0,s2), (r1,s2), (r2,s2)(r3,s1), (r1,s3), (r2,s3)

Page 12: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Abstract Model (cont.)

Data streams R(A,…), S(A,…) Compute equi-join on A

Match all r and s of streams R, S such that r.A=s.A

Sliding window of size w

1 1 1 3 2

2 3 1 1 4

R

S

(r0,s2), (r1,s2), (r2,s2)(r3,s1), (r1,s3), (r2,s3)No new output

Page 13: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Limited Resources

Focus on limited memory M<2w State of the art: random load

shedding [KNV03] Random sample of streams Desired approach: semantic load

shedding Goal: graceful degradation

Approximation Set-valued result: Error measure?

Page 14: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Set-Approximation Error

What is a good error measure? Information Retrieval, Statistics, Data Mining

Matching coefficient Dice coefficient Jaccard coefficient Cosine coefficient Overlap coefficient

Earth Mover’s Distance (EMD) [RTG98] Match And Compare (MAC) [IP99]

Join: subset of output result EMD, Overlap coefficient trivially 0 or 1 Others (except MAC) reduce to MAX-subset error

measure

|| BA|)||/(|||2 BABA

||/|| BABA

||||/|| BABA |}||,min{|/|| BABA

Page 15: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Optimization Problem

Select records to be kept in memory such that the result size is maximized subject to memory constraints

Lightweight online technique Adaptivity in presence of memory

fluctuations

Page 16: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Optimal Offline Algorithm

What is the best possible that can be achieved? Optimal sampling strategy for MAX-

subset Bottom-line for evaluation of any online

algorithm Same optimization problem, but knows

future Finite subsets of input streams

Formulate as linear flow problem

Page 17: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Generation of Flow Model

R=1,1,1,3

S=2,3,1,1

M=2, w=3

Fixed memory allocation

3 -3

cost

Capacity: 0..1, linear cost

-1

-1 -1-1

-1

-1

Keep in memory

Replace

Page 18: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Correspondence to Windows

R=1,1,1,3

S=2,3,1,1

Page 19: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Correspondence to Windows

R=1,1,1,3

S=2,3,1,1

Page 20: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Correspondence to Windows

R=1,1,1,3

S=2,3,1,1

-1

-1-1

Page 21: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Correspondence to Windows

R=1,1,1,3

S=2,3,1,1

-1

-1-1

-1

-1-1

Page 22: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Complexity

Integer solution exists Optimal solution found in O(n2 m log

n) N input size of single stream #nodes: n < 2wN + N + 2 #arcs: m < 2n + M + 1

Reasonable costs for benchmarking Approx. 1GB memory (w=800, M=800) Approx. 1h computation time

Page 23: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Optimal Flow

R=1,1,1,3

S=2,3,1,1

M=2, w=3

Fixed memory allocation

3 -3

cost

Capacity: 0..1, linear cost

-1

-1 -1-1

-1

-1

Keep in memory

Replace

Page 24: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Easy to Extend

R=1,1,1,3

S=2,3,1,1

M=2, w=3

Variable memory allocation

3 -3

cost

Capacity: 0..1, linear cost

-1

-1 -1-1

-1

-1

Keep in memory

Replace

Page 25: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Online Heuristics

Maximize expected output PROB: sort tuples by join partner arrival

probability LIFE: sort tuples by product of partner

arrival probability and remaining lifetime

Maintain stream statistics Histograms (DGIM02, TGIK02), wavelets

(GKMS01), quantiles (GKMS02, GK01)

Page 26: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Approximation Quality

Page 27: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Effect of Skew

Page 28: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Summary

Information sphere architecture Optimal algorithm and fast efficient

heuristic for sliding window joins Open problems

Other set error measures, resource models Other joins: compress records Complex queries Distributed processing Integration with other techniques into local

information sphere

Page 29: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Related Work

Aurora (Brown, MIT), STREAM (Stanford), Telegraph (Berkeley), NiagaraCQ (Wisconsin, OGI)

Memory requirements [ABBMW02,TM02]

Aggregation Alon, Bar-Yossef, Datar, Dobra,

Garofalakis, Gehrke, Gibbons, Gilbert, Indyk, Korn, Kotidis, Koudas, Matias, Motwani, Muthukrishnan, Rastogi, Srivastava, Strauss, Szegedy

Page 30: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Other Results

[DGR03] Integration with archive

Load smoothing, not shedding Novel “error” measure: archive access cost

Static join for sensor networks Maximize result size subject to constraints on

energy consumption Polynomial dynamic programming solution Fast 2-approximation algorithms NP-hardness proof for join of 3 or more streams

Page 31: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Other Results (cont.)

[DGGR02] Computation of aggregates over

streams for multiple joins Small pseudo-random sketch synopses

(randomized linear projections) Explicit, tunable error guarantees Sketch partitioning to boost accuracy

(intelligently partition join attribute space)

Page 32: Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Thanks!

Questions?

?

?

?

?

?

?

?