Data Stream Processing - University of Edinburgh

59
Data Stream Processing Part II Stream Windows Lossy Counting Sticky Sampling 1

Transcript of Data Stream Processing - University of Edinburgh

Page 1: Data Stream Processing - University of Edinburgh

Data Stream ProcessingPart II

Stream Windows Lossy Counting Sticky Sampling1

Page 2: Data Stream Processing - University of Edinburgh

Data Streams (recap)

continuous, unbounded sequence of itemsunpredictable arrival timestoo large to store locallyone pass real time processing required

Stream Windows Lossy Counting Sticky Sampling2

Page 3: Data Stream Processing - University of Edinburgh

Reservoir Sampling (recap)

r/N r/N r/N r/N r/N

Reservoir

Stream

create representative sample of incoming data items Nuniformly sample into reservoir of size r

Stream Windows Lossy Counting Sticky Sampling3

Page 4: Data Stream Processing - University of Edinburgh

Today

Counting algorithms based on streamwindows

Lossy Counting

Sticky Sampling

Stream Windows Lossy Counting Sticky Sampling4

Page 5: Data Stream Processing - University of Edinburgh

Stream Windows

Mechanism for extracting a finite relation from aninfinite stream.

Stream Windows Lossy Counting Sticky Sampling5

Page 6: Data Stream Processing - University of Edinburgh

Window Example

a d j u w s w y u j g d e d l

stream

past future

a d j u w s w y u j g d e d l

stream

a d j u w s w y u j g d e d l

stream

Sliding Window

Stream Windows Lossy Counting Sticky Sampling6

Page 7: Data Stream Processing - University of Edinburgh

Window Example

a d j u w s w y u j g d e d l

stream

past future

a d j u w s w y u j g d e d l

stream

a d j u w s w y u j g d e d l

stream

Sliding Window

Stream Windows Lossy Counting Sticky Sampling7

Page 8: Data Stream Processing - University of Edinburgh

Window Example

a d j u w s w y u j g d e d l

stream

past future

a d j u w s w y u j g d e d l

stream

a d j u w s w y u j g d e d l

stream

Sliding Window

Stream Windows Lossy Counting Sticky Sampling8

Page 9: Data Stream Processing - University of Edinburgh

Window Example

a d j u w s w y u j g d e d l

stream

past future

a d j u w s w y u j g d e d l

stream

a d j u w s w y u j g d e d l

stream

Sliding WindowStream Windows Lossy Counting Sticky Sampling

9

Page 10: Data Stream Processing - University of Edinburgh

Window Types

assumes existences of some attribute that defines the order ofthe stream elements (e.g. time)w is the window length (size) expressed in units of theordering attribute (e.g. seconds)

t1 t2 t3 t4 t1' t2' t3' t4'Sliding Window

ti' - ti = w

t1 t2 t3Tumbling Window

ti+1 - ti = w

Stream Windows Lossy Counting Sticky Sampling10

Page 11: Data Stream Processing - University of Edinburgh

Window Types

assumes existences of some attribute that defines the order ofthe stream elements (e.g. time)w is the window length (size) expressed in units of theordering attribute (e.g. seconds)

t1 t2 t3 t4 t1' t2' t3' t4'Sliding Window

ti' - ti = w

t1 t2 t3Tumbling Window

ti+1 - ti = w

Stream Windows Lossy Counting Sticky Sampling11

Page 12: Data Stream Processing - University of Edinburgh

Count based Windows

Ordering attribute can cause problems for duplicates(e.g. same time stamps)

Use count based windows instead

t1 t2 t3t1' t2' t3'Count based Window

Count based windows are potentially unpredicatable with respect tofluctuation in input rates.

Stream Windows Lossy Counting Sticky Sampling12

Page 13: Data Stream Processing - University of Edinburgh

Count based Windows

Ordering attribute can cause problems for duplicates(e.g. same time stamps)

Use count based windows instead

t1 t2 t3t1' t2' t3'Count based Window

Count based windows are potentially unpredicatable with respect tofluctuation in input rates.

Stream Windows Lossy Counting Sticky Sampling13

Page 14: Data Stream Processing - University of Edinburgh

Punctuation based Windows

Split windows based on punctuations in the data

Punctuation based Window\n \n \n

Potentially problematic if windows grow too large or too small.

Stream Windows Lossy Counting Sticky Sampling14

Page 15: Data Stream Processing - University of Edinburgh

Punctuation based Windows

Split windows based on punctuations in the data

Punctuation based Window\n \n \n

Potentially problematic if windows grow too large or too small.

Stream Windows Lossy Counting Sticky Sampling15

Page 16: Data Stream Processing - University of Edinburgh

Window Standing Query Example

What is the average of the integers in the window?

Stream of integersWindow of size w = 4

Count based sliding windowfor the first w inputs, sum and countafterwards change average by adding (i − j)/w to the previouswindow average

Stream Windows Lossy Counting Sticky Sampling16

Page 17: Data Stream Processing - University of Edinburgh

Window Standing Query Example

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1+3+5+44

= 3.25

3.25 + i−jw

with i newest value, joldest value

1+3+5+44

+ 8−14

= 5

5 + 9−34

= 6.5

Datastructure?

Stream Windows Lossy Counting Sticky Sampling17

Page 18: Data Stream Processing - University of Edinburgh

Window Standing Query Example

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1+3+5+44

= 3.25

3.25 + i−jw

with i newest value, joldest value

1+3+5+44

+ 8−14

= 5

5 + 9−34

= 6.5

Datastructure?

Stream Windows Lossy Counting Sticky Sampling18

Page 19: Data Stream Processing - University of Edinburgh

Window Standing Query Example

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1+3+5+44

= 3.25

3.25 + i−jw

with i newest value, joldest value

1+3+5+44

+ 8−14

= 5

5 + 9−34

= 6.5

Datastructure?

Stream Windows Lossy Counting Sticky Sampling19

Page 20: Data Stream Processing - University of Edinburgh

Window Standing Query Example

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1+3+5+44

= 3.25

3.25 + i−jw

with i newest value, joldest value

1+3+5+44

+ 8−14

= 5

5 + 9−34

= 6.5

Datastructure?

Stream Windows Lossy Counting Sticky Sampling20

Page 21: Data Stream Processing - University of Edinburgh

Window Standing Query Example

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1+3+5+44

= 3.25

3.25 + i−jw

with i newest value, joldest value

1+3+5+44

+ 8−14

= 5

5 + 9−34

= 6.5

Datastructure?

Stream Windows Lossy Counting Sticky Sampling21

Page 22: Data Stream Processing - University of Edinburgh

Window Standing Query Example

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1+3+5+44

= 3.25

3.25 + i−jw

with i newest value, joldest value

1+3+5+44

+ 8−14

= 5

5 + 9−34

= 6.5

Datastructure?

Stream Windows Lossy Counting Sticky Sampling22

Page 23: Data Stream Processing - University of Edinburgh

Window Average

#!/usr/bin/env python2import sysimport QueueWINDOW = 4elems = Queue.Queue()elem_sum = 0for i in range(WINDOW): # initial average

val = int(sys.stdin.readline().strip())elems.put(val)elem_sum += val

avg = float(elem_sum) / WINDOWfor line in sys.stdin:

print(avg)val = int(line.strip())avg = avg + (val - elems.get())/float(WINDOW)elems.put(val)

Allows calculation in a single pass of each element.

Stream Windows Lossy Counting Sticky Sampling23

Page 24: Data Stream Processing - University of Edinburgh

Window Average

#!/usr/bin/env python2import sysimport QueueWINDOW = 4elems = Queue.Queue()elem_sum = 0for i in range(WINDOW): # initial average

val = int(sys.stdin.readline().strip())elems.put(val)elem_sum += val

avg = float(elem_sum) / WINDOWfor line in sys.stdin:

print(avg)val = int(line.strip())avg = avg + (val - elems.get())/float(WINDOW)elems.put(val)

Allows calculation in a single pass of each element.

Stream Windows Lossy Counting Sticky Sampling24

Page 25: Data Stream Processing - University of Edinburgh

Window based Algorithm

Lossy Counting

Stream Windows Lossy Counting Sticky Sampling25

Page 26: Data Stream Processing - University of Edinburgh

Problem Description

Maintain a count of distinctelements seen so far

Examples:Google web crawler counting URL encounters.Detecting spam pages through content analysis.User login rankings to web services.

Straight forward solution: Hashtable

Too large for memory, too slow on disk

Stream Windows Lossy Counting Sticky Sampling26

Page 27: Data Stream Processing - University of Edinburgh

Problem Description

Maintain a count of distinctelements seen so far

Examples:Google web crawler counting URL encounters.Detecting spam pages through content analysis.User login rankings to web services.

Straight forward solution: Hashtable

Too large for memory, too slow on disk

Stream Windows Lossy Counting Sticky Sampling27

Page 28: Data Stream Processing - University of Edinburgh

Problem Description

Maintain a count of distinctelements seen so far

Examples:Google web crawler counting URL encounters.Detecting spam pages through content analysis.User login rankings to web services.

Straight forward solution: Hashtable

Too large for memory, too slow on disk

Stream Windows Lossy Counting Sticky Sampling28

Page 29: Data Stream Processing - University of Edinburgh

Problem Description

Maintain a count of distinctelements seen so far

Examples:Google web crawler counting URL encounters.Detecting spam pages through content analysis.User login rankings to web services.

Straight forward solution: Hashtable

Too large for memory, too slow on disk

Stream Windows Lossy Counting Sticky Sampling29

Page 30: Data Stream Processing - University of Edinburgh

Algorithm Parameters

Environment ParametersElements seen so far N

User-specified Parameterssupport threshold s ∈ (0, 1)

error parameter ε ∈ (0, 1)

Stream Windows Lossy Counting Sticky Sampling30

Page 31: Data Stream Processing - University of Edinburgh

Algorithm Guarantees

1 All items whose true frequency exceeds sN are output. Thereare no false negatives.

2 No items whose true frequency is less than (s − ε)N is output.3 Estimated frequencies are less than the true frequencies by at

most εN.

Stream Windows Lossy Counting Sticky Sampling31

Page 32: Data Stream Processing - University of Edinburgh

Example

With s = 10%, ε = 1%,N = 1000

1 All elements exceeding frequency sN = 100 will be output.2 No elements with frequencies below (s − ε)N = 90 are output.

False positives between 90 and 100 might or might not beoutput.

3 All estimated frequencies diverge from their true frequencies byat most εN = 10 instances.

Rule of thumb: ε = 0.1s

Stream Windows Lossy Counting Sticky Sampling32

Page 33: Data Stream Processing - University of Edinburgh

Example

With s = 10%, ε = 1%,N = 1000

1 All elements exceeding frequency sN = 100 will be output.

2 No elements with frequencies below (s − ε)N = 90 are output.False positives between 90 and 100 might or might not beoutput.

3 All estimated frequencies diverge from their true frequencies byat most εN = 10 instances.

Rule of thumb: ε = 0.1s

Stream Windows Lossy Counting Sticky Sampling33

Page 34: Data Stream Processing - University of Edinburgh

Example

With s = 10%, ε = 1%,N = 1000

1 All elements exceeding frequency sN = 100 will be output.2 No elements with frequencies below (s − ε)N = 90 are output.

False positives between 90 and 100 might or might not beoutput.

3 All estimated frequencies diverge from their true frequencies byat most εN = 10 instances.

Rule of thumb: ε = 0.1s

Stream Windows Lossy Counting Sticky Sampling34

Page 35: Data Stream Processing - University of Edinburgh

Example

With s = 10%, ε = 1%,N = 1000

1 All elements exceeding frequency sN = 100 will be output.2 No elements with frequencies below (s − ε)N = 90 are output.

False positives between 90 and 100 might or might not beoutput.

3 All estimated frequencies diverge from their true frequencies byat most εN = 10 instances.

Rule of thumb: ε = 0.1s

Stream Windows Lossy Counting Sticky Sampling35

Page 36: Data Stream Processing - University of Edinburgh

Example

With s = 10%, ε = 1%,N = 1000

1 All elements exceeding frequency sN = 100 will be output.2 No elements with frequencies below (s − ε)N = 90 are output.

False positives between 90 and 100 might or might not beoutput.

3 All estimated frequencies diverge from their true frequencies byat most εN = 10 instances.

Rule of thumb: ε = 0.1s

Stream Windows Lossy Counting Sticky Sampling36

Page 37: Data Stream Processing - University of Edinburgh

Expected Errors

1 high frequency false positives2 small errors in frequency estimations

Acceptable for high numbers of N

Stream Windows Lossy Counting Sticky Sampling37

Page 38: Data Stream Processing - University of Edinburgh

Expected Errors

1 high frequency false positives2 small errors in frequency estimations

Acceptable for high numbers of N

Stream Windows Lossy Counting Sticky Sampling38

Page 39: Data Stream Processing - University of Edinburgh

Lossy Counting in Action

Incoming Stream of Colours

Stream Windows Lossy Counting Sticky Sampling39

Page 40: Data Stream Processing - University of Edinburgh

Divide into Windows/Buckets

w w w

Window Size w =⌈1ε

⌉=

⌈1

0.01

⌉= 100

Stream Windows Lossy Counting Sticky Sampling40

Page 41: Data Stream Processing - University of Edinburgh

First Window Comes In

Empty Counts Frequency Counts

First Window

Go through elements. If counter exists, increase by one, if notcreate one and initialise it to one.

Stream Windows Lossy Counting Sticky Sampling41

Page 42: Data Stream Processing - University of Edinburgh

Adjust Counts at Window Boundaries

Frequency Counts Frequency Counts

Reduce all counts by one. If counter is zero for a specificelement, drop it.

Stream Windows Lossy Counting Sticky Sampling42

Page 43: Data Stream Processing - University of Edinburgh

Next Window Comes In

Next Window

Frequency CountsFrequency Counts

Count elements and adjust counts afterwards.

Stream Windows Lossy Counting Sticky Sampling43

Page 44: Data Stream Processing - University of Edinburgh

Lossy Counting Summary

Split Stream into WindowsFor each window: Count elements, if no counter exists, createone.At window boundaries: Reduce all frequencies by one. Iffrequency goes to zero, drop counter.Process next window ...

Data structure to save counters:Hashtable<Color,Integer>

Stream Windows Lossy Counting Sticky Sampling44

Page 45: Data Stream Processing - University of Edinburgh

Lossy Counting Summary

Split Stream into WindowsFor each window: Count elements, if no counter exists, createone.At window boundaries: Reduce all frequencies by one. Iffrequency goes to zero, drop counter.Process next window ...

Data structure to save counters:

Hashtable<Color,Integer>

Stream Windows Lossy Counting Sticky Sampling45

Page 46: Data Stream Processing - University of Edinburgh

Lossy Counting Summary

Split Stream into WindowsFor each window: Count elements, if no counter exists, createone.At window boundaries: Reduce all frequencies by one. Iffrequency goes to zero, drop counter.Process next window ...

Data structure to save counters:Hashtable<Color,Integer>

Stream Windows Lossy Counting Sticky Sampling46

Page 47: Data Stream Processing - University of Edinburgh

Output

With s = 10%, ε = 1%,N = 200

Frequency Counts

outputthreshold

Output

24

22

19

27

False Positive

To reduce false positives to acceptable amount, onlyoutput counters with frequency f ≥ (s − ε)N = 18.

Stream Windows Lossy Counting Sticky Sampling47

Page 48: Data Stream Processing - University of Edinburgh

Accuracy Improvement

Reduction step of counters follows the approach of reducing allcounters by one. An improved version maintains exact

frequencies and remembers for each counter at which windowid it was created. At window boundaries, counters are onlyremoved when their frequency falls below a certain level in

relation to their window id.

(Color,Integer,WindowID)

See paper for details.

G. S. Manku, R. Motwani. Approximate Frequency Counts overData Streams, VLDB, 2002.

Stream Windows Lossy Counting Sticky Sampling48

Page 49: Data Stream Processing - University of Edinburgh

Window based Algorithm

Sticky Sampling

Stream Windows Lossy Counting Sticky Sampling49

Page 50: Data Stream Processing - University of Edinburgh

Problem Description

Counting algorithm using asampling approach.

Probabilistic sampling decides if a counter for adistinct element is created.

If a counter exists for a certain element, every futureinstance of this element will be counted.

Stream Windows Lossy Counting Sticky Sampling50

Page 51: Data Stream Processing - University of Edinburgh

Algorithm Parameters

Environment ParametersElements seen so far N

User-specified Parameterssupport threshold s ∈ (0, 1)

error parameter ε ∈ (0, 1)

probability of failure δ ∈ (0, 1)

The algorithm is probabilistic and fails if any of thethree guarantees is not satisfied.

Stream Windows Lossy Counting Sticky Sampling51

Page 52: Data Stream Processing - University of Edinburgh

Algorithm Guarantees

1 All items whose true frequency exceeds sN are output. Thereare no false negatives.

2 No items whose true frequency is less than (s − ε)N is output.3 Estimated frequencies are less than the true frequencies by at

most εN.

Guarantees and thereby expected errors are the sameas for Lossy Counting. Except for the small

probability that it might fail to provide correctanswers.

Stream Windows Lossy Counting Sticky Sampling52

Page 53: Data Stream Processing - University of Edinburgh

Sticky Sampling in Action

Incoming Stream of Colours

Stream Windows Lossy Counting Sticky Sampling53

Page 54: Data Stream Processing - University of Edinburgh

Divide into Windows/Buckets

w = t w = 2t w = 4t

window 1 window 2 window 3

Dynamic window size with t = 1ε log

(1sδ

)With s = 10%, ε = 1%, δ = 0.1%

t ≈ 921

Stream Windows Lossy Counting Sticky Sampling54

Page 55: Data Stream Processing - University of Edinburgh

A Window Comes in

window 1 window 2 window 3

w = tr = 1

w = 2tr = 2

w = 4tr = 4

Go through elements. If counter exists, increase it. If not,create a counter with probability 1

rand initialise it to one.

Sampling rate r grows in proportion to window size.

Stream Windows Lossy Counting Sticky Sampling55

Page 56: Data Stream Processing - University of Edinburgh

Adjust Counts at Window Boundaries

Frequency Counts Frequency Counts

Go through elements ofeach counter. Toss coin, ifunsuccessful removeelement, otherwise moveon to next counter. Ifcounter becomes zero,drop it.

Ensures uniformsampling

Stream Windows Lossy Counting Sticky Sampling56

Page 57: Data Stream Processing - University of Edinburgh

Sticky Sampling Summary

Split stream into windows, doubling window size of each newwindowFor each window: Go through elements if counter exists,increase it. If not, create one with probability 1

r with r growingat the same rate as window size.At window boundaries: Reduce all frequencies by tossing anunbiased coin for each counted element. Remove element ifcoin toss unsuccessful, otherwise move on to next counter. Iffrequency goes to zero, drop counter.Process next window ...

Stream Windows Lossy Counting Sticky Sampling57

Page 58: Data Stream Processing - University of Edinburgh

Output

Same principle as Lossy Counting

To reduce false positives to acceptable amount, onlyoutput counters with frequency f ≥ (s − ε)N .

Stream Windows Lossy Counting Sticky Sampling58

Page 59: Data Stream Processing - University of Edinburgh

Lossy Counting vs. Sticky Sampling

Feature Lossy Counting Sticky SamplingResults deterministic probabilisticMemory grows with N static (independent of N)Theory performs worse performs betterPractice performs better performs worse

performance in terms of memory and accuracy

Stream Windows Lossy Counting Sticky Sampling59