Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

31
Effective Change Detection Using Sampling Junghoo “John” Cho Alexandros Ntoulas UCLA

Transcript of Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Page 1: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Effective Change Detection

Using Sampling

Junghoo “John” Cho

Alexandros Ntoulas

UCLA

Page 2: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 2

Application Web search engines/crawlers Web archive Data warehouse . . .

ProblemPolling

Remote database Local database

QueryUpdate

Page 3: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 3

Existing Approach

Round robin Download pages in a round robin manner

Change-frequency based [CLW98, CGM00, EMT01] Estimate the change frequency Adjust download frequency Proven to be optimal

Page 4: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 4

Our Approach

Sampling-based Sample k pages from each source Download more pages from the source with more

changed samples

Page 5: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 5

Comparison

Frequency based Proven to be optimal Change history required Difficult to estimate change frequency

Sampling based Can be worse than frequency based policy No history/frequency-estimation required

Experimental comparison later

Page 6: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 6

Questions

Are we assuming correlation? How to use sampling results?

Proportional vs Greedy How many samples?

Dynamic sample size adjustment? What if we have very limited resources?

Page 7: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 7

Is Correlation Necessary?

Random sampling

Correlation not necessary. Only random sampling More discussion later

4/5 1/5

Page 8: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 8

Questions

Are we assuming correlation? How to use sampling results?

Proportional vs Greedy How many samples?

Dynamic sample size adjustment? What if we have very limited resources?

Page 9: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 9

Download Model (1)

Fixed download cycle Say, once a month

Fixed download resources in each cycle Say, 100,000 page download every month

Goal Download as many changes as we can ChangeRatio =

No of changed & downloaded pages

No of downloaded pages

Page 10: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 10

Download Model (2)

Two-stage sampling policy Sampling stage Download stage

Sampling requires page download

Page 11: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 11

How to Use Sampling Result?

Sites A and B, each with 20 pages 20 total download, 5 samples from each site 10 page download remaining

4/5 1/5A B

Page 12: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 12

Proportional Policy

Download pages proportionally to the detected changes 8 pages from A, 2 pages from B

4/5 1/5A B

Page 13: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 13

Greedy Policy

Download pages from the sites with most changes 10 pages from A

4/5 1/5A B

Page 14: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 14

Optimality of Greedy

Theorem Greedy is optimal if we make download decisions

purely based on sampling results Probabilistic optimality for their expected values

Page 15: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 15

Questions

Are we assuming correlation? How to use sampling results?

Proportional vs Greedy How many samples?

Dynamic sample size adjustment? What if we have very limited resources?

Page 16: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 16

How Many Samples?

Too few samples Inaccurate change estimates

Too many samples “Waste” of resources for sampling

How to determine optimal sample size?

Page 17: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 17

Optimal Sample Size

Factors to consider Total number of pages that we maintain Number of pages that we can download in the

current cycle Number of pages in each Web site Change distribution

Scenario 1 -- A: 90/100, B: 10/100 Scenario 2 -- A: 60/100, B: 40/100

Page 18: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 18

Change Fraction Distribution

fraction ofsites

f( )

t

i : fraction of changed pages in site i f(): distribution of values

Page 19: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 19

Optimal Sample Size

N: no of pages in a site r: no of pages to download / no of pages we

maintain Analysis is complex

is a good rule of thumb

Nr f (t )6(r )

Nr

Page 20: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 20

Dynamic Sample Size?

Do we need the same sample size for every site? A: = 0, B: = 0.45, C: = 0.55, D: = 1

Page 21: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 21

Adaptive Sampling

If the estimated is high/low enough, make an early decision

What does “high enough” mean? Confidence interval above threshold

t

( )i

( )i( )

i

Page 22: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 22

In the Paper

More details on Optimal sample size Adaptive policy

The cases where resource is too limited for sampling

Page 23: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 23

Experiments

353,000 pages from 252 sites Mostly popular sites

Yahoo, CNN, Microsoft, … ~ 1400 pages from each site Followed the links in the breadth-first manner

Monthly change history for 6 months 5 download cycles

In experiments, 100,000 page downloads in each download cycle

Page 24: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 24

Comparison of Policies

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

RR FRQ PRP GRD ADP

ChangeRatio

Page 25: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 25

Optimal Sample Size

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0 50 100 150 200 250

Optimal sample size ~ 10 through 60 ~ 20Nr

ChangeRatio

Sample Size

Page 26: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 26

Comparison of Long-Term Performance

Problem: We have only 5-download-cycle data

Solution: Extrapolate the history

?

Repeat

Page 27: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 27

Frequency vs. Sampling

0.5

0.6

0.7

0.8

0.9

0 100 200 300 400Download Cycle

ChangeRatio

Frequency

Greedy

Page 28: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 28

Related Work

Frequency-based policy Coffman et al., Journal of Scheduling 1998 Cho et al., SIGMOD 2000 Edwards et al., WWW 2001

Source cooperation Olston et al., SIGMOD 2002

Page 29: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 29

Conclusion

Sampling-based policy Great short-term performance No change history required

Frequency-based policy Potentially good long-term performance if the

change frequency does not change Greedy is easy to implement and shows high

performance

Page 30: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 30

Future Work

Combination of sampling and frequency based policies Switch to the frequency-based policy after a while

Good partitioning for sampling? Site based? Directory based? Content based? Link-structure based?

Page 31: Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Junghoo "John" Cho (UCLA Computer Science) 31

Questions?