How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

How to Crawl the Web

Junghoo ChoHector Garcia-Molina

Stanford University

2

What is a Crawler?

web

init

get next url

get page

extract urls

initial urls

to visit urls

visited urls

web pages

3

Crawling at Stanford

WebBase Project BackRub Search Engine, Page Rank Google New WebBase Crawler

4

Crawling Issues (1)

Load at visited web sites– robots.txt files

– space out requests to a site

– limit number of requests to a site per day

– limit depth of crawl

– multicast

5

Crawling Issues (2)

Load at crawler– parallelize

init

get next url

get page

extract urls

initial urls

to visit urls

visited urls

web pages

init

get next url

get page

extract urls

?

6

Crawling Issues (3)

Scope of crawl– not enough space for “all” pages

– not enough time to visit “all” pages

Solution: Visit “important” pages

visitedpagesmicr

osoft

microsoft

7

Crawling Issues (4)

Incremental crawling– how do we keep pages “fresh”?

– how do we avoid crawling from scratch?

8

Crawl-ordering Experiment

Domain: pages within Stanford Goal: Crawl HOT pages!

– Experiment 1: the pages with highest PageRank (top 10%)

– Experiment 2: the pages related to “admission”

Experiment:– How many HOT pages did we collect when we crawl, say,

20% of the Stanford domain?

9

Experiment 1: Top 10% PageRank pages

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

PageRank

backlink

breadth

random

Ordering:

% crawled pages

% crawled HOT pages

10

Experiment 2: Admission-related Pages

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

PageRank

backlink

breadth

random

Ordering :

Topic word : "admission"

% crawled pages

% crawled HOT pages

with topic bias

with topic bias

with topic bias

11

Web Evolution Experiment

How often does a web page change? How long does it take for 50% of the web to change? How do we model web changes?

12

Experimental Setup

February 17 to June 24, 1999 270 sites visited (with permission)

– identified 400 sites with highest “page rank”

– contacted administrators

720,000 pages collected– 3,000 pages from each site daily

– start at root, visit breadth first (get new & old pages)

– ran only 9pm - 6am, 10 seconds between site requests

13

How Often Does a Page Change?

Example: 50 visits to page, 5 changes average change interval = 50/5 = 10 days

Is this correct?

1 day

changes

page visited

14

Average Change Intervalfr

actio

n of

pag

es

15

Average Change Interval — By Domainfr

actio

n of

pag

es

16

Time for a 50% Change

days

frac

tion

of u

ncha

nged

pag

es

17

Modeling Web Evolution

Poisson process with rate T is time to next event fT(t) = e-t (t > 0)

18

Change Interval of Pages

for pages thatchange every

10 days on average

interval in days

frac

tion

of c

hang

esw

ith g

iven

inte

rval

Poisson model

19

Change Metrics

Freshness– Freshness of element ei at time t is

F( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise

eiei

......

web database

– Freshness of the database S at time t is

F( S ; t ) = F( ei ; t )N

1 N

i=1

20

Change Metrics

Age– Age of element ei at time t is

A( ei ; t ) = 0 if ei is up-to-date at time t t - (modification ei time) otherwise

eiei

......

web database

– Age of the database S at time t is

A( S ; t ) = A( ei ; t )N

1 N

i=1

21

Change Metrics

F(ei)

A(ei)

0

0

1

time

time

update refresh

F( S ) = lim F(S ; t ) dtt1 t

0t

F( ei ) = lim F(ei ; t ) dtt1 t

0t

Time averages:

similar for age...

22

Questions

How often should we revisit pages to 80% of pages up-to-date?

How should we revisit pages when they change at different frequencies?

23

Freshness vs. Revisit Frequency

r = / f = average change frequency / average visit frequency

24

Age vs. Revisit Frequency

r = / f = average change frequency / average visit frequency

= Age / time to refresh all N elements

25

Trick Question

Two page database

e1 changes daily

e2 changes once a week

Can visit pages once a week How should we visit pages?

– e1 e1 e1 e1 e1 e1 ...

– e2 e2 e2 e2 e2 e2 ...

– e1 e2 e1 e2 e1 e2 ... [uniform]

– e1 e1 e1 e1 e1 e1 e2 e1 e1 ... [proportional]

– ?

e1

e2

e1

e2

webdatabase

26

Proportional Often Not Good!

Visit fast changing e1 get 1/2 day of freshness

Visit slow changing e2 get 1/2 week of freshness

Visiting e2 is a better deal!

27

Selecting Optimal Refresh Frequency

• Analysis is complex• Shape of curve is the same in all cases• Holds for any distribution g( )

28

Optimal Refresh Frequency for Age

• Analysis is also complex• Shape of curve is the same in all cases• Holds for any distribution g( )

29

Comparing Policies

Freshness AgeProportional 0.12 400 daysUniform 0.57 5.6 daysOptimal 0.62 4.3 days

Based on Statistics from experimentand revisit frequency of every month

30

Crawler Types

In-place vs. shadow

Steady vs. batch

eiei

......web database

ei

...

shadowdatabase

time

crawler on

crawler off

31

Comparison: Batch vs. Steady

batch modein-placecrawler

steadyin-placecrawler

crawler running

32

Shadowing Steady Crawlercr

awle

r’s

colle

ctio

ncu

rren

t col

lect

ion

withoutshadowing

33

Shadowing Batch Crawlercr

awle

r’s

colle

ctio

ncu

rren

t col

lect

ion

withoutshadowing

34

Experimental Data Freshness

Steady BatchIn-Place 0.88 0.88Shadowing 0.77 0.86

• Pages change on average every 4 months• Batch crawler works one week out of 4

1

2

0.63

0.50

35

Building an Incremental Crawler

Variable visit frequencies

(but with care…) Steady In-place

improves freshness!

Also need:– Page change frequency estimator

– Page replacement policy

– Page ranking policy (what are ‘important” pages?)

See papers....

36

Summary

Selecting “important pages:”

Intuitive policy performs well

Maintaining the collection fresh:

Intuitive policy does not perform well

37

The End

Thank you for your attention... For more information, visit

http://www-db.stanford.eduFollow “Searchable Publications” link, andsearch for author = “Cho”

How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

Documents

Transcript of How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.