How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.
-
date post
22-Dec-2015 -
Category
Documents
-
view
220 -
download
0
Transcript of How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.
2
What is a Crawler?
web
init
get next url
get page
extract urls
initial urls
to visit urls
visited urls
web pages
4
Crawling Issues (1)
Load at visited web sites– robots.txt files
– space out requests to a site
– limit number of requests to a site per day
– limit depth of crawl
– multicast
5
Crawling Issues (2)
Load at crawler– parallelize
init
get next url
get page
extract urls
initial urls
to visit urls
visited urls
web pages
init
get next url
get page
extract urls
?
6
Crawling Issues (3)
Scope of crawl– not enough space for “all” pages
– not enough time to visit “all” pages
Solution: Visit “important” pages
visitedpagesmicr
osoft
microsoft
7
Crawling Issues (4)
Incremental crawling– how do we keep pages “fresh”?
– how do we avoid crawling from scratch?
8
Crawl-ordering Experiment
Domain: pages within Stanford Goal: Crawl HOT pages!
– Experiment 1: the pages with highest PageRank (top 10%)
– Experiment 2: the pages related to “admission”
Experiment:– How many HOT pages did we collect when we crawl, say,
20% of the Stanford domain?
9
Experiment 1: Top 10% PageRank pages
0%
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%
PageRank
backlink
breadth
random
Ordering:
% crawled pages
% crawled HOT pages
10
Experiment 2: Admission-related Pages
0%
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%
PageRank
backlink
breadth
random
Ordering :
Topic word : "admission"
% crawled pages
% crawled HOT pages
with topic bias
with topic bias
with topic bias
11
Web Evolution Experiment
How often does a web page change? How long does it take for 50% of the web to change? How do we model web changes?
12
Experimental Setup
February 17 to June 24, 1999 270 sites visited (with permission)
– identified 400 sites with highest “page rank”
– contacted administrators
720,000 pages collected– 3,000 pages from each site daily
– start at root, visit breadth first (get new & old pages)
– ran only 9pm - 6am, 10 seconds between site requests
13
How Often Does a Page Change?
Example: 50 visits to page, 5 changes average change interval = 50/5 = 10 days
Is this correct?
1 day
changes
page visited
18
Change Interval of Pages
for pages thatchange every
10 days on average
interval in days
frac
tion
of c
hang
esw
ith g
iven
inte
rval
Poisson model
19
Change Metrics
Freshness– Freshness of element ei at time t is
F( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise
eiei
......
web database
– Freshness of the database S at time t is
F( S ; t ) = F( ei ; t )N
1 N
i=1
20
Change Metrics
Age– Age of element ei at time t is
A( ei ; t ) = 0 if ei is up-to-date at time t t - (modification ei time) otherwise
eiei
......
web database
– Age of the database S at time t is
A( S ; t ) = A( ei ; t )N
1 N
i=1
21
Change Metrics
F(ei)
A(ei)
0
0
1
time
time
update refresh
F( S ) = lim F(S ; t ) dtt1 t
0t
F( ei ) = lim F(ei ; t ) dtt1 t
0t
Time averages:
similar for age...
22
Questions
How often should we revisit pages to 80% of pages up-to-date?
How should we revisit pages when they change at different frequencies?
24
Age vs. Revisit Frequency
r = / f = average change frequency / average visit frequency
= Age / time to refresh all N elements
25
Trick Question
Two page database
e1 changes daily
e2 changes once a week
Can visit pages once a week How should we visit pages?
– e1 e1 e1 e1 e1 e1 ...
– e2 e2 e2 e2 e2 e2 ...
– e1 e2 e1 e2 e1 e2 ... [uniform]
– e1 e1 e1 e1 e1 e1 e2 e1 e1 ... [proportional]
– ?
e1
e2
e1
e2
webdatabase
26
Proportional Often Not Good!
Visit fast changing e1 get 1/2 day of freshness
Visit slow changing e2 get 1/2 week of freshness
Visiting e2 is a better deal!
27
Selecting Optimal Refresh Frequency
• Analysis is complex• Shape of curve is the same in all cases• Holds for any distribution g( )
28
Optimal Refresh Frequency for Age
• Analysis is also complex• Shape of curve is the same in all cases• Holds for any distribution g( )
29
Comparing Policies
Freshness AgeProportional 0.12 400 daysUniform 0.57 5.6 daysOptimal 0.62 4.3 days
Based on Statistics from experimentand revisit frequency of every month
30
Crawler Types
In-place vs. shadow
Steady vs. batch
eiei
......web database
ei
...
shadowdatabase
time
crawler on
crawler off
34
Experimental Data Freshness
Steady BatchIn-Place 0.88 0.88Shadowing 0.77 0.86
• Pages change on average every 4 months• Batch crawler works one week out of 4
1
2
0.63
0.50
35
Building an Incremental Crawler
Variable visit frequencies
(but with care…) Steady In-place
improves freshness!
Also need:– Page change frequency estimator
– Page replacement policy
– Page ranking policy (what are ‘important” pages?)
See papers....
36
Summary
Selecting “important pages:”
Intuitive policy performs well
Maintaining the collection fresh:
Intuitive policy does not perform well