CS276 Lecture 17 Crawling and web indexes. Today’s lecture Crawling Connectivity servers.
Massive Crawling for the Masses - Persone - Dipartimento ...pages.di.unipi.it/marino/bubing.pdf ·...
Transcript of Massive Crawling for the Masses - Persone - Dipartimento ...pages.di.unipi.it/marino/bubing.pdf ·...
![Page 1: Massive Crawling for the Masses - Persone - Dipartimento ...pages.di.unipi.it/marino/bubing.pdf · BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini,](https://reader030.fdocuments.us/reader030/viewer/2022021809/5c6631ff09d3f2d0218bf364/html5/thumbnails/1.jpg)
BUbiNG Massive Crawling
for the MassesPaolo Boldi, Andrea Marino,
Massimo Santini, Sebastiano Vigna
Dipartimento di Informatica Università degli Studi di Milano
Italy
![Page 2: Massive Crawling for the Masses - Persone - Dipartimento ...pages.di.unipi.it/marino/bubing.pdf · BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini,](https://reader030.fdocuments.us/reader030/viewer/2022021809/5c6631ff09d3f2d0218bf364/html5/thumbnails/2.jpg)
Once upon a time UbiCrawler
UbiCrawler was a scalable, fault-tolerant and fully distributed web crawler (Software: Practice & Experience, 34(8):711-726, 2004)!
LAW (Laboratory for Web Algorithmics) used it many times in the mid-2000s, to download portions of the web (.it, .uk, .eu, Arabic countries...)!
Based on this experience, LAW decided to write a new crawler, 10 years later!
BUbiNG
![Page 3: Massive Crawling for the Masses - Persone - Dipartimento ...pages.di.unipi.it/marino/bubing.pdf · BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini,](https://reader030.fdocuments.us/reader030/viewer/2022021809/5c6631ff09d3f2d0218bf364/html5/thumbnails/3.jpg)
Why a new crawler?OPEN SOURCE! !
Not so many open-source crawlers !
Heritrix (Internet Archive; used for ClueWeb12)!
Nutch (used for ClueWeb09)!
Not suitable to collect really big datasets!
Not so easily extensible!
Distributed? (Heritrix is not distributed; Nutch uses Hadoop)
![Page 4: Massive Crawling for the Masses - Persone - Dipartimento ...pages.di.unipi.it/marino/bubing.pdf · BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini,](https://reader030.fdocuments.us/reader030/viewer/2022021809/5c6631ff09d3f2d0218bf364/html5/thumbnails/4.jpg)
ChallengesPushing hardware to the limit: Use massive memory and multiple cores efficiently!
Fill bandwidth in spite of politeness (both at host and IP level) => coherent time frame!
Producing big datasets in spite of limited hardware!
Making crawling and analysis consistent!
!
Completely configurable!
Extensible will little eff!
Integrated with spam detection (Hungarian Academy of Sciences)
![Page 5: Massive Crawling for the Masses - Persone - Dipartimento ...pages.di.unipi.it/marino/bubing.pdf · BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini,](https://reader030.fdocuments.us/reader030/viewer/2022021809/5c6631ff09d3f2d0218bf364/html5/thumbnails/5.jpg)
High ParallelismWe use massively multiple (like 5000) fetching threads!
Every thread handles a request and is I/O bound!
Parallel threads parse and store pages!
Slow data structures are sandwiched between lock-free queues
![Page 6: Massive Crawling for the Masses - Persone - Dipartimento ...pages.di.unipi.it/marino/bubing.pdf · BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini,](https://reader030.fdocuments.us/reader030/viewer/2022021809/5c6631ff09d3f2d0218bf364/html5/thumbnails/6.jpg)
Fully DistributedWe use JGroups to set up a view on a set of agents!
We use JAI4J, a thin layer over JGroups that handles job assignment.!
Hosts are assigned to agent using consistent hashing!
URLs for which an agent is not responsible are quickly delivered to the right agent!
![Page 7: Massive Crawling for the Masses - Persone - Dipartimento ...pages.di.unipi.it/marino/bubing.pdf · BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini,](https://reader030.fdocuments.us/reader030/viewer/2022021809/5c6631ff09d3f2d0218bf364/html5/thumbnails/7.jpg)
Near–Duplicates
We detect (presently) near-duplicates using a fingerprint of a stripped page (stored in a Bloom filter)!
The stripping includes eliminating almost all tag attributes and numbers from text
![Page 8: Massive Crawling for the Masses - Persone - Dipartimento ...pages.di.unipi.it/marino/bubing.pdf · BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini,](https://reader030.fdocuments.us/reader030/viewer/2022021809/5c6631ff09d3f2d0218bf364/html5/thumbnails/8.jpg)
(1)
Sieve Distributor
URL
host � visit state
DNSThread
URL innew host
workbench entry
IP � workbench entry
Workbench
URL inknown host
visit state (acquire)
WorkbenchThread
visit state
TODOListFetchingThreadResults
ParsingThread parsed!
visit state (put back)
Guavacache
Store
WorkbenchVirtualizer(disk queues)
(in memory)
page, headers
etc.
URLs found
URL
(2)
other agents
(3)
URL
Frontier
![Page 9: Massive Crawling for the Masses - Persone - Dipartimento ...pages.di.unipi.it/marino/bubing.pdf · BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini,](https://reader030.fdocuments.us/reader030/viewer/2022021809/5c6631ff09d3f2d0218bf364/html5/thumbnails/9.jpg)
Highlight: The workbench
A double priority queue of visit states (the state of visit of a host)!
Organized by next-fetch per host & per IP!
Works like a delay queue: wait until a host is ready to be visited
![Page 10: Massive Crawling for the Masses - Persone - Dipartimento ...pages.di.unipi.it/marino/bubing.pdf · BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini,](https://reader030.fdocuments.us/reader030/viewer/2022021809/5c6631ff09d3f2d0218bf364/html5/thumbnails/10.jpg)
Highlight: the workbench virtualizer
Visit states keep track of URLs that are to be visited for a given host (those already been output from the sieve)!
How to reconcile this with constant memory?!
Keeping only the tip of each queue and using on-disk refill queues for the rest...
![Page 11: Massive Crawling for the Masses - Persone - Dipartimento ...pages.di.unipi.it/marino/bubing.pdf · BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini,](https://reader030.fdocuments.us/reader030/viewer/2022021809/5c6631ff09d3f2d0218bf364/html5/thumbnails/11.jpg)
Behavior on a slow connection
0
2000
4000
6000
8000
10000
12000
14000
0 50 100 150 200 250 300
Pag
es/s
Number of threads
![Page 12: Massive Crawling for the Masses - Persone - Dipartimento ...pages.di.unipi.it/marino/bubing.pdf · BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini,](https://reader030.fdocuments.us/reader030/viewer/2022021809/5c6631ff09d3f2d0218bf364/html5/thumbnails/12.jpg)
Front size
0
10000
20000
30000
40000
50000
60000
70000
0 500 1000 1500 2000 2500 3000 3500 4000
Av
erag
e fr
on
t si
ze (
IPs)
IP delay (ms)
125 threads500 threads
2000 threads
![Page 13: Massive Crawling for the Masses - Persone - Dipartimento ...pages.di.unipi.it/marino/bubing.pdf · BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini,](https://reader030.fdocuments.us/reader030/viewer/2022021809/5c6631ff09d3f2d0218bf364/html5/thumbnails/13.jpg)
Average speed
0
2000
4000
6000
8000
10000
12000
14000
0 500 1000 1500 2000 2500 3000 3500 4000
Av
erag
e S
pee
d (
Req
ues
ts/s
)
IP delay (ms)
125 threads500 threads
2000 threads
![Page 14: Massive Crawling for the Masses - Persone - Dipartimento ...pages.di.unipi.it/marino/bubing.pdf · BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini,](https://reader030.fdocuments.us/reader030/viewer/2022021809/5c6631ff09d3f2d0218bf364/html5/thumbnails/14.jpg)
ComparisonsMachines Speed/agent (MB/s)
Nutch (ClueWeb09) 100 0,1
Heritrix (ClueWeb12) 5 4
Heritrix (in vitro) 1 4,5
IRLBot 1 40
BUbiNG (in vivo) 1 154
BUbiNG (in vitro) 4 160
![Page 15: Massive Crawling for the Masses - Persone - Dipartimento ...pages.di.unipi.it/marino/bubing.pdf · BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini,](https://reader030.fdocuments.us/reader030/viewer/2022021809/5c6631ff09d3f2d0218bf364/html5/thumbnails/15.jpg)
Fast?
In vitro: >9000 pages/s average, peaks at 18000 pages/s!
In vivo (@iStella): >3500 pages/s average (single crawler), steady download speed of 1.2Gb/s!
ClueWeb09 (Nutch): 4.3 pages/s!
ClueWeb12 (Heritrix): 60 pages/s!
IRLbot: 1790 pages/s (unverifiable)
![Page 16: Massive Crawling for the Masses - Persone - Dipartimento ...pages.di.unipi.it/marino/bubing.pdf · BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini,](https://reader030.fdocuments.us/reader030/viewer/2022021809/5c6631ff09d3f2d0218bf364/html5/thumbnails/16.jpg)
We broke down almost everything!Hardware broke down: €40,000 server replaced for no charge with a €60,000 server!
OS broke down: Linux kernel’s bug 862758!
JVM broke down: try opening 5000 random-access files!
Dozens of bug reports and improvements to a number of open-source projects, including the Jericho HTML parser, Apache Software Foundation’s HTTP Client, etc.
Vital importance in open-source development
![Page 17: Massive Crawling for the Masses - Persone - Dipartimento ...pages.di.unipi.it/marino/bubing.pdf · BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini,](https://reader030.fdocuments.us/reader030/viewer/2022021809/5c6631ff09d3f2d0218bf364/html5/thumbnails/17.jpg)
Future works
Download@ http://law.di.unimi.it/!
Using other prioritizations for URL!
But first of all: making crawling technology more and more accessible to the masses
![Page 18: Massive Crawling for the Masses - Persone - Dipartimento ...pages.di.unipi.it/marino/bubing.pdf · BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini,](https://reader030.fdocuments.us/reader030/viewer/2022021809/5c6631ff09d3f2d0218bf364/html5/thumbnails/18.jpg)
Thanks!