CS 349: WebBase 1
-
Upload
raven-hess -
Category
Documents
-
view
10 -
download
0
description
Transcript of CS 349: WebBase 1
CS 349: WebBase 1
What the WebBase
can and can’t do?
Summary
• What is in the WebBase
• Performance Considerations
• Reliability Considerations
• Other sources of data
WebBase Repository
• 25 million web pages
• 150 GB (50 GB compressed)
• spread across roughly 30 disks
What kind of web pages?
• Everything you can imagine
• Infinite Web Pages (truncated at 100K)
• 404 Errors
• Very little correct HTML.
Duplicate Web Pages
• Duplicate Sites– Crawl Root Pages first– Find duplicates and assume same for
remainder of crawl
• Duplicate Hierarchies off of main page– Mirror Sites
• Duplicate Pages• Near Duplicate Pages
Shiva’s Test Results
• 36% Duplicates• 48% Near Duplicates• Largest Sets of Duplicates:
– TUCOWS (100)– MS IE Server Manuals (90)– Unix Help Pages (75)– RedHat Linux Manual (55)– Java API Doc (50)
Order of Web Pages
• First half million are root pages
• After that, pages in PageRank order
• Roughly by importance
Structure of Data
• magic number (4 bytes), packet length (4 bytes), packet (~2K bytes)
• packet is compressed
• packet contains: docID, URL, HTTP Headers, HTML data
Performance Issues: An Example
• One Disk Seek Per Document:– 10 ms seek latency +10 ms rotational latency– x ms read latency + x ms OS overhead– y ms processing
• Realistically 50 ms per document = 20 docs per second
• 25 million / 20 docs per second = 1,250,000 seconds = 2 weeks (too slow)
How fast does it have to be?
• Answer: 4ms per doc = 250 docs per second
• 25 million / 250 = 100,000 seconds = 1.2 days
• Reading + Uncompressing + Parsing == ~3 to 4 ms per document
• So there is not much room left for processing
How can you do something complicated?
• Really fast processing to generate some intermediate smaller results.
• Run complex processing over smaller results.
• Example: Duplicate Detection– Compute shingles from all documents– Find pairs of documents that share
shingles
Bulk Processing of Large Result Sets
• Example: Resolving Anchors
• Resolve URLs and save from - to in ASCII
• Compute 64 bit checksum of “To” URLs
• Bulk Merge against checksum - docid table
Reliability - Potential Sources of Problems
• Source code bug.
• Hardware failure
• OS failure
• Out of resources
Software Engineering Guidelines
• Number of bugs seen ~ log(size of dataset)
• Not just your bugs– OS bugs– Disk OS bugs
• Generate incremental results
Other Available Data
• Link Graph of the Web
• List of PageRanks
• List of URLs
Link Graph of the Web
• From DocID : To DocID
• Try red bars on Google to find backlinks
• Interesting Information
What is PageRank
• Measure of “importance”
• You are important if important things point to you
• Random surfer model
Uncrawled URLs
• Image links
• MailTo links
• CGI links
• Plain uncrawled HTML links
Summary
• WebBase has lots of web pages– very heterogeneous and weird
• Performance Considerations– code should be very very fast– use bulk processing
• Reliability Considerations– write out intermediate results
• Auxiliary data