CS 349: WebBase 1

CS 349: WebBase 1

What the WebBase

can and can’t do?

Summary

• What is in the WebBase

• Performance Considerations

• Reliability Considerations

• Other sources of data

WebBase Repository

• 25 million web pages

• 150 GB (50 GB compressed)

• spread across roughly 30 disks

What kind of web pages?

• Everything you can imagine

• Infinite Web Pages (truncated at 100K)

• 404 Errors

• Very little correct HTML.

Duplicate Web Pages

• Duplicate Sites– Crawl Root Pages first– Find duplicates and assume same for

remainder of crawl

• Duplicate Hierarchies off of main page– Mirror Sites

• Duplicate Pages• Near Duplicate Pages

Shiva’s Test Results

• 36% Duplicates• 48% Near Duplicates• Largest Sets of Duplicates:

– TUCOWS (100)– MS IE Server Manuals (90)– Unix Help Pages (75)– RedHat Linux Manual (55)– Java API Doc (50)

Order of Web Pages

• First half million are root pages

• After that, pages in PageRank order

• Roughly by importance

Structure of Data

• magic number (4 bytes), packet length (4 bytes), packet (~2K bytes)

• packet is compressed

• packet contains: docID, URL, HTTP Headers, HTML data

Performance Issues: An Example

• One Disk Seek Per Document:– 10 ms seek latency +10 ms rotational latency– x ms read latency + x ms OS overhead– y ms processing

• Realistically 50 ms per document = 20 docs per second

• 25 million / 20 docs per second = 1,250,000 seconds = 2 weeks (too slow)

How fast does it have to be?

• Answer: 4ms per doc = 250 docs per second

• 25 million / 250 = 100,000 seconds = 1.2 days

• Reading + Uncompressing + Parsing == ~3 to 4 ms per document

• So there is not much room left for processing

How can you do something complicated?

• Really fast processing to generate some intermediate smaller results.

• Run complex processing over smaller results.

• Example: Duplicate Detection– Compute shingles from all documents– Find pairs of documents that share

shingles

Bulk Processing of Large Result Sets

• Example: Resolving Anchors

• Resolve URLs and save from - to in ASCII

• Compute 64 bit checksum of “To” URLs

• Bulk Merge against checksum - docid table

Reliability - Potential Sources of Problems

• Source code bug.

• Hardware failure

• OS failure

• Out of resources

Software Engineering Guidelines

• Number of bugs seen ~ log(size of dataset)

• Not just your bugs– OS bugs– Disk OS bugs

• Generate incremental results

Other Available Data

• Link Graph of the Web

• List of PageRanks

• List of URLs

Link Graph of the Web

• From DocID : To DocID

• Try red bars on Google to find backlinks

• Interesting Information

What is PageRank

• Measure of “importance”

• You are important if important things point to you

• Random surfer model

Uncrawled URLs

• Image links

• MailTo links

• CGI links

• Plain uncrawled HTML links

Summary

• WebBase has lots of web pages– very heterogeneous and weird

• Performance Considerations– code should be very very fast– use bulk processing

• Reliability Considerations– write out intermediate results

• Auxiliary data

CS 349: WebBase 1

Documents

Transcript of CS 349: WebBase 1