CS 349: WebBase 1

19
CS 349: WebBase 1 What the WebBase can and can’t do?

description

CS 349: WebBase 1. What the WebBase can and can’t do?. Summary. What is in the WebBase Performance Considerations Reliability Considerations Other sources of data. WebBase Repository. 25 million web pages 150 GB (50 GB compressed) spread across roughly 30 disks. - PowerPoint PPT Presentation

Transcript of CS 349: WebBase 1

Page 1: CS 349: WebBase 1

CS 349: WebBase 1

What the WebBase

can and can’t do?

Page 2: CS 349: WebBase 1

Summary

• What is in the WebBase

• Performance Considerations

• Reliability Considerations

• Other sources of data

Page 3: CS 349: WebBase 1

WebBase Repository

• 25 million web pages

• 150 GB (50 GB compressed)

• spread across roughly 30 disks

Page 4: CS 349: WebBase 1

What kind of web pages?

• Everything you can imagine

• Infinite Web Pages (truncated at 100K)

• 404 Errors

• Very little correct HTML.

Page 5: CS 349: WebBase 1

Duplicate Web Pages

• Duplicate Sites– Crawl Root Pages first– Find duplicates and assume same for

remainder of crawl

• Duplicate Hierarchies off of main page– Mirror Sites

• Duplicate Pages• Near Duplicate Pages

Page 6: CS 349: WebBase 1

Shiva’s Test Results

• 36% Duplicates• 48% Near Duplicates• Largest Sets of Duplicates:

– TUCOWS (100)– MS IE Server Manuals (90)– Unix Help Pages (75)– RedHat Linux Manual (55)– Java API Doc (50)

Page 7: CS 349: WebBase 1

Order of Web Pages

• First half million are root pages

• After that, pages in PageRank order

• Roughly by importance

Page 8: CS 349: WebBase 1

Structure of Data

• magic number (4 bytes), packet length (4 bytes), packet (~2K bytes)

• packet is compressed

• packet contains: docID, URL, HTTP Headers, HTML data

Page 9: CS 349: WebBase 1

Performance Issues: An Example

• One Disk Seek Per Document:– 10 ms seek latency +10 ms rotational latency– x ms read latency + x ms OS overhead– y ms processing

• Realistically 50 ms per document = 20 docs per second

• 25 million / 20 docs per second = 1,250,000 seconds = 2 weeks (too slow)

Page 10: CS 349: WebBase 1

How fast does it have to be?

• Answer: 4ms per doc = 250 docs per second

• 25 million / 250 = 100,000 seconds = 1.2 days

• Reading + Uncompressing + Parsing == ~3 to 4 ms per document

• So there is not much room left for processing

Page 11: CS 349: WebBase 1

How can you do something complicated?

• Really fast processing to generate some intermediate smaller results.

• Run complex processing over smaller results.

• Example: Duplicate Detection– Compute shingles from all documents– Find pairs of documents that share

shingles

Page 12: CS 349: WebBase 1

Bulk Processing of Large Result Sets

• Example: Resolving Anchors

• Resolve URLs and save from - to in ASCII

• Compute 64 bit checksum of “To” URLs

• Bulk Merge against checksum - docid table

Page 13: CS 349: WebBase 1

Reliability - Potential Sources of Problems

• Source code bug.

• Hardware failure

• OS failure

• Out of resources

Page 14: CS 349: WebBase 1

Software Engineering Guidelines

• Number of bugs seen ~ log(size of dataset)

• Not just your bugs– OS bugs– Disk OS bugs

• Generate incremental results

Page 15: CS 349: WebBase 1

Other Available Data

• Link Graph of the Web

• List of PageRanks

• List of URLs

Page 16: CS 349: WebBase 1

Link Graph of the Web

• From DocID : To DocID

• Try red bars on Google to find backlinks

• Interesting Information

Page 17: CS 349: WebBase 1

What is PageRank

• Measure of “importance”

• You are important if important things point to you

• Random surfer model

Page 18: CS 349: WebBase 1

Uncrawled URLs

• Image links

• MailTo links

• CGI links

• Plain uncrawled HTML links

Page 19: CS 349: WebBase 1

Summary

• WebBase has lots of web pages– very heterogeneous and weird

• Performance Considerations– code should be very very fast– use bulk processing

• Reliability Considerations– write out intermediate results

• Auxiliary data