The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page...
-
Upload
mildred-norton -
Category
Documents
-
view
213 -
download
0
Transcript of The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page...
The Anatomy of a Large-Scale Hypertextual Web Search Engine
By Sergey Brin and Lawrence PagePresented byJoshua HaleyZeyad Zainal
Michael LopezMichael Galletti
Britt PhillipsJeff Masson
Searching in the 90’s
• Search Engine Technology had to deal with huge growths.
Indexed Pages in 94
(110K)
Indexed Pages in 97
(2M)
0500000
100000015000002000000
Web Pages Indexed1994 v. 1997
Queries Per Day 94 (1.5K)
Queries Per Day 97 (20M)
05000000
100000001500000020000000
Queries Per Day1994 v. 1997
Google will Scale
• They wanted a search engine that:– Has fast crawling capabilities– Use Storage Space Efficiently– Process Indexes fast– Handles Queries fast
• They Had to Deal with Scaling Difficulties– Disk Speeds and OS robustness not scaling as well
as hardware performance and cost
The Google Goals
• Improve Search Quality– Remove Junk Results (Prioritizing of Results)
• Academic Search Engine Research– Create Literature on the subject of Databases
• Gather Usage Data– Data bases can support research
• Support Novel Research Activities on Web Data
System Features
• Two important features that help it produce high precision results:– PageRank– Anchor Text
PageRank
• Graph structure of hyperlinks hadn’t been used by other search engines
• Graph of 518 million hyperlinks• Text matching using page titles performs well
after pages are prioritized• Similar results when looking at entire pages
PageRank Formula
• Not all pages linking to others are counted equally• PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))– A: page– T1…Tn: pages linking to it– C(A): pages linking out of it– d: “damping factor”
• PageRank for 26m pages can be calculated in a few hours
Intuitive Justification
• A page can have a high PageRank if many pages link to it
• Or if a high PageRank’d page links to it (eg: Yahoo News)– The page wouldn’t be linked to if it wasn’t high
quality, or it had a broken link• PageRank handles these cases by propagating
the weights of different pages
Anchor Text
• Anchors provide more accurate descriptions than the page itself.
• Anchors exist for documents that aren’t text-based (eg. Images, videos, etc)
• Google indexed more than 259m anchors from just 24m pages.
Other Features
• Larger font sizes or bold fonts carry more weight than other words
Related Work
Early Search Engines
• The World Wide Web Worm (WWWW)– One of the first web search engines (Developed
1994)– Had a database of 300,000 multimedia objects
• Some early search engines retrieved results by post-processing the results of other search engines.
Information Retrieval
• The science of searching for documents or information within documents and for metadata about documents.
• Most research is on small collections of scientific papers or news stories on a related topic.
• Text Retrieval Conference is the primary benchmark for information retrieval– Uses the “Very Large Corpus”, a small and well
controlled collection for their benchmarks– Very Large Corpus benchmark is only 20GB
Information Retrieval
• The Text Retrieval Conference doesn’t produce good results on the web– EX: A search of “Bill Clinton” would return a page
that only says “Bill Clinton Sucks” and have a picture of him. Brin and Page believe that for a search of “Bill Clinton” you should receive reasonable results because there is so much information on the topic.
• The standard information retreival work needs to be extended to deal effectively with the web
Differences Between the Web and Well Controlled Collections
• Documents differ internally in their language, vocabulary, type or format, and may even be machine generated.
• External meta information is information that can be inferred about a document but is not contained within it.– Ex: reputation of the source, update frequency, quality,
popularity, etc.• A page like Yahoo needs to be treated differently than
an article or web page that receives one view every ten years.
Differences Between the Web and Well Controlled Collections
• There is no control over what people can put on the web
• Some companies manipulate search engines to route traffic for profit
• Metadata efforts have largely failed with web search engines because a user can be returned a web page that has nothing to do with the query due to the search engine being manipulated.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
By Sergey Brin and Lawrence PagePresented byJoshua HaleyZeyad Zainal
Michael LopezMichael Galletti
Britt PhillipsJeff Masson
System Anatomy
• High-level discussion of architecture• Descriptions of data structures– Repository– Lexicon– HitLists– Forward and Inverted Indices
• Major applications– Crawling– Indexing– Searching
Google Architecture Overview• Implemented in C, C++
– Runs efficiently on Linux, Solaris
• Many distributed webcrawlers– Receive list of URLs to crawl from
URL Server
• Crawlers send pages to Store Server– Compressed pages sent to
Repository– Repository assigns page a docID
• Indexer– Documents from Repository
converted into HitLists– Sends HitLists to Barrels– Sends links to anchor file
Google Architecture Overview• URL Resolver
– Reads from anchor file– Converts URLS to docIDs and sends
them to Barrels– Pairs of docIDs stored in Links
database
• Sorter– Barrels presorted by docID,
Forward Index– Re-sorts by wordID to create
Inverted Index– Dumps a list of associated wordIDs
to Lexicon
• Lexicon– Keeps a list of words
• Searcher– Uses Lexicon, Inverted Index, and
Pagerank to answer queries
Repository• BigFiles– Virtual files spanning
multiple file systems– Operating systems did
not provide enough for system needs
• Repository access– No additional data
structures necessary– Reduces complexity– Can rebuild all data
structures from Respository
Repository 53.5 GB = 147.8 GB Uncompressed
Sync Length Compressed packet
Sync Length Compressed packet
Uncompressed Packet
docId ecode urlLen pageLen url page
• Repository– Contains full HTML of every
web page– Compression decision
• Bzip offers 4 : 1 compression• Zlib offers 3 : 1, is faster
– Opted for speed over ratio
Document Index and Lexicon• Document Index
– Stores information about each document
– Fixed-width ISAM (Index-Sequential Access Mode) ordered by docID
– Information includes:• Status• Pointer into Repository• Checksum• Various Statistics
– Record fetching• Document points to docinfo file
with URL and title if previously crawled
• Otherwise points to URL in URLlist
• docID Allocation– File of all document checksums
paired with docIDs• Sorted by checksum
– Find docID• 1. Checksum of URL is computed• 2. Binary search over file
– May be done in batches
• Lexicon– Capable of existing in main
memory of a machine– Holds 14 million words
• Linked-List of words• Hash table of pointers
HitLists and Encoding• Hit
– Occurrence of a word in a document, given 2 bytes
– Fancy and plain hits– Records capitalization, size relative to
document, and position
• HitList– List of Hits for some word in some
document– Requires the most space– Many possible encoding schemes
• Simple• Hand-optimized• Huffman
– Time vs space compromiseBit Allocation for Different Hits [2 Bytes]
Plain Cap: 1 Size: 3 Position: 12
Fancy Cap: 1 Size = 7 Type: 4 Position: 4
Anchor Cap: 1 Size = 7 Type: 4 Hash: 4 Pos: 4
• Anchor Hits– Hash to docID anchor
occurs in• Storing
– Lists stored in barrels– Space-saving
• Combine length with different ID depending on Forward or Inverted index
• If list length will not fit in remaining bits, place escape character there and use next two bytes to store list length
Forward and Inverted Indices• Forward Index
– 64 barrels• Each one corresponds to a range of
wordIDs
– Words in documents broken up into ranges• docID is recorded into appropriate
barrel• List of wordIDs with HitLists follow• wordIDs stored relative to Barrel
starting index– Fit in 24 bits, leaving 8 for list
length
– System requires more storage for duplicate IDs• However, coding complexity greatly
reduced
• Inverted Index– Created after Barrels go through Sorter– For each valid wordID there is a pointer
from Lexicon into corresponding Barrel– Points to docList of docIDs and
matching HitLists• Represents every document in which a
particular word appears
• docList Ordering– Sort by docID
• Quick for multi-word queries
– Sort by ranking of occurrence• One word queries trivial• Multi-word queries likely near start of list• Merging is difficult• Development is difficult
– Compromise!• Keep two sets of Barrels
Crawling The Web
Crawling
Accessing millions of webpages and logging data
DNS caching for increased performance Email from web admins Unpredictable bugs Copyright problems Robots.txt
Indexing The Web
Parsing HTML data Handle wide variety of errors Encoding to Barrels Turning words into WordIds Hashing all the data Sorting data recursively – Bucket Sort
Searching Quality first Limited depth(40k hits) No one factor will have too
much impact Titles,Font Size,
Distance,Count Creates Relevance score Combines PageRank and IR
score
User Feedback
User input vital to improved search results Verified users can evaluate results and send
their ratings back Adjust ranking system Verify that old results are still valid
Results and Performance
• The most important measure of a search engine is the quality ofits search results
• “Our own experience with Google has shownit to produce better results than the major commercial search engines for most searches.”
• Results are generally high quality pageswith minimal broken links
Storage Requirements• Total size of repository is about 53 GB
(relatively cheap source of data)
• Total of all the data use by engine requiresabout 55 GB
• With better compression, only 7 GBof drive needed
System and Search Performance• Google’s major operations:
Crawling, Indexing, Sorting
• Indexer > Crawler in terms of speed• Indexer runs at 54 pages/second• Using four machines, sorting takes
24 hours
• Most queries answered within 10 s• No query caching or subindices on
common terms
Conclusions• Google is designed to a be a scalable search engine, providing high quality
search results.
• Future Work: Query caching, smart disk allocation, subindices
Smart algorithms to decide what old web pages should be recrawled and which new ones should be crawled
Using proxy caches to build search databases and adding boolean operators, negation, and stemming
Support user context and result summarization
• High Quality Search: Users want high quality results without being frustrates and wasting time.
Google returns higher quality search results than current commercial search engines; Link structure analysis determines quality of pages, link description determines relevance.
• Scalable Architecture: Google is efficient in both space and time
Google has overcome bottleneck in CPU, memory access and capacity, and disk I/O during various operations to prove excellence
Crawling, Indexing, Sorting are efficient enough to build 24 million pages in less than a week