PageRank and Architecture...

12
G G o o o o g g l l e e PageRank and Architecture PageRank and Architecture G G o o o o g g l l e e PageRank and Architecture PageRank and Architecture Presentation by Georgi Chulkov, Networks and Distributed Systems Seminar, Spring 2007, Jacobs University Bremen

Transcript of PageRank and Architecture...

GGoooogglleePageRank and ArchitecturePageRank and Architecture

GGoooogglleePageRank and ArchitecturePageRank and Architecture

Presentation by Georgi Chulkov,Networks and Distributed Systems Seminar,Spring 2007,Jacobs University Bremen

GGoooogglleePageRank and ArchitecturePageRank and ArchitectureThe Google Challenge

● How to build a large-scale search engine that gives important results?

● Easy in principle, made difficult by scale– 100 q/s in 1999, 1300 q/s in 2005

– 250 GB in 1999, 16000 TB (rumored) in 2005

– 150 million web pages in 1999, 11.5 billion in 2005

– 11 links l/p in 1999, 110 l/p in 2005

GGoooogglleePageRank and ArchitecturePageRank and ArchitecturePageRank

● An importance rating of a page– Importance =

number of backlinks

● Random surfer model

GGoooogglleePageRank and ArchitecturePageRank and ArchitectureThe PageRank Formula

● Rank = probability of ending up on that page

● Simplified

– PR(A) = PR(T1) / C(T1) + ... + PR(Tn) / C(Tn)

– Problem: rank sinks

● Full

– PR(A) = (1 – d) / N + d * (PR(T1) / C(T1) + ... + PR(Tn) / C(Tn))

– d is the probability of not getting bored (0.85)

● Linear system in theory, iteration in practice

GGoooogglleePageRank and ArchitecturePageRank and ArchitectureMore on PageRank

● Dangling links are ignored

● Convergence in log(n)– 50 iterations in 1999, 100 iterations today

● Initial vector can be played with

● Applications– Search

– Broken link detection

– Discovering the truth about porn

GGoooogglleePageRank and ArchitecturePageRank and ArchitectureOn to Google...

● Designed for accuracy– Uses PageRank, anchor text, proximity, font size...

● Huge and scalable– 1 googol = 10100

– Largest distributed cluster in the world

(100000 x = a lot of penguins!)

GGoooogglleePageRank and ArchitecturePageRank and ArchitectureArchitecture

● Distributed crawlers store the web

● Indexer parses HTML and creates hits, anchors, lexicon, document index

● Resolver creates link DB, stores anchor text, creates docIDs

GGoooogglleePageRank and ArchitecturePageRank and ArchitectureArchitecture

● Two sets of barrels for forward and inverse index

● Sorter converts forward (docID) index into backward (wordID) index

● Searcher does the rest

GGoooogglleePageRank and ArchitecturePageRank and ArchitectureImplementation

● Custom code from the ground up– Own file system, own data structures, etc.

– CPUs and disks get faster, but not seek times

● Magic everywhere– Position of word in a document, font size and type of

occurrence is encoded in 2 bytes

– Concurrent writing to a 14-million-word lexicon with no locking

GGoooogglleePageRank and ArchitecturePageRank and ArchitectureImplementation

● Crawling– Web crawlers crawl 100 p/s at 600 KB/s with 300

open connections and have own DNS cache... in 1999

● Parsing, indexing, sorting– Very little is mentioned, other than how difficult it is :(

GGoooogglleePageRank and ArchitecturePageRank and ArchitectureLet's search!

● Single word: get wordID, find docIDs for that word, sort them by score

● Multiple words: do the same, filter, and multiply final score by proximity score

● Score = PageRank + IR score calculated from anchor text, position, and font size

GGoooogglleePageRank and ArchitecturePageRank and ArchitectureThe End

● References– Papers

Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd, The PageRank Citation Ranking: Bringing Order to the Web, Stanford University, SIDL-WP-1999-0120, November 1999.

Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the Seventh international Conference on World Wide Web 7 (Brisbane, Australia). P. H. Enslow and A. Ellis, Eds. Elsevier Science Publishers B. V., Amsterdam, The Netherlands, 107-117. DOI= http://dx.doi.org/10.1016/S0169-7552(98)00110-X

– Websiteshttp://www.1cog.com/search-engine-statistics.html

http://en.wikipedia.org/wiki/PageRank

http://www.google.com/