The Anatomy of a Large-Scale Hypertextual Web Search Engine A review by: Adam Chamberlain, Adrian...
-
Upload
elijah-mccormack -
Category
Documents
-
view
212 -
download
0
Transcript of The Anatomy of a Large-Scale Hypertextual Web Search Engine A review by: Adam Chamberlain, Adrian...
The Anatomy of a Large-Scale
Hypertextual Web Search Engine
A review by: Adam Chamberlain, Adrian Hudnott, Rob
Garrood & Ben Smith
November 2005
2
Agenda• Introduction• Overview of Google• PageRank
– Motivation & Description– Example– Issues & Comparison– Further Work
• Application• Conclusions
3
Introduction• About the paper
– Brin & Page, 1998, Stanford University– Details a prototype search engine, Google– Covers both architecture and algorithms– Cited in web metrics with relation to significance
• Also relevant to Web Graph Properties
• PageRank– Covered in a separate paper from Brin & Page– Is the primary metric used in the paper
4
Overview : What is Google?• Web search engine
– Tackles issues faced by previous crawlers of scalability and manipulation
• Academic– Built on strong understanding of web metrics– Use of hyperlink structures
• Transparent– Initially released into the public domain– Support for informatics research
5
Overview : ArchitectureURL Server Crawler Store Server
Repository
IndexerURL Resolver
AnchorsAnchors
LexiconLexiconBarrelsBarrels
LinksLinks DocDocIndexIndex
Sorter
PageRank Searcher
CheckChecksumssums
6
Overview: Google Architecture
(Explanation for handout only.)• URL Server: Finds pages to surf.• Crawler: Downloads pages and places them in the repository.• Store Server: Document compression.• Repository: Cached copies of most web pages.• Indexer: Creates the forward index (documents words) and extracts
hyperlink tags into the Anchors file.• URL Resolver: Converts relative URLs into absolute URLs and creates the
Links file.• Links file: Ordered pairs of document IDs where a hyperlink exists between
them.• Sorter: Re-sorts the forward index to create the inverted index (words
documents) and creates the Lexicon.• Lexicon: Dictionary of all possible search keywords.• Doc Index: Maps document identifier codes to URLs.• PageRank: An influential web metric used to sort Google’s matches.• Searcher: Performs searches!
7
Overview : Forward Index• Indexer identifies key word ‘hits’
in a document• Maps document (page) ID’s to
word ID’s in Lexicon• Word ID’s partially sorted into
barrels– 64 of these– Word ID’s within a barrel are
unsorted.– Individual document may spread
over barrels.
• However, not useful for search!
8
Overview : Inverted Index
• Want to know in what documents a key word occurs
• Need the ‘Inverted Index’• Sorts the forward index into
its inverted form• Function performed by the
‘Sorter’
9
Overview : Ranking System • Proximity of keyword ‘hits’
– This is the sum of the distance between them
• Hits have ‘types’– Types: body text, heading text, anchor text, url, …– Relative font size factor used
• Count how many hits occur of each type and range of proximity values– Apply a function to each type-proximity count
• These form a type-proximity vector, C
10
Overview : Ranking System (2)• V = C·W (dot product) is computed.
– W is the importance associated with each type-proximity class.
• Combine V with the PageRank score
• Effect of increasing hits declines– Prevents large scale manipulation
Hit Count, x
f( x)
11
PageRank : Motivation• Academic Citation Analysis* attempted, but…
– Web has no formal quality control or peer review– Possible to inflate citation counts artificially– Web pages vary more than academic papers
• Consider:– One link from the University’s main page, or one
link from Yahoo’s main page…– Which citation should carry the higher weight ?
*Also known as bibliometrics
12
PageRank : Description• Informal Definition:
– “A page has a high rank if the sum of the ranks of its backlinks are high”
– Handles ‘Yahoo’ case on previous slide
• Intuitive Definition:– Corresponds to the Random Surfer Model– User keeps clicking on links ‘linearly’ then gets
bored and restarts at a random location
• Now for the maths…
13
PageRank : Description (2)• Formal Definition:
– c is a ‘dampening’ factor, was 0.85– Nv is number of out-links from page v– Bu is the set of backlinks from the current page– cE(u) corresponds to the surfer getting ‘bored’
)()('
)(' ucEN
vRcuR
u vBv
14
PageRank : Example• Considering an example network• Calculating A:
))(/)()(/)()(/)(()1()( ENERCNCRBNBRccAR
c = dampening factor
N = out-degree
R = PageRank
A B
ED
C
15
PageRank : Example (2)• Initially set all PageRank to 1
• First Iteration:
In-Links Rank (R) Out-Links (N) R/N
B 1 1 1
C 1 2 0.5
E 1 2 0.5
85.1)5.05.01(85.0)85.01()( AR
A B
ED
C
16
PageRank : Example (3)• Repeat process for B, C, D and E• Feed computed values into next iteration
Iteration 1 2 3 4 5 6
A 1.8500 1.2479 1.1967 1.5230 1.3412 1.2954
B 0.4333 0.4333 0.6380 0.4930 0.4807 0.5593
C 0.8583 0.7981 0.9772 0.9084 0.8668 0.9277
D 1.0000 1.7225 1.2107 1.1672 1.4445 1.2900
E 0.8583 0.7981 0.9772 0.9084 0.8668 0.9277
Order ADCEB DACEB ADCEB ADCEB DACEB ADCEB
17
PageRank : Analysis• Converges in log n time
– Constrained by the time to build a full-text index more than anything
• Rank ‘Sinks’– Caused by two pages that point to each other but
not to any other pages: rank accumulates– Solved by random surfer model
• Manipulation – ‘Google Bombing’– French Military ‘Victories’ links to ‘Defeats’– ‘Miserable Failure’ links to George Bush biography
18
19
PageRank : Comparison• Web Graph Properties
– Uses graph of the entire web: depends on full crawl– More sophisticated than simply summing in/out-
degrees
• Web Page Significance– Uses Boolean Spread Activation – match all words– Enhanced citation analysis – building on work of
Kleinberg, Egghe & Rousseau– Doesn’t suffer from Tightly Knit Communities effect
of Kleinberg’s Hubs & Authorities
20
PageRank : Further Work• Personalised PageRank, Haveliwala, 1999
– In-memory, block oriented, algorithm• PageRank can be computed in an hour on a PIII
450Mhz using less than 100Mb of main memory– Compute PageRank on the client-side
• Use local information: bookmarks, searches, history
• Provide the link structure of the web on a DVD
– 11/11/05, “Personalized Search” released
21
PageRank : Further Work (2)• Topic Sensitive PageRank, Haveliwala, 2002
– Improve Google by giving weight to the informational relationship between sites
– A) Uniform Results
• Similar to ‘current’ Google but with topics
– B) Personalised to a particular user• Based on previous searches and users’ surfing
habits
22
Applications : Google• Google Inc.
– Largest search engine• Technologies utilised by others (e.g. Yahoo!)• Biggest ever technology IPO, 2004
– Redefining search• Set a trend for other search providers• Raised importance of quality web search results• Combining information retrieval methods
– Business model based on advertising• Potential area for conflict• Over 100 factors now influence results
23
Applications : PageRank• Back-link prediction
– Desire for optimal web crawling strategy– Better indicator than citation counts!
• Improving user navigation– ‘The PageRank Proxy’– Providing PageRank information with links
• Establishing trust– Wealth of authors on the web, who to trust?– Use PageRank to rate trust
24
Applications : The Future• Internal Development
– Project no longer in academic realm• Lack of transparency initially intended• Role of PageRank unclear• Likely focus on extensions and results tuning
• External Development– API’s
• Allowing innovative use of Google technologies
– Open Source Code• Focused on developing infrastructure
25
Conclusions• Academic Background
– Success from strong academic understanding– Raised profile of informatics and search– Good platform for future research
• Success as a failure– Intention for transparency and use in academia– Commercial success has removed transparency– Potentially bad for further research in this area
26
Summary• We have seen:
– The architecture used by Google– PageRank as a web metric– Strengths and potential manipulations– The commercial success of Google– Applications– Potential areas of future research
27
References• Work by Brin & Page (now at Google)
– Brin, S., Page, L. (1998), ‘The anatomy of a large-scale hypertextual search engine’, Computer Networks and ISDN Systems, 30(1-7):107--117.
– Page, L., Brin, S., Motwani, R. and Winograd, T. (1998), ‘The PageRank Citation Ranking: Bringing Order to the Web', Stanford Digital Library Technologies Project.
– More papers at: http://www.google.com on many aspects of web metrics and search in general
• PageRank– http://www.iprcom.com/papers/pagerank/– Take a look at the example at: http://www.dcs.warwick.ac.uk/~csucbu– http://en.wikipedia.org/wiki/Google_bomb
28
References (2)• Further Developments
– Haveliwala, T. H. (1999), ‘Efficient computation of PageRank’. Technical report, Stanford University, Stanford, CA, 1999.
– Haveliwala, T. H. (2002), ‘Topic-sensitive PageRank’. In Proceedings of the Eleventh International World Wide Web Conference, Honolulu, Hawaii, May 2002.
• Commercial Aspect– http://money.cnn.com/2004/04/29/technology/google/– http://www.google.com/corporate/history.html
• Web Metrics– Dhyani, D., Keong N., W. , and Bhowmick, S. (2002), ‘A survey of web
metrics’, ACM Computing Surveys, 34(4):469--503.