Web Search – Summer Term 2006 VI. Web Search - Ranking

Web Search – Summer Term 2006

VI. Web Search -Ranking

(c) Wolfgang Hürst, Albert-Ludwigs-University

General Web Search Engine Architecture

CLIENT

QUERY ENGINE

RANKING

CRAWL CONTROL

CRAWLER(S)

USAGE FEEDBACK

RESULTSQUERIES

WWW

COLLECTION ANALYSIS MOD.

INDEXER MODULE

PAGE REPOSITORY

INDEXESSTRUCTUREUTILITY TEXT

(CF. [1] FIG. 1)

1st Generation Web Search EnginesUse only "on page" text data

- word frequency, language

1995-1997 (AltaVista, Excite, Lycos, etc.)

- Extended Boolean model - Matches: Exact, prefix, phrase, ... - Operators: AND, OR, AND NOT, NEAR, ... - Fields: TITLE, URL, HOST, ... - AND is somewhat easier to implement, maybe preferable as default for short queries- Ranking - TF like factors: TF, explicit keywords, words in the title, explicit emphasis (headers), etc. - IDF factors: IDF, total word count in corpus, frequency in query log, frequency in language

TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002

Classic IR vs. Web Search

INFORMATION

QUERY

DATA / DOCUMENTS

INFORMATION NEED

The web is huge.Very huge.

Big variety in dataBig variety in users

Doc. authors don't cooperate (spam,...)

Users don't cooperate (short queries, ...)

The no. of users ishuge. Very huge.

.. but basic conditions & characteristics differ significantly

Initial problem is similar to traditional IR ...

Classic IR vs. Web SearchClassic IR: Generally a 2-step process 1. Select subset of relevant documents (based on term appearance, etc.) 2. Rank / sort by estimated relevance (based on term frequency, etc.)

Problems with web search: - Size of the WWW - Heterogeneity (length, quality, format, ...) - Not always self descriptive (e.g. search engine) - Can be manipulated (spam)

Therefore: Pure content based ranking critical

The evolution of search engines1st generation: Use only "on page", text data- Word frequency, language

1995-1997 (AltaVista, Excite, Lycos, etc.)

2nd gen.: Use off-page, web-specific data- Link (or connectivity) analysis- Click-through data (what results people click on)- Anchor-text (how people refer to a page)

From 1998 (made popular by Google but everyone now)

TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002

Some 2nd generation web IR ranking criteria

- Extended Boolean model

- Content-based techniques (variants of term vector or probabilistic model)

- Ad-hoc factors (anti-porn heuristics, publication/location data, url depth, ...)

- Human annotations (in Yahoo, ODP, etc.)

- Popularity: Click-through feedback ("People vote with their clicks", DirectHit)

- Anchor text

- Connectivity-based techniquesTUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002

The Evolution of Search EnginesRough summary:

First web search engines: Like tradition IR, i.e. 1. Search: Textprocessing to find pages 2. Ranking: Analysis of term distributions (and some other things)

New / current web search engines: 1. Search: Textprocessing to find pages (incl. anchor text and some other things) 2. Ranking: Link analysis (and some other things)

The Web GraphWhat do we know about the link structure of the web?

SOURCE: [3]

Considering the link structureIdea: Use link structure to get a content-independent feature to estimate relevanceof a web page

Why? Because a link from page A to page B- Often represents a relationship of their contents- Can be interpreted as a reference or recommendation

Compare citations in scientific literature- Articles with lots of citations: Considered to have a high scientific value(Bibliometric laws)

Considering the link structureFirst approach: Use the number of backlinks as a measure for a page's relevance

Problem: web vs. scientific publications- Publications: rigorous review process, high quality, well-defined length and style, same goal, similar number of citations, etc.- Web pages: no quality control, no publication costs, can be manipulated (self-citations), large variance regarding quality, number of links, intended usage, length and style, etc.(Note: Compare differences to classic IR)

Idea: Consider reputation and credibility of a link, i.e. the quality and importance of a web page

Simple PageRank - DefinitionGoal: Measure for the importance and quality of a web page

Definition: Simple PageRank- Assume the web is a strongly connected graph- Define the Simple PageRank R of a page p as

R(p) = qBp

R(q)

Nq

Simple PageRank - Example

R(1) = 0,286

R(2) = 0,286

R(3) = 0,143

R(5) = 0,143 R(4) = 0,143

1

2

3

5 4

SOURCE: [3]

Calculation:- Set of linear equations, but direct calculation of a solution too expensive- But: R is eigenvector of A (with eigenvalue 1)- Hence: Efficient approaches for iterative calculation exist (e.g. power iteration)

Simple PageRank - CalculationMathematical representation:If R = mx1 vector [R(1), R(2), ..., R(m)]and A = matrix with aij

= 1 / N(j) if link from i to j exists = 0 otherwise

Then R = A R R(p) = qBp

R(q)

Nq

Simple PageRank - CalculationMathematical representation:If R = mx1 vector [R(1), R(2), ..., R(m)]and A = matrix with aij

= 1 / N(j) if link from i to j exists = 0 otherwise

Then R = A R

Convergence:- Yes! (see literature for a proof)- Number of iterations can be reduced by - using "good" starting values - interruption as soon as the ranking order will (most likely) not change anymore

But: Only true for strongly connected graphs

R(p) = qBp

R(q)

Nq

Problems if graph is not strongly connected

Page Sink = group of pages that link to each other but have no outside links

Page Leak = single page with no outside links (special case of a page sink)

What happens with Leaks?

1

2

3

5 4

R(p) = qBp

R(q)

Nq

Possible Solutions:- Remove all Leaks before the PageRank is calculated- Or: Insert back links for each forward link to a Leak

What happens with Sinks?

1

2

3

5 4

R(p) = qBp

R(q)

Nq

Possible Solutions:- Introduce a damping factor d

PageRank - Definition and Example

R(1) = 0,154

R(2) = 0,142

R(3) = 0,101

R(5) = 0,290 R(4) = 0,313

1

2

3

5 4SOURCE: [3]

Definition: PageRank R of page p

Example:

Intuitive Interpretation

Link: Recommendation and quality statement

- Important pages link to important pages

- Many links from important pages: Target is important as well

- Importance influences other pages and depends on them, hence: recursive calculation (relatively robust against spam)

Another Intuitive InterpretationThe Random Surfer Model: A random surfer performs a random walk on the web graph

- Surfer on page q -> probability to click on any of the links = 1/Nq

- Probability to get to page p: Sum of the probabilities of all links q in Bq times the probability of being on page q

- The damping factor d represents the probability of going to a random new page

R(p) = qBp

R(q)

Nq

The Random Surfer Model

Comments to the Random Surfer Model:

Easy, illustrative model to explain the involved mathematics, but many limitations:

- Does not model the back button

- Jumps and selection of links usually not random or uniformly distributed

- Random walk can hardly be seen as a good search strategy

PageRank - First conclusion

PageRank = Quality or importance measure for a web page

Harder to manipulate than pure backlinks

Content and query independent

Ranking:-> The higher the PageRank,-> the higher a page's quality or importance,-> the higher the page's relevance

Note: Other applications exist (e.g. crawling)

PageRank in the 1st Google engine

Good for "common cases", not necessarily when searching for a particular information

In practice: Combination with other features

Example: First Google search engine [2]

- Hits classified in different types (anchor, title, ...), represented by a vector of type-weights- Term distribution / no. of hits, represented by a vector of count-weights- Scalar product of both vectors = IR score- Final rank = combination of IR score and PageRank

PageRank today

PageRank and web search:

- Most likely used by most search engines (in some way)

Other interesting (research) problems:

- More efficient calculation- Storage and speed- Update- Fighting spam- Make better usage of the structure of the web- Consider the continuous change of the web's structure- Better study of its characteristics (practical + theor.): - Stability and sensitivity - Influence of the linkage - Influence of different parameters- Personalization and specialization

References - Indexing[1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001Chapter 5 (Ranking and Link Analysis)

[2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998Chapter 2 and 4.5.1

[3] BORDER, KUMAR, MAGHOUL, RAGHAVAN, RAJAGOPALAN, STATA, TOMKINS, WIENER: "GRAPH STRUCTURE IN THE WEB", WWW 2000

[4] PAGE, BRIN, MOTWANI, WINOGRAD: "THE PAGERANK CITATION RANKING: BRINGING ORDER TO THE WEB", STANFORD TECHNICAL REPORT

[5] HAVELIWALA: "TOPIC-SENSITIVE PAGERANK", WWW 2002

Web Search – Summer Term 2006 VI. Web Search - Ranking

Documents

Transcript of Web Search – Summer Term 2006 VI. Web Search - Ranking