FORENSIC ANALYSIS OF DRUGS - LC-IR GPC-IR SEC-IR GC-IR Infrared
Link analysis in web IR
Transcript of Link analysis in web IR
Link analysis in web IRCE-324: Modern Information Retrieval Sharif University of Technology
M. Soleymani
Fall 2016
Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
Hypertext and links
We look beyond the content of documents
We begin to look at the hyperlinks between them
Address questions like
Do the links represent a conferral of authority to some pages? Is this
useful for ranking?
How likely is it that a page pointed to by the CERN home page is about
high energy physics
Big application areas
The Web
Social networks: a rich source of grouping behavior
E.g., Shoppers’ affinity – Goel+Goldstein 2010: Consumers whose
friends spend a lot, spend a lot themselves
2
Our primary interest in this course
Analogs of most IR functionality based purely on text
Scoring and ranking
Powerful sources of authenticity and authority are achievable via links
Link-based clustering
topical structure from links
Links as features in classification
docs that link to one another are likely to be on the same subject
Crawling
Based on the links seen, where do we crawl next?
3
The Web as a Directed Graph
Assumption 1: A hyperlink between pages denotes a conferral of authority
(quality signal)
[author of the link thinks that the linked-to page is high-quality]
Assumption 2: The text in the anchor of the hyperlink describes the target
page (textual context)
Page Ahyperlink
Page BAnchor
Sec. 21.1
4
Anchor text
We use anchor text somewhat loosely here for: the text
surrounding the hyperlink.
Example:
“<a href=http://www.acm.org/jacm> Journal of the ACM </a>.”
(Extended) anchor text:“You can find cheap cars here”
5
Assumption 1: Reputed sites
6
Assumption 2: Annotation of target
7
Indexing anchor text
Indexing a doc d also includes (with some weight) anchor
text from links pointing to d.
Searching on [text of d] + [anchor text → d] is often more
effective than searching on [text of d] only.
www.ibm.com
Armonk, NY-based computer
giant IBM announced today
Joe’s computer hardware links
Sun
HP
IBMBig Blue today announced
record profits for the quarter
Sec. 21.1.1
8
Anchor text: example
Example: Query IBM
Matches IBM’s copyright page
Matches many spam pages
Matches IBM wikipedia article
May not match IBM home page!
. . . if IBM home page is mostly graphics
Searching on [anchor text → d2] is better for the query IBM.
In this representation, the page with the most occurrences of
IBM is www.ibm.com.
9
10
Indexing anchor text
Can score anchor text with weight depending on the
authority of the anchor page’s website
E.g., if we were to assume that content from cnn.com or
yahoo.com is authoritative, then trust this anchor text
Can sometimes have unexpected side effects
Sec. 21.1.1
11
Google bombs
A Google bomb is a search with “bad” results due to
maliciously manipulated anchor text.
Google introduced a new weighting function in January
2007 that fixed many Google bombs.
Defused Google bombs: [who is a failure?], [evil empire]
12
Origins of PageRank
We can use citation analysis for hyperlinks on the web
like citations in the scientific literature
Appropriately weighted citation frequency is an excellent
measure of quality . . .
. . . both for web pages and for scientific publications.
Next: PageRank algorithm for computing weighted
citation frequency on the web
14
The web isn’t scholarly citation
Millions of participants, each with self interests
Spamming is widespread
Once search engines began to use links for ranking
(roughly 1998), link spam grew
You can join a group of websites that heavily link to one
another
15
Random walk
Imagine a web surfer doing a random walk on the web
Start at a random page
At each step, go out of the current page along one of the links on that
page, equiprobably
In the steady state, each page has a long-term visit rate.
This long-term visit rate is the page’s PageRank.
PageRank = long-term visit rate
= steady state probability
16
Not quite enough
The web is full of dead-ends.
Random walk can get stuck in dead-ends.
Makes no sense to talk about long-term visit rates.
??
Sec. 21.2
17
Teleporting
At a dead end, jump to a random web page with
probability 1/N.
At any non-dead end, with probability 𝛼 (e.g. 10%), jump
to a random web page.
With remaining probability, go out on one of the links on that
page, equiprobably
Sec. 21.2
18
Result of teleporting
Now cannot get stuck locally.
There is a long-term rate at which any page is visited (not
obvious, will show this).
How do we compute this visit rate?
Sec. 21.2
19
Computing the probability matrix
The uniform vector 𝐯 = [1
𝑁,1
𝑁, … ,
1
𝑁]
Set 𝐏 to 𝐀 (adjacency matrix of the graph) and also
normalize each of its rows having nonzero elements to
the length of that row
For each dead-end: set the corresponding row of 𝐏 to 𝐯
𝐏 = 1 − 𝛼 𝐏 + 𝛼𝐯
20
Markov chains
A Markov chain consists of n states, plus an nn transition
probability matrix P.
At each step, we are in exactly one of the states.
Pij: probability of j being the next state, given we are
currently in state i.
i jPij
Sec. 21.2.1
1
, 1n
ij
j
i P
21
Ergodic Markov chains
Theorem: For any ergodic Markov chain, there is a unique
long-term visit rate for each state.
Steady-state probability distribution.
Over a long time-period, we visit each state in proportion
to this rate.
It doesn’t matter where we start.
Sec. 21.2.1
22
Ergodic Markov chains
A Markov chain is ergodic iff it is irreducible and
aperiodic.
Irreducibility. Roughly: there is a path from any page to any
other page.
Aperiodicity. Roughly: The pages cannot be partitioned such
that the random walker visits the partitions only sequentially.
23
Probability vectors
A probability (row) vector 𝒙 = (𝑥1, … 𝑥𝑛) tells us where
the walk is at any point.
E.g., (0 0 0…1…0 0 0) means we’re in state 𝑖.𝑖 𝑛1
More generally, the vector 𝒙 = (𝑥1, … 𝑥𝑛) means the walk is in
state 𝑖 with probability 𝑥𝑖.
.11
n
i
ix
Sec. 21.2.1
24
Change in probability vector
If the probability vector is 𝒙 = (𝑥1, … 𝑥𝑛) at this step,
what is it at the next step?
Recall that row 𝑖 of the transition prob. Matrix 𝑷 tells us
where we go next from state 𝑖.
So from 𝒙, our next state is distributed as 𝒙𝑷
The one after that is 𝒙𝑷𝟐, then 𝒙𝑷𝟑, etc.
(Where) Does the converge?
Sec. 21.2.1
25
How do we compute this vector?
Let 𝒂 = (𝑎1, … 𝑎𝑛) denote the row vector of steady-state
probabilities.
If our current position is described by 𝒂, then the next step is
distributed as 𝒂𝑷.
But 𝒂 is the steady state, so 𝒂 = 𝒂𝑷.
Solving this matrix equation gives us 𝒂.
So 𝒂 is the (left) eigenvector for 𝑷.
Corresponds to the “principal” eigenvector of 𝑷 with the largest eigenvalue.
Transition probability matrices always have largest eigenvalue 1.
Sec. 21.2.2
26
Example
27[This slide has been adopted from Schutze’s slides]
Example
28
𝒂 = (0.05, 0.04, 0.11, 0.25, 0.21, 0.04, 0.31)
Pagerank summary
Preprocessing:
Given graph of links, build matrix 𝑷.
From it compute 𝒂 – left eigenvector of 𝑷.
The entry 𝑎𝑖 is a number between 0 and 1: the pagerank of
page 𝑖.
Query processing:
Retrieve pages meeting query.
Rank them by their pagerank.
But this rank order is query-independent…
Sec. 21.2.2
29
The reality
Pagerank is used in Google and other engines, but is
hardly the full story of ranking
Many sophisticated features are used
Some address specific query classes
Machine learned ranking (Lecture 19) heavily used
Pagerank is still very useful for things like crawl policy
30
Personalized pagerank
In the basic pagerank computation, teleportation
probability vector v is uniform over all pages
But if the user has preferences on which pages to
teleport to, that preference can be represented in v
v could be uniform over user’s bookmarks
Or it could be non-zero on just pages on topics of interest to
the user
Pagerank would be personalized to user’s interests
31
[Haveliwala 2003] [Jeh and Widom 2003]
Personalized pagerank
32
a user whose interests
are 60% sports and
40% politics.
(teleporting 6% to
sports pages and 4%
to politics pages.)
Linearity theorem
If a1 and a2 are the personalized pagerank vectors, then
for any non-negative constants 𝜆1, 𝜆2 (𝜆1 + 𝜆2 = 1) we have:
Proof
33
)()()1( 221122112211 vvaaPaa
)()()1(
)1()1(
))1(())1((
22112211
22221111
222111
vvaaP
vPavPa
vPavPa
aa
2211 aa
The probability of
teleport to the
pages of topic 1
Topic-sensitive pagerank
Compute personalized pagerank vector per topic
16 top-level topics from the Open Directory Project
Each ODP topic has a set of pages (hand-)classified into that
topic
Preference vector for the topic is uniform over pages in that
topic, and 0 elsewhere
34
Query-time processing
Construct a distribution over topics for the query
User profile can provide a distribution over topics
Query can be classified into the different topics
Any other context information can be used to inform topic
distributions
Use the topic preferences to compute a weighted linear
combination of topic pagerank vectors to use in place of
pagerank
35
Hyperlink-Induced Topic Search (HITS)
In response to a query, instead of an ordered list of pages each
meeting the query, find two sets of inter-related pages:
Hub pages are good lists of links on a subject.
e.g.,“Bob’s list of cancer-related links.”
Authority pages occur recurrently on good hubs for the subject.
Best suited for “broad topic” queries rather than for page-
finding queries.
Gets at a broader slice of common opinion.
Sec. 21.3
36
Hubs and Authorities
Thus, a good hub page for a topic points to many
authoritative pages for that topic.
A good authority page for a topic is pointed to by many
good hubs for that topic.
Circular definition - will turn this into an iterative
computation.
Sec. 21.3
37
Hubs & Authorities
AT&T
Alice
ITIM
Bob
O2
Mobile telecom companies
Hubs
Authorities
Sec. 21.3
38
High-level scheme
Extract from the web a base set of pages that could be
good hubs or authorities.
From these, identify a small set of top hub and authority
pages;
iterative algorithm.
Sec. 21.3
39
Base set
Given text query (say information), use a text index to
get all pages containing information.
Call this the root set of pages.
Add in any page that either
points to a page in the root set, or
is pointed to by a page in the root set.
Call this the base set.
Sec. 21.3
40
Visualization
Root
set
Base set
Sec. 21.3
41
Iterative update
Repeat the following updates, for all x:
yx
yaxh
)()(
xy
yhxa
)()(
x
x
Sec. 21.3
42
Distilling hubs and authorities
Compute, for each page x in the base set, a hub score
ℎ(𝑥) and an authority score 𝑎(𝑥):
Initialize: for all 𝑥, ℎ(𝑥) ← 1; 𝑎(𝑥)1;
Iteratively update all ℎ(𝑥), 𝑎(𝑥);
After iterations
output pages with highest h scores as top hubs
output pages with highest a scores as top authorities.
Sec. 21.3
43
Iterative algorithm
a = (1, 1, . . . , 1)
h = (1, 1, . . . , 1)
repeat
for i=1 to m do
ℎ𝑖 = 𝐴𝑖𝑗=1𝑎𝑗
for i=1 to n do
𝑎𝑖 = 𝐴𝑗𝑖=1ℎ𝑗
Normalize(a)
Normalize(h)
Until convergence
44
To prevent the ℎ and 𝑎 values from getting too big,
can scale down after each iteration.
Scaling factor doesn’t really matter:
we only care about the relative values of the scores.
How many iterations?
Claim: relative values of scores will converge after a few
iterations:
Suitably scaled, ℎ and 𝑎 scores settle into a steady state!
proof of this comes later.
In practice, ~5 iterations get you close to stability.
Sec. 21.3
45
Hub and authority: example
46
Japan Elementary Schools
The American School in Japan
The Link Page
‰ª�è�s—§ˆä“c�¬Šw�Zƒz�[ƒ�ƒy�[ƒW
Kids' Space
ˆÀ�é�s—§ˆÀ�é�¼•”�¬Šw�Z
‹{�鋳ˆç‘åŠw•�‘®�¬Šw�Z
KEIMEI GAKUEN Home Page ( Japanese )
Shiranuma Home Page
fuzoku-es.fukui-u.ac.jp
welcome to Miasa E&J school
�_“Þ�쌧�E‰¡•l�s—§’†�ì�¼�¬Šw�Z‚̃y
http://www...p/~m_maru/index.html
fukui haruyama-es HomePage
Torisu primary school
goo
Yakumo Elementary,Hokkaido,Japan
FUZOKU Home Page
Kamishibun Elementary School...
schools
LINK Page-13
“ú–{‚ÌŠw�Z
�a‰„�¬Šw�Zƒz�[ƒ�ƒy�[ƒW
100 Schools Home Pages (English)
K-12 from Japan 10/...rnet and Education )
http://www...iglobe.ne.jp/~IKESAN
‚l‚f‚j�¬Šw�Z‚U”N‚P‘g•¨Œê
�ÒŠ—’¬—§�ÒŠ—“Œ�¬Šw�Z
Koulutus ja oppilaitokset
TOYODA HOMEPAGE
Education
Cay's Homepage(Japanese)
–y“ì�¬Šw�Z‚̃z�[ƒ�ƒy�[ƒW
UNIVERSITY
‰J—³�¬Šw�Z DRAGON97-TOP
�‰ª�¬Šw�Z‚T”N‚P‘gƒz�[ƒ�ƒy�[ƒW
¶µ°é¼ÂÁ© ¥á¥Ë¥å¡¼ ¥á¥Ë¥å¡¼
Hubs Authorities
Sec. 21.3
47
Things to note
Pulled together good pages regardless of language of page
content.
Use only link analysis after the base set is assembled
iterative scoring is query-independent.
Iterative computation after text index retrieval -
significant overhead.
Sec. 21.3
48
Proof of convergence
NN adjacency matrix A:
each of the N pages in the base set has a row and column
in the matrix.
Entry Aij = 1 if page i links to page j, else = 0.
1 2
3
1 2 3
1
2
3
0 1 0
1 1 1
1 0 0
Sec. 21.3
49
Rewrite in matrix form
𝒉 = 𝑨𝒂.
𝒂 = 𝑨𝑇𝒉.
Substituting, 𝒉 = 𝑨𝑨𝑇𝒉 and 𝒂 = 𝑨𝑇𝑨𝒂.
Thus, 𝒉 is an eigenvector of 𝑨𝑨𝑇and 𝒂 is an
eigenvector of 𝑨𝑇𝑨.
Further, our iterative algorithm is indeed a known algorithm for
computing eigenvectors: the power iteration method.
Guaranteed to converge.
Sec. 21.3
50
Issues
Topic Drift
Off-topic pages can cause off-topic “authorities” to be
returned
E.g., the neighborhood graph can be about a “super topic”
Mutually Reinforcing Affiliates
Affiliated pages/sites can boost each others’ scores
Linkage between affiliated pages is not a useful signal
Sec. 21.3
51
Resources
IIR Chap 21
52