Post on 19-Jan-2016
The Structure of Broad Topicson the Web
Soumen ChakrabartiMukul M. JoshiKunal Punera(IIT Bombay)
David M. Pennock(NEC Research Institute)
Graph structure of the Web
Over two billion nodes, two trillion links Power-law degree distribution
• Pr(degree = k) 1/k2.1
Looks like a “bow-tie” at large scale
IN OUTStrongly
connectedcore (SCC)
“This isthe Web”
The need for content-based models
Why does a radius-1 expansion help in topic distillation?
Why does topic-specific focused crawling work?
Why is a global PageRank useful for specific queries?
Searchengine
QueryRootset
Classifier
Crawler
Checkfrontier topic
Prune if irrelevant
vu
u
upd
N
dvp
)OutDegree(
)()1()(
Uniformjump
Walk toout-neighbor
The need for content-based models
How are different topics linked to each other? Are topic directories representative of Web
topic populations? Are standard collections (e.g., TREC W10G)
representative of Web topics?
“This isthe Webwith topics”
How to characterize “topics”
Web directories—most natural choice Started with http://dmoz.org Keep pruning until all leaf topics
have enough (>300) samples Approx 120k sample URLs Flatten to approx 482 topics Train text classifier (Rainbow) Characterize new document d as a
vector of probabilities pd = (Pr(c|d) c)
Classifier
Topic ProbArts 0.1Computers 0.3Science 0.6
Test doc
Critique and defense
Cannot capture fine-grained or emerging topics• Emerging topics most often specialize
existing broad topics• Broad topics rarely change
Classifier may be inaccurate• Adequate if much better than random
guessing of topic label• Can compensate errors using held-out
validation data
Background topic distribution
What fraction of Web pages are about Health?
Sampling via random walk• PageRank walk (Henzinger et al.)• Undirected regular walk (Bar-
Yossef et al.)
Make graph undirected Add self-loops so that all nodes
have the same degree Sample with large stride Collect topic histograms
Convergence
Start from pairs of diverse topics Two random walks, sample from each walk Measure distance between topic distributions
• L1 distance |p1 – p2| = c|p1(c) – p2(c)| in [0,2]
• Below .05 —.2 within 300—400 physical pages
Background distribution
0
0.1
0.2
0.3
0.4
Art
s
Bu
sin
ess
Co
mp
ute
rs
Ga
me
s
He
alth
Ho
me
Re
cre
atio
n
Re
fere
nce
Sci
en
ce
Sh
op
pin
g
So
cie
ty
Sp
ort
s00.20.40.60.8
1
0 500 1000#hops
Dis
trib
utio
n di
ffere
nce
Stride=30k
Stride=75k
Biases in topic directories
Use Dmoz to train a classifier
Sample the Web Classify samples Diff Dmoz topic
distribution from Web sample topic distribution
Report maximum deviation in fractions
NOTE: Not exactly Dmoz
Dmoz over-representsGames.Video_GamesSociety.PeopleArts.Celebrities...Education.Colleges...Travel.ReservationsDmoz under-represents…WWW…Directories!Sports.HockeySociety.PhilosophyEducation…K12…Recreation…Camping
Topic-specific degree distribution
Preferential attachment: connect u to v w.p. proportional to the degree of v, regardless of topic
More realistic: u has a topic, and links to v with related topics
Unclear if power-law should be upheld
Intra-topiclinkage
Inter-topiclinkage
Random forward walk without jumps/Arts/Music
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20Wander hops
L_
1 D
ista
nce
From backgroundFrom hop0
/Sports/Soccer
0.4
0.6
0.8
1
1.2
1.4
0 5 10 15 20Wander hops
L_
1 D
ista
nce
From backgroundFrom hop0
Sampling walk is designed to mix topics well How about walking forward without jumping?
• Start from a page u0 on a specific topic• Forward random walk (u0, u1, …, ui, …)• Compare (Pr(c|ui) c) with (Pr(c|u0) c) and with
the background distribution
Forward walks wander away fromstarting topic slowly
But do not converge to thebackground distribution
Global PageRank ok alsofor topic-specific queries• Jump parameter d=.1—.2• Topic drift not too bad within
path length of 5—10• Prestige conferred mostly by
same-topic neighbors Also explains why focused crawling works
Observations and implicationsW.p. d jump toa random node
W.p. (1-d)jump to anout-neighboru.a.r.
High-prestige
node
Jump
Citation matrix
Given a page is about topic i, how likely is it to link to topic j?• Matrix C[i,j] = probability that page about
topic i links to page about topic j• Soft counting: C[i,j] += Pr(i|u)Pr(j|v)
Applications• Classifying Web pages into topics• Focused crawling for topic-specific pages• Finding relations between topics in a
directory
u v
Citation, confusion, correctionFrom topic
True topic From topic
To topic
Guessed topic
To topic
Art
sB
usin
ess
Com
put
ers
Gam
esH
ealth
Hom
eR
ecre
atio
nR
efe
renc
eS
cien
ceS
hopp
ing
Soc
iety
Spo
rts
Classifier’s confusion on held-out documents can be used to correct confusion matrix
Fine-grained views of citation
Clear block-structure derived from coarse-grain topics
Strong diagonals reflecttightly-knit topic communities
Prominent off-diagonalentries raise designissues for taxonomyeditors and maintainers
Concluding remarks
A model for content-based communities• New characterization and measurement of
topical locality on the Web• How to set the PageRank jump parameter?• Topical stability of topic distillation• Better crawling and classification
A tool for Web directory maintenance• Fair sampling and representation of topics• Block-structure and off-diagonals• Taxonomy inversion