Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language...
-
date post
19-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language...
Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning
Frank LinLanguage Technologies Institute
School of Computer ScienceCarnegie Mellon University
PhD Thesis ProposalOctober 8, 2010, Pittsburgh, PA, USA
Thesis CommitteeWilliam W. Cohen (Chair), Christos Faloutsos, Tom Mitchell, Xiaojin Zhu
2
Motivation
• Graph data is everywhere• We want to do machine learning on them• For non-graph data, often it makes sense to
represent them as graphs for unsupervised and semi-supervised learning
And some of them very big
Often require a similarity measure between data
points – naturally represented as edges
between nodes
3
Motivation
• We want to do clustering and semi-supervised learning on large graphs
• We need scalable methods that provide quality results
4
Thesis Goal
• Make contributions toward fast, space-efficient, effective, and simple graph-based learning methods that scale up to large datasets.
5
Road Map
Basic Learning Methods Applications to a Web-Scale Knowledge Base
Graph Induced Graph Applications Support
Unsupervised / Clustering
Power Iteration
ClusteringPIC with Path
Folding
Learning Synonym NP’sDistributed Implementation
Avoiding Collision
Hierarchical
Learning Polyseme NP’s
Finding New Types and Relations
Semi-
Supervise
d
Multi-RankWalk
MRW with Path Folding
Noun Phrase Categorization
Distributed Implementation
6
Road Map
Basic Learning Methods Applications to a Web-Scale Knowledge Base
Graph Induced Graph Applications Support
Unsupervised / Clustering
Power Iteration
ClusteringPIC with Path
Folding
Learning Synonym NP’sDistributed Implementation
Avoiding Collision
Hierarchical
Learning Polyseme NP’s
Finding New Types and Relations
Semi-
Supervise
d
Multi-RankWalk
MRW with Path Folding
Noun Phrase Categorization
Distributed Implementation
Prior Work
Proposed Work
7
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
8
Power Iteration Clustering
• Spectral clustering methods are nice, and natural choice for graph data
• But they are rather expensive (slow)
Power iteration clustering (PIC) can provide a similar solution at a very
low cost (fast)!
9
Background: Spectral Clustering
• Idea: instead of clustering data points in their original (Euclidean) space, cluster them in the space spanned by the “significant” eigenvectors of a (Laplacian) similarity matrix
A popular spectral clustering method:
normalized cuts (NCut)
10
Background: Spectral Clusteringdataset and normalized cut results
2nd smallest eigenvector
3rd smallest eigenvector
valu
e
index
1 2 3cluster
clus
terin
g s
pace
11
Background: Spectral Clustering
• Normalized Cut algorithm (Shi & Malik 2000):1. Choose k and similarity function s2. Derive A from s, let W=I-D-1A, where I is the identity
matrix and D is a diagonal square matrix Dii=Σj Aij
3. Find eigenvectors and corresponding eigenvalues of W4. Pick the k eigenvectors of W with the 2nd to kth smallest
corresponding eigenvalues as “significant” eigenvectors5. Project the data points onto the space spanned by these
vectors6. Run k-means on the projected data points
D
Finding eigenvectors and eigenvalues of a matrix is very slow
in generalCan we find a similar low-dimensional embedding for
clustering without eigenvectors?
12
The Power Iteration
• Or the power method, is a simple iterative method for finding the dominant eigenvector of a matrix:
tt cWvv 1
W : a square matrix
vt : the vector at
iteration t;
v0 typically a random vector
c : a normalizing constant to keep vt
from getting too large or too small
Typically converges quickly; fairly efficient if W is a sparse matrix
13
The Power Iteration
• Or the power method, is a simple iterative method for finding the dominant eigenvector of a matrix:
tt cWvv 1
What if we let W=D-1A(like Normalized Cut)?
Row-normalized similarity
matrix
14
The Power Iteration
15
Power Iteration Clustering
• The 2nd to kth eigenvectors of W=D-1A are roughly piece-wise constant with respect to the underlying clusters, each separating a cluster from the rest of the data (Meila & Shi 2001)
• The linear combination of piece-wise constant vectors is also piece-wise constant!
16
Spectral Clusteringdataset and normalized cut results
2nd smallest eigenvector
3rd smallest eigenvector
valu
e
index
1 2 3cluster
clus
terin
g s
pace
17
a·
b·
+
=
18
Power Iteration Clustering
dataset and PIC results
vt
we just need the clusters to be separated in some space.
Key idea: to do clustering, we may not need all the information in a full spectral
embedding (e.g., distance between clusters in a k-dimension eigenspace)
When to Stop
ntnnk
tkkk
tkk
tt cccc eeeev ...... 111111
Recall:
n
t
nnk
t
kkk
t
kkt
t
c
c
c
c
c
c
ceeee
v
111
1
1
1
1
111
11
......
Then:
Because they are raised to the power t, the eigenvalue ratios
determines how fast v converges to e1
At the beginning, v changes fast (“accelerating”) to converge locally due to “noise terms”
(k+1…n) with small λ
When “noise terms” have gone to zero, v changes slowly (“constant speed”) because only larger λ terms (2…k) are left, where the
eigenvalue ratios are close to 1
20
Power Iteration Clustering
• A basic power iteration clustering (PIC) algorithm:
Input: A row-normalized affinity matrix W and the number of clusters kOutput: Clusters C1, C2, …, Ck
1. Pick an initial vector v0
2. Repeat• Set vt+1 ← Wvt
• Set δt+1 ← |vt+1 – vt|• Increment t• Stop when |δt – δt-1| ≈ 0
3. Use k-means to cluster points on vt and return clusters C1, C2, …, Ck
21
PIC RuntimeNormalized Cut
Normalized Cut, faster implementation
Ran out of memory (24GB)
PIC Accuracy on Network Datasets
Upper triangle: PIC does
better
Lower triangle: NCut or
NJW does better
23
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
24
Clustering Text Data
• Spectral clustering methods are nice• We want to use them for clustering text data
(A lot of)
25
The Problem with Text Data
• Documents are often represented as feature vectors of words:
The importance of a Web page is an inherently subjective matter, which depends on the readers…
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use…
You're not cool just because you have a lot of followers on twitter, get over yourself…
cool web search make over you0 4 8 2 5 30 8 7 4 3 21 0 0 0 1 2
26
The Problem with Text Data
• Feature vectors are often sparse• But similarity matrix is not!
cool web search make over you0 4 8 2 5 30 8 7 4 3 21 0 0 0 1 2
Mostly zeros - any document contains only a small fraction
of the vocabulary
27 125 -
23 - 125
- 23 27
Mostly non-zero - any two
documents are likely to have a
word in common
27
The Problem with Text Data
• A similarity matrix is the input to many clustering methods, including spectral clustering
• Spectral clustering requires the computation of the eigenvectors of a similarity matrix of the data
27 125 -
23 - 125
- 23 27
In general O(n3); approximation
methods still not very fast
O(n2) time to construct
O(n2) space to store
> O(n2) time to operate on
Too expensive! Does not
scale up to big
datasets!
28
The Problem with Text Data
• We want to use the similarity matrix for clustering (like spectral clustering), but:– Without calculating eigenvectors– Without constructing or storing the similarity
matrix
Power Iteration Clustering + Path Folding
29
Path Folding
• A basic power iteration clustering (PIC) algorithm:
Input: A row-normalized affinity matrix W and the number of clusters kOutput: Clusters C1, C2, …, Ck
1. Pick an initial vector v0
2. Repeat• Set vt+1 ← Wvt
• Set δt+1 ← |vt+1 – vt|• Increment t• Stop when |δt – δt-1| ≈ 0
3. Use k-means to cluster points on vt and return clusters C1, C2, …, Ck
Okay, we have a fast clustering method – but there’s the W that requires O(n2) storage space and construction and operation time!
Key operation in PIC
Note: matrix-vector multiplication!
30
Path Folding
• What’s so good about matrix-vector multiplication?
• If we can decompose the matrix…
• Then we arrive at the same solution doing a series of matrix-vector multiplications!
ttt ABCW vvv )(1
)))(((1 tt CBA vv
Isn’t this more expensive?
Well, not if this matrix is dense…
And these are sparse!
31
Path Folding
• As long as we can decompose the matrix into a series of sparse matrices, we can turn a dense matrix-vector multiplication into a series of sparse matrix-vector multiplications.
This means that we can turn an operation that requires O(n2)
storage and runtime into one that requires O(n) storage and runtime!
This is exactly the case for
text data
And many other kinds of data as well!
32
Path Folding
• Example – inner product similarity:
TFFDW 1
The original feature matrix
The feature matrix
transposed
Diagonal matrix that normalizes W so rows sum to 1
Construction: givenStorage: O(n)
Construction: givenStorage: just use FStorage: n
33
Path Folding
• Example – inner product similarity:
TFFDW 1
• Iteration update:
))((11 tTt FFD vv
Construction: O(n)
Storage:O(n)
Operation: O(n)
Okay…how about a similarity function we actually use for
text data?
34
Path Folding
• Example – cosine similarity:
NNFFDW T1
• Iteration update:
))))((((11 tTt NFFND vv
Diagonal cosine
normalizing matrix
Construction: O(n)
Storage:O(n)
Operation: O(n)
Compact storage: we don’t need a cosine-normalized version of the feature vectors
35
Path Folding
• We refer to this technique as path folding due to its connections to “folding” a bipartite graph into a unipartite graph.
36
Results
• An accuracy result:Upper triangle:
we win
Lower triangle: spectral
clustering wins
Each point is accuracy for a 2-cluster
text dataset
Diagonal: tied (most datasets)
No statistically significant
difference – same accuracy
37
Results
• A scalability result:
y: algorithm runtime
(log scale)
x: data size (log scale)
Linear curve
Quadratic curve
Spectral clustering (red & blue)
Our method (green)
38
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
MultiRankWalk
• Classification labels are expensive to obtain• Semi-supervised learning (SSL) learns from
labeled and unlabeled data for classification• When it comes to network data, what is a
general, simple, and effective method that requires very few labels?
Our Answer:MultiRankWalk (MRW)
Random Walk with Restart
• Imagine a network, and starting at a specific node, you follow the edges randomly.
• But with some probability, you “jump” back to the starting node (restart!).
If you recorded the number of times you land on each node,
what would that distribution look like?
Random Walk with Restart
What if we start at a
different node?Start node
Random Walk with Restart
• The walk distribution r satisfies a simple equation:
rur dWd )1(
Start node(s)
Transition matrix of the
network
Restart probability
“Keep-going” probability (damping factor)
Random Walk with Restart
• Random walk with restart (RWR) can be solved simply and efficiently with an iterative procedure:
1)1( tt dWd rur
RWR for Classification
RWR with start nodes being
labeled points in class A
RWR with start nodes being
labeled points in class B
Nodes frequented more by RWR(A) belongs to class A, otherwise they
belong to B
• Simple idea: use RWR for classification
45
ResultsAccuracy: MRW vs. harmonic functions method (HF)
Upper triangle:
MRW better
Lower triangle:
MRW better
Point color & style roughly indicates # of labeled examples
With larger # of labels, both
methods do well
MRW does much better with only a few labels!
Results• How much better is MRW using authoritative seed preference?
y-axis:MRW F1
score minus wvRN F1
x-axis: number of seed
labels per classThe gap between
MRW and wvRN narrows with authoritative
seeds, but they are still
prominent on some datasets
with small number of seed
labels
47
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
48
MRW with Path Folding
• MRW is based on random walk with restart (RWR):
1)1( tt dWd rur
Core operation: matrix-vector multiplication
“Path folding” can be applied to MRW as well to efficiently learn
from induced graph data!
Most other methods preprocess the induced graph
data to create a sparse similarity
matrix
Can also be applied to iterative implementations of the harmonic
functions method
49
On to real, large, difficult applications…
• We now have a basic set of tools to do efficient unsupervised and semi-supervised learning on graph data
• We want to apply to cool, challenging problems…
50
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
51
A Web-Scale Knowledge Base
• Read the Web (RtW) project:
Build a never-ending system that learns to extract information
from unstructured web pages, resulting in a knowledge base of
structured information.
52
Noun Phrase and Context Data
• As a part of RtW project, two kinds of noun phrase (NP) and context co-occurrence data was produced:– NP-context co-occurrence data– NP-NP co-occurrence data
• These datasets can be treated as graphs
53
Noun Phrase and Context Data
• NP-context data:
… know that drinking pomegranate juice may not be a bad …
NPbefore context after context
pomegranate juice know that drinking _
_ may not be a bad
3
2
_ is made from
_ promotes responsible
JELL-O
Jagermeister
5821
NP-contextgraph
54
Noun Phrase and Context Data
• NP-NP data:
… French Constitution Council validates burqa ban …
NP context
French Constitution Council
JELL-OJagermeisterNP-NPgraph
NP
burqa ban
French Court hot pants
veil
Context can be used for weighting edges or making
a more complex graph
55
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
56
Noun Phrase Categorization
• We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs.
• Challenges:– Large, noisy dataset (10m NPs, 8.6m contexts from 500m
web pages).– What’s the right function for NP-NP categorical similarity?– Which learned category assignment should we “promote”
to the knowledge base?– How to evaluate it?
57
Noun Phrase Categorization
• We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs.
• Challenges:– Large, noisy dataset (10m NPs, 8.6m contexts from 500m
web pages).– What’s the right function for NP-NP categorical similarity?– Which learned category assignment should we “promote”
to the knowledge base?– How to evaluate it?
58
Noun Phrase Categorization
• Preliminary experiment:– Small subset of the NP-context data• 88k NPs• 99k contexts
– Find category “city”• Start with a handful of seeds
– Ground truth set of 2,404 city NPs created using
59
Noun Phrase Categorization
• Preliminary result:
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
HF
coEM
MRW
MRWb
Recall
Precision
60
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
61
Learning Synonym NP’s
• Currently the knowledge base see NP’s with different strings as different entities
• It would be useful to know when two different surface forms refer to the same semantic entity
Natural thing to try:use PIC to cluster NPs
But do we get synonyms?
62
Learning Synonym NP’sNP-context graph yields
categorical clusters…
President Obama
Senator Obama
Barack Obama
Senator McCain
George W. Bush
Nicolas Sarkozy
Tony Blair
NP-NP graph yields semantically related clusters…
President Obama
Senator Obama
Barack Obama
Senator McCain
African American
Democrat
Bailout
But together they may help us to identify synonyms!
Other methods may help to further refine results:
President Obama
Senator Obama
Barack Obama
BarackObama.com
Obamamania
Anti-Obama
Michelle Obama
e.g., string similarity
63
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
64
Learning Polyseme NP’s
• Currently a noun phrase found on a web page can only be mapped to a single entity in the knowledge base
• It would be useful to know when two different entities share the same surface form.
• Example:
Jordan
?
65
Learning Polyseme NP’s• Proposal: Cluster the NP’s neighbors:
– Cluster contexts in the NP-context graph and NPs in the NP-NP graph– Identify salient clusters as reference to different entities
• Results from clustering neighbors of “Jordan” in the NP-NP graph:
Tru, morgue, Harrison, Jackie, Rick, Davis, Jack, Chicago, episode, sights, six NBA championships, glass, United Center, brother, restaurant, mother, show, Garret, guest star, dead body, woman, Khouri, Loyola, autopsy, friend, season, release, corpse, Lantern, NBA championships, Maternal grandparents, Sox, Starr, medical examiner, Ivers, Hotch, maiden name, NBA titles, cousin John, Scottie Pippen, guest appearance, Peyton, Noor, night, name, Dr . Cox, third pick, Phoebe, side, third season, EastEnders, Tei, Chicago White Sox, trumpeter, day, Chicago Bulls, products, couple, Pippen, Illinois, All-NBA First Team, Dennis Rodman, first retirement
1948-1967, 500,000 Iraqis, first Arab country, Palestinian Arab state, Palestinian President, Arab-states, al-Qaeda, Arab World, countries, ethics, Faynan, guidelines, Malaysia, methodology, microfinance, militias, Olmert, peace plan, Strategic Studies, million Iraqi refugees, two Arab countries, two Arab states, Palestinian autonomy, Muslim holy sites, Jordan algebras, Jordanian citizenship, Arab Federation, Hashemite dynasty, Jordanian citizens, Gulf Cooperation, Dahabi, Judeh, Israeli-occupied West Bank, Mashal, 700,000 refugees, Yarmuk River, Palestine Liberation, Sunni countries, 2 million Iraqi, 2 million Iraqi refugees, Israeli borders, moderate Arab states
66
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
67
Finding New Types and Relations
• Proposed methods for finding synonyms and polysemes identify particular nodes and relationship between nodes based on the characteristics of the graph structure, using PIC
Can we generalize these methods to find any unary or binary relationships between
nodes in the graph?
Different relationships may require
different kinds of graphs and
similarity functions
Can we learn effective similarity
functions for clustering
efficiently?
Admittedly the most
open-ended part of this proposal …
Tom’s suggestion: Given a set of NPs,
can you find k categories and l
relationships that cover the most NPs?
68
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
69
PIC Extension: Avoiding Collisions
• One robustness question for vanilla PIC as data size and complexity grows:
• How many (noisy) clusters can you fit in one dimension without them “colliding”?
Cluster signals cleanly separated
A little too close for comfort?
70
PIC Extension: Avoiding Collisions
• We propose two general solutions:1. Precondition the starting vector; instead of
random:• Degree• Skewed (1, 2, 4, 8, 16, etc.)• May depend on particular properties of the data
2. Run PIC d times with different random starts and construct a d-dimension embedding• Unlikely two clusters collide on all d dimensions• We can afford it because PIC is fast and space-efficient!
71
PIC Extension: Avoiding Collisions
• Preliminary results on network classification datasets:
RED: PIC embedding
with a random start vector
GREEN: PIC using a degree
start vector
BLUE: PIC using 4
random start vectors
Dataset (k)
1-dimension PIC embeddings lose on accuracy at higher k’s compared to NCut and NJW
# of clusters
But using a 4 random vectors instead helps!
Note # of vectors << k
72
PIC Extension: Avoiding Collisions
• Preliminary results on name disambiguation datasets:
Again using a 4 random vectors seems to work!
Again note # of vectors << k
73
PIC Extension: Hierarchical Clustering
• Real, large-scale data may not have a “flat” clustering structure
• A hierarchical view may be more useful
Good News:The dynamics of a PIC embedding display a hierarchically convergent
behavior!
74
PIC Extension: Hierarchical Clustering
• Why?• Recall PIC embedding at time t:
n
t
nn
tt
t
t
c
c
c
c
c
c
ceeee
v
113
1
3
1
32
1
2
1
21
11
...
Less significant eigenvectors / structures go away first, one by one
More salient structure stick
around
e’s – eigenvectors (structure) SmallBig
There may not be a clear
eigengap - a gradient of
cluster saliency
75
PIC Extension: Hierarchical Clustering
PIC already converged to 8 clusters…
But let’s keep on iterating…
“N” still a part of the “2009”
cluster…
Similar behavior also noted in matrix-matrix power
methods (diffusion maps, mean-shift, multi-resolution
spectral clustering)
Same dataset you’ve seen
Yes(it might take a while)
76
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
77
Distributed / Parallel Implementations
• Distributed / parallel implementations of learning methods are necessary to support large-scale data given the direction of hardware development
• PIC, MRW, and their path folding variants have at their core sparse matrix-vector multiplications
• Sparse matrix-vector multiplication lends itself well to a distributed / parallel computing framework
• We propose to use• Alternatives:
Existing graph analysis tool:
78
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
Questions?
80
Additional Information
PIC: Some “Powering” Methods at a Glance
Method W Iterate Stopping Final
Tishby & Slonim 2000 W=D-1A Wt+1=Wt
rate of information
loss
information bottleneck
method
Zhou & Woodruff
2004W=A Wt+1=Wt a small t a threshold ε
Carreira-Perpinan
2006W=D-1A Xt+1=WX entropy a threshold ε
PIC W=D-1A vt+1=Wvt acceleration k-means
How far can we go with a one- or low-dimensional
embedding?
82
PIC: Versus Popular Fast Sparse Eigencomputation Methods
For Symmetric Matrices
For General Matrices Improvement
Successive Power Method
Basic; numerically unstable, can be
slow
Lanczos Method Arnoldi MethodMore stable, but
require lots of memory
Implicitly Restarted Lanczos Method
(IRLM)
Implicitly Restarted Arnoldi Method
(IRAM)More memory-
efficient
Method Time Space
IRAM (O(m3)+(O(nm)+O(e))×O(m-k))×(# restart) O(e)+O(nm)
PIC O(e)x(# iterations) O(e)
Randomized sampling
methods are also popular
83
PIC: Related Clustering Work• Spectral Clustering
– (Roxborough & Sen 1997, Shi & Malik 2000, Meila & Shi 2001, Ng et al. 2002)• Kernel k-Means (Dhillon et al. 2007)• Modularity Clustering (Newman 2006)• Matrix Powering
– Markovian relaxation & the information bottleneck method (Tishby & Slonim 2000)
– matrix powering (Zhou & Woodruff 2004)– diffusion maps (Lafon & Lee 2006)– Gaussian blurring mean-shift (Carreira-Perpinan 2006)
• Mean-Shift Clustering– mean-shift (Fukunaga & Hostetler 1975, Cheng 1995, Comaniciu & Meer 2002)– Gaussian blurring mean-shift (Carreira-Perpinan 2006)
84
PICwPF: ResultsEach point represents
the accuracy result from a
dataset
Lower triangle: k-means wins
Upper triangle: PIC wins
85
PICwPF: Results Two methods have almost the same behavior
Overall, one method not statistically significantly better than the other
86
PICwPF: Results
Not sure why NCUTiram did
not work as well as NCUTevd
Lesson: Approximate
eigen- computation methods may
require expertise to work well
87
PICwPF: Results
• PIC is O(n) per iteration and the runtime curve looks linear…
• But I don’t like eyeballing curves, and perhaps the number of iteration increases with size or difficulty of the dataset?
Correlation plotCorrelation statistic
(0=none, 1=correlated)
88
PICwPF: Results
• Linear run-time implies constant number of iterations.
• Number of iterations to “acceleration-convergence” is hard to analyze:– Faster than a single complete run of power
iteration to convergence.– On our datasets• 10-20 iterations is typical• 30-35 is exceptional
89
PICwPF: Related Work
• Faster spectral clustering– Approximate eigendecomposition (Lanczos, IRAM)– Sampled eigendecomposition (Nyström)
• Sparser matrix– Sparse construction• k-nearest-neighbor graph• k-matching
– graph sampling / reduction
Not O(n) time methods
Still require O(n2) construction in
general
Not O(n) space methods
90
PICwPF: Results
MRW: RWR for Classification
We refer to this method as MultiRankWalk: it classifies data with multiple rankings using random walks
MRW: Seed Preference
• Obtaining labels for data points is expensive• We want to minimize cost for obtaining labels• Observations:– Some labels inherently more useful than others– Some labels easier to obtain than others
Question: “Authoritative” or “popular” nodes in a network are typically easier to obtain labels for. But are these labels also more
useful than others?
Seed Preference
• Consider the task of giving a human expert (or posting jobs on Amazon Mechanical Turk) a list of data points to label
• The list (seeds) can be generated uniformly at random, or we can have a seed preference, according to simple properties of the unlabeled data
• We consider 3 preferences:– Random– Link Count– PageRank
Nodes with highest counts make the list
Nodes with highest scores make the list
MRW: The Question• What really makes MRW and wvRN different?• Network-based SSL often boil down to label propagation. • MRW and wvRN represent two general propagation methods –
note that they are call by many names:MRW wvRN
Random walk with restart Reverse random walk
Regularized random walk Random walk with sink nodes
Personalized PageRank Hitting time
Local & global consistency Harmonic functions on graphs
Iterative averaging of neighbors
Great…but we still don’t know why the differences in
their behavior on these network datasets!
MRW: The Question
• It’s difficult to answer exactly why MRW does better with a smaller number of seeds.
• But we can gather probable factors from their propagation models:
MRW wvRN
1 Centrality-sensitive Centrality-insensitive
2 Exponential drop-off / damping factor No drop-off / damping
3Propagation of different classes done independently
Propagation of different classes interact
MRW: The Question
• An example from a political blog dataset – MRW vs. wvRN scores for how much a blog is politically conservative:
1.000 neoconservatives.blogspot.com1.000 strangedoctrines.typepad.com1.000 jmbzine.com0.593 presidentboxer.blogspot.com0.585 rooksrant.com0.568 purplestates.blogspot.com0.553 ikilledcheguevara.blogspot.com0.540 restoreamerica.blogspot.com0.539 billrice.org0.529 kalblog.com0.517 right-thinking.com0.517 tom-hanna.org0.514 crankylittleblog.blogspot.com0.510 hasidicgentile.org0.509 stealthebandwagon.blogspot.com0.509 carpetblogger.com0.497 politicalvicesquad.blogspot.com0.496 nerepublican.blogspot.com0.494 centinel.blogspot.com0.494 scrawlville.com0.493 allspinzone.blogspot.com0.492 littlegreenfootballs.com0.492 wehavesomeplanes.blogspot.com0.491 rittenhouse.blogspot.com0.490 secureliberty.org0.488 decision08.blogspot.com0.488 larsonreport.com
0.020 firstdownpolitics.com0.019 neoconservatives.blogspot.com0.017 jmbzine.com0.017 strangedoctrines.typepad.com0.013 millers_time.typepad.com0.011 decision08.blogspot.com0.010 gopandcollege.blogspot.com0.010 charlineandjamie.com0.008 marksteyn.com0.007 blackmanforbush.blogspot.com0.007 reggiescorner.blogspot.com0.007 fearfulsymmetry.blogspot.com0.006 quibbles-n-bits.com0.006 undercaffeinated.com0.005 samizdata.net0.005 pennywit.com0.005 pajamahadin.com0.005 mixtersmix.blogspot.com0.005 stillfighting.blogspot.com0.005 shakespearessister.blogspot.com0.005 jadbury.com0.005 thefulcrum.blogspot.com0.005 watchandwait.blogspot.com0.005 gindy.blogspot.com0.005 cecile.squarespace.com0.005 usliberals.about.com0.005 twentyfirstcenturyrepublican.blogspot.com
Seed labels underlined
1. Centrality-sensitive: seeds have different
scores and not necessarily the highest
2. Exponential drop-off: much less sure
about nodes further away from seeds
3. Classes propagate independently:
charlineandjamie.com is both very likely a conservative and a
liberal blog (good or bad?)
We still don’t really understand it yet.
MRW: Related Work
• MRW is very much related to– “Local and global consistency” (Zhou et al. 2004)– “Web content categorization using link information”
(Gyongyi et al. 2006)– “Graph-based semi-supervised learning as a generative
model” (He et al. 2007)• Seed preference is related to the field of active
learning– Active learning chooses which data point to label next
based on previous labels; the labeling is interactive– Seed preference is a batch labeling method
Similar formulation,
different view
RWR ranking as features to SVM
Random walk
without restart,
heuristic stopping
Authoritative seed preference a good base line for active
learning on network data!