Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning

Post on 23-Feb-2016

36 views 0 download

Tags:

description

Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning. Frank Lin Language Technologies Institute School of Computer Science Carnegie Mellon University PhD Thesis Proposal October 8, 2010, Pittsburgh, PA, USA Thesis Committee - PowerPoint PPT Presentation

Transcript of Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning

Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning

Frank LinLanguage Technologies Institute

School of Computer ScienceCarnegie Mellon University

PhD Thesis ProposalOctober 8, 2010, Pittsburgh, PA, USA

Thesis CommitteeWilliam W. Cohen (Chair), Christos Faloutsos, Tom Mitchell, Xiaojin Zhu

2

Motivation

• Graph data is everywhere• We want to do machine learning on them• For non-graph data, often it makes sense to

represent them as graphs for unsupervised and semi-supervised learning

And some of them very big

Often require a similarity measure between data

points – naturally represented as edges

between nodes

3

Motivation

• We want to do clustering and semi-supervised learning on large graphs

• We need scalable methods that provide quality results

4

Thesis Goal

• Make contributions toward fast, space-efficient, effective, and simple graph-based learning methods that scale up to large datasets.

5

Road Map

Basic Learning Methods Applications to a Web-Scale Knowledge Base

Graph Induced Graph Applications Support

Unsupervised / Clustering

Power Iteration

ClusteringPIC with Path

Folding

Learning Synonym NP’sDistributed Implementation

Avoiding Collision

Hierarchical

Learning Polyseme NP’s

Finding New Types and Relations

Semi-

Supervise

d

Multi-RankWalk

MRW with Path Folding

Noun Phrase Categorization

Distributed Implementation

6

Road Map

Basic Learning Methods Applications to a Web-Scale Knowledge Base

Graph Induced Graph Applications Support

Unsupervised / Clustering

Power Iteration

ClusteringPIC with Path

Folding

Learning Synonym NP’sDistributed Implementation

Avoiding Collision

Hierarchical

Learning Polyseme NP’s

Finding New Types and Relations

Semi-

Supervise

d

Multi-RankWalk

MRW with Path Folding

Noun Phrase Categorization

Distributed Implementation

Prior Work

Proposed Work

7

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization• PIC for Extending a Web-Scale Knowledge Base

• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

8

Power Iteration Clustering

• Spectral clustering methods are nice, and natural choice for graph data

• But they are rather expensive (slow)

Power iteration clustering (PIC) can provide a similar solution at a very

low cost (fast)!

9

Background: Spectral Clustering

• Idea: instead of clustering data points in their original (Euclidean) space, cluster them in the space spanned by the “significant” eigenvectors of a (Laplacian) similarity matrix

A popular spectral clustering method:

normalized cuts (NCut)

10

Background: Spectral Clusteringdataset and normalized cut results

2nd smallest eigenvector

3rd smallest eigenvector

valu

e

index

1 2 3cluster

clus

terin

g sp

ace

11

Background: Spectral Clustering

• Normalized Cut algorithm (Shi & Malik 2000):1. Choose k and similarity function s2. Derive A from s, let W=I-D-1A, where I is the identity

matrix and D is a diagonal square matrix Dii=Σj Aij 3. Find eigenvectors and corresponding eigenvalues of W4. Pick the k eigenvectors of W with the 2nd to kth smallest

corresponding eigenvalues as “significant” eigenvectors5. Project the data points onto the space spanned by these

vectors6. Run k-means on the projected data points

D

Finding eigenvectors and eigenvalues of a matrix is very slow

in generalCan we find a similar low-dimensional embedding for

clustering without eigenvectors?

12

The Power Iteration• Or the power method, is a simple iterative method

for finding the dominant eigenvector of a matrix:

tt cWvv 1

W : a square matrix

vt : the vector at

iteration t;

v0 typically a random vector

c : a normalizing constant to keep vt

from getting too large or too small

Typically converges quickly; fairly efficient if W is a sparse matrix

13

The Power Iteration• Or the power method, is a simple iterative method

for finding the dominant eigenvector of a matrix:

tt cWvv 1

What if we let W=D-1A(like Normalized Cut)?

Row-normalized similarity

matrix

14

The Power Iteration

15

Power Iteration Clustering

• The 2nd to kth eigenvectors of W=D-1A are roughly piece-wise constant with respect to the underlying clusters, each separating a cluster from the rest of the data (Meila & Shi 2001)

• The linear combination of piece-wise constant vectors is also piece-wise constant!

16

Spectral Clusteringdataset and normalized cut results

2nd smallest eigenvector

3rd smallest eigenvector

valu

e

index

1 2 3cluster

clus

terin

g sp

ace

17

+

=

18

Power Iteration Clustering

dataset and PIC results

vt

we just need the clusters to be separated in some space.

Key idea: to do clustering, we may not need all the information in a full spectral

embedding (e.g., distance between clusters in a k-dimension eigenspace)

When to Stop

ntnnk

tkkk

tkk

tt cccc eeeev ...... 111111

Recall:

n

t

nnk

t

kkk

t

kkt

t

cc

cc

cc

ceeeev

111

1

1

1

1

111

11

......

Then:

Because they are raised to the power t, the eigenvalue ratios

determines how fast v converges to e1

At the beginning, v changes fast (“accelerating”) to converge locally due to “noise terms”

(k+1…n) with small λ

When “noise terms” have gone to zero, v changes slowly (“constant speed”) because only larger λ terms (2…k) are left, where the

eigenvalue ratios are close to 1

20

Power Iteration Clustering

• A basic power iteration clustering (PIC) algorithm:

Input: A row-normalized affinity matrix W and the number of clusters kOutput: Clusters C1, C2, …, Ck

1. Pick an initial vector v0

2. Repeat• Set vt+1 ← Wvt

• Set δt+1 ← |vt+1 – vt|• Increment t• Stop when |δt – δt-1| ≈ 0

3. Use k-means to cluster points on vt and return clusters C1, C2, …, Ck

21

PIC RuntimeNormalized Cut

Normalized Cut, faster implementation

Ran out of memory (24GB)

PIC Accuracy on Network Datasets

Upper triangle: PIC does

better

Lower triangle: NCut or

NJW does better

23

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization• PIC for Extending a Web-Scale Knowledge Base

• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

24

Clustering Text Data

• Spectral clustering methods are nice• We want to use them for clustering text data

(A lot of)

25

The Problem with Text Data

• Documents are often represented as feature vectors of words:

The importance of a Web page is an inherently subjective matter, which depends on the readers…

In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use…

You're not cool just because you have a lot of followers on twitter, get over yourself…

cool web search make over you0 4 8 2 5 30 8 7 4 3 21 0 0 0 1 2

26

The Problem with Text Data

• Feature vectors are often sparse• But similarity matrix is not!

cool web search make over you0 4 8 2 5 30 8 7 4 3 21 0 0 0 1 2

Mostly zeros - any document contains only a small fraction

of the vocabulary

27 125 -

23 - 125

- 23 27

Mostly non-zero - any two

documents are likely to have a

word in common

27

The Problem with Text Data

• A similarity matrix is the input to many clustering methods, including spectral clustering

• Spectral clustering requires the computation of the eigenvectors of a similarity matrix of the data

27 125 -

23 - 125

- 23 27

In general O(n3); approximation

methods still not very fast

O(n2) time to construct

O(n2) space to store

> O(n2) time to operate on

Too expensive! Does not

scale up to big

datasets!

28

The Problem with Text Data

• We want to use the similarity matrix for clustering (like spectral clustering), but:– Without calculating eigenvectors– Without constructing or storing the similarity

matrix

Power Iteration Clustering + Path Folding

29

Path Folding

• A basic power iteration clustering (PIC) algorithm:

Input: A row-normalized affinity matrix W and the number of clusters kOutput: Clusters C1, C2, …, Ck

1. Pick an initial vector v0

2. Repeat• Set vt+1 ← Wvt

• Set δt+1 ← |vt+1 – vt|• Increment t• Stop when |δt – δt-1| ≈ 0

3. Use k-means to cluster points on vt and return clusters C1, C2, …, Ck

Okay, we have a fast clustering method – but there’s the W that requires O(n2) storage space and construction and operation time!

Key operation in PIC

Note: matrix-vector multiplication!

30

Path Folding

• What’s so good about matrix-vector multiplication?

• If we can decompose the matrix…

• Then we arrive at the same solution doing a series of matrix-vector multiplications!

ttt ABCW vvv )(1

)))(((1 tt CBA vv

Isn’t this more expensive?

Well, not if this matrix is dense…

And these are sparse!

31

Path Folding

• As long as we can decompose the matrix into a series of sparse matrices, we can turn a dense matrix-vector multiplication into a series of sparse matrix-vector multiplications.

This means that we can turn an operation that requires O(n2)

storage and runtime into one that requires O(n) storage and runtime!

This is exactly the case for

text data

And many other kinds of data as well!

32

Path Folding

• Example – inner product similarity:

TFFDW 1

The original feature matrix

The feature matrix

transposed

Diagonal matrix that normalizes W so rows sum to 1

Construction: givenStorage: O(n)

Construction: givenStorage: just use FStorage: n

33

Path Folding

• Example – inner product similarity:

TFFDW 1

• Iteration update:

))((11 tTt FFD vv

Construction: O(n)

Storage:O(n)

Operation: O(n)

Okay…how about a similarity function we actually use for

text data?

34

Path Folding

• Example – cosine similarity:

NNFFDW T1

• Iteration update:

))))((((11 tTt NFFND vv

Diagonal cosine

normalizing matrix

Construction: O(n)

Storage:O(n)

Operation: O(n)

Compact storage: we don’t need a cosine-normalized version of the feature vectors

35

Path Folding

• We refer to this technique as path folding due to its connections to “folding” a bipartite graph into a unipartite graph.

36

Results

• An accuracy result:Upper triangle:

we win

Lower triangle: spectral

clustering wins

Each point is accuracy for a 2-cluster

text dataset

Diagonal: tied (most datasets)

No statistically significant

difference – same accuracy

37

Results

• A scalability result:

y: algorithm runtime

(log scale)

x: data size (log scale)

Linear curve

Quadratic curve

Spectral clustering (red & blue)

Our method (green)

38

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization• PIC for Extending a Web-Scale Knowledge Base

• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

MultiRankWalk

• Classification labels are expensive to obtain• Semi-supervised learning (SSL) learns from

labeled and unlabeled data for classification• When it comes to network data, what is a

general, simple, and effective method that requires very few labels?

Our Answer:MultiRankWalk (MRW)

Random Walk with Restart

• Imagine a network, and starting at a specific node, you follow the edges randomly.

• But with some probability, you “jump” back to the starting node (restart!).

If you recorded the number of times you land on each node,

what would that distribution look like?

Random Walk with Restart

What if we start at a

different node?Start node

Random Walk with Restart

• The walk distribution r satisfies a simple equation:

rur dWd )1(

Start node(s)

Transition matrix of the

network

Restart probability

“Keep-going” probability (damping factor)

Random Walk with Restart

• Random walk with restart (RWR) can be solved simply and efficiently with an iterative procedure:

1)1( tt dWd rur

RWR for Classification

RWR with start nodes being

labeled points in class A

RWR with start nodes being

labeled points in class B

Nodes frequented more by RWR(A) belongs to class A, otherwise they

belong to B

• Simple idea: use RWR for classification

45

ResultsAccuracy: MRW vs. harmonic functions method (HF)

Upper triangle:

MRW better

Lower triangle:

MRW better

Point color & style roughly indicates # of labeled examples

With larger # of labels, both

methods do well

MRW does much better with only a few labels!

Results• How much better is MRW using authoritative seed preference?

y-axis:MRW F1

score minus wvRN F1

x-axis: number of seed

labels per classThe gap between

MRW and wvRN narrows with authoritative

seeds, but they are still

prominent on some datasets

with small number of seed

labels

47

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization• PIC for Extending a Web-Scale Knowledge Base

• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

48

MRW with Path Folding

• MRW is based on random walk with restart (RWR):

1)1( tt dWd rur

Core operation: matrix-vector multiplication

“Path folding” can be applied to MRW as well to efficiently learn

from induced graph data!

Most other methods preprocess the induced graph

data to create a sparse similarity

matrix

Can also be applied to iterative implementations of the harmonic

functions method

49

On to real, large, difficult applications…

• We now have a basic set of tools to do efficient unsupervised and semi-supervised learning on graph data

• We want to apply to cool, challenging problems…

50

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization• PIC for Extending a Web-Scale Knowledge Base

• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

51

A Web-Scale Knowledge Base

• Read the Web (RtW) project:

Build a never-ending system that learns to extract information

from unstructured web pages, resulting in a knowledge base of

structured information.

52

Noun Phrase and Context Data

• As a part of RtW project, two kinds of noun phrase (NP) and context co-occurrence data was produced:– NP-context co-occurrence data– NP-NP co-occurrence data

• These datasets can be treated as graphs

53

Noun Phrase and Context Data

• NP-context data:

… know that drinking pomegranate juice may not be a bad …

NPbefore context after context

pomegranate juice know that drinking _

_ may not be a bad

32

_ is made from

_ promotes responsible

JELL-O

Jagermeister

5821

NP-contextgraph

54

Noun Phrase and Context Data

• NP-NP data:

… French Constitution Council validates burqa ban …

NP context

French Constitution Council

JELL-OJagermeisterNP-NPgraph

NP

burqa ban

French Court hot pants

veil

Context can be used for weighting edges or making

a more complex graph

55

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization• PIC for Extending a Web-Scale Knowledge Base

• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

56

Noun Phrase Categorization

• We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs.

• Challenges:– Large, noisy dataset (10m NPs, 8.6m contexts from 500m

web pages).– What’s the right function for NP-NP categorical similarity?– Which learned category assignment should we “promote”

to the knowledge base?– How to evaluate it?

57

Noun Phrase Categorization

• We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs.

• Challenges:– Large, noisy dataset (10m NPs, 8.6m contexts from 500m

web pages).– What’s the right function for NP-NP categorical similarity?– Which learned category assignment should we “promote”

to the knowledge base?– How to evaluate it?

58

Noun Phrase Categorization

• Preliminary experiment:– Small subset of the NP-context data• 88k NPs• 99k contexts

– Find category “city”• Start with a handful of seeds

– Ground truth set of 2,404 city NPs created using

59

Noun Phrase Categorization

• Preliminary result:

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

HF

coEM

MRW

MRWb

Recall

Precision

60

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization• PIC for Extending a Web-Scale Knowledge Base

• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

61

Learning Synonym NP’s

• Currently the knowledge base see NP’s with different strings as different entities

• It would be useful to know when two different surface forms refer to the same semantic entity

Natural thing to try:use PIC to cluster NPs

But do we get synonyms?

62

Learning Synonym NP’sNP-context graph yields

categorical clusters…

President Obama

Senator Obama

Barack Obama

Senator McCain

George W. Bush

Nicolas Sarkozy

Tony Blair

NP-NP graph yields semantically related clusters…

President Obama

Senator Obama

Barack Obama

Senator McCain

African American

Democrat

Bailout

But together they may help us to identify synonyms!

Other methods may help to further refine results:

President Obama

Senator Obama

Barack Obama

BarackObama.com

Obamamania

Anti-Obama

Michelle Obama

e.g., string similarity

63

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization• PIC for Extending a Web-Scale Knowledge Base

• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

64

Learning Polyseme NP’s

• Currently a noun phrase found on a web page can only be mapped to a single entity in the knowledge base

• It would be useful to know when two different entities share the same surface form.

• Example:

Jordan

?

65

Learning Polyseme NP’s• Proposal: Cluster the NP’s neighbors:

– Cluster contexts in the NP-context graph and NPs in the NP-NP graph– Identify salient clusters as reference to different entities

• Results from clustering neighbors of “Jordan” in the NP-NP graph:

Tru, morgue, Harrison, Jackie, Rick, Davis, Jack, Chicago, episode, sights, six NBA championships, glass, United Center, brother, restaurant, mother, show, Garret, guest star, dead body, woman, Khouri, Loyola, autopsy, friend, season, release, corpse, Lantern, NBA championships, Maternal grandparents, Sox, Starr, medical examiner, Ivers, Hotch, maiden name, NBA titles, cousin John, Scottie Pippen, guest appearance, Peyton, Noor, night, name, Dr . Cox, third pick, Phoebe, side, third season, EastEnders, Tei, Chicago White Sox, trumpeter, day, Chicago Bulls, products, couple, Pippen, Illinois, All-NBA First Team, Dennis Rodman, first retirement

1948-1967, 500,000 Iraqis, first Arab country, Palestinian Arab state, Palestinian President, Arab-states, al-Qaeda, Arab World, countries, ethics, Faynan, guidelines, Malaysia, methodology, microfinance, militias, Olmert, peace plan, Strategic Studies, million Iraqi refugees, two Arab countries, two Arab states, Palestinian autonomy, Muslim holy sites, Jordan algebras, Jordanian citizenship, Arab Federation, Hashemite dynasty, Jordanian citizens, Gulf Cooperation, Dahabi, Judeh, Israeli-occupied West Bank, Mashal, 700,000 refugees, Yarmuk River, Palestine Liberation, Sunni countries, 2 million Iraqi, 2 million Iraqi refugees, Israeli borders, moderate Arab states

66

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization• PIC for Extending a Web-Scale Knowledge Base

• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

67

Finding New Types and Relations

• Proposed methods for finding synonyms and polysemes identify particular nodes and relationship between nodes based on the characteristics of the graph structure, using PIC

Can we generalize these methods to find any unary or binary relationships between

nodes in the graph?

Different relationships may require

different kinds of graphs and

similarity functions

Can we learn effective similarity

functions for clustering

efficiently?

Admittedly the most

open-ended part of this proposal …

Tom’s suggestion: Given a set of NPs,

can you find k categories and l

relationships that cover the most NPs?

68

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization• PIC for Extending a Web-Scale Knowledge Base

• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

69

PIC Extension: Avoiding Collisions• One robustness question for vanilla PIC as data size and

complexity grows:• How many (noisy) clusters can you fit in one dimension

without them “colliding”?

Cluster signals cleanly separated

A little too close for comfort?

70

PIC Extension: Avoiding Collisions

• We propose two general solutions:1. Precondition the starting vector; instead of

random:• Degree• Skewed (1, 2, 4, 8, 16, etc.)• May depend on particular properties of the data

2. Run PIC d times with different random starts and construct a d-dimension embedding• Unlikely two clusters collide on all d dimensions• We can afford it because PIC is fast and space-efficient!

71

PIC Extension: Avoiding Collisions• Preliminary results on network classification datasets:

RED: PIC embedding

with a random start vector

GREEN: PIC using a degree

start vector

BLUE: PIC using 4

random start vectors

Dataset (k)

1-dimension PIC embeddings lose on accuracy at higher k’s compared to NCut and NJW

# of clusters

But using a 4 random vectors instead helps!

Note # of vectors << k

72

PIC Extension: Avoiding Collisions• Preliminary results on name disambiguation datasets:

Again using a 4 random vectors seems to work!

Again note # of vectors << k

73

PIC Extension: Hierarchical Clustering

• Real, large-scale data may not have a “flat” clustering structure

• A hierarchical view may be more useful

Good News:The dynamics of a PIC embedding display a hierarchically convergent

behavior!

74

PIC Extension: Hierarchical Clustering

• Why?• Recall PIC embedding at time t:

n

t

nn

tt

t

t

cc

cc

cc

ceeeev

113

1

3

1

32

1

2

1

21

11

...

Less significant eigenvectors / structures go away first, one by one

More salient structure stick

around

e’s – eigenvectors (structure) SmallBig

There may not be a clear

eigengap - a gradient of

cluster saliency

75

PIC Extension: Hierarchical Clustering

PIC already converged to 8 clusters…

But let’s keep on iterating…

“N” still a part of the “2009”

cluster…

Similar behavior also noted in matrix-matrix power

methods (diffusion maps, mean-shift, multi-resolution

spectral clustering)

Same dataset you’ve seen

Yes(it might take a while)

76

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization• PIC for Extending a Web-Scale Knowledge Base

• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

77

Distributed / Parallel Implementations

• Distributed / parallel implementations of learning methods are necessary to support large-scale data given the direction of hardware development

• PIC, MRW, and their path folding variants have at their core sparse matrix-vector multiplications

• Sparse matrix-vector multiplication lends itself well to a distributed / parallel computing framework

• We propose to use• Alternatives:

Existing graph analysis tool:

78

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization• PIC for Extending a Web-Scale Knowledge Base

• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

Questions?

80

Additional Information

PIC: Some “Powering” Methods at a Glance

Method W Iterate Stopping Final

Tishby & Slonim 2000 W=D-1A Wt+1=Wt

rate of information

loss

information bottleneck

method

Zhou & Woodruff

2004W=A Wt+1=Wt a small t a threshold ε

Carreira-Perpinan

2006W=D-1A Xt+1=WX entropy a threshold ε

PIC W=D-1A vt+1=Wvt acceleration k-means

How far can we go with a one- or low-dimensional

embedding?

82

PIC: Versus Popular Fast Sparse Eigencomputation Methods

For Symmetric Matrices

For General Matrices Improvement

Successive Power Method

Basic; numerically unstable, can be

slow

Lanczos Method Arnoldi MethodMore stable, but

require lots of memory

Implicitly Restarted Lanczos Method

(IRLM)

Implicitly Restarted Arnoldi Method

(IRAM)More memory-

efficient

Method Time Space

IRAM (O(m3)+(O(nm)+O(e))×O(m-k))×(# restart) O(e)+O(nm)

PIC O(e)x(# iterations) O(e)

Randomized sampling

methods are also popular

83

PIC: Related Clustering Work• Spectral Clustering

– (Roxborough & Sen 1997, Shi & Malik 2000, Meila & Shi 2001, Ng et al. 2002)• Kernel k-Means (Dhillon et al. 2007)• Modularity Clustering (Newman 2006)• Matrix Powering

– Markovian relaxation & the information bottleneck method (Tishby & Slonim 2000)

– matrix powering (Zhou & Woodruff 2004)– diffusion maps (Lafon & Lee 2006)– Gaussian blurring mean-shift (Carreira-Perpinan 2006)

• Mean-Shift Clustering– mean-shift (Fukunaga & Hostetler 1975, Cheng 1995, Comaniciu & Meer 2002)– Gaussian blurring mean-shift (Carreira-Perpinan 2006)

84

PICwPF: ResultsEach point represents

the accuracy result from a

dataset

Lower triangle: k-means wins

Upper triangle: PIC wins

85

PICwPF: Results Two methods have almost the same behavior

Overall, one method not statistically significantly better than the other

86

PICwPF: Results

Not sure why NCUTiram did

not work as well as NCUTevd

Lesson: Approximate

eigen- computation methods may

require expertise to work well

87

PICwPF: Results• PIC is O(n) per iteration and the runtime curve looks

linear…• But I don’t like eyeballing curves, and perhaps the number

of iteration increases with size or difficulty of the dataset?

Correlation plotCorrelation statistic

(0=none, 1=correlated)

88

PICwPF: Results

• Linear run-time implies constant number of iterations.

• Number of iterations to “acceleration-convergence” is hard to analyze:– Faster than a single complete run of power

iteration to convergence.– On our datasets• 10-20 iterations is typical• 30-35 is exceptional

89

PICwPF: Related Work

• Faster spectral clustering– Approximate eigendecomposition (Lanczos, IRAM)– Sampled eigendecomposition (Nyström)

• Sparser matrix– Sparse construction• k-nearest-neighbor graph• k-matching

– graph sampling / reduction

Not O(n) time methods

Still require O(n2) construction in

general

Not O(n) space methods

90

PICwPF: Results

MRW: RWR for Classification

We refer to this method as MultiRankWalk: it classifies data with multiple rankings using random walks

MRW: Seed Preference

• Obtaining labels for data points is expensive• We want to minimize cost for obtaining labels• Observations:– Some labels inherently more useful than others– Some labels easier to obtain than others

Question: “Authoritative” or “popular” nodes in a network are typically easier to obtain labels for. But are these labels also more

useful than others?

Seed Preference

• Consider the task of giving a human expert (or posting jobs on Amazon Mechanical Turk) a list of data points to label

• The list (seeds) can be generated uniformly at random, or we can have a seed preference, according to simple properties of the unlabeled data

• We consider 3 preferences:– Random– Link Count– PageRank

Nodes with highest counts make the list

Nodes with highest scores make the list

MRW: The Question• What really makes MRW and wvRN different?• Network-based SSL often boil down to label propagation. • MRW and wvRN represent two general propagation methods –

note that they are call by many names:MRW wvRN

Random walk with restart Reverse random walk

Regularized random walk Random walk with sink nodes

Personalized PageRank Hitting time

Local & global consistency Harmonic functions on graphs

Iterative averaging of neighbors

Great…but we still don’t know why the differences in

their behavior on these network datasets!

MRW: The Question• It’s difficult to answer exactly why MRW does better with a smaller

number of seeds.• But we can gather probable factors from their propagation models:

MRW wvRN

1 Centrality-sensitive Centrality-insensitive

2 Exponential drop-off / damping factor No drop-off / damping

3Propagation of different classes done independently

Propagation of different classes interact

MRW: The Question• An example from a political blog dataset – MRW vs. wvRN

scores for how much a blog is politically conservative:1.000 neoconservatives.blogspot.com1.000 strangedoctrines.typepad.com1.000 jmbzine.com0.593 presidentboxer.blogspot.com0.585 rooksrant.com0.568 purplestates.blogspot.com0.553 ikilledcheguevara.blogspot.com0.540 restoreamerica.blogspot.com0.539 billrice.org0.529 kalblog.com0.517 right-thinking.com0.517 tom-hanna.org0.514 crankylittleblog.blogspot.com0.510 hasidicgentile.org0.509 stealthebandwagon.blogspot.com0.509 carpetblogger.com0.497 politicalvicesquad.blogspot.com0.496 nerepublican.blogspot.com0.494 centinel.blogspot.com0.494 scrawlville.com0.493 allspinzone.blogspot.com0.492 littlegreenfootballs.com0.492 wehavesomeplanes.blogspot.com0.491 rittenhouse.blogspot.com0.490 secureliberty.org0.488 decision08.blogspot.com0.488 larsonreport.com

0.020 firstdownpolitics.com0.019 neoconservatives.blogspot.com0.017 jmbzine.com0.017 strangedoctrines.typepad.com0.013 millers_time.typepad.com0.011 decision08.blogspot.com0.010 gopandcollege.blogspot.com0.010 charlineandjamie.com0.008 marksteyn.com0.007 blackmanforbush.blogspot.com0.007 reggiescorner.blogspot.com0.007 fearfulsymmetry.blogspot.com0.006 quibbles-n-bits.com0.006 undercaffeinated.com0.005 samizdata.net0.005 pennywit.com0.005 pajamahadin.com0.005 mixtersmix.blogspot.com0.005 stillfighting.blogspot.com0.005 shakespearessister.blogspot.com0.005 jadbury.com0.005 thefulcrum.blogspot.com0.005 watchandwait.blogspot.com0.005 gindy.blogspot.com0.005 cecile.squarespace.com0.005 usliberals.about.com0.005 twentyfirstcenturyrepublican.blogspot.com

Seed labels underlined

1. Centrality-sensitive: seeds have different

scores and not necessarily the highest

2. Exponential drop-off: much less sure

about nodes further away from seeds

3. Classes propagate independently:

charlineandjamie.com is both very likely a conservative and a

liberal blog (good or bad?)

We still don’t really understand it yet.

MRW: Related Work

• MRW is very much related to– “Local and global consistency” (Zhou et al. 2004)– “Web content categorization using link information”

(Gyongyi et al. 2006)– “Graph-based semi-supervised learning as a generative

model” (He et al. 2007)• Seed preference is related to the field of active learning– Active learning chooses which data point to label next based

on previous labels; the labeling is interactive– Seed preference is a batch labeling method

Similar formulation,

different view

RWR ranking as features to SVM

Random walk

without restart,

heuristic stopping

Authoritative seed preference a good base line for active

learning on network data!