Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language...

97
Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie Mellon University PhD Thesis Proposal October 8, 2010, Pittsburgh, PA, USA Thesis Committee William W. Cohen (Chair), Christos Faloutsos, Tom Mitchell, Xiaojin Zhu
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language...

Page 1: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning

Frank LinLanguage Technologies Institute

School of Computer ScienceCarnegie Mellon University

PhD Thesis ProposalOctober 8, 2010, Pittsburgh, PA, USA

Thesis CommitteeWilliam W. Cohen (Chair), Christos Faloutsos, Tom Mitchell, Xiaojin Zhu

Page 2: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

2

Motivation

• Graph data is everywhere• We want to do machine learning on them• For non-graph data, often it makes sense to

represent them as graphs for unsupervised and semi-supervised learning

And some of them very big

Often require a similarity measure between data

points – naturally represented as edges

between nodes

Page 3: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

3

Motivation

• We want to do clustering and semi-supervised learning on large graphs

• We need scalable methods that provide quality results

Page 4: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

4

Thesis Goal

• Make contributions toward fast, space-efficient, effective, and simple graph-based learning methods that scale up to large datasets.

Page 5: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

5

Road Map

Basic Learning Methods Applications to a Web-Scale Knowledge Base

Graph Induced Graph Applications Support

Unsupervised / Clustering

Power Iteration

ClusteringPIC with Path

Folding

Learning Synonym NP’sDistributed Implementation

Avoiding Collision

Hierarchical

Learning Polyseme NP’s

Finding New Types and Relations

Semi-

Supervise

d

Multi-RankWalk

MRW with Path Folding

Noun Phrase Categorization

Distributed Implementation

Page 6: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

6

Road Map

Basic Learning Methods Applications to a Web-Scale Knowledge Base

Graph Induced Graph Applications Support

Unsupervised / Clustering

Power Iteration

ClusteringPIC with Path

Folding

Learning Synonym NP’sDistributed Implementation

Avoiding Collision

Hierarchical

Learning Polyseme NP’s

Finding New Types and Relations

Semi-

Supervise

d

Multi-RankWalk

MRW with Path Folding

Noun Phrase Categorization

Distributed Implementation

Prior Work

Proposed Work

Page 7: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

7

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization

• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

Page 8: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

8

Power Iteration Clustering

• Spectral clustering methods are nice, and natural choice for graph data

• But they are rather expensive (slow)

Power iteration clustering (PIC) can provide a similar solution at a very

low cost (fast)!

Page 9: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

9

Background: Spectral Clustering

• Idea: instead of clustering data points in their original (Euclidean) space, cluster them in the space spanned by the “significant” eigenvectors of a (Laplacian) similarity matrix

A popular spectral clustering method:

normalized cuts (NCut)

Page 10: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

10

Background: Spectral Clusteringdataset and normalized cut results

2nd smallest eigenvector

3rd smallest eigenvector

valu

e

index

1 2 3cluster

clus

terin

g s

pace

Page 11: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

11

Background: Spectral Clustering

• Normalized Cut algorithm (Shi & Malik 2000):1. Choose k and similarity function s2. Derive A from s, let W=I-D-1A, where I is the identity

matrix and D is a diagonal square matrix Dii=Σj Aij

3. Find eigenvectors and corresponding eigenvalues of W4. Pick the k eigenvectors of W with the 2nd to kth smallest

corresponding eigenvalues as “significant” eigenvectors5. Project the data points onto the space spanned by these

vectors6. Run k-means on the projected data points

D

Finding eigenvectors and eigenvalues of a matrix is very slow

in generalCan we find a similar low-dimensional embedding for

clustering without eigenvectors?

Page 12: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

12

The Power Iteration

• Or the power method, is a simple iterative method for finding the dominant eigenvector of a matrix:

tt cWvv 1

W : a square matrix

vt : the vector at

iteration t;

v0 typically a random vector

c : a normalizing constant to keep vt

from getting too large or too small

Typically converges quickly; fairly efficient if W is a sparse matrix

Page 13: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

13

The Power Iteration

• Or the power method, is a simple iterative method for finding the dominant eigenvector of a matrix:

tt cWvv 1

What if we let W=D-1A(like Normalized Cut)?

Row-normalized similarity

matrix

Page 14: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

14

The Power Iteration

Page 15: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

15

Power Iteration Clustering

• The 2nd to kth eigenvectors of W=D-1A are roughly piece-wise constant with respect to the underlying clusters, each separating a cluster from the rest of the data (Meila & Shi 2001)

• The linear combination of piece-wise constant vectors is also piece-wise constant!

Page 16: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

16

Spectral Clusteringdataset and normalized cut results

2nd smallest eigenvector

3rd smallest eigenvector

valu

e

index

1 2 3cluster

clus

terin

g s

pace

Page 17: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

17

+

=

Page 18: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

18

Power Iteration Clustering

dataset and PIC results

vt

we just need the clusters to be separated in some space.

Key idea: to do clustering, we may not need all the information in a full spectral

embedding (e.g., distance between clusters in a k-dimension eigenspace)

Page 19: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

When to Stop

ntnnk

tkkk

tkk

tt cccc eeeev ...... 111111

Recall:

n

t

nnk

t

kkk

t

kkt

t

c

c

c

c

c

c

ceeee

v

111

1

1

1

1

111

11

......

Then:

Because they are raised to the power t, the eigenvalue ratios

determines how fast v converges to e1

At the beginning, v changes fast (“accelerating”) to converge locally due to “noise terms”

(k+1…n) with small λ

When “noise terms” have gone to zero, v changes slowly (“constant speed”) because only larger λ terms (2…k) are left, where the

eigenvalue ratios are close to 1

Page 20: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

20

Power Iteration Clustering

• A basic power iteration clustering (PIC) algorithm:

Input: A row-normalized affinity matrix W and the number of clusters kOutput: Clusters C1, C2, …, Ck

1. Pick an initial vector v0

2. Repeat• Set vt+1 ← Wvt

• Set δt+1 ← |vt+1 – vt|• Increment t• Stop when |δt – δt-1| ≈ 0

3. Use k-means to cluster points on vt and return clusters C1, C2, …, Ck

Page 21: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

21

PIC RuntimeNormalized Cut

Normalized Cut, faster implementation

Ran out of memory (24GB)

Page 22: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

PIC Accuracy on Network Datasets

Upper triangle: PIC does

better

Lower triangle: NCut or

NJW does better

Page 23: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

23

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization

• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

Page 24: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

24

Clustering Text Data

• Spectral clustering methods are nice• We want to use them for clustering text data

(A lot of)

Page 25: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

25

The Problem with Text Data

• Documents are often represented as feature vectors of words:

The importance of a Web page is an inherently subjective matter, which depends on the readers…

In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use…

You're not cool just because you have a lot of followers on twitter, get over yourself…

cool web search make over you0 4 8 2 5 30 8 7 4 3 21 0 0 0 1 2

Page 26: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

26

The Problem with Text Data

• Feature vectors are often sparse• But similarity matrix is not!

cool web search make over you0 4 8 2 5 30 8 7 4 3 21 0 0 0 1 2

Mostly zeros - any document contains only a small fraction

of the vocabulary

27 125 -

23 - 125

- 23 27

Mostly non-zero - any two

documents are likely to have a

word in common

Page 27: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

27

The Problem with Text Data

• A similarity matrix is the input to many clustering methods, including spectral clustering

• Spectral clustering requires the computation of the eigenvectors of a similarity matrix of the data

27 125 -

23 - 125

- 23 27

In general O(n3); approximation

methods still not very fast

O(n2) time to construct

O(n2) space to store

> O(n2) time to operate on

Too expensive! Does not

scale up to big

datasets!

Page 28: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

28

The Problem with Text Data

• We want to use the similarity matrix for clustering (like spectral clustering), but:– Without calculating eigenvectors– Without constructing or storing the similarity

matrix

Power Iteration Clustering + Path Folding

Page 29: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

29

Path Folding

• A basic power iteration clustering (PIC) algorithm:

Input: A row-normalized affinity matrix W and the number of clusters kOutput: Clusters C1, C2, …, Ck

1. Pick an initial vector v0

2. Repeat• Set vt+1 ← Wvt

• Set δt+1 ← |vt+1 – vt|• Increment t• Stop when |δt – δt-1| ≈ 0

3. Use k-means to cluster points on vt and return clusters C1, C2, …, Ck

Okay, we have a fast clustering method – but there’s the W that requires O(n2) storage space and construction and operation time!

Key operation in PIC

Note: matrix-vector multiplication!

Page 30: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

30

Path Folding

• What’s so good about matrix-vector multiplication?

• If we can decompose the matrix…

• Then we arrive at the same solution doing a series of matrix-vector multiplications!

ttt ABCW vvv )(1

)))(((1 tt CBA vv

Isn’t this more expensive?

Well, not if this matrix is dense…

And these are sparse!

Page 31: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

31

Path Folding

• As long as we can decompose the matrix into a series of sparse matrices, we can turn a dense matrix-vector multiplication into a series of sparse matrix-vector multiplications.

This means that we can turn an operation that requires O(n2)

storage and runtime into one that requires O(n) storage and runtime!

This is exactly the case for

text data

And many other kinds of data as well!

Page 32: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

32

Path Folding

• Example – inner product similarity:

TFFDW 1

The original feature matrix

The feature matrix

transposed

Diagonal matrix that normalizes W so rows sum to 1

Construction: givenStorage: O(n)

Construction: givenStorage: just use FStorage: n

Page 33: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

33

Path Folding

• Example – inner product similarity:

TFFDW 1

• Iteration update:

))((11 tTt FFD vv

Construction: O(n)

Storage:O(n)

Operation: O(n)

Okay…how about a similarity function we actually use for

text data?

Page 34: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

34

Path Folding

• Example – cosine similarity:

NNFFDW T1

• Iteration update:

))))((((11 tTt NFFND vv

Diagonal cosine

normalizing matrix

Construction: O(n)

Storage:O(n)

Operation: O(n)

Compact storage: we don’t need a cosine-normalized version of the feature vectors

Page 35: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

35

Path Folding

• We refer to this technique as path folding due to its connections to “folding” a bipartite graph into a unipartite graph.

Page 36: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

36

Results

• An accuracy result:Upper triangle:

we win

Lower triangle: spectral

clustering wins

Each point is accuracy for a 2-cluster

text dataset

Diagonal: tied (most datasets)

No statistically significant

difference – same accuracy

Page 37: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

37

Results

• A scalability result:

y: algorithm runtime

(log scale)

x: data size (log scale)

Linear curve

Quadratic curve

Spectral clustering (red & blue)

Our method (green)

Page 38: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

38

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization

• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

Page 39: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

MultiRankWalk

• Classification labels are expensive to obtain• Semi-supervised learning (SSL) learns from

labeled and unlabeled data for classification• When it comes to network data, what is a

general, simple, and effective method that requires very few labels?

Our Answer:MultiRankWalk (MRW)

Page 40: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

Random Walk with Restart

• Imagine a network, and starting at a specific node, you follow the edges randomly.

• But with some probability, you “jump” back to the starting node (restart!).

If you recorded the number of times you land on each node,

what would that distribution look like?

Page 41: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

Random Walk with Restart

What if we start at a

different node?Start node

Page 42: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

Random Walk with Restart

• The walk distribution r satisfies a simple equation:

rur dWd )1(

Start node(s)

Transition matrix of the

network

Restart probability

“Keep-going” probability (damping factor)

Page 43: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

Random Walk with Restart

• Random walk with restart (RWR) can be solved simply and efficiently with an iterative procedure:

1)1( tt dWd rur

Page 44: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

RWR for Classification

RWR with start nodes being

labeled points in class A

RWR with start nodes being

labeled points in class B

Nodes frequented more by RWR(A) belongs to class A, otherwise they

belong to B

• Simple idea: use RWR for classification

Page 45: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

45

ResultsAccuracy: MRW vs. harmonic functions method (HF)

Upper triangle:

MRW better

Lower triangle:

MRW better

Point color & style roughly indicates # of labeled examples

With larger # of labels, both

methods do well

MRW does much better with only a few labels!

Page 46: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

Results• How much better is MRW using authoritative seed preference?

y-axis:MRW F1

score minus wvRN F1

x-axis: number of seed

labels per classThe gap between

MRW and wvRN narrows with authoritative

seeds, but they are still

prominent on some datasets

with small number of seed

labels

Page 47: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

47

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization

• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

Page 48: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

48

MRW with Path Folding

• MRW is based on random walk with restart (RWR):

1)1( tt dWd rur

Core operation: matrix-vector multiplication

“Path folding” can be applied to MRW as well to efficiently learn

from induced graph data!

Most other methods preprocess the induced graph

data to create a sparse similarity

matrix

Can also be applied to iterative implementations of the harmonic

functions method

Page 49: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

49

On to real, large, difficult applications…

• We now have a basic set of tools to do efficient unsupervised and semi-supervised learning on graph data

• We want to apply to cool, challenging problems…

Page 50: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

50

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization

• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

Page 51: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

51

A Web-Scale Knowledge Base

• Read the Web (RtW) project:

Build a never-ending system that learns to extract information

from unstructured web pages, resulting in a knowledge base of

structured information.

Page 52: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

52

Noun Phrase and Context Data

• As a part of RtW project, two kinds of noun phrase (NP) and context co-occurrence data was produced:– NP-context co-occurrence data– NP-NP co-occurrence data

• These datasets can be treated as graphs

Page 53: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

53

Noun Phrase and Context Data

• NP-context data:

… know that drinking pomegranate juice may not be a bad …

NPbefore context after context

pomegranate juice know that drinking _

_ may not be a bad

3

2

_ is made from

_ promotes responsible

JELL-O

Jagermeister

5821

NP-contextgraph

Page 54: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

54

Noun Phrase and Context Data

• NP-NP data:

… French Constitution Council validates burqa ban …

NP context

French Constitution Council

JELL-OJagermeisterNP-NPgraph

NP

burqa ban

French Court hot pants

veil

Context can be used for weighting edges or making

a more complex graph

Page 55: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

55

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization

• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

Page 56: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

56

Noun Phrase Categorization

• We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs.

• Challenges:– Large, noisy dataset (10m NPs, 8.6m contexts from 500m

web pages).– What’s the right function for NP-NP categorical similarity?– Which learned category assignment should we “promote”

to the knowledge base?– How to evaluate it?

Page 57: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

57

Noun Phrase Categorization

• We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs.

• Challenges:– Large, noisy dataset (10m NPs, 8.6m contexts from 500m

web pages).– What’s the right function for NP-NP categorical similarity?– Which learned category assignment should we “promote”

to the knowledge base?– How to evaluate it?

Page 58: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

58

Noun Phrase Categorization

• Preliminary experiment:– Small subset of the NP-context data• 88k NPs• 99k contexts

– Find category “city”• Start with a handful of seeds

– Ground truth set of 2,404 city NPs created using

Page 59: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

59

Noun Phrase Categorization

• Preliminary result:

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

HF

coEM

MRW

MRWb

Recall

Precision

Page 60: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

60

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization

• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

Page 61: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

61

Learning Synonym NP’s

• Currently the knowledge base see NP’s with different strings as different entities

• It would be useful to know when two different surface forms refer to the same semantic entity

Natural thing to try:use PIC to cluster NPs

But do we get synonyms?

Page 62: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

62

Learning Synonym NP’sNP-context graph yields

categorical clusters…

President Obama

Senator Obama

Barack Obama

Senator McCain

George W. Bush

Nicolas Sarkozy

Tony Blair

NP-NP graph yields semantically related clusters…

President Obama

Senator Obama

Barack Obama

Senator McCain

African American

Democrat

Bailout

But together they may help us to identify synonyms!

Other methods may help to further refine results:

President Obama

Senator Obama

Barack Obama

BarackObama.com

Obamamania

Anti-Obama

Michelle Obama

e.g., string similarity

Page 63: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

63

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization

• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

Page 64: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

64

Learning Polyseme NP’s

• Currently a noun phrase found on a web page can only be mapped to a single entity in the knowledge base

• It would be useful to know when two different entities share the same surface form.

• Example:

Jordan

?

Page 65: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

65

Learning Polyseme NP’s• Proposal: Cluster the NP’s neighbors:

– Cluster contexts in the NP-context graph and NPs in the NP-NP graph– Identify salient clusters as reference to different entities

• Results from clustering neighbors of “Jordan” in the NP-NP graph:

Tru, morgue, Harrison, Jackie, Rick, Davis, Jack, Chicago, episode, sights, six NBA championships, glass, United Center, brother, restaurant, mother, show, Garret, guest star, dead body, woman, Khouri, Loyola, autopsy, friend, season, release, corpse, Lantern, NBA championships, Maternal grandparents, Sox, Starr, medical examiner, Ivers, Hotch, maiden name, NBA titles, cousin John, Scottie Pippen, guest appearance, Peyton, Noor, night, name, Dr . Cox, third pick, Phoebe, side, third season, EastEnders, Tei, Chicago White Sox, trumpeter, day, Chicago Bulls, products, couple, Pippen, Illinois, All-NBA First Team, Dennis Rodman, first retirement

1948-1967, 500,000 Iraqis, first Arab country, Palestinian Arab state, Palestinian President, Arab-states, al-Qaeda, Arab World, countries, ethics, Faynan, guidelines, Malaysia, methodology, microfinance, militias, Olmert, peace plan, Strategic Studies, million Iraqi refugees, two Arab countries, two Arab states, Palestinian autonomy, Muslim holy sites, Jordan algebras, Jordanian citizenship, Arab Federation, Hashemite dynasty, Jordanian citizens, Gulf Cooperation, Dahabi, Judeh, Israeli-occupied West Bank, Mashal, 700,000 refugees, Yarmuk River, Palestine Liberation, Sunni countries, 2 million Iraqi, 2 million Iraqi refugees, Israeli borders, moderate Arab states

Page 66: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

66

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization

• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

Page 67: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

67

Finding New Types and Relations

• Proposed methods for finding synonyms and polysemes identify particular nodes and relationship between nodes based on the characteristics of the graph structure, using PIC

Can we generalize these methods to find any unary or binary relationships between

nodes in the graph?

Different relationships may require

different kinds of graphs and

similarity functions

Can we learn effective similarity

functions for clustering

efficiently?

Admittedly the most

open-ended part of this proposal …

Tom’s suggestion: Given a set of NPs,

can you find k categories and l

relationships that cover the most NPs?

Page 68: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

68

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization

• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

Page 69: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

69

PIC Extension: Avoiding Collisions

• One robustness question for vanilla PIC as data size and complexity grows:

• How many (noisy) clusters can you fit in one dimension without them “colliding”?

Cluster signals cleanly separated

A little too close for comfort?

Page 70: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

70

PIC Extension: Avoiding Collisions

• We propose two general solutions:1. Precondition the starting vector; instead of

random:• Degree• Skewed (1, 2, 4, 8, 16, etc.)• May depend on particular properties of the data

2. Run PIC d times with different random starts and construct a d-dimension embedding• Unlikely two clusters collide on all d dimensions• We can afford it because PIC is fast and space-efficient!

Page 71: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

71

PIC Extension: Avoiding Collisions

• Preliminary results on network classification datasets:

RED: PIC embedding

with a random start vector

GREEN: PIC using a degree

start vector

BLUE: PIC using 4

random start vectors

Dataset (k)

1-dimension PIC embeddings lose on accuracy at higher k’s compared to NCut and NJW

# of clusters

But using a 4 random vectors instead helps!

Note # of vectors << k

Page 72: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

72

PIC Extension: Avoiding Collisions

• Preliminary results on name disambiguation datasets:

Again using a 4 random vectors seems to work!

Again note # of vectors << k

Page 73: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

73

PIC Extension: Hierarchical Clustering

• Real, large-scale data may not have a “flat” clustering structure

• A hierarchical view may be more useful

Good News:The dynamics of a PIC embedding display a hierarchically convergent

behavior!

Page 74: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

74

PIC Extension: Hierarchical Clustering

• Why?• Recall PIC embedding at time t:

n

t

nn

tt

t

t

c

c

c

c

c

c

ceeee

v

113

1

3

1

32

1

2

1

21

11

...

Less significant eigenvectors / structures go away first, one by one

More salient structure stick

around

e’s – eigenvectors (structure) SmallBig

There may not be a clear

eigengap - a gradient of

cluster saliency

Page 75: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

75

PIC Extension: Hierarchical Clustering

PIC already converged to 8 clusters…

But let’s keep on iterating…

“N” still a part of the “2009”

cluster…

Similar behavior also noted in matrix-matrix power

methods (diffusion maps, mean-shift, multi-resolution

spectral clustering)

Same dataset you’ve seen

Yes(it might take a while)

Page 76: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

76

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization

• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

Page 77: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

77

Distributed / Parallel Implementations

• Distributed / parallel implementations of learning methods are necessary to support large-scale data given the direction of hardware development

• PIC, MRW, and their path folding variants have at their core sparse matrix-vector multiplications

• Sparse matrix-vector multiplication lends itself well to a distributed / parallel computing framework

• We propose to use• Alternatives:

Existing graph analysis tool:

Page 78: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

78

Talk Outline

• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)

• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base

• Noun Phrase Categorization

• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations

• PIC Extensions• Distributed Implementations

Page 79: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

Questions?

Page 80: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

80

Additional Information

Page 81: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

PIC: Some “Powering” Methods at a Glance

Method W Iterate Stopping Final

Tishby & Slonim 2000 W=D-1A Wt+1=Wt

rate of information

loss

information bottleneck

method

Zhou & Woodruff

2004W=A Wt+1=Wt a small t a threshold ε

Carreira-Perpinan

2006W=D-1A Xt+1=WX entropy a threshold ε

PIC W=D-1A vt+1=Wvt acceleration k-means

How far can we go with a one- or low-dimensional

embedding?

Page 82: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

82

PIC: Versus Popular Fast Sparse Eigencomputation Methods

For Symmetric Matrices

For General Matrices Improvement

Successive Power Method

Basic; numerically unstable, can be

slow

Lanczos Method Arnoldi MethodMore stable, but

require lots of memory

Implicitly Restarted Lanczos Method

(IRLM)

Implicitly Restarted Arnoldi Method

(IRAM)More memory-

efficient

Method Time Space

IRAM (O(m3)+(O(nm)+O(e))×O(m-k))×(# restart) O(e)+O(nm)

PIC O(e)x(# iterations) O(e)

Randomized sampling

methods are also popular

Page 83: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

83

PIC: Related Clustering Work• Spectral Clustering

– (Roxborough & Sen 1997, Shi & Malik 2000, Meila & Shi 2001, Ng et al. 2002)• Kernel k-Means (Dhillon et al. 2007)• Modularity Clustering (Newman 2006)• Matrix Powering

– Markovian relaxation & the information bottleneck method (Tishby & Slonim 2000)

– matrix powering (Zhou & Woodruff 2004)– diffusion maps (Lafon & Lee 2006)– Gaussian blurring mean-shift (Carreira-Perpinan 2006)

• Mean-Shift Clustering– mean-shift (Fukunaga & Hostetler 1975, Cheng 1995, Comaniciu & Meer 2002)– Gaussian blurring mean-shift (Carreira-Perpinan 2006)

Page 84: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

84

PICwPF: ResultsEach point represents

the accuracy result from a

dataset

Lower triangle: k-means wins

Upper triangle: PIC wins

Page 85: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

85

PICwPF: Results Two methods have almost the same behavior

Overall, one method not statistically significantly better than the other

Page 86: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

86

PICwPF: Results

Not sure why NCUTiram did

not work as well as NCUTevd

Lesson: Approximate

eigen- computation methods may

require expertise to work well

Page 87: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

87

PICwPF: Results

• PIC is O(n) per iteration and the runtime curve looks linear…

• But I don’t like eyeballing curves, and perhaps the number of iteration increases with size or difficulty of the dataset?

Correlation plotCorrelation statistic

(0=none, 1=correlated)

Page 88: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

88

PICwPF: Results

• Linear run-time implies constant number of iterations.

• Number of iterations to “acceleration-convergence” is hard to analyze:– Faster than a single complete run of power

iteration to convergence.– On our datasets• 10-20 iterations is typical• 30-35 is exceptional

Page 89: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

89

PICwPF: Related Work

• Faster spectral clustering– Approximate eigendecomposition (Lanczos, IRAM)– Sampled eigendecomposition (Nyström)

• Sparser matrix– Sparse construction• k-nearest-neighbor graph• k-matching

– graph sampling / reduction

Not O(n) time methods

Still require O(n2) construction in

general

Not O(n) space methods

Page 90: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

90

PICwPF: Results

Page 91: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

MRW: RWR for Classification

We refer to this method as MultiRankWalk: it classifies data with multiple rankings using random walks

Page 92: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

MRW: Seed Preference

• Obtaining labels for data points is expensive• We want to minimize cost for obtaining labels• Observations:– Some labels inherently more useful than others– Some labels easier to obtain than others

Question: “Authoritative” or “popular” nodes in a network are typically easier to obtain labels for. But are these labels also more

useful than others?

Page 93: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

Seed Preference

• Consider the task of giving a human expert (or posting jobs on Amazon Mechanical Turk) a list of data points to label

• The list (seeds) can be generated uniformly at random, or we can have a seed preference, according to simple properties of the unlabeled data

• We consider 3 preferences:– Random– Link Count– PageRank

Nodes with highest counts make the list

Nodes with highest scores make the list

Page 94: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

MRW: The Question• What really makes MRW and wvRN different?• Network-based SSL often boil down to label propagation. • MRW and wvRN represent two general propagation methods –

note that they are call by many names:MRW wvRN

Random walk with restart Reverse random walk

Regularized random walk Random walk with sink nodes

Personalized PageRank Hitting time

Local & global consistency Harmonic functions on graphs

Iterative averaging of neighbors

Great…but we still don’t know why the differences in

their behavior on these network datasets!

Page 95: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

MRW: The Question

• It’s difficult to answer exactly why MRW does better with a smaller number of seeds.

• But we can gather probable factors from their propagation models:

MRW wvRN

1 Centrality-sensitive Centrality-insensitive

2 Exponential drop-off / damping factor No drop-off / damping

3Propagation of different classes done independently

Propagation of different classes interact

Page 96: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

MRW: The Question

• An example from a political blog dataset – MRW vs. wvRN scores for how much a blog is politically conservative:

1.000 neoconservatives.blogspot.com1.000 strangedoctrines.typepad.com1.000 jmbzine.com0.593 presidentboxer.blogspot.com0.585 rooksrant.com0.568 purplestates.blogspot.com0.553 ikilledcheguevara.blogspot.com0.540 restoreamerica.blogspot.com0.539 billrice.org0.529 kalblog.com0.517 right-thinking.com0.517 tom-hanna.org0.514 crankylittleblog.blogspot.com0.510 hasidicgentile.org0.509 stealthebandwagon.blogspot.com0.509 carpetblogger.com0.497 politicalvicesquad.blogspot.com0.496 nerepublican.blogspot.com0.494 centinel.blogspot.com0.494 scrawlville.com0.493 allspinzone.blogspot.com0.492 littlegreenfootballs.com0.492 wehavesomeplanes.blogspot.com0.491 rittenhouse.blogspot.com0.490 secureliberty.org0.488 decision08.blogspot.com0.488 larsonreport.com

0.020 firstdownpolitics.com0.019 neoconservatives.blogspot.com0.017 jmbzine.com0.017 strangedoctrines.typepad.com0.013 millers_time.typepad.com0.011 decision08.blogspot.com0.010 gopandcollege.blogspot.com0.010 charlineandjamie.com0.008 marksteyn.com0.007 blackmanforbush.blogspot.com0.007 reggiescorner.blogspot.com0.007 fearfulsymmetry.blogspot.com0.006 quibbles-n-bits.com0.006 undercaffeinated.com0.005 samizdata.net0.005 pennywit.com0.005 pajamahadin.com0.005 mixtersmix.blogspot.com0.005 stillfighting.blogspot.com0.005 shakespearessister.blogspot.com0.005 jadbury.com0.005 thefulcrum.blogspot.com0.005 watchandwait.blogspot.com0.005 gindy.blogspot.com0.005 cecile.squarespace.com0.005 usliberals.about.com0.005 twentyfirstcenturyrepublican.blogspot.com

Seed labels underlined

1. Centrality-sensitive: seeds have different

scores and not necessarily the highest

2. Exponential drop-off: much less sure

about nodes further away from seeds

3. Classes propagate independently:

charlineandjamie.com is both very likely a conservative and a

liberal blog (good or bad?)

We still don’t really understand it yet.

Page 97: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

MRW: Related Work

• MRW is very much related to– “Local and global consistency” (Zhou et al. 2004)– “Web content categorization using link information”

(Gyongyi et al. 2006)– “Graph-based semi-supervised learning as a generative

model” (He et al. 2007)• Seed preference is related to the field of active

learning– Active learning chooses which data point to label next

based on previous labels; the labeling is interactive– Seed preference is a batch labeling method

Similar formulation,

different view

RWR ranking as features to SVM

Random walk

without restart,

heuristic stopping

Authoritative seed preference a good base line for active

learning on network data!