Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

37
Adaptation of Graph-Based Semi- Supervised Methods to Large- Scale Text Data Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University KDD-MLG 2011 2011-08-21, San Diego, CA, USA

description

Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data. Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University KDD-MLG 2011 2011-08-21, San Diego, CA, USA. Overview. Graph-based SSL The Problem with Text Data Implicit Manifolds - PowerPoint PPT Presentation

Transcript of Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

Page 1: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

Frank Lin and William W. CohenSchool of Computer Science, Carnegie Mellon University

KDD-MLG 20112011-08-21, San Diego, CA, USA

Page 2: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

Overview

• Graph-based SSL• The Problem with Text Data• Implicit Manifolds– Two GSSL Methods– Three Similarity Functions

• A Framework• Results• Related Work

Page 3: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

Graph-based SSL

• Graph-based semi-supervised learning methods can often be viewed as methods for propagating labels along edges of a graph.

• Naturally they are a good fit for network dataLabeled class A

Labeled Class B

How to label the rest of the nodes in the

graph?

Page 4: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

The Problem with Text Data

• Documents are often represented as feature vectors of words:

The importance of a Web page is an inherently subjective matter, which depends on the readers…

In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use…

You're not cool just because you have a lot of followers on twitter, get over yourself…

cool web search make over you0 4 8 2 5 30 8 7 4 3 21 0 0 0 1 2

Page 5: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

The Problem with Text Data

• Feature vectors are often sparse• But similarity matrix is not!

cool web search make over you0 4 8 2 5 30 8 7 4 3 21 0 0 0 1 2

Mostly zeros - any document contains only a small fraction

of the vocabulary

27 125 -

23 - 125

- 23 27

Mostly non-zero - any two

documents are likely to have a

word in common

Page 6: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

The Problem with Text Data

• A similarity matrix is the input to many GSSL methods

• How can we apply GSSL methods efficiently?

27 125 -

23 - 125

- 23 27

O(n2) time to construct

O(n2) space to store

> O(n2) time to operate on

Too expensive! Does not

scale up to big

datasets!

Page 7: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

The Problem with Text Data

• Solutions: 1. Make the matrix sparse2. Implicit Manifold

But this is what we’ll talk about!

A lot of cool work has gone into

how to do this…

(If you do SSL on large-scale non-network data, there is a good chance

you are using this already...)

Page 8: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

8

Implicit Manifolds

• A pair-wise similarity matrix is a manifold under which the data points “reside”.

• It is implicit because we never explicitly construct the manifold (pair-wise similarity), although the results we get are exactly the same as if we did.

What do you mean by…?

Sounds too good. What’s

the catch?

Page 9: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

9

Implicit Manifolds

• Two requirements for using implicit manifolds on text (-like) data:1. The GSSL method can be implemented with

matrix-vector multiplications2. We can decompose the dense similarity matrix

into a series of sparse matrix-matrix multiplications

Let’s look at some specifics, starting with

two GSSL methods

As long as they are met, we can obtain the exact same results without ever constructing a

similarity matrix!

Page 10: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

10

Two GSSL Methods

• Harmonic functions method (HF)• MultiRankWalk (MRW)

If you’ve never heard of

them…You might

have heard one of these:

Gaussian fields and harmonic functions classifier (Zhu et al. 2003) Weighted-voted relational network classifier (Macskassy & Provost 2007) Weakly-supervised classification via random walks (Talukdar et al. 2008) Adsorption (Baluja et al. 2008) Learning on diffusion maps (Lafon & Lee 2006) and others …

HF’s cousins: Partially labeled classification using Markov random walks (Szummer & Jaakkola 2001) Learning with local and global consistency (Zhou et al. 2004) Graph-based SSL as a generative model (He et al. 2007) Ghost edges for classification (Gallagher et al. 2008) and others …

MRW’s cousins:

Propagation via forward random walks (w/ restart)

Propagation via backward random walks

Page 11: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

11

Two GSSL Methods

• Harmonic functions method (HF)

• MultiRankWalk (MRW)

In both of these iterative implementations, the core

computation are matrix-vector multiplications!

Let’s look at their implicit

manifold qualification…

Page 12: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

12

Three Similarity Functions

• Inner product similarity

• Cosine similarity

• Bipartite graph walk

Simple, good for binary features

Often used for document

categorization

Can be viewed as a bipartite graph

walk, or as feature reweighting by

relative frequency; related to TF-IDF term weighting…Side note: perhaps all the

cool work on matrix factorization could be

useful here too…

Page 13: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

13

Putting Them Together

• Example: HF + inner product similarity:

S = FFT

• Iteration update becomes:

))((11 tTt FFD vv

Construction: O(n) Storage: O(n) Operation: O(n)

How about a similarity function we actually use for

text data?

Diagonal matrix D can be calculated as:

D(i,i)=d(i) where d=FFT1

Parentheses are important!

Page 14: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

14

Putting Them Together

• Example: HF + cosine similarity:

S = NcFFTNc

• Iteration update:

vt+1 = D−1(Nc (F(FT (Ncv

t ))))

Diagonal cosine normalizing matrix

Compact storage: we don’t need a cosine-

normalized version of the feature vectors

Diagonal matrix D can be calculated as: D(i,i)=d(i) where d=NcFFTNc1

Construction: O(n)

Storage: O(n)

Operation: O(n)

Page 15: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

15

A Framework

• So what’s the point?1. Towards a principled way for researchers to apply

GSSL methods to text data, and the conditions under which they can do this efficiently

2. Towards a framework on which researchers can develop and discover new methods (and recognizing old ones)

3. Building a SSL tool set – pick one that works for you, according to all the great work that has been done

Page 16: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

16

A Framework

Choose your SSL Method…

Harmonic Functions

MultiRankWalk

Hmm… I have a large dataset with very

few training labels, what should I try?

How about MultiRankWalk with

a low restart probability?

Page 17: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

17

A FrameworkBut the

documents in my dataset are kinda

long…

Can’t go wrong with cosine similarity!

… and pick your similarity function

Inner Product

Cosine Similarity

Bipartite Graph Walk

Page 18: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

18

Results• On 2 commonly used document categorization datasets• On 2 less common NP classification datasets• Goal: to show they do work on large text datasets, and

consistent with what we know about these SSL methods and similarity functions

Page 19: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

19

Results

• Document categorization

Page 20: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

20

Results

• Noun phrase classification dataset:

Page 21: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

Questions?

Page 22: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

22

Additional Information

Page 23: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

MRW: RWR for Classification

We refer to this method as MultiRankWalk: it classifies data with multiple rankings using random walks

Page 24: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

MRW: Seed Preference

• Obtaining labels for data points is expensive• We want to minimize cost for obtaining labels• Observations:– Some labels inherently more useful than others– Some labels easier to obtain than others

Question: “Authoritative” or “popular” nodes in a network are typically easier to obtain labels for. But are these labels also more

useful than others?

Page 25: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

Seed Preference

• Consider the task of giving a human expert (or posting jobs on Amazon Mechanical Turk) a list of data points to label

• The list (seeds) can be generated uniformly at random, or we can have a seed preference, according to simple properties of the unlabeled data

• We consider 3 preferences:– Random– Link Count– PageRank

Nodes with highest counts make the list

Nodes with highest scores make the list

Page 26: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

MRW: The Question• What really makes MRW and wvRN different?• Network-based SSL often boil down to label propagation. • MRW and wvRN represent two general propagation methods –

note that they are call by many names:MRW wvRN

Random walk with restart Reverse random walk

Regularized random walk Random walk with sink nodes

Personalized PageRank Hitting time

Local & global consistency Harmonic functions on graphs

Iterative averaging of neighbors

Great…but we still don’t know why the differences in

their behavior on these network datasets!

Page 27: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

MRW: The Question• It’s difficult to answer exactly why MRW does better with a smaller

number of seeds.• But we can gather probable factors from their propagation models:

MRW wvRN

1 Centrality-sensitive Centrality-insensitive

2 Exponential drop-off / damping factor No drop-off / damping

3Propagation of different classes done independently

Propagation of different classes interact

Page 28: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

MRW: The Question• An example from a political blog dataset – MRW vs. wvRN

scores for how much a blog is politically conservative:1.000 neoconservatives.blogspot.com1.000 strangedoctrines.typepad.com1.000 jmbzine.com0.593 presidentboxer.blogspot.com0.585 rooksrant.com0.568 purplestates.blogspot.com0.553 ikilledcheguevara.blogspot.com0.540 restoreamerica.blogspot.com0.539 billrice.org0.529 kalblog.com0.517 right-thinking.com0.517 tom-hanna.org0.514 crankylittleblog.blogspot.com0.510 hasidicgentile.org0.509 stealthebandwagon.blogspot.com0.509 carpetblogger.com0.497 politicalvicesquad.blogspot.com0.496 nerepublican.blogspot.com0.494 centinel.blogspot.com0.494 scrawlville.com0.493 allspinzone.blogspot.com0.492 littlegreenfootballs.com0.492 wehavesomeplanes.blogspot.com0.491 rittenhouse.blogspot.com0.490 secureliberty.org0.488 decision08.blogspot.com0.488 larsonreport.com

0.020 firstdownpolitics.com0.019 neoconservatives.blogspot.com0.017 jmbzine.com0.017 strangedoctrines.typepad.com0.013 millers_time.typepad.com0.011 decision08.blogspot.com0.010 gopandcollege.blogspot.com0.010 charlineandjamie.com0.008 marksteyn.com0.007 blackmanforbush.blogspot.com0.007 reggiescorner.blogspot.com0.007 fearfulsymmetry.blogspot.com0.006 quibbles-n-bits.com0.006 undercaffeinated.com0.005 samizdata.net0.005 pennywit.com0.005 pajamahadin.com0.005 mixtersmix.blogspot.com0.005 stillfighting.blogspot.com0.005 shakespearessister.blogspot.com0.005 jadbury.com0.005 thefulcrum.blogspot.com0.005 watchandwait.blogspot.com0.005 gindy.blogspot.com0.005 cecile.squarespace.com0.005 usliberals.about.com0.005 twentyfirstcenturyrepublican.blogspot.com

Seed labels underlined

1. Centrality-sensitive: seeds have different

scores and not necessarily the highest

2. Exponential drop-off: much less sure

about nodes further away from seeds

3. Classes propagate independently:

charlineandjamie.com is both very likely a conservative and a

liberal blog (good or bad?)

We still don’t really understand it yet.

Page 29: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

MRW: Related Work

• MRW is very much related to– “Local and global consistency” (Zhou et al. 2004)– “Web content categorization using link information”

(Gyongyi et al. 2006)– “Graph-based semi-supervised learning as a generative

model” (He et al. 2007)• Seed preference is related to the field of active learning– Active learning chooses which data point to label next based

on previous labels; the labeling is interactive– Seed preference is a batch labeling method

Similar formulation,

different view

RWR ranking as features to SVM

Random walk

without restart,

heuristic stopping

Authoritative seed preference a good base line for active

learning on network data!

Page 30: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

Results• How much better is MRW using authoritative seed preference?

y-axis:MRW F1

score minus wvRN F1

x-axis: number of seed

labels per classThe gap between

MRW and wvRN narrows with authoritative

seeds, but they are still

prominent on some datasets

with small number of seed

labels

Page 31: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

31

A Web-Scale Knowledge Base

• Read the Web (RtW) project:

Build a never-ending system that learns to extract information

from unstructured web pages, resulting in a knowledge base of

structured information.

Page 32: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

32

Noun Phrase and Context Data

• As a part of RtW project, two kinds of noun phrase (NP) and context co-occurrence data was produced:– NP-context co-occurrence data– NP-NP co-occurrence data

• These datasets can be treated as graphs

Page 33: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

33

Noun Phrase and Context Data

• NP-context data:

… know that drinking pomegranate juice may not be a bad …

NPbefore context after context

pomegranate juice know that drinking _

_ may not be a bad

32

_ is made from

_ promotes responsible

JELL-O

Jagermeister

5821

NP-contextgraph

Page 34: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

34

Noun Phrase and Context Data

• NP-NP data:

… French Constitution Council validates burqa ban …

NP context

French Constitution Council

JELL-OJagermeisterNP-NPgraph

NP

burqa ban

French Court hot pants

veil

Context can be used for weighting edges or making

a more complex graph

Page 35: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

35

Noun Phrase Categorization

• We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs.

• Challenges:– Large, noisy dataset (10m NPs, 8.6m contexts from 500m

web pages).– What’s the right function for NP-NP categorical similarity?– Which learned category assignment should we “promote”

to the knowledge base?– How to evaluate it?

Page 36: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

36

Noun Phrase Categorization

• We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs.

• Challenges:– Large, noisy dataset (10m NPs, 8.6m contexts from 500m

web pages).– What’s the right function for NP-NP categorical similarity?– Which learned category assignment should we “promote”

to the knowledge base?– How to evaluate it?

Page 37: Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

37

Noun Phrase Categorization

• Preliminary experiment:– Small subset of the NP-context data• 88k NPs• 99k contexts

– Find category “city”• Start with a handful of seeds

– Ground truth set of 2,404 city NPs created using