Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data
description
Transcript of Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data
Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data
Frank Lin and William W. CohenSchool of Computer Science, Carnegie Mellon University
KDD-MLG 20112011-08-21, San Diego, CA, USA
Overview
• Graph-based SSL• The Problem with Text Data• Implicit Manifolds– Two GSSL Methods– Three Similarity Functions
• A Framework• Results• Related Work
Graph-based SSL
• Graph-based semi-supervised learning methods can often be viewed as methods for propagating labels along edges of a graph.
• Naturally they are a good fit for network dataLabeled class A
Labeled Class B
How to label the rest of the nodes in the
graph?
The Problem with Text Data
• Documents are often represented as feature vectors of words:
The importance of a Web page is an inherently subjective matter, which depends on the readers…
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use…
You're not cool just because you have a lot of followers on twitter, get over yourself…
cool web search make over you0 4 8 2 5 30 8 7 4 3 21 0 0 0 1 2
The Problem with Text Data
• Feature vectors are often sparse• But similarity matrix is not!
cool web search make over you0 4 8 2 5 30 8 7 4 3 21 0 0 0 1 2
Mostly zeros - any document contains only a small fraction
of the vocabulary
27 125 -
23 - 125
- 23 27
Mostly non-zero - any two
documents are likely to have a
word in common
The Problem with Text Data
• A similarity matrix is the input to many GSSL methods
• How can we apply GSSL methods efficiently?
27 125 -
23 - 125
- 23 27
O(n2) time to construct
O(n2) space to store
> O(n2) time to operate on
Too expensive! Does not
scale up to big
datasets!
The Problem with Text Data
• Solutions: 1. Make the matrix sparse2. Implicit Manifold
But this is what we’ll talk about!
A lot of cool work has gone into
how to do this…
(If you do SSL on large-scale non-network data, there is a good chance
you are using this already...)
8
Implicit Manifolds
• A pair-wise similarity matrix is a manifold under which the data points “reside”.
• It is implicit because we never explicitly construct the manifold (pair-wise similarity), although the results we get are exactly the same as if we did.
What do you mean by…?
Sounds too good. What’s
the catch?
9
Implicit Manifolds
• Two requirements for using implicit manifolds on text (-like) data:1. The GSSL method can be implemented with
matrix-vector multiplications2. We can decompose the dense similarity matrix
into a series of sparse matrix-matrix multiplications
Let’s look at some specifics, starting with
two GSSL methods
As long as they are met, we can obtain the exact same results without ever constructing a
similarity matrix!
10
Two GSSL Methods
• Harmonic functions method (HF)• MultiRankWalk (MRW)
If you’ve never heard of
them…You might
have heard one of these:
Gaussian fields and harmonic functions classifier (Zhu et al. 2003) Weighted-voted relational network classifier (Macskassy & Provost 2007) Weakly-supervised classification via random walks (Talukdar et al. 2008) Adsorption (Baluja et al. 2008) Learning on diffusion maps (Lafon & Lee 2006) and others …
HF’s cousins: Partially labeled classification using Markov random walks (Szummer & Jaakkola 2001) Learning with local and global consistency (Zhou et al. 2004) Graph-based SSL as a generative model (He et al. 2007) Ghost edges for classification (Gallagher et al. 2008) and others …
MRW’s cousins:
Propagation via forward random walks (w/ restart)
Propagation via backward random walks
11
Two GSSL Methods
• Harmonic functions method (HF)
• MultiRankWalk (MRW)
In both of these iterative implementations, the core
computation are matrix-vector multiplications!
Let’s look at their implicit
manifold qualification…
12
Three Similarity Functions
• Inner product similarity
• Cosine similarity
• Bipartite graph walk
Simple, good for binary features
Often used for document
categorization
Can be viewed as a bipartite graph
walk, or as feature reweighting by
relative frequency; related to TF-IDF term weighting…Side note: perhaps all the
cool work on matrix factorization could be
useful here too…
13
Putting Them Together
• Example: HF + inner product similarity:
€
S = FFT
• Iteration update becomes:
))((11 tTt FFD vv
Construction: O(n) Storage: O(n) Operation: O(n)
How about a similarity function we actually use for
text data?
Diagonal matrix D can be calculated as:
D(i,i)=d(i) where d=FFT1
Parentheses are important!
14
Putting Them Together
• Example: HF + cosine similarity:
€
S = NcFFTNc
• Iteration update:
€
vt+1 = D−1(Nc (F(FT (Ncv
t ))))
Diagonal cosine normalizing matrix
Compact storage: we don’t need a cosine-
normalized version of the feature vectors
Diagonal matrix D can be calculated as: D(i,i)=d(i) where d=NcFFTNc1
Construction: O(n)
Storage: O(n)
Operation: O(n)
15
A Framework
• So what’s the point?1. Towards a principled way for researchers to apply
GSSL methods to text data, and the conditions under which they can do this efficiently
2. Towards a framework on which researchers can develop and discover new methods (and recognizing old ones)
3. Building a SSL tool set – pick one that works for you, according to all the great work that has been done
16
A Framework
Choose your SSL Method…
Harmonic Functions
MultiRankWalk
Hmm… I have a large dataset with very
few training labels, what should I try?
How about MultiRankWalk with
a low restart probability?
17
A FrameworkBut the
documents in my dataset are kinda
long…
Can’t go wrong with cosine similarity!
… and pick your similarity function
Inner Product
Cosine Similarity
Bipartite Graph Walk
18
Results• On 2 commonly used document categorization datasets• On 2 less common NP classification datasets• Goal: to show they do work on large text datasets, and
consistent with what we know about these SSL methods and similarity functions
19
Results
• Document categorization
20
Results
• Noun phrase classification dataset:
Questions?
22
Additional Information
MRW: RWR for Classification
We refer to this method as MultiRankWalk: it classifies data with multiple rankings using random walks
MRW: Seed Preference
• Obtaining labels for data points is expensive• We want to minimize cost for obtaining labels• Observations:– Some labels inherently more useful than others– Some labels easier to obtain than others
Question: “Authoritative” or “popular” nodes in a network are typically easier to obtain labels for. But are these labels also more
useful than others?
Seed Preference
• Consider the task of giving a human expert (or posting jobs on Amazon Mechanical Turk) a list of data points to label
• The list (seeds) can be generated uniformly at random, or we can have a seed preference, according to simple properties of the unlabeled data
• We consider 3 preferences:– Random– Link Count– PageRank
Nodes with highest counts make the list
Nodes with highest scores make the list
MRW: The Question• What really makes MRW and wvRN different?• Network-based SSL often boil down to label propagation. • MRW and wvRN represent two general propagation methods –
note that they are call by many names:MRW wvRN
Random walk with restart Reverse random walk
Regularized random walk Random walk with sink nodes
Personalized PageRank Hitting time
Local & global consistency Harmonic functions on graphs
Iterative averaging of neighbors
Great…but we still don’t know why the differences in
their behavior on these network datasets!
MRW: The Question• It’s difficult to answer exactly why MRW does better with a smaller
number of seeds.• But we can gather probable factors from their propagation models:
MRW wvRN
1 Centrality-sensitive Centrality-insensitive
2 Exponential drop-off / damping factor No drop-off / damping
3Propagation of different classes done independently
Propagation of different classes interact
MRW: The Question• An example from a political blog dataset – MRW vs. wvRN
scores for how much a blog is politically conservative:1.000 neoconservatives.blogspot.com1.000 strangedoctrines.typepad.com1.000 jmbzine.com0.593 presidentboxer.blogspot.com0.585 rooksrant.com0.568 purplestates.blogspot.com0.553 ikilledcheguevara.blogspot.com0.540 restoreamerica.blogspot.com0.539 billrice.org0.529 kalblog.com0.517 right-thinking.com0.517 tom-hanna.org0.514 crankylittleblog.blogspot.com0.510 hasidicgentile.org0.509 stealthebandwagon.blogspot.com0.509 carpetblogger.com0.497 politicalvicesquad.blogspot.com0.496 nerepublican.blogspot.com0.494 centinel.blogspot.com0.494 scrawlville.com0.493 allspinzone.blogspot.com0.492 littlegreenfootballs.com0.492 wehavesomeplanes.blogspot.com0.491 rittenhouse.blogspot.com0.490 secureliberty.org0.488 decision08.blogspot.com0.488 larsonreport.com
0.020 firstdownpolitics.com0.019 neoconservatives.blogspot.com0.017 jmbzine.com0.017 strangedoctrines.typepad.com0.013 millers_time.typepad.com0.011 decision08.blogspot.com0.010 gopandcollege.blogspot.com0.010 charlineandjamie.com0.008 marksteyn.com0.007 blackmanforbush.blogspot.com0.007 reggiescorner.blogspot.com0.007 fearfulsymmetry.blogspot.com0.006 quibbles-n-bits.com0.006 undercaffeinated.com0.005 samizdata.net0.005 pennywit.com0.005 pajamahadin.com0.005 mixtersmix.blogspot.com0.005 stillfighting.blogspot.com0.005 shakespearessister.blogspot.com0.005 jadbury.com0.005 thefulcrum.blogspot.com0.005 watchandwait.blogspot.com0.005 gindy.blogspot.com0.005 cecile.squarespace.com0.005 usliberals.about.com0.005 twentyfirstcenturyrepublican.blogspot.com
Seed labels underlined
1. Centrality-sensitive: seeds have different
scores and not necessarily the highest
2. Exponential drop-off: much less sure
about nodes further away from seeds
3. Classes propagate independently:
charlineandjamie.com is both very likely a conservative and a
liberal blog (good or bad?)
We still don’t really understand it yet.
MRW: Related Work
• MRW is very much related to– “Local and global consistency” (Zhou et al. 2004)– “Web content categorization using link information”
(Gyongyi et al. 2006)– “Graph-based semi-supervised learning as a generative
model” (He et al. 2007)• Seed preference is related to the field of active learning– Active learning chooses which data point to label next based
on previous labels; the labeling is interactive– Seed preference is a batch labeling method
Similar formulation,
different view
RWR ranking as features to SVM
Random walk
without restart,
heuristic stopping
Authoritative seed preference a good base line for active
learning on network data!
Results• How much better is MRW using authoritative seed preference?
y-axis:MRW F1
score minus wvRN F1
x-axis: number of seed
labels per classThe gap between
MRW and wvRN narrows with authoritative
seeds, but they are still
prominent on some datasets
with small number of seed
labels
31
A Web-Scale Knowledge Base
• Read the Web (RtW) project:
Build a never-ending system that learns to extract information
from unstructured web pages, resulting in a knowledge base of
structured information.
32
Noun Phrase and Context Data
• As a part of RtW project, two kinds of noun phrase (NP) and context co-occurrence data was produced:– NP-context co-occurrence data– NP-NP co-occurrence data
• These datasets can be treated as graphs
33
Noun Phrase and Context Data
• NP-context data:
… know that drinking pomegranate juice may not be a bad …
NPbefore context after context
pomegranate juice know that drinking _
_ may not be a bad
32
_ is made from
_ promotes responsible
JELL-O
Jagermeister
5821
NP-contextgraph
34
Noun Phrase and Context Data
• NP-NP data:
… French Constitution Council validates burqa ban …
NP context
French Constitution Council
JELL-OJagermeisterNP-NPgraph
NP
burqa ban
French Court hot pants
veil
Context can be used for weighting edges or making
a more complex graph
35
Noun Phrase Categorization
• We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs.
• Challenges:– Large, noisy dataset (10m NPs, 8.6m contexts from 500m
web pages).– What’s the right function for NP-NP categorical similarity?– Which learned category assignment should we “promote”
to the knowledge base?– How to evaluate it?
36
Noun Phrase Categorization
• We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs.
• Challenges:– Large, noisy dataset (10m NPs, 8.6m contexts from 500m
web pages).– What’s the right function for NP-NP categorical similarity?– Which learned category assignment should we “promote”
to the knowledge base?– How to evaluate it?
37
Noun Phrase Categorization
• Preliminary experiment:– Small subset of the NP-context data• 88k NPs• 99k contexts
– Find category “city”• Start with a handful of seeds
– Ground truth set of 2,404 city NPs created using