David I. Inouye, Pradeep Ravikumar, Inderjit S. Dhillon...Capturing Semantically Meaningful Word...

1
Capturing Semantically Meaningful Word Dependencies with an Admixture of Poisson MRFs David I. Inouye, Pradeep Ravikumar, Inderjit S. Dhillon Admixture of Poisson MRFs (APM) [Inouye et al. 2014] Top Words theater music plays novels life Top Words theater music plays novels life Document Corpus Doc. 4 - Music is an art form whose medium is sound and silence… Doc. 4 - Music is an art form whose medium is sound and silence… Doc. 4 - Music is an art form whose medium is sound and silence… Doc. 4 - Music is an art form whose medium is sound and silence… Doc. 4 - Music is an art form whose medium is sound and silence… Doc. 4 - Music is an art form whose medium is sound and silence… Doc. 4 - Music is an art form whose medium is sound and silence… Doc. 4 - Music is an art form whose medium is sound and silence… Doc. 4 - Music is an art form whose medium is sound and silence… Doc. 1 - Nuclear power, or nuclear energy, is the use of exothermic nuclear processes... Doc. 2 - Theatre or theater is a collaborative form of fine art that uses live performers ... Doc. 3 - A temperature is a numerical measure of hot or cold. Its measurement is…. Doc. 4 - Music is an art form whose medium is sound and silence… Doc. 4 - Music is an art form whose medium is sound and silence… Doc. 4 - Music is an art form whose medium is sound and silence… Admixture of Poisson MRFs with dependencies Single Topic (Multinomial) Multiple Topics (LDA, PLSA) Top Words nuclear theater novels heat indian emperor sun century life Top Words nuclear heat sun temperature soviet Top Words theater music plays novels life “Fine Arts” “Temperature” Previous Independent Models Corpus Examples: - Encyclopedia Articles - Twitter Posts - News Articles - Research papers Applications: - Topic Visualization - Information Retrieval - Document Classification - Word Sense Disambiguation Figure: Previous topic models assume words are conditionally independent of each other given a topic, and thus, these previous models can only represent topics as a list of words ordered by frequency. However, an Admixture of Poisson MRFs (APM) can model dependencies between words and hence can represent each topic as a graph over words. Open Problems in APM Model 1. High computational complexity of APM I No parallelism since optimizing jointly over all parameters I Slow convergence of proximal gradient descent I APM has O (k p 2 ) parameters versus O (kp ) for LDA 2. Edge parameters of APM not directly evaluated I Previous metrics calculated word pair statistics for top words [Newman et al. 2010, Mimno et al. 2011, Aletras and Court 2013] I However, APM explicitly models dependencies between words I How can we semantically evaluate these dependencies? Proposed Solutions 1. Parallel alternating Newton-like algorithm I Split into two convex problems I Demonstrate scaling at p = 10, 000 and n = 100, 000 I Empirically, O (knp 2 ) complexity to estimate O (kn + kp 2 ) parameters I http://bigdata.ices.utexas.edu/software/apm/ 2. Evocation metric that directly evaluates word pairs I Develop novel metric based on notion of evocation (which words “bring to mind” other words) [Inouye et al. 2014] Inouye, D. I., Ravikumar, P., and Dhillon, I. S. Admixture of Poisson MRFs: A Topic Model with Word Dependencies. In ICML, 2014. Background: Admixtures x 2 x 1 "Documents" Mixture Components x 2 x 1 Dense "Topic" Sparse "Document" Dense "Document” Sparse "Topic" Figure: (Left) In mixtures, documents are drawn from exactly one component distribution. (Right) In admixtures, documents are drawn from a distribution whose parameters are a convex combination of component parameters. The conditional distribution given the admixture weights and component distributions is merely the base distribution with parameters that are instance-specific mixtures of the component parameters: Pr Admix. (x | w, Φ) = Pr Base x ¯ φ -1 h k X j =1 w j Ψ(φ j ) i Admixtures/topic models/mixed-membership models: I LDA [Blei et al. 2003] - An admixture of Multinomials I Spherical Admixture Model (SAM) [Reisinger et al., 2010] - An admixture of Von-Mises Fisher distributions I Mixed Membership Stochastic Block Models [Airoldi et al. 2009] - An admixture for generative networks Background: Poisson MRF By assuming that the conditional distribution of a variable x s given all other variables x \s is a univariate Poisson, a joint Poisson distribution can defined [Yang et al. 2012]: Pr PMRF (x | θ , Θ) exp ( θ T x + x T Θx - p X s =1 ln(x s !) ) , where θ R p and Θ ∈{R p ×p : diag (Θ) = 0}. Node conditionals (i.e. the distribution of one word given all other words) are 1-D Poissons: Pr(x s | x -s s , Θ s ) exp{ (θ s + x T Θ s | {z } η s ) x s - ln(x s !) }. Background: APM Formal Definition An Admixture of Poisson MRFs (APM) is an admixture with Poisson MRFs as the component distributions: Pr APM (x, w, θ 1...k , Θ 1...k )= Pr PMRF x ¯ θ = k X j =1 w j θ j , ¯ Θ= k X j =1 w j Θ j Pr Dir (w) k Y j =1 Pr(θ j , Θ j ) [Yang et al. 2012] Yang, E., Ravikumar, P., Allen, G. I., and Liu., Z. Graphical Models via Generalized Linear Models. In NIPS, 2012. [Hsieh et al. 2014] Hsieh, C.-J., Sustik, M. A., Dhillon, I. S., and Ravikumar, P. QUIC: Quadratic Approximation for Sparse Inverse Covariance Estimation. JMLR, 2014. [Boyd-Graber et al. 2006] Boyd-Graber, J., Fellbaum, C., Osherson, D., and Schapire, R. Adding Dense, Weighted Connections to {WordNet}. In Global {WordNet} Conference, 2006. Parallel Alternating Newton-like Algorithm (Code available*) 1. Split the algorithm into alternating steps I Posterior is convex in W or (θ 1...k , Θ 1...k ) but not both I Similar to EM for mixture models or ALS for NMF arg min Φ 1 ,Φ 2 ,··· ,Φ p - 1 n p X s =1 h tr( ˜ Z s Φ s ) - n X i =1 exp(z T i Φ s w i ) i + p X s =1 λkvec(Φ s ) \1 k 1 arg min w 1 ,w 2 ,··· ,w n Δ k - 1 n n X i =1 h ψ T i w i - p X s =1 exp(z T i Φ s w i ) i where z i = [1 x T i ] T ˜ Z s = f (X , W) φ j s =[θ j s j s ) T ] T ψ i = f (X , Φ 1...k ) Φ s =[φ 1 s φ 2 s ··· φ k s ] 2. Use proximal Newton-like method [Hsieh et al. 2014] 3. Simplify Newton step computation [Hsieh et al. 2014] I Compute Hessian entries only for non-zero/free parameters * Code available at: http://bigdata.ices.utexas.edu/software/apm/ Timing Results on Wikipedia 1 3.1 3.4 0.6 2.2 2.2 0 1 2 3 4 n = 20,000 p = 5,000 # of Words = 50M n = 100,000 p = 5,000 # of Words = 133M n = 20,000 p = 10,000 # of Words = 57M Time (hrs) APM Training Time on Wikipedia Dataset 1st Iter. Avg. Next 3 Iter. Figure: Timing results for different sizes of a Wikipedia dataset show that the algorithm scales approximately as O (np 2 ). (Fixing k =5=0.5) Parallel Speedup on BNC Corpus 0 5 10 15 20 0 5 10 15 20 Speedup # of MATLAB Workers Parallel Speedup on BNC Dataset Perfect Speedup Actual Speedup Figure: Parallel speedup is approximately linear when using a simple parfor loop in MATLAB. Subproblems are all independent so speedup could be O (min(n , p )) on distributed system. Experiment on BNC corpus (p = 1646 and n = 4049) fixing k = 5, λ = 8 and running for 30 alternating iterations. Evocation [Boyd-Graber et al. 2006] I Evocation denotes the idea of which words “evoke” or “bring to mind” other words I Distinctive from word similarity or synonymy I Types of evocation: Rose - Flower (example), Brave - Noble (kind), Yell - Talk (manner), Eggs - Bacon (co-occurence), Snore - Sleep (setting), Wet - Desert (antonymy), Work - Lazy (exclusivity), Banana - Kiwi (likeness). Evocation Metric Word Pair H M w1 ↔ w2 w1 ↔ w3 w1 ↔ w4 w2 ↔ w3 w2 ↔ w4 w3 ↔ w4 w2 ↔ w3 w2 ↔ w4 w3 ↔ w4 w1 ↔ w3 w1 ↔ w2 w1 ↔ w4 Word Pair H M w2 ↔ w4 w3 ↔ w4 w1 ↔ w3 w1 ↔ w4 Word Pair H M Rank by model weights M Sum top-m human scores H m = Number of top word pairs to evaluate H = Human-evaluated scores for subset of word pairs M = Corresponding weights induced by model π M (j )= Ordering induced by M so that M π (1) ≥M π (2) ≥···≥M π (|H|) Evoc m (M, H)= m X j =1 H π M (j ) (Evocation for Single Topic) Evoc-1 = k X j =1 1 k Evoc m (M j , H) (Avg. Evoc. of Topics) Evoc-2 = Evoc m ( k X j =1 1 k M j , H) (Evoc. of Avg. Topic) Evocation Metric Results APM APM-LowReg APM-HeldOut CTM HDP LDA RSM RND 0 200 400 600 800 1000 1200 1400 1600 Evocation (m= 50) k = 1 3 5 10 25 50 Evoc-1 (Avg. Evoc. of Topics) N/A 0 200 400 600 800 1000 1200 1400 1600 Evocation (m= 50) k = 1 3 5 10 25 50 Evoc-2 (Evoc. of Avg. Topic) N/A Qualitative Analysis of Evocation I Word pairs for Evoc-2 (m = 50) ordered by human score Best LDA Model (k = 50) Best APM Model (k =5) Human Score Human Score 100 run.v car.n 100 telephone.n call.n 82 teach.v school.n 97 husband.n wife.n 69 school.n class.n 82 residential.a home.n 63 van.n car.n 76 politics.n political.a 51 hour.n day.n 75 steel.n iron.n 50 teach.v student.n 75 job.n employment.n 44 house.n government.n 75 room.n bedroom.n 44 week.n day.n 72 aunt.n uncle.n 38 university.n institution.n 72 printer.n print.v 38 state.n government.n 60 love.v love.n 38 woman.n man.n 57 question.n answer.n 38 give.v church.n 57 prison.n cell.n 38 wife.n man.n 51 mother.n baby.n 38 engine.n car.n 50 sun.n earth.n 35 publish.v book.n 50 west.n east.n 32 west.n state.n 44 weekend.n sunday.n 32 year.n day.n 41 wine.n drink.v 25 member.n give.v 38 south.n north.n 25 dog.n animal.n 38 morning.n afternoon.n 25 seat.n car.n 38 engine.n car.n Word Pair Word Pair I Red highlights pairs that seem semantically uninteresting I Blue highlights pairs that seem semantically interesting http://www.cs.utexas.edu/~dinouye/ {dinouye,pradeepr,inderjit}@cs.utexas.edu

Transcript of David I. Inouye, Pradeep Ravikumar, Inderjit S. Dhillon...Capturing Semantically Meaningful Word...

Page 1: David I. Inouye, Pradeep Ravikumar, Inderjit S. Dhillon...Capturing Semantically Meaningful Word Dependencies with an Admixture of Poisson MRFs David I. Inouye, Pradeep Ravikumar,

Capturing Semantically Meaningful Word Dependencieswith an Admixture of Poisson MRFs

David I. Inouye, Pradeep Ravikumar, Inderjit S. Dhillon

Admixture of Poisson MRFs (APM)[Inouye et al. 2014]

Top Wordstheatermusicplaysnovelslife

Top Wordstheatermusicplaysnovelslife

Document Corpus

Doc. 4 - Music is an art form whose medium is sound and silence…

Doc. 4 - Music is an art form whose medium is sound and silence…

Doc. 4 - Music is an art form whose medium is sound and silence…

Doc. 4 - Music is an art form whose medium is sound and silence…

Doc. 4 - Music is an art form whose medium is sound and silence…

Doc. 4 - Music is an art form whose medium is sound and silence…

Doc. 4 - Music is an art form whose medium is sound and silence…

Doc. 4 - Music is an art form whose medium is sound and silence…

Doc. 4 - Music is an art form whose medium is sound and silence…

Doc. 1 - Nuclear power, or nuclear energy, is the use of exothermic nuclear processes...

Doc. 2 - Theatre or theater is a collaborative form of fine art that uses live performers ...

Doc. 3 - A temperature is a numerical measure of hot or cold. Its measurement is….

Doc. 4 - Music is an art form whose medium is sound and silence…

Doc. 4 - Music is an art form whose medium is sound and silence…

Doc. 4 - Music is an art form whose medium is sound and silence…

Admixture of Poisson MRFswith dependencies

Single Topic(Multinomial)

Multiple Topics(LDA, PLSA)

Top Wordsnucleartheaternovelsheatindianemperorsuncenturylife…

Top Wordsnuclearheatsuntemperaturesoviet

Top Wordstheatermusicplaysnovelslife

“Fine Arts”

“Temperature”

Previous Independent Models

Corpus Examples:- Encyclopedia Articles- Twitter Posts- News Articles- Research papers

Applications:- Topic Visualization- Information Retrieval- Document Classification- Word Sense Disambiguation

Figure: Previous topic models assume words are conditionally independent ofeach other given a topic, and thus, these previous models can only representtopics as a list of words ordered by frequency. However, an Admixture ofPoisson MRFs (APM) can model dependencies between words and hencecan represent each topic as a graph over words.

Open Problems in APM Model1. High computational complexity of APM

I No parallelism since optimizing jointly over all parametersI Slow convergence of proximal gradient descentI APM has O(kp2) parameters versus O(kp) for LDA

2. Edge parameters of APM not directly evaluatedI Previous metrics calculated word pair statistics for top words

[Newman et al. 2010, Mimno et al. 2011, Aletras and Court 2013]

I However, APM explicitly models dependencies between wordsI How can we semantically evaluate these dependencies?

Proposed Solutions1. Parallel alternating Newton-like algorithm

I Split into two convex problemsI Demonstrate scaling at p = 10, 000 and n = 100, 000I Empirically, O(knp2) complexity to estimate O(kn + kp2) parametersI http://bigdata.ices.utexas.edu/software/apm/

2. Evocation metric that directly evaluates word pairsI Develop novel metric based on notion of evocation

(which words “bring to mind” other words)

[Inouye et al. 2014] Inouye, D. I., Ravikumar, P., and Dhillon, I. S. Admixture of

Poisson MRFs: A Topic Model with Word Dependencies. In ICML, 2014.

Background: Admixturesx2

x1

"Documents"Mixture

Componentsx2

x1

Dense "Topic"

Sparse "Document"

Dense "Document”

Sparse "Topic"

Figure: (Left) In mixtures, documents are drawn from exactly onecomponent distribution. (Right) In admixtures, documents are drawn from adistribution whose parameters are a convex combination of componentparameters.

The conditional distribution given the admixture weights andcomponent distributions is merely the base distribution withparameters that are instance-specific mixtures of the componentparameters:

PrAdmix.

(x |w,Φ) = PrBase

(x∣∣∣ φ̄ = Ψ−1

[ k∑j=1

wjΨ(φj)])

Admixtures/topic models/mixed-membership models:I LDA [Blei et al. 2003] - An admixture of Multinomials

I Spherical Admixture Model (SAM) [Reisinger et al., 2010] -An admixture of Von-Mises Fisher distributions

I Mixed Membership Stochastic Block Models [Airoldi et al.2009] - An admixture for generative networks

Background: Poisson MRFBy assuming that the conditional distribution of a variable xs

given all other variables x\s is a univariate Poisson, a jointPoisson distribution can defined [Yang et al. 2012]:

PrPMRF

(x |θ,Θ) ∝ exp

{θTx + xTΘx−

p∑s=1

ln(xs!)

},

where θ ∈ Rp and Θ ∈ {Rp×p : diag (Θ) = 0}.Node conditionals (i.e. the distribution of one word given allother words) are 1-D Poissons:

Pr(xs | x−s, θs,Θs) ∝ exp{ (θs + xTΘs︸ ︷︷ ︸ηs

) xs − ln(xs!) }.

Background: APM Formal DefinitionAn Admixture of Poisson MRFs (APM) is an admixture withPoisson MRFs as the component distributions:

PrAPM

(x,w,θ1...k,Θ1...k) =

PrPMRF

x

∣∣∣∣∣∣ θ̄ =k∑

j=1

wjθj, Θ̄ =

k∑j=1

wjΘj

PrDir

(w)k∏

j=1

Pr(θj,Θj)

[Yang et al. 2012] Yang, E., Ravikumar, P., Allen, G. I., and Liu., Z. Graphical Models via

Generalized Linear Models. In NIPS, 2012.

[Hsieh et al. 2014] Hsieh, C.-J., Sustik, M. A., Dhillon, I. S., and Ravikumar, P. QUIC:

Quadratic Approximation for Sparse Inverse Covariance Estimation. JMLR, 2014.

[Boyd-Graber et al. 2006] Boyd-Graber, J., Fellbaum, C., Osherson, D., and Schapire, R.

Adding Dense, Weighted Connections to {WordNet}. In Global {WordNet} Conference, 2006.

Parallel Alternating Newton-likeAlgorithm (Code available*)

1. Split the algorithm into alternating stepsI Posterior is convex in W or (θ1...k,Θ1...k) but not bothI Similar to EM for mixture models or ALS for NMF

arg minΦ1,Φ2,··· ,Φp

−1

n

p∑s=1

[tr(Z̃

sΦs)−

n∑i=1

exp(zTi Φswi)

]+

p∑s=1

λ‖vec(Φs)\1‖1

arg minw1,w2,··· ,wn∈∆k

−1

n

n∑i=1

[ψT

i wi −p∑

s=1

exp(zTi Φswi)

]where zi = [1 xT

i ]T Z̃s

= f (X ,W)

φjs = [θj

s (Θjs)

T ]T ψi = f (X ,Φ1...k)

Φs = [φ1s φ

2s · · ·φk

s ]

2. Use proximal Newton-like method [Hsieh et al. 2014]3. Simplify Newton step computation [Hsieh et al. 2014]

I Compute Hessian entries only for non-zero/free parameters

* Code available at: http://bigdata.ices.utexas.edu/software/apm/

Timing Results on Wikipedia

1

3.1 3.4

0.6

2.2 2.2

0

1

2

3

4

n = 20,000 p = 5,000

# of Words = 50M

n = 100,000 p = 5,000

# of Words = 133M

n = 20,000 p = 10,000

# of Words = 57M

Tim

e (

hrs

)

APM Training Time on Wikipedia Dataset

1st Iter. Avg. Next 3 Iter.

Figure: Timing results for different sizes of a Wikipedia dataset show thatthe algorithm scales approximately as O(np2). (Fixing k = 5, λ = 0.5)

Parallel Speedup on BNC Corpus

0

5

10

15

20

0 5 10 15 20

Spe

ed

up

# of MATLAB Workers

Parallel Speedup on BNC Dataset

Perfect Speedup

Actual Speedup

Figure: Parallel speedup is approximately linear when using a simple parfor

loop in MATLAB. Subproblems are all independent so speedup could beO(min(n, p)) on distributed system. Experiment on BNC corpus (p = 1646and n = 4049) fixing k = 5, λ = 8 and running for 30 alternating iterations.

Evocation [Boyd-Graber et al. 2006]I Evocation denotes the idea of which words “evoke” or “bring

to mind” other words

I Distinctive from word similarity or synonymy

I Types of evocation: Rose - Flower (example), Brave - Noble(kind), Yell - Talk (manner), Eggs - Bacon (co-occurence),Snore - Sleep (setting), Wet - Desert (antonymy), Work -Lazy (exclusivity), Banana - Kiwi (likeness).

Evocation Metric

Word Pair H M

w1 ↔ w2

w1 ↔ w3

w1 ↔ w4

w2 ↔ w3

w2 ↔ w4

w3 ↔ w4

w2 ↔ w3

w2 ↔ w4

w3 ↔ w4

w1 ↔ w3

w1 ↔ w2

w1 ↔ w4

Word Pair H M

w2 ↔ w4

w3 ↔ w4

w1 ↔ w3

w1 ↔ w4

Word Pair H M

Rank by model weights M Sum top-m human scores H

m = Number of top word pairs to evaluate

H = Human-evaluated scores for subset of word pairs

M = Corresponding weights induced by model

πM(j) = Ordering induced by Mso that Mπ(1) ≥Mπ(2) ≥ · · · ≥ Mπ(|H|)

Evocm(M,H) =m∑

j=1

HπM(j) (Evocation for Single Topic)

Evoc-1 =k∑

j=1

1

kEvocm(Mj ,H) (Avg. Evoc. of Topics)

Evoc-2 = Evocm(k∑

j=1

1

kMj ,H) (Evoc. of Avg. Topic)

Evocation Metric Results

0

200

400

600

800

1000

1200

1400

1600

k = 1 3 5 10 25 50 k = 1 3 5 10 25 50

Evoc-1 (Avg. Evoc. of Topics) Evoc-2 (Evoc. of Avg. Topic)

Evo

cati

on

(m

= 5

0)

APM APM-LowReg APM-HeldOut CTM HDP LDA RSM RND

N/A N/A

0

200

400

600

800

1000

1200

1400

1600

k = 1 3 5 10 25 50 k = 1 3 5 10 25 50

Evoc-1 (Avg. Evoc. of Topics) Evoc-2 (Evoc. of Avg. Topic)

Evo

cati

on

(m

= 5

0)

APM APM-LowReg APM-HeldOut CTM HDP LDA RSM RND

N/A N/A 0

200

400

600

800

1000

1200

1400

1600

k = 1 3 5 10 25 50 k = 1 3 5 10 25 50

Evoc-1 (Avg. Evoc. of Topics) Evoc-2 (Evoc. of Avg. Topic)

Evo

cati

on

(m

= 5

0)

APM APM-LowReg APM-HeldOut CTM HDP LDA RSM RND

N/A N/A

0

200

400

600

800

1000

1200

1400

1600

k = 1 3 5 10 25 50 k = 1 3 5 10 25 50

Evoc-1 (Avg. Evoc. of Topics) Evoc-2 (Evoc. of Avg. Topic)

Evo

cati

on

(m

= 5

0)

APM APM-LowReg APM-HeldOut CTM HDP LDA RSM RND

N/A N/A 0

200

400

600

800

1000

1200

1400

1600

k = 1 3 5 10 25 50 k = 1 3 5 10 25 50

Evoc-1 (Avg. Evoc. of Topics) Evoc-2 (Evoc. of Avg. Topic)

Evo

cati

on

(m

= 5

0)

APM APM-LowReg APM-HeldOut CTM HDP LDA RSM RND

N/A N/A

Qualitative Analysis of EvocationI Word pairs for Evoc-2 (m = 50) ordered by human score

Best LDA Model (k = 50) Best APM Model (k = 5)

Human

Score

Human

Score

100 run.v ↔ car.n 100 telephone.n ↔ call.n82 teach.v↔ school.n 97 husband.n↔wife.n69 school.n ↔ class.n 82 residential.a ↔ home.n63 van.n ↔ car.n 76 politics.n ↔ political.a51 hour.n ↔ day.n 75 steel.n ↔ iron.n50 teach.v ↔ student.n 75 job.n ↔ employment.n

44 house.n ↔ government.n 75 room.n ↔ bedroom.n44 week.n ↔ day.n 72 aunt.n↔ uncle.n38 university.n ↔ institution.n 72 printer.n ↔ print.v38 state.n ↔ government.n 60 love.v ↔ love.n38 woman.n ↔man.n 57 question.n ↔ answer.n38 give.v ↔ church.n 57 prison.n ↔ cell.n38 wife.n ↔ man.n 51 mother.n ↔ baby.n38 engine.n ↔ car.n 50 sun.n ↔ earth.n35 publish.v ↔ book.n 50 west.n ↔ east.n32 west.n ↔ state.n 44 weekend.n ↔ sunday.n32 year.n ↔ day.n 41 wine.n ↔ drink.v25 member.n ↔ give.v 38 south.n ↔ north.n25 dog.n ↔ animal.n 38 morning.n ↔ afternoon.n25 seat.n ↔ car.n 38 engine.n ↔ car.n

Word Pair Word Pair

I Red highlights pairs that seem semantically uninteresting

I Blue highlights pairs that seem semantically interesting

http://www.cs.utexas.edu/~dinouye/ {dinouye,pradeepr,inderjit}@cs.utexas.edu