Download - Representing Documents via Latent Keyphrase Inference · 26 q We have introduced a novel document representation method using latent keyphrases q Each dimension is explainable q Works

Representing Documents via LatentKeyphrase Inference

April. 15th, 2016

2

Document Representation in Vector Space

Critical for document retrieval, categorization

3

q Bag-of-Words or Phrases

q Cons: Sparse on short texts

Traditional Methods

4

q Topic models [LDA]

q Cons: Difficult for human to infer topic semanticsEach topic is a distribution over words, each document is a mixture of corpus-wide topics

5

q Concept-based models [ESA]

q Cons: Low coverage of concepts in human-curated knowledge base

Every Wikipedia article represents a concept

Article words are associated with the concept (TF.IDF), which help infer concepts from document

Concept:Panthera

Cat[0.92]

Leopard[0.84]

Roar[0.77]

6

q Word/Document embedding models [word2vec paragraph2vec]

q Cons: Difficult to explain what each dimension means

7

q Use domain keyphrases as the entries in the vector andq Identify document keyphrases (subset of domain keyphrases)

by evaluating relatedness between (doc, domain keyphrase)q Unsupervised model

Document Representation Using Keyphrases

Corpus Domain Keyphrases <K1, K2, …, KM>

8

q Where to get domain keyphrases from a given corpus?q Mining Quality Phrases from Massive Text Corpora [SIGMOD15]

q How to identify document keyphrases?q Can be latent mentions (short text)q Relatedness scores

Challenges

9

q Powered by Bayesian Inference on “Domain Keyphrase Silhouette”q Domain Keyphrase Silhouette: Topic centered on domain keyphrase

q “Reverse” topic modelsq Learned from corpus

How to identify document keyphrases?

10

Framework for Latent Keyphrase Inference (LAKI)

12

q Learning Hierarchical Bayesian Network (DAG)

BinaryVariables

Domain Keyphrase Silhouette

Task 1: Model Learning: learning link weightsTask 2: Structure Learning: learning network structure

13

Noise/Prior Aggregated over all other linksconnected with 𝑍"

q Use Z to represent K (domain keyphrases)and T (content units)

q Noisy-ORq A parent node is easier to activate its children

when the link weight is largerq A child node is influenced by all its parents

Task 1: Model Learning given Structure

Toy example

14

q Training data: Documentsq Expectation-step:q For each document, collect sufficient statisticsq Link firing (Parent, child both being activated) probabilityq Node activation probability

q Maximization-step:q Update link weight

Fully observedcontent units

Partially observeddocument keyphrases

Maximum Likelihood Estimation

15

q Domain keyphrases are connected to content unitsq Help infer document keyphrases from content units

q Domain keyphrases are interconnectedq Help infer document keyphrases from other keyphrases

Task 2: Structure Learning

16

q Data-Driven, DAG, similar to ontologyq Heuristic:q Two nodes are connected onlyq Closely Related: word2vecq Co-occur frequently

q Links are always point to less frequent nodesq Work well in practice

A Heuristic Approach

18

q Exact inference is slow!q NP hard to compute posterior probability for Noisy-Or networksq Approximate inference insteadq Pruning irrelevant nodes using an efficient scoring functionq Gibbs sampling

Inference

19

q Two text-related tasks to evaluate document representation qualityq Phrase relatednessq Document classification

q Two datasets

Experiments

20

q ESA (Explicit Semantic Analysis)q KBLink uses link structure in Wikipediaq BoW (bag-of-words)q ESA-C extends ESA by replacing Wiki with domain corpusq LSA (Latent Semantic Analysis)q LDA (Latent Dirichlet Allocation)q Word2Vec is a neural network computing word

embeddingsq EKM uses explicit keyphrase detection

Methods

21

Phrase Relatedness Correlation

Document Classification

22

Case Study

24

Time Complexity

#Samples1000 3000 5000 7000 9000

Runin

g Ti

me

(ms)

100

200

300

400

500

AcademiaYelp

#Quality Phrases After Pruning10 100 200 300 400 500

Runin

g Ti

me

(ms)

100

200

300

400

500

AcademiaYelp

#Words0 100 200 400 800

Runin

g Ti

me

(ms)

0

500

1000

1500

AcademiaYelp

25

Breakdown of Processing Time

26

q We have introduced a novel document representation method using latentkeyphrases

q Each dimension is explainableq Works for short textq Works for closed-domain text

q We have developed an efficient inference method to do real time keyphraseidentification

q Future workq Better structure learning approachq Combined with knowledge baseq Try other inference method other than Gibbs sampling

q Code available at http://jialu.info

26

Conclusion