Representing Documents via LatentKeyphrase Inference
April. 15th, 2016
2
Document Representation in Vector Space
Critical for document retrieval, categorization
3
q Bag-of-Words or Phrases
q Cons: Sparse on short texts
Traditional Methods
4
q Topic models [LDA]
q Cons: Difficult for human to infer topic semanticsEach topic is a distribution over words, each document is a mixture of corpus-wide topics
5
q Concept-based models [ESA]
q Cons: Low coverage of concepts in human-curated knowledge base
Every Wikipedia article represents a concept
Article words are associated with the concept (TF.IDF), which help infer concepts from document
Concept:Panthera
Cat[0.92]
Leopard[0.84]
Roar[0.77]
6
q Word/Document embedding models [word2vec paragraph2vec]
q Cons: Difficult to explain what each dimension means
7
q Use domain keyphrases as the entries in the vector andq Identify document keyphrases (subset of domain keyphrases)
by evaluating relatedness between (doc, domain keyphrase)q Unsupervised model
Document Representation Using Keyphrases
Corpus Domain Keyphrases <K1, K2, …, KM>
8
q Where to get domain keyphrases from a given corpus?q Mining Quality Phrases from Massive Text Corpora [SIGMOD15]
q How to identify document keyphrases?q Can be latent mentions (short text)q Relatedness scores
Challenges
9
q Powered by Bayesian Inference on “Domain Keyphrase Silhouette”q Domain Keyphrase Silhouette: Topic centered on domain keyphrase
q “Reverse” topic modelsq Learned from corpus
How to identify document keyphrases?
10
Framework for Latent Keyphrase Inference (LAKI)
11
12
q Learning Hierarchical Bayesian Network (DAG)
BinaryVariables
Domain Keyphrase Silhouette
Task 1: Model Learning: learning link weightsTask 2: Structure Learning: learning network structure
13
Noise/Prior Aggregated over all other linksconnected with 𝑍"
q Use Z to represent K (domain keyphrases)and T (content units)
q Noisy-ORq A parent node is easier to activate its children
when the link weight is largerq A child node is influenced by all its parents
Task 1: Model Learning given Structure
Toy example
14
q Training data: Documentsq Expectation-step:q For each document, collect sufficient statisticsq Link firing (Parent, child both being activated) probabilityq Node activation probability
q Maximization-step:q Update link weight
Fully observedcontent units
Partially observeddocument keyphrases
Maximum Likelihood Estimation
15
q Domain keyphrases are connected to content unitsq Help infer document keyphrases from content units
q Domain keyphrases are interconnectedq Help infer document keyphrases from other keyphrases
Task 2: Structure Learning
16
q Data-Driven, DAG, similar to ontologyq Heuristic:q Two nodes are connected onlyq Closely Related: word2vecq Co-occur frequently
q Links are always point to less frequent nodesq Work well in practice
A Heuristic Approach
17
18
q Exact inference is slow!q NP hard to compute posterior probability for Noisy-Or networksq Approximate inference insteadq Pruning irrelevant nodes using an efficient scoring functionq Gibbs sampling
Inference
19
q Two text-related tasks to evaluate document representation qualityq Phrase relatednessq Document classification
q Two datasets
Experiments
20
q ESA (Explicit Semantic Analysis)q KBLink uses link structure in Wikipediaq BoW (bag-of-words)q ESA-C extends ESA by replacing Wiki with domain corpusq LSA (Latent Semantic Analysis)q LDA (Latent Dirichlet Allocation)q Word2Vec is a neural network computing word
embeddingsq EKM uses explicit keyphrase detection
Methods
21
Phrase Relatedness Correlation
Document Classification
22
Case Study
23
24
Time Complexity
#Samples1000 3000 5000 7000 9000
Runin
g Ti
me
(ms)
100
200
300
400
500
AcademiaYelp
#Quality Phrases After Pruning10 100 200 300 400 500
Runin
g Ti
me
(ms)
100
200
300
400
500
AcademiaYelp
#Words0 100 200 400 800
Runin
g Ti
me
(ms)
0
500
1000
1500
AcademiaYelp
25
Breakdown of Processing Time
26
q We have introduced a novel document representation method using latentkeyphrases
q Each dimension is explainableq Works for short textq Works for closed-domain text
q We have developed an efficient inference method to do real time keyphraseidentification
q Future workq Better structure learning approachq Combined with knowledge baseq Try other inference method other than Gibbs sampling
q Code available at http://jialu.info
26
Conclusion
Top Related