Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011...
-
Upload
stephanie-osborne -
Category
Documents
-
view
218 -
download
0
Transcript of Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011...
![Page 1: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/1.jpg)
1
Building Topic Models in a Federated Digital Library Through
Selective Document Exclusion
ASIST 2011New Orleans, LAOctober 10, 2011
Miles Efron Peter Organisciak Katrina FenlonGraduate School of Library & Information Science
University of Illinois, Urbana-Champaign
Supported by IMLS LG-06-07-0020.
![Page 2: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/2.jpg)
2
The Setting: IMLS DCC
collection(s) collection(s) collection(s)Data providers(IMLS NLG & LSTA)
metadata
…
DCC
services
metadata metadata
OAI-PMH
Service provider:DCC
![Page 3: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/3.jpg)
3
High-Level Research Interest
• Improve “access” to data harvested for federated digital libraries by enhancing:– Representation of documents– Representation of document aggregations– Capitalizing on the relationship between
aggregations and documents.
• PS: By “document” I mean a single metadata (usually DC) record.
![Page 4: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/4.jpg)
4
Motivation for our Work
• Most empirical approaches to this type of problem rely on some kind of analysis of term counts.
• Unreliable for our data:– Vocabulary mismatch– Poor probability estimates
![Page 5: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/5.jpg)
5
The Setting: IMLS DCC
![Page 6: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/6.jpg)
6
The Problem: Supporting End-User Experience
• Full-text search• Browse by “subject”• Desired:– Improved browsing– Support high-level aggregation understanding and
resource discovery• Approach: Empirically induced “topics” using
established methods--e.g. latent Dirichlet allocation (LDA).
![Page 7: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/7.jpg)
7
![Page 8: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/8.jpg)
8
![Page 9: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/9.jpg)
9
![Page 10: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/10.jpg)
10
![Page 11: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/11.jpg)
11
Research Question
• Can we improve induced models by mitigating the influence of noisy data, common in federated digital library settings?
• Hypothesis: Harvested records are not all useful for training a model of corpus-level topics.
• Approach: Identify and remove “weakly topical” documents during model training.
![Page 12: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/12.jpg)
12
Latent Dirichlet Allocation
• Given a corpus of documents, C, and an empirically chosen integer k
• Assume that a generative process involving k latent topics generated word occurrences in C.
• End result: for a given word w and a given document D:– Pr(w|Ti)
– Pr(D|Ti)
– Pr(Ti)
For each topic T1 … Tk
![Page 13: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/13.jpg)
13
Latent Dirichlet Allocation
• Given a corpus of documents, C, and an empirically chosen integer k
• Assume that a generative process involving k latent topics generated word occurrences in C.
• End result: for a given word w and a given document D:– Pr(w|Ti)
– Pr(D|Ti)
– Pr(Ti)
For each topic T1 … Tk
1. Choose doc length N ~ Poisson(mu).2. Choose probability vector Theta ~ Dir(alpha).3. For each word wi in 1:N:
a) Choose topic zi ~ Multinomial(Theta).b) Choose word wn from P(wn | wn, Beta).
![Page 14: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/14.jpg)
14
Latent Dirichlet Allocation
• Given a corpus of documents, C and an empirically chosen integer k.
• Assume that a generative process involving k latent topics generated word occurrences in C.
• End result: for a given word w and a given document D:– Pr(w|Ti)
– Pr(D|Ti)
– Pr(Ti)
For each topic T1 … Tk
Calculate estimates via iterative methods: MCMC / Gibbs
Sampling.
![Page 15: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/15.jpg)
15
Full Corpus
![Page 16: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/16.jpg)
16
Full Corpus
Proposed algorithm
![Page 17: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/17.jpg)
17
Reduced Corpus
Pr(w | T) Pr(D | T) Pr(T)
Train the Model
![Page 18: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/18.jpg)
18
Full Corpus
Inference
Pr(w | T) Pr(D | T) Pr(T)
Pr(w | T) Pr(D | T) Pr(T)
![Page 19: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/19.jpg)
19
Sample Topics Induced from “Raw” Data
![Page 20: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/20.jpg)
20
Documents’ Topical Strength• Hypothesis: Harvested records are not all
useful for training a model of corpus-level. topics.
![Page 21: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/21.jpg)
21
Documents’ Topical Strength• Hypothesis: Harvested records are not all
useful for training a model of corpus-level.• Proposal: Improve induced topic model by
removing “weakly topical” documents during training.
• After training, use the inferential apparatus of LDA to assign topics to these “stop documents.”
![Page 22: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/22.jpg)
22
Identifying “Stop Documents”• Time at which documents enter a repository is
often informative (e.g. bulk uploads).
log Pr(di | MC)where MC is the collection language modeland di is the words comprising the ith document
![Page 23: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/23.jpg)
23
Identifying “Stop Documents”• Our paper outlines an algorithm for
accomplishing this.• Intuition:– Given a document di decide if it is part of a “run”
of near-identical records.– Remove all records that occur within a run.– The required amount of homogeneity to identify a
run is guided by a parameter tol which is the cumulative normal: e.g. 95%, 99% confidence.
![Page 24: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/24.jpg)
24
![Page 25: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/25.jpg)
25
![Page 26: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/26.jpg)
26
Sample Topics Induced from Groomed Data
![Page 27: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/27.jpg)
27
Experimental Assessment
• Question: Are topics built from “sampled” corpora more coherent than topics induced from raw corpora?
• Intrusion detection:– Find the 10 most probable words for topic Ti
– Replace one of these 10 with a word chosen from the corpus with uniform probability.
– Ask human assessors to identify the “intruder” word.
![Page 28: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/28.jpg)
28
Experimental Assessment
• For each topic Ti have 20 assessors try to find an intruder (20 different intruders). Repeat for both “sampled” and “raw” models.– i.e. 20 * 2* 100 = 4,000 assessments
• Asi is the percent of workers who correctly found the intruder in the ith topic of the sampled model and Ari is analogous for the raw model
• H0: Asi > Ari yields p<0.001
![Page 29: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/29.jpg)
29
Experimental Assessment
• For each topic Ti have 20 workers subjectively assess the topic’s “coherence,” reporting on a 4-point Likert scale.
![Page 30: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/30.jpg)
30
Current & Future Work
• Testing breadth of coverage• Assessing the value of induced topics
• Topic information for document priors in the language modeling IR framework [next slide]
• Massive document expansion for improved language model estimation [under review]
![Page 31: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/31.jpg)
31
Weak Topicality and Document Priors
![Page 32: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/32.jpg)
32
Weak Topicality and Document Priors
![Page 33: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.](https://reader035.fdocuments.us/reader035/viewer/2022062321/56649f305503460f94c4a7d5/html5/thumbnails/33.jpg)
33
Thank You
ASIST 2011New Orleans, LAOctober 10, 2011
Miles Efron Peter Organisciak Katrina FenlonGraduate School of Library & Information Science
University of Illinois, Urbana-Champaign