Evaluating the Use of Clustering for Automatically Organising Digital Library Collections
-
Upload
pathsproject -
Category
Technology
-
view
251 -
download
0
description
Transcript of Evaluating the Use of Clustering for Automatically Organising Digital Library Collections
Evaluating the Use of Clustering for Automatically Organising Digital Library
Collections
Mark M. Hall, Mark Stevenson, Paul D. Clough
TPDL 2012, Cyprus, 24-27 September 2012
Opening Up Digital Cultural Heritage
TPDL 2012, Cyprus, 24-27 September 2012http://www.flickr.com/photos/usnationalarchives/4069633668/
Carl Collinshttp://www.flickr.com/photos/carlcollins/199792939/
http://www.flickr.com/photos/brokenthoughts/122096903/
Exploring Collections
• Exploring / Browsing as an alternative to Search (where applicable)
• Requires some kind of structuring of the data
• Manual structuring ideal– Expensive to generate– Integration of collections problematic
• Alternative: Automatic structuring via clustering
TPDL 2012, Cyprus, 24-27 September 2012
Test Collection
• 28133 photographs provided by the University of St Andrews Library– 85% pre 1940– 89% black and white– Majority UK– Title and description tend to be
short
TPDL 2012, Cyprus, 24-27 September 2012
Ottery St Mary Church
Tested Clustering Strategies
• Latent Dirichlet Allocation (LDA)– 300 & 900 topics– With and without Pairwise Mutual Information
(PMI) filtering
• K-Means– 900 clusters– TFIDF vectors & LDA topic vectors
• OPTICS– 900 clusters– TFIDF vectors & LDA topic vectors
TPDL 2012, Cyprus, 23-27 September 2012
Processing Time
Model Wall-clock TimeLDA 300 00:21:48LDA 900 00:42:42LDA + PMI 300 05:05:13LDA + PMI 900 17:26:08K-Means TFIDF 09:37:40K-Means LDA 03:49:04Optics TFIDF 12:42:13Optics LDA 05:12:49
TPDL 2012, Cyprus, 24-27 September 2012
Evaluation Metrics
• Cluster cohesion– Items in a cluster should be similar to each
other– Items in a cluster should be different from
items in other clusters
• How to test this?– “Intruder” test– If you insert an intruder into a cluster, can
people find it
TPDL 2012, Cyprus, 24-27 September 2012
Intruder Test
1. Randomly select one topic
2. Randomly select four items from the topic
3. Randomly select a second topic – the “intruder” topic
4. Randomly select one item from the second topic – the “intruder” item
5. Scramble the five items and let the user choose which one is the “intruder”
TPDL 2012, Cyprus, 24-27 September 2012
Cluster Cohesion – Cohesive
TPDL 2012, Cyprus, 24-27 September 2012
Cluster Cohesion – Not Cohesive
TPDL 2012, Cyprus, 24-27 September 2012
Evaluation Metrics
• Cohesive– “Intruder” is chosen significantly more
frequently than by chance– Choice distribution is significantly different
from the uniform distribution
• Borderline cohesive– Two out of five items make up > 95% of the
answers– “Intruder” is one of those two
TPDL 2012, Cyprus, 24-27 September 2012
Evaluation Bounds
• Upper bound– Manual annotation
• 936 topics
• Lower bound– 3 cohesive topics– <5% likelihood of seeing that number of cohesive
topics by chance
• Control data– 10 “really, totally, completely obvious” intruders used
to filter participants who randomly select answers
TPDL 2012, Cyprus, 24-27 September 2012
Experiment
• Crowd-sourced using staff & students at Sheffield University– 700 participants
• 9 clustering strategies– 30 units per strategy – total of 270 units
• Results– 8840 ratings– 21 – 30 ratings per unit (median 27 ratings)
TPDL 2012, Cyprus, 24-27 September 2012
ResultsModel Cohesive Borderline Non-CohesiveUpper Bound 27 0 3Lower Bound 3 0 27LDA 300 15 6 9LDA 900 20 4 6LDA + PMI 300 16 4 10LDA + PMI 900 21 2 7K-Means TFIDF 24 3 3K-Means LDA 20 0 10Optics TFIDF 14 2 14Optics LDA 16 0 14
TPDL 2012, Cyprus, 24-27 September 2012
Conclusions
• K-means almost as good as the human classification
• LDA is very fast and approximately two thirds of the topics are acceptably cohesive
• Future work:– Make it hierarchical– Create hybrid algorithms
TPDL 2012, Cyprus, 24-27 September 2012
Thank you for listening
http://www.paths-project.eu
Find out more about the project:
The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no 270082. We acknowledge the contribution of all project partners involved in PATHS (see: http://www.paths-project.eu).