Comparing taxonomies for organising collections of documents presentation
-
Upload
pathsproject -
Category
Education
-
view
105 -
download
5
Transcript of Comparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents
Samuel Fernando, Mark Hall, Eneko Agirre,
Aitor Soroa, Paul Clough, Mark Stevenson
COLING 2012, 14th December 2012, Mumbai, India
Introduction
● Large collections of diverse data are available
online. PATHS project aims to support user
exploration in digital library collections.
● Search box is useful but taxonomies are better
suited for exploration and browsing.
● We apply taxonomies to organise data from a large
digital library collection.
● Process is automatic – either map items to an
existing taxonomy, or induce a taxonomy from the
data.
COLING 2012, 14th December 2012, Mumbai, India
Evaluation data
● We use items from Europeana, a large online collection
of cultural heritage.
● Use English subset, approx. 550,000 items.
● Item typically contains a picture, a title, description and
subject keywords.
● Very diverse data comprising artifacts, places, people.
Topics include fashion, archaeology, architecture and
many other subjects.
● Data from many providers, some of which use
taxonomies, some don’t – need unified approach
COLING 2012, 14th December 2012, Mumbai, India
Example item
COLING 2012, 14th December 2012, Mumbai, India
Title: Design Council Slide Collection Subject: colour, exhibitions, industrial design Description: Display on the theme of colour matching at the Design Centre, London, 1960
Manually created taxonomies
● We use four existing manually created taxonomies:
– LCSH (Library of Congress)
– WordNet domains
– Wikipedia Taxonomy
– DBpedia ontology
● The taxonomies already exist and are of good
quality - but problem is to map Europeana items
into the correct place in the taxonomy.
COLING 2012, 14th December 2012, Mumbai, India
LCSH
● A controlled vocabulary maintained by the US
Library of Congress for bibliographic records.
● Used by libraries to organise collections and also by
curators of cultural heritage.
● Subject keywords are used to map Europeana
items into the appropriate LCSH category nodes.
industrial design design creation (literary, artistic, etc.)
intellect
+30 more higher level headings
COLING 2012, 14th December 2012, Mumbai, India
WordNet domains
● WordNet domains (Bernardo Magnini, LREC 2000)
applies a small set of 164 domain labels to each of the
WordNet synsets.
● Again use subject keywords to map Europeana items -
first to Yago2 (for proper nouns) then to synset and
finally to WordNet domain label.
tourism social
color factotum
art humanities
+ 5 more
COLING 2012, 14th December 2012, Mumbai, India
Wikipedia Taxonomy
● Wikipedia category hierarchy preserving only is-a
relations - all others are discarded.
● Use Wikipedia Miner over each Europeana item to
identify Wikipedia articles in the subject keywords. Then
map item to all categories that contain these articles
design visual_arts criticism
image_processing digital_signal_processing signal_processing
museology museums educational_organizations
organizations
+35 more
COLING 2012, 14th December 2012, Mumbai, India
DBpedia ontology
A formalised shallow ontology manually created
based on Wikipedia (with inference capability).
Again use Wikipedia Miner to find Wikipedia articles
in subject keywords of each item and map item to
the categories which these articles belong.
musical_work work
work
album musicalwork work
COLING 2012, 14th December 2012, Mumbai, India
Automatic data-derived taxonomies
● We use two approaches to derive taxonomies
automatically from the Europeana data.
– LDA (Latent Dirichlet Allocation) topic modelling
– WikiFreq (Wikipedia Frequency hierarchy)
● Taxonomies fit data - no unnecessary nodes to
prune.
● Mapping from items to concept nodes is implicit
during derivation.
COLING 2012, 14th December 2012, Mumbai, India
Latent Dirichlet Allocation (LDA) maps each
item to one or more topics.
Distribution of items over topics - each topic is
a distribution over words
Item-topic and topic-word distributions are
learned using collapsed Gibbs sampling
Has been used for improving results from IR
Previous work has developed hierarchical LDA
but this is infeasible over our large data set
LDA topic modelling
COLING 2012, 14th December 2012, Mumbai, India
Hierarchical LDA topics
● Run LDA over corpus to determine item-topic probabilities.
● Identify set of items for each topic. Each item assigned to
highest probability topic. Topic labelled with highest
probability word.
● If a topic has less than 60 items then stop. Otherwise go
back to first step with the set of items identified in previous
part as the corpus.
COLING 2012, 14th December 2012, Mumbai, India
Hierarchical LDA topics (example)
COLING 2012, 14th December 2012, Mumbai, India
Bangle design design design
brooch collection
Wikipedia link frequencies
● Novel approach.
● Run Wikipedia Miner to find links in all Europeana
items – use title, subject and description.
● Find frequency counts for each link.
● For each item take the set of links found.
● Create taxonomy branch (if not already present)
with links in order of frequency (most frequent first).
● Map item to least frequent link.
COLING 2012, 14th December 2012, Mumbai, India
Wikipedia link frequencies (cont.)
● Large number of concept nodes - limit to 24
children for each node.
● Require at least 2 links for each item - filter out
items with little metadata.
● Filter out concepts with fewer than 20 items.
industrial design design council
COLING 2012, 14th December 2012, Mumbai, India
Statistics
COLING 2012, 14th December 2012, Mumbai, India
Type Taxonomy Items Nodes Avg. parents
Avg. Depth
Top nodes
Manual LCSH DBpedia WikiTax WN domains
99259 178312 275359 308687
285238 273 121359 170
1.8 4.2 11.7 7.1
1.97 2 1.13 7.1
28901 30 10417 6
Automatic LDA topics Wiki Freq
545896 66558
22494 502
1 1
7.3 3.39
9 24
Evaluation - cohesion
Intruder detection originally proposed in (Chang et. al,
2009). A cohesive unit is defined as one in which the
items are similar while at the same time different from
items in other clusters.
Present 5 items to each annotator. 4 from one concept
node, and an intruder item randomly from elsewhere in
the taxonomy. The more cohesive the unit, the more
obvious the intruder will be.
Crowd-sourcing: 111 annotators, 30 units from each
taxonomy. 1255 answers – average 7 annotators for
each unit
COLING 2012, 14th December 2012, Mumbai, India
Example of a cohesive unit
COLING 2012, 14th December 2012, Mumbai, India
Evaluation - cohesion results
COLING 2012, 14th December 2012, Mumbai, India
Type Taxonomy Cohesive units
Percentage
Manual LCSH DBpedia
Wiki Taxonomy WN domains
19 17 18 15
63.3 56.7 60.0 50.0
Automatic LDA topics Wiki Freq
17 29
56.7 96.7
Number of cohesive units (out of a possible 30)
Evaluation - relation classification
Previous work has typically used a simple boolean
question “is it true that ChildNode is-a ParentNode?”
We ask two questions for each child-parent pair A and
B:
Are the concepts A and B related?
If they are, is A more specific than B, less specific
than B, or neither?
Crowd sourcing: 173 annotators, 40 pairs from each
taxonomy, each pair evaluated on average 16 times
COLING 2012, 14th December 2012, Mumbai, India
Evaluation - example pairs
COLING 2012, 14th December 2012, Mumbai, India
Taxonomy Child (A) Parent(B)
LCSH Work Braid
Human Behaviour Weaving
DBpedia Mountain Range Fern
Place Plant
Wiki Taxonomy
Mammals of Africa Schools in Wiltshire
Wildlife of Africa Schools in England
WN domains vehicles mechanics
transport engineering
LDA topics earthenware view
dish church
Wiki Freq Corrosion Interior Design
Coin Industrial Design
Are A and B related?
COLING 2012, 14th December 2012, Mumbai, India
Taxonomy Yes No Don't know
LCSH DBpedia
Wiki Taxonomy WN domains
74.2 86.6 96.1 77.1
8.8 11.2 1.7
14.5
17.0 2.2 2.3 8.4
LDA topics Wiki Freq
30.3 47.6
50.3 16.5
19.3 35.8
Which is more specific?
COLING 2012, 14th December 2012, Mumbai, India
Taxonomy A<B A>B Neither Don't know
LCSH DBpedia
WikiTaxonomy WN domains
65.4 76.2 78.3 63.6
8.7 4.9 4.7 6.3
23.4 18.1 16.0 28.0
2.5 0.7 0.9 2.0
LDA topics Wiki Freq
21.4 30.9
14.8 22.6
62.1 43.6
1.6 2.9
Conclusions
Wikipedia Taxonomy is conceptually well organised,
even better than LCSH which has been widely used
for organising library collections.
WikiFreq gives very high cohesion for items
although the conceptual relations are not well
defined.
Future work continues with different intrinsic and
user evaluations. Also aim to combine Wikipedia
Taxonomy and WikiFreq to get the best of both.
COLING 2012, 14th December 2012, Mumbai, India
The End
Supported by the PATHS project http://paths-project.eu Funded by the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 270082. This research was also partially funded by the Ministry of Economy under grant TIN2009-14715-C04-01 (KNOW2 project
Questions?