Constructing Folksonomies from User-Specified Relations on Flickr
-
Upload
dorinda-urbina -
Category
Documents
-
view
27 -
download
3
description
Transcript of Constructing Folksonomies from User-Specified Relations on Flickr
1
Constructing Folksonomies from User-Specified Relations on Flickr
Anon Plangprasopchok and
Kristina Lerman
2
Motivation
UsersWeb content
hierarchical classification
Consume
ProduceAnnotate
Organize
DiscoverAnnotation /
Metadata
Organize Search Recommend Leverage Categorize
3
Motivation
Goal: to induce category knowledge from social annotation produced by many users
• Metadata from an individual user may be too inaccurate and incomplete…
•The metadata from different users may complement each other, making it, in combination, meaningful.
4
Folksonomy
• Original definition: classification emerging from the use of tags by users (Thomas Vander Wal)
• In this work: hidden classification hierarchies from annotation created many users
5
Hierarchical Relations in Social Web• Appear Implicitly
• Appear Explicitly
Tags:InsectGrasshopperAustralianMacroOrthoptera
Folder (collection)
Sub folder (set)
Relations
Goal: to induce deeper hierarchies from this metadata
7
Inducing Hierarchy from TagsExisting approaches
• Graph based (Mika05)• build a network of associated tags (node = tag, edge = co-occurrence of tags)• suggest applying betweenness centrality and set theory to determine broader/narrower relations
• Hierarchical Clustering (Brooks06; Heymann06+)•Tags appear more frequently would have higher centrality and thus more abstract.
• Probabilistic subsumption (Sanderson99+, Schmitz06)• x is broader than y if x subsumes y• x subsumes y if p(x|y) > t & p(y|x) < t x
y
8
Inducing Hierarchy from Tags• Some difficulties when using tags to induce hierarchy:
Above relations induced using subsumption approach on tags [Sanderson99+, Schmitz06]
Washington United States
Car Automobile
Notation: A B (A is broader than B)(hypernym relation)
Insect Hongkong
Color Brazilian
Specificity Rarity
Tags are from different facets*
9
Inducing Hierarchy from user-specified relations
• User specified relations, e.g., – Flickr’s Collection-Set , – Delicious’ Bundle-Tag, – Bibsonomy’s Relation-Tag
• Key intuition: Not so many people specify peculiar relations like – “automobile” “car”, or – “Washington” “United States”
10
Simple Strategy
Sets
Collection
The Netherlands - Holanda
Set
Collection
Blijdorp - Rotterdam Tokenize + Stem
Concept relations
netherland holanda
blijdorp
rotterdam
holanda
rotterdam
countri
netherland
netherland
blijdorp
2. Link concepts & Select path
blijdorp
countri
netherlandholland china
……
1. Remove “noisy” relations- Conflict resolution- Significance test
11
Remove noisy relations: 1st approach
• Conflict Resolution (when both a->b and b->a appear)– Relation conflicts occur because of noise– Voting scheme:
Keep ab (and discard ba)
If Nu(ab) > 1 and Nu(ab) > Nu(ba)
insect
butterfly
butterfly
insect
10 2
12
Remove noisy relations:2nd approach
• Significance Test- Use statistical significance test to decide if a b is significant
- Null hypothesis: observed relation ab was generated by chance, via the random, independent generation of individual concepts a, b (according to the binomial distribution).
# observations
rejectaccept
# of ab
Is “b” narrower than “a” by chance?
13
Link concepts and select path
• Link concepts: assume that same terms refer to the same concept.
anim
bug insect
moth
26 72
4 18
1
4 possible paths from anim moth:1) abim2) aim3) am4) abm
Network Bottleneck idea: “the flow bottleneck is a minimum flow capacity among all relations in the path”
1) abim [BN score = min(26,1,18) = 1]2) aim [BN score = min(72,18) = 18]3) am [BN score = min(10) = 10]4) abm [BN score = min(26,4) = 4]
10
anim
bug+
• Select path: link relations from many users can cause a spaghetti graph
anim
insect
anim
bug insect
14
Evaluation & Data Set
Contribution#2:Learning Concept Hierarchies
• Hypothesis: the approach that takes explicit relations into account can induce better hierarchies.
• “Better” means more consistent with hand-built hierarchies (ODP ver. 10/08)
• The baseline approach is subsumption approach [Schmitz06] Collection and set terms are used instead of tags, making it comparable.
Data Set: Data from 17 user groups, devoted to wildlifeand naturalist photography
21,792 of 39,922 users specify at least one collection
110,543 unique terms (c.f. 166,153 unique terms in ODP), 15,495 terms in common.
15
Evaluation methodologyODP has many sub hierarchies: comparing to the induced ones are impractical!
Contribution#2:Learning Concept Hierarchies
It’s easier to compare when specifying “root concept” and “leaf concepts”, i.e., specifying a certain sub tree to compare.
Reference hierarchy
Relations (right after tokenized)
Induced hierarchy
Induce (remove
noise+link)
(ODP)
16
Metrics• Taxonomic Overlap [adapted from Maedche02+]
– measuring structure similarity between two trees– for each node, determining how many ancestor and
descendant nodes overlap to those in the reference tree.
• Lexical Recall– measuring how well an approach can discover
concepts, existing in the reference hierarchy (coverage)
Contribution#2:Learning Concept Hierarchies
18
Quantitative Results
Contribution#2:Learning Concept Hierarchies
• Manually selecting 32 root nodes
• Taxonomic Overlap :• 27 of them are better than those by subsumption • 3 of them get zero score in both approaches
• Lexical Recall:• 28 of them are better than those by subsumption• 2 of them get similar score on both approaches• the rest, by subsumption, only induce the root node.
• The proposed approach can induce deeper trees
The proposed approach can induce hierarchies more consistent with ODP in almost all cases.
22
Discussion
• Simple strategy to aggregate a large number of shallow relations specified by different users into a common, deeper hierarchy
• Induced hierarchies are more consistent with ODP
• Future work includes:Term ambiguityRelation typesGlobal pathApply to other datasets
23
Related Work
• Learning concept hierarchy from text data • Syntactic based [Hearst92, Caraballo99, Pasca04,
Cimiano+05, Snow+06]• Word clustering [e.g. Segal+02, Blei+03]
• Induce concept hierarchy from tags • Graph-based & clustering based [Mika05,
Brooks+06, Heymann+06, Zhou07+]• Probabilistic subsumption [Schmitz06]
• Ontology alignment [Udrea+07]• Exploit user-specified hierarchy
• GiveALink [Markines06+]
24
• Questions?• Is the metric used in evaluation meaningful?• How is the scalability of the system?• Wordnet, ODP is already there. Why do we
need this system?• How is this work related to ontology
enrichment?• Is it ethical to collect users’ data?– ….?
26
Open Problems• Term ambiguity
- The current approach: similar terms refer to the similar concept …. but..
“Victoria”
Canada
Lotus
Person name
Australia
- And has no explicit way to merge synonyms
(There are also many acronyms & colloquial terms in Social Web)
Spain España
A possible solution: concept clustering
27
Open Problems
• Inducing “related-to” relation– “Flora” and “Fauna”, “Pet” and “Family”– Prepositions or some connectors may give some clues, e.g.,
“flora & fauna” and
“Pets – Family”– Tag distributions may also help
Nature
Flora Fauna
Nature
Flora Fauna
28
Open Problems
• True parent selection– Tokenizing collection/set names can cause
another problem
Flora & Fauna
Insect
Fauna
Insect
Flora
Insect
A possible solution: conditional probability ratio
29
Conclusion• Propose statistical approaches for
inducing concepts;inducing concept hierarchies, from social annotation
• On going work aim to improve induced hierarchies’ quality includes:• Resolve term ambiguity• Induce “related to” relations• Select the right parent • Evaluate on more data sets
These approaches perform better than existing approaches
30
Social Web
Adapted from The Social Web: an Information Revolution (courtesy of Kristina Lerman)
Content
User
Prod
uce
Consume
Annotate Organize
Dis
cov
er
spare
31
Social Web
Delicious : 5.3 million users; over 180 million unique URLs [blog.delicious.com, 2009]
- Produce- Consume- Annotate & Organize
Flickr: 2 billion photos [techcrunch.com, 2007]/
4000+ photos upload per min (1/21/2009 morning)
3 Basic Entities Involved (1) User (2) Content (3) Metadata
users
content
32
Motivation
• OrganizingArranging/ Visualizing users’ content (e.g., semantic directory)
• Search/DiscoveryEspecially, binary content like photos and videos, where social annotation functions as a semantic index
• RecommendationLearning users’ taste/ interest
• Leveraging knowledge basesUpdating lexical systems and ontologies for semantic web applications
• CategorizationUnderstanding how new content fits to existing ones
Social Annotation is potentially a good source of evidence for inducing category knowledge, which is useful in many applications, e.g.,
spare
33
Motivation
Although metadata from an individual user may be too inaccurate and incomplete, those from different users may complement each other, making them meaningful for the tasks.
Goal: to induce category knowledge from social annotation produced by many users
34
Evaluation methodologyODP has many sub hierarchies: comparing to the induced ones are impractical!
Contribution#2:Learning Concept Hierarchies
It’s easier to compare when specifying “root concept” and “leaf concepts”, i.e., specifying a certain sub tree to compare.
35
Collection
Set
Data pre-processing
Flickr relations
User-specified relations
SignificanceTest
ConflictResolution
Subsumption
<root, leaf> e.g., <anim, rat>
Find ODP root-leaf pairs thatoverlap w/Flickr
Flickr-ODP root-leaf overlaps
ComputeTaxonomic Overlap,
Lexical Recall
<root, leaf, odp path> e.g., Animal/Mammal/Rodent/Rat
Relation weighting & linking
Hierarchy Construction Evaluation