From Web 2.0 to Web 3.0 using Data Miningstaab/Presentations/20100531Ste… · Web Science &...
Transcript of From Web 2.0 to Web 3.0 using Data Miningstaab/Presentations/20100531Ste… · Web Science &...
Web Science & TechnologiesUniversity of Koblenz ▪ Landau, Germany
From Web 2.0 to Web 3.0 using Data Mining
Steffen Staab & Rabeeh Abbasi
Steffen [email protected]
IRMLeS 20102 of 58
WeST
Folksonomies facilitate sharing resources• Pictures
• Bookmarks
• Publications
Users annotate resources with tags (keywords)Tags can be used for searching and browsing resources
Web 2.0 Folksonomies
flickr.com/photos/digitalart/3543019869/
flickr.com/photos/fapeg/474807645/
hibiscusflower
1942penny
Steffen [email protected]
IRMLeS 20103 of 58
WeST
Why do you want to have a Semantic Web at all?
4 billion photos!Many of them tagged
Steffen [email protected]
IRMLeS 20104 of 58
WeST
Source: http://www.kullin.net/2009/10/flickr-reaches-4-billion-photos.html
Solid growth
Why do you want to have a Semantic Web at all?
Steffen [email protected]
IRMLeS 20106 of 58
WeST
Why I want to have a Semantic Web!
Intelligent queriesThesauri
Many photos are unsearchable!
Requirement: Know that “40ies” = “forties”
Steffen [email protected]
IRMLeS 20107 of 58
WeST
Why I want to have a Semantic Web!Faceted Search & Browsing:
Semaplorer (btc.uni-koblenz.de)
Requirement: Know what is a city, country, landmark
Steffen [email protected]
IRMLeS 20108 of 58
WeST
Why I want to have a Semantic Web!
Is there a quantifiable benefit in Ontology Learning?
Low-LevelFeatures
Geotags51°30'03"N0°7'24"W
Folksonomy
Add Tags...
Search ResourcesBrowse Resources...
Problems and Features of Folksonomies
Low-LevelFeatures
Geotags51°30'03"N0°7'24"W
Recommend Tags...
Improve SearchSemantic Browsing...
31
2
Folksonomy
Discovering and Exploiting Semantics
Steffen [email protected]
IRMLeS 201012 of 58
WeST
Search in Folksonomies
Billions of resources in Folksonomies
But still many photos are unsearchable!
Still many photos are unsearchable, because• they have been tagged with other (semantically related) tags!
Steffen [email protected]
IRMLeS 201014 of 58
WeST
Sparseness in Folksonomies
Most of the pictures in Flickr have only a few tags• Leads into extremely sparse data!
1 2 3 4 5 6 7 8 9 10 >100
1000000
2000000
3000000
4000000
5000000
6000000
7000000
Number of Tags
Num
ber o
f Res
ourc
es
Steffen [email protected]
IRMLeS 201015 of 58
WeST
Simple Vector Space Model of a Folksonomy
Consider the tagging system as a Vector Space Model (VSM)Each resource is represented as a vector of tags
flickr.com/photos/ethankillian/364295181/flickr.com/photos/fapeg/474807645/flickr.com/photos/digitalart/3543019869/flickr.com/photos/sondyaustin/3576959698/
1942 1 1 0 0cent 0 1 0 0
flower 0 0 1 0forties 0 1 0 0
hibiscus 0 0 1 1nature 0 0 0 1penny 1 0 0 0
flickr.com/photos/digitalart/3543019869/flickr.com/photos/
fapeg/474807645/
flickr.com/photos/Sondyaustin/3576959698/
flickr.com/photos/ethankillian/364295181/
Steffen [email protected]
IRMLeS 201016 of 58
WeST
Enrich the existing Vector Space Model using semantically related tags
• Dimensions of Semantically Related Tags– Context of tags (Resource / User)– Distribution of tags (Similar / Generalized)
Discovering Semantically Related Tags
Tag
Dist
ribut
ion
Gen
eral
ized
-
Sim
ilar
Resource User
Tag Context
SRG
SRC
SUC
User Context &Similar Tags
Resource Context &Similar Tags
Resource Context &Generalized Tags
Steffen [email protected]
IRMLeS 201017 of 58
WeST
Resource vs. User Context
Resource Context• Narrow context, based on the resources associated to a tag
User Context• Broader context, based on the vocabulary of a users
1942 1 1 0 0cent 0 1 0 0
flower 0 0 1 0forties 0 1 0 0
hibiscus 0 0 1 1nature 0 0 0 1penny 1 0 0 0
flickr.com/photos/digitalart/3543019869/flickr.com/photos/
fapeg/474807645/
flickr.com/photos/Sondyaustin/3576959698/
flickr.com/photos/ethankillian/364295181/
1942 2 0cent 1 0
flower 0 1forties 1 0
hibiscus 0 2nature 0 1penny 1 0
Steffen [email protected]
IRMLeS 201018 of 58
WeST
Semantically Related Tags – Tag Distribution
Tag Distribution• Similar Tags (Symmetric Similarity)
– Using Cosine Similarity:
1942 cent flower forties hibiscus nature penny1942 1 0.71 0 0.71 0 0 0.71cent 0.71 1 0 1 0 0 0
flower 0 0 1 0 0.71 0 0forties 0.71 1 0 1 0 0 0
hibiscus 0 0 0.71 0 1 0.71 0nature 0 0 0 0 0.71 1 0penny 0.71 0 0 0 0 0 1
Steffen [email protected]
IRMLeS 201019 of 58
WeST
Semantically Related Tags – Tag Distribution
Tag Distribution• Generalized Tags (Asymmetric Similarity)
– Using Modified Overlap Coefficient
Steffen [email protected]
IRMLeS 201020 of 58
WeST
enRiched Vector Space Models for Folksonomies
Use semantically related tags to enrich the existing vector space model Semantically Related Tags
Original Vector Space Enriched Vector Space
1942 cent flower forties hibiscus nature penny1942 1 0.71 0 0.71 0 0 0.71cent 0.71 1 0 1 0 0 0
flower 0 0 1 0 0.71 0 0forties 0.71 1 0 1 0 0 0
hibiscus 0 0 0.71 0 1 0.71 0nature 0 0 0 0 0.71 1 0penny 0.71 0 0 0 0 0 1
1942 1 1 0 0cent 0 1 0 0
flower 0 0 1 0forties 0 1 0 0
hibiscus 0 0 1 1nature 0 0 0 1penny 1 0 0 0
flickr.com/photos/digitalart/3543019869/flickr.com/photos/
fapeg/474807645/
flickr.com/photos/Sondyaustin/3576959698/
flickr.com/photos/ethankillian/364295181/
1942 1.71 2.41 0 0cent 0.71 2.71 0 0
flower 0 0 1.71 0.71forties 0.71 2.71 0 0
hibiscus 0 0 1.71 1.71nature 0 0 0.71 1.71penny 1.71 0.71 0 0
flickr.com/photos/digitalart/3543019869/flickr.com/photos/
fapeg/474807645/
flickr.com/photos/Sondyaustin/3576959698/
flickr.com/photos/ethankillian/364295181/
Steffen [email protected]
IRMLeS 201021 of 58
WeST
Examples – Context and Distribution of Tags
Tag Similar – Res. Generalized – Res Similar – Userbrick brick(1.00) brick(1.00) brick(1.00)
wall(0.11) wall(0.19) wall(0.37)building(0.13) fence(0.36)red(0.11) window(0.36)
rust(0.35)
flower(0.27)nature(0.12)
pub pub(1.00) pub(1.00) pub(1.00)crawl(0.17) beer(0.11) beer(0.25)
bar(0.11) bar(0.24)sign(0.22)
style style(1.00) style(1.00) style(1.00)crave(0.14) fashion(0.26) fashion(0.16)
beauty(0.11) hair(0.13)fashion(0.11) woman(0.12)
man(0.12)
bromelia bromelia(1.00) bromelia(1.00) bromelia(1.00)airplant(0.32) bromeliad(0.35) lirio(0.18)bromeliad(0.17) tillandsia(0.30) tibouchina(0.15)tillandsia(0.15) soneca(0.15)
strelitzia(0.15)
london(0.22)
arian(0.13)
persians(0.10)
Steffen [email protected]
IRMLeS 201022 of 58
WeST
Evaluation
Dataset• Image uploaded to Flickr between 2004/05• Tags used by at least 10 different users• Dataset statistics
– Photos: ~27 Million– Tags: ~92 Thousand– Users: ~0.3 Million– Tag Assignments: ~94 Million
• Queries– Subset of AOL query log
Steffen [email protected]
IRMLeS 201023 of 58
WeST
Evaluation
User based evaluation• done by 18 users
150 queries split into 3 sets of 50 queries each• having exact matches:
– 1 to 10 (sparse results)– 11 to 50 (normal)– more than 50 (too many results)
Search results were presented to users for different vector space modelsCalculated Precisions @ 5,10,15,20
Steffen [email protected]
IRMLeS 201024 of 58
WeST
1-100.36
0.41
0.46
0.51
0.56
0.61
0.66
1 to 10 Relevant Resources (rare queries)Simple Res./Similar Res./Gen.User/Similar Semi Random Random
Number of Relevant Resources
Pre
cisi
on a
t 20
Results – Sparse Results
SignificantImprovement
for sparse results
Steffen [email protected]
IRMLeS 201025 of 58
WeST
Results
11-500.6
0.61
0.62
0.63
0.64
0.65
0.66
0.67
0.68
0.69
0.7
11 to 50 Relevant ResourcesSimple Res./Similar Res./Gen.User/Similar Semi Random Random
Number of Relevant Resources
Prec
ision
at 2
0
Steffen [email protected]
IRMLeS 201026 of 58
WeST
Results – Large Result Set
50+0.65
0.66
0.67
0.68
0.69
0.7
0.71
0.72
0.73
0.74
0.75
More than 50 Relevant ResourcesSimple Res./Similar Res./Gen.User/Similar Semi Random Random
Number of Relevant Resources
Prec
isio
n at
20
Steffen [email protected]
IRMLeS 201028 of 58
WeST
One prominent photo returned
by Flickr when searching for „Koblenz“
» restroom
Is this what you expect when searching for a place?
Steffen [email protected]
IRMLeS 201029 of 58
WeST
Usage context: Faceted Browsing
• Semaplorer (btc.uni-koblenz.de)– Handles billions of triples, – Lacks semantics for individual photos
– Cannot identify landmark photos
Steffen [email protected]
IRMLeS 201030 of 58
WeST
Finding Landmark Photos
Which photos are (non-)representative for a given place?
Steffen [email protected]
IRMLeS 201031 of 58
WeST
Building a Classifier
For a classifier we need• Features of the photos
– Tags• Positive and Negative training examples
– Exploit Flickr groups
flickr.com/photos/swamibu/2223726960/, flickr.com/photos/gunner66/2219882643/, flickr.com/photos/mromega/2346732045/, flickr.com/photos/me_haridas/399049455/, flickr.com/photos/caribb/84655830/, flickr.com/photos/conner395/1018557535/, flickr.com/photos/66164549@N00/2508246015/, flickr.com/photos/kupkup/1356487670/, flickr.com/photos/asam/432194779/, flickr.com/photos/michaelfoleyphotography/392504158/
+VE -VE
ClassifierClassifierClassificationModel
Norm
alization
SVMSVM
PositiveTraining
Examples
NegativeTraining
Examples
Positive Flickr Groups
Negative Flickr Groups
Steffen [email protected]
IRMLeS 201033 of 58
WeST
User Evaluation
– Datasets: Training 430282 photos, Test Dataset 232265 photos
– Comparison with state of the art WorldExplorer– 20 Users evaluated both methods for 50 different cities– Improvement of 12% in precision for our method
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
Precision per UserWorld Explorer Our Method
User
Pre
cisi
on
Steffen [email protected]
IRMLeS 201035 of 58
WeST
Social Tagging Systems – Browsing?
I want to “browse” vehicle images!!!how can I do it?
• can I do it using a Tag Cloud?
Perhaps I need to structure the tags and resources!how can I do it?
• Put them into categories (like Vehicles, People, etc)!– Do it Manually or with Training?
» Might not be possible on a large scale!– Automatically and without any training!
» Using T-ORG!
Steffen [email protected]
IRMLeS 201036 of 58
WeST
PresidentGeraldFordNixonPardon
T-ORG – Classification
Organize resources by putting their tags into categories depending upon their contextUsers can browse categories to retrieve required resources
User A
User B
Group 2
Group 1
EiffelEiffel tower
BigEyefulParis
FranceMiniatures
SingenCarsMotorsFord1955
Person Location VehicleCategories
Steffen [email protected]
IRMLeS 201037 of 58
WeST
T-ORG
Tag Organization using T-ORG
Select ontologiesrelated to the
categories(e.g. Vehicle, People, etc.)
Prune and refine these ontologiesaccording to the
desired categories (add missing
concepts, filter existing concepts)
Apply the classification
algorithm T-KNOW to classify the tags and
resources
Browse the categories to explore the tags
and resources
Steffen [email protected]
IRMLeS 201038 of 58
WeST
Classifying the tags using T-KNOW
Use well-known linguistic patterns to
generate queriesSearch these patterns on
Google and download search results
Compare each Google search result with the context of the tag and
extract the concept
Select the concept which has the highest similarity with the context of the tag
Steffen [email protected]
IRMLeS 201039 of 58
WeST
T-KNOW – Computing Similarity
Compute similarity using cosine measure between Bag of Words (BOW) representation of “Tag Context” and “Search Result”
1955 = 1as = 0cars = 1ford = 1foundation = 0international = 0motors = 1organizations = 0singen = 1such = 0
1955 = 0as = 1cars = 0ford = 1foundation = 2international = 1motors = 0organizations = 1singen = 0such = 1
Tag Contextsingencarsmotorsford1955
Search Result
BOW
cos(ĉ,â) = ĉ x â / |ĉ||â|= 0.15
ĉ â
Only consider the results having similarity above a certain ThresholdResult having the highest similarity is considered as final
Steffen [email protected]
IRMLeS 201040 of 58
WeST
Experimental Setup
Person
Location
Vehicle
Organization
Other
Author, Singer, Human, …Country, District, City, Village,…
Vehicle, Car, Truck, Motorbike, Train, …Company, Organization, Firm, Foundation, …
4+1 Categories 932 Concepts
189 random Images from 9 Flickr groups 1754 Tags
Steffen [email protected]
IRMLeS 201041 of 58
WeST
Experimental Setup – Classifiers
Two human classifiers: K (gold standard) and ST-KNOW
Steffen [email protected]
IRMLeS 201042 of 58
WeST
Experimental Setup – Evaluation
F-MeasureA = set of correct classification by test (user S or T-KNOW)B = set of all classification by Gold Standard (user K)C = set of all classifications by test
Precision = A / CRecall = A / BF-Measure = 2 * Precision * Recall / (Precision + Recall)
Steffen [email protected]
IRMLeS 201043 of 58
WeST
Results – F-Measure
0.51
0.56
0.61
0.66
0.71
0.76
0.00 0.05 0.10 0.15 0.20 0.25 0.30Threshold
F-M
easu
re
Tag ContextResource ContextUser ContextGroup ContextUser S
- Results not far away from Human Classification
51%
79%
Steffen [email protected]
IRMLeS 201044 of 58
WeST
Tag Recommendations in FolksonomiesSimplifying the annotation process
3
Steffen [email protected]
IRMLeS 201045 of 58
WeST
Tag Recommendation
http://www.flickr.com/photos/64255998@N00/283890351/
ClocktowerUhrturmGrazAustriaEuropeHerbstAutumnNightNacht
47° 4′ 32″ N 15° 26′ 12″ E
ImageImage TagsTags
Steffen [email protected]
IRMLeS 201046 of 58
WeST
Tag Recommendation
http://www.flickr.com/photos/64255998@N00/283890351/
Tag RecommenderTag Recommender
ClocktowerUhrturmGrazAustriaEuropeHerbstAutumnNightNacht
InputInput OutputOutput
47° 4′ 32″ N 15° 26′ 12″ E
Steffen [email protected]
IRMLeS 201047 of 58
WeST
Tag Recommendation
TagsGeographical Coordinates
Low-level Features
Tag RecommenderTag Recommender
ClocktowerUhrturmGrazAustriaEuropeHerbstAutumnNightNacht
InputInput OutputOutput
47° 4′ 32″ N 15° 26′ 12″ E
Steffen [email protected]
IRMLeS 201048 of 58
WeST
Tag Recommendation
TagsGeographical Coordinates
Low-level Features
Which image features to use?
Tag RecommenderTag Recommender
ClocktowerUhrturmGrazAustriaEuropeHerbstAutumnNightNacht
InputInput OutputOutput
47° 4′ 32″ N 15° 26′ 12″ E
Steffen [email protected]
IRMLeS 201049 of 58
WeST
Features Available in Large Datasets
CoPhIR Dataset contains• 106 million images from Flickr with
– Title, location, tag, comment, etc– Five standard MPEG-7 Features
Steffen [email protected]
IRMLeS 201050 of 58
WeST
Tag Recommender – System Overview
TrainingData
Location 1
Location N
Location 1Clusters
Location NClusters
Model
Features
Features
Steffen [email protected]
IRMLeS 201051 of 58
WeST
Tag RecommenderTag Recommender
System Overview
TrainingData
Location 1
Location N
Location 1Clusters
Location NClusters
Model
Features
Features
Image Features
Tags
ClocktowerGrazAustriaEuropeAutumnNightUhrturmNachtHerbstHDRPhotomatix
InputInput OutputOutput
47° 4′ 32″ N 15° 26′ 12″ E
Steffen [email protected]
IRMLeS 201053 of 58
WeST
Given a cluster containing homogeneous images• Find most representative tags
– Tag Rank ~ Frequency of Users
Identifying Representative Tags from Clusters
Clocktower (4)Graz (4)Austria (3)Europe (3)Autumn (2)Night (2)Uhrturm (1)Nacht (1)Herbst (1)HDR (1)Photomatix (1)
Cluster Tags
User 2 User 4
User 1 User 3
Steffen [email protected]
IRMLeS 201054 of 58
WeST
Classification
Map a new image to one of the existing clustersFind the nearest cluster centroid
• Assign representative tags of the mapped cluster to the image
RepresentativeTags
NearestCentroid
Steffen [email protected]
IRMLeS 201055 of 58
WeST
Evaluation
Evaluated on tags of ~400,000 Flickr images• 75% training data• 25% test data / gold standard
– Evaluation based on the correctly identified tags of test data• Data is randomly split based on users (not resources)
Ignored • 10 most frequent tags for each location (city) and • Tags used by less than three users
Calculated• Precision• Recall• F-Measure
Steffen [email protected]
IRMLeS 201056 of 58
WeST
Micro Precision
1 2 3 4 5 10 15 200.0000
0.0200
0.0400
0.0600
0.0800
0.1000
0.1200
0.1400
0.1600
0.1385
0.1008
0.0867
0.0783
0.0686
0.0478
0.03820.0331
0.05020.0441 0.0412 0.0388 0.0364
0.03020.0264 0.0236
0.04510.0382
0.0326 0.0295 0.02740.0203
0.0170 0.0150
0.03380.0302 0.0284 0.0282 0.0275
0.0237 0.0211 0.0193
Geo EHD Tags Random
Number of tags recommended
Pre
cisi
onGeo CoordinatesLow-level FeaturesTagsBaseline (Random)
Steffen [email protected]
IRMLeS 201057 of 58
WeST
Micro Recall
1 2 3 4 5 10 15 200.0000
0.0500
0.1000
0.1500
0.2000
0.2500
0.0515
0.0749
0.0966
0.11630.1273
0.1749
0.2071
0.2357
0.0187
0.0328
0.04600.0577
0.0677
0.1120
0.1469
0.1748
0.01680.0279
0.03510.0417
0.0478
0.0668
0.07950.0882
0.01260.0225
0.03170.0419
0.0511
0.0880
0.1174
0.1432
Geo EHD Tags Random
Number of tags recommended
Rec
all
Steffen [email protected]
IRMLeS 201058 of 58
WeST
Micro F-Measure
1 2 3 4 5 10 15 200
0.02
0.04
0.06
0.08
0.1
0.0751
0.0859
0.09140.0936
0.0892
0.0750
0.0645
0.0581
0.0272
0.0376
0.04350.0464 0.0474 0.0475
0.04480.0417
0.0244
0.0323 0.0338 0.0346 0.03490.0311
0.02810.0256
0.0183
0.02580.0300
0.0337 0.0357 0.0373 0.0358 0.0341
Geo EHD Tags Random
Number of tags recommended
F-M
easu
re
Steffen [email protected]
IRMLeS 201059 of 58
WeST
Coverage
Number of different tags correctly suggested
Steffen [email protected]
IRMLeS 201060 of 58
WeST
Summary
We can discover semantics in Web 2.0 and exploit them forImproved searchSemantic BrowsingTag recommendation
Steffen [email protected]
IRMLeS 201061 of 58
WeST
Lessons Learned
There is a quantifiable value in ontology learning techniques
Improvement of user experience is aboutSemantics
• A little bit of semantics goes a long way!– Who? – What? – When? – Where?
Representativity / Prototypicality← Do not ignore!← Not represented in SemWeb!
Steffen [email protected]
IRMLeS 201062 of 58
WeST
Most interesting result for „Koblenz“:
A Castle half an hour away from Koblenz
Muchwork left
to bedone….
Steffen [email protected]
IRMLeS 201063 of 58
WeST
More details in:
Abbasi, Staab: RichVSM: enRiched Vector Space Models for Folksonomies. (ACM Hypertext 2009)
Abbasi, Grzegorzek, Staab: Large Scale Tag Recommendation Using Different Image Representations. (SAMT 2009)
Abbasi, Chernov, Nejdl, Paiu, Staab: Exploiting Flickr Tags and Groups for Finding Landmark Photos. (ECIR 2009)
Abbasi, Staab, Cimiano: Organizing Resources on Tagging Systems using T-ORG. (ESWC 2007)