From Web 2.0 to Web 3.0 using Data Miningstaab/Presentations/20100531Ste… · Web Science &...

64
Web Science & Technologies University of Koblenz Landau, Germany From Web 2.0 to Web 3.0 using Data Mining Steffen Staab & Rabeeh Abbasi

Transcript of From Web 2.0 to Web 3.0 using Data Miningstaab/Presentations/20100531Ste… · Web Science &...

Web Science & TechnologiesUniversity of Koblenz ▪ Landau, Germany

From Web 2.0 to Web 3.0 using Data Mining

Steffen Staab & Rabeeh Abbasi

Steffen [email protected]

IRMLeS 20102 of 58

WeST

Folksonomies facilitate sharing resources• Pictures

• Bookmarks

• Publications

Users annotate resources with tags (keywords)Tags can be used for searching and browsing resources

Web 2.0 Folksonomies

flickr.com/photos/digitalart/3543019869/

flickr.com/photos/fapeg/474807645/

hibiscusflower

1942penny

Steffen [email protected]

IRMLeS 20103 of 58

WeST

Why do you want to have a Semantic Web at all?

4 billion photos!Many of them tagged

Steffen [email protected]

IRMLeS 20104 of 58

WeST

Source: http://www.kullin.net/2009/10/flickr-reaches-4-billion-photos.html

Solid growth

Why do you want to have a Semantic Web at all?

Steffen [email protected]

IRMLeS 20105 of 58

WeST

Why I want to have a Semantic Web!

Steffen [email protected]

IRMLeS 20106 of 58

WeST

Why I want to have a Semantic Web!

Intelligent queriesThesauri

Many photos are unsearchable!

Requirement: Know that “40ies” = “forties”

Steffen [email protected]

IRMLeS 20107 of 58

WeST

Why I want to have a Semantic Web!Faceted Search & Browsing:

Semaplorer (btc.uni-koblenz.de)

Requirement: Know what is a city, country, landmark

Steffen [email protected]

IRMLeS 20108 of 58

WeST

Why I want to have a Semantic Web!

Is there a quantifiable benefit in Ontology Learning?

Low-LevelFeatures

Geotags51°30'03"N0°7'24"W

Folksonomy

Add Tags...

Search ResourcesBrowse Resources...

Problems and Features of Folksonomies

Low-LevelFeatures

Geotags51°30'03"N0°7'24"W

Recommend Tags...

Improve SearchSemantic Browsing...

31

2

Folksonomy

Discovering and Exploiting Semantics

Improving Search in Folksonomies(Can we improve search, particularly for sparse results?)

1

Steffen [email protected]

IRMLeS 201012 of 58

WeST

Search in Folksonomies

Billions of resources in Folksonomies

But still many photos are unsearchable!

Still many photos are unsearchable, because• they have been tagged with other (semantically related) tags!

Steffen [email protected]

IRMLeS 201014 of 58

WeST

Sparseness in Folksonomies

Most of the pictures in Flickr have only a few tags• Leads into extremely sparse data!

1 2 3 4 5 6 7 8 9 10 >100

1000000

2000000

3000000

4000000

5000000

6000000

7000000

Number of Tags

Num

ber o

f Res

ourc

es

Steffen [email protected]

IRMLeS 201015 of 58

WeST

Simple Vector Space Model of a Folksonomy

Consider the tagging system as a Vector Space Model (VSM)Each resource is represented as a vector of tags

flickr.com/photos/ethankillian/364295181/flickr.com/photos/fapeg/474807645/flickr.com/photos/digitalart/3543019869/flickr.com/photos/sondyaustin/3576959698/

1942 1 1 0 0cent 0 1 0 0

flower 0 0 1 0forties 0 1 0 0

hibiscus 0 0 1 1nature 0 0 0 1penny 1 0 0 0

flickr.com/photos/digitalart/3543019869/flickr.com/photos/

fapeg/474807645/

flickr.com/photos/Sondyaustin/3576959698/

flickr.com/photos/ethankillian/364295181/

Steffen [email protected]

IRMLeS 201016 of 58

WeST

Enrich the existing Vector Space Model using semantically related tags

• Dimensions of Semantically Related Tags– Context of tags (Resource / User)– Distribution of tags (Similar / Generalized)

Discovering Semantically Related Tags

Tag

Dist

ribut

ion

Gen

eral

ized

-

Sim

ilar

Resource User

Tag Context

SRG

SRC

SUC

User Context &Similar Tags

Resource Context &Similar Tags

Resource Context &Generalized Tags

Steffen [email protected]

IRMLeS 201017 of 58

WeST

Resource vs. User Context

Resource Context• Narrow context, based on the resources associated to a tag

User Context• Broader context, based on the vocabulary of a users

1942 1 1 0 0cent 0 1 0 0

flower 0 0 1 0forties 0 1 0 0

hibiscus 0 0 1 1nature 0 0 0 1penny 1 0 0 0

flickr.com/photos/digitalart/3543019869/flickr.com/photos/

fapeg/474807645/

flickr.com/photos/Sondyaustin/3576959698/

flickr.com/photos/ethankillian/364295181/

1942 2 0cent 1 0

flower 0 1forties 1 0

hibiscus 0 2nature 0 1penny 1 0

Steffen [email protected]

IRMLeS 201018 of 58

WeST

Semantically Related Tags – Tag Distribution

Tag Distribution• Similar Tags (Symmetric Similarity)

– Using Cosine Similarity:

1942 cent flower forties hibiscus nature penny1942 1 0.71 0 0.71 0 0 0.71cent 0.71 1 0 1 0 0 0

flower 0 0 1 0 0.71 0 0forties 0.71 1 0 1 0 0 0

hibiscus 0 0 0.71 0 1 0.71 0nature 0 0 0 0 0.71 1 0penny 0.71 0 0 0 0 0 1

Steffen [email protected]

IRMLeS 201019 of 58

WeST

Semantically Related Tags – Tag Distribution

Tag Distribution• Generalized Tags (Asymmetric Similarity)

– Using Modified Overlap Coefficient

Steffen [email protected]

IRMLeS 201020 of 58

WeST

enRiched Vector Space Models for Folksonomies

Use semantically related tags to enrich the existing vector space model Semantically Related Tags

Original Vector Space Enriched Vector Space

1942 cent flower forties hibiscus nature penny1942 1 0.71 0 0.71 0 0 0.71cent 0.71 1 0 1 0 0 0

flower 0 0 1 0 0.71 0 0forties 0.71 1 0 1 0 0 0

hibiscus 0 0 0.71 0 1 0.71 0nature 0 0 0 0 0.71 1 0penny 0.71 0 0 0 0 0 1

1942 1 1 0 0cent 0 1 0 0

flower 0 0 1 0forties 0 1 0 0

hibiscus 0 0 1 1nature 0 0 0 1penny 1 0 0 0

flickr.com/photos/digitalart/3543019869/flickr.com/photos/

fapeg/474807645/

flickr.com/photos/Sondyaustin/3576959698/

flickr.com/photos/ethankillian/364295181/

1942 1.71 2.41 0 0cent 0.71 2.71 0 0

flower 0 0 1.71 0.71forties 0.71 2.71 0 0

hibiscus 0 0 1.71 1.71nature 0 0 0.71 1.71penny 1.71 0.71 0 0

flickr.com/photos/digitalart/3543019869/flickr.com/photos/

fapeg/474807645/

flickr.com/photos/Sondyaustin/3576959698/

flickr.com/photos/ethankillian/364295181/

Steffen [email protected]

IRMLeS 201021 of 58

WeST

Examples – Context and Distribution of Tags

Tag Similar – Res. Generalized – Res Similar – Userbrick brick(1.00) brick(1.00) brick(1.00)

wall(0.11) wall(0.19) wall(0.37)building(0.13) fence(0.36)red(0.11) window(0.36)

rust(0.35)

flower(0.27)nature(0.12)

pub pub(1.00) pub(1.00) pub(1.00)crawl(0.17) beer(0.11) beer(0.25)

bar(0.11) bar(0.24)sign(0.22)

style style(1.00) style(1.00) style(1.00)crave(0.14) fashion(0.26) fashion(0.16)

beauty(0.11) hair(0.13)fashion(0.11) woman(0.12)

man(0.12)

bromelia bromelia(1.00) bromelia(1.00) bromelia(1.00)airplant(0.32) bromeliad(0.35) lirio(0.18)bromeliad(0.17) tillandsia(0.30) tibouchina(0.15)tillandsia(0.15) soneca(0.15)

strelitzia(0.15)

london(0.22)

arian(0.13)

persians(0.10)

Steffen [email protected]

IRMLeS 201022 of 58

WeST

Evaluation

Dataset• Image uploaded to Flickr between 2004/05• Tags used by at least 10 different users• Dataset statistics

– Photos: ~27 Million– Tags: ~92 Thousand– Users: ~0.3 Million– Tag Assignments: ~94 Million

• Queries– Subset of AOL query log

Steffen [email protected]

IRMLeS 201023 of 58

WeST

Evaluation

User based evaluation• done by 18 users

150 queries split into 3 sets of 50 queries each• having exact matches:

– 1 to 10 (sparse results)– 11 to 50 (normal)– more than 50 (too many results)

Search results were presented to users for different vector space modelsCalculated Precisions @ 5,10,15,20

Steffen [email protected]

IRMLeS 201024 of 58

WeST

1-100.36

0.41

0.46

0.51

0.56

0.61

0.66

1 to 10 Relevant Resources (rare queries)Simple Res./Similar Res./Gen.User/Similar Semi Random Random

Number of Relevant Resources

Pre

cisi

on a

t 20

Results – Sparse Results

SignificantImprovement

for sparse results

Steffen [email protected]

IRMLeS 201025 of 58

WeST

Results

11-500.6

0.61

0.62

0.63

0.64

0.65

0.66

0.67

0.68

0.69

0.7

11 to 50 Relevant ResourcesSimple Res./Similar Res./Gen.User/Similar Semi Random Random

Number of Relevant Resources

Prec

ision

at 2

0

Steffen [email protected]

IRMLeS 201026 of 58

WeST

Results – Large Result Set

50+0.65

0.66

0.67

0.68

0.69

0.7

0.71

0.72

0.73

0.74

0.75

More than 50 Relevant ResourcesSimple Res./Similar Res./Gen.User/Similar Semi Random Random

Number of Relevant Resources

Prec

isio

n at

20

Browsing (landmarks) in Folksonomies

2a

Steffen [email protected]

IRMLeS 201028 of 58

WeST

One prominent photo returned

by Flickr when searching for „Koblenz“

» restroom

Is this what you expect when searching for a place?

Steffen [email protected]

IRMLeS 201029 of 58

WeST

Usage context: Faceted Browsing

• Semaplorer (btc.uni-koblenz.de)– Handles billions of triples, – Lacks semantics for individual photos

– Cannot identify landmark photos

Steffen [email protected]

IRMLeS 201030 of 58

WeST

Finding Landmark Photos

Which photos are (non-)representative for a given place?

Steffen [email protected]

IRMLeS 201031 of 58

WeST

Building a Classifier

For a classifier we need• Features of the photos

– Tags• Positive and Negative training examples

– Exploit Flickr groups

flickr.com/photos/swamibu/2223726960/, flickr.com/photos/gunner66/2219882643/, flickr.com/photos/mromega/2346732045/, flickr.com/photos/me_haridas/399049455/, flickr.com/photos/caribb/84655830/, flickr.com/photos/conner395/1018557535/, flickr.com/photos/66164549@N00/2508246015/, flickr.com/photos/kupkup/1356487670/, flickr.com/photos/asam/432194779/, flickr.com/photos/michaelfoleyphotography/392504158/

+VE -VE

ClassifierClassifierClassificationModel

Norm

alization

SVMSVM

PositiveTraining

Examples

NegativeTraining

Examples

Positive Flickr Groups

Negative Flickr Groups

Steffen [email protected]

IRMLeS 201033 of 58

WeST

User Evaluation

– Datasets: Training 430282 photos, Test Dataset 232265 photos

– Comparison with state of the art WorldExplorer– 20 Users evaluated both methods for 50 different cities– Improvement of 12% in precision for our method

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

Precision per UserWorld Explorer Our Method

User

Pre

cisi

on

Steffen [email protected]

IRMLeS 201034 of 58

WeST

Semantic Browsing in Folksonomies

2b

Steffen [email protected]

IRMLeS 201035 of 58

WeST

Social Tagging Systems – Browsing?

I want to “browse” vehicle images!!!how can I do it?

• can I do it using a Tag Cloud?

Perhaps I need to structure the tags and resources!how can I do it?

• Put them into categories (like Vehicles, People, etc)!– Do it Manually or with Training?

» Might not be possible on a large scale!– Automatically and without any training!

» Using T-ORG!

Steffen [email protected]

IRMLeS 201036 of 58

WeST

PresidentGeraldFordNixonPardon

T-ORG – Classification

Organize resources by putting their tags into categories depending upon their contextUsers can browse categories to retrieve required resources

User A

User B

Group 2

Group 1

EiffelEiffel tower

BigEyefulParis

FranceMiniatures

SingenCarsMotorsFord1955

Person Location VehicleCategories

Steffen [email protected]

IRMLeS 201037 of 58

WeST

T-ORG

Tag Organization using T-ORG

Select ontologiesrelated to the

categories(e.g. Vehicle, People, etc.)

Prune and refine these ontologiesaccording to the

desired categories (add missing

concepts, filter existing concepts)

Apply the classification

algorithm T-KNOW to classify the tags and

resources

Browse the categories to explore the tags

and resources

Steffen [email protected]

IRMLeS 201038 of 58

WeST

Classifying the tags using T-KNOW

Use well-known linguistic patterns to

generate queriesSearch these patterns on

Google and download search results

Compare each Google search result with the context of the tag and

extract the concept

Select the concept which has the highest similarity with the context of the tag

Steffen [email protected]

IRMLeS 201039 of 58

WeST

T-KNOW – Computing Similarity

Compute similarity using cosine measure between Bag of Words (BOW) representation of “Tag Context” and “Search Result”

1955 = 1as = 0cars = 1ford = 1foundation = 0international = 0motors = 1organizations = 0singen = 1such = 0

1955 = 0as = 1cars = 0ford = 1foundation = 2international = 1motors = 0organizations = 1singen = 0such = 1

Tag Contextsingencarsmotorsford1955

Search Result

BOW

cos(ĉ,â) = ĉ x â / |ĉ||â|= 0.15

ĉ â

Only consider the results having similarity above a certain ThresholdResult having the highest similarity is considered as final

Steffen [email protected]

IRMLeS 201040 of 58

WeST

Experimental Setup

Person

Location

Vehicle

Organization

Other

Author, Singer, Human, …Country, District, City, Village,…

Vehicle, Car, Truck, Motorbike, Train, …Company, Organization, Firm, Foundation, …

4+1 Categories 932 Concepts

189 random Images from 9 Flickr groups 1754 Tags

Steffen [email protected]

IRMLeS 201041 of 58

WeST

Experimental Setup – Classifiers

Two human classifiers: K (gold standard) and ST-KNOW

Steffen [email protected]

IRMLeS 201042 of 58

WeST

Experimental Setup – Evaluation

F-MeasureA = set of correct classification by test (user S or T-KNOW)B = set of all classification by Gold Standard (user K)C = set of all classifications by test

Precision = A / CRecall = A / BF-Measure = 2 * Precision * Recall / (Precision + Recall)

Steffen [email protected]

IRMLeS 201043 of 58

WeST

Results – F-Measure

0.51

0.56

0.61

0.66

0.71

0.76

0.00 0.05 0.10 0.15 0.20 0.25 0.30Threshold

F-M

easu

re

Tag ContextResource ContextUser ContextGroup ContextUser S

- Results not far away from Human Classification

51%

79%

Steffen [email protected]

IRMLeS 201044 of 58

WeST

Tag Recommendations in FolksonomiesSimplifying the annotation process

3

Steffen [email protected]

IRMLeS 201045 of 58

WeST

Tag Recommendation

http://www.flickr.com/photos/64255998@N00/283890351/

ClocktowerUhrturmGrazAustriaEuropeHerbstAutumnNightNacht

47° 4′ 32″ N 15° 26′ 12″ E

ImageImage TagsTags

Steffen [email protected]

IRMLeS 201046 of 58

WeST

Tag Recommendation

http://www.flickr.com/photos/64255998@N00/283890351/

Tag RecommenderTag Recommender

ClocktowerUhrturmGrazAustriaEuropeHerbstAutumnNightNacht

InputInput OutputOutput

47° 4′ 32″ N 15° 26′ 12″ E

Steffen [email protected]

IRMLeS 201047 of 58

WeST

Tag Recommendation

TagsGeographical Coordinates

Low-level Features

Tag RecommenderTag Recommender

ClocktowerUhrturmGrazAustriaEuropeHerbstAutumnNightNacht

InputInput OutputOutput

47° 4′ 32″ N 15° 26′ 12″ E

Steffen [email protected]

IRMLeS 201048 of 58

WeST

Tag Recommendation

TagsGeographical Coordinates

Low-level Features

Which image features to use?

Tag RecommenderTag Recommender

ClocktowerUhrturmGrazAustriaEuropeHerbstAutumnNightNacht

InputInput OutputOutput

47° 4′ 32″ N 15° 26′ 12″ E

Steffen [email protected]

IRMLeS 201049 of 58

WeST

Features Available in Large Datasets

CoPhIR Dataset contains• 106 million images from Flickr with

– Title, location, tag, comment, etc– Five standard MPEG-7 Features

Steffen [email protected]

IRMLeS 201050 of 58

WeST

Tag Recommender – System Overview

TrainingData

Location 1

Location N

Location 1Clusters

Location NClusters

Model

Features

Features

Steffen [email protected]

IRMLeS 201051 of 58

WeST

Tag RecommenderTag Recommender

System Overview

TrainingData

Location 1

Location N

Location 1Clusters

Location NClusters

Model

Features

Features

Image Features

Tags

ClocktowerGrazAustriaEuropeAutumnNightUhrturmNachtHerbstHDRPhotomatix

InputInput OutputOutput

47° 4′ 32″ N 15° 26′ 12″ E

Steffen [email protected]

IRMLeS 201052 of 58

WeST

Clustering

Used Standard K-Means clustering

Steffen [email protected]

IRMLeS 201053 of 58

WeST

Given a cluster containing homogeneous images• Find most representative tags

– Tag Rank ~ Frequency of Users

Identifying Representative Tags from Clusters

Clocktower (4)Graz (4)Austria (3)Europe (3)Autumn (2)Night (2)Uhrturm (1)Nacht (1)Herbst (1)HDR (1)Photomatix (1)

Cluster Tags

User 2 User 4

User 1 User 3

Steffen [email protected]

IRMLeS 201054 of 58

WeST

Classification

Map a new image to one of the existing clustersFind the nearest cluster centroid

• Assign representative tags of the mapped cluster to the image

RepresentativeTags

NearestCentroid

Steffen [email protected]

IRMLeS 201055 of 58

WeST

Evaluation

Evaluated on tags of ~400,000 Flickr images• 75% training data• 25% test data / gold standard

– Evaluation based on the correctly identified tags of test data• Data is randomly split based on users (not resources)

Ignored • 10 most frequent tags for each location (city) and • Tags used by less than three users

Calculated• Precision• Recall• F-Measure

Steffen [email protected]

IRMLeS 201056 of 58

WeST

Micro Precision

1 2 3 4 5 10 15 200.0000

0.0200

0.0400

0.0600

0.0800

0.1000

0.1200

0.1400

0.1600

0.1385

0.1008

0.0867

0.0783

0.0686

0.0478

0.03820.0331

0.05020.0441 0.0412 0.0388 0.0364

0.03020.0264 0.0236

0.04510.0382

0.0326 0.0295 0.02740.0203

0.0170 0.0150

0.03380.0302 0.0284 0.0282 0.0275

0.0237 0.0211 0.0193

Geo EHD Tags Random

Number of tags recommended

Pre

cisi

onGeo CoordinatesLow-level FeaturesTagsBaseline (Random)

Steffen [email protected]

IRMLeS 201057 of 58

WeST

Micro Recall

1 2 3 4 5 10 15 200.0000

0.0500

0.1000

0.1500

0.2000

0.2500

0.0515

0.0749

0.0966

0.11630.1273

0.1749

0.2071

0.2357

0.0187

0.0328

0.04600.0577

0.0677

0.1120

0.1469

0.1748

0.01680.0279

0.03510.0417

0.0478

0.0668

0.07950.0882

0.01260.0225

0.03170.0419

0.0511

0.0880

0.1174

0.1432

Geo EHD Tags Random

Number of tags recommended

Rec

all

Steffen [email protected]

IRMLeS 201058 of 58

WeST

Micro F-Measure

1 2 3 4 5 10 15 200

0.02

0.04

0.06

0.08

0.1

0.0751

0.0859

0.09140.0936

0.0892

0.0750

0.0645

0.0581

0.0272

0.0376

0.04350.0464 0.0474 0.0475

0.04480.0417

0.0244

0.0323 0.0338 0.0346 0.03490.0311

0.02810.0256

0.0183

0.02580.0300

0.0337 0.0357 0.0373 0.0358 0.0341

Geo EHD Tags Random

Number of tags recommended

F-M

easu

re

Steffen [email protected]

IRMLeS 201059 of 58

WeST

Coverage

Number of different tags correctly suggested

Steffen [email protected]

IRMLeS 201060 of 58

WeST

Summary

We can discover semantics in Web 2.0 and exploit them forImproved searchSemantic BrowsingTag recommendation

Steffen [email protected]

IRMLeS 201061 of 58

WeST

Lessons Learned

There is a quantifiable value in ontology learning techniques

Improvement of user experience is aboutSemantics

• A little bit of semantics goes a long way!– Who? – What? – When? – Where?

Representativity / Prototypicality← Do not ignore!← Not represented in SemWeb!

Steffen [email protected]

IRMLeS 201062 of 58

WeST

Most interesting result for „Koblenz“:

A Castle half an hour away from Koblenz

Muchwork left

to bedone….

Steffen [email protected]

IRMLeS 201063 of 58

WeST

More details in:

Abbasi, Staab: RichVSM: enRiched Vector Space Models for Folksonomies. (ACM Hypertext 2009)

Abbasi, Grzegorzek, Staab: Large Scale Tag Recommendation Using Different Image Representations. (SAMT 2009)

Abbasi, Chernov, Nejdl, Paiu, Staab: Exploiting Flickr Tags and Groups for Finding Landmark Photos. (ECIR 2009)

Abbasi, Staab, Cimiano: Organizing Resources on Tagging Systems using T-ORG. (ESWC 2007)

Steffen [email protected]

IRMLeS 201064 of 58

WeST

Thank you for your attention!