Towards Data-driven Estimation of Image Tag Relevance using Visually Similar and Dissimilar...

6
Towards Data-driven Estimation of Image Tag Relevance using Visually Similar and Dissimilar Folksonomy Images Sihyoung Lee 1 , Wesley De Neve 1,2 , Yong Man Ro 1  1 Image and Video Systems Lab, Korea Advanced Institute of Science and Technology (KAIST),  Yuseong-gu, Daejeon, South Korea 2 Multimedia Lab, Ghent University – IBBT, Ghent, Belgium {ijiat, wesley.deneve, ymro}@kaist.ac.kr  ABSTRACT Given that the presence of non-relevant tags in an image folksonomy hampers the effective organization and retrieval of images, this paper discusses a novel technique for estimating the relevance of user-supplied tags with respect to the content of a seed image. Specifically, this paper proposes to compute the relevance of image tags by making use of both visually similar and dissimilar images. That way, compared to tag relevance estimation only using visually similar images, the difference in tag relevance between tags relevant and tags irrelevant with respect to the content of a seed image can be increased at a limited increase in computational cost, thus making it more straightforward to distinguish between them. The latter is confirmed through experimentation with subsets of MIRFLICKR-25000 and MIRFLICKR-1M, showing that tag relevance estimation using  both visually similar and dissimilar imag es allows achieving more effective image tag refinement and tag-based image retrieval than tag relevance estimation only using visually similar images. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Information Retrieval –  Information filtering General Terms Algorithms, Measurement, Experime ntation Keywords Image folksonomies, image retrieval, image tag refinement, socially-aware image understanding, tag relevance estimation 1. INTRODUCTION Thanks to easy-to-use multimedia devices, the availability of cheap storage and bandwidth, and more and more people going online, the number of images shared online is increasing at a high rate. For example, as of August 2011, Flickr is known to host more than 6 billion images, with over 2,000 new images uploaded every minute [1]. Similarly, more than 250 million images are uploaded to Facebook each day [2]. These numbers make clear that a strong need exists for techniques that allow organizing and retrieving images in an effective way. Present-day multimedia applications organize and retrieve images  by means of user-defined tags. These freely- chosen textual descriptors, which facilitate an intuitive understanding of the image content, allow reusing text-based search techniques. The result of user-driven tagging of images is known as an image folksonomy. Image folksonomies are also referred to as unstructured collections of collective knowledge, taking the form of user-supplied images and tags. As pointed out in [3], among other publications, an image folksonomy suffers from two major issues: 1) the presence of weakly annotated images (due to the time-consuming nature of tagging) and 2) the presence of tags that are not relevant with respect to the content of the images they were assigned to (for reasons ranging from subjective interpretation of the image content [4] to batch tagging [5]). The first issue can be addressed  by techniques for image tag recommendation [6], while the second issue can be addressed by techniques for image tag relevance estimation. The design and evaluation of the latter is the focus of this paper, and in what follows, we discuss a number of efforts that are representative for this active area of research. The authors of [7] take advantage of WordNet [8] to measure the semantic correlation among image tags. Strongly correlated tags are considered to be relevant to the content of a given seed image, while weakly correlated tags are considered to be irrelevant. The authors of [9] find reliable textual descriptors by mining the tags assigned by photographers to images and by seeking inter-subject agreeme nt for pairs of images that are judged to be highly similar, assuming that the expertise and reliability of photographers is higher than the expertise and reliability of random human annotators. The authors of [10] discuss a scheme that aims at automatically ranking the tags assigned to a given seed image. To that end, initial tag relevance scores are computed by means of  probability density estimation. These relevance scores are subsequently refined by performing a random walk over a tag similarity graph. The authors of [11] propose the use of neighbor voting to estimate the relevance of tag assignments, calculating the relevance of a tag with respect to the content of a seed image  by accumula ting votes for the tag from the visual neighbors of the seed image. The authors of [12] compare the effectiveness of several variants of the neighbor voting algorithm outlined in [11], finding that weighted voting by means of visual similarity is most effective. Finally, the authors of [13] propose to estimate the relevance of image tags by means of information about both the image-image relation (visual similarity) and the tag-tag relation (tag co-occurrence statistics) in image folksonomies. Permission to make digital or hard copies of all or part of this work for  personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAM’12, October 29, 2012, Nara, Japan. Copyright 2012 ACM 978-1-4503-1586-9/12/10.. .$15.00. 3

description

Given that the presence of non-relevant tags in an image folksonomy hampers the effective organization and retrieval of images, this paper discusses a novel technique for estimating the relevance of user-supplied tags with respect to the content of a seed image. Specifically, this paper proposes to compute the relevance of tags by making use of both visually similar and dissimilar images. That way, compared to tag relevance estimation only using visually similar images, it is possible to increase the difference in tag relevance between tags relevant and tags irrelevant with respect to the content of a seed image at a limited increase in computational cost, thus making it more straightforward to distinguish between them. The latter is confirmed through experimentation with the publicly available MIRFLICKR-25000 and MIRFLICKR-1M image collections, showing that tag relevance estimation using both visually similar and dissimilar images allows achieving more effective image tag refinement and more effective tag-based image retrieval than tag relevance estimation only using visually similar images.

Transcript of Towards Data-driven Estimation of Image Tag Relevance using Visually Similar and Dissimilar...

7/14/2019 Towards Data-driven Estimation of Image Tag Relevance using Visually Similar and Dissimilar Folksonomy Images

http://slidepdf.com/reader/full/towards-data-driven-estimation-of-image-tag-relevance-using-visually-similar 1/6

Towards Data-driven Estimation of Image Tag Relevance

using Visually Similar and Dissimilar Folksonomy ImagesSihyoung Lee

1, Wesley De Neve

1,2, Yong Man Ro

1 1Image and Video Systems Lab, Korea Advanced Institute of Science and Technology (KAIST), 

Yuseong-gu, Daejeon, South Korea

2Multimedia Lab, Ghent University – IBBT, Ghent, Belgium

{ijiat, wesley.deneve, ymro}@kaist.ac.kr  

ABSTRACT 

Given that the presence of non-relevant tags in an imagefolksonomy hampers the effective organization and retrieval of images, this paper discusses a novel technique for estimating therelevance of user-supplied tags with respect to the content of aseed image. Specifically, this paper proposes to compute therelevance of image tags by making use of both visually similar and dissimilar images. That way, compared to tag relevanceestimation only using visually similar images, the difference in tag

relevance between tags relevant and tags irrelevant with respect tothe content of a seed image can be increased at a limited increasein computational cost, thus making it more straightforward todistinguish between them. The latter is confirmed throughexperimentation with subsets of MIRFLICKR-25000 andMIRFLICKR-1M, showing that tag relevance estimation using both visually similar and dissimilar images allows achieving moreeffective image tag refinement and tag-based image retrieval thantag relevance estimation only using visually similar images.

Categories and Subject Descriptors 

H.3.3 [Information Search and Retrieval]: Information Retrieval –  Information filtering 

General Terms Algorithms, Measurement, Experimentation

Keywords 

Image folksonomies, image retrieval, image tag refinement,socially-aware image understanding, tag relevance estimation

1.  INTRODUCTIONThanks to easy-to-use multimedia devices, the availability of cheap storage and bandwidth, and more and more people goingonline, the number of images shared online is increasing at a highrate. For example, as of August 2011, Flickr is known to hostmore than 6 billion images, with over 2,000 new images uploadedevery minute [1]. Similarly, more than 250 million images are

uploaded to Facebook each day [2]. These numbers make clear 

that a strong need exists for techniques that allow organizing andretrieving images in an effective way.

Present-day multimedia applications organize and retrieve images by means of user-defined tags. These freely-chosen textualdescriptors, which facilitate an intuitive understanding of theimage content, allow reusing text-based search techniques. Theresult of user-driven tagging of images is known as an imagefolksonomy. Image folksonomies are also referred to asunstructured collections of collective knowledge, taking the form

of user-supplied images and tags.

As pointed out in [3], among other publications, an imagefolksonomy suffers from two major issues: 1) the presence of weakly annotated images (due to the time-consuming nature of tagging) and 2) the presence of tags that are not relevant withrespect to the content of the images they were assigned to (for reasons ranging from subjective interpretation of the imagecontent [4] to batch tagging [5]). The first issue can be addressed by techniques for image tag recommendation [6], while thesecond issue can be addressed by techniques for image tagrelevance estimation. The design and evaluation of the latter is thefocus of this paper, and in what follows, we discuss a number of efforts that are representative for this active area of research.

The authors of [7] take advantage of WordNet [8] to measure thesemantic correlation among image tags. Strongly correlated tagsare considered to be relevant to the content of a given seed image,while weakly correlated tags are considered to be irrelevant. Theauthors of [9] find reliable textual descriptors by mining the tagsassigned by photographers to images and by seeking inter-subjectagreement for pairs of images that are judged to be highly similar,assuming that the expertise and reliability of photographers ishigher than the expertise and reliability of random humanannotators. The authors of [10] discuss a scheme that aims atautomatically ranking the tags assigned to a given seed image. Tothat end, initial tag relevance scores are computed by means of  probability density estimation. These relevance scores aresubsequently refined by performing a random walk over a tagsimilarity graph. The authors of [11] propose the use of neighbor 

voting to estimate the relevance of tag assignments, calculatingthe relevance of a tag with respect to the content of a seed image by accumulating votes for the tag from the visual neighbors of theseed image. The authors of [12] compare the effectiveness of several variants of the neighbor voting algorithm outlined in [11],finding that weighted voting by means of visual similarity is mosteffective. Finally, the authors of [13] propose to estimate therelevance of image tags by means of information about both theimage-image relation (visual similarity) and the tag-tag relation(tag co-occurrence statistics) in image folksonomies.

Permission to make digital or hard copies of all or part of this work for 

 personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that

copies bear this notice and the full citation on the first page. To copy

otherwise, or republish, to post on servers or to redistribute to lists,

requires prior specific permission and/or a fee.SAM’12, October 29, 2012, Nara, Japan.

Copyright 2012 ACM 978-1-4503-1586-9/12/10.. .$15.00. 

3

7/14/2019 Towards Data-driven Estimation of Image Tag Relevance using Visually Similar and Dissimilar Folksonomy Images

http://slidepdf.com/reader/full/towards-data-driven-estimation-of-image-tag-relevance-using-visually-similar 2/6

In this paper, we present a novel data-driven technique for estimating the relevance of user-defined tag assignments, makinguse of both images that are visually similar to the seed image usedand images that are visually dissimilar to the seed image used. Thelatter sets us apart from previous work in the field, like therepresentative research efforts discussed above. Our rationale behind the complementary use of visually dissimilar images istwofold: 1) compared to techniques that only make use of visually

similar images, the complementary use of visually dissimilar images allows taking into account a larger amount of collectiveknowledge present in an image folksonomy and 2) given a seedimage, visual search can be more effectively used to findsemantically dissimilar images than semantically similar images(cf. the observation of [14] that the probability of havingsemantically dissimilar images in a set of visually dissimilar images is higher than the probability of having semanticallysimilar images in a set of visually similar images).

Compared to estimating tag relevance only using visually similar images, the proposed technique is able to increase the differencein tag relevance between tags relevant and tags irrelevant withrespect to the content of a seed image. This is demonstrated bymeans of a simple yet elegant mathematical formalization on theone hand, and through experimentation with MIRFLICKR-25000and MIRFLICKR-1M on the other hand. Note that [15] recentlyintroduced a classification technique that replaces expert-labeledtraining images with training images collected from an imagefolksonomy, leveraging a bootstrapping approach that iterativelyselects the most misclassified negative images in order to enhancethe accuracy of visual concept classifiers. However, while [15]aims at using (misclassified) negative images for the purpose of improving the effectiveness of visual concept classifiers, we aimat using visually dissimilar images for the purpose of better estimating the relevance of user-defined tag assignments.

This paper is organized as follows. In Section 2, we introduce the proposed technique for data-driven tag relevance estimation. InSection 3, we discuss experimental results. Finally, in Section 4,we present our conclusions and directions for future work.

2.  PROPOSED TECHNIQUEWe start this section with a high-level description of the problemof tag relevance estimation for the goal of image tag refinement. Next, we discuss our technique for tag relevance estimation.

2.1  Problem DescriptionFigure 1 visualizes how the proposed technique for tag relevanceestimation is used in order to remove non-relevant tags from aseed image i (i.e., image tag refinement). Let us assume that T i is aset of tags assigned to i. In general, T i contains two types of tags:1) tags relevant with respect to the content of  i and 2) tags notrelevant with respect to the content of  i. During image tag

refinement, if the relevance of a tag t  ∈ T i with respect to the

content of  i is lower than a particular threshold ξ tag , then t  is

considered to be non-relevant and is subsequently removed fromT i. This can be formalized as follows:

{ },),(| tag i

refined 

i it r T t t T  x >ÙÎ=   (1)

whererefined 

iT  is a refined set of tags and where r (t , i) denotes the

relevance of t with respect to the content of i. The higher the valueof r (t , i), the higher the relevance of t with respect to the contentof i, and vice versa. Finally, ξ tag  determines whether  t  is relevantor not with respect to the content of i.

Image folksonomy

connecticut, clouds, coast,

flower, grass, happy, leaves,

mountain, pretty, rain, sad,

sky, sun, trees, water,

waterbury, 2009

Set of visually

similar images

Set of visually

dissimilar images

Tag relevance

estimation

Tag relevance

estimation

Tag refinement

flower, grass, leaves

i T i

i T irefined 

 

Figure 1. Tag relevance estimation for image tag refinement .

2.2  Data-driven Tag Relevance EstimationThe proposed technique for tag relevance estimation computes

r (t , i) as the difference between the tag relevance obtained whenmaking use of folksonomy images visually similar to i and the tagrelevance obtained when making use of folksonomy imagesvisually dissimilar to i:

),,,(),,(:),( l it r k it r it r  dissimilar  similar  -=   (2)

where r  similar (t , i, k ) denotes the relevance of  t  with respect to i when making use of  k  folksonomy images visually similar to i,and where r dissimilar (t , i, l ) denotes the relevance of t with respect toi when making use of l folksonomy images visually dissimilar to i.In what follows, we detail the computation of  r  similar (t , i, k ) andr dissimilar (t , i, l ), as well as our rationale.

2.2.1   Estimating r  similar (t, i, k ) 

To estimate the relevance of  t with respect to the content of  i bymeans of images visually similar to i, we make use of the neighbor voting technique of [11]. This state-of-the-art technique isstraightforward in use and has recently attracted significantresearch attention [16][17].

Given an image folksonomy, neighbor voting estimates therelevance of  t  with respect to the content of  i as the difference between ‘the number of images annotated with t in a set of k 

neighbor images of i retrieved from the image folksonomy by

means of visual similarity search’ and ‘the number of images

annotated with t in a set of k neighbor images of i retrieved from

the image folksonomy by means of random sampling ’. This can beexpressed as follows:

,||

),(

),( 

)]([)],([:),,(

),(

åå

Î

Î×-=

-=

k i N  j

 I  j

rand t  st  similar 

 s I 

t  jvote

k t  jvote

k  N nk i N nk it r 

  (3)

where nt [·] counts the number of images annotated with t , where N  s(i, k ) denotes a set of k neighbors of i retrieved from an imagefolksonomy by means of a visual similarity function  s, and where N rand (k ) denotes a set of k neighbors of i retrieved from an imagefolksonomy by means of random sampling. We assume that s sorts

4

7/14/2019 Towards Data-driven Estimation of Image Tag Relevance using Visually Similar and Dissimilar Folksonomy Images

http://slidepdf.com/reader/full/towards-data-driven-estimation-of-image-tag-relevance-using-visually-similar 3/6

folksonomy images according to their visual similarity with i indescending order. Further, vote( j, t ) denotes a voting function,returning one when j has been annotated with t , and returning zerootherwise. Finally, | I | represents the total number of images in theimage folksonomy used. Note that the higher the value of r  similar (t , i, k ), the more relevant t to i.

2.2.2   Estimating r dissimilar (t, i, l ) To estimate the relevance of a tag by means of visually dissimilar images, we introduce a variant of the neighbor voting technique of [11]. Specifically, given an image folksonomy, we estimate therelevance of  t  with respect to the content of  i as the difference between ‘the number of images annotated with t in a set of l 

images that are visually dissimilar to i, and where these l images

have been retrieved from the image folksonomy by means of 

visual dissimilarity search’ and ‘the number of images annotated 

with t in a set of l neighbors of i that have been retrieved from the

image folksonomy by means of random sampling ’. This can beexpressed as follows:

,||

),(

),( 

)]([)],([:),,(

),(å

å

Î

Î

×-=

-=

l i N  j

 I  j

rand t d t dissimilar 

d  I 

t  jvote

l t  jvote

l  N nl i N nl it r 

  (4)

where  N d (i, l ) denotes a set of  l neighbors of i retrieved from theimage folksonomy used by means of a visual distance function d ,sorting folksonomy images according to their visual distance to i in descending order. Note that the lower the value of r dissimilar (t , i, l ), the more relevant t to i.

2.2.3   RationaleLet us assume that an image i has been annotated with a tag

t 1 ∈ T i relevant to the content of  i and a tag t 2 ∈ T i irrelevant to

the content of  i. It should be obvious that the difference in tagrelevance between t 1 and t 2 needs to be as high as possible inorder to facilitate effective image tag refinement (among other 

applications), something we intend to do by making use of bothvisually similar and dissimilar folksonomy images.

For a tag t 1 relevant to i, it should be clear that the probability of observing an image in  N  s(i, k ) that has been annotated with t 1 ishigher than the probability of observing an image in  N rand (k ) thathas been annotated with t 1. Indeed,  N  s(i, k ) mainly consists of images that are visually similar to i, and where these images arethus supposed to be semantically related to i. Given (3),r  similar (t 1, i, k ) thus returns a positive value. On the other hand, itshould be clear that the probability of observing an image in N d (i, l ) that has been annotated with t 1 is lower than the probability of observing an image in  N rand (l ) that has beenannotated with t 1. Indeed,  N d (i, l ) mainly consists of images thatare visually dissimilar to i, and where these images are thussupposed not to be semantically related to t 1. Given (4),r dissimilar (t 1, i, l ) thus returns a negative value.

For a tag t 2 not relevant to i, it should be clear that the probabilityof observing an image in N  s(i, k ) that has been annotated with t 2 islower than the probability of observing an image that has beenannotated with t 2 in  N rand (k ). Indeed, most images in  N  s(i, k ) aresupposed to be relevant to t 1 but not to t 2, while the probability of observing an image in  N rand (k ) that has been annotated with t 2 isindependent of the visual similarity function used (as  N rand (k ) isconstructed by means of random sampling). Given (3),

r  similar (t 2, i, k ) thus returns a negative value. On the other hand, itshould be clear that the probability of observing an image in N d (i, l ) that has been annotated with t 2 is higher than the probability of observing an image in  N rand (l ) that has beenannotated with t 2. Indeed, most images in N d (i, l ) are supposed to be irrelevant to t 1. Given (4), r dissimilar (t 2, i, l ) thus returns a positive value.

As illustrated by (2), the proposed technique for tag relevanceestimation calculates the difference between r  similar (t , i, k ) andr dissimilar (t , i, l ). Compared to tag relevance estimation only usingvisually similar images, the proposed technique increases therelevance value of  t 1 by subtracting r dissimilar (t 1, i, l ) fromr  similar (t 1, i, k ), and where r dissimilar (t 1, i, l ) has a negative value.Similarly, the proposed technique decreases the relevance value of t 2 by subtracting r dissimilar (t 1, i, l ) from r  similar (t 1, i, k ), and wherer dissimilar (t 1, i, l ) has a positive value. As a result, we can concludethat the proposed technique increases the difference in tagrelevance between tags that are relevant and not relevant withrespect to the content of  i, thus making it easier to distinguish between the two types of tags. Table 1 summarizes theaforementioned relationship for t 1 and t 2.

Table 1. Relation between r similar (t , i , k ), r dissimilar (t , i , l ), and r (t , i )

r  similar (t , i, k ) r dissimilar (t , i, l ) r (t , i)

t 1 + - ++

t 2 - + --

2.2.4  Complexity ConsiderationsIn this section, we briefly discuss the computational complexity of the proposed technique. Compared to the complexity of tagrelevance estimation only using visually similar images [11], the proposed technique additionally needs 1) a component for constructing a set of images that are visually dissimilar to the seedimage used and 2) a component for estimating the relevance of atag by using the set of visually dissimilar images constructed.

To construct a set of visually similar images, we first compute thevisual similarity between folksonomy images and the seed image. Next, we rank the folksonomy images according to their visualsimilarity to the seed image. That way, it should be clear that wecan easily use the resulting list of ranked images to construct botha set of visually similar images and a set of visually dissimilar images, not needing any additional computation that is substantialin nature. Given a set of visually similar images and a set of visually dissimilar images, we can then estimate the relevance of atag with respect to the content of the seed image. The techniquefor estimating the relevance of a tag using visually dissimilar images is similar to the technique for estimating the relevance of atag using visually similar images. Also, given that no dependencyexists among the two aforementioned techniques, parallelization

can be used to mitigate execution times.

3.  EXPERIMENTSIn this section, we first investigate the number of non-relevanttags in an image folksonomy before and after image tagrefinement. Given 24 query tags and a ground truth, we theninvestigate the number of correctly retrieved images over the totalnumber of retrieved images for each of the 24 query tags. Bothexperiments use the effectiveness of tag relevance estimation onlyusing visually similar images as a baseline.

5

7/14/2019 Towards Data-driven Estimation of Image Tag Relevance using Visually Similar and Dissimilar Folksonomy Images

http://slidepdf.com/reader/full/towards-data-driven-estimation-of-image-tag-relevance-using-visually-similar 4/6

3.1  Experimental SetupTo test the effectiveness of the proposed technique for tagrelevance estimation, we need two types of image sets: 1) animage folksonomy and 2) a test set, and where the collectiveknowledge present in the image folksonomy is used to estimatethe relevance of tags assigned to the test images. In our experiments, in order to create a source of collective knowledge,we randomly selected 100,000 images from MIRFLICKR-1M

[18], annotated by 13,343 users with a total of 1,130,342 tags.

Figure 2 shows the distribution of the tag frequency in the imagefolksonomy created. The  x-axis represents the 159,300 uniquetags present in the image folksonomy used, ordered by descendingtag frequency, whereas the  y-axis represents the frequency of the159,300 unique tags. We can observe that the distribution of thetag frequency follows a power law [19].

Figure 2. Tag frequency distribution in the image folksonomy used. 

To evaluate the effectiveness of image tag refinement and tag- based image retrieval, we adopted two different sets of test images.The effectiveness of image tag refinement was tested by means of 1,000 images randomly selected from the MIRFLICKR-25000collection [19]. These images were annotated with a total of 

24,774 tags, and where each image was annotated with at leastfive tags. We manually classified the 24,774 tags as relevant or non-relevant, finding 6,534 tags to be relevant and 17,940 tags to be non-relevant. Further, the effectiveness of tag-based imageretrieval was tested by using MIRFLICKR-25000. Note that nooverlap exists between the image folksonomy created and our testsets (this is, the different sets are mutually exclusive).

We represented the visual content of each image by means of Bag-of-Visual-Words (BoVW), relying on a vocabulary of 500visual words [20]. We adopted the cosine similarity to findvisually similar images (i.e., s in (3) denotes cosine similarity). Inaddition, we made use of “1 - cosine similarity” to find visuallydissimilar images (i.e., d in (4) denotes “1 - cosine similarity”).

The value of  k  and l  in (4) were determined offline using an

empirical approach. Specifically, we first manually investigatedthe ratio of non-relevant tags in the image folksonomy by varyingk (we kept l fixed to zero), selecting that value of k that minimizedthe aforementioned ratio. The value of  l  was subsequentlydetermined in a similar way, reusing the value of k determined inthe previous step. As a result, k and l were set to 500 and 5,000,respectively.

3.2  Evaluation CriteriaWe used the Noise Level ( NL) metric [3] in order to evaluate theeffectiveness of image tag refinement.  NL represents the proportion of non-relevant tags in the set of all user-supplied tags.If an image folksonomy contains a high number of non-relevanttags, then  NL is close to one. Likewise, if an image folksonomycontains a low number of irrelevant tags, then NL is close to zero.

We used the precision at rank m ( P @m) in order to evaluate theeffectiveness of tag-based image retrieval, computing  P @m for each of the 24 query tags used. To that end, we first estimated therelevance of each query tag with respect to the content of theMIRFLICKR-25000 images. We then ranked the MIRFLICKR-25000 images according to their relevance to each of the querytags used, relying on a ground truth manually created by thefounders of MIRFLICKR-25000. Finally, we averaged  P @m over the 24 query tags used. Note that, for the sake of completeness,the definition of  P @m can be found below:

,@,

m

 I  I t  for m P 

retrieved 

mt 

relevant 

t  Ç=   (5)

where t  represents a query tag used to retrieve images from the

image folksonomy, where relevant t  I  is the set of all folksonomy

images that are relevant to t  (given the provided ground truth),

and where retrieved 

mt  I ,is the set of the m topmost images that have

 been retrieved for t (given the estimated tag relevance values).

3.3  Experimental Results

3.3.1   Effectiveness of Image Tag Refinement Table 2 compares the effectiveness of image tag refinement whenonly making use of visually similar images on the one hand, andwhen making use of both visually similar and dissimilar imageson the other hand. We can observe that  NL decreases withapproximately 6% (from 0.733 to 0.690) when tag relevanceestimation only makes use of visually similar images, whereas  NL 

decreases with approximately 8% (from 0.733 to 0.673) when tagrelevance estimation uses both visually similar and dissimilar images. Specifically, allowing both techniques to remove about653 relevant tags (cost), the proposed technique for tag relevanceestimation is able to remove 5,846 irrelevant tags, while tagrelevance estimation only using visually similar images is onlyable to remove 4,823 irrelevant tags (benefit).

Table 2. Effectiveness of image tag refinement

Before

image tag

refinement

After image tag refinement

Using visually

similar 

images [11]

Using the

 proposed

technique

 Number of 

relevant tags6,534 5,881 5,881

 Number of 

irrelevant tags17,940 13,117 12,094

 NL 0.733 0.690 0.673

6

7/14/2019 Towards Data-driven Estimation of Image Tag Relevance using Visually Similar and Dissimilar Folksonomy Images

http://slidepdf.com/reader/full/towards-data-driven-estimation-of-image-tag-relevance-using-visually-similar 5/6

Figure 3 contains four example images that illustrate theeffectiveness of the different image tag refinement techniques.Relevant tags have been underlined. In addition, tags used as aninput for image tag refinement have been sorted alphabetically,whereas tags outputted by image tag refinement have been rankedaccording to decreasing relevance values. For the four imagesshown in Figure 3, we can observe that the proposed techniquefor tag relevance estimation is more effective than tag relevance

estimation only making use of visually similar images.

Image

Tags

Before tag refinement

After image tag refinement

Using visually similar images

with uniform weights [11]Using the proposed technique

 baby, booties, bsj, elizabeth,

ez, garter, gift, handknit,

handknitting, jacket, knit,

knitting, merino, rock, socks,

stayon, stitch, str, surprise,

that, wool, yarn, zimmermann

wool, knitting, yarn, knit,

handknit, socks, baby, jacket,

handknitting, surprise

wool, knitting, yarn, knit,

socks, handknit, jacket,

handknitting, stitch, baby

architecture,

 beginnersdigitalphotography,

 blackandwhite,

canon1855mmf3556, city,

culture, detail, exploration,

explore,

flickrfavoritecityandstreetphot

ographers, industrial, industry,

 juststreetphotography,

longbeach, nature, night,

nightphotography, nightshots,

street, structure, urban,

vintage

street, architecture, vintage,

urban, explore, city, industrial,

culture

street, architecture,

urban, vintage, city, industry,

explore, structure

animals, bugs, canon,

canonrebelxti, delviscio, earth,fog, foliage, garden, home,

insects, jeffdelviscio,

 jefferydelviscio,

 jeffreydelviscio, light, macro,

microscopic, mist, mom, rebel,

rebelxti, soil, vegetables,

water, winter 

garden, animals, fog, foliage,

earth, rebel, insects, rebelxti

garden, animals, fog, foliage,

earth, insects, light, mist

anawesomeshot, aroma, bee,

 biology, bloom, botanical,

 bottany, bud, closeup,

colourful, colours, digital,

digitalcamara, digitalphoto,

environment, flickr,

flickrsbest, flora, flowermacro,

flower, garden, google, green,

himachal, honey, india,

indianphoto, leaves, macro,

nature, naturesfinest,

naturesgift, olympus,

olympussp550uz, petals,

 picturesque, plants, pollen, red,

stem

flower, macro, green, leaves,

nature, colours, colourful,

india, digital, garden,

naturesfinest, closeup, flickr,

anawesomeshot

flower, macro, green, nature,

leaves, garden, colours,

closeup, colourful, plants,

flora, bee, botanical, bloom

 

Figure 3. Example images illustrating the effectiveness of 

image tag refinement. 

3.3.2   Effectiveness of Tag-based Image Retrieval Figure 4 shows the effectiveness of tag-based image retrievalwhen using different techniques for tag relevance estimation. Theeffectiveness of tag-based image retrieval is computed by meansof  Average  P @5 and  Average  P @10. We can observe that tag- based image retrieval using the proposed technique for tagrelevance estimation allows achieving a precision that is higher than the precision of tag relevance estimation only making use of visually similar images. Specifically, compared to tag relevanceestimation only using visually similar images, the proposedtechnique allows improving the effectiveness of tag-based imageretrieval with approximately 10% and 14% in terms of  Average  P @5 and  Average  P @10, respectively (from 0.528 to 0.583 for  Average  P @5, and from 0.471 to 0.535 for  Average  P @10).

4.  CONCLUSIONS AND FUTURE WORK This paper introduced a data-driven approach for estimating therelevance of user-supplied tags with respect to the content of aseed image, computing the relevance of these tags by means of  both visually similar and dissimilar folksonomy images. That way,compared to tag relevance estimation only making use of visually

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

 Average P@5 Average P@10

Tag relevance estimation using visually similar images [11]

Tag relevance estimation using the proposed technique

 

Figure 4. Effectiveness of tag-based image retrieval. 

similar images, we are able to increase the difference in tagrelevance between tags relevant and tags not relevant with respectto the content of a seed image at a limited increase incomputational cost, thus making it more straightforward todistinguish between them. The latter is demonstrated by means of a simple yet elegant mathematical formalization, and throughexperimentation with subsets of MIRFLICKR-25000 and

MIRFLICKR-1M, showing that the proposed technique allowsincreasing the effectiveness of both image tag refinement and tag- based image retrieval.

In future work, we plan to improve the estimation of image tagrelevance by combining visual information and tag statistics. Inaddition, we plan to compare our data-driven approach with aclassifier-based approach for detecting a number of predefinedsemantic concepts. Finally, we plan to evaluate the proposedapproach in the context of semantic concept-based video copydetection [21].

5.  ACKNOWLEDGMENTSThis research was supported by the Basic Science ResearchProgram of the National Research Foundation (NRF) of Korea,

funded by the Ministry of Education, Science and Technology(research grant: 2011-0011383).

6.  REFERENCES[1]  Flickr Blog. August 2011. 6,000,000,000. Available on

http://blog.flickr.net/en/2011/08/04/6000000000/.

[2]  Facebook Statistics. November 2011. Available onhttp://www.facebook.com/press/info.php?statistics/.

[3]  Chua, T., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y. 2009. NUS-WIDE: A Real-World Web Image Database from National University of Singapore. In Proceedings of ACMCIVR. 1-9. DOI=http://doi.acm.org/10.1145/1646396.1646452.

[4]  Lindstaedt, S., Morzinger, R., Sorschag, R., Pammer, V.,Thallinger, G. 2009. Automatic image annotation usingvisual content and folksonomies. Multimedia Tools andApplications. 41, 1, 97-113. DOI=http://dx.doi.org/10.1007/s11042-008-0247-7.

[5]  Murdock, V. 2011. Your Mileage May Vary: On the Limitsof Social Media. SIGSPATIAL Special. 3, 2, 62-66. DOI=http://doi.acm.org/10.1145/2047296.2047309.

7

7/14/2019 Towards Data-driven Estimation of Image Tag Relevance using Visually Similar and Dissimilar Folksonomy Images

http://slidepdf.com/reader/full/towards-data-driven-estimation-of-image-tag-relevance-using-visually-similar 6/6

[6]  Lee, S., De Neve, W., Plataniotis, K. N., Ro, Y. M. 2010.MAP-based image tag recommendation using a visualfolksonomy. Pattern Recognition Letters. 31, 9, 976-982.DOI= http://dx.doi.org/10.1016/j.patrec.2009.12.024.

[7]  Jin, J., Khan, L., Wang, L., Awad, M. 2005. ImageAnnotation by Combining Multiple Evidence & WordNet. InProceedings of ACM MM. 706-715. DOI=http://doi.acm.org/10.1145/1101149.1101305.

[8]  Fellbaum, C. 1998. WordNet: An Electronic LexicalDatabase, The MIT Press.

[9]  Kennedy, L., Slaney, M., Weinberger. K. 2009. ReliableTags Using Image Similarity: Mining Specificity andExpertise from Large-Scale Multimedia Databases. InProceedings of ACM Multimedia: Workshop on Web-ScaleMultimedia Corpus. 17-24. DOI=http://doi.acm.org/10.1145/1631135.1631139.

[10] Liu, D. Hua, X. S., Yang, L. J., Wang, M., Zhang, H. J. 2009.Tag Ranking. In Proceedings of the International WorldWide Web Conference. 351-360. DOI=http://doi.acm.org/10.1145/1526709.1526757.

[11] Li, X., Snoek, C. G., Worring, M. 2009. Learning Social Tag

Relevance by Neighbor Voting. IEEE Trans. Multimedia. 11,7, 1310-1322. DOI=http://dx.doi.org/10.1109/TMM.2009.2030598.

[12] Truong, B. Q., Sun, A., Bhowmick, S. S. 2012. Content isStill King: The Effect of Neighbor Voting Schemes on TagRelevance for Social Image Retrieval. In Proceedings of ACM ICMR. 1-8.

[13] Lee, S., De Neve, W., Ro, Y. M. 2010. Tag refinement in animage folksonomy using visual similarity and tag co-occurrence statistics. Signal Processing: ImageCommunication. 25, 10, 761-773. DOI=http://dx.doi.org/10.1016/j.image.2010.10.002.

[14] Deselaers T., Ferrari V. 2011. Visual and SemanticSimilarity in ImageNet. In Proceedings of IEEE CVPR.

1777-1784. DOI=http://dx.doi.org/10.1109/CVPR.2011.5995474.

[15] Li, X., Snoek, C. G., Worring, M., Smeulders, A. W. M.2011. Social Negative Bootstrapping for VisualCategorization. In Proceedings of ACM MIR. 1-8. DOI=http://doi.acm.org/10.1145/1991996.1992008.

[16] Liu, D., Hua, X.-S., Zhang, H.-J. 2011. Content-based tag

 processing for Internet social images. Multimedia Tools andApplications. 51, 2, 723-738. DOI=http://dx.doi.org/10.1007/s11042-010-0647-3.

[17] Sawant, N., Li, J., Wang, J. Z. 2011. Automatic imagesemantic interpretation using social action and tagging data.Multimedia Tools and Applications. 51, 2, 213-246. DOI=http://dx.doi.org/10.1007/s11042-010-0650-8.

[18] Huiskes, M. J., Thomee, B., Lew, M. S. 2010. New Trendsand Ideas in Visual Concept Detection: The MIR Flickr Retrieval Evaluation Initiative. In Proceedings of ACM MIR.527-536. DOI=http://doi.acm.org/10.1145/1743384.1743475.

[19] Sigurbjörnsson, B., Zwol van R., 2008. Flickr TagRecommendation based on Collective Knowledge. In

Proceedings of the International World Wide WebConference. 327-336. DOI=http://doi.acm.org/10.1145/1367497.1367542.

[20] Sande, K. E. A., Gevers, T., Snoek, C. G. M. 2010.Evaluating Color Descriptors for Object and Scenerecognition. IEEE Trans. Pattern Analysis and MachineIntelligence. 32, 9, 1582-1596. DOI=http://dx.doi.org/10.1109/TPAMI.2009.154.

[21] Min, H.-S., Choi, J., De Neve, W., Ro, Y. M. 2012. Near-Duplicate Video Clip Detection Using Model-Free SemanticConcept Detection and Adaptive Semantic DistanceMeasurement. IEEE Trans. on Circuits and Systems for Video Technology. 22, 8, 1174-1187. DOI=http://dx.doi.org/10.1109/TCSVT.2012.2197080.

8