Automatic Discovery of Global and Local …...gesting tags that have a strong presence in the area...

11
Automatic Discovery of Global and Local Equivalence Relationships in Labeled Geo-Spatial Data Bart Thomee Yahoo Labs San Francisco, CA, USA [email protected] Gianmarco De Francisci Morales Yahoo Labs Barcelona, Spain [email protected] ABSTRACT We propose a novel algorithmic framework to automatically detect which labels refer to the same concept in labeled spa- tial data. People often use different words and synonyms when referring to the same concept or location. Furthermore these words and their usage vary across culture, language, and place. Our method analyzes the patterns in the spatial distribution of labels to discover equivalence relationships. We evaluate our proposed technique on a large collection of geo-referenced Flickr photos using a semi-automatically constructed ground truth from an existing ontology. Our approach is able to classify equivalent tags with a high ac- curacy (AUC of 0.85), as well as providing the geographic extent where the relationship holds. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications— Spatial databases and GIS ; H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing Keywords Geotagged data; geo-spatial analysis; folksonomy; relation- ship discovery; Flickr 1. INTRODUCTION In recent years, social media and Web applications, such as del.icio.us, have amassed large quantities of user-generated content. Researchers often leverage the“wisdom of the crowd” in order to organize and search this content. In particular, user-supplied tags, textual labels assigned to such content, form a powerful and useful feature that has been successfully exploited for this purpose in many different domains. Tags are free-form, short keywords associated by a user to some content such as a photo, a link, or a blog entry. The categorization system that arises from applying tags is usually called a folksonomy [43]. Unlike ontologies and taxonomies, folksonomies result in unstructured knowledge, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. HT’14, September 1–4, 2014, Santiago, Chile. Copyright 2014 ACM 978-1-4503-2954-5/14/09 ...$15.00. http://dx.doi.org/10.1145/2631775.2631794. i.e. they are flat and have no predefined semantics. How- ever, their unstructured nature is also their strength. For example, tagging has lower cognitive cost than picking cat- egories from an ontology, it allows for greater flexibility and self-expression, and the folksonomy may naturally evolve to reflect emergent properties of the data [40]. The availability of large quantities of data has prompted researchers to try to automatically extract knowledge from it. Formerly ontologies used to be built manually by experts, however with large amounts of data this process has proved to be expensive and unsustainable. Currently, most state-of- the-art methods to generate ontologies use large large text databases [25] as their imput. Often, they leverage crowd- sourced data and user-generated content. Many approaches mine Wikipedia or the Web to find interesting relationships, for instance DBPedia 1 and Google’s Knowledge Graph 2 . The expectation is that the individual interactions of a large number of agents would lead to global effects that can be observed as semantics. Ontologies would thus become an emergent property of the system as opposed to a fixed for- malization of knowledge. While geo-referenced tags are readily available in large quantities, the problem of referencing relationships geograph- ically has received little attention so far. The presence of the spatial dimension allows us to ask new questions about la- beled data, such as “where does this relationship hold?” , “is this relationship valid locally or globally?” , and “which tag among several equivalents is most prominent here?” . Broadly speaking, we are interested in identifying pat- terns in the distribution of labels over some domain. In this work we describe a general framework for the spatial domain and focus on geographic patterns for evaluation. Specifically, we look at tags from Flickr 3 , a popular photo-sharing web- site that supports tags and geo-referenced photos. Rather than looking at co-occurrences of tags in photos [39], we an- alyze the spatial patterns that emerge from the usage of tags, and extract structured information from these pat- terns in the form of equivalence relationships. The equiv- alence relationships extracted include synonyms, language variants, misspellings, abbreviations, and even nicknames. For example, our algorithm can determine that the Eiffel Tower may interchangeably be referred to by different tags, such as la tour eiffel, toureiffel, eiffelturm, and la 1 http://dbpedia.org 2 http://www.google.com/insidesearch/features/ search/knowledge.html 3 http://www.flickr.com

Transcript of Automatic Discovery of Global and Local …...gesting tags that have a strong presence in the area...

Page 1: Automatic Discovery of Global and Local …...gesting tags that have a strong presence in the area where the photo was taken, de-duplicating tags when they refer to the same canonical

Automatic Discovery of Global and LocalEquivalence Relationships in Labeled Geo-Spatial Data

Bart ThomeeYahoo Labs

San Francisco, CA, [email protected]

Gianmarco De Francisci MoralesYahoo Labs

Barcelona, [email protected]

ABSTRACTWe propose a novel algorithmic framework to automaticallydetect which labels refer to the same concept in labeled spa-tial data. People often use different words and synonymswhen referring to the same concept or location. Furthermorethese words and their usage vary across culture, language,and place. Our method analyzes the patterns in the spatialdistribution of labels to discover equivalence relationships.We evaluate our proposed technique on a large collectionof geo-referenced Flickr photos using a semi-automaticallyconstructed ground truth from an existing ontology. Ourapproach is able to classify equivalent tags with a high ac-curacy (AUC of 0.85), as well as providing the geographicextent where the relationship holds.

Categories and Subject DescriptorsH.2.8 [Database Management]: Database Applications—Spatial databases and GIS ; H.3.1 [Information Storageand Retrieval]: Content Analysis and Indexing

KeywordsGeotagged data; geo-spatial analysis; folksonomy; relation-ship discovery; Flickr

1. INTRODUCTIONIn recent years, social media and Web applications, such asdel.icio.us, have amassed large quantities of user-generatedcontent. Researchers often leverage the“wisdom of the crowd”in order to organize and search this content. In particular,user-supplied tags, textual labels assigned to such content,form a powerful and useful feature that has been successfullyexploited for this purpose in many different domains.

Tags are free-form, short keywords associated by a userto some content such as a photo, a link, or a blog entry.The categorization system that arises from applying tagsis usually called a folksonomy [43]. Unlike ontologies andtaxonomies, folksonomies result in unstructured knowledge,

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected]’14, September 1–4, 2014, Santiago, Chile.Copyright 2014 ACM 978-1-4503-2954-5/14/09 ...$15.00.http://dx.doi.org/10.1145/2631775.2631794.

i.e. they are flat and have no predefined semantics. How-ever, their unstructured nature is also their strength. Forexample, tagging has lower cognitive cost than picking cat-egories from an ontology, it allows for greater flexibility andself-expression, and the folksonomy may naturally evolve toreflect emergent properties of the data [40].

The availability of large quantities of data has promptedresearchers to try to automatically extract knowledge fromit. Formerly ontologies used to be built manually by experts,however with large amounts of data this process has provedto be expensive and unsustainable. Currently, most state-of-the-art methods to generate ontologies use large large textdatabases [25] as their imput. Often, they leverage crowd-sourced data and user-generated content. Many approachesmine Wikipedia or the Web to find interesting relationships,for instance DBPedia1 and Google’s Knowledge Graph2. Theexpectation is that the individual interactions of a largenumber of agents would lead to global effects that can beobserved as semantics. Ontologies would thus become anemergent property of the system as opposed to a fixed for-malization of knowledge.

While geo-referenced tags are readily available in largequantities, the problem of referencing relationships geograph-ically has received little attention so far. The presence of thespatial dimension allows us to ask new questions about la-beled data, such as “where does this relationship hold?”, “isthis relationship valid locally or globally?”, and “which tagamong several equivalents is most prominent here?”.

Broadly speaking, we are interested in identifying pat-terns in the distribution of labels over some domain. In thiswork we describe a general framework for the spatial domainand focus on geographic patterns for evaluation. Specifically,we look at tags from Flickr3, a popular photo-sharing web-site that supports tags and geo-referenced photos. Ratherthan looking at co-occurrences of tags in photos [39], we an-alyze the spatial patterns that emerge from the usage oftags, and extract structured information from these pat-terns in the form of equivalence relationships. The equiv-alence relationships extracted include synonyms, languagevariants, misspellings, abbreviations, and even nicknames.For example, our algorithm can determine that the EiffelTower may interchangeably be referred to by different tags,such as la tour eiffel, toureiffel, eiffelturm, and la

1http://dbpedia.org2http://www.google.com/insidesearch/features/search/knowledge.html3http://www.flickr.com

Page 2: Automatic Discovery of Global and Local …...gesting tags that have a strong presence in the area where the photo was taken, de-duplicating tags when they refer to the same canonical

dame de fer, even if these tags would never occur togetherin the same photo. We further determine whether the equiv-alence relationship holds globally or holds only locally. Forexample, in the city of Paris there is a park officially knownas ‘Le Jardin du Luxembourg’, which may be referred to bypeople when tagging their photos by using the tag luxem-

bourg or the tag park. In and around this particular park,these two labels can be considered equivalent, and one maybe substituted for the other; yet anywhere else in Paris thesetwo labels should not be considered the same, in particularbecause the label park will likely also occur around all otherparks in Paris. Conversely, while there are many parks in thecity of Luxembourg, the equivalence relationship betweenthese two tags does not hold there either.

There are several applications for a system that automati-cally uncovers geo-spatial equivalence relationships; next weprovide a few examples.

Tag canonicalization and entity linking. Mapping thetags to known entities, and thus implicitly knowing the re-lationships between entities, can enrich existing knowledgerepositories. Such mapping of tags to their canonical forms,results in (i) the compression of the tag vocabulary, sinceseveral terms can be mapped to a single representative en-try, and (ii) more emphasis on relevant terms when, forinstance, performing term counting for building languagemodels. In fact, any related term that would be consideredindividually (and thus would be competing for attention)will be considered as part of a larger group as a result ofcanonicalization. For recommendation and personalization,the vocabulary compression yields a reduced dimensional-ity of the tag space that needs to be considered, e.g., formatrix factorization. Furthermore, the spatial aspect can bebeneficial for contextualization in hyper-local search. For ex-ample, a user in a Spanish-speaking country may issue thequery “bamba” when searching for a popular Mexican folksong, while a user in Israel issuing this query is more likelyto refer to a particular peanut butter-flavored snack.

Tag suggestion, correction, de-duplication, disam-biguation and search. The canonicalization process yieldsrelationships between tags; these relationships can hold glob-ally true or be region or area specific. This knowledge canbenefit the tagging process by flagging misspelled tags, sug-gesting tags that have a strong presence in the area wherethe photo was taken, de-duplicating tags when they refer tothe same canonical tag, and disambiguating tags by usinggeo-spatial and contextual information to find the intendedcanonical terms (e.g., a user tagging a photo of an orangewith “orange” when it is taken inside Orange County in Cal-ifornia). In addition, search results can be improved whenthe user searches for a tag by expanding the search query toalso include all other tags that map to the same canonicalterm, as well as diversifying the results by highlighting allpossible meanings of the query if it is ambiguous.

Several challenges need to be faced to successfully mine re-lationships from geo-spatial labeled user-generated content.First, data is present at several different scales (street, city,region, country, etc.). Second, people do not always behaveand tag rationally. This fact translates in a high volume ofnoise in the data. Finally, the large amount of the data im-poses the use of efficient techniques.

The main contributions of this work are the following:

• We investigate the problem of automatically uncover-ing equivalence relationships from geo-spatial data;

• We propose an algorithm to find equivalence relation-ships and evaluate it on a collection of Flickr tags;

• We validate our approach on a ground truth generatedsemi-automatically from an existing ontology.

The remainder of this paper is organized as follows. Wefirst discuss related work in Section 2. Section 3 then in-troduces our framework while Section 4 describes its imple-mentation. Section 5 presents its evaluation and Section 6concludes with final remarks and future outlooks.

2. RELATED WORKAt its core, this work is an application of geo-spatial datamining to geo-referenced folksonomies.

Folksonomy is a portmanteau term from folk and taxon-omy [43]. It refers to a classification system that createsmeaning from unstructured and collaborative annotationsthat are used in a practical real-world system. While on-tologies are traditionally crafted by experts, and thereforerelatively expensive to create, folksonomies arise from thesteady-state tagging behavior of users, or so-called “wisdomof the crowd”, and are relatively inexpensive to create. Ad-ditionally, while ontologies usually have a hierarchical struc-ture, folksonomies are composed by a collection of tags whichare all equal, i.e., the structure of a folksonomy is flat.

Passant [29] proposes a method to improve the qualityof folksonomies by using ontologies for tag canonicalization,disambiguation and structuring. Knerr [21] proposes a for-malization of folksonomies in terms of linked data to enableinteroperability across different folksonomies. In the samespirit, MOAT [30] is a framework that enables machine-readable semantics for tags. Limpens et al. [22] provides asurvey of proposals that attempt to bridge ontologies andfolksonomies. Gruber [12] tries to clarify the relationshipbetween ontologies and folksonomies, and argues that con-trasting them is a “false dichotomy”.

Differently from these other works, the method proposedin this paper takes advantage of the data available fromfolksonomies in order to enrich ontologies by uncovering newrelationships. Similarly to this work, Mika [27] and Lux andDosinger [23] generate semantic relationship starting fromfolksonomies. However, they focus on the social part of thefolksonomies and do not take into account the geo-spatialaspect of the problem, which is our main contribution.

Closely related to our work, Schmitz [36] try to induce anontology from Flickr tags. The ontologies are representedas trees of tags that constitute a taxonomy, i.e., the edgesare subsumptions (“is-a” relationships). The method usedby the authors is a simple rule based approach based on co-occurences, which resembles the mining of generalized (hier-archical) association rules. However, the authors do not takeinto account the spatial nature of the data. Furthermore, ourframework uses a supervised learning approach.

Geo-spatial data mining refers to knowledge extractionfrom data with spatial coordinates. Examples include datafrom geographic information systems (GIS), GPS data andmobile trajectories.

Discovering geographic regions within spatial data hasbeen an active research topic for many years. A variety of

Page 3: Automatic Discovery of Global and Local …...gesting tags that have a strong presence in the area where the photo was taken, de-duplicating tags when they refer to the same canonical

methods can be applied to uncover the hidden spatial struc-tures, most notably techniques such as clustering [9], densityestimation [16, 18] and neural networks [15].

In recent years, georeferenced multimedia collections haveconsiderably grown in size, in particular due to the availabil-ity of digital cameras with built-in GPS receivers that canautomatically attach the geographic location to every photothat is taken. We can therefore find a considerable numberof methods in the research literature that try to represent orsummarize geographical regions in terms of either the tagsassociated with the photos in that region [1, 7] or of the pho-tos themselves [2, 5, 17, 19, 35, 44]. For example, languagemodeling approaches [14, 38, 42] characterize cells that par-tition the world into discrete units for automatic photo andvideo geo-localization.

In our work, we make use of geo-spatial areas that de-scribe tags in a folksonomy order to qualify relationshipsamong them. GeoFolk is a bayesian latent topic model thatcombines tags and geographic coordinates, modeled jointlyvia a generative process [41]. The authors show its useful-ness in the usual contexts of tag recommendation, contentclassification and clustering. More refined models that candescribe non-gaussian distributions have also been proposedmore recently [20]. Indeed, these models are one of the pos-sible building blocks that our framework can use to modelareas. However, our final goal of finding equivalence rela-tionships is different from what these works address.

Rattenbury et al. [32] extract place and event semanticsfrom Flickr tags. They use patterns in the spatiotemporaldistribution of tags to classify them in these two categories.In particular, places exploit spatial patterns, while eventsuse temporal patterns at a given scale. Conversely, we areinterested in semantic relationships between tags. Similarly,Zhang et al. [45] try to discover spatiotemporal relationshipsin collections of tagged photos. However, while their notionof relationship is vague and never explicitly defined, in thiswork we aim at discovering well-defined semantic relation-ships. Furthermore, the authors use a fixed coarse grainedquantized vector representation of the tag distribution, whilewe propose a direct estimation of tag distribution from thedata, coupled with a dynamic quantization of the distribu-tion based on its spatial extent.

3. ALGORITHMIC FRAMEWORKIn our approach we focus on labeled instances, e.g. tags inthe case of online photos, each of which is associated with alocation. Our geo-spatial relationship discovery frameworkemploys both unsupervised and supervised learning, andconsists of five main steps, namely:

1. Data representation – The data distribution of eachlabel is analyzed in order to produce a geo-spatial repre-sentation;

2. Overlap detection – The data representations of labelsare compared to find spatial overlaps that may indicatethat a relationship exists between the labels;

3. Feature generation – Descriptive features are gener-ated from the geo-spatial representations and their over-lapping area;

4. Relationship classification – The features are analyzedin order to determine which, if any, relationship holds be-

tween overlapping labels and the geo-spatial extent of thisrelationship;

5. Relationship aggregation – The relationships betweenlabels are optionally analyzed to perform relationship ag-gregation, e.g., detecting transitivity between three ormore relationships.

There are several possible implementations for each ofthese steps, and our framework does not enforce any par-ticular choice, as long as the various pieces form a consis-tent pipeline (e.g., the feature generation should work on thechosen data representation). We describe the specific instan-tiation of the framework we experiment with in Section 4.

The final output of our framework is a set of tuples of theform (λa, λb, Eab) that represent equivalence relationshipsbetween labels λa and λb. Each relationship is associatedwith a geo-spatial extent Eab that represents the spatial areawhere the relationship holds. In the remainder of this sectionwe detail each of the aforementioned steps in the framework.

3.1 Data representationGiven a set of labels Λ, we can define the data collection4 asDΛ =

⊎Dλ | λ ∈ Λ, where the

⊎operator performs a union

of the data instances associated with each label Dλ. In casea data item has multiple labels, it will be included in DΛ as

many times its labels. We further define Dλ , {d}, whered has a label λ and is represented by a tuple (x, y), whichcontains a location that is expressed by a coordinate x anda coordinate y. In the specific case of geo-referenced dataitems, a location refers to a geographic coordinate, where xrepresents the longitude and y the latitude of the coordinate.

Modeling. For each label λ ∈ Λ, we individually analyzethe geo-spatial distribution of its instances by using unsu-pervised learning. The algorithm detects where the label ex-hibits a strong presence, and yields one or more clusters thatcan be represented as closed surfaces. For example, in twodimensions the surface of each obtained cluster would beformed by the convex hull of all points in the cluster. Thechoice of which clustering algorithm to apply depends onvarious factors, such as (i) the complexity of the data, (ii)the scale of the data, (iii) the volume of noise in the data,and (iv) the desired degree of accuracy for the modeling.

For example, mean-shift [3], k-means [24] or DBSCAN [8]can be used to separate the data into disjoint clusters, whereaskernel density estimation [34] or Gaussian mixture decom-position [33] can be used to generate probabilistic models ofthe data. Some clustering methods require pre-setting thenumber of clusters to detect, while others implicitly per-form clustering without requiring to specify the number ofclusters [31]; nonetheless, for the former type of clusteringmethods there exist practical methods (e.g., Bayesian infor-mation criterion) to perform model selection, i.e., to choosethe best number of clusters.

The union of the detected clusters for a label forms itsgeo-spatial representation. Similarly to the definition of thedata collection, we define the geo-spatial representation ofa label as a cluster collection CΛ =

⊎Cλ | λ ∈ Λ, where

Cλ , {c} and where each cluster c has a label λ and isrepresented by its geo-spatial surface S.

4While we describe our data model using two spatial di-mensions for clarity, the extension of our model to higher-dimensional spaces is straightforward.

Page 4: Automatic Discovery of Global and Local …...gesting tags that have a strong presence in the area where the photo was taken, de-duplicating tags when they refer to the same canonical

3.2 Overlap detectionAfter the geo-spatial representations have been determinedfor each label, we perform pair-wise comparisons betweenthe representations to find overlaps that may indicate theexistence of a relationship between the labels. For each pairof labels λa and λb, we identify all geo-spatial overlaps be-tween their associated clusters Cλa and Cλb . Each overlap isrepresented by a tuple (λa, Ca, λb, Cb, Eab), where Ca ⊆ Cλa

and Cb ⊆ Cλb , and where Eab is the extent of the geo-spatialintersection between the clusters Ca and Cb. By definition,when there are multiple overlaps between a pair of labels,then no cluster of either label is able to participate in morethan one overlap. A further filtering step can be appliedto the obtained overlaps in order to discard those whose ex-tents are too small to allow for reliably detecting meaningfulrelationships.

3.3 Feature generationOnce the overlaps have been identified, we generate featuresfrom the union of involved clusters Ma =

⋃c | c ∈ Ca and

Mb =⋃c | c ∈ Cb, as well as from their geo-spatial in-

tersection Mab = Ma

⋂Mb, as exemplified in Figure 1. The

features aim at describing the inherent properties of the geo-spatial extents and densities of the overlapping clusters, thuscapturing the important aspects which allow a classifier todetermine whether a relationship holds between the overlap-ping clusters of the labels. We identify three kind of features:overlap, extent and density.

Overlap-based features indicate which kind of relationshiplikely holds between the overlapping clusters. We can dif-ferentiate between various types of overlap, e.g., by usingqualitative spatial reasoning to express the overlap betweenthe clusters through different region connection calculi [4].For instance, a non-tangential proper part relationship be-tween Ma and Mb, i.e., all clusters in Ma are completelycovered by the clusters in Mb, would suggest a subsumptionrelationship is likely.

Extent-based features indicate the strength of the relation-ship between the labels – the larger the extent of the overlap,the more likely an actual relationship to exist – as well asfacilitating the decision process when several relationshipshave been identified as candidates for describing the over-lap. For instance, the relative extent of the overlap may en-able an equivalence relationship to be distinguished from asubsumption one.

Density-based features capture how similar the distributionof the data within the extents of the overlapping clustersare, in particular the extent the clusters have in commonwith respect to their overall extents. These features allowto detect relationships that would be hard to find based onjust their overlap or size. For example, when a cluster with alarge extent overlaps one or more clusters with small extentan equivalence relationship may still be uncovered betweentheir labels when their normalized densities are similar.

After the feature generation step, we end up with featuresFa,b,ab = {Fa, Fb, Fab} describing each overlap.

3.4 Relationship classificationDifferent kinds of relationships between labels may be presentwithin the data, such as equivalence and subsumption, wherethe relationship may vary according to its location. In gen-

Figure 1: The intersection Mab of clusters Ma and Mb.

eral, each type of relationship can be treated as a differentclass in a multi-class classifier. Such a classifier takes as in-put the features Fa,b,ab of an overlap between a pair of labels{λa, λb}, and produces as output a decision r that indicateswhich, if any, relationship holds between the two labels. Aclassifier need not produce a binary decision, but may ratherassign a confidence score to each type of relationship, whichwould provide the relationship aggregation step with morerefined information to better resolve situations when severalrelationships are possible.

3.5 Relationship aggregationThe final step of our framework optionally performs an ag-gregation of the relationships detected between each pairsof labels, in order to more accurately deduce which relation-ships between labels actually hold. For example, if a rela-tionship r is transitive and the output of the classificationis as follows:

(λa, λb, r, Eab), (λb, λc, r, Ebc)

we can infer the additional relationship (λa, λc, r, Eac), whereEac = Eab ∩ Ebc, provided that Eac 6= ∅. Furthermore, if arelationship between two labels holds across n extents, e.g,

(λa, λb, r, E1ab), (λa, λb, r, E

2ab), . . . , (λa, λb, r, E

nab)

then the relationship also holds on their union (λa, λb, r, E1ab∪

E2ab ∪ . . . ∪ Enab).Aggregating the individual relationships for overlaps be-

tween pairs of labels provides an opportunity to correctthose that have been misclassified. Furthermore adjacentgeo-spatial extents may be connected to also cover the extentbetween them. For instance, the relationship between thelabels holidays and vacaciones may be initially classifiedas an equivalence relationship that holds in many places inSpanish-speaking countries; these equivalence relationshipscould be expanded to cover the entire world, even thoughthese labels may have co-occurred in just a few instances.

While our framework is generic in the sense that varioustypes of relationships could be detected with suitable fea-tures, classifiers and aggregation, in this paper we solely fo-cus on detecting equivalence relationships between tags, andleave the detection of other relationships to future work.

4. IMPLEMENTATIONIn this section we describe one particular instantiation ofour algorithmic framework. We discuss the implementationchoices of each of the steps of the framework presented inthe previous section.

Page 5: Automatic Discovery of Global and Local …...gesting tags that have a strong presence in the area where the photo was taken, de-duplicating tags when they refer to the same canonical

Figure 2: The clusters detected for the tag zionnational-park (ε = 0.0001). The densities within the clusters are notshown. The smallest cluster is centered on Zion NationalPark and has a large prior of 0.92. The largest cluster isalso centered on the park, but has a smaller prior of 0.07.The third cluster is centered on Dixie National Forest,albeit only with a prior of 0.01.

4.1 Data representationAs explained in Section 3.1, there are several ways to gen-erate an intermediate representation of the raw data thatis amenable for detecting overlaps and for generating fea-tures. In this work, we represent the data via a probabilisticgenerative model, which allows us to model the data as aprobability density function. We adopt the Gaussian mix-ture model (GMM) [33] to describe the data, in particularbecause it provides a sound statistical framework for theapproximation of unknown distributions that are not neces-sarily Gaussian themselves [26].

To obtain the spatial representation of a label we firstfit a Gaussian mixture model – this can be done for multi-ple labels in parallel – in order to obtain one or more mix-tures. Each mixture component represents a different sub-population within the distribution of the data for a label,and is described by a mean µ, a covariance matrix Σ anda prior α that can be considered the “height” of its under-lying Gaussian. To fit the model to the data, we employthe Expectation-Maximization (EM) [6] algorithm. Thereare various options to choose the hyper-parameter k of themodel (the number of mixture components). One approachis to use the Bayesian information criterion (BIC) [37]. Forsimplicity, we choose a practical iterative approach based oncross-validation as implemented in WEKA [13]. In each iter-ation, we increase the number of components by one and wecompute the log-likelihood of the model on a 10-fold cross-validation. If the log-likelihood increases, we repeat the pro-cedure, otherwise we stop.

In line with our earlier terminology we will refer to mix-tures as “clusters” henceforth. The spatial representation ofa label Cλ is then the set formed by all of its clusters, i.e.

Cλ , {c}, where each cluster c has label λ and is representedby the tuple (µ,Σ, α), see Figure 2 for an example.

4.2 Overlap detectionSince the values of the Gaussian function never become zero,any cluster represented as a mixture of Gaussians always

Figure 3: The clusters detected for the tags zionnation-alpark and zionnp (ε = 0.0001). The densities within theclusters are not shown.

overlaps with every other cluster. However, in practice, thevalue of the Gaussian function rapidly decreases to zero andbecomes negligible a few standard deviations away from themean. We therefore apply a threshold ε to limit the extentof each cluster c ∈ Cλ to the area bounded by the level curveLc = {(x, y) | N (µc,Σc) = ε/αc}. This thresholding makespairwise comparison of distributions feasible by pruning un-related clusters.

The cluster prior αc affects the level curve of a cluster,since the level curve will encompass a larger area given ahigher prior than the level curve given a lower prior, dueto its more influential Gaussian. The level curve of a clus-ter may be the empty set when all of its densities are be-low the threshold; such clusters are unlikely to significantlycontribute to any kind of relationship and can be filteredout. For each pair of labels λa and λb we can now deter-mine all geo-spatial overlaps between their associated clus-ters Cλa and Cλb by checking for intersections between thelevel curves of their clusters, yielding zero or more overlaps.As mentioned earlier, each overlap is represented by a tu-ple (λa, Ca, λb, Cb, Eab), where Ca ⊆ Cλa and Cb ⊆ Cλb ,and where Eab is the extent of the intersection between theclusters Ca and Cb, see Figure 3 for an example.

4.3 Feature generationFor each overlap we extract a number of overlap-based, extent-based and density-based features from the union of involvedclusters Ma =

⋃c | c ∈ Ca and Mb =

⋃c | c ∈ Cb, and from

their intersection Mab = Ma

⋂Mb.

Overlap-based features. Following the work in the areaof qualitative spatial reasoning, we express the overlaps us-ing region connection calculus. The RCC8 model [4] distin-guishes between eight different types of region connectionsthat express all possible ways regions can interact with eachother, such as partial overlapping or touching as is illus-trated in Figure 4; the model can also be applied when mul-tiple regions are involved. Per overlap we first compute thecombined level curves La from Ma and Lb from Mb beyondwhich the distributions have value zero, after which we canthen compare the areas delineated by the level curves to as-sign one of the RCC8 region connection types to the overlap.

Page 6: Automatic Discovery of Global and Local …...gesting tags that have a strong presence in the area where the photo was taken, de-duplicating tags when they refer to the same canonical

Figure 4: The RCC8 connections. DC = disconnected,EC = externally connected, PO = partially overlapping,EQ = equal, TPP(i) = tangential proper part (inverse),and NTTP(i) = non-tangential proper part (inverse).

Extent-based features. We compute several features fromthe area covered by Ma, Mb and Mab. In order to do so, wefirst represent the data for each Mi as a two-dimensional bi-nary histogram fi(x, y), where a histogram bin is activatedwhen its corresponding location falls inside the area en-closed within the combined level curve. All three histogramsfa(x, y), fb(x, y) and fab(x, y) span the same geographicarea. For computational reasons we resample the geographicarea to fit a histogram with a maximum dimension of 2000bins, either widthwise or lengthwise depending on the hor-izontal or vertical orientation of the geographic area. Fromeach of the three histograms we then first compute the sec-ond and third order complex translation, rotation and scal-ing invariant moments [10]. Translation, rotation and scalinginvariance allows for a robust description of the shapes of Ma

and Mb and their intersection Mab by ignoring their hori-zontal and vertical displacements, orientational differencesand their disparity in size. The second order moments cap-ture the distribution of the mass of each shape with respectto the mean, while the third order moments capture theirskewness. In addition to these moments we further calculatethe sizes of the areas and compute the Jaccard index, i.e.the area of the intersection Mab divided by the area of theunion Ma

⋃Mb.

Density-based features. We compute the same transla-tion, rotation and scaling invariant moments as above, butthis time we use the actual densities of the clusters ratherthan their shapes in the histograms fi(x, y). We furthermoreinclude the location and value of the maximum density andthe average density per histogram cell. Finally, we computethe Jensen-Shannon divergence between Ma and Mb, to cap-ture how similar their density distributions are.

In total, we generate 8 overlap-based features, 28 extent-based features and 40 density-based features, yielding a totalof 76 features per overlap, see also Table 1.

4.4 Relationship classificationTo find an equivalence relationship, we set up the classi-fication task as a binary classification problem, where thepositive class represents the existence of the equivalence re-lationship and the negative class its absence. While in prin-ciple any binary classifier can be used for the task, we chooseto use tree-based classifiers to gain insight on the relative im-portance of the features. In particular, we experiment stan-dard decision trees, with random forests and with gradientboosted decision trees. We use WEKA [13] for the decisiontrees and the random forests, and an in-house implementa-tion for the gradient boosted decision trees. We present theresults of the classification task in Section 5.

4.5 Relationship aggregationEquivalence is a transitive relationship, thus given the bi-nary relationships uncovered in the previous step, we cancompute their transitive closure and therefore discover morerelationships, as was also earlier described in Section 3.5. Re-call that the aggregated relationship will hold on the unionof the intersections of the original extents. Since we evaluateour algorithm on a selection of tags that were in part ran-domly selected, the number of transitive relationships dis-covered in this manner is therefore small in our dataset. It isnevertheless valuable to analyze the transitive aggregationfrom a qualitative point of view; we present examples fromour dataset in Section 5.3.

5. EXPERIMENTSWe experiment with our relationship discovery frameworkand validate it on the specific task of finding synonyms.Synonyms represent a very well-defined subset of equiva-lence relationships, which is easy to assess. However, manu-ally building a ground truth for such a task is a tedious job.For this reason, we devise a procedure to generate it semi-automatically, so that we only need to check the synonymsand non-synonyms identified by the evaluated algorithmsfor correctness. Before presenting the experimental results,in the next section we describe the dataset used in our ex-periments and the procedure used to generate the groundtruth.

5.1 Dataset

Flickr. Our Flickr dataset contains a sample of over 56 mil-lion geo-referenced images uploaded to Flickr before the endof 2010. Each photo is represented by a geographic loca-tion, indicated by longitude and latitude, and one or moretags assigned to the photo by the user. We consider eachinstance of a tag associated with a photo as a label and an-notate each label with the location information of its sourcephoto. A single photo may thus generate multiple instancesof annotated labels.

We performed sanitization on the data by removing allnon-latin characters, reducing all remaining latin charactersto their lowercase representation and removing all diacritics,so that tags like Espa~na and Gaudı become espana and gaudi

respectively. We removed tags that referred to years, such as2006 and 2007, as well as camera manufacturer names, suchas canon and nikon, because these at times get automat-ically added by capture devices or photo applications andthus are not representative of the tagging behavior of users.

We additionally removed infrequently used tags by ensur-ing we have at least 50 instances per tag occurring aroundthe world. While infrequently used tags could yield a clusterthat characterizes an area, Sigurbjornsson and van Zwol [39]show that the tag frequency distribution in Flickr follows apower law where the long tail contains words categorized asoccurring incidentally, making it unlikely for such tags togenerate a good cluster.

Ground truth. To generate the ground truth, we first ex-tracted all (inter-language and intra-language) synonymsfrom the BabelNet multilingual ontology [28] and appliedthe same sanitization described above to each synonym. Toavoid any ambiguity in interpretation we retained only thoseterms that were assigned to a single concept. We computed

Page 7: Automatic Discovery of Global and Local …...gesting tags that have a strong presence in the area where the photo was taken, de-duplicating tags when they refer to the same canonical

Table 1: The overlap-based (top), extent-based (middle) and density-based (bottom) features used by our algorithm.

DC Ma and Mb do not overlap.

EC Ma and Mb touch each other.

PO Ma and Mb partially overlap.

EQ Ma and Mb exactly overlap.

TPP Ma is a tangential proper part of Mb.

TPPi Mb is a tangential proper part of Ma.

NTPP Ma is a non-tangential proper part of Mb.

NTPPi Mb is a non-tangential proper part of Ma.

Jaccard Jaccard index of Ma and Mb.

Area{a,b,ab} Area sizes of the extents of Ma, Mb and of their intersection Mab, respectively.

Φe(1, 1){r,i}{a,b,ab} Second order invariant (real and complex) moments [10] of the extents of Ma, Mb and Mab.

Φe(2, 0){r,i}{a,b,ab} Second order invariant (real and complex) moments [10] of the extents of Ma, Mb and Mab.

Φe(2, 1){r,i}{a,b,ab} Third order invariant (real and complex) moments [10] of the extents of Ma, Mb and Mab.

Φe(3, 0){r,i}{a,b,ab} Third order invariant (real and complex) moments [10] of the extents of Ma, Mb and Mab.

JensenShannon Jensen-Shannon divergence of Ma and Mb.

Sum{a,b,ab} Cumulative densities of Ma, Mb and Mab.

Avg{a,b,ab} Average densities of Ma, Mb and Mab.

MaxX(0, 0){a,b,ab} Normalized longitude bin (0 − 1) of maximum density in Ma, Mb and Mab.

MaxY (0, 0){a,b,ab} Normalized latitude bin (0 − 1) of maximum density in Ma, Mb and Mab.

MaxV (0, 0){a,b,ab} Maximum density values of Ma, Mb and Mab.

Φd(1, 1){r,i}{a,b,ab} Second order invariant (real and complex) moments [10] of the densities of Ma, Mb and Mab.

Φd(2, 0){r,i}{a,b,ab} Second order invariant (real and complex) moments [10] of the densities of Ma, Mb and Mab.

Φd(2, 1){r,i}{a,b,ab} Third order invariant (real and complex) moments [10] of the densities of Ma, Mb and Mab.

Φd(3, 0){r,i}{a,b,ab} Third order invariant (real and complex) moments [10] of the densities of Ma, Mb and Mab.

the intersection between the selected terms and the tag vo-cabulary in Flickr, ensuring that for each concept at leasttwo terms had been used as photo tags. We then manuallyinspected all terms per concept to validate whether theytruly were synonyms and discarded all terms whose meaningwas (possibly) ambiguous. For example, kivu and lakekivu

are considered synonyms in BabelNet for the concept “LakeKivu”, even though in reality the former tag describes amuch larger region that encompasses the lake; we thus didnot consider these two terms as synonyms. Ultimately, wegenerated 1240 examples. We formed an equal-sized set ofnon-synonym tags by including a portion of the terms wediscarded earlier and by randomly sampling the remainderfrom the tag vocabulary that were not in the synonym set.

As described in Section 4, we proceeded to fit a Gaussianmixture model to the data instances of each of the synonymsand non-synonyms to obtain one or more clusters per tag,and then detected the overlapping clusters. Whenever theclusters of a pair of tags generated multiple overlaps we onlyretained the one where the cumulative priors were highestof all clusters that were part of the overlap. Finally, we com-puted the features for all clusters involved in the remainingoverlaps, and we assigned the class positive if both tags inthe pair were present in BabelNet as a synonym and neg-

ative otherwise. We determined the overlaps for varyingvalues of the threshold ε. In case the pair of tags did notgenerate overlapping clusters, the overlap-based feature wasset to DC and the extent features and density features ofthe intersection were set to zero.

5.2 Empirical ResultsWe evaluate three different classifiers on the ground truth forseveral values of the threshold ε: decision tree (DT), randomforests (RF) and gradient boosted decision trees (GBDT).

For each classifier and value of ε we run a 10-fold cross val-idation and average results across the folds. For DT we setthe minimum number of instances in each leaf to 10 and theconfidence factor for pruning to 0.10. For RF we use 50 trees,and for GBDT 20 trees. These settings were chosen empir-ically in a preliminary test in order to reduce over-fitting.We leave all other settings at their default value.

Table 2 summarizes the results. We report the positiverate (TPR), the false positive rate (FPR) and the area underthe ROC curve (AUC). The TPR and FPR are obtained atthe optimal point of operation on the ROC curve by givingequal weights to misclassifications. For all three classifiersthe AUC is high, where the most powerful classifiers in ourevaluation setting is RF, which is able to reach an AUC of0.85 for ε = 0.001. While the GBDT obtains the best TPR,at the same time the FPR is highest, yielding a lower AUCthan the RF and one that is comparable to the DT.

When inspecting the generated clusters and their overlapsfor different values of ε, we observe that higher values lead tofewer and more spatially constrained clusters that less fre-quently overlap, whereas the reverse is the case for smallervalues. Intuitively, the classification performance should in-crease for higher thresholds of ε, since any remaining clus-ters that do overlap have strong spatial support and similarspatial footprints, and thus are more likely to refer to syn-onyms than to non-synonyms. Yet, we notice that the per-formance of the classifiers is relatively stable across all eval-uated thresholds ε, indicating that the clusters formed forsynonymous tags do not always have strong spatial supportnor may they necessarily overlap. To this end we investigatewhich features are more important than others for a correctclassification.

Feature importance. By analyzing our model from theGBDT classifier, we can infer which features contribute the

Page 8: Automatic Discovery of Global and Local …...gesting tags that have a strong presence in the area where the photo was taken, de-duplicating tags when they refer to the same canonical

Rel

ativ

e im

porta

nce

0

20

40

60

80

Max

Yb

Max

Xb

Max

Xab

Max

Ya

ɸ(2

,1)r,

b

ɸ(2

,1)r,

a

Max

Va

Jens

enSh

anno

n

Max

Yab

ɸ(3

,0)r,

a

Figure 5: Estimated relative importance of the top-10features for GBDT.

most to the classification, as described by Friedman [11].Across all models, all of the features in the top-10 are den-sity based, see Figure 5. Six of these features refer to therelative location and height of the peaks of the clusters andtheir intersection, one feature refers to the Jensen-Shannondivergence and the remaining three features are second andthird order image density moments. Interestingly, when welook beyond the top-10, the overlap-based features from theRCC8 model seem to contribute little to the classification,which may be due to the mismatch between the GMM andthe model for which the region connection calculus was origi-nally devised, and deserves further study. Overall our resultsindicate that the Gaussian mixtures model is very suitablefor modeling geo-spatial labeled data and that the featuresextracted from the clusters and their intersection, primarilythe density-based features, are sufficiently informative forthe classification task.

Comparison. We implement a baseline based on spatialtag correspondences to place the results of our method intocontext. Here, we discretize the world by considering it as alarge histogram, where each cell has a size of ρ◦ longitudeby ρ◦ latitude. For example, ρ = 0.0001 yields cells of about10x10m each, while ρ = 0.01 yields cells of about 1x1km.We quantize the location of each tag instance to a cell in thehistogram, after which we count the number of instances percell. We then represent the global distribution of non-emptycells for a tag by a sparse feature vector, after which wecompare the vectors of pairs of tags using cosine similarity,producing a single feature representing their similarity. Wefinally evaluate the performance of the baseline, again usingthe Decision Tree, Random Forest and Gradient BoostedDecision Tree, and report the same metrics as before.

Table 3 summarizes the results. We can see that the base-line achieves worse results in comparison with our method;our method outperforms the baseline for all three classifiers.We notice that the baseline performs particularly badly forthe GBDT when ρ is small, although even when ρ is largeit still underperforms with respect to the DT and RF; whilethe GBDT classifier is effective in achieving a low FPR, itis not able to produce a high enough TPR to yield a com-petitive performance.

Table 2: Evaluation results for our algorithm for the De-cision Tree (DT), Random Forest (RF) and GradientBoosted Decision Tree (GBDT) classifiers. We report thetrue positive rate (TPR), false positive rate (FPR) andthe area under the curve (AUC) for several parametervalues ε.

ε DT RF GBDT

TPR FPR AUC TPR FPR AUC TPR FPR AUC

0.000001 0.762 0.274 0.759 0.777 0.223 0.832 0.773 0.257 0.7370.000005 0.746 0.254 0.767 0.763 0.237 0.822 0.808 0.287 0.7480.00001 0.729 0.271 0.759 0.776 0.224 0.819 0.818 0.318 0.7520.00005 0.737 0.263 0.773 0.778 0.222 0.823 0.729 0.241 0.7210.0001 0.733 0.267 0.753 0.770 0.230 0.819 0.829 0.409 0.6790.0005 0.733 0.267 0.761 0.771 0.229 0.844 0.869 0.313 0.7370.001 0.762 0.238 0.792 0.787 0.213 0.848 0.848 0.240 0.791

Table 3: Evaluation results for the spatial tag correspon-dence baseline for the Decision Tree (DT), Random For-est (RF) and Gradient Boosted Decision Tree (GBDT)classifiers. We report the true positive rate (TPR), falsepositive rate (FPR) and the area under the curve (AUC)for several parameter values ρ.

ρ DT RF GBDT

TPR FPR AUC TPR FPR AUC TPR FPR AUC

0.0001 0.545 0.455 0.537 0.531 0.469 0.554 0.101 0.014 0.1800.0005 0.592 0.408 0.602 0.587 0.413 0.611 0.205 0.012 0.2900.001 0.612 0.388 0.615 0.615 0.385 0.645 0.252 0.025 0.3540.005 0.663 0.337 0.670 0.658 0.342 0.701 0.350 0.035 0.4850.01 0.697 0.303 0.675 0.688 0.312 0.747 0.402 0.028 0.559

The baseline is an approach that only considers the occur-rences of tags at a global level, and as such it is incapableof detecting equivalence relationships that only hold locallyand nowhere else. In contrast, our method only classifies tagsas being equivalent in the area spanned by the union of theiroverlapping clusters, unless the aggregation step expandsthis area. Furthermore, the size of the cells can considerablyinfluence the accuracy of the classifier, where for very smallcells the number of instances per individual cell may be verysmall and produce many cells per tag, which may result indissimilar feature vectors for equivalent tags. Conversely, forvery large cells the number of instances per individual cellmay be very large and produce few cells per tag, which mayresult in similar feature vectors for non-equivalent tags. Incontrast, our method automatically detects the optimal scaleat which to represent the tag distribution. The advantage ofthe baseline over our method, however, is that it is not ascomplex and is able to detect equivalence relationships thatare globally valid.

5.3 Anecdotal ResultsIn this section we provide several examples of clusters andthe obtained classified overlaps to provide a visual illus-tration of the output of our algorithm and the classifiers.Correctly classified positive examples include the tag pairsbaleari – balearics (an archipelago of Spain) and occupy-

london – occupylsx (the Occupy London movement), whileincorrectly classified positive examples include arcdetriom-

phe – grandearche (two famous landmarks in Paris, France)and nijocastle – nijojo (a castle located in Kyoto, Japan).Incorrectly classified negative examples also occur, such asthe tag pair portknockie – bowfiddlerock (a coastal villagetown in Scotland and its characteristic landmark). We showa few examples in Figure 6. We observe that positive tagpair examples may be misclassified as a result of so-called

Page 9: Automatic Discovery of Global and Local …...gesting tags that have a strong presence in the area where the photo was taken, de-duplicating tags when they refer to the same canonical

Figure 6: Example overlaps between the tag clusters of portknockie – bowfiddlerock (top), baleari – balearics (bottom-left) and occupylondon – occupylsx (bottom-right). The densities within the clusters and overlaps are not shown, but inall three examples the peaks of the densities within the clusters are located in close proximity of each other. The densitypeaks of baleari and balearics are principally located on the islands, whereas those of occupylondon and occupylsx arecentered near St. Paul’s Cathedral where one of the protester camps of the Occupy London movement was based.

“bulk tagging” of photos, where people apply the same setof tags to all photos they took that day, while negative tagpair examples are typically misclassified due to subsumptionrelationships with both having similar tag distributions (e.g.the tags barcelona and spain).

In terms of detecting transitive relationships, the tagslouvre – museedulouvre (referring to the famous museumthe Louvre in Paris) are positively identified as an equiv-alence relationship, as were louvremuseum – museedulou-

vre, both spanning roughly the same geographic area. Thus,we are able to connect the tags louvre – louvremuseum to-gether, whereby their relationship holds in the intersectionof their extents.

Interestingly enough our classifiers pointed out mistakes inour ground truth labeling. For example, what we thought tobe a negative example, barajas – lemd, is classified as pos-itive; it turns out that the tags refer to the shortened nameof Madrid-Barajas Airport and its four-character ICAO air-port code.

6. CONCLUSIONSIn this paper we introduced a novel algorithmic frameworkto automatically discover equivalence relationships in geo-referenced folksonomies. Differently from previous work inthe literature, our method is based on the geo-spatial pat-terns of the tags in the folksonomy, rather than on their co-occurrences in the content. Furthermore, by using a combi-nation of unsupervised and supervised learning, we discoverwell-defined equivalence relationship rather than generic clus-ters of related tags as done previously in the literature. Our

method is generic and can be applied to any labeled datawith geo-spatial coordinates. We validated our frameworkon the task of discovering synonyms from their geo-spatialpatterns. We built a ground truth semi-automatically byleveraging an existing ontology. Our method is able to cor-rectly classify the synonyms with high accuracy, and the bestclassifier achieves an AUC of 0.85.

This work can be extended in several directions. On themodeling part, we plan to explore different data represen-tations, such as non-parametric DP-GMM. Furthermore, inthis paper we used elliptical Gaussians as the basic mix-ture. However, by using a diagonal or full covariance matrix,which allows the mixtures to be more freely placed, morecomplex shapes and more powerful models can be obtained.This characteristic might be useful in discovering relation-ships different from equivalence.

In this work we mainly focused on validating the approachon a known ground truth, however the capabilities of theframework still need to be evaluated further. We plan toperform a large scale evaluation of the output of the classi-fication. Given the size of the task, crowdsourcing the eval-uation of the discovered relationships seems a viable option.

Finally, while in this paper we focused on equivalence re-lationships, we will expand the relationships that can bediscovered to include more different kinds. For example, sub-sumption relationships are a prime candidate that would re-quire minimal modifications to the framework. Furthermore,by including temporal features in our analysis, we will ex-plore the possibility of uncovering periodic relationships inthe data.

Page 10: Automatic Discovery of Global and Local …...gesting tags that have a strong presence in the area where the photo was taken, de-duplicating tags when they refer to the same canonical

7. REFERENCES[1] S. Ahern, M. Naaman, R. Nair, and J. Yang. World

explorer: visualizing aggregate data from unstructuredtext in geo-referenced collections. In Proceedings of the7th ACM/IEEE-CS Joint Conference on DigitalLibraries, pp. 1–10, 2007.

[2] W. Chen, A. Battestini, N. Gelfand, and V. Setlur.Visual summaries of popular landmarks fromcommunity photo collections. In Proceedings of the43rd Asilomar Conference on Signals, Systems andComputers, pp. 782–789, 2009.

[3] Y. Cheng. Mean shift, mode seeking, and clustering.IEEE Transactions on Pattern Analysis and MachineIntelligence, 17(8):790–799, 1995.

[4] A. Cohn, B. Bennett, J. Gooday, and N. Gotts.Qualitative spatial representation and reasoning withthe Region Connection Calculus. GeoInformatica, 1(3):275–316, 1997.

[5] M. Cristani, A. Perina, U. Castellani, and V. Murino.Content visualization and management of geo-locatedimage databases. In Proceedings of the 26thInternational Conference on Human Factors inComputing Systems – Extended Abstracts, pp.2823–2828, 2008.

[6] A. Dempster, N. Laird, and D. Rubin. Maximumlikelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society, 39(1):1–38,1977.

[7] D. Deng, T. Chuang, and R. Lemmens.Conceptualization of place via spatial clustering andco-occurrence analysis. In Proceedings of the 2009International Workshop on Location Based SocialNetworks, pp. 49–56, 2009.

[8] M. Ester, H. Kriegel, J. Sander, and X. Xu. Adensity-based algorithm for discovering clusters inlarge spatial databases with noise. In Proceedings ofthe 2nd AAAI International Conference on KnowledgeDiscovery and Data Mining, pp. 226–231, 1996.

[9] V. Estivill-Castro and I. Lee. AUTOCLUST:automatic clustering via boundary extraction formining massive point-data sets. In Proceedings of the5th International Conference on Geocomputation, pp.1–20, 2001.

[10] J. Flusser, T. Suk, and B. Zitova. Moments andmoment invariants in pattern recognition, chapter 2.John Wiley & Sons, Ltd., 2009.

[11] J. Friedman. Greedy function approximation: agradient boosting machine. Annals of Statistics, 29(5):1189–1232, 2001.

[12] T. Gruber. Ontology of folksonomy: A mash-up ofapples and oranges. International Journal onSemantic Web and Information Systems, 3(1):1–11,2007.

[13] M. Hall, E. Frank, G. Holmes, B. Pfahringer,P. Reutemann, and I. Witten. The WEKA datamining software: an update. ACM SIGKDDExplorations Newsletter, 11(1):10–18, 2009.

[14] C. Hauff and G. Houben. Geo-location estimation ofFlickr images: social web based enrichment. InProceedings of the 34th European Conference onInformation Retrieval, pp. 85–96, 2012.

[15] H. Hotta and M. Hagiwara. A neural-network-basedgeographic tendency visualization. In Proceedings ofthe 2008 IEEE/WIC/ACM International Conferenceon Web Intelligence and Intelligent Agent Technology,volume 1, pp. 817–823, 2008.

[16] H. Hotta and M. Hagiwara. Online geovisualizationwith fast kernel density estimator. In Proceedings ofthe 2009 IEEE/WIC/ACM International Conferenceon Web Intelligence and Intelligent Agent Technology,volume 1, pp. 622–625, 2009.

[17] A. Jaffe, M. Naaman, T. Tassa, and M. Davis.Generating summaries and visualization for largecollections of geo-referenced photographs. InProceedings of the 8th ACM International Workshopon Multimedia Information Retrieval, pp. 89–98, 2006.

[18] C. Jones, R. Purves, P. Clough, and H. Joho.Modelling vague places with knowledge from the Web.International Journal of Geographical InformationScience, 22(10):1045–1065, 2008.

[19] L. Kennedy and M. Naaman. Generating diverse andrepresentative image search results for landmarks. InProceedings of the 17th International Conference onWorld Wide Web, pp. 297–306, 2008.

[20] C. Kling, J. Kunegis, S. Sizov, and S. Staab.Detecting non-gaussian geographical topics in taggedphoto collections. In Proceedings of the 7th ACMInternational Conference on Web Search and DataMining, pp. 603–612, 2014.

[21] T. Knerr. Tagging ontology - towards a commonontology for folksonomies. 2006.

[22] F. Limpens, F. Gandon, and M. Buffa. Bridgingontologies and folksonomies to leverage knowledgesharing on the social web: a brief survey. InProceedings of the 23rd IEEE/ACM InternationalConference on Automated Software Engineering –Workshops, pp. 13–18, 2008.

[23] M. Lux and G. Dosinger. From folksonomies toontologies: employing wisdom of the crowds to servelearning purposes. International Journal of Knowledgeand Learning, 3(4):515–528, 2007.

[24] J. MacQueen. Some methods for classification andanalysis of multivariate observations. In Proceedings ofthe 5th Berkeley Symposium on MathematicalStatistics and Probability, volume 1, 1967.

[25] A. Maedche and S. Staab. Mining ontologies fromtext. In Proceedings of the 12th European Workshopon Knowledge Acquisition, Modeling and Management,pp. 189–202, 2000.

Page 11: Automatic Discovery of Global and Local …...gesting tags that have a strong presence in the area where the photo was taken, de-duplicating tags when they refer to the same canonical

[26] J. S. Marron and M. P. Wand. Exact mean integratedsquared error. The Annals of Statistics, 20(2):712–736,1992.

[27] P. Mika. Ontologies are us: a unified model of socialnetworks and semantics. Web Semantics: Science,Services and Agents on the World Wide Web, 5(1):5–15, 2007.

[28] R. Navigli and S. Ponzetto. BabelNet: the automaticconstruction, evaluation and application of awide-coverage multilingual semantic network.Artificial Intelligence, 193:217–250, 2012.

[29] A. Passant. Using ontologies to strengthenfolksonomies and enrich information retrieval inweblogs. In Proceedings of the 1st AAAI InternationalConference on Weblogs and Social Media, 2007.

[30] A. Passant. Meaning of a tag: a collaborativeapproach to bridge the gap between tagging andlinked data. In Proceedings of the 2008 Linked Dataon the Web Workshop, 2008.

[31] C. Rasmussen. The infinite Gaussian mixture model.In Proceedings of the 12th International Conference onAdvances in Neural Information Processing Systems,pp. 554–560, 1999.

[32] T. Rattenbury, N. Good, and M. Naaman. Towardsautomatic extraction of event and place semanticsfrom Flickr tags. In Proceedings of the 30th ACMInternational Conference on Research andDevelopment in Information Retrieval, pp. 103–110,2007.

[33] R. Redner and H. Walker. Mixture densities,maximum likelihood and the EM algorithm. SIAMreview, 26(2):195–239, 1984.

[34] M. Rosenblatt. Remarks on some nonparametricestimates of a density function. The Annals ofMathematical Statistics, 27(3):832–837, 1956.

[35] S. Rudinac, A. Hanjalic, and M. Larson. Findingrepresentative and diverse community contributedimages to create visual summaries of geographic areas.In Proceedings of the 19th ACM InternationalConference on Multimedia, pp. 1109–1112, 2011.

[36] P. Schmitz. Inducing ontology from Flickr tags. InProceedings of the Collaborative Web TaggingWorkshop, 2006.

[37] G. Schwarz. Estimating the dimension of a model.The annals of statistics, 6(2):461–464, 1978.

[38] P. Serdyukov, V. Murdock, and R. van Zwol. PlacingFlickr photos on a map. In Proceedings of the 30thACM International Conference on Research andDevelopment in Information Retrieval, pp. 484–491,2009.

[39] B. Sigurbjornsson and R. van Zwol. Flickr tagrecommendation based on collective knowledge. InProceedings of the 17th International Conference onWorld Wide Web, pp. 327–336, 2008.

[40] R. Sinha. A cognitive analysis of tagging, 2005.

[41] S. Sizov. GeoFolk: latent spatial semantics in Web 2.0social media. In Proceedings of the 3rd ACMInternational Conference on Web Search and DataMining, pp. 281–290, 2010.

[42] O. van Laere, S. Schockaert, and B. Dhoedt. Findinglocations of Flickr resources using language modelsand similarity search. In Proceedings of the 1st ACMInternational Conference on Multimedia Retrieval,2011.

[43] T. Vander Wal. Folksonomy Coinage and Definition,2007.

[44] K. Yanai, H. Kawakubo, and B. Qiu. A visual analysisof the relationship between word concepts andgeographical locations. In Proceedings of the 8th ACMInternational Conference on Image and VideoRetrieval, 2009.

[45] H. Zhang, M. Korayem, E. You, and D. Crandall.Beyond co-occurrence: discovering and visualizing tagrelationships from geo-spatial and temporalsimilarities. In Proceedings of the 5th ACMInternational Conference on Web Search and DataMining, pp. 33–42, 2012.