Bottom-Up Gazetteers: Learning from the Implicit -

20
Bottom-Up Gazetteers: Learning from the Implicit Semantics of Geotags Carsten Keßler, Patrick Mau´ e, Jan Torben Heuer and Thomas Bartoschek Institute for Geoinformatics, University of M¨ unster, Germany carsten.kessler|patrick.maue|jan.heuer|bartoschek @uni-muenster.de Abstract. As directories of named places, gazetteers link the names to geographic footprints and place types. Most existing gazetteers are managed strictly top-down: entries can only be added or changed by the responsible toponymic authority. The covered vocabulary is therefore of- ten limited to an administrative view on places, using only official place names. In this paper, we propose a bottom-up approach for gazetteer building based on geotagged photos harvested from the web. We dis- cuss the building blocks of a geotag and how they relate to each other to formally define the notion of a geotag. Based on this formalization, we introduce an extraction process for gazetteer entries that captures the emergent semantics of collections of geotagged photos and provides a group-cognitive perspective on named places. Using an experimental setup based on clustering and filtering algorithms, we demonstrate how to identify place names and assign adequate geographic footprints. The re- sults for three different place names (Soho, Camino de Santiago and Kil- imanjaro ), representing different geographic feature types, are evaluated and compared to the results obtained from traditional gazetteers. Finally, we sketch how our approach can be combined with other (for example, linguistic) approaches and discuss how such a bottom-up gazetteer can complement existing gazetteers. 1 Introduction and Motivation The amount of geotagged user-generated content on the Social Web has been soaring in the last years. Cheaper and smaller GPS chips as well as easy-to- use tools for manual geotagging have led to a sharp increase, particularly in the number of geotagged photos. The sheer amount of geotagged pictures – currently over 100 million on Yahoo’s Flickr service alone 1 – makes them a very attractive source for geographic information retrieval [1,2]. As such, geotagged photos can be regarded as an implicit kind of Volunteered Geographic Information (VGI) [3]. Merging professional data sources with such VGI is attractive for a number of reasons, such as rapid updates and enrichment with data typically not con- tained in professional data sets. Examples include the extraction of footprints [1] and grounding of vague geographic terms [4] such as downtown Mexico City 1 According to http://blog.flickr.net/2009/02/05/.

Transcript of Bottom-Up Gazetteers: Learning from the Implicit -

Bottom-Up Gazetteers: Learning from theImplicit Semantics of Geotags

Carsten Keßler, Patrick Maue, Jan Torben Heuer and Thomas Bartoschek

Institute for Geoinformatics, University of Munster, Germanycarsten.kessler|patrick.maue|jan.heuer|bartoschek

@uni-muenster.de

Abstract. As directories of named places, gazetteers link the namesto geographic footprints and place types. Most existing gazetteers aremanaged strictly top-down: entries can only be added or changed by theresponsible toponymic authority. The covered vocabulary is therefore of-ten limited to an administrative view on places, using only official placenames. In this paper, we propose a bottom-up approach for gazetteerbuilding based on geotagged photos harvested from the web. We dis-cuss the building blocks of a geotag and how they relate to each otherto formally define the notion of a geotag. Based on this formalization,we introduce an extraction process for gazetteer entries that capturesthe emergent semantics of collections of geotagged photos and providesa group-cognitive perspective on named places. Using an experimentalsetup based on clustering and filtering algorithms, we demonstrate how toidentify place names and assign adequate geographic footprints. The re-sults for three different place names (Soho, Camino de Santiago and Kil-imanjaro), representing different geographic feature types, are evaluatedand compared to the results obtained from traditional gazetteers. Finally,we sketch how our approach can be combined with other (for example,linguistic) approaches and discuss how such a bottom-up gazetteer cancomplement existing gazetteers.

1 Introduction and Motivation

The amount of geotagged user-generated content on the Social Web has beensoaring in the last years. Cheaper and smaller GPS chips as well as easy-to-use tools for manual geotagging have led to a sharp increase, particularly in thenumber of geotagged photos. The sheer amount of geotagged pictures – currentlyover 100 million on Yahoo’s Flickr service alone1 – makes them a very attractivesource for geographic information retrieval [1,2]. As such, geotagged photos canbe regarded as an implicit kind of Volunteered Geographic Information (VGI)[3]. Merging professional data sources with such VGI is attractive for a numberof reasons, such as rapid updates and enrichment with data typically not con-tained in professional data sets. Examples include the extraction of footprints[1] and grounding of vague geographic terms [4] such as downtown Mexico City1 According to http://blog.flickr.net/2009/02/05/.

or mapping of non-geographic terms [5] to determine the regional use of wordslike soda or pop [6].

One promising use of VGI – and geotagged photos in particular – is the en-richment of gazetteers with vernacular names and vague places [7]. Gazetteershave been developed as directories of named places with information on geo-graphic footprints and place types to facilitate geographic information organi-zation and retrieval. Most gazetteers follow a strict top-down approach, i.e., thegazetteer data is administered by the organization running the gazetteer. Onlythis toponymic authority can add places or place types to the gazetteer andcorrect erroneous entries, which slows down updates and hampers the inclusionof local and often tacit knowledge. Moreover, in most gazetteers informationon geographic footprints is limited to a single coordinate pair, representing thecentre of a city, administrative district or street. Extraction of footprints fromgeotagged information on the web is thus a promising way to automaticallygenerate polygonal footprints for these gazetteer entries. Although a numberof approaches have been developed for this task [5,8,9,10], they are hardly im-plemented in existing gazetteers. Apart from the GeoNames gazetteer2, whichcomplements its database with geotagged information from Wikipedia, stricttop-down management of gazetteers is still prevalent.

In this paper, we present an approach to build gazetteers entirely from vol-unteered geographic information. We discuss the challenges posed by automati-cally establishing the foundations of such a gazetteer based on geotagged photosharvested from the web. The implemented algorithms for retrieving geotags andclustering the corresponding locations to generate footprints are well-established.However, the emergent semantics [11] of such a collection of geotagged photosis still largely unspecified. Hence, the main contribution of this paper will bethe formal definition of geotags. We explain the relation between the attachedlabel (tag) and the information objects like a photo, its label’s author, as well ascreation time and coordinates. We discuss the implicit semantics hidden in thisrelation, and how gazetteer entries can emerge from collections of such geotagsusing the presented implementation.

Inferred knowledge about places from a source like geotagged photos – usuallytagged with subjective keywords – can be seen as a social knowledge buildingprocess [12, chapter 9]. Ideally, this process leads to a representation of thegroup cognition [12] and can thus be regarded as a cognitive engineering [13]process which lets traditional GI applications benefit from the Wisdom of theCrowds [14]. Gazetteers exposing the collaborative perspective on place differsignificantly from traditional gazetteers with administrative focus [15]. It is thusnot the aim of this research to replace today’s gazetteers, which have alreadyproven useful for countless applications building on geocoding, geoparsing andnatural language processing. Instead, we argue for a separation of these differentviews into separate gazetteers, which can then be accessed through a gazetteerinfrastructure as outlined in [7,16].

2 See http://www.geonames.org.

In order to demonstrate the feasibility of our approach, we have set up anapplication which retrieved and processed geotags associated to photos publishedon Flickr, Panoramio and Picasa3. While there is also other geotagged contentonline such as videos, blog posts or Wikipedia entries, we chose to limit thisexperiment to photos. Photos are inherently related to the real world, sinceevery photo has been taken somewhere. Moreover, as mentioned above, there isalready a substantial amount of geotagged photos available online. By analyzingthe coordinate pairs attached to the pictures, the time they were taken as wellas the tags added by their owners, we are able to compute geographic footprintsrepresenting specific keywords. The collection of these keywords, derived from alltags of all retrieved photos, is further analyzed to differentiate between toponymsand tags without spatial relation. We test a repository build up this way withqueries for Soho, Camino de Santiago (Way of St. James) and Kilimanjaro.We compare the results to those obtained from the same query on GeoNames.This evaluation focuses on the question whether our bottom-up gazetteer canalready take on established gazetteers in terms of completeness and accuracy ofgeographic footprint.

The next section points to relevant related work. Section 3 introduces aformal definition of geotags and establishes the relation between gazetteers andgeotags. Section 4 describes the crawling and filtering approach implemented inthe prototype. Section 5 analyzes the results obtained for the three exemplaryqueries, followed by conclusions and an outlook on potential applications andfuture work in Section 6.

2 Related Work

This section points to related work from gazetteer research, tagging and bottom-up generation of geographic information.

2.1 Gazetteer Building and Learning

Gazetteers are knowledge organization systems that consist of triples (N, F, T ),where N corresponds to the place name, F to the geographic footprint and Tto the place type [17]. Since neither N , F nor T are unique, all three compo-nents are required to fully represent and unambiguously identify a named place[17, p. 92]. In the context of gazetteers, a clear distinction is made betweenplace as a social construct based on perceivable characteristics or convention[18], and the actual real-world feature it refers to [19]. Feature types are mostlyorganized in semi-formal thesauri with natural language descriptions. Recentresearch demonstrates how gazetteers could benefit from more rigorous, formalplace type definitions [16] and develops methods for gazetteer conflation [20].

Existing gazetteers have generally been developed based on databases pro-vided by administrative authorities, or by merging existing gazetteers [17]. More

3 See http://flickr.com/, http://panoramio.com/ and http://picasaweb.com/.

recently, the ever-growing amount of information available on the web has beenidentified as a promising resource of knowledge about named places. Jones etal. [1] introduce a linguistic approach to enrich gazetteers with knowledge aboutvague places. They use documents harvested via web search and analyze themfor cooccurrences of vague place names with more precise co-located places. Inanother linguistics-based approach presented by Uryupina [21], a bootstrappingalgorithm is applied to automatically classify places into predefined categories(e.g. city, mountain). The machine learning techniques employed in this researchenabled a high precision of about 85%, albeit the comparably small training datasets of only 100 samples per category. Henrich and Ludecke [5] introduce a pro-cess based on the results retrieved from a web search engine to derive geographicrepresentations for both geographic and non-geographic terms at query time.Goldberg et al. [22] developed an agent-based system that crawls structured on-line source such as the USPS zip code database and online phone books. Theauthors demonstrate that this approach is capable of creating detailed regional,land-parcel level gazetteers with a high degree of completeness.

2.2 User-generated Geographic Information

Online mapping tools with open APIs such as Google Maps have enabled thecreation of the huge amounts of user-generated geographic information – alsodubbed collaborative [23] or volunteered GI (VGI) [3] – in the first place. Whilethis mainly refers to projects like OpenStreetMap4, we argue that geotags, andmore importantly the geographic footprints derived from them, can also be filedinto this category. Similar approaches have already been sketched in previousresearch to derive landscape regions [24] or imprecise definitions of boundaries ofurban neighborhoods [8] from such geotagged content. We build on this previouswork and show how geographic information collected this way can be processedfor the integration with existing gazetteers.

3 What is a Geotag?

We have introduced geotags as particular examples of volunteered geographicinformation. Before discussing the idea of inferring semantics from the geotag,we are going to formally define it.

3.1 Tagging Geographic Information Objects

Humans adding items like pictures to their collections use individual orderingschemes (besides time) to group similar items, keep different items apart and con-sequently simplify recovery. We order books in our (real) book shelf accordingto various criteria, including topic, age, thickness, or even color. Such individualpreferences re-appear in virtual collections. Using tags – words or combinations

4 See http://www.openstreetmap.org/.

of words people associate with virtual items – is a well accepted approach tosort items on the virtual shelf. Tags, however, can vary significantly from personto person. The formal definition of a tag therefore has to include both the userand the tagged information object. Gruber [25] suggests to model the tag as theprocess Tagging = (L, U, I, S), which establishes an immediate relation betweenthe the Label L coming from the User’s (U) vocabulary associated to an infor-mation Item I. This definition includes a Source S, which enables sharing acrossapplications. In the following, we leave this source aside, since it has no directimpact on the presented approach. The following rule states that, if a label isassociated with an item by some user, it is regarded as tag. More importantly,it also states that a tag is always bound to its author and the item:

∀l(Label(l)∧ ∃i(Item(i) ∧ associatedTo(l, i)) (1)∧∃u(User(u) ∧ createdBy(l, u))→ Tag(l))

Any information object which is inherently hard to classify – basically allnon-textual information – requires a solution for its categorization. Tagging iscommonly accepted for such contents, such as photos or videos, but also forbookmarks, scientific articles, and many more. In the remainder of this paper,we focus on photos with an identifiable geographical context, e.g. a picture of LaCatedral in Mexico City. The items in question are therefore related to objectsin the geographic landscape [26]. Goodchild’s “geographic reality” [27] as formaldefinition of geographical information takes the spatio-temporal nature of thephysical (field-based) reality into account. Humans, however, do not perceivereality as continuous fields. They identify individual objects, either directly orindirectly by looking at photos created by camera sensors.

In this World of Individual Objects [26] we only consider particulars (entitiesexisting in space and time) with an observable spatial and temporal extension.Objects on the photo have per se no meaning; in Frank’s World of SociallyConstructed Reality we eventually associate semantics to be able to reference theparticulars [28] in spoken language. Such reference can either be a proper name,which is used as unique identifier [29], e.g., Catedral Metropolitana de la Ciudadde Mexico, or it links to a category5 which groups objects sharing commonproperties, e.g. cathedral. We finally identify individual particulars according totheir spatial or temporal characteristics, by either referring to complex objects(e.g., downtown) or to the homogenous spatial or temporal region the object isproper part of, e.g., Mexico City. So far, this follows the definition of gazetteerentries from Section 2.1. The place type T and place name N in the discussedtriple (N, F, T ) both refer to the particular’s semantics, the geographic footprintF on the other hand is related to its spatial extension in physical reality.

The same applies to the labels used to tag a photo, which function as ref-erences to particulars in geographic space. The nature of this reference, how-ever, cannot be explicitly described: although it appears to be obvious for the

5 The reference is then again the proper name of the object’s type.

mentioned proper names or category names, most tags associated to photos donot have an objective relation to the geographic object. The label vacation09makes perfect sense for the user, who might have sorted all pictures of his Mex-ico trip using this tag. Once the items are shared, however, such personal tagsloose any usefulness. Other examples which have no immediate relation to thedepicted particular are labels naming properties of the item itself (e.g. blue,high-resolution), the process of creating the item (e.g. nikon), its potentialuse (e.g. wallpaper), or simply the author’s opinion (interesting). Note thatwe assume that it is the user’s intention to improve the item’s findability; hence,we do not expect to encounter deliberate errors (which is obviously not truein real world settings; we propose an effective solution for this problem in Sec-tion 3.3). Once we have identified the references, we can use them to locate thereferred-to object in space and time. The following rule makes this dependencybetween the tag and its role as reference to the depicted particular explicit:

∀l∃i(Tag(l) ∧ Item(i) ∧ associatedTo(l, i)∧ (2)∃p(Particular(p) ∧ represents(i, p))→ refersTo(l, p))

The rule does not (and cannot) further specify the reference type. Taking ourexample of the cathedral, the label Catedral Metropolitana is immediatelyreferencing – here as proper name – the particular. We can then further specifythe tag as a proper name:

∀l∃p(Tag(l) ∧ Particular(p) ∧ names(l, p)→ ProperName(l)) (3)

The open question here is obviously how to infer if the label is a proper nameand, even more important, how to ensure that it is really the proper name of thedepicted geographic object. The clustering and filtering approach introduced inthe next sections provides answers to both questions.

Labels like Mexico or Summer 2009 are indirect references. They point toa region containing the particular (spatially and temporally, respectively). Thefollowing rule formalizes our assumption, that, if the tag is a toponym referringto a certain geographic region, we can infer that our depicted object is spatiallyrelated to that region:

∀l∃p(Tag(l) ∧ Particular(p) ∧ refersTo(l, p)∧ (4)∃r(GeographicRegion(r) ∧ names(l, r))→ spatiallyRelated(p, r))

We can only assume that there is a spatial relation between the depictedparticular and the place name. By looking only at the labels we cannot infer whatkind of spatial (or temporal, for that matter) relation exists, and hence whatspatial character this specific label has. In the following section we introduce theconcept of a geotag as an extension of the traditional tag. Geotags give us theopportunity to make use of geographic coordinates and points in time to identifythe spatio-temporal character of the associated labels.

3.2 A Formal Definition of Geotag

The tagging process establishes the relation between the user, the informationitem, and the label. If the information item represents one or more geographicobjects, the associated label may (but does not have to) refer to either dimen-sion of the depicted object: either its semantics (including a proper name of theindividual or category) or its spatio-temporal extension (naming, for example,the containing region). A geotag extends the notion of the tag by adding anexplicit location in space and time to the information item. In the case of digitalphotos, a time stamp with the creation date is usually added by the camera auto-matically. Geographic coordinates are either provided by built-in GPS modules,or added manually by the user. Building on Gruber’s definition of tagging as arelation, we add the time stamp T and the coordinates C to the relation (andomit the source S): Geotagging = (L, U,C, I, T ). By extending our rule-baseddefinition of a tag (Eq. 1), the following rule reclassifies a label as a geotag

∀l∃i(Label(l) ∧ Item(i) ∧ associatedTo(l, i) (5)∧∃c(Coordinate(c) ∧ associatedTo(c, i))∧∃t(Timestamp(t) ∧ associatedTo(t, i))

∧∃u(User(u) ∧ createdBy(l, u))→ Geotag(l))

Note that we do not assume that a label reclassified as geotag is per se aplace name. The tag blue is not necessarily related to the depicted object, nordoes it have a spatial or temporal character. In our understanding, it is still ageotag, since it is the label used by one user in some occasion to tag an itemwith an associated location and date. In the following Section 3.3, we introducean approach which reliably computes whether a label is spatially related to theparticular.

3.3 A Clustering Approach to Categorize Geotags

The definition of geotags introduced in the previous section has substantial im-plications on the conceptual level. An information item is linked to a coordinateand time stamp, and labelled by one or more individuals. If we want to extractone particular aspect, e.g. the spatial coverage of geotags, we have to considerthe other four properties as well.

Using the definition of a geotag as the relation Geotagging = (L, U,C, I, T ),we use the tuple relational calculus6 [30] in the remainder to specify the queriesused to retrieve different kinds of clouds. For example, the query {g.C|g ∈Geotagging ∧ g[L] = Li} returns the coordinates of all tuples g where the la-bel (the field L) has the value Li. We call the result of this query a point cloudof a label. A folksonomy – the aggregation of all tags from all users into one (un-controlled) vocabulary – is then simply formalized as {g.L|g ∈ Geotagging}. The6 TCR is a concise declarative query language for the relational model, the presented

examples can also be expressed in SQL.

resulting tag cloud can also be reduced to the vocabulary of one particular userUi with the query {g.L|g ∈ geotags ∧ g[U ] = Ui}. Her spatio-temporal activity– the user’s movement across space and time – is queried using the statement{g.C, g.T |g ∈ Geotagging ∧ g[u] = Ui}.

We suggest to make use of the point cloud of one label to compute its spatialfootprint. A gazetteer build on top of this approach could then return geometriesand centroids for proper (potentially unofficial) names of geographical objects.The information we derive from geotags, however, is inherently noisy: many tagsdo not have an immediate relation to the particular represented by the geotaggeditem. Only significant occurrences of geotags should therefore be considered forthis approach. We define one occurrence of a geotag g = (Li, Ui, Ci, Ii, Ti) assignificant if the following two conditions are fulfilled:

1. At least two tuples gi and gj exist where gi[L] = gj [L], and g[Ui] 6= g[Uj ].Since names in geotags are subjective, this rule assures that only nameswhich are used by different persons are taken into account.

2. The spatial distribution {g.C|g ∈ Geotagging ∧ g[L] = Li} can be clustered.

In the following section we describe the algorithm which applies filters check-ing for these conditions to extract the relevant candidates for toponyms from thelarge set of tags. The semantic analysis of the two preceding sections can be eas-ily realized as executable rules, for example expressed in the Semantic Web RuleLanguage (SWRL) [31]. SWRL supports built-ins, the algorithm presented inthe following pages can therefore be integrated as geotag :significant and used toextend and clarify rule 2:

∀l∃i(Tag(l) ∧ Item(i) ∧ (6)associatedTo(l, i) ∧ geotag : significant(l) ∧

∃p(Particular(p) ∧ represents(i, p))→ refersTo(l, p))

A reasoning engine triggers the execution of the clustering algorithm onceit processes the added built-in. The algorithm returns true if the given label issignificantly occurring (or false otherwise). Once we have applied the filtering andclustering, our gazetteer can provide the point clouds (and the regions coveredby the point clouds) for given place names. For some place names, the clusteringprocess results in multiple clusters (see the example of Soho in the followingsections). This does not impair the efficacy of the presented approach as longas the clustering algorithm produces reasonable results (which depends mostlyon the number of available geotags). For cases such as Soho, multiple gazetteerentries are generated.

Although we introduced time as a fundamental component of the geotag,we have not discussed the implications for the targeted gazetteer. With thepresented approach, the tag GEOS 2007 would also be classified as place name.While we cannot discuss this issue here in detail for a lack of space, distinguishingbetween toponyms and labels naming temporal events can be implemented byapplying the clustering approach both to the spatial and temporal dimensions.

3.4 Extraction of Gazetteer Entries

Section 2.1 defines gazetteer entries as triples (N, F, T ). This notion has to befurther specified for a gazetteer based on geotags. Since, in our case, the under-lying data consist of a large collection of photos geo-located with exactly onecoordinate pair, the given place name N maps to a point cloud as geographicfootprint: F = {g.C|g ∈ geotags ∧ g[L] = Li}. Each point in the cloud repre-sents one significant occurrence of the given place name as tag for a photo. Sincethe footprint is no longer a single coordinate pair, the gazetteer’s mapping fromplace name to footprint N −→ F should now result in three different mappings.N −→ Fr maps the place name to the raw footprint consisting of the corre-sponding point cloud. N −→ Fp maps to the polygon which approximates theregion occupied by the point cloud. N −→ Fc finally maps a place name to thefootprint’s centroid, i.e., to a single coordinate pair as returned by conventionalgazetteers. The centroid is the mean of all coordinate pairs in the point cloudand is thus specifically (and intentionally) biased towards areas that containhigh numbers of geotags. Fc can thus be regarded as the point of interest bestrepresenting a place name, based on the number and location of correspondinggeotags.

While the derivation of the gazetteer entries from geotags allows for en-hanced functionality in the mapping from place name to footprint, the mappingto place type N −→ T remains unchanged. The experimental setup presentedin Section 4 leaves the place type unspecified. Potential combinations with lin-guistic approaches [21] as sketched in Figure 1, however, would allow for a semi-automatic classification of the gazetteer entries based on a predefined typingscheme. This scheme could be adopted from existing gazetteers. Due to the lim-ited reliability of any data coming from such collaborative platforms, such anapproach would at least require quality control mechanisms. A fully automaticstrong typing of place names with such bottom-up approach is clearly not fea-sible here. While this is out of scope for this paper, the grouping of a resource’stags into place names, place types and other tags does appear feasible. Moreover,it stands to reason whether such a tag-based typing is a more practical approachfor a community-driven gazetteer [32].

4 Workflow and Algorithm

This section describes the crawling approach implemented in our prototype.The different aspects of the resources that play a role in the filtering process arediscussed.

4.1 Crawling Approach

A reliable extraction of geographic footprints requires a sufficiently large numberof geotagged resources. We have limited ourselves to photos as resources for var-ious reasons. People sharing their creations on the web want others’ recognition.

Fig. 1. Geotagged photos are crawled from the web (1) and fed into an RDFtriple store. The tags are filtered based on occurrences to retrieve a subset oftoponyms (2). For each place name, regions and centroids are calculated (3).Finally, every place name is categorized using linguistic classification (4). Thepart outlined in grey has been implemented for this paper (adapted from [7]).

Community-based web sites take this aspect into account by ranking the photosby popularity, which relies on the findability of the photos. Photo-sharing websites all provide various means to find a photo: one can use a keyword-basedsearch engine, browse a map with overlaid pictures, browse pictures by date,and so on. Users spent a considerable amount of time to annotate the picturesto cover all these aspects. Since every photo is implicitly located, assigning anexplicit location by linking the photo to a point on a base map is a commonannotation procedure. Accordingly, digital photos do not only carry detailedmetadata in their Exif tags, they are also exceptionally well described by theircreators. The last and most important reason to consider only photos as resourcefor extracting the spatial footprints of place names is the abundant availability. Itis therefore reasonable to assume that the crawling yields a large enough sampleof geotagged resources to achieve a significant result.

The crawling algorithm is conceptually straight-forward. Starting from a spe-cific tag, the algorithm requests all geotagged resources which have been anno-tated with this tag. All three services used for our study provide this functionalitythrough their APIs. For every tag attached to a retrieved photo, we store a sep-arate complete geotag tuple (L, U, C, I, T ) in our RDF triple store. In the nextstep, the conditions detailed in section 3.1 are applied to filter out tags whichwe have identified as not important. The resulting set of geotag tuples is takenas input for the clustering method described in the following.

4.2 Geotag Extraction Algorithm

A place name either refers to one unique place (e.g. Kilimajaro) or to multipleregions (e.g. the districts Soho in London and New York). The geotag tuplesresulting from the crawling algorithm are used to identify clusters of high point-density. We consider the point cloud (explained in Section 3.3) as geographicfootprint for the label Li if many people used this keyword to annotate theirphotos taken nearby. Such clusters can have any shape, they are not necessarilyconcave and can contain holes. Point clouds derived from geotags are not equallydistributed over space, but have some tendency to follow structures like trailsor streets. In [10] the Delaunay triangulation has been identified as candidatealgorithm to find clusters within point clouds. This method is not restrictedto places with certain geometries. It computes the smallest possible trianglebetween three adjacent points; each point is connected to its nearest neighborsby an edge. A Delaunay triangulation for the tag Soho in New York is depictedin Figure 2. In order to split the graph of points and edges into clusters ofhigh density (short edges), we remove all edges longer than a given threshold.If adjacent, remaining triangles are merged into one or more polygons. Theyrepresent Fp, the polygonal geographic footprint of the gazetteer’s place name N .

A more advanced way to extract polygonal footprints from single locations isthe Alpha Shape [33,34], which has also been used to generate the Flickr shapefiles7. For reasons of simplicity, we sticked to a Delaunay triangulation for thisexperiment. The next section shows that even with such a comparably simpleclustering approach one can already obtain usable results.

5 Experimental Results and Evaluation

This section presents the results obtained by our prototype implementation.The results are discussed and compared to those obtained from conventionalgazetteers.

5.1 Soho, Camino de Santiago and Kilimanjaro

We retrieved geotagged photos annotated with Soho, Camino de Santiago andKilimanjaro. These three place names were chosen because they represent differ-ent geometries: Soho as a city district represents polygonal real-world features7 See http://code.flickr.com/blog/2008/10/30/.

Fig. 2. Cluster graph after the Delaunay triangulation for the place name Soho.The screen shot shows the clustering result depending on the edge length thresh-old: A small value results in several small clusters shown in blue. When thethreshold increases, the fragments starts to join to the large black cluster.

up to a few kilometers in diameter. Moreover, we chose this example becausethere is not “the one” Soho, but both districts in London and New York canbe regarded as equally well-known. Camino de Santiago refers to a number ofpilgrimage routes leading to the Cathedral of Santiago de Compostela8 in north-western Spain. It usually refers to Camino Frances, the medieval route alongJaca, Pamplona, Estella, Burgos and Leon, but it is also used for a number ofother ways to Santiago de Compostela across Europe and is thus a prime exam-ple of an ambiguous linear real-world feature. The third example, Kilimanjaro,is an example of a large-scale natural feature that can be seen (and hence shot)from far, but is hard to reach. Using this example, we want to investigate howwell our approach is apt to derive useful results for such features.

Table 1. Figures on the RDF repository used for this study. The numbers includea negligible number of entries added during the testing phase.

Geotag Filtered Unique FilteredTupels Geotag Tupels Names Unique Names Resources Users

560,834 471,393 9,917 2,035 10,603 1,103

Table 1 gives an overview of the number of resources and tags obtained bycrawling the three photo sharing websites for the three given examples. Onlyaround 15 percent of tuples were removed during the filtering process, the ratio

8 Tradition has it that the cathedral contains apostle Saint James the Great’s gravesite.

of ∼0.84 is surprisingly high. The ratio from filtered to unfiltered unique nameson the other hand is ∼0.21; this shows that our filtering approach identifiedalmost 80% of the names as irrelevant since they were used by only one user.The difference between the two ratios means that the remaining 20% of filtereduniques names appear in 80% of all geotag tuples. Our rather simple approachof not further considering tags that only occur once thus proves very effective.Most tags are noise, but those which remain are used and accepted by manyusers. Table 2 contains the specific numbers per place name.

For Soho, the two biggest clusters emerge as expected in central London andin New York (see Figure 3). Apart from these two main clusters, a number ofsmaller clusters appear at different locations around the world. An analysis of thecorresponding resources showed that most of them correspond to smaller placescalled Soho, thus representing valid gazetteer entries. The small outlying clusterssouth of the main cluster in Figure 3, however, are clearly no meaningful results.Such outliers occur frequently when users tag whole photo sets with the nameof the place where most of them were taken. This inevitably tags some photoswith the wrong place name and will require an improved filtering approach.

Table 2. Figures on the three case studies. The last column indicates the dis-tance from the cluster’s centroid to the corresponding footprint in GeoNames(a: London, b: New York).

Place name Geotag Tuples Resources Users Dates Distance

Soho 11916 3124 446 3087 0.26a/ 0.16b kmCamino de Santiago 5132 1304 75 1255 285.3 kmKilimanjaro 2536 825 72 808 3.7 km

For Camino de Santiago, the generated clusters give a good impression ofthe main trail to the Cathedral of Santiago de Compostela (see Figure 4). Oneapparent problem here is that the clustering algorithm splits up the route intodistinct segments. Future research should focus on the development of “intelli-gent” clustering approaches that take the shape of the cluster into account, inorder to enable a more reliable clustering.

For Kilimanjaro, the emerging clusters (see Figure 5) expose the main prob-lem with an approach based on tagged and geolocated photos: users often do nottag the picture with the place name of the location where the picture was taken,but with the name of real-world feature shown in the picture. This becomesespecially apparent for very large real-world features, as in this example. Severalsmaller clusters expose the high number of pictures taken at these locations,which apparently offer a good view on Mount Kibo, the highest peak of theKilimanjaro massif. Future work needs to investigate how clusters referring tosuch real-world features can be detected, for example, by identifying ring-shapedclusters such as the one in Figure 5.

Fig. 3. The clusters generated for Soho. The left screen shot shows the clusterin London, the right one shows the cluster in Manhattan, New York.

5.2 Geographic Footprints

The footprints extracted by our approach provide additional useful informationto the point-based footprints provided by conventional gazetteers. For compari-son with GeoNames, we also computed the corresponding centroid as the meanof all coordinates in every cluster (or cluster group, as for Kilimanjaro). Thiscentroid points to what can be described as a named cluster’s group-cognitivecentre. In contrast to the geometric centre point, it gives an estimate of the com-mon point of interest of users providing the photos retrieved in the crawling step.In the following, we discuss the extracted footprints and how the group-cognitivecentre and the geometric centre point differ for our three examples.

For Soho and Kilimanjaro, the distance between the GeoNames footprintand the centroid of our cluster is comparably small, given the respective scale ofthe cluster (and the size of the corresponding real-world feature). The footprintfor Soho, London, in GeoNames is about 260m away from the centroid of ourcluster. The cluster itself represents the common notion of Soho very well9,although it extends across Oxford Street in the north, which is usually taken asSoho’s northern border. The same applies to the eastern extension of the cluster;the southern and western extension match the common notion of Soho very well.Similar observations can be made for Soho, New York: The area that is commonlyreferred to as Soho10 is completely covered, but the cluster exceeds the actualarea in all four directions. This exceeding problem can probably be addressedby adjusting the cutoff length during triangulation and fetching more inputdata. The centroid of the cluster is only 160m away from the footprint of the

9 See http://en.wikipedia.org/wiki/Soho#Streets for comparison.10 See http://en.wikipedia.org/wiki/SoHo#Geography.

Fig. 4. The clusters generated for Camino de Santiago give a good impressionof the trail of the route.

corresponding GeoNames entry. The clusters generated for Camino de Santiagostretch very well along the actual trail of the route, despite the gaps discussedabove. The calculation of the centroid shows that it is in most cases meaninglessto represent linear real-world features by points. While the centroid representsa mean value for all coordinates in the clusters, the footprint from GeoNames islocated at one end of the route. Selecting the destination of the pilgrimage trailas footprint certainly makes sense in this case (the coordinate refers to Santiagode Compostela), however, this selection will be completely arbitrary for linearfeatures that lack such a clear destination (such as most roads). For Kilimanjaro,the clusters represent the areas with a view on the Kilimanjaro’s highest peak,rather than the mountain itself (due to the problems discussed above). Thisalso causes a distance of almost 4 km of the clusters’ centroid to the GeoNamesfootprint, which is nevertheless still within an acceptable range given the size ofthe real-world feature.

6 Conclusions

This section summarizes the paper and points to different applications of theapproach presented in this paper, as well as directions for future work.

6.1 Discussion

In this paper, we have presented an experiment to test the feasibility of the ideato build a gazetteer completely from geotagged photos crawled from the web. Wehave introduced the theoretical foundations to capture the emergent semanticsof geographic information extracted from geotagged resources on the web. Atheoretically sound definition of a geotag has been introduced and related to theclassical definition of a gazetteer. Using the implementation which clustered and

Fig. 5. The clusters generated for Kilimanjaro are distributed over a large areaand show the problem of photos tagged with with the names of features shownin the pictures, although they were taken from far away.

filtered geotags of photos, we have demonstrated how the geographic footprintfor a given place name can be derived.

The results of our queries for Soho, Camino de Santiago and Kilimanjaroshowed that it is possible to derive meaningful geographic footprints from geo-tagged content, even with comparably simple clustering approaches. Both thefootprints as well as their centroids shed a different light on named places thanconventional gazetteers. As pointed out in [22], every gazetteer extracted fromonline information can only be as good as the information it builds on. However,our experiment has demonstrated that useful results can already be obtainedwith very straight-forward means to extract a group-cognitive perspective [12]on place names. Hence, we do not propose to replace existing gazetteers by ourapproach, but to complement them within a gazetteer infrastructure [7,16]. Fur-ther improvements can be expected from implementing models of trust in theharvesting process, which would allow for an estimation of the quality of thegeotags used for clustering [7,23].

From a visual inspection, the generated regions were judged to be plausiblerepresentations of the place names’ geographic footprints. Particularly, the algo-rithm showed the capability to recognize different places carrying the same name,

as shown in the Soho example. Moreover, the filtering algorithm has successfullysorted the crawled tags into toponyms and other tags based on the notion ofsignificant occurrences. The example of Kilimanjaro has shown that very largereal-world features are problematic for our approach, since they often appear inthe context of photos that show them, but that were taken far away from theactual feature. Evidently, the results could be improved by more sophisticatedcrawling, filtering and clustering approaches.

6.2 Applications

While the crawling approach presented in this paper has been developed with therecursive generation of a bottom-up gazetteer in mind, the underlying algorithmsare also potentially useful in a number of other applications. The user compo-nent, for example, could be used to derive communities and their vocabularyby analyzing how groups of users tag certain real-world features. The tempo-ral component has only been used to identify occurrences and to filter eventsthat might corrupt the place name recognition. Instead of treating these filteredevents as noise, however, one could also imagine an application that specificallylooks for such events based on temporal clusters. This would enable an automaticcalculation of geographic footprints for such events, which could eventually bemerged into event gazetteers [35,36].

The fact that every resource carries a time stamp and a user’s name can alsobe used to extract individual space-time prisms [37,38]. This may provide insightinto real-world social interactions between the users of photo sharing platforms,such as “who travelled together” or “who went to this party”. The implicationsfor privacy, however, are obvious and would require a careful consideration ofethical issues. From this perspective, the photo sharing platforms used in thispaper might require more fine-grained mechanisms to give their users controlover what information they want to reveal to whom. One method to preventautomatic generation of such profiles would be to allow users to exclude specificmetadata (or combinations of them) from access through the respective APIs.

6.3 Future Work

The next step in this research will be the combination of the filtering and clus-tering algorithm presented in this paper with linguistic web crawling approaches.This would facilitate to go beyond place names and their geographic footprintsand also extract the corresponding place type, as demonstrated by Uryupina[21]. It is, however, unlikely that it will also be possible to extract a strong placetyping from user tags. While straightforward types such as city, street or rivermay still be found frequently enough in the tags for a reliable extraction, itis unlikely that a user tags a picture taken in Soho with section of populatedplace – the associated feature class (i.e., place type) in GeoNames. However,same as for footprints and centroids, such a bottom-up typing scheme wouldreflect place types used in common language, as opposed to the often somewhatartificial administrative place types used in current gazetteers. This bottom-up

approach should also allow for a more flexible categorization that does not forceevery named place into exactly one category [32] in order to fully capture theemergent semantics of collections of geotagged content. We also plan to extendthe existing implementation to take the temporal nature of geotags into account.This eventually results in the identification not only of place names, but also ofnames of events and processes with a spatial character.

Acknowledgments

This research has been partly funded by the SimCat project (DFG Ra1062/2-1and DFG Ja1709/2-2, see http://sim-dl.sourceforge.net) and the GDI-Gridproject (BMBF 01IG07012, see http://www.gdi-grid.de). Figure 1 containsgeotag icons under a Creative Commons license from http://geotagicons.com.

References

1. Jones, C.B., Purves, R.S., Clough, P.D., Joho, H.: Modelling vague places withknowledge from the web. International Journal of Geographical Information Sci-ence 22(10) (2008) 1045–1065

2. Larson, R.R.: Geographic information retrieval and spatial browsing. GIS andLibraries: Patrons, Maps and Spatial Information (April 1996) 81–124

3. Goodchild, M.F.: Citizens as voluntary sensors: Spatial data infrastructure in theworld of web 2.0. International Journal of Spatial Data Infrastructures Research2 (2007) 24–32

4. Bennett, B., Mallenby, D., Third, A.: An ontology for grounding vague geographicterms. In Eschenbach, C., Gruninger, M., eds.: Proceedings of the 5th Interna-tional Conference on Formal Ontology in Information Systems (FOIS-08), IOSPress (2008)

5. Henrich, A., Ludecke, V.: Determining geographic representations for arbitraryconcepts at query time. In: LOCWEB ’08: Proceedings of the first internationalworkshop on Location and the web, New York, NY, USA, ACM (2008) 17–24

6. McConchie, A.: The great pop vs. soda controversy, available from http://

popvssoda.com (last visited agust 1st, 2009) (2002)7. Keßler, C., Janowicz, K., Bishr, M.: An agenda for the next generation gazetteer:

Geographic information contribution and retrieval. In: ACM GIS ’09, November4–6, 2009. Seattle, WA, USA, ACM (2009)

8. Wilske, F.: Approximation of neighborhood boundaries using collaborative taggingsystems. In Pebesma, E., Bishr, M., Bartoschek, T., eds.: GI-Days 2008. ifgiPrints32 (2008) 179–187

9. Guo, Q., Liu, Y., Wieczorek, J.: Georeferencing locality descriptions and computingassociated uncertainty using a probabilistic approach. International Journal ofGeographical Information Science 22(10) (2008) 1067–1090

10. Heuer, J.T., Dupke, S.: Towards a spatial search engine using geotags. In Probst,F., Keßler, C., eds.: GI-Days 2007 – Young Researchers Conference. ifgiPrints 30(2007) 199–204

11. Aberer, K., Mauroux, P.C., Ouksel, A.M., Catarci, T., Hacid, M.S., Illarramendi,A., Kashyap, V., Mecella, M., Mena, E., Neuhold, E.J., Et: Emergent semanticsprinciples and issues. In: Database Systems for Advances Applications (DASFAA2004), Proceedings, Springer (March 2004) 25–38

12. Stahl, G.: Group Cognition: Computer Support for Building Collaborative Knowl-edge (Acting with Technology). MIT Press (2006)

13. Raubal, M.: Cognitive engineering for geographic information science. GeographyCompass 3(3) (2009) 1087–1104

14. Surowiecki, J.: The Wisdom of Crowds. Anchor (2005)15. Schlieder, C.: Modeling collaborative semantics with a geographic recommender.

In: Advances in Conceptual Modeling – Foundations and Applications. Volume4802 of Lecture Notes in Computer Science. Springer (2007) 338–347

16. Janowicz, K., Keßler, C.: The role of ontology in improving gazetteer interaction.International Journal of Geographical Information Science 22(10) (2008) 1129–1157

17. Hill, L.L.: Georeferencing: The Geographic Associations of Information (DigitalLibraries and Electronic Publishing). The MIT Press (2006)

18. Casati, R., Varzi, A.C.: Parts and Places. The Structures of Spatial Representation.MIT Press, Cambridge and London (1999)

19. Goodchild, M.F., Hill, L.L.: Introduction to digital gazetteer research. Interna-tional Journal of Geographical Information Science 22(10) (2008) 1039–1044

20. Hastings, J.T.: Automated conflation of digital gazetteer data. International Jour-nal of Geographical Information Science 22 (October 2008) 1109–1127

21. Uryupina, O.: Semi-supervised learning of geographical gazetteers from the in-ternet. In: Proceedings of the HLT-NAACL 2003 workshop on Analysis of geo-graphic references, Morristown, NJ, USA, Association for Computational Linguis-tics (2003) 18–25

22. Goldberg, D.W., Wilson, J.P., Knoblock, C.A.: Extracting geographic featuresfrom the internet to automatically build detailed regional gazetteers. InternationalJournal of Geographical Information Science 23(1) (2009) 93–128

23. Bishr, M., Kuhn, W.: Geospatial information bottom-up: A matter of trust and se-mantics. In Fabrikant, S., Wachowicz, M., eds.: The European Information Society– Leading the Way with Geo-information (Proceedings of AGILE 2007, Aalborg,DK). Springer Lecture Notes in Geoinformation and Cartography (2007) 365–387

24. Guszlev, A., Lukacs, L.: Folksonomy & landscape regions. In Probst, F., Keßler, C.,eds.: GI-Days 2007 – Young Researchers Conference. ifgiPrints 30 (2007) 193–197

25. Gruber, T.: Ontology of folksonomy: A mash-up of apples and oranges. Interna-tional Journal on Semantic Web & Information Systems 3 (2007 – originally pub-lished at http://tomgruber.org/writing/ontology-of-folksonomy.htm in Nov.2005)

26. Frank, A.: Ontology for spatio-temporal databases. In Koubarakis, M., Sellis, T.,Frank, A.U., Grumbach, S., Guting, R.H., Jensen, C.S., Lorentzos, N., Manolopou-los, Y., Nardelli, E., Pernici, B., Schek, H.J., Scholl, M., Theodoulidis, B., Tryfona,N., eds.: Spatio-Temporal Databases. Lecture Notes in Computer Science. Springer(2003) 9–77

27. Goodchild, M.F.: Geographical data modeling. Computational Geosciences 18(4)(1992) 401–408

28. Saeed, J.I.: Semantics (Introducing Linguistics). Wiley-Blackwell (2003)29. Searle, J.R.: Proper names. Mind 67(266) (1958) 166–17330. Codd, E.F.: A relational model of data for large shared data banks. Communica-

tions of the ACM 13(6) (1970) 377–38731. O’connor, M., Tu, S., Nyulas, C., Das, A., Musen, M.: Querying the semantic web

with swrl. (2007) 155–15932. Shirky, C.: Ontology is overrated – categories, links, and tags. Essay available

from http://shirky.com/writings/ontology_overrated.html (2005)

33. Edelsbrunner, H., Kirkpatrick, D., Seidel, R.: On the shape of a set of points inthe plane. Information Theory, IEEE Transactions on 29(4) (1983) 551–559

34. Edelsbrunner, H., Mucke, E.: Three-dimensional alpha shapes. ACM Transactionson Graphics 13(1) (1994) 43–72

35. Allen, R.: A query interface for an event gazetteer. In: Proceedings of the 2004Joint ACM/IEEE Conference on Digital Libraries. (2004) 72–73

36. Mostern, R., Johnson, I.: From named place to naming event: creating gazetteersfor history. International Journal of Geographical Information Science 22(10)(2008) 1091–1108

37. Hagerstrand, T.: What about people in regional science? Papers in RegionalScience 24(1) (1970) 6–21

38. Miller, H.J.: A measurement theory for time geography. Geographical Analysis 37(2005) 17–45