Defining vague places with web knowledge

download Defining vague places with web knowledge

of 57

Transcript of Defining vague places with web knowledge

  • 7/31/2019 Defining vague places with web knowledge

    1/57

    IFA: DEFINING VAGUE PLACES

    WITH WEB KNOWLEDGE, ASEMANTIC APPROACH

    JORDI CASTELLS SALAAugust, 2012

    SUPERVISORS:Dr. Ir. R.A. de ByDr. J. Morales

  • 7/31/2019 Defining vague places with web knowledge

    2/57

  • 7/31/2019 Defining vague places with web knowledge

    3/57

    IFA: DEFINING VAGUE PLACESWITH WEB KNOWLEDGE, ASEMANTIC APPROACH

    JORDI CASTELLS SALAEnschede, The Netherlands, August, 2012

    Individual Assignment (IFA) submitted to the Faculty of Geo-informationScience and Earth Observation of the University of Twente in partial fullmentof the requirements for the degree of Master in Geo-information Science andEarth Observation.Specialization: GeoInformatics Master

    SUPERVISORS:Dr. Ir. R.A. de ByDr. J. Morales

    PROJECT ASSESSMENT BOARD:Dr. R. Zurita-Milla (chair)Dr. Ir. R.A. de By

    Dr. J. Morales

  • 7/31/2019 Defining vague places with web knowledge

    4/57

    DisclaimerThis document describes work undertaken as part of a programme of study at the Faculty of Geo-information Science and EarthObservation of the University of Twente. All views and opinions expressed therein remain the sole responsibility of the author, anddo not necessarily represent those of the Faculty.

  • 7/31/2019 Defining vague places with web knowledge

    5/57

    ACKNOWLEDGEMENTS

    I would like to express my gratitude to my project supervisor Dr. Ir. Rolf de By for his straightfor-ward attitude to me, the time he spent reviewing my drafts even being on holidays, and of coursefor providing me the opportunity to work on such interesting project. The DBpedia team alsodeserve a mention for their hard work and the help they offered on the mailing list.Finally, I want to mention, Rory P. Nealon for being continuosly interested in the status of theproject, giving advice and listening while having a coffee every morning.

    i

  • 7/31/2019 Defining vague places with web knowledge

    6/57

    TABLE OF CONTENTS

    Acknowledgements i

    1 Introduction 11.1 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    2 Background 32.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2.1.1 SPIRIT project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2.2.1 RDF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2.2 SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.3 DBPedia. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    3 Dataset analysis 93.1 Dataset DBpedia 3.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Continents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3 Countries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 Dataset Errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3.5 Dataset DBpedia Live. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Design 15

    4.1 General Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Point cloud generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.3 Point cloud process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    4.3.1 Convex hull. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3.2 Alpha shape. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3.3 Heat Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    5 Implementation 19

    5.1 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.2 Project le structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.3 Point cloud generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    5.3.1 Convex hull. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.3.2 Alpha shape. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.3.3 Heat map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    5.4 Execution Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    6 Results 25

    7 Project schedule 33

    8 Conclusion 35

    ii

  • 7/31/2019 Defining vague places with web knowledge

    7/57

    A Report example 39

    B Atlas 41

    C Proj4 denitions 47

    iii

  • 7/31/2019 Defining vague places with web knowledge

    8/57

    LIST OF FIGURES

    2.1 SPARQL Query diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 DBpedia components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3.1 World map points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Graph of total points per continent . . . . . . . . . . . . . . . . . . . . . . . . 103.3 Classication of countries by number of points and ratio.. . . . . . . . . . . . 123.4 DBpedia 3.7 spotted errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    4.1 VaguePlaces use cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Convex hull example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    4.3 Alpha shape example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.1 Vagueplaces software sequence diagram. . . . . . . . . . . . . . . . . . . . . . 19

    6.1 Comparison with Real boundaries. . . . . . . . . . . . . . . . . . . . . . . . . 276.2 Alpha result comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286.3 Twente results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296.4 Midlands UK results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.5 Highlands UK results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.6 Results comparison for the highlands . . . . . . . . . . . . . . . . . . . . . . . 32

    7.1 Project Gantt Diagram.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    B.1 Europe points map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43B.2 Africa points map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43B.3 Asia points map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44B.4 Oceania points map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44B.5 North America points map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45B.6 South America points map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    iv

  • 7/31/2019 Defining vague places with web knowledge

    9/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    Listings

    2.1 RDF example set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 RDF/ XML serialization example . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Notation3 serialization example . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4 N-Triples serialization example. . . . . . . . . . . . . . . . . . . . . . . . . . . 52.5 SPARQL example query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.6 RDF dataset example 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.7 RDF dataset example 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    3.1 SPARQL to calculate the ratios of all the countries. . . . . . . . . . . . . . . . 123.2 DBpedia N-triples with incorrect data. . . . . . . . . . . . . . . . . . . . . . . 135.1 Vagueplaces example execution. . . . . . . . . . . . . . . . . . . . . . . . . . . 205.2 Europe countries SPARQL query . . . . . . . . . . . . . . . . . . . . . . . . . 215.3 Points SPARQL query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.4 SPARQL FILTER condition example . . . . . . . . . . . . . . . . . . . . . . . 22A.1 Example Vagueplaces report for the highlands. . . . . . . . . . . . . . . . . . . 39

    v

  • 7/31/2019 Defining vague places with web knowledge

    10/57

    LIST OF TABLES

    3.1 Approximate number of points on each continent . . . . . . . . . . . . . . . . 93.2 Number of approximate points in a subset of most populated countries. . . . . 113.3 Number of approximate points in a subset of USA states. . . . . . . . . . . . . 113.4 Comparison between DBpedia 3.7 and DBpedia Live. . . . . . . . . . . . . . . 14

    6.1 Approximate points for example results. . . . . . . . . . . . . . . . . . . . . . 25

    7.1 Table of tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    vi

  • 7/31/2019 Defining vague places with web knowledge

    11/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    Chapter 1

    Introduction

    1.1 PROBLEM DESCRIPTION

    Online gazetteers provide services that help to geocode locations. Given the name, the gazetteerwill attempt to provide coordinates for the place, usually in point information. Point informationis not always the best solution: urban areas, administrative units, or rivers are better described

    using polygons or lines. There is a clear need for region geocoding.In ofcial administrative units, the situation is not critical. This kind of data is most of thetime available from online spatial data repositories. On the other hand, non-ofcial boundariesare difcult to nd and, old boundaries or folklore delimitations difcult to obtain. For example,Mittelland in Switzerland or the Scottish highlands.

    Solutions regarding the web are already proposed by Arampatzis et al.[AvKR+ 06] andChristopher B. Jones et al.[ JPCJ08] inside the research by the SPIRIT project[PRS+ 02]. Thoseapproaches use google search and classication to obtain the information; this project will aim toobtain similar results using semantic web approaches.

    1.2 OBJECTIVES

    Previous work presented a solution to the problem. The main aim of this project is to provideanother way to obtain a similar or better solution with the aid of the semantic web formats usingDBpedia[BL09] as a base data repository.

    The basic objectives to accomplish are:

    (a) Analyze RDF1 [MM04] and SPARQL2 [PS98] specications.

    (b) Review the DBpedia project as a suitable source of information.

    (c) Compare the amount of information stored by DBpedia in different continents.

    (d) Implement a rst version of a software solution to obtain the boundaries using the differentlanguage abstracts on DBpedia.

    Secondary objectives regarding the nal software usability may be achieved if there is enoughtime, or can be generated in the future:

    Provide the software as a QuantumGIS (QGis) plugin.

    Provide the software as a web service.

    1

    Resource Description Language2SPARQL Protocol and RDF Query Language

    1

  • 7/31/2019 Defining vague places with web knowledge

    12/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    2

  • 7/31/2019 Defining vague places with web knowledge

    13/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    Chapter 2

    Background

    2.1 PREVIOUS WORK

    Most of the previous work on this topic is related to the SPIRIT1 project.

    2.1.1 SPIRIT project

    The SPIRIT project objective is to retrieve spatially aware information from the Internet. Theexisting web technology is well suited for document or image search, but it behaves poorly whenit comes to nd specic geographical information. The idea of the SPIRIT project was to designand implement a spatial search engine to nd documents related to places or regions in a query.

    Grouped into this project there have been different researchers and publications from 2002 to2005. One of the most interesting publications is the one from Arampatzis et. al.[AvKR+ 06]. Inthis paper they describe, in a very general approach, the steps used to delineate imprecise regions.Similar to Arampatzis, Jones et. al.[ JPCJ08] published an article where the process is describedmore precisely.

    Both articles have a common approach:

    1. Obtain relevant websites using Google search engine.2. Parse the HTML searching for place names.

    3. Geocode the places.

    4. Rank the results if necessary.

    5. Analyze the point cloud.

    6. Derive a polygon approximation.

    The rst four steps are common in both works, with the only difference of the optionalranking. The analysis of the point cloud is achieved in two different ways: the spatial densityestimation used by Jones, and the alpha shape[EDGK83] in the Arampatzis paper.

    2.2 TECHNICAL BACKGROUND

    This project presents a similar approach to the already present articles, with the basic differencein the point retrieval aspect. While previous studies mainly used the Google search engine, thisproject uses an RDF2 serialization of the Wikipedia project. The Wikipedia option is regardedas a centralized knowledge site, and its RDF serialization a way to access this information in asemantically linked way.

    1

    Spatially Aware Information Retrieval on the Internet2Resource Description Framework

    3

  • 7/31/2019 Defining vague places with web knowledge

    14/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    2.2.1 RDF

    RDF is a metadata model specied by the W3C[MM04]. Metadata meaning that is a data modelto dene other data models and promote data interchange. The RDF has been developed as a key

    component of the semantic web, but that does not mean that this is the only application. RDFcanbe used in several ways, being one of the more famous the RSS3 specication and implementationto follow a website updates. In this project the use of RDF is as a base data storage format in oneof its different implementations (N-Triplets), and accessed through a query language.

    Overview: RDF stores information and relationships, is used to represent facts about resourcesin the form of subject predicate and object, and each of this facts is called a triple.

    Subject: The referred resource.For example Johan.

    Predicate: One property title. For example birthplace.

    Object: The value of the resource-predicate pair.For example Amsterdam.A big set of triples will dene a knowledge space on a topic. It is important to note that the

    object can be a reference to another subject like in the code listing2.1.

    1 Amsterdam cap i t a l of The_Nether lands2 Amsterdam popula t ion 7906543 Johan b i r th pla ce Amsterdam

    Listing 2.1: RDF example set

    Serialization: In the previous paragraphs, the basics of the RDF format are described. Butthere are multiple ways to serialize4 the datasets. There are two main formats for RDF and somevariations. The following paragraphs present those formats with the same example:There is an article published by Wikipedia about super Mario, that was last modied on 12 / 06 / 2012

    RDF / XML: This is the rst and main format described by W3C[BM04]. It is XML5 syn-tax, and considered the normative serialization. Figure2.2 contains the xml code to represent theexample.

    1 < rdf:RDF2 xmlns : rd f = " h t t p : // www.w3. org / 1999/ 02/ 22 rdf syntax ns#"3 xmlns :dc= " h t t p : // pur l . org / dc/ e lements / 1 .1/ ">4 < r d f : D e s c r i p t i o n r d f : a bo u t= " h t t p : // en . wiki ped ia . org/ wiki / Mario">

    5 < d c : t i t l e> Mario6 < d c : p u b l i s h e r> Wikipedia7 < dc : mo d i f i ed> 12/ 06/ 20128 9

    Listing 2.2: RDF/ XML serialization example

    N3: Notation3 [BLC11] appears as an easier way to serialize the RDF in a human readableformat keeping most of the capabilities of RDF/ XML.

    3Originally RDF Site Summary, now commonly stated as Really Simple Sindication4

    Converting a data structure into a storable format5Extensive Markup Language

    4

  • 7/31/2019 Defining vague places with web knowledge

    15/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    1 @ p re f ix d c : < h t t p : // pur l . org / dc / e lements / 1 .1/> .23 < h t t p : // en . wiki ped ia . org/ wiki / Mario>4 d c : t i t l e "Mario" ;5 d c : p u b l i s h e r " Wiki pedi a ";6 d c : m o d i f ie d 1 2/ 06/ 201 2 .

    Listing 2.3: Notation3 serialization example

    N-Triples: N-Triples [GB04] is a further simplication of the Notation3 format. Lackingsome of the features of bigger formats makes it easier to write software parsers, but also impliesmore lines of code for the same amount of information.

    1 @ p re f ix d c : < h t t p : // pur l . org / dc / e lements / 1 .1/> .23 < h t t p : // en . wiki ped ia . org/ wiki / Mario> d c : t i t l e "Super Mar io" .4 < h t t p : // en . wiki ped ia . org/ wiki / Mario> d c : p u b l i s h e r " Wiki pedi a ".5 < h t t p : // en . wiki ped ia . org/ wiki / Mario> d c : m o d if i e d 1 2/ 06/ 2012 .

    Listing 2.4: N-Triples serialization example

    2.2.2 SPARQL

    The previous sectionpresents basic background knowledge on RDF. This sectiongives an overviewof the main RDF query language dened by W3C, SPARQL6 [PS98].

    SPARQL is to RDF what SQL is to a RDBMS7 like Postgresql or Mysql. The user does nothave to know all the details on how RDF is stored to use it properly. Also, SPARQL is not limitedto query RDF, it can also update and delete statements.

    Overview: The interesting part of SPARQL in this project is about its query capabilities. Atypical statement has the following outline:I want these parts of information, from this subset of data. This general statement syntax is illustrated in Figure2.1.

    Figure 2.1: SPARQL query diagram. WHERE species data to pull, SELECT decides which data to display.

    Image extracted from Learning SPARQL [Duc11]

    6

    Sparql Protocol and RDF Query Language7Relational Database Management System

    5

  • 7/31/2019 Defining vague places with web knowledge

    16/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    Code listing2.5 presents an example to clarify the concept, this example will be tested withtwo slightly different datasets (code2.6 and 2.7). The SPARQL WHERE conditions, in conso-nance with RDF format, are described using triples. The example WHERE clause contains:

    The constant ab:john in the subject. The constant ab:email in the predicate.

    The variable ?johnMail in the object.

    This WHERE clause will match all the triples whose subject-predicate matches ab:john-ab:email.It asks for all the ab:email entries associated with the ab:john resource. Commonly speaking, all Johns email addresses.

    1 PREFIX ab : < http : / / ex p re f i x>23 SELECT ? johnMail4 WHERE5 {6 ab : john ab : emai l ? johnMai l .7 }

    Listing 2.5: SPARQL example query

    This specic query will obtain different results according to the origin RDF dataset. Thedataset in Figure2.6 will return [email protected], while the dataset in Figure2.7 will return tworesults: [email protected] and [email protected].

    1 < http : / / e x p r e f i x / jo hn > < http : / / e x p r e f i x / email> "john@gmail .com"2 < http : / / e x p r e f i x / mari> < http : / / e x p r e f i x / email> " marie@gmail .com"

    Listing 2.6: RDF dataset example 1

    1 < http : / / e x p r e f i x / jo hn > < http : / / e x p r e f i x / email> "john@gmail .com"2 < http : / / e x p r e f i x / c a r l > < http : / / e x p r e f i x / email> " car l@a st . com"3 < http : / / e x p r e f i x / mari> < http : / / e x p r e f i x / email> " marie@gmail .com"4 < http : / / e x p r e f i x / jo hn > < http : / / e x p r e f i x / email> " john33@aol .com"

    Listing 2.7: RDF dataset example 2

    2.2.3 DBPedia

    The combination of RDF format, SPARQL queries and Wikipedia articles is the DBpedia project.DBpedia started as a project of the Berlin and Leipzig universities[ABK+ 07] and it aims to pro-vide a copy of Wikipedia in a structured RDF format.

    Extraction: Wikipedia articles consist of text in a specic format. Most of this text is free text,but some constructions are used to provide some kind of structured information: Infoboxes,coordinates and external links are examples of structured information inside Wikipedia.

    All the articles areperiodically dumped and parsed for information. The results are stored intovarious databases that provide different access to the data. A Virtuoso Universal Server engineprovides the SPARQL endpoint. Figure2.2 is the general diagram of DBpedia components.

    6

  • 7/31/2019 Defining vague places with web knowledge

    17/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    Figure 2.2: DBpedia components. Image extracted from [ABK+ 07]

    Versions: DBpedia is provided in two basic versions. A static version that is released every nowand then, without clear schedule. And a live version that gets updated, almost every minute, whena Wikipedia article is modied.

    The current static version is DBpedia 3.7and refers to dumps of July 2011. The set of static ver-sions can be useful to analyze the data with a constant dataset. On the DBpedia live the resourcesare modied almost at the same pace as Wikipedia articles. Not only the DBpedia resources, alsothe source code for parsers and templates is modied to better suit the articles or correct bugs.

    Languages: DBpedia extracts the Wikipedia information in 111 different languages linking thesame articles in different languages as the same resource. This is very useful because a singleresource has different abstracts that can be parsed for information.

    7

  • 7/31/2019 Defining vague places with web knowledge

    18/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    8

  • 7/31/2019 Defining vague places with web knowledge

    19/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    Chapter 3

    Dataset analysis

    Two of the main objectives of this document are related to reviewing the DBpedia project as asuitable source of information. Also, to compare the amount of information for different coun-tries and continents. DBpedia derives from Wikipedia, so the total information is directly relatedto the amount of articles written by the community. Not all the articles are related to places, onlya small subset of entries are of use to the project.

    The main objective is to obtain absolute point numbers for each continent and some coun-tries with DBpedia version 3.7, generated from Wikipedia dumps on July 2011. Additionally, aDBpedia relative ratio is computed.

    This section describes two of the possible input datasets, DBpedia 3.7 and DBpedia Live. ForDBpedia 3.7 a recount and ratio of total points for continents and some countries is presented,also a review of some errors and possible solutions. DBpedia Lives fast pace changes prevent topresent reproducible information and that is why the main analysis is performed on DBpedia 3.7.Nevertheless, the general status of DBpedia live is presented and compared.

    3.1 DATASET DBPEDIA 3.7

    The origin dataset used for the analysis is obtained with SPARQL queries using the scriptalldb- pediapoints.py. This piece of software retrieves all the points that are taken into account whentrying to dene a vague place. The script is available in the Vagueplaces git repository[Cas11].

    3.2 CONTINENTS

    This section deals with the number of points from the original dataset that fall in one of the sevencontinents: Europe, North America, South America, Oceania, Africa and Asia. To obtain thevalue, I use Qgis and the Spatial Query attribute to nd points that intersect with each of thecontinents. Figure 3.1 presents a world map with the DBpedia points. Table3.1 presents theinformation. AnnexesB.1to B.6are larger scale maps of each continent.

    Table 3.1: Approximate number of points on each continent

    Continent Total points RatioEurope 250740 0.025North America 164120 0.006South America 12930 0.0007Oceania 11390 0.02Asia 111100 0.002Africa 19290 0.0006Antarctica 700 0.00005

    Outside 15110

    9

  • 7/31/2019 Defining vague places with web knowledge

    20/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    Figure 3.1: DBpedia 3.7 points on a world continents map. Using Mollweide world projection.

    Continents dataset from geocommons.com . Oceans dataset from naturalearthdata.comFull map version in Appendix B

    Figure 3.2: Graph of total points per continent

    A ratio is computed as the fraction between the number of points and the area in kilometers.It is interesting as a value of zone goodness, but not as an absolute quality indicator. For example,Europe and North America have a very nice ratio, but the points are not evenly spread over thecontinent. The area of each country is extracted from the Central Intelligence Agency worldfactbook [cia12]. The area of the continents is generated from Qgis using the input dataset.

    10

    http://localhost/var/www/apps/conversion/releases/20121024195125/tmp/scratch_5/geocommons.comhttp://localhost/var/www/apps/conversion/releases/20121024195125/tmp/scratch_5/geocommons.comhttp://localhost/var/www/apps/conversion/releases/20121024195125/tmp/scratch_5/naturalearthdata.comhttp://localhost/var/www/apps/conversion/releases/20121024195125/tmp/scratch_5/naturalearthdata.comhttp://localhost/var/www/apps/conversion/releases/20121024195125/tmp/scratch_5/geocommons.com
  • 7/31/2019 Defining vague places with web knowledge

    21/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    3.3 COUNTRIES

    The continent overview is useful to determine where the vague places software can obtain properresults. With a fast check on Table3.1and the various presented maps (FiguresB.1to B.6), NorthAmerica, Europe and Asia are the clear winners on number of points. In Point ratio, Europe andOceania followed by North America.

    Table 3.2 presents the most point populated countries and their ratio.

    Table 3.2: Number of approximate points in a subset of most populated countries

    Country Total points RatioPoland 76071 0.25The Netherlands 5738 0.14UK 35830 0.14France 40368 0.071

    Japan 9989 0.03Germany 8974 0.025USA 137655 0.02Spain 7930 0.015

    A closer inspection exposes that all those countries have enough points to query for a satis-fying result on some zones. On the bigger countries, even if there are a lot of points, some areasare almost deserted and would be difcult to generate a polygon from that small subset of points.On the other hand, the most populated areas can be useful. For example, USA has the biggestamount of points, but because of its size it also has a low ratio. Table3.3 reveals that querying forareas like Massachussets or Indiana can result in a very accurate polygon, while Wyoming, a verydeserted place for points, would not be a good basis for an accurate result.

    Table 3.3: Number of approximate points in a subset of USA states

    Country Total points RatioMassachusetts 13570 0.5Indiana 13386 0.14Wyoming 875 0.003

    It is clear that not all the countries fall in a position like the ones presented. It is expected thatfurther developments of Wikipedia will bring more information to DBpedia. At the moment,only a small subset of the world countries can be of use.

    I perform a classication of countries according to their number of points. For this classi-cation a SPARQL query grouping the places by country is performed. This query result has alot of outliers, about 650 country values in contrast to the 195 listed at the U.S Department of State website[os12]. This is mainly because a lot of incorrect denitions lead to a large amount of single-valued resources labeled as countries: old nonexistant countries, incorrectly typed names,and countries not ofcially recognized etc...

    Most of the countries have 20 points or more, adding this value as a threshold for the querydrops 425 entries leaving a result with 225 countries. Some of those entries may be real countries,but the amount of outliers is far higher than the possible drop of real countries. The pie chartin Figure 3.3 represents the distribution of the world countries in terms of DBpedia amount of points:

    11

  • 7/31/2019 Defining vague places with web knowledge

    22/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    More than 10.000 points or ratio higher than 0.1 is considered of good quality.

    Between 1000 and 9.999 points or 0.01, 0.1 ratio is considered medium quality.

    Between 100 and 999 points or 0.001, 0.01 ratio is considered low quality. With less than 99 points or ratio below 0.001 is considered unusable.

    Figure 3.3: Classication of countries by number of points and ratio.

    1 SELECT ? c (COUNT( ) AS ? coun t ) (MAX(? km) AS ? kmval ) ( xs d : fl o a t (COUNT( ) )/ xsd : f l o a t (MAX(? km) ) )

    2 WHERE{3 ?p rd f : t y pe dbped i a owl : P lace .4 ? p g eo : l a t ? l a t .5 ?p geo : l ong ? l on .6 ?p dbp ed i a owl : country ? c .7 ? c dbpprop : areaKm ?km .8 FILTER (?km > 0 )9 }

    10 GROUP BY ? c11 ORDER BY DESC (? count )

    Listing 3.1: SPARQL to calculate the ratios of all the countries

    12

  • 7/31/2019 Defining vague places with web knowledge

    23/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    3.4 DATASET ERRORS

    The original data extracted from DBpedia is not exempt of errors. There is a big group of outliersthat do not correspond to any of the continents, as stated in Table3.1. Those points are a mix of incorrect position errors and duplication of places. A couple of errors are spotted from a simpleinspection of Figure3.1, a close up image on this singularities is portrayed on Figure3.4.

    a) b)Figure 3.4: a) Mirrored Australia error, b) Graticule on India error.

    Figure 3.4-a shows a duplication mirror of points. When performing a SPARQL query on anyRDF dataset, multiple results on the same subject-predicate are computed as a cartesian product.This is not a big problem as long as the points are different interpretations of the same city, thatwould result in a square shape near the original place. The problem with points on the east coastof Australia is that the latitudes are repeated in positive and negative value. Code listing3.2 isan example for Melbourne, the performed query is unable to decide the result, and the cartesiancoordinates are:

    144.963 37.8136

    144.963 -37.8136

    1 < h t t p : // dbpedi a . org/ r e sou rce / Melbourne> < h t t p : // www.w3. org / 2003/ 01/ ge o/ wgs84_pos# lat> "37 .8136 " ^< h t t p : // www.w3. org / 2001/ XMLSchema# fl o a t> .

    2 < h t t p : // dbpedi a . org/ r e sou rce / Melbourne> < h t t p : // www.w3. org / 2003/ 01/ ge o/ wgs84_pos# lat> " 37.8136" ^< h t t p : // www.w3. org / 2001/ XMLSchema# fl o a t> .

    3 < h t t p : // dbpedi a . org/ r e sou rce / Melbourne> < h t t p : // www.w3. org / 2003/ 01/ ge o/ wgs84_pos#long> "144.963 " ^< h t t p : // www.w3. org / 2001/ XMLSchema# fl o a t> .

    Listing 3.2: DBpedia N-triples with incorrect data

    This problem affects nearly 5000 points of Australia. A workaround is easy to imagine. Afterretrieving a point subset selecting only the points inside Australia, or the southern emisphere, isenough. But this will add more complexity when working on Australia zones.

    13

  • 7/31/2019 Defining vague places with web knowledge

    24/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    The second bad zone (Figure3.4-b) is in India. Around India and Bangladesh a grid-like pat-tern can be observed. This is due the values on DBpedia, and seems to be more of rounding thestored values. For this problem, no possible solution in the scope of this project is proposed.

    3.5 DATASET DBPEDIA LIVE

    Table 3.4 lists the differences between DBpedia 3.7 and the current DBpedia live on July 2012.These values may not represent the real numbers on current DBpedia live.

    Table 3.4: Point comparison between DBpedia 3.7 and DBpedia Live (July 2012)

    DBpedia 3.7 DBpedia LiveTotal Places 585802 749869Europe 250740 384217North America 164120 188846

    South America 12930 13417Oceania 11390 15401Asia 111100 120191Africa 19290 24311Antarctica 700 700

    Even if the point comparison shows a difference of 164067 points, the spread of the points hasnot changed at all. Most of the new points are either in Europe, Asia and North America, givinga higher ratio to those continents.

    The Australia and India errors are still present in the dataset. This has started a discussion onthe DBpedia mailing lists, hoping that it will be resolved in future releases[dbp].

    14

  • 7/31/2019 Defining vague places with web knowledge

    25/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    Chapter 4

    Design

    4.1 GENERAL OVERVIEW

    The use case diagram on Figure4.1 has 4 different main use cases: Retrieve points, generate apolygon, generate a heatmap and retrieve all the points. These are the main use cases that will beimplemented in the Vagueplaces software solution. Generate polygon and generate heatmap will

    not be a direct use case but a second step after retrieving a set of points.

    Figure 4.1: VaguePlaces use cases

    4.2 POINT CLOUD GENERATION

    The point cloud to retrieve has to represent approximately the vaguely dened zone. The ap-proach to do so is to retrieve DBpedia articles refering to places dening at least four attributes:

    15

  • 7/31/2019 Defining vague places with web knowledge

    26/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    name, latitude, longitude and abstract.This process is split in two steps: selecting a rst point set and ltering the results by keyword.

    The ltering can be achieved in two parts of the process: in the same SPARQL query using theFILTER keyword, and in the Python result sets checking for substrings. If the Python approachis used, a more ne grained process can be achieved, but also the data transfer is higher becauseit contains all the abstracts. The rst iteration of this project will use a FILTER attribute withSPARQL, obtaining directly the result set and decreasing the bandwith usage.

    In this project not only the English entry is used, the whole range of abstracts in differentlanguages are queried for the keyword. This gives more information, but also more complexity todene trigger phrases as in Jones studies[ JPCJ08]. The approach taken, given the short time of the project, is to just check for the keyword instead of trying to extract semantical meaning frommore than 10 main languages. This design decision can give problems if a zone is very famous andreferenced in other articles. For example, beforehand, I know that the Costa Brava zone is denedin the Barcelona article, even if Barcelona is not part of the Costa Brava.

    4.3 POINT CLOUD PROCESS

    The point cloud is not the nal result, it needs to be reviewed and transformed to a polygon or arepresentative raster. Three options are proposed to deal with this problem.

    4.3.1 Convex hull

    This is the most straightforward way to obtain a polygon from the points. The convex hull for aset of points X is the minimal convex set containing X, Figure4.2.

    Figure 4.2: Convex hull example. Image extracted from http://mathworld.wolfram.com

    Convex hull is a typical problem already solved with various algorithms. The result will bevery generic and not sensitive to outliers.

    4.3.2 Alpha shape

    An alpha shape is a generalization of the convex hull concept rst dened in 1983 by Edelsbrunner[EDGK83], who also proposed an implementation[AEF+ 95]. The notion of an alpha shape isto generate the shape formed by a group of points. This is a very vague description of a problem,and alpha shape is one of its possible solutions.

    To obtain an alpha shape, a delaunay triangulation has to be computed from the original pointset, the nal result will be a subset of this triangulation. To decide which segments are in the resultand which are not, an alpha value is dened. This alpha value is the radius of a circle used to erase

    16

    http://mathworld.wolfram.com/http://mathworld.wolfram.com/
  • 7/31/2019 Defining vague places with web knowledge

    27/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    Figure 4.3: Alpha shape example. Image extracted from CGAL documentation http://www.cgal.org

    the segments. If the eraser ts in a segment without including any of the points, that segment iserased from the result. This process is a straightforward generation for an alpha shape, but othertypes can be obtanied, for example, the weighted alpha shape will give more importance to somepoints than others.

    Using an alpha shape can be very suitable in this project for two reasons: rst, not all the areasare square shaped, in an elongated area a convex hull will give an incorrect result; second, an alphashape with a proper alpha value will remove the outliers from the result.

    4.3.3 Heat Map

    A heat map provides a totally different view on the problem. Instead of generating a polygon, theresult is a matrix with density values. The density values are the total number of points inside aspecic cell of the matrix.

    The cell size is critical for the result of the heat map.

    17

    http://www.cgal.org/http://www.cgal.org/
  • 7/31/2019 Defining vague places with web knowledge

    28/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    18

  • 7/31/2019 Defining vague places with web knowledge

    29/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    Chapter 5

    Implementation

    The full software is provided with a single input namedvagueplaces.py. This script controls allthe ow and receives the options from the user. The rst implementation only provides pointretrieval for zones in Europe. Future versions can provide access to other continents as long as asuitable SPARQL query is dened. The general sequence diagram of the software is in Figure5.1.

    vaguePlaces

    (Python)

    AlphaGenerator

    ( C + + )

    foreachcountry

    DBpedia

    getCountries

    Countries

    getPoints(query)

    Points

    generateAlpha(points)

    awkt

    Reporter(PYTHON)

    (countries,points,awkt,cwkt)

    print_report

    query,options

    Shapely-

    (Python)

    generateChull(points)

    cwkt

    Figure 5.1: Vagueplaces software sequence diagram

    19

  • 7/31/2019 Defining vague places with web knowledge

    30/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    5.1 INTERFACE

    The implementation consists of a command line interface with a set of modiers. All the retrievedpoints are saved in a csv le with the name, country and point values. The software prints a reporton the standart output consisting of:

    Dataset information

    File Date DBpedia version used Table with country point count

    Geometries

    WKT for alpha shape polygons WKT for convex hull polygon

    The software accepts various input options, some of them are optional:

    Compulsory parameters

    Output csv lename. queryfollowed by one or more keywords to search.

    Optional parameters

    helpprints the help information. live activates the use of DBpedia Live instead of standard DBpedia. alphagives the option to dene a specic alpha value. verbose prints more information to the output when running.

    Code listing5.1includes two example executions, and AnnexA the output report for the rstexecution. For the query it is possible to dene more than one keyword, if that is the case, thequeries are assumed to represent a logical disjunction.

    1 python vag uepl ace s . py r e s u l t s / ml . cs v l i v e query " midl ands "2 python vag uepl ace s . py r e s u l t s / ml . cs v a l p ha 0 . 5 query "Burgundy" "Bourgogne"

    Listing 5.1: Vagueplaces example execution

    20

  • 7/31/2019 Defining vague places with web knowledge

    31/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    5.2 PROJECT FILE STRUCTURE

    Different code les are used during the implementation:

    vagueplaces.py: Main executable. Controls the ow of the process. cPlace.py: A class representing a place.

    cReport.py: A class representing a report.

    cSpinner.py: A class to generate a spinner thread. This is used to give feedback to the user.

    geom_functions.py: File containing the convex hull and alpha shape functions called fromthe main program.

    alpha_shape (Directory):

    main.cpp: Code of the alpha_shaper binary. alpha_shaper: Compiled c++ executable to generate alpha shapes.

    alldbpediapoints.py: Script to retrieve all DBpedia points into a CSV le.

    5.3 POINT CLOUD GENERATION

    The point cloud is retrieved directly from DBpedia using a SPARQL query. The query is easilymanageable with the SPARQL-wrapper implementation for Python[HFT ].

    The design decision was to query directly all the database and retrieve the points. In a practicalimplementation this is not possible because a list of limitations of the public endpoint:

    Result size limit.

    Execution time limit.

    To avoid these problems two queries are involved: getCountries and getPoints (Figure5.1).Using these two queries avoids obtaining large amounts of data as result, or surpassing the execu-tion time limit caused when there are too many results to lter.

    getCountries: Retrieves a list of country URIs1 to iterate for points. Currently is only imple-mented for Europe because a clear classication exists on DBpedia to retrieve the list of European

    countries (code listing5.2). Each continent needs its own query, and North America can notbe separed in countries because USA has too many points that would result in a long executionthread overpassing the execution time limit. The retrieval is implemented as a function namedeuropean_countries() , more functions can be added in the future.

    1 SELECT DISTINCT ? p la ce WHERE {2 ? p lac e rdf : type yago : EuropeanCount r ies .3 ? p l ace rd f : t ype dbped i a owl : Country4 }

    Listing 5.2: Europe countries SPARQL query

    1Universal Resource Identier

    21

  • 7/31/2019 Defining vague places with web knowledge

    32/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    getPoints: Queries DBpedia to retrieve a set of specic points containing the query string in oneof its abstracts (code listing5.3). The code listing contains underlined keywords; these keywordsrepresent Python variables added to the query in each iteration. SPARQL query5.2 generatesthe COUNTRY_URI variable, RESULTS_QUERY is a total results limiter to avoid the resultsize limit of the endpoint,OFFSETVALis only used in the case that the result was bigger than RESULTS_QUERY and iterates over the results. Finally,QUERY is a lter condition generatedfrom the list of input keywords.

    1 SELECT DISTINCT ? t i t l e , ? geo l a t , ? ge o l ong2 WHERE{3 ? p l ace rd f : t ype dbped i a owl : P lace .4 ? p l ace d bped i a owl : count ry .5 ? p lac e foa f : name ? t i t l e .6 ? p l a c e g eo : l a t ? g e o l a t .7 ? p l ace geo : l on g ? geo long .8 ? p l ace d bped i a owl : a b s t r a c t ? a b s t r a c t .9 FILTER ( QUERY )

    10 }11 OFFSET OFFSETVAL12 LIMIT RESULTS_QUERY

    Listing 5.3: Points SPARQL query

    Code listing5.4is a lter example. For each keyword, the software generates a case-insensitiveregular expression string. The strings are combined using the OR clause (||).

    1 FILTER ( regex ( ? a b s t r a c t ," Bourgogne" , " i " ) || regex ( ? a b s t r a c t ," Burgundy" , " i " ) )

    Listing 5.4: SPARQL FILTER condition example

    5.3.1 Convex hull

    The Shapely library contains a convex hull function that returns a WKT polygon denition. Thisis the function used to generate the convex hull.

    5.3.2 Alpha shape

    The alpha shape implementation generated more problems. I could not nd an alpha shape imple-mentation on QGis or ArcGIS, neither as a Python library to work directly. The existing opensource implementations lie on two backgrounds: CGAL2 and the R statistical package[aRC10].

    In this project I use the CGAL background to generate the alpha shape of the DBpedia pointset. The CGAL project provides a C++ library with complex geometry algorithms. Pythonbindings exist using SWIG3, but those are still under development and do not provide the fullcapabilities from CGAL. Instead of trying to bind the library in Python I decided to generate anexecutable of the alpha shape generator using C++ . This is a very important point because it istied to the Linux operating system, to use it in a different system, it has to be recompiled.

    This is not the only problem to face. CGAL is not a GIS library, it is a mathematical library,lacking the capabilities to export the results as a typical GIS le. To export the results, a specicmodule is implemented. The original result is an unordered list of segments, the new moduleorders the segments producing polygons with holes.

    2

    Computational Geometry Algorithms Library3Simplied Wrapper and Interface Generator

    22

  • 7/31/2019 Defining vague places with web knowledge

    33/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    The function segments_to_polygons(vector,vector) generates a list of polygons to print, possiblycontaining holes. The version of this function is not perfect, it still has some bugs that output in-valid WKT polygons along with the valid result. Another option is available with the alpha shapeexecutable, obtaining the results as a set of Linestrings instead of trying to generate a polygon.

    5.3.3 Heat map

    Due to the lack of time, the heat map module is not implemented in this project.

    5.4 EXECUTION TIMES

    The execution times proved difcult to assess. It depends on different factors, being the networkspeed, DBpedia current load, and DBpedia cache the most important three. The rst time a queryis performed the time usually ranges between 5 and 15 minutes on the ITC wireless network.

    23

  • 7/31/2019 Defining vague places with web knowledge

    34/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    24

  • 7/31/2019 Defining vague places with web knowledge

    35/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    Chapter 6

    Results

    A small set of tests is prepared for the software: Twente, Midlands, and Highlands. To assess theresults the vagueplaces software is also tested on four clearly dened boundaries: Lazio (Italy),Overijssel (The Netherlands), Bourgogne (France), and Catalunya (Spain). The countries selectedare considered high and medium point quality.

    It is difcult to decide if a result is correct or not for a vague place, there is no possible crosscheck with a real value. The following pages present different results with a little comment onthem. Maps in Figure6.1 present results for four queries on real boundaries. The polygons aregenerated with the default alpha values and querying directly for the region name.

    The Catalunya and Lazio results are mostly sharp on the coast and less accurate in the interior.For Catalunya some results include southern France and other territories on the west, those usedto be considered part of the territory. Both results have an outlier in the coast, an in both cases itrepresents an incorrect georeference for a Place.

    Overijssel is correctly situated, but the result is not very sharp on the real boundaries. Thetotal number of points (Table6.1) is 200, not enough to provide a precisely dened boundarywithout obtaining a bigger or smaller polygon than expected.

    Bourgogne is the set with most points and the result represents this. The polygon is very closeto the boundary, but the default alpha value generates an unnecessary simplication.

    The alpha is a crucial value on the result. At the same time, it is very difcult at a rst glanceto dene a suitable alpha for all the cases. For this project, a default alpha of 0.1 is dened, butthat does not mean that this is the best. Figure6.2shows different alpha values for the Bourgognepoint set. It is clear that for this set, 0.1 is not the best option. This is because the point densityis very high, therefore it is possible to use a smaller alpha and obtain a very sharp boundary. Onthe other hand, Overijssel with only 200 points, obtains more benet of using a big alpha.

    Figures6.3 to 6.5 are the result of the vague boundaries for Twente, the UK Midlands and theUK highlands retrieved by the software.

    Table 6.1: Approximate points for example results

    Region Total PointsCatalunya 620Overijssel 200Bourgogne 2100Lazio 700Twente 50Highlands 2000

    CGAL provides an optimal alpha function. This function generates an alpha that encloses allthe points, enclosing all the points is not a good strategy for this project because it will includeoutlier points. It would be interesting to compute the mean distance from the points to all its

    25

  • 7/31/2019 Defining vague places with web knowledge

    36/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    neighbours. The mean of the distribution could be used as a starting point to dene the alpha fora specic dataset.

    26

  • 7/31/2019 Defining vague places with web knowledge

    37/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    4) Bourgogne 3) Lazio

    1) Catalunya 4) Overijssel

    Figure 6.1: Default alpha shape value (0.1) compared in four real boundaries: Catalunya, Overijssel, Bourgogne,

    and Lazio.

    27

  • 7/31/2019 Defining vague places with web knowledge

    38/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    4 0.005

    1) 0.1 4) 0.05

    4 Point set

    LegendAlpha Shape

    Official Boundary

    Figure 6.2: Comparison of the same point set with different alpha values.

    28

  • 7/31/2019 Defining vague places with web knowledge

    39/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    LegendTwente 08/08/2012

    Alpha shape

    Convex Hull

    Amersfoort / RD New

    Figure 6.3: Regio Twente results with default values

    29

  • 7/31/2019 Defining vague places with web knowledge

    40/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    LegendMidlands 08/08/2012

    Alpha Shape

    OSGB 1936 / BritishNational Grid

    Figure 6.4: Midlands UK results with default values

    30

  • 7/31/2019 Defining vague places with web knowledge

    41/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    LegendHighlands 02/08/2012

    Alpha Shape

    OSGB 1936 / BritishNational Grid

    Figure 6.5: Highlands UK results with default values

    31

  • 7/31/2019 Defining vague places with web knowledge

    42/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    1 Point set

    4) Al ha 0.1 (DEFAULT)

    4 Convex hull

    3) Al ha 1

    Figure 6.6: Results comparison for the highlands

    32

  • 7/31/2019 Defining vague places with web knowledge

    43/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    Chapter 7

    Project schedule

    The original idea for this project was to follow a waterfall approach. Figure7 and Table 7.1represent the original project schedule. Fortunately the project followed the schedule without bigdeviations. The only clear deviation is the fact that one of the use cases is dropped.

    Table 7.1: Table of tasks

    Task Start EndProposal Writing Jun/ 18/ 2012 Jun/ 22/ 2012RDF and SPARQL Jun/ 25/ 2012 Jun/ 29/ 2012Existing literature Jun/ 25/ 2012 Jun/ 29/ 2012Query Implementation Jul/ 02/ 2012 Jul/ 13/ 2012Point cloud processing Jul/ 16/ 2012 Jul/ 27/ 2012Software Implementation Jul/ 30/ 2012 Aug/ 10/ 2012Testing and adjusting Aug/ 13/ 2012 Aug/ 17/ 2012Document Writing Aug/ 13/ 2012 Aug/ 23/ 2012Presentation preparation Aug/ 20/ 2012 Aug/ 24/ 2012M0 - Deliver Report Aug/ 24/ 2012 Aug/ 24/ 2012M1 - Point Cloud Jul/ 13/ 2012 Jul/ 13/ 2012M2 - Polygon results Jul/ 27/ 2012 Jul/ 27/ 2012M3 - Software beta Aug/ 10/ 2012 Aug/ 10/ 2012

    Figure 7.1: Project Gantt Diagram.

    33

  • 7/31/2019 Defining vague places with web knowledge

    44/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    34

  • 7/31/2019 Defining vague places with web knowledge

    45/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    Chapter 8

    Conclusion

    Three big conclusions are drawed from this project:

    The process is not suitable on all the countries.

    If the point set is dense, the result can be very accurate.

    Research on how to obtain a suitable alpha value would be benecial.

    The aim was to get acquainted with SPARQL and RDF technologies, study the suitability of DBpedia as a dataset, and implement a rst version of a vague place generation software. All in ashort period of one and a half months.

    I am totally satised with the outcome of this IFA project. Of course it is not perfect, but Ifeel that it can set a good starting point for a bigger research to dene different world areas.

    The three objectives are totally fulllled. There has been a learning of SPARQL and RDF, theDBpedia analysis is included in this document and the software can be obtained at a public github[Cas11].

    From the DBpedia point analysis the conclusion is clear. In some countries this method can

    be useful, but in most of the countries not. This is a very bad starting point, but it can improve. Italready improved from DBpedia 3.7 to the next DBpedia 3.8, and while Wikipedia keeps growingthis numbers will increase. The lack of Wikipedia articles in some zones totally depends on thetotal population, access to internet, and Wikipedia article writers. From the current situation, theapproach presented by this project is suitable for imprecise regions of Europe, North America,and Japan. Africa, South America and North Asia do not have enough points to obtain a properresult. India and Australia could be queried if the current dataset errors were solved.

    The project has a small code footprint that is easily manegable. With 250 Python, and 500C++ lines of code (not including comments and blanks) is not a problem to review and extend.For a future usage, more country list generators need to be implemented for vague places outsideEurope. Another interesting project could be to replace all the C++ code with an own Python

    implementation of an alpha shape, generating a full operating system independant software.The third conclusion is perhaps the most important. The alpha value is default for all the

    countries, but as I stated, different countries obtain better results with different alpha values. Afollowing research on how to obtain an optimal alpha value for different set of points would bevery benecial.

    Overall, the project feeling is positive and I would love to see it growing.

    35

  • 7/31/2019 Defining vague places with web knowledge

    46/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    36

  • 7/31/2019 Defining vague places with web knowledge

    47/57

    LIST OF REFERENCES

    [ABK+ 07] Sren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, and Zachary Ives.Dbpedia: A nucleus for a web of open data. InProceedings of 6th International Se-mantic Web Conference, 2nd Asian Semantic Web Conference (ISWC + ASWC 2007) ,volume 4825, pages 1115, Berlin, Heidelberg, nov 2007. Springer Berlin/ Heidel-berg.

    [AEF+ 95] N. Akkiraju, H. Edelsbrunner, M. Facello, P. Fu, E. P. Mcke, and C. Varela. Alphashapes: Denition and software. InProceedings of the 1st International Computa-tional Geometry Software Workshop, pages 6366, 1995.

    [aRC10] Beatriz Pateiro-Lpez andAlberto Rodrguez-Casal. Generalizing the convex hull of a sample: The r package alphahull.Journal of Statistical Software , 34(5):28, 2010.

    [AvKR+ 06] Avi Arampatzis, Marc van Kreveld, Iris Reinbacher, Christopher B. Jones, SubodhVaid, Paul Clough, Hideo Joho, and Mark Sanderson. Web-based delineation of imprecise regions.Computers, Environment and Urban Systems, 30(4):436459, 2006.

    [BL09] Christian Bizer and Jens Lehmann. Dbpedia - a crystallization point for the web of data. 2009.

    [BLC11] Tim Berners-Lee and Dan Connolly. Notation3 (n3): A rede-able rdf syntax. http://www.w3.org/TeamSubmission/n3/ , 2011.

    http:// www.w3.org/ TeamSubmission/ n3/ .[BM04] Dave Beckett and Brian McBride. Rdf / xml syntax specication.http://www.w3.

    org/TR/rdf-syntax-grammar/ , 2004.

    [Cas11] Jordi Castells. Git repository for the dening vague places with web knowledge, asemantic web approach project.https://github.com/kxtells/vague-places ,Aug 2011.

    [cia12] Central intelligence agency world factbook. https://www.cia.gov/library/publications/the-world-factbook , 2012.

    [dbp] Dbpedia mailing list error discussion.http://sourceforge.net/mailarchive/ message.php?msg_id=29588905 .

    [Duc11] Bob Ducharme. Learning SPARQL. Oreilly, 2011.

    [EDGK83] HERBERT EDELSBRUNNER and RAIMUND SEIDEL DAVID G . KIRK-PATRICK. On the shape of a set of points in the plane.IEEE Transactions on Information Theory, IT-29(4), 1983.

    [GB04] Jan Grant and Dave Beckett. N-triples. http://www.w3.org/TR/rdf-testcases/#ntriples , 2 2004.

    [HFT ] Ivan Herman, Sergio Fernndez, and Carlos Tejo. Sparql endpoint interface forpython. http://sparql-wrapper.sourceforge.net/ .

    37

    http://www.w3.org/TeamSubmission/n3/http://www.w3.org/TeamSubmission/n3/http://www.w3.org/TR/rdf-syntax-grammar/http://www.w3.org/TR/rdf-syntax-grammar/http://www.w3.org/TR/rdf-syntax-grammar/https://github.com/kxtells/vague-placeshttps://www.cia.gov/library/publications/the-world-factbookhttps://www.cia.gov/library/publications/the-world-factbookhttps://www.cia.gov/library/publications/the-world-factbookhttp://sourceforge.net/mailarchive/message.php?msg_id=29588905http://sourceforge.net/mailarchive/message.php?msg_id=29588905http://sourceforge.net/mailarchive/message.php?msg_id=29588905http://www.w3.org/TR/rdf-testcases/#ntripleshttp://www.w3.org/TR/rdf-testcases/#ntripleshttp://sparql-wrapper.sourceforge.net/http://sparql-wrapper.sourceforge.net/http://sparql-wrapper.sourceforge.net/http://www.w3.org/TR/rdf-testcases/#ntripleshttp://www.w3.org/TR/rdf-testcases/#ntripleshttp://sourceforge.net/mailarchive/message.php?msg_id=29588905http://sourceforge.net/mailarchive/message.php?msg_id=29588905https://www.cia.gov/library/publications/the-world-factbookhttps://www.cia.gov/library/publications/the-world-factbookhttps://github.com/kxtells/vague-placeshttp://www.w3.org/TR/rdf-syntax-grammar/http://www.w3.org/TR/rdf-syntax-grammar/http://www.w3.org/TeamSubmission/n3/
  • 7/31/2019 Defining vague places with web knowledge

    48/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    [ JPCJ08] Christopher B Jones, Ross S Purves, Paul D Clough, and Hideo Joho. Modellingvague places with knowledge from the web.International Journal of Geographical Information Science , 22(10):10451065, 2008.

    [MM04] Frank Manola and Eric Miller. Rdf primer. http://www.w3.org/TR/2004/REC-rdf-primer-20040210/ , 2 2004.

    [os12] U.S Department of state. Independent states of the world.http://www.state.gov/s/inr/rls/4250.htm , January 2012.

    [PRS+ 02] R. Purves, A. Ruas, M. Sanderson, M. Sester, M. van Kreveld, and R. Weibel. Spatialinformation retrieval and geographical ontologies. an overview of the spirit project.2002.

    [PS98] Eric Prudhommeaux and Andy Seaborne. Sparql query language for rdf.http://www.w3.org/TR/rdf-sparql-query/ , 1 2998.

    38

    http://www.w3.org/TR/2004/REC-rdf-primer-20040210/http://www.w3.org/TR/2004/REC-rdf-primer-20040210/http://www.w3.org/TR/2004/REC-rdf-primer-20040210/http://www.state.gov/s/inr/rls/4250.htmhttp://www.state.gov/s/inr/rls/4250.htmhttp://www.state.gov/s/inr/rls/4250.htmhttp://www.w3.org/TR/rdf-sparql-query/http://www.w3.org/TR/rdf-sparql-query/http://www.w3.org/TR/rdf-sparql-query/http://www.w3.org/TR/rdf-sparql-query/http://www.w3.org/TR/rdf-sparql-query/http://www.state.gov/s/inr/rls/4250.htmhttp://www.state.gov/s/inr/rls/4250.htmhttp://www.w3.org/TR/2004/REC-rdf-primer-20040210/http://www.w3.org/TR/2004/REC-rdf-primer-20040210/
  • 7/31/2019 Defining vague places with web knowledge

    49/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    Appendix A

    Report example

    1 ###########################################2 # REPORT GENERATED BY v ag u ep l a c e s . py3 # 2012 08 06 15 : 54 :06 . 794 9 454 #5 ###########################################6 ###########################################7 #8 # DATASET9 #

    10 ###########################################1112 D Bp ed ia L a s t r e l e a s e v e r s i o n13 QUERY: Mi dl an ds14 R e t ri e v ed P o i n t s : 5 0615 S ki pp ed P o i n t s : 016 FILE: / home / j o r d i / REPOS/ vague p l a c e s/ r e s u l t s / t e s t . c sv1718 count ry | t o t a l _ po in ts19 United_Kingdom | 47320 Nether lands | 121 S wi t ze r l and | 122 Rep ub l i c_ o f _ I r e l a n d | 312324 ###########################################25 #26 # GEOMETRIES27 #28 ###########################################2930 Alpha Shap e WKT31 POLYGON( ( 1 .075 53 .417 , 1.547 53 .794 , 1.83333 53 .366 , 1.86442 53 .2873 , 2.21758

    53.113 , 2.614 53 .453 , 2.94639 53 .4025 , 3.06776 52 .9767 , 2.904 52 .4669 , 2.949652.359 8 , 2.425 52 .045 , 2 .33 5 2 . 11 , 2.16087 52 .1556 , 1.835 52 .101 , 1.5947252.194 3 , 1 .28 51 . 75 , 0.833333 51 .8168 , 0 .708 51 .704 , 0 .317 51 .685 , 0.19351.548 4 , 0 . 13 3 1 5 1 . 5 2 8 4 , 0 . 2 11 5 1 . 1 9 4 , 0 . 2 9 1 5 1 . 6 3 6 , 0 . 1 3 4 2 9 5 2 .0 3 46 , 0.47944452.136 4 , 0.498 52 .5152 , 0 . 15 6 9 5 2 . 7 8 8 9 , 0 . 11 5 2 . 7 3 , 0 . 2 3 5 5 3 . 1 0 5 , 0 . 1 6 6 6 6 753 .4167 , 0 . 166667 53 .4228 , 0.029 53 .562 , 0.087 53 .5635 , 0.4099 53 .575 , 0.59072453.6875 , 0 .6 53 .6 875 , 1 .0 75 53 .417 , 1 . 07 5 5 3 . 4 1 7 ) )

    32POLYGON( (

    3 .5 52 .8 ,

    4.0444 52 .5444 ,

    3 .5 5 2 . 8 ) )33 POLYGON(( 1 . 11 76 5 52 .5 648 ,1 .6 9648 52 .6 732 ,1 .1 1765 52 . 564 8) )

    34 POLYGON( ( 1 .6451 51 .3432 , 1.28 51 .75 , 1 . 64 5 1 5 1 . 3 4 3 2 ) )35 POLYGON( ( 2 .614 53 .453 , 2.2981 53 .0928 , 2.75 52 .712 , 3.06776 52 .9767 , 3 .5 52 .8 , 2.614

    53 .45 3 ) )36 POLYGON( ( 7 .153 53 .0853 , 7.3378 53 .5224 , 7.41667 53 .9333 , 7.7214 53 .798 , 7.9167

    53.833 3 , 7.992 53 .674 , 7.96667 53 .5 , 7.95 53 .4206 , 7.98333 53 .1833 , 7.59 52 .78 , 7.453 , 7.3008 53 .0309 , 7.153 53 .017 , 6 .82 53 .017 , 6.82 53 .0853 , 7.153 53 .0853 , 7.15353 .085 3 ) )

    37 POLYGON((0 . 11 52 .73 , 0 . 26 52 .4 , 0 . 5 3 4 52 .448 , 0 . 11 52 .73 , 0 . 11 52 .7 3 ) )38 POLYGON(( 0 . 90 80 8 52 .4 533 ,1 .1 1765 52 .5 648 ,0 .9 0808 52 . 453 3) )39 POLYGON(( 0 .62 43 52 .4539 , 0 . 53 4 5 2 .4 48 , 0 . 74401 52 .4185 , 0 . 7455 52 .4185 , 0 . 90808 52 .4533 , 0 . 6243

    5 2 . 4 5 3 9 , 0 . 6 2 4 3 5 2 . 4 5 3 9 ) )40 POLYGON( ( 1 .6451 51 .3432 , 1.65656 51 .3016 , 1.6498 51 .284 , 1.6451 51 .3432 , 1 . 64 5 1 5 1 . 3 4 3 2 ) )41 POLYGON( ( 3.17 51 .4351 , 3.17 51 .43 , 3 .16651 51 .43 , 3.16651 51 .4351 , 3.17 51 .4351 , 3.17

    51 .435 1 ) )4243 Alpha:0 .1

    39

  • 7/31/2019 Defining vague places with web knowledge

    50/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    44 Opt imal A lp ha : 35 .87974546 Convex Hull Shape WKT47 POLYGON ( (8. 199 999 99 999 999 93 47.4 8329 9999 999 9998 , 8.3166700000000002 52 .983299 99999999 98 ,

    8.8510000000000009 53 .51500 00000000 006 , 7.9166999999999996 53 .83330 00000000 013 , 7.4166699999999999 53 .93330 000000000 27 , 1.5980000000000001 55 .057000 00000000 22 ,6 . 20 000 0000000000 2 52 .13329999999 999 84 , 8 . 1999999999999993 47 .4832999999999998 ) )

    Listing A.1: Example Vagueplaces report for the highlands

    40

  • 7/31/2019 Defining vague places with web knowledge

    51/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    Appendix B

    Atlas

    41

  • 7/31/2019 Defining vague places with web knowledge

    52/57

  • 7/31/2019 Defining vague places with web knowledge

    53/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    Figure B.1: DBpedia 3.7 points in Europe Albers Equal Area conical projection for Europe (Annex C - 4)

    Figure B.2: DBpedia 3.7 points in Africa Albers Equal Area conical projection for Africa (Annex C - 7)

    43

  • 7/31/2019 Defining vague places with web knowledge

    54/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    Figure B.3: DBpedia 3.7 points in Asia Albers Equal Area conical projection for Asia (Annex C - 5)

    Figure B.4: DBpedia 3.7 points in Oceania Albers Equal Area conical projection for Oceania (Annex C - 6)

    44

  • 7/31/2019 Defining vague places with web knowledge

    55/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    Figure B.5: DBpedia 3.7 in North America Albers Equal Area conical projection for North America (Annex C - 2)

    Figure B.6: DBpedia 3.7 in South America Albers Equal Area conical projection for South America (Annex C - 3)

    45

  • 7/31/2019 Defining vague places with web knowledge

    56/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    46

  • 7/31/2019 Defining vague places with web knowledge

    57/57

    IFA: DEFINING VAGUE PLACES WITH WEB KNOWLEDGE, A SEMANTIC APPROACH

    Appendix C

    Proj4 denitions

    1. Mollweide World+ proj= moll + lon_0= 0 + x_0= 0 + y_0= 0 + ellps= WGS84 + datum= WGS84 + units= m+ no_defs

    2. Albers conic Equal Area North America+ proj= aea+ lat_1= 29.5+ lat_2= 45.5+ lat_0= 23+ lon_0= -96+ x_0= 0 + y_0= 0+ ellps= GRS80+ datum= NAD83 + units= m + no_defs

    3. Albers conic Equal Area South America+ proj= aea+ lat_1= -5+ lat_2= -42+ lat_0= -32+ lon_0= -60+ x_0= 0 + y_0= 0 + ellps= aust_SA+ units= m + no_defs

    4. Albers conic Equal Area Europe+ proj= aea + lat_1= 43 + lat_2= 62 + lat_0= 30 + lon_0= 10 + x_0= 0 + y_0= 0 + ellps= intl+ units= m + no_defs

    5. Albers conic Equal Area Asia+ proj= aea+ lat_1= 15+ lat_2= 65+ lat_0= 30+ lon_0= 95+ x_0= 0 + y_0= 0+ ellps= WGS84+ datum= WGS84+ units= m + no_defs

    6. Albers conic Equal Area Oceania+ proj= aea+ lat_1= -18+ lat_2= -36+ lat_0= 0 + lon_0= 134+ x_0= 0 + y_0= 0 + ellps= GRS80+ towgs84= 0,0,0,0,0,0,0+ units= m + no_defs

    7. Albers conic Equal Area Africa+ proj= aea+ lat_1= 20+ lat_2= -23+ lat_0= 0+ lon_0= 25+ x_0= 0 + y_0= 0 + ellps= WGS84+ datum= WGS84+ units= m + no_defs

    8. British national grid+ proj= tmerc + lat_0= 49+ lon_0= -2+ k= 0.9996012717+ x_0= 400000+ y_0= -100000+ ellps= airy+ datum= OSGB36+ units= m + no_defs

    9. Amersfoort RD NEW+ proj= sterea+ lat_0= 52.15616055555555+ lon_0= 5.38763888888889+ k= 0.9999079+ x_0= 155000+ y_0= 463000+ ellps= bessel+ units= m + no_defs