%Template for producing VLSI Symposia proceedings€¦ · Web view2017-11-10 · The Cultural...

Location inference and verification techniques for cultural heritage attraction

Watchira Buranasing, Thepchai Supnithi, Monthika Boriboon, Marut BuranarachNational Electronics and Computer Technology Center

Thailand Science Park, Klong Luang, Pathumthani 12120, Thailand {watchira.buranasing, thepchai.supnithi, monthika.boriboon, marut.buranarach}@nectec.or.th

Abstract

The Cultural heritage tourism is a branch of tourism oriented towards the cultural heritage of the location where tourism is occurring. One of the essential parts of the cultural heritage tourism is the geographical locations of places. The important issue for data integration for creating cultural heritage tourism application is how to identify the location of places or events. There are three factors for verifying the location: The first is geological based, we use Haversine formula for finding the distance between the inferred geocoordinates. The second is content-based, we use doc2vec, that modify from the word2vec model focus on context information for semantic similarities consideration. The third is location extraction based on time, and we use template matching with the time for verifying the same place. We successfully apply our approach to cultural knowledge center archive, data center archive of the fine arts department and Wikipedia. The experimental show 84% of accuracy by using geolocation-based with content-based.

Keywords: location inference, tourism, cultural heritage, location-based, doc2vec

I. Introduction

Cultural heritage[1][2] is a way of life of people it is becoming an important issue. The Cultural heritage tourism is a branch of tourism oriented towards the cultural heritage of the location where tourism is occurring. The heritage tourism can define as a traveling to get experience from the places, artifacts, and activities that can represent the stories and people in the past. The heritage tourism can include cultural, historical and natural resources. The cultural heritage tourism could focus on historical attractions, museums, monuments, art galleries, festivals, performance and cultural communities.

One of the most important parts of the cultural heritage tourism is the geographical locations of places or events. The accuracy and efficiency of the locations detection have a significant impact on several issues such as increasing search relevance results, the recommendation system and event extraction.

There are explicitly locations such as latitude, longitude and the location of structured data and implicitly locations such as the location that present by city and country names as features for geolocation in pain text.

The important issue for data integration for creating cultural heritage tourism application is how to identify the location of places or events. There are two problems of geographically ambiguous. The first problem is there are some places have the same name, but their geolocation is mismatching. It can be the mismatch between country, region, city or latitude/longitude. For example Wat Mahathat, there are a lot of the name “Wat Mahathat” in Thailand. Figure 1. shows two of Wat mahathat, one is in Bangkok, and the other one is in Ayutthaya.

Figure 1. The example of same places, that mismatch geolocation. The two of Wat mahathat, one is in Bangkok, and the other one is in

Ayutthaya.

The second problem is there are some places have more one name such as local name, old name, formal name, informal name. Table 1. shows the examples of the place names with the other names.

Table 1. the example of the place names and their alias name.Place name Alias name

Wat prasrirattana Sasadharam

Wat Prakaew

Wat Phra Chetuphon Vimolmangklararm Rajwaramahaviharn

Wat Pho, Wat Photaram

Wat Arun Ratchawararam Ratchawaramahawihan

Wat Arun, Wat Makok, Wat Chang

The basic idea for location inference technique work on location information from each of textual content, location-specific elements and locations tagging. Our work is related to the work on social media and information inference techniques. Eisenstein et al. [3] describe a model and implement this model on US-based users to geolocate them

based on their content. Schulz et al. [4] present a location inference method by combining the information of spatial indicators, such as tweet messages, profile location information and time zones using a polygon mapping technique which estimates the location within 50 km. radius. F. O. O stermann et al. [5] present a method for extracting and comparing places using geo-social media by analysis of locations involved the extraction of Flickr images from the same geographic areas and analysis of the tags used by the authors to describe them. Han Bo et al. [6] presents methods for applying feature selection to identify location indicative words (LIWs) for the task of text-based geolocation. Rattenbury et al. [7] present the method to extract place and event semantics for tags based on the GPS metadata of the images on Flickr. John Lingad et al. [8] apply Named Entity Recognizers to extract locations from microblogs at the level of both geo-location and point-of-interest. Jalal Mahmud et al. [9] present an algorithm for inferring the home using the content of users’ tweets and their tweeting behavior. All above tasks are geolocation inference by basic information but can’t verify the location of the same place title or same location, but difference place name.

This paper introduces an approach for verifying the location of places from various data sources. The main challenges addressed by this work could be summarized as:

- We design a model for identifying the location of the same place’s title, but different location.- We design a model for identifying the same location, but different place’s title.

The remaining of the paper is organized as follows. Section II gives an overview of the model, data collection, data preparation, location inference verification include geolocation-based, content-based and location extraction based on time. Section III shows the experimental results and Section IV shows conclusion and discussion of future directions.

II. System Overview

A. Method and components

For this approach, Figure 2. Show the method and components of the location inference. There is data preparation process and location inference verification process. Data preparation is a process for setting data for the location inference verification. This process includes data collection, which collects data from various data sources and data cleaning, the process for removing unnecessary data. The location inference verification process is using collected data for verifying the location of place or event. There are three factors for verifying the location: Geological based, Content-based and Location extraction based on time. We evaluated model with three factors for verifying the place.

Figure 2. The system overview of the method and components.

B. Data Collection and Data Cleaning

We collect data from various data sources in cultural heritage domain. The data can be obtained title, description, geolocation such as country, area, city, latitude/longitude. However, using data from various data need to cleaning as follows :

- Convert data in standard encoding format such as UTF-8

- Remove tabs and replace them with a single space- Replace HTML entities with the corresponding

character- Remove all HTML tags.

C. Location Inference Verification

The components of location inference are referred to the information from three references: geolocation-based, content-based and location extraction. We describe each reference in the following subsection.

1. Geolocation-Based2. Content-Based3. Location Extraction based on time

Geolocation-Based: Geolocation[10] is the estimation or the identification of the physical location of the places, objects or events.We integrate data from heterogeneous data sources. The accuracy of location verifying depends on the correct data, amount and variety of location sources. The information contained in the geolocation object are address including sub-district, district, city, country, zip code and geolocation including latitude and longitude. The problem of verifying place by geolocation based is there are same places, but different geolocations. To solve this problem, we use Haversine[11] formula for finding the distance between the inferred geocoordinates. Some archives collected based on decimal format, so we convert them to radians before they can be used.

This formula converts the decimal format to radians.

Radians = degrees * PI/180 …………… (1)

This formula calculates the great circle distance as the shortest distance between two points based on given coordinate.

…(2)

Where r is the radius of earth i.e 6371 KM or 3961 miles

Using formula (2), the distance between the inferred and actual locations of the sample places. We evaluate for checking the distance error. The best distance for verifying location is 0.21-0.30 km. intervals as shown in Figure 3.

Figure 3. The results of the evaluation for checking the distance of places

Content-Based : There are data from various data source, and we focus on the description of the article for verifying the same place. The hypothesis is the description of the same places usually have some same words in the description even though the description wrote by difference writer. In this paper, we use doc2vec[12][13], that modify from the word2vec [14]model. This model is an unsupervised learning of continuously distributed vector representations for a block of texts such as paragraphs, sentences or documents.

The word embedding model is learned using a loss function defined on word pairs but, the sentence embedding model is learned using a loss function defined in sentence pairs. In the sentence embedding model usually the relationship among words in the sentence. This model uses the context information for consideration. For our approach, the sentence embedding is suitable more than word embedding model because we focus on semantic similarities.

Figure 4. shows the architecture of the model, that is similar to word2vec model. We work on distributed memory model of paragraph vectors (PV-DM). This model works similar to CBOW, but the input of this model introduces an additional document token in addition to multiple target words. This model attempts to predict a word in the ordered sequence based on the other surrounding words in the context, which provided by the paragraph or document id. Every paragraph or documents is mapped to a vector. The vectors are

averaged, summed or concatenated as part of the next process. For our approach, we use the average as the method to combine the vectors.

Figure 4. The architecture of the model.

We adopt the cosine similarity between the semantic vectors of two sentences as a measure of their similarity. Table 2. Show the results of document similarity for verifying the places using content-based.

Table 2. show the similarity measure of documents using doc2vec at (size=300, min_count=0, alpha=0.025, min_alpha=0.025)

Document 1 Document 2 Similarityวดับวกครกหลวง (Buak khrok luang temple) : MOC

วดับวกครกหลวง (Buak khrok luang temple) : Fine Art

0.549

วดัอรุณราชวรารามราชวรมหาวหิาร(Wat Arun Ratchawararam Ratchawaramahawihan): Fine Art

วดัอรุณราชวรารามราชวรมหาวหิาร(Wat Arun Ratchawararam Ratchawaramahawihan): Wikipedia

0.540

Location Extraction based on time: This process aims to extract relation between pairs of entities in sentences. We use template matching [15] for finding the same place. We focus on descriptions of each dataset. For this approach aims to answer the following questions words: What, Where and When. “WHAT” is used for when referring to the other name of the place. “WHERE” is used for when referring to a place or location, that the subject is located or located near. “WHEN” is used for referring to time, when the subject was created. The result of extraction can be era, MM-DD-YY or period. We use location and time because someplace changed the name when the time has passed Table 3. shows relation extraction patterns.

Table 3. Example of information extraction patterns for location extraction based on time.

Question word Relation SurfaceWHERE Located at ตัง้อยูท่ี่

ที่ตัง้ตัง้อยู่ที่อยู่

WHAT Has alias name ชื่อเดิมเดิมชื่อ

เรยีกอีกชื่อวา่ชาวบา้นเรยีกวา่

WHEN Built in

สรา้งเมื่อสรา้งตัง้แต่สรา้งในสมยั

The results of using the template for extract information, we can get the set of data as shown in table 4.

Table 4. the set of dataDocument 1 Document 2

Title วดับวกครกหลวง (Buak khrok luang temple)

วดับวกครกหลวง (Buak khrok luang temple)

Alias วดัมว่งคำา (Moung kam temple)

Located at ตำาบลท่าศาลา (Thasala subdistrict)อำาเภอเมอืง (Muang district)จงัหวดัเชยีงใหม ่(Chiang Mai Province)

ตำาบลท่าศาลา (Thasala subdistrict)อำาเภอเมอืง (Muang district)จงัหวดัเชยีงใหม ่(Chiang Mai Province)

Built in รตันโกสนิทร์(Rattanakosin era)

Using the results for similarity measure each document. We use cosine similarity based on the term-based similarity measure. The results are reported in table 5.

Table 5. show the example of results of the documents similarity measure using cosine similarity.

Document 1 Document2 Similarityวดับวกครกหลวง (Buak khrok luang temple) : MOC

วดับวกครกหลวง (Buak khrok luang temple) : Fine Art

0.795

วดัอรุณราชวรารามราชวรมหาวหิาร(Wat Arun Ratchawararam Ratchawaramahawihan): Fine Art

วดัอรุณราชวรารามราชวรมหาวหิาร(Wat Arun Ratchawararam Ratchawaramahawihan): Wikipedia

0.632

III. Experimental Results

The dataset for our approach collects from three data sources. The first is cultural knowledge from cultural knowledge center of proposed Ministry of Culture. The second is the cultural archive from the fine arts department of Thailand. The third is Wikipedia, and we focus on cultural heritage information in Thailand. The data consists of the title, description, and geolocation from this data sources. We use 500 data for evaluating with five model for verifying the places. For Geolocation based, we use 0.21-0.30 Km. for the best distance. We compare against the following model: Geolocation-based, Content-based, Location extraction based on time, Geolocation with content-based and Geolocation-

based with location based on time. The results are reported in Table 6.

Table 6. The experimental resultsAccurcy (%)

Geolocation based 60Content Based 75Location Extaction based on Time

79

Geolocation based+Content Based

84

Geolocation based+Location Extraction based on time

82

As can be seen from the Table 6, the high accuracy is Geolocation-based with content-based, the accuracy is 84%.

IV. Conclusion and Future WorkThis paper presents a methodology for verifying the location of places from various data sources. There are three factors for verifying the location: The first is geological based, we use Haversine formula for finding the distance between the geocoordinates. The second is content-based, we use doc2vec, that focus on context information for semantic similarity. The third is location extraction based on time. We use template matching with the time for verifying the same place. We successfully apply our approach to cultural knowledge center archive, data center archive of the fine arts department and Wikipedia. The experimental show 84% of accuracy by using geolocation-based with content based. In future work, we approach to verifying the place with social network information.

References

[1] Wikipedia : Culture, https://simple.wikipedia.org/wiki/Culture, 2015

[2] Culture,https://www.tamu.edu/faculty/choudhury/culture.html

[3] Eisenstein, J.; O’Connor, B.; Smith, N.A.; Xing, E.P. A latent variable model for geographic lexical variation. Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA, 2010; pp. 1277–1287.

[4] Schulz, A.; Hadjakos, A.; Paulheim, H.; Nachtwey, J.; Mühlhäuser, M. A multi-indicator approach for geolocalization of tweets. International AAAI Conference on Weblogs and Social Media, Cambridge, MA, USA, July 2013

[5] F. O. Ostermann , H. Huang , G. Andrienko , N. Andrienko , C. Capineri , K. Farkas , Extracting and comparing places using geo-social media. ISPRS Geospatial Week 2015.

[6] HAN Bo, Paul COOK, Timo thy BALDWIN, Geolocation Prediction in Social Media Data by Finding Location Indicative Words, COLING 2012.

[7] Tye Rattenbury, Nathaniel Good and Mor Naaman, Towards automatic extraction of event and place semantics from flickr tags, SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 2007, pages 103-110.

[8] John Lingad, Sarvnaz Karimi and Jie Yin, Location Extraction From Disaster-Related Microblogs, Proceedings of the 22nd International Conference on World Wide Web, 2013, pages 1017-1020

[9] Jalal Mahmud, Jeffrey Nichols, Clemens Drews, Where Is This Tweet From? Inferring Home Locations of Twitter Users, Proceedings of the Sixth International AAAI Conference on Weblogs and Social Media, Sixth International AAAI Conference on Weblogs and Social Media, 2012

[10] Geolocation, https://en.wikipedia.org/wiki/Geolocation[11] Jovin J. Mwemezi, Youfang Huang,Haversine formula

Great Circle Distances and Bearings Between Two Locations, New York Science Journal, 2011

[12] Quoc Le, Tomas Mikolov, Distributed representations of sentences and documents, Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014, Pages1188-1196

[13] Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013a

[14] Buranasing, W., Phoomvuthisarn, S., and Buranarach, M., Information Extraction and Integration for Enriching Cultural Heritage Collections, Proc. of the 11th International Conference on Knowledge, Information and Creativity Support Systems (KICSS 2016), November 2016.

%Template for producing VLSI Symposia proceedings€¦ · Web view2017-11-10 · The Cultural...

Documents

Transcript of %Template for producing VLSI Symposia proceedings€¦ · Web view2017-11-10 · The Cultural...