Location-based search: services, photos, web
description
Transcript of Location-based search: services, photos, web
Location-based search: services, photos, web
Andrei TabarceaMohammad Rezaei
4.12.2013
Introduction
The goal is to find services, photos and points of interest close to the user’s locationWe call this “location-based search”We try to search our local database of photos and services and to find location information in web-pages
keyword
Results on map
User location
MOPSI search
Mopsi Services Database
Mopsi Photo Collection
Mopsi Web Search
Combinationof
search results
User locationkeyword
Mopsi search
Web interfaceInput: (keyword, user location)Output: array of results
keyword
Results list
Search optionsResults on map
User location
Mobile interfaceUser location
Search results
Mopsi search (server workflow)Input: (keyword, flagS, flagP, flagW, user location)Output: (g_markersData) array of results
keyword
Results list
Search options
flagSflagPflagW
Results on map
User location
Location based searchInput: keyword, flagS, flagP, user location (lat,lon)Output: list of results
Note: A service has a list of keywords and a title A photo has just a descriptionSo, Keyword search is done according to this information
Notation: S: service, P: phototext(S): keywords and title of servicetext(P): description of photoflagS: search for services if trueflagP: search for photos if true
Overall flowStart
Update keywords statisticsUpdate keywords history
flagS
Y
Photo search and add results to the list
flagP
flagW
Local service search and display results in list
Web searchAdd results to the list and on map
End
Y
Y
N
N
N
When a keyword is searched: statistics: the count of it in database is incremented, keyword and city are stored
history: keyword, location, userid and time are stored
Stage 1:Search mopsi services
Display all results on Map
Stage 2:Search mopsi photos
Stage 3:Search web
Local service searchStart
nL>0
Do search on servernL=number of results
Display results in the list
Y
N
End
Cluster results with almost same title and location
Sort the results(distance to user location)
Take and display one of the similar results as representative
The list of results
Photo searchStart
nP>0
Do search on servernP=number of results
YN
End
Cluster results with almost same title and location
Cluster the results and Local services with almost
same title and location
Sort the results(distance to user location)
Add results to the list
Web search
Start
nW>0
Do search on servernW=number of results
YN
End
Cluster results with almost same title and location
Cluster the results and Local services and photos with almost same title and
location
Sort the results(distance to user location)
Add results to the list
Add results on the map
Filtering results: old solution Fixed distance to user location: d
Find services wheretext(S) ≈ keyword AND dist(S,User) < d
Find photos wheretext(P) ≈ keyword AND dist(P,User) < d
d
Advantages: SimpleSame time for any search
Disdvantages: Parameter d (User can choose d, but still not automatic)There are many cases with “no results”
Current solution: Binary search K-nearest services
Show all the results in 10 km
If number of results is less than K, double the distance (until whole earth), when number of results is bigger than K, divide the distance
d
2d
4d
xExample with k=5:Number of results n in distance d: 1 < k Double distance: in 2d, n=2 < kIn 4d, n=8 > kNow dividing distance in colored area:In 3d, n=4 < kIn 3.5d, n=5 (=k)So, we have 5 nearest results to user location in distance x User location
A photo or service with required keyword
Algorithmd=10000: initial distanceK=10: number of required resultsdelta_dist: minimum distance for dividingns: number of resulted services res_Snp: number of resulted photos res_P
res_S = services where text(S) ≈ keywordres_P = photos where text(P) ≈ keywordif ( ns+np > K )
(res_S res_P dist) = extend_distance();(res_S res_P dist) = contract_distance();
display (ns+np) services and photos
extend_distance() ns= 0; np=0; While ( ns+np < K AND dist < earth_r*pi)
res_S = services where text(S) ≈ keyword AND dist(S,User) < distres_P = photos where text(P) ≈ keyword AND dist(P,User) < distdist = dist*2
dist = dist/2
d
2d
4d
Δ
Algorithm (cont.)contract_distance(dist, K)
d1 = dist/2d2 = dist dist = (d1 + d2)/2delta = dist – d1ns=np=0While ( ns+np != K AND delta > delta_dist AND dist > d )
res_S = services where text(S) ≈ keyword AND dist(S,User) < dist res_P = photos where text(P) ≈ keyword AND dist(P,User) < dist if ( ns+np > K )
d1 = d1; d2= distelse
d1 = dist; d2 = d2dist = (d1 + d2)/2delta = dist-d1
Simplifying distance calculation
Since there is no spatial dist function in mysql:Points with distance < d from user locationSimplified: |lat-lat1|< Δlat AND |lon-lon1|< Δlon
User location(lat1, lon1)
d
d
(lat1+ Δlat, lon1)
(lat1, lon1+ Δlon)
Δlat and Δlon?lat1, lon1
d (in meter)
Δlat and Δlon?
)2/(sin)cos()cos()2/(sinarcsin( 221
2 lonlatlatlatEd
Distance d (in meter) between two points (lat1, lon1) and (lat2, lon2):
Earth diameter (in meter)
Haversine distance:
(lat1, lon1) and (lat1, lon1+ Δlon) Δlat=0
))2/(sin)(cosarcsin( 21
2 lonlatEd
)2/sin()cos()/sin( 1 lonlatEd
))cos(
)/sin(sin(2
1lat
Edalon
(lat1, lon1) and (lat1+ Δlat, lon1) Δlon=0
)2/(sinarcsin( 2 latEd
E
dlat
2
lonlat
EdifNote 1
)cos(
)/sin(:
1
How to find location-information in web-pages?
Location-based web data mining
Mopsi web search
Web mining
Geo-referencing
Geo-referencing:
A geographic reference is an information entity that is discovered from the context and can be mapped to a geographic location
Strategies for geographic reference extraction:– Gazetteer-based text matching– Rule-based linguistic analysis– Regular-expression based text matching– Using host location– Geographic meta-tags
Hu, Y. H., Lim, S., & Rizos, C. Georeferencing of Web Pages based on Context-Aware Conceptual Relationship Analysis. 2006
Ad-Hoc Georeferencing
The problem is how to extract and validate location data from semi-structured textPostal address is the most common location data foundOur goal is to give geographical coordinates to services mentioned in web-pagesWe call this method ad-hoc georeferencing
<HTML><HEAD profile"="http://geotags.com/geo> <META name="geo.position" content="62.35;29.44"> <META name="geo.region" content="FI"><META name="geo.placename" content="Joensuu"> <META http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><link rel="stylesheet" href="http://www.joensuu.fi/tkt/sivutyyli.css" type="text/css"><TITLE>Pages of Pasi Fränti</TITLE></HEAD>
VS.
Location Information in Webpages
Site hosting information (owner address, server address etc.)
HTML tags (geo-tags, address-tags, vcards for Google Maps etc.)
Natural language descriptions
Addresses, postal codes, phone numbers
Site hosting informationdomain: uef.fidescr: ITÄ-SUOMEN YLIOPISTO (UNIV OF EASTERN FINLAND)descr: 22857339address: TIETOTEKNIIKKAKESKUS (IT-CENTRE)/Jarno Huuskonenaddress: PL 1627address: 70211address: KUOPIO FINLANDphone: +358 44 7162810status: Grantedcreated: 26.5.2010modified: 19.8.2011expires: 26.5.2015nserver: ns-secondary.funet.fi [Ok]nserver: ns1.uef.fi [Ok]nserver: ns2.uef.fi [Ok]dnssec: no
geo-tags, address-tags, vcards for Google Maps etc.
HTML tags
<HTML><HEAD profile"="http://geotags.com/geo>
<META name="geo.position" content="62.35;29.44"> <META name="geo.region" content="FI"><META name="geo.placename" content="Joensuu"> <META http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><link rel="stylesheet" href="http://www.joensuu.fi/tkt/sivutyyli.css" type="text/css"><TITLE>Pages of Pasi Fränti</TITLE></HEAD>
Natural language descriptionsScouts' Youth Hostel (8.3 km from Joensuu Airport) Show map
Good, 7.4 Latest booking: January 23 Scouts’ Youth Hostel is located at the outfall of River Pielisjoki, 1.5 km from Joensuu city centre. It offers free Wi-Fi and rooms with shared bathroom and kitchen facilities. Olga Saint-Petersburg, Russia "Great price for the nice room. Friendly stuff, cozy atmosphere. But a bit loud."
from € 46
Postal addresses
Input:
• user location (lat, lon)
• keywords
Output: list of services containing:
• name/title
• website
• address (street, number. city)
• location (lat, lon)
• image
• other info (opening hours, telephone etc.)
Main idea:
• preprocess the search results of an external search engine (Google, Yahoo, Bing etc.) by detecting postal address in order to find the location
Mopsi search
Problems- How to evaluate relevance?
- Mixed keyword meanings
- No relation between keywords and addresses
Mopsi Web Search Workflow
Geocoded street-name
database
Geo-referencing module
Mobileapplication
Web userinterface
Coordinates
AddressKeywordCoordinates
Searchresults
KeywordCoordinates
Searchresults
Georeferencing module
Georeferencing module
Geocodeddatabase
Address and description
detector
Address validator
Word list
Results list
Sorted results list
KeywordMunicipalities
<keyword, municipality>
query
Result links
Coordinates
Municipalities listAddresses
Coordinates
Relevant municipalities
detector
Keyword, Address,Coordinates
Page parser
1. Convert user location (lat, lon) into user address = Geocoding step
2. Search with the query "keyword+city" using an external search engine API and download the first k results (web pages) = Web page retrieval step
3. Detect addresses and additional informatio from the downloaded web pages = Data mining step
4. Ranking the results (distance, relevance etc.) = Ranking step
5. Display the search results to the user
Proposed steps
1. Geocoder
2.Web page retrieval
3. Data
mining
4. Result
rankingUser
lat, lon
keywordsweb
pagesresult
list
5. ranked result list
1. Geocoding
Geocoder
Web page retrieval
Data mining
Result ranking
User
lat, lon
keywordsweb
pagesresult
list
ranked result list
Convert user location (lat, lon) into user address using:
2.Web page retrieval
Geocoder
Web page retrieval
Data mining
Result ranking
User
lat, lon
keywordsweb
pagesresult
list
ranked result list
Download k webpages from the query <keyword, city> using API of:
3.Data mining
Geocoder
Web page retrieval
Data mining
Result ranking
User
lat, lon
keywordsweb
pagesresult
list
ranked result
list
Main idea:Find location information in HTML pages by detecting postal addresses
Steps:1. Parse and segment the HTML page2. Identify addresses and locations3. Identify the services the addresses are pointing to (name/title) and
retrieve extra information (photos, opening hours, telephone etc.)
3.1 Parsing HTML pages
-Current solution extracts an array of text from HTML pages-We don’t exploit the advantage that we extract data from web pages-Proposed future solution:
- Segmentation of web pages using DOM trees- Detection of the address block- Nearest-neighbor search considering text and visual characteristics
Joen Pizza Special Y-tunnus 2129577-6 Käyntiosoite Koskikatu 17 80100 JOENSUU Postiosoite Koskikatu 17 80100 JOENSUU Puhelin: 013-220246 Virallinen toimiala Kahvila-ravintolat
Web page example - Homepage
DOM tree
blue: links (the A tag)red: tables (TABLE, TR and TD tags)green: dividers (DIV tag)violet: images (the IMG tag)yellow: forms (FORM, INPUT, TEXTAREA, SELECT and OPTION tags)orange: linebreaks and blockquotes (BR, P, and BLOCKQUOTE tags)black: HTML tag, the root nodegray: all other tags
DOM subtree
<html>
<body>
<table> <td>
<tr> <div>
<table>
<tr><td>
PizzaPojat Niinivaara
Niinivaarantie 19
80200 Joensuu
013 - 137 017
<br/>
<div>
<table align="center“> <tr> <td> <div id="footerleft"> <h3>PizzaPojat Niinivaara</h3> <p>Niinivaarantie 19</p> <p>80200 Joensuu</p> <br /> <p>013 - 137 017</p> </div> <td> </tr> </table>
Web page example - Catalog
Bosbor kebab
Fiesta
Miami
<html>
<body>
<table> <td>
<tr> <div>
<table>
<tr><td>
PizzaPojat Niinivaara
Niinivaarantie 19
80200 Joensuu
013 - 137 017
<br/>
1. Convert HTML pages to xHTML for using xQuery
2. Detect addresses and postal codes
3. Break the DOM tree into subtrees
4. Use heuristics and regular expressions to detect extra information from the subtree (service name, telephone, opening hours etc.)
Proposed implementation
Rule-based pattern matching algorithmStarting point: the detection of street-namesPrefix trees are used for fast text matching for street-namesAn address-block candidate is constructed by detecting:
• street names and number• postal codes• municipal names
We will use OpenStreetMap database for global detection
3.2 Postal address detection
Street namesStreetnumbers
City namesTelephonenumbers
AddressDetection(words)i=0while i < count(words)
set street, number, postcode, city as emptyif word[i] is streetName
i++street = words[i]for j = i to i+5
if words[j] is numbernumber = words[j]break
for k = j+1 to j+5if word[k] is postcode postcode = words[k]j = kbreak
for k = j+1 to j+5if words[k] is citycity = words[k]i = k+1break
if street is not empty AND number is not empty AND city is not emptycandidate = (street, number, postcode, city)
3.2 Postal address detection
Joen Pizza Special Y-tunnus: 2129577-6 Käyntiosoite: Koskikatu 17 80100 JOENSUU Puhelin: 013-220246 Virallinen toimiala: Kahvila-ravintolat
streetName number postcode city
Prefix TreesInvented by Friedkin (1960)
The prefix tree (or trie) is a fast ordered tree data structure used for retrieval
Root is associated with an empty string
All the descendants of a node have a common prefix of the string associated with that node
Some nodes can have associated values (usually they mark the end of a word)
Street-name prefix trees
Our solution is to detect street-names using prefix trees constructed from the gazetteer
A street-name prefix tree is build for each municipality used in the search
The user’s location and his area of interested are known, therefore prefix-trees can be limited to municipalities
Prefix Tree Statistics Finland Singapore
Maximum tree depth 34 14
Average tree depth 12.7 7.4
Average tree width 105 167
Average number of nodes per tree 2338 2335
Total size (MB) 74.4 0.18
3.3 Retrieve extra information - Title detection (or company detection) is a
Named Entity Recognition problem
Usually, the text before the address holds relevant informationThere are other methods to investigate such as using classifiers or using web page structure
Joen Pizza Special Y-tunnus: 2129577-6 Käyntiosoite: Koskikatu 17 80100 JOENSUU Postiosoite Koskikatu 17 80100 JOENSUU Puhelin: 013-220246 Virallinen toimiala: Kahvila-ravintolat
addresswords before the address
4. Ranking
Geocoder
Web page retrieval
Data mining
Result ranking
User
lat, lon
keywordsweb
pagesresult
list
ranked result list
Main criterion: distance from the user’s location
Future idea: relevance to user’s profile and history
Future ideas recap
– Use freely available geographical sources for extending the prototype to other regions
– Use geographical scope of a web page to improve address detection and disambiguation
– Use the structure of the HTML page and DOM tree semantic analysis for better data extraction
– Gather and tag a testing dataset for better evaluation of the algorithms