QUANNAN LI 1,2, YU ZHENG 2, XING XIE 2, YUKUN CHEN 2, WENYU LIU 1, WEI-YING MA 2 1 DEPT. ELECTRONICS...

28
QUANNAN LI 1,2 , YU ZHENG 2 , XING XIE 2 , YUKUN CHEN 2 , WENYU LIU 1 , WEI-YING MA 2 1 DEPT. ELECTRONICS AND INFORMATION ENGINEERING, HUAZHONG UNIVERSITY OF SCIENCE AND TECHNOLOGY 2 MICROSOFT RESEARCH ASIA THE 16TH ACM SIGSPATIAL INTERNATIONAL CONFERENCE ON ADVANCES IN GEOGRAPHIC INFORMATION SYSTEMS, 2008 Mining User Similarity Based on Location History Presented on 26th Nov.

Transcript of QUANNAN LI 1,2, YU ZHENG 2, XING XIE 2, YUKUN CHEN 2, WENYU LIU 1, WEI-YING MA 2 1 DEPT. ELECTRONICS...

QUANNAN LI 1 , 2 , YU ZHENG 2 , XING XIE 2 , YUKUN CHEN 2 , WENYU LIU 1 , WEI-YING MA 2

1 D E P T. E L E C T R O N I C S A N D I N F O R M AT I O N E N G I N E E R I N G, H UA Z H O N G U N I V E R S I T Y O F S C I E N C E

A N D T E C H N O L O G Y2 M I C R O S O F T R E S E A R C H A S I A

T H E 1 6 T H A C M S I G S PAT I A L I N T E R N AT I O N A L C O N F E R E N C E O N A D VA N C E S I N G E O G R A P H I C

I N F O R M AT I O N S Y S T E M S, 2 0 0 8

Mining User Similarity Based on Location History

Presented on 26th Nov.

1. Introduction2. Related work (skipped)3. Architecture4. User Similarity Exploration5. Experiments6. Conclusion

Outline2

The pervasiveness of location-acquisition technologies such as GPS, GSM network the collection of large spatio-temporal datasets and discovering valuable knowledge about movement

behavior. use raw GPS data without much

understanding. Actually, besides the GPS data itself, people intend to

know about user intention and user interests. projects [9][12][13][15]

aiming to understand user-specific activity from individual GPS data have emerged.

Detecting locations of a user, predicting the user’s movement

1. INTRODUCTION-1 3

the correlation between users are not explored user similarity Application

Individual: discovering potential friends, share similar interests in books, music and movies.

merchants: improving their sales and marketing the first law of geography,

everything is related to everything else, but near things are more related than distant things

In this paper, to mine user similarity based on user-generated GPS. a novel approach to measure user similarity

geographically.

1. INTRODUCTION-24

GPS log a sequence of GPS points P={p1, p2, … , pn}.

Each GPS contains latitude, longitude and timestamp. GPS trajectory

connect these GPS points according to their time serials.

Stay point: stay point 1, at P3 stationary for a time (threshold).

enter a building and lose satellite signal for a time interval until coming back outdoors.

stay point 2, several GPS points (P5, P6, P7 and P8), user wanders around within a spatial region. people travel outdoors and are attracted by the

surrounding environment.

3. ARCHITECTURE3.1 Preliminary-1

5

Location history: a record of locations that an entity visited in

geographical spaces over an interval of time.

3.1 Preliminary-26

Hierarchical graph: put all users’ stay points

into a dataset and hierarchically cluster into several spatial regions in a divisive manner.

the similar stay points from various users will be assigned to the same clusters on different layers.

each user can build a directed graph

Location history representation individual hierarchical

graph User similarity

explorationFriend and location

recommendation

3.2 Architecture of HGSM 7

HGSM: Hierarchical Graph-based Similarity Measurement

The hierarchical graph an effective representation of a user’s location

history sequence property of user movement

To measure the similarity between two users, on each layer

find the same graph nodes the users shared then formulate a sequence based on these graph nodes.

measuring the similarity between two users can be transformed into a problem of sequences matching.

4. User Similarity Exploration4.1 Location History Extraction

8

demonstrates how a sequence of places is extracted from each individual’s location history user 1 and user 2 share the same graph nodes A, B and C.

Using a green curve, sequentially connect the blue nodes over these graph nodes in terms of time serials.

user 1: < C, A, B, B, C, C, B, C ><C(1), A(1), B(2), C(2), B(1), C(1)>

user 2: <A, B, C, A, A, C, A><A(1), B(1), C(1), A(2), C(1), A(1)>

Given each user’s arrival time and leaving time on each cluster

Figure 6 9

Definitions Related to Similar Sequences Similar sequences:

1. ∀ 1≤i≤m,ai=bi, i.e., the nodes at the same position of the two sequences

share the same cluster ID; 2. ∀ 1≤𝑖<𝑚, |Δ𝑡𝑖−Δ𝑡𝑖′|≤ tth

𝑡th is a pre-defined time threshold, called temporal constraint.

It denotes that the two users have similar transition times between the same regions.

4.2 Sequence Matching10

在這僅考慮出現點及停留離開的時間,並未考慮時段 ( 白天及晚上 ) ?

m-length similar sequence: If the number of nodes in a similar sequence is m, we

call this sequence m-length similar sequence.

temporal constraint is configured as 3 hours a 3-length similar sequence <𝐴(1)→𝐵(2)→𝐶(2)> is

detected

m-length similar sequence11

Similar Sequence Matching-112

13

A) We detect 1-length similar sequences as follows. <A12>, <B23>, <B25>, <C31>, <C34> and

<A42>, <A12> denotes the first node of sequence 1

sharing the same node A with the second node of sequence 2.

B) depicts the process of the extension operation based on the results of the first step. If we set the temporal constraint tth to 2 hours, four 2-length similar sequences including <A12,

B23>, <A12, C34>, <B23, C34> and <C31, A42> can be retrieved.

C) based on the 2-length sequences, one 3-length similar sequence <A12, B23,

C34> can be detected.

Similar Sequence Matching-2

When calculating the score, account two factors: length of similar sequence and layer the sequence

Similarity measure of an m-length sequence (2): α(𝑚) =2𝑚−1

Similarity at single layer (3) n is the number of similar sequences the two users i is the score of the i-th similar sequence, (2). 𝑁1 and 𝑁2 denote the number of stay-points of the two

users.Similarity across multi-layer (4):

H : the total layers of the hierarchical graph. 𝑙 : the support of similarity of sequences on the l-th layer.

The lower the layer a sequence was detected, the higher score it obtains. In our experiment, 𝛽𝑙=2l-1

4.3 Similarity Measurement14

65 volunteers with GPS traces over 6 months. The total distance of logs exceeds 50,000 KM.

Stay point detection: set timeThreh to 30 minutes and distThreh to 200

meters. Clustering: algorithm called “OPTICS”

A density-based clustering algorithm one of the following conditions hold.

1) The number of users is less than two, 2) boundary rectangle is smaller than 500 meters.

we establish 4-layer hierarchical clusters the top layer :layer 1 (higher layer) and the bottom layer: layer 4 (lower layer).

5. Experiments 5.1 Settings -1

15

Sequence matching: we set tth of layer l to (H-l+1)∙T,

tth : time threshold; H: the depth of the hierarchy

H=4; l =4 , tth = T; l=1, 𝑡th = 4T. After trying a set of T, the performance of HGSM does

not change when T increases to a certain value.

Similarity measurement: set 𝛼(𝑚)=2𝑚−1, and 𝛽𝑙=2𝑙−1.

𝛼(𝑚) increases exponentially with the length of sequence (m)

the significance of similar sequences found on l-layer increase exponentially with l.

5.1 Settings -216

Ground truth: each volunteer is required to rate other users based

on individual understanding The relevance rating between two users is

asymmetric, i.e., though user A rates 2 on user B, user B may not rate 2 to A.

5.2 Evaluation Approach-117

For instance, using user Ui as a query, we

retrieve the top ten similar users based on their similarity score to Ui .

Then, a relevance vector G of the search results is formulated based on the relationship matrix.

we calculate MAP and nDCG for this retrieval.

After all the volunteers have been tested, we calculate a mean value of MAP and nDCG based on each individual’s results.

5.2 Evaluation Approach-218

Evaluation Framework: 65 people are respectively used as queries to search for each

of them the top ten similar users.

Evaluation Criterions: MAP and nDCG are employed to evaluate the

performance of our approach. mean average precision normalized discounted cumulative gain (nDCG).

MAP : the mean of the precision score a user is deemed as a relevant user

if his/her relevant level is greater than or equal to 3. the MAP of a relevance vector

𝐺=<4,0,2,3,3,1,0,2,1,1> is computed as follows:

5.2 Evaluation Approach-319

nDCG: the relative-to-the-ideal performance of information retrieval techniques [8].

The discounted cumulative gain of G computed as follows: (In our experiments, b = 2.)

Given the ideal discounted cumulative gain DCG’,

nDCG at i-th position can be computed as 𝐷𝐶𝐺 𝑖 =𝐷𝐶𝐺 𝑖 /𝐷𝐶𝐺′[𝑖].

5.2 Evaluation Approach-420

[8] Jarvelin, K., Kekalainen, J. Cumulated gain-based evaluation of IR techniques, ACM Transactions on Information Systems ,ACM Press(2002), 422-446

Baselines: If in the cluster 𝑐𝑖 User1 has 𝑘𝑖 stay-points and User2

has 𝑙 stay-points, the location histories of User1 and User2 can be represented as follows.

𝑢1 =<𝑘1,𝑘2,...𝑘𝑖,… ,𝑘𝑁> and 𝑢2 =<𝑙1,𝑙2,…,𝑙𝑖,… ,𝑙𝑁>. The similarity of two users by count is computed as

equation (6):

Cosine similarity and Pearson similarity are computed as equation (7) and equation (8) respectively:

5.2 Evaluation Approach-521

Seq: the similarity for sequence feature, Hier: the hierarchical property of geographic

spaces. Hier+Seq: HGSM of similarity considering

both the sequence and hierarchy properties. Count: similarity-by-count on the bottom layer Hier+Count: similarity-by-count across multi-layer. Cosine and Pearson respectively denotes the cosine

similarity and Pearson similarity on the bottom layer. Hier+Cosine and Hier+Pearson: respectively

represent the cosine similarity and Pearson similarity across multi-layers.

5.3 Experimental Results 22

HGSM advantages over cosine

similarity, Pearson similarity and similarity-by-count.

by considering the similarity across multi-layer

HGSM (Hier+Seq) leads the performance in

both nDCG@5 and nDCG@10 among these methods.

the hierarchical property of geo-space better improves the performance of Seq

MAP & nDCG23

maxLength nDCG@5 over the

maxLength. when the maxLength

exceeds 5, the performance of the ranking does not vary any more.

maxLength24

the MAP and nDCG@5 of our approach changing over the time threshold tth. the performance of our approach is improved as

the tth increases. when the time threshold increases to a certain

value, the performances reach their summit and do not vary any more.

Time Threshold25

both MAP and nDCG increase as the level of layer increases, i.e., layer 4 is more capable of discriminating similar users than layer 3

MAP & nDCG changing on different layer

26

People’s location histories imply their interests and preferences.

A framework, HGSM, enable us to consistently model each individual’s location

history, effectively measure the similarity among users.

Many applications friend recommendation and location recommendation

we explore users’ location histories on different scales of geographic spaces. The layer with relatively fine granularity enhances our

capability of precisely discriminating similar users, the layer with relatively coarse granularity enables us to

recognize high-level user behavior and further recall unobvious similar users.

6. Conclusion 27

僅有地點的相似性 是否有其他性質的相似性 在預防犯罪要討論的相似性為何

相似性在預防犯罪的運用 尋找潛在犯罪傾向者 目標的識別及追踪 縮減監控的範圍

自我學習的方向 Related work 中計算相似性的方法 Cosine, Pearson, …

Comments28