Explore Spatial-temporal Patterns for Trending …zhu/social_2.pdfExplore Spatial-temporal Patterns...

1

Explore Spatial-temporal Patterns for Trending Venues Rui Zhu

University of Pittsburgh 135 North Bellefield Street

1-412-577-8888 [email protected]

ABSTRACT Taking advantages of social network to find patterns on individuals’ life, communities’ dynamic activity, and venues’ popularity have been receiving increasing attentions by researchers. But those researches either concentrate on peoples’ static and dynamic activities, or venues’ static popularity. There is little work on investigating the dynamic activities of venues. In this report, an idea of analyzing the temporal and spatial patterns of trending venues has been proposed. Two methods: temporal data mining models and spatial-temporal data mining models are introduced to predict venues’ activity: trending. Experimental results showed that both methods have a potential to accomplish this goal, although both have their limitations.

Categories and Subject Descriptors Term Project Report.

General Terms Design, Experimentation, Data mining

Keywords Temporal and spatial-temporal data mining, trending venues, prediction and classification models

1. INTRODUCTION To build a model to predict trending venues is a spatial-temporal problem. The model’s goal is to accurately and timely predict the location and time when the event of trending happens for each venue. To accomplish this goal, sources from social network could be fully explored since the social networks reflect the relationship between individuals and venues, or individual and individual, and such relationships largely contribute to the dynamic mobility of venues like trending.

Among other related research, people usually concentrated on the activities of individuals. Researchers investigate the movement patterns of users to analyze the dynamic neighborhood, or similar to our work, they used users’ historical check-in numbers to track individuals’ life trajectories, etc. Based on my own knowledge, there is no research focusing on analysis of venues’ activities.

The reason for this limitation is that people usually think venues to be static, and in general, static agents do not have activities. Therefore, it seems more reasonable to focus on the activity of dynamic agents, like individuals. However, when another dimension: time is introduced, the static agents could be transformed to be dynamic. Specifically, we can regard the event of venues’ trending as a dynamic condition.

The work in this project could benefit lots of areas and individuals. For example, business stores can refer this model to check the

trend of stores and make marketing and advertising decisions; for government, more efficient allocation of resources could be accomplished based on the changes of venues’ trending in temporal and spatial dimensions; for citizens, this work could guide them making decision on their daily plans, helping them to avoid crowed venues and etc.

Generally, two methods are investigated in this work to predict the trending venues: temporal models and spatial-temporal models. In the temporal models, only time is considered to predict the percentage of a venue being trending. In spatial-temporal models, both spatial and temporal features extracted from various sources are utilized to accomplish the goal.

Section 2 gives a basic description of the data used in this project, and how these data are collected. Section 3 explores the temporal patterns for the historical data set. Based on the analysis, two kinds of models and their experimental evaluations are discussed in Section 4 respectively. Finally, a discussion based on the experiment observation and expected future works are proposed in Section 5.

2. DATA SOURCE & DATA COLLECTION Since this project is a spatial-temporal one, two dimensions: space and time are considered. The spatial data like longitude and latitude are provided by Foursquare when I retrieved venues using their API. For temporal data, the Foursquare does not provide accessible historical data set for public to use. So the temporal data should be collected manually during a period of time. In my data set, the venues are collected from July 1st 2012 to November 3rd 2012. Scripts are written to grab venues almost every half an hour for almost each day during this period. The total size of the data set is approximately 20 GB, including most big cities in the world. Due to the large amount of venues in each city, I selected some of the venues as representatives, and to make the model as complete and accurate as possible, selected venues cover all kinds of categories.

In different cities, the patterns of trending venues are various (Robles & Benner, 2012). So a specific city should be selected as the experimental city. Since New York is the largest and most diverse city around the world, I select it as my experimental city. Additionally, New York City covers almost all of the kinds of venues’ categories.

Table 1 shows the basic information about the data collected in NYC. From the table, it can be clearly seen that although the number of instances are very large, the events of trending is relatively low, and the corresponding number of trending venues is lower since a trending venue could trended for many times. Approximately 24% of the whole venues have record of being trending during this period. This number indicates that the event

2

of trending is limited to a small number of fixed venues. Consequently, trending venues have a spatial pattern in the dataset according to this analysis.

Table 1. Basic information about NYC data set

Total number of instances in NYC: 4,123,085

Total number of venues in NYC 1,053

Trending events 24,375

Trending venues 253

Number of venue’s category 176

Most of the data are collected from Foursquare, where friendly API is provided. The features and other attributes that could be obtained from this source are listed in Table 2. Features are used for building the models, and some of the other attributes are used to retrieval two other features (i.e. competiveness and modified_category).

Table 2. Features and other attributes from Foursquare

Feature obtained from Foursquare

Other attributes

Check-ins; Users count;

Number of users review; Mayors check-ins; Category; Time; Promotions; Trending (binary) Number of photos

Venue id Category id Latitude Longitude

Since all the data are collected manually during a long period, missing data are inevitable. A statistics about the missing date for each month is shown in Table 3. It is illustrated that for September and October, the missing date is relatively small compared with July and August. Therefore, in this work, I select data in September and October as the sample data for NYC. By truncating the data set, the missing data are reduced in one hand, and the total size of the data set decreased to an acceptable and computable level from around 20 GB to around 600 MB. Table 4 shows the number of instance for September and October respectively, and their total number of instances. It can be seen that the both number reach beyond 1 million, which is a large number.

Table 3. Statistics about missing data for each month

Month Missing date Missing hour

July 7 Many August 6

September 2

October 3

Table 4. Number of instances for September & October

Month Number of instances Total number

September 1,318,751 2,333,610 October 1,014,859

3. TEMPORAL PATTERNS Before establishing models for predicting the trending venues, we should check whether there are temporal patterns in the data set (spatial patterns have been proved to be existed in Section 2). To see the patterns clearly, I calculate the percentage of venues being trending for each day (𝑦 = !"#$%& !" !"#$%&$' !"#$%"&'#

!"!#$ !"#$%& !" !"#$%"&'#) and draw

the time series plot for each day (the total number of day is 60), which is demonstrated in Figure 1. From this plot, fluctuations in the data are roughly constant in size over time, except for Day18, where noisy data may occur.

Figure 1. Plot of the time series data (60 days)

This plot is drawn from the whole two months. To see it in more detail, weekly and daily period are applied to draw the plot, which are shown in Figure 2 and Figure 3 respectively. Successive four weeks are randomly selected to draw the weekly plot. It is shown that the trends of four plots in Figure 2 are basically the same. The number of trending venues increases from Wednesday night, and reaches highest of the week on Thursday night. But for these two days, the percentage of trending venues at night are much higher than the one at daytime. In contrast, on Friday and Saturday, the whole day’s percentages of trending venues stay at a relatively high level, especially for Friday. From Sunday to Tuesday morning, the percentage of trending venues decreases gradually. This general trend occurs to the four selected weeks; therefore we can conclude that the data have temporal patterns weekly. In addition, this pattern conforms the individuals’ activities in real life. From Thursday night to Saturday night, people are more likely to hang out and have fun, and From Sunday night to Wednesday, people are more likely to focus on works and seldom check-in at public and entertainment venues. In respect to the daily plots, four days are also randomly selected. In the four plots,

3

the highest percentages of trending venues all occur at around 8 p.m. and stay in the bottom from 5 a.m. to 10 a.m. The only difference is that for September 4th and 17th, the percentages are at a low level, but for September 11th and October 2nd, the percentages are relatively high. By checking the date, both September 11th and October 2nd are Tuesday. Therefore, an assumption may make that on Tuesday, people are more likely to hangout and check-in at venues. But this assumption has not been proved by reality.

Figure 2. Plot of the time series data (weekly)

Figure 3. Plot of the time series data (daily)

Based on the monthly, weekly and daily analysis of temporal patterns, we could see that fluctuations are constant in size over time, and do not seem to depend on the level of the time series for all the three conditions. Thus, I tried an addictive model to decompose the time series into trend, seasonal and random components, which are shown in Figure 4. We can clearly see that these three components have relatively constant fluctuations, especially for the seasonal component. Therefore, we can explicitly state that for this data set, the temporal patterns exist. The next section concentrates on using different models to investigate and subsequently predict the patterns.

Figure 4. Decomposition of additive time series

4. METHODS & EXPERIMENTS

4.1 Temporal models Since the temporal patterns analyzed in Section 3 are obvious, the first series of models focus only on temporal models. Three models are used in this work (i.e. simple exponential smoothing, halt-winters exponential smoothing, and autoregressive integrated moving average model (ARIMA)). Models are built based on the whole historical data set and tested on the dataset itself and on 10 future days. Then evaluations on the forecast errors are discussed for each model. ! Model 1: Simple exponential smoothing

By using model 1, the red lines in Figure 1 show the predicted time series, and the black lines are the original time series. The sum of squared error for this model is 0.0126.

Figure 5. In-sample forecast result using simple exponential

smoothing

In Figure 6, the model gives a forecast for future 10 days, and two prediction intervals are provided: a 80% prediction interval (light blue area) and a 95% prediction interval (grey area).

4

Figure 6. Forecast result for future 10 days using simple

exponential smoothing

! Model 2: Halt-Winters exponential smoothing

Similar to model 1, Figure 7 demonstrates the predicted and original lines using red and black lines respectively. For this model, the sum of squared error is 0.01185.

Figure 7. In-sample forecast result using Holt-Winters

exponential smoothing

And Figure 8 shows the forecast results for suture 10 days. The light blue area is 80% prediction interval and grey area is 95% prediction interval.

Figure 8. Forecast result for future 10 days using Holt-

Winters exponential smoothing

! Model 3: Autoregressive integrated moving average model (ARIMA)

To use an ARIMA model, we should first guarantee that the time series are stationary, or we need to use functions in R to “difference” the time series. From Figure 1 and Figure 2, we can see that the mean and variance are basically stationary, as the level of series stays roughly constant over time, and the variance of the series appears roughly constant as well. Therefore, the “difference” functions are not necessary in this work.

The next step is to select the appropriate ARIMA model, which means finding the values of most appropriate values of p and q for an ARIMA(p, d, q) model, where d equals 0 in this work since the order of difference is 0. To do this, I examined the correlogram and partial correlogram of the time series, which are shown in Figure 9 and Figure 10.

Figure 9. Correlogram for lags 1-20

We can see from Figure 9 that autocorrelations at most lags exceed the significant bound, especially for the beginning ones from lags 1-5. But the correlogram reached to nearly 0 at lag 8.

5

Figure 10. Partial correlogram for lags 1-20

From partial correlogram, we can see that the partial correlations at lag 1, 2 and 3 all exceed the significant bounds, and tail off to 0 when the lag is 4.

Consequently, there are potentially three models that can be applied for this data set.

a). An ARIMA (0, 0, 8), that is, a moving average model of order q=8.

b). An ARIMA (4, 0, 0), that is, an autoregressive model of order p=4

c). An ARIMA (p, 0, q), that is, a mixed model with p and q greater than 0

In this work, I use the principle of parsimony to decide which model is best: that is, I assume that the model with the fewest parameters is best. Therefore, the autoregressive model is selected as the model (Actually, I should also try different p and q for mixed models, but due to the time issue, I didn’t do that).

Using the selected autoregressive model, I predict the future 10 days percentage of venues being trending, which is shown in Figure 11.

Figure 11. Forecast result for future 10 days using

Autoregressive model

After building models, the next step is to evaluate the models, and check whether there is any potential space for improving them. Generally, three approaches are utilized in

this work to evaluate each model (i.e. correlagram of the in-sample forecast errors, histogram of the forecasts errors, and Ljung-Box test). Results are shown in Figure 12. From these evaluations, we can see that the autocorrelations at many lags for both simple exponential smoothing and Halt-Winters exponential smoothing exceed the significant bound, and the p-values for these two models are very low. These evidences indicate that non-zero autocorrelations exist in the in-sample forecast errors at lags 1-20 for both models. In contrast, for the ARIMA model, there are few lags’ autocorrelation exceed the significant bound, and the p-value is relatively large, although it is still not larger than 0.05. In respect to the histogram of the forecast errors, the three models all have roughly constant variance and are normally distributed with mean zero.

Figure 12. Evaluations on the three models

Consequently, it can be concluded that improvements could be accomplished in all the three models, although the ARIMA model is better compared with the other two.

In addition, despite the fact that these models are used to predict the general trend using the whole venues’ dataset, we assume that for a specific venue, these trends could also exist, and models could be still applicable to predict their activities.

4.2 Spatial-temporal models Rather than applying temporal models to predict the percentage of venues being landmarks, the temporal factors are incorporated into features, and ordinary data mining models are implemented in the second method. I call it spatial-temporal models since both spatial and temporal features are considered in this work.

In this part, feature selection, data preprocessing, models, and experimental evaluations are discussed respectively.

4.2.1 Feature selection As described in Section 2, most of the features are extracted from Foursquare using their API. I categorized these features into three

6

groups based on their functions: spatial features, temporal features and general features. ! Temporal features

1. Check-ins: This is the primary feature, I assumed, to predict the trending venues, because it directly reflects venues’ popularity. However, There is no obvious pattern between check-ins and trending events, which can be clear demonstrated in Figure 13 (Benner & Robles, 2012).

Figure 13. Local view of trending and average check-

ins for Erie PA.

Therefore, other features may be considered in determining the event of trending.

2. Here-now: Actually, this feature is the one who directly determine the event of trending (i.e. if the check-ins number is larger than 6, the venue is trending). Consequently, the historical here-now could affect the results.

3. Promotions: If a venue has promotions, it is more likely to attract more customers thus increasing the probability for this venue to be trending. So this feature is taken into account.

4. Modified_category: The traditional category data are in a nominal format. But in this project, I converted category to a numeric predictor by considering the possibility of one category to be a trending venue at different period of the day (morning, afternoon and night). By doing so, the temporal factor is introduced into the feature set. Note that I separate the time into morning, afternoon, and night, but more detailed partitions could be investigated in the future.

The standard I used to slice the time is show as below:

𝑇1: 0~5: 𝑚𝑖𝑑𝑛𝑖𝑔ℎ𝑡

𝑇2: 6~11: 𝑚𝑜𝑟𝑛𝑖𝑛𝑔

𝑇3: 12~17: 𝑎𝑓𝑡𝑒𝑟𝑛𝑜𝑜𝑛

𝑇4: 18~23: 𝑛𝑖𝑔ℎ𝑡

And the function to calculate this feature is illustrated as:

𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑_𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦 =𝑛!,!𝑁!,!

𝑤ℎ𝑒𝑟𝑒 𝑛!,! 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑠 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑒𝑛𝑑𝑖𝑛𝑔 𝑣𝑒𝑛𝑢𝑒𝑠

𝑜𝑓 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑖 𝑖𝑛 𝑡𝑖𝑚𝑒 𝑠𝑙𝑖𝑐𝑒 𝑡;

𝑁!,! 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑠 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑜𝑡𝑎𝑙 𝑡𝑟𝑒𝑛𝑑𝑖𝑛𝑔 𝑎𝑛𝑑 𝑛𝑜𝑛− 𝑡𝑟𝑒𝑛𝑑𝑖𝑛𝑔 𝑣𝑒𝑛𝑢𝑒𝑠 𝑜𝑓 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑖 𝑖𝑛 𝑡𝑖𝑚𝑒 𝑠𝑙𝑖𝑐𝑒 𝑡.

! Spatial feature

I categorized competiveness as spatial features. This feature is represented by the percentage of similar venues within a buffering, which is set to be 150 meter in this work.

𝑐𝑜𝑚𝑝𝑒𝑡𝑖𝑣𝑒𝑛𝑒𝑠𝑠 =𝑛!,!𝑁!

𝑛!,!:𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑣𝑒𝑛𝑢𝑒𝑠 𝑤𝑖𝑡ℎ 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑖,

within the buffering eange b. 𝑁!: 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑣𝑒𝑛𝑢𝑒𝑠 𝑤𝑖𝑡ℎ𝑖𝑛 𝑡ℎ𝑒 𝑏𝑢𝑓𝑓𝑒𝑟𝑖𝑛𝑔

𝑟𝑎𝑛𝑔𝑒 𝑏.

This feature is important because a venue is not an individual geographic object standing there alone; it has a relationship with people as well as surrounding objects. The former features could reflect the interaction between human and geographic objects. The relationship between objects is represented by competiveness. This relationship could be positive or negative. My hypothesis here is that the more similar venues, the less chances for a venue to be trending.

! General features

1. Users count: This means the number of users who have checked here before. The larger this feature, the more popular the venue could be. It is an accumulative feature, so I do not regard it as a temporal feature here.

2. Number of reviews: I assume that the more users who like to write a review for a specific venue, the larger chance for this venue to be trending. However, this is just a low level assumption. More work could be investigated at this perspective. For example, if text mining were used to mine the mood of the user when writing this review, the function of this feature would be enhanced.

4.2.2 Data preprocessing Only 5 out of 7 features could be directly retrieved from the Foursquare using API. Two other features: competiveness and modified_category are calculated by the functions. I used Postgres, where spatial queries are provided, and Python to do these preprocessing. Besides, the data set is too large to build models. Therefore, for this part, I cut the data to one week (Sep 17th to Sep 23rd, when the data is relatively consistent and completed) to build the model. There are 307,156 instances in the truanted data set, which is still a large number. Before implementing models, I also randomized the instances in the data set in order to make the models more generalized. Also, the values of the data are normalized before building models.

4.2.3 Evaluation I used 70/30 percentage split to evaluate the model, instead of using the 10-fold cross validation, since the data set is too large for running 10-fold cross validation. The results are shown in Table 5.

7

Four measures are used in this work to do the evaluations: mean absolute error (MAE), root mean absolute error (RMAE), relative absolute error (RAE), and root relative absolute error (RRBE).

Three models: linear regression, artificial neural network (ANN), and support vector machine (SVM) are applied. From the table, we can see that although the MAE and RMSE are low for the three models, the RAE and RRSE are very high. The RAE and RRSE could reflect how better the model is compared with just using the mean to do the prediction. If it is larger than 100%, it means the model is worse than predicting the means. The reason for this bad performance in RAE and RRSE is that my dataset is extremely imbalanced. Venues who are regarded as trending status only occupy less than 10% of the whole dataset. Most of the instances are non-trending venues.

Table 5. Modes’ performances

Model MAE RMSE RAE RRSE

Linear regression

0.0148 0.0769 116.9913% 95.875%

ANN 0.0105 0.0769 83.1311% 95.3184%

SVM 0.0065 0.0805 51.277% 100.3251%

5. DISCUSSION & FUTURE WORK From these analyses, what we can conclude is that venues’ trending events have spatial and temporal patterns. Temporal models can predict the general trend of this in respect to the whole dataset. However, I cannot do any conclusions on spatial-temporal models since the results are too biased since dataset is imbalanced. More work, most of which should concentrate on solving the imbalance problem, are needed before any meaningful conclusions are made.

There are many challenges in my project. First, the data collection is tedious and risky. It takes a very long time to collect such a large data set and there may be errors during the collection process every day, but it is hard to figure out.

Secondly, the extraction of features using the dataset or many other sources is also challenging and tedious. Among the 7 features, only 5 can be obtained directly from the raw dataset. The remaining 2 features, which are category and competiveness are either derived from the data set or gained from other sources.

Thirdly, due to the much historical data, this project is a big data one, and missing and noisy data are also very common. These challenges result in the complexity of preprocessing and building models. However, on the other hand, these data are still not enough to solve this problem comprehensively. The current data only cover four months, one season. Only weekly and daily patterns are researched in this work. But if more data could be collected, more detailed monthly and seasonally patterns would be

investigate further. The only problem is that this will increase the challenges of big data problem in return.

6. REFERENCES [1] Robles, C., & Benner, J. (2012, September). A Tale of Three Cities: Looking at the Trending Feature on Foursquare. In Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom) (pp. 566-571). IEEE. [2] Zhang, K., Jin, Q., Pelechrinis, K., & Lappas, T. (2013, August). On the importance of temporal dynamics in modeling urban activity. In Proceedings of the 2nd ACM SIGKDD International Workshop on Urban Computing (p. 7). ACM.

[3] Benner, J., & Robles, C. (2012). Trending on Foursquare: Examining the Location and Categories of Venues that Trend in Three Cities. In Proc. Workshop on GIScience in the Big Data Age, 7th International Conference on Geographic Information Science (pp. 27-34).

[4] Karamshuk, D., Noulas, A., Scellato, S., Nicosia, V., & Mascolo, C. (2013). Geo-Spotting: Mining Online Location-based Services for Optimal Retail Store Placement. arXiv preprint arXiv:1306.1704. [5] Mitsa, T. (2010). Temporal data mining. CRC Press. [6] Antunes, C. M., & Oliveira, A. L. (2001, August). Temporal data mining: An overview. In KDD Workshop on Temporal Data Mining (pp. 1-13). [7] Roddick, J. F., & Spiliopoulou, M. (1999). A bibliography of temporal, spatial and spatio-temporal data mining research. ACM SIGKDD Explorations Newsletter, 1(1), 34-38.

[8] Wei, L., & Keogh, E. (2006, August). Semi-supervised time series classification. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 748-753). ACM.

[9] Xi, X., Keogh, E., Shelton, C., Wei, L., & Ratanamahatana, C. A. (2006, June). Fast time series classification using numerosity reduction. In Proceedings of the 23rd international conference on Machine learning (pp. 1033-1040). ACM.

Explore Spatial-temporal Patterns for Trending …zhu/social_2.pdfExplore Spatial-temporal Patterns...

Documents

Transcript of Explore Spatial-temporal Patterns for Trending …zhu/social_2.pdfExplore Spatial-temporal Patterns...