A Supervised Modeling Approach to Determine Elite Status of Yelp Members

8

Click here to load reader

Transcript of A Supervised Modeling Approach to Determine Elite Status of Yelp Members

Page 1: A Supervised Modeling Approach to Determine Elite Status of Yelp Members

A Supervised Modeling Approach to Determine the Elite Statusof Yelp Members Using Decision Trees and Linear Regression

Chithroobeni ShankarCarnegie Mellon University

[email protected]

Darshana SivakumarCarnegie Mellon University

[email protected]

Jennifer LiCarnegie Mellon University

[email protected]

Julie TramCarnegie Mellon [email protected]

Moustafa AlyCarnegie Mellon [email protected]

Neil EveretteCarnegie Mellon [email protected]

Ravindra UdipiCarnegie Mellon [email protected]

Sahil KumarCarnegie Mellon [email protected]

AbstractYelp, which was founded in 2004 by two PayPal executives,is a crowd-sourced multinational company headquartered inSan Francisco, CA. Yelp’s goal is to connect people withgreat local businesses. Yelp has over 77 million cumulativereviews from yelpers around the world. Yelpers share theireveryday local business experiences, giving voice to con-sumers and bringing word of mouth online. Approximately142 million unique visitors used Yelp’s website, and approx-imately 79 million unique visitors visited Yelp via their mo-bile device, on a monthly average [1].

Embed among all these business reviews and yelpers isa classification between Elite and Non-elite yelpers. YelpElite is a way for Yelp to recognize and reward users whoare active on Yelp. Elite-worthiness is based on a numberof things, including well-written reviews, high quality tips, adetailed personal profile, an active voting and compliment-ing record, and a history of playing well with others [2]. Elitestatus is earned every year and is determined by a commit-tee. Elite yelpers have profiles with special badges and theelite yelpers are invited to private events and parties.

For the data analytics course project, our team will at-tempt to crack the code using a systematic algorithm to pre-dict users’ Elite worthiness. We will use the Yelp academicset and the associated user attributes to determine the mostaccurate algorithm to predict elite status. Our goal for theproject is to predict with 95% accuracy if a user obtains elitestatus for any particular year within the Yelp Academic set.

We should note that there are some inherent risk usingthe Yelp academic data set. Our team has no insight into any

additional or hidden indicators that may be used in determin-ing Elite status beyond the data field that was provided in theYelp Academic set. The academic dataset only has 12% ofthe reviews from 370K users. Our algorithm and modelingis based on the the data provided that exists in the academicdata set.

1. IntroductionThe Yelp Academic Dataset has been provided by Yelp to beused for academic purposes. The dataset is a rich resourceof the interaction information between customers and busi-nesses on the Yelp platform. Yelp’s academic dataset in-cludes information about businesses near 30 different pre-mium schools, including Carnegie Mellon University inPittsburgh, Pennsylvania. The academic dataset is in theform of different JSON files for different objects, with nestedjson structures and arrays in it. It consists of five objects re-lated to Businesses, Customers, Reviews, Customer Check-Ins and Customer tips. Business objects contain basic infor-mation about local businesses. Review objects contain thereview text, the star rating, and information on votes Yelpusers have cast on the review. User objects contain aggre-gate information about a single user across all of Yelp. Table1 shows the number of records for each of these categoriesand describes the Yelp objects [3].

2. Problem StatementAll Yelpers can nominate themselves or their friends to be anelite member on the Yelp website. According to Yelp, thereisn’t a specific benchmark for a member to be selected to

Page 2: A Supervised Modeling Approach to Determine Elite Status of Yelp Members

Object Type Description Num. of RecordsBusiness Business objects contain, location information, Number of reviews, average star

ratings and url of the local businesses.61,184

Review Review objects contain the review text, the star rating, and information on votes Yelpusers have cast on the review. [4]

1,569,264

User User objects contain aggregate information about a single user across all of Yelp. 366,715Check-ins The Check-ins set provides the data related to the user check-ins patterns for busi-

nesses.45,166

Customer tips Similar to the reviews the tips set also has the text column that provides quick tipsrelated to the businesses.

495,107

Table 1. Various Objects Belonging to Yelp Academic Dataset

be an elite member or not. Also, to be considered elite, amember needs to reapply every year [5].

Yelp’s Elite Council’s process of selecting elite membersis a blackbox for the rest of the world. What if, using Yelp’shistoric data, we could create an automated process for deci-phering if a member is fit enough to be given the elite statusor not? This could potentially ease the selection task for theElite Council, by automatically filtering out nominations thatare predictably unfit for elite status. This will result in sav-ings for Yelp, as the overhead costs of preliminary filteringfor the Elite Council will be removed.

2.1 GoalOur goal is to create an algorithm to predict a user’s elitestatus on Yelp. We want to predict a user’s elite status withan accuracy of 95%.

3. Initial Data InvestigationThere are 5 data objects provided in the Yelp academicdataset that comprise the 1.6 million reviews and 500k tipsby 366k users for 61k business in 10 cities and four countries[3]. Out of the 366k Yelp members in the dataset, only 25k(6.8%) were determined to be elite members. For our initialinvestigation we analyzed the 20 attributes of the user dataobject to find correlations that could identify elite vs non-elite members.

3.1 Most Significant Attributes for 2015The data set was initially reduced to user activity in theyear 2015. The red outline in the box plot developed usingTableau as seen in Figure 1 identifies the non elite members.Compared with results from elite members, the following at-tributes had significant differences over non-elite members:

• Number of reviews written• Number of user Fans• Votes counted as Useful• Votes counted as Cool• Votes counted as Funny

Figure 1. Most Significant Attributes for 2015

These five attributes were initially flagged as attributesfor further analysis.

3.2 Review Count Past 10 Years, Elite vs Non-EliteThe box plot in Figure 2 depicts the most significant at-tribute, Review Count.

When the user data attributes were expanded over a 10year span, it confirmed the findings from the 2015 informa-tion.

With small exceptions in 2005 (the first year of Elitequalification) and 2015 (an incomplete year), the attributefindings were consistent across the 10 year span.

According to our initial analysis on the user attributedata alone, it was concluded that the four initial attributesas in Table 2 had a high correlation in identifying Elitevs Non-Elite members. Additional manipulation of the data(merging of the user data with the review data set) wasrequired to further test if other conditions such as previousyears as Yelp Elite status had any additional correlatingeffects.

Page 3: A Supervised Modeling Approach to Determine Elite Status of Yelp Members

Figure 2. Review Count Past 10 Years, Elite vs Non-Elite

Review CountElite Non-Elite Difference

75% Quartile 106 Reviews 11 Reviews 9.6xMedian 75 Reviews 5 Reviews 15x25% Quartile 51 Reviews 2 Reviews 25xVotes Useful

Elite Non-Elite Difference75% Quartile 140 Votes 16 Votes 8.8xMedian 70 Votes 4 Votes 17.5x25% Quartile 40 Votes 1 Votes 40xVotes Cool

Elite Non-Elite Difference75% Quartile 63 Votes 4 Votes 15.8xMedian 27 Votes 1 Votes 27x25% Quartile 14 Votes 0 Votes 14xVotes Funny

Elite Non-Elite Difference75% Quartile 50 Votes 4 Votes 12.5xMedian 20 Votes 1 Votes 20x25% Quartile 10 Votes 0 Votes 10x

Table 2. Initial Data Findings

Dataset Num. of AttributesUsers 23Businesses 105

Table 3. Number of Attributes in Different Datasets

4. Feature SelectionFeature selection is a popular technique in Data Mining thathelps reduce input data into more manageable sizes for pro-cessing and analysis. It does not imply only cardinality re-duction, i.e. reducing the number of features to be selectedbased on a cutoff count, but also actively selecting featuresor attributes of a dataset based on their usefulness for anal-ysis4. Some datasets have the issue of containing too manyattributes which are sparse in their information. This maylead to cumbersome fitting problems with a model and evendegrade the quality of the result by introducing noise in theanalysis. For this reason we paid attention to the feature se-lection and data massaging’ early in our work.

As was alluded earlier, the raw Yelp datasets had a highnumber of attributes to describe Users and Businesses asseen in Table 3.

In our bid to create predictive models for determiningYelp Elite User Selection, we found the models built onthe raw dataset to have a high degree of inaccuracy. Inorder to determine the usefulness of the attributes available,we decided to evaluate correlation between a user’s elitestatus and the other attributes available to depict his behavioron the Yelp platform. Furthermore, we rendered correlationmatrices for the dataset available. This helped us narrowdown to the attribute groups of interest. Still cautious aboutgetting rid of data in the dataset, we decided to try and clubrelated attributes. We had 10 different type of complimentsand 10 attributes to represent them. Since they all couldessentially be clubbed in an aggregated field to represent theoverall compliments, we decided to experiment with that.

With this experiment, we noticed slightly higher accuracyin our predictive model. Inspired by the change, we decidedto apply the same approach to some more attributes whichwere related. There were 3 attributes to represent 3 differenttypes of votes a user had received. Consolidating the datafrom these 3 columns into one was the next step. Our modelimproved with this step too.

Once we knew that we had a better organized datasetnow to work on in order to create the model, we decidedto trim out down some more to really highlight the patternswe were interested in and use the correlations that weremore prominent. We built new sets of correlation matriceson the new dataset in order to filter down to the attributesthat had the highest impact in determining a user’s elitestatus. Comparing the correlation of the newly generatedaggregated attributes helped us find the areas we needed toconcentrate on in order to build an effective model. We wereable to improve our models by leveraging this information

Page 4: A Supervised Modeling Approach to Determine Elite Status of Yelp Members

Figure 3. Correlation Matrix

Figure 4. Correlation Matrix

by creating appropriate rules in order to better utilize thecorrelation information we had found.

5. Algorithm ExperimentsOur team decided to dive deeper into the Yelp user dataset to gain better insights in it. As we focused more on theYelp Elite member status, we began exploring different tech-niques to determine any correlation that will help establishour model. The team used a supervised learning techniqueto understand the criteria for Yelp Elite member.

The criteria for our classifier algorithms selection were:

• Rule Based Classifier: We are interested in generating arules engine for Yelp Elite users evaluation.

• Reasonable computational complexity: The academicdataset size is over 2 GB, with the review dataset over1.4 GB. We need to have the best of both worlds, fast al-gorithm to allow multiple experiments and produce highquality model.

Based on that, these initial set of algorithms are selectedfor experimentation, we evaluated their effectiveness duringthe project life cycle.

• Alternating decision tree• kNN: k nearest neighbor classification• Bayesian Algorithms• Random Forest• CART: Classification and regression tree• Conjunctive Rule classification

Below, we will briefly discuss our results in each type ofclassifiers:

• Bayesian Algorithms: We ran the data set against a num-ber of Bayesian algorithms and the results were veryweak True positive rate (62%). A quick look on the natureof our data and some visualizations lead us to understandwhy the Bayesian algorithms performed poorly. Bayesianalgorithms assume strong independence on the attributes,the data set we had, we could see strong correlation be-tween the attributes, such as the number of star countsand the number of reviews. The statistical advantage ofBayesian algorithms was lost in our case.

• Regression models: We had a hunch that regressionmodels were not the best options we have. The data isrich in its attributes and the percentile distribution of thevalues in each attributes leads to multiple decision points.For example, if the number of compliments the user hasis less than 2, he/she is definitely not an elite member.However, we tried the regression models, the results werebetter than Bayesian algorithms, but far from our targets:True positive rate (72%).

• Alternating Decision Trees: Based on our observations,we noticed that we needed an algorithm that works withindependent attributes and be sensitive to different bandsof data. Decision trees seemed to be a natural and logi-cal progression. We had better results True positive rate(79%). We couldn’t improve further than this.

• Random Forest: As its a family of decision trees, theresults were almost identical to the previous algorithms.

• kNN: K nearest neighbor seemed to be a good choice asit tends to do well with Binary classifier if K is selectedto be an Odd number. The results were very promising.True positive rate (84%). However, we identified that we

Page 5: A Supervised Modeling Approach to Determine Elite Status of Yelp Members

Figure 5. Dataset Relations

could not improve further as the data is quite discrete andthis degrades the performance of the algorithm.

• CART: Classification and regression tree: Classificationand regression trees seemed to combine the best of bothworlds, Rules that can take care of the percentile distri-bution and regression that can easily identify the correla-tion between the attributes and weight them. Indeed, weachieved the best results with J48 tree and linear regres-sion, our True positive rate was 94.2% .

6. Results and AnalysisOnce we finalized our goal to create a model for determiningthe elite worthiness of an yelp user, we focused on the Userattributes and the Review attributes. During the experimentperiod we used this data set in different combinations toarrive at our final model. In this section we describe thereasoning behind using each of these combinations and theresults of our experiments on each of these data sets.

6.1 Pennsylvania Data SetGoal: We decided to focus on the data from one state, asit provides balanced distribution of business types, users andreviews and helps us understand the behavior and correlationamongst these attributes. With the smaller data set it is alsoeasier to try different algorithms.

Data Manipulation: We chose Pennsylvania as it had thebusiness around the CMU campus and it ranked secondfor the number of restaurants/ state metric in the academicdataset. The user data set did not have the state informationrelated to the user. With an assumption that the review datais local, we picked business in Pennsylvania, selected thereviews for these businesses and then got the users andthe corresponding attributes for these reviews. The relationbetween the Business, User and Review objects in shown inFigure 5.

Datasets Used:

Data Size: 17,791

Elite:Non-Elite: 1:12

Attributes Used: review count, fans, votes.cool,votes.funny, votes.useful, average stars, compliments.hot,compliments.more, compliments.list

Results: The results obtained using the J48 Pruned Treeand Regression Classifier are shown in Table 4.

Discussion: The Pennsylvania data was initially selectedbecause it is smaller and take less time to try different al-gorithms. At the same time, it has a balanced distribution ofbusiness types, users and reviews.

The data was divided to test data and training data at ratioof 1:2. After running J48 graft pruned tree classifier, 95.40%users are correctly classified. ROC area is 95.70% whichmeans the classification is quite accurate. However the Falsepositive rate on Non-Elite users is quite high which meansmany users that can be qualified as elite users are falselyclassified as Non-Elite users. So the goal for next step isto expand the algorithm to a larger scale, as well as reduceFalse positive rate.

6.2 Review Data SetGoal: The academic data set has 1.6 Million reviews,spread across multiple users and business. The intention ofthis experiment is to predict the elite Yelpers just based onthe review data.

Data Manipulation: To be able to use the review datawe decided to aggregate them by the userId for each year.Elite status is granted to users on an yearly basis. Beingelite in one year doesn’t necessarily mean the status canbe kept for next year. User data doesn’t reflect this timesensitiveness. Most attributes in user dataset is an aggregatedresult across years since user joined Yelp. So we decided toexplore Review dataset which has timestamps of when theuser posted the reviews.

For a given userId we aggregated the star ratings (1,2,3,4and 5) provided by each user and also the compliments(funny, cool and useful) they got. For each of these user Idswe inserted an isElite flag based on the year they were elite.The years the user had the elite status is available in the userdata set. Sample record set aggregation is depicted in Figure6.

Datasets Used:

Data Size: 500,967

Elite:Non-Elite: 1:13

Attributes Used: NumberOf5StarReviews, Num-berOf4StarReviews, NumberOf3StarReviews, Num-berOf2StarReviews, NumberOf1StarReviews, funnyVote-Count, usefulVoteCount, coolVoteCount

Results: The results obtained using the J48 Pruned Treeand Regression Classifier are shown in Table 5.

Discussion: The data was divided to test data and trainingdata at ratio of 1:2. In the result, weighted average True pos-itive rate is 93.40%. It is not a significant change from lastexperiment. However, True positive rate of Non-Elite is 99%and False positive rate of Elite is 1%, while True positive

Page 6: A Supervised Modeling Approach to Determine Elite Status of Yelp Members

TP Rate FP Rate Precision Recall F-Measure ROC Area Class85.50% 3.50% 74% 85.50% 79.30% 95.70% Elite96.50% 14.50% 98.30% 96.50% 97.40% 95.70% Non Elite

Weighted Avg. 95.40% 13.40% 95.80% 95.40% 95.50% 95.70%

Table 4. Pennsylvania Data Set Results

TP Rate FP Rate Precision Recall F-Measure ROC Area Class21.80% 1.00% 62.30% 21.80% 32.30% 69.30% Elite99.00% 78.20% 94.20% 99.00% 96.50% 69.40% Non Elite

Weighted Avg. 93.40% 72.60% 91.90% 93.40% 91.90% 69.40%

Table 5. Review Data Set Results

Figure 6. Data Transformation

rate of Elite is 21.80% and False positive rate of Non-Eliteis 78.20%. This means the classifier tends to classify anygiven user toward Non-Elite instead of Elite. Almost 80% ofusers that can be elite are falsely classified to non-elite. Andthe ROC Area dropped significantly to 69.40%. It means theaccuracy of this classification is not very good.

There are two main reasons behind this result:

• Review attributes are not as strongly associated with theelite status as user attributes.

• Data is highly skewed towards non-elite users.

To achieve more accurate results, we need to stay withuser attributes and take some advantages of review attributes.

6.3 All User DataGoal: The academic data set has 366K user’s data with 23attributes. The goal of this experiment is to predict the eliteYelpers just based on the user data.

Data Manipulation: We massaged the User level attributesa little to obtain parsable data elements. From the featureselection process and the attribution correlation matrix weidentified that user data set has attributes like review count,fans, votes.cool and votes.useful that play a significant roleto obtain the elite status. We further aggregated the friendslist and the total number of votes and compliments into ameasurable numeric count.

Datasets Used:Data Size: 366,715

Elite:Non-Elite: 1:15

Attributes Used: review count, friends, fans, aver-age stars, yelping.since.months, aggregated compliments,aggregated votes

Results: The results obtained using the J48 Pruned Treeand Regression Classifier are shown in Table 6.

Discussion: After expanding the algorithm to all user data,the result is quite satisfying. The data was still divided to testdata and training data at ratio of 1:2.

Weighted average is 97% with the True positive rate ofElite users as high as 98.70%. ROC Area is 94.70%, whichmeans this classification is relatively accurate. The Falsepositive rate of Elite is as high as 24.90%, meaning that userswho are not supposed to be elite users are classified to elitegroup.We would like to reduce the False positive rate whilemaintaining the accuracy of the prediction.

This result proved that user attributes are very strongly as-sociated with the elite status, compared to review attributes.However, since elite status is granted on a yearly basis, userattributes still cannot capture the impact of the time factoron the elite status. So the next goal is to aggregate the twodataset to leverage the strengths of user attributes and thetime attribute of review data.

Page 7: A Supervised Modeling Approach to Determine Elite Status of Yelp Members

TP Rate FP Rate Precision Recall F-Measure ROC Area Class98.70% 24.90% 98.20% 98.70% 98.40% 94.70% Elite75.10% 1.30% 80.50% 75.10% 77.70% 94.70% Non Elite

Weighted Avg. 97.00% 23.20% 96.90% 97.00% 97.00% 94.70%

Table 6. All User Data Results

TP Rate FP Rate Precision Recall F-Measure ROC Area Class79.40% 1.80% 76.30% 79.40% 77.80% 98.60% Elite98.20% 20.60% 98.50% 98.20% 98.30% 98.60% Non Elite

Weighted Avg. 96.90% 19.30% 96.90% 96.90% 96.90% 98.60%

Table 7. Merged Review and User Data Results

TP Rate FP Rate Precision Recall F-Measure ROC Area Class94.20% 4.10% 90.20% 94.70% 92.40% 98.80% Elite95.90% 5.80% 97.50% 95.10% 96.70% 98.80% Non Elite

Weighted Avg. 95.30% 5.30% 95.90% 95.90% 94.90% 98.80%

Table 8. Balanced Merged Review and User Data Results

6.4 Merged Review and User DataGoal: After independent analysis of user level attributesand review level attributes, we wanted to measure the impactof these attributes together on the elite status. So we decidedto aggregate the review data and merge it with the user datato predict the elite status.

Data Manipulation: We merged the user level attributesinto review attributes, to be able to experiment the combineddata set. We converted the yelping since column to a mea-surable number of months field. It is a factor that reflects howlong an user has been active on Yelp. Here the review Countbelongs to user attributes. StarCount1 starCount5 belongto review attributes. In the dataset, review data only capturesabout 12% total reviews all users have given. So sum of allusers’ reviewCount is not equal to amount of reviews.

Datasets Used:

Data Size: 366,715

Elite:Non-Elite: 1:15

Attributes Used: yelpMonths, starCount5, starCount4.starCount3, starCount2, starCount1, averageStars, cool-ComplimentsCount, funnyComplimentsCount, useFulCom-plimentsCount, friendsCount, fanCount, reviewCount

Results: The results obtained using the J48 Pruned Treeand Regression Classifier are shown in Table 7.

Discussion: The merged dataset yields better result thanreview-only data and user-only data. While the weighted av-erage True positive rate stays as high as 96.90%, averageFalse positive rate dropped to lower than 20%. ROC Area

is as high as 98.6%, up from 94.70% on user data. Compar-ing the False positive rate of Elite and Non-Elite, the clas-sifier still tends to classify users towards non-elite. It makes20.60% users falsely classified to non-elite status. False pos-itive rate of two types is very unbalanced.

This experiment showed that the combined attributeswork better in the classification. However the skewed dataproblem hasn’t been solved yet. Next step is to balance thedataset so that false positive rate can be further reduced.

6.5 Balanced Merged Review and User DataGoal: The goal here is to run our experiments on a bal-anced data set that is not skewed towards Non-elite mem-bers.

Data Manipulation: In all of the above mentioned datasets,the proportion of the elite users was very less so the resultswere more inclined towards classifying non-elite status.Soto get the right balance, we choose a dataset, that had a bal-anced mix (1:2) of elite vs non-elite data. from the mergeddataset.

Datasets Used:

Data Size: 82,000

Elite:Non-Elite: 1:2

Attributes Used: yelpMonths, starCount5, starCount4,starCount3, starCount2, starCount1, averageStars, cool-ComplimentsCount, funnyComplimentsCount, useFulCom-plimentsCount, friendsCount, fanCount, reviewCount

Results: The results obtained using the J48 Pruned Treeand Regression Classifier are shown in Table 8.

Page 8: A Supervised Modeling Approach to Determine Elite Status of Yelp Members

Discussion: After balancing the dataset with 33% eliteusers and 67% non-elite users, we got the best result amongall experiments. The data was divided to test data and train-ing data at ratio of 1:2. Weighted average of True positiverate is 95.3% with both Elite and Non-Elite close to 95%.Average of False positive rate is 5.3%, balanced betweenelite and non-elite. ROC Area is 98.80%, which is higherthan all previous results.

Given this result, we can confidently conclude that ourclassifier will classify Yelp users with a weighted averageaccuracy of 95%.

7. Conclusion and Future WorkOur final model, developed using the J48 tree and linearregression determines elite users with over 94% accuracy.It also gives an ROC area of 98.80%, establishing its cor-rectness. However, this model has been developed with theacademic data set provided by Yelp, thus missing some at-tributes. With additional attributes such as the device throughwhich reviews were written, the time taken to write reviewsafter meals, the proximity with which the reviews were writ-ten, the user attributed divided by year, and so on, we believewe can develop the model to predict elite users with more ac-curacy. The model also does not use Natural Language Pro-cessing to determine the content of reviews. Applying NLPon this data may yield more conditions for the determinationof elite status. Furthermore, the Yelp elite council does notdisclose the factors it considers for the determination of theelite status. The developed model is based only on historicdata.

In the future, we would like to try our models on Yelp’scomplete dataset, and check if it yields similar results. Wemay have to make some modifications to incorporate the newattributes, to achieve similar accuracy. We also plan to sub-mit our results to the Yelp Dataset Challenge’ to evaluateour findings. Additionally, we will work with other qualita-tive factors such as content of reviews, in an effort to com-pletely eliminate the manual process that Yelp uses to deter-mine elite members.

References[1] ”Yelp Investor Relations.” Web. 7 May 2015.

http://goo.gl/Iz4ZEo.

[2] ”What Is Yelp’s Elite Squad?” Web. 7 May 2015.http://goo.gl/DcbkCX.

[3] ”Yelp.” Yelp’s Academic Dataset. Accessed April 5, 2015.https://goo.gl/dHgVmn.

[4] ”Feature Selection (Data Mining) -MSDN - Microsoft.” 2015. 7 May. 2015.https://msdn.microsoft.com/en-us/ms175382.aspx.

[5] Stone, Madeline. ”Elite Yelpers Hold Immense Power, AndThey Get Treated Like Kings By Bars And Restaurants TryingTo Curry Favor.” Business Insider. August 22, 2014. AccessedApril 27, 2015. http://goo.gl/cZyOMN.