YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to...

30
YELP DATA CHALLENGE Team Super 5 Madlen Ivanova Kartik Niyogi Saritha Ramkumar Sampa Sanyal Sugandha Mann

Transcript of YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to...

Page 1: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

YELP DATA CHALLENGE

Team Super 5

Madlen Ivanova

Kartik Niyogi

Saritha Ramkumar

Sampa Sanyal

Sugandha Mann

Page 2: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 1

CONTENTS

INTRODUCTION ................................................................................................................................... 2

RESEARCH PROBLEM ......................................................................................................................... 5

DATA ..................................................................................................................................................... 6

ANALYSIS ............................................................................................................................................. 8

I. ATTRIBUTES IMPACTING CONSUMER PREFERENCE ....................................................... 8

II. IDENTIFYING RELEVANT ATTRIBUTES THROUGH VOICE OF THE CONSUMER ......... 12

III. CUSTOMER SENGMENTATION BASED ON REVIEW TEXT .............................................. 18

BUSINESS STRATEGY RECOMMENDATION: ................................................................................ 22

CHALLENGES .................................................................................................................................... 26

CONCLUSION ..................................................................................................................................... 26

REFERENCES...................................................................................................................................... 27

APPENDIX ........................................................................................................................................... 28

Page 3: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 2

INTRODUCTION

According to SAS, Big data is a term that describes the large volume of data – both structured

and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of

data that’s important. It’s what organizations do with the data that matters. Big data can be

analyzed for insights that lead to better decisions and strategic business moves.

Big data is relevant in all industries in this age and day. The amount of data that is created and

stored today is incredible, and it just keeps increasing. That means there’s even more potential to

collect key insights from business information.

Impact of Big data on the restaurant business has grown manifold just like in other sectors like

banking, retail and pharmaceutical.

Food Innovations:

These days the trend has been moving towards focusing of analytical practices for opening new

Restaurants. Several themes based restaurants have successfully been running, for example

Amelie’s bakery is theme based French café, which opened in downtown Charlotte after a

successful franchise in Noda.

Also, different food innovations are setting trends in the ultra-competitive food industry or

restaurant industry these days. For example, on one hand you have food Innovations like Cronut,

which is in vogue these days which is a hybrid of a croissant and a donut.

Page 4: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 3

Whereas on the other hand food is also getting cleaner, leaner and healthier, like Panera Bread

introduced 100% clean menu recently, all organic and locally harvested food.

From farm to table sourcing food trend has also been brewing up for quite some time in

Charlotte and nearby places. Restaurants have their own terrace gardens or farms from where

they source their raw materials.

Also, meal or menu specific raw materials delivery by companies like Blue Apron, Hello fresh is

in vogue these days, these companies would deliver all ingredients that you need to create

amazing meal.

The impact of food innovations has been tremendous on restaurants, resulting in new theme

based restaurants, multi cuisine restaurants. Some of the fastest growing chains in the restaurant

industry are the ones embracing innovation throughout their operations. Restaurants these days

are thriving on how limited-service chains can leverage innovation in various forms.

Subsequently, innovations are being experimented upon from customer experience to restaurant

ambience, kitchen innovation, menu innovation, etc. Also, innovation is being used as a catalyst

for franchise expansion. Although the mobile apps and other services offered by Yelp, Groupon,

ChowNow etc. happen to be the lifeline of restaurants, unravelling useful information from

review texts offered businesses an innovative avenue to understand customer feedback and

opportunities. They help restaurants in taking orders online making customer ordering

experience hassle free. They also provide customer reviews for a restaurant and try capturing the

whole experience of dining from Parking to Pets.

Charlotte Restaurant Market:

Charlotte is among the “top large markets” with 3+ million residents.

Page 5: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 4

Table 1: North Carolina Restaurant Sales

According to the Charlotte Chamber of Commerce, there are more than 1,500 restaurants and bars

in Mecklenburg County. In the combined Charlotte, Concord, and Gastonia region, according to

the Bureau of Labor Statistics, that number grows to 4,382.

According to the National Restaurant Association, the number of restaurant and food service jobs

in North Carolina is expected to grow another 15% during the next 10 years.

Figure 1: North Carolina Restaurant Industry

Page 6: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 5

The United States Commerce Department reported in 2015 that Americans now spend more

money at restaurants than they do at grocery stores. The average American family eats out 4.5

times a week.

For our study, we are focusing on dataset by Yelp as part of Yelp dataset challenge 2017

Our work focuses on gaining in-depth understanding of consumer’s heterogeneous preferences

toward various restaurant attributes by analyzing consumer’s restaurant reviews on Yelp.com. We

identify key attributes to particular cuisines (e.g., Italian, Japanese) at a specified area (e.g.,

residential, uptown) to help restaurant managers understand the Charlotte restaurant business, and

thereby guide investors when considering investment opportunities.

RESEARCH PROBLEM

Our goal is to create and provide a reference guide to any future investor, so they can make

informed decisions when opening new restaurants or trying to improve one. To do this, we attempt

to understand how cuisine preferences and locality influenced the success of restaurants in and

around the Charlotte area. Charlotte, being the third-fastest growing major city in the United States,

restaurants here offer a plethora of options in terms of ambience, parking space, price range, drive

through, delivery, pet friendliness etc. However, the customers, on the other hand, are also keen

on a different set of attributes such as friendliness of staff, wait time etc. along with food taste and

quality. The weightage of these preferences varied totally based on both locality and cuisine type.

Hence, we found it important to analyze business characteristics and customer preferences

separately to get the complete picture.

To achieve this, the analysis proceeded with three main objectives,

Page 7: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 6

• Identifying important business characteristics and trends of the Charlotte restaurant

business industry

• Understanding customers’ overall preferences for restaurants

• Identifying multiple customer segments in Charlotte and compare & contrast customers’

differential preferences among Charlotte areas and cuisine type

DATA

For the business problem discussed above, we decided to integrate the freely available Yelp

dataset from Yelp Dataset challenge 2017 with the Charlotte city Zoning data.

The Yelp dataset consisted details of all business from 11 cities across 4 countries. The

primary task in the data preparation was to convert the data in json format into csv files using

python. Then the data was sub-set to include Charlotte specific restaurants business. The final list

comprised of the details of 2,140 restaurant business with a total of 121K review texts. Data

preparation and merging were carried out in MySQL and Excel. The below table summarizes the

key aspects of the datasets.

Data Files Attributes

Users User Information: User Name, Yelping Since

Reviews: Number of reviews, Average Star rating,

Votes: Helpful votes Provided and Received (Cool, cute, funny, hot)

Business Business Information: Name, Address, Latitude, Longitude

Attributes: Key features like Ambience, Parking, Alcohol, Outdoor seating etc.

Categories: Cuisine type– American, Italian, European, Latin, Chinese, Indian

Reviews Key Identifiers: Business ID and User ID

Review Information: Review Text, Date

Review Usefulness: Star Rating, Votes (Useful, Funny, Cool)

Page 8: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 7

The second dataset is the city of Charlotte zoning data, which is coded in geospatial shape files

(.shp) with each zone marked by area and shape. The latitude and longitude data from the business

file in Yelp is linked to the area coordinates to mark the zone of each restaurant in the business

file. The geospatial merging discussed here we carried out in R programming. The Charlotte

County had 23 unique zones. For the sake of ease of analysis, the 17 city zones were rolled up into

5 zones - Uptown/Office, Residential, Industry/ Business, Institution/Research and others based

on the similarity as shown in the below table.

CHARLOTTE CITY ZONES ROLLED UP ZONE

Business

BUSINESS/INDUSTRY

Heavy Industrial

Light Industrial

Commercial Center

Business-Distribution

Multi-Family

RESIDENTIAL Single Family

Manufactured Home

Urban Residential

Uptown Mixed Use

UPTOWN/OFFICE Business Park

Office

Mixed Use Residential

Institutional RESEARCH

Research

Mixed Use OTHERS

Transit Oriented

Table 2: Consolidated Zones

Page 9: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 8

ANALYSIS

Exploratory Data Analysis:

Figure 2: (Left) Review count by Cuisine. (Right) Box-plot of Star Rating by cuisine

As can be seen from the Fig.2 (Left) that only few restaurants account for bulk of the reviews while

most other restaurants have only very few reviews. The Fig.2 (Right) shows the median star rating

across cuisine is approximately same.

I. ATTRIBUTES IMPACTING CONSUMER PREFERENCE

The focus is on understanding the relative importance of certain key attribute (primary) along

with utility value of the respective sub-attribute based on consumer preference. Secondary

attributes are analyzed to understand the impact on Star ratings.

The Business dataset contained “Attribute” column which provided details of 29 attributes that

described the restaurants. Five primary attributes are shortlisted to perform conjoint analysis. The

remaining 24 secondary attributes are used in preparing Regression model in analyzing the impact

of these attributes on Star rating.

Analyzing primary attributes:

Page 10: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 9

The five primary attributes along with their respective sub-attributes are as below:

• Alcohol – {Wine & Beer, Full Bar, None}

• Ambience – {Causal, Classy, Divey, Hipster, Intimate, Romantic, Touristy, Trendy, Upscale,

None}

• Parking – {Garage, Lots, Street, Valet, Validated, None}

• Good For Meals – {Breakfast, Brunch, Lunch, Dinner, Dessert, Latenight, None}

• Price – {1,2,3,4}

The results of the Conjoint Analysis are as shown in Table 3 below:

Table 3: Conjoint Analysis: Key attributes vs Cuisine/ Zone

The table represents the attribute mapped against Cuisine/ Zone based on importance. The

Blue color represents the attribute (and sub-attribute) that was most prominently associated with

particular Cuisine/ Zone while the orange represent the second significant attribute (sub-attribute).

E.g. Customers visiting American cuisine restaurant values “Hipster” ambience and “validated”

parking as the most important or primary significant attributes while European cuisine customer

values “Divey” ambience as the primary attribute followed by preference for alcohol availability

(specifically “Wine & Beer”) as the secondary significant attribute. Similarly, for Zone, customers

American Asian European

Mexican/

Latin Others#

Business/

Industry

Institution/

Research Others* Residential

Uptown/

Office

Alcohol Wine_Beer Wine_Beer Wine_Beer Wine_Beer Wine_Beer Wine_Beer Full_Bar Full_Bar Wine_Beer Wine_Beer

Ambience Hipster Classy Divey Hipster Classy Classy Classy Romantic Hipster Hipster

Parking Validated Validated Validated Validated Street Validated Street Lots Validated Street

GFM Dinner Lunch/

Dinner

Brunch Dinner Dessert Dessert/

Lunch

Lunch/

Latenight

Dessert/

Lunch

Brunch Dessert

Price

(1 - 4)

4 3/ 4 3 4 2/ 3 3/ 4 1/ 2 1 3/ 4 4

Primary Significant # - Sandwich & Juice bars, Pizza, Burger joints

Secondary Significant * - Transit Oriented

All others

Cuisine Zone

Page 11: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 10

dining in Uptown/ Office area prefer “Hipster” ambience as primary, while “expensively” Priced

restaurant as secondary attribute.

The Conjoint table helps us map the preference of consumers across cuisine and zone. Based on

this we can derive opportunity zone for each cuisine as shown in Fig.3 below:

Figure 3: Opportunity Matrix between Cuisine and Zone

A visual representation on Charlotte map is as below in Fig. 4

Figure 4: Charlotte Map: Zone vs Cuisine

Analyzing secondary attributes:

The impact of remaining 24 secondary attributes on star rating are as show in Table 4

below:

Page 12: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 11

Table 4: Regression: Star Rating vs secondary variables

A possible explanation for this observation is that high-end restaurants generally garner higher star

ratings compared to daily eat-out restaurants. While most daily eat-out restaurants provide drive

thru facility the high-end restaurants do not provide this facility. This can be seen from the table

below:

Drive

Thru?

Avg.

Star

Rating

Restaurants

True 2.6 McDonald’s, Wendy’s, Burger King, Chick-fil-A etc.

False 3.4 Amelie’s French Bakery, Mojo's Famous, Noodles & Company

etc.

For daily eat-outs, generally customer write reviews when they are unhappy e.g. cold pizza,

mixed up order, long wait time etc. So, the low average rating for daily eat-outs are not on account

of facilities like Drive Thru or Delivery service but possibly on other larger service issues.

Dependent Variable Stars

Obs 2140

R-Square 0.16

Root MSE 0.71

Significant Variable Estimate

Bike Parking 0.31

Caters 0.14

Drive Thru -0.77

Delivery -0.09

Wheel Chair 0.13

Happy Hours -0.31

Credit Card Take Out

Good For Kids WiFi

Has TV BitCoin

Noise Level BYOB

Outdoor Seating Dogs Allowed

Attire Music

Good For Groups Best Night

Reservation Diet Restrictions

Table Service

Non-Significant Variables

Regression Output

The regression output is characterized by following:

• Low R2 of 0.16 indicates lack of explaining

power of the model

• Only 6 of 24 variables significant

• Parameter Estimates not in-line with general

expectations:

o Drive Thru, Delivery and Happy Hours

are negative

o Would expect higher Star rating for

improvement in above factors

Page 13: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 12

II. IDENTIFYING RELEVANT ATTRIBUTES THROUGH VOICE OF THE CONSUMER

The objective of this exercise was to use text categorization as a tool to identify key user topics

and attributes within the restaurant category. The approach taken was to identify key attributes

not already a part of the analysis in Part 1 and Part 2 but extremely critical to the industry. The

key user topics that was finalized were the following:-

• Customer Experience (Service, Food, Ambience, décor, wait, order, staff, extra amenities

like bike parking, BYOB, etc. were key attributes that were included as part of the

“Customer Experience” topic

• Taste

• Wait Time

• Go with

• Entertainment (Music, T.V etc.)

Approach Taken for Text Categorization

• Restaurant Ontology was generated after going through the entire review corpus. A sample

list is attached in the Appendix Table 18

• Standard English Stop words were used in addition to customized stop words

• Text Categorization was done across reviews corpus and across specific cuisines

• This was done in an iterative fashion to remove terms of non-interest and highlight those

which would drive insights

Page 14: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 13

Table 6: No of occurrences across the American Cuisine subcategory

As can be seen from Table 5 Customer Experience and Taste remain the top two key attributes

across the entire corpus and across specific cuisine types. Another view of the leading term

frequencies are as follows:

Figure 5: Term Frequency

Food 31%

Place26%

Service17%

time16%

menu10%

Leading Term Frequencies across entire corpuses

Food Place Service time menu

Table 5: No of occurrences across the entire corpus

Page 15: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 14

The above chart indeed validates that based on term Frequencies – Food and Service together

drive customer reviews and preferences within industry. Analyzing the key terms that form the

leading key user topics-the following are the output.

CUSTOMER SENGMENTATION BASED ON REVIEW TEXT

Customer Experience

Entertainment

Go With

Taste

Table 7: User Topics for various attributes

Page 16: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 15

As we can see from Table 7, customer reviews were classified across 4 user topics with

predominance of Customer Experience, and Taste as the leading “user topics”

Highest frequency terms in each topic is also displayed with their frequency count across cuisines

and the leading terms which drive each of them. For e.g. under Customer Service we can see terms

like place, service, time which helps us draw insights on features/attributes or parameters which

are important to the “Consumer Experience Category”.

In a similar way – key topics under the category “Entertainment” can be seen in Table 7. Key terms

like “noise”, “television”, “music” are leading terms within this category which are indicative of

the type of “Entertainment” preferences of the consumer.

Text Categorization was also used effectively to mine for key terms or data across “Highly Rated”

and “Lower Rated” restaurants. For the above, corpus was categorized taking restaurants with 4

Star and 5 Star ratings as high while the ones that were less than 4 as “Low.”

The output is as follows: -

Table 8: Topics for highly rated restaurants

Page 17: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 16

From the above we can summarize the following: -

• Machine generated topics are more prominent over User provided topics

• Drivers for good ratings can be clearly seen across the categories of Food, Service and

Price. This has been highlighted in the above Table

• Term frequencies also seem to indicate the same result

Figure 6: Concept Link

From the above we can see a strong term association between “food”, “good food “and “service”

and “great service”. These terms are indicative of key attributes that drive higher ratings across

restaurant category.

Table 9: Topics Driving Negative Restaurant Reviews

Page 18: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 17

From the Table 9 above, we can see that negative ratings are driven mainly by Service, Amenities

within the restaurant and not specifically food. Exploring this concept further via concept links,

we can see that there is a strong correlation between negative terms like “horrible”, “bad” and

“slow”. The topics have been highlighted in the above table for clarity.

We also find that the concept link of “Food” does not really have any negative terms strongly

associated with it. This highlights that “Service” remains one of the key attributes which drive

negative reviews within this industry.

Comparison of Leading Key Terms Across Different Cuisine Categories:

The objective of this was to understand through text categorization the differences in key terms

across different cuisine types to help us gather insights across the differences, similarities, and

characteristics within the cuisine category.

Table 10: Key Terms across Cuisine

From the above table, we can gather the key terms that characterize a cuisine type and hence drive

the consumer preferences within that particular segment. For e.g. the American cuisine highlights

“Specials” in terms of Food, indicating the customer preference within this category for “specials”

- special offers, special combos, etc. The Place term is highlighted with sub terms like “vibes”,

American Asian European Mexican Others

Food Specials,order,server Chinese, Japanese,Thai

,delicous food, expansive

menu , reasonable price

Service, Dress, Décor Authentic,Service,people,price

atmosphere

terrible, server,atmosphere ,

price, service

Place Vibes,music,specials Nice, clean, atmosphere, pack ,people, understand

park, family friendly

,atmosphere,date

patio,table,silverware vibe,cool,sit,server,music

Service Return, Sit, customer

service, server

table, staff, fast,

efficient, server

bad, excellent,slow,

staff

Server/staff,experience, locati

on,slow

order, server, arrive ,people,TV

Time server ,menu ,seat, hard

time

wait,server, whole time,

hard, order

arrive,order,wait time server,order,arrive explain,combo , serve, attitude

Menu special,tasty variety, desserts Specials,flavor,

crispy,perfect

texture,tasty

explain,combo, serve,

attitude

special,tasty,restaurant

Great/Good food,drink

specials,variety,

food ,drink,

offer,town,spot,customer

service

location, food,experience,

selection

vibe,outdoor,music,

food,specials,variety

food,staff,atmosphere,

Page 19: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 18

“music”, and “specials” throwing insights on customer preference on the ambience, place of the

restaurant.

Hence with this one stop view, we can with the help of text categorization assess the key

characteristics that drive each of these categories across key terms.

III. CUSTOMER SENGMENTATION BASED ON REVIEW TEXT

The objective of the analysis is to segment users into different groups based on the key aspects

he/she would consider in reviewing a restaurant. Different users give preference to different facets

while dining at a restaurant. More than just the taste and quality of food, people may consider the

hospitality of staff, wait time, etc. The analysis aims at looking at these features in the reviews by

clustering the reviews and then cross referencing them to each of the zones and cuisine types

separately. This would aid in arriving at a conclusion of how the customer preferences vary across

zones as well as the cuisine type. We would also cross reference the results from the analysis to

the review rating score provided by the user to assess the sentiment so that we can gauge the user

satisfaction about the food/ service/ wait time. Lastly, tabulating the number of open/closed

restaurants against the customer segments would help in understanding how the preference affects

the overall success of the restaurant.

A comprehensive stopword list to eliminate other attributes and keywords was pivotal in the

data processing. The list of stopwords used is given in the Table 11 below:

burrito finally Rice lunch fries sides restaurant Steak

crust mexican Beer Meal salad hotdog pasta noodle

fish pizzas Bar Meat wing hot soup Roll

pizza tacos bread menu wings dog sushi Thai

salsa toppings burger Pork fried chilli potato Broth

taco sauce chicken sandwich Queso drink mac chinese

beans cheese dinner Wine egg eat chip indian

Page 20: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 19

chips shrimp Dish Food coffee price $ vietnamese

ramen pho Charlotte Bean Tortilla Chipotle breakfast

Table 11: List of Stopwords

The analysis was carried out in SAS E-Miner Text Mining interface. SAS E-Miner offered two

types of text clustering mechanisms - Expectation Maximization and Hierarchical. As it goes with

any clustering mechanism, the primary challenge was to identify the ideal number of clusters;

various combinations were tried to see the most meaningful and logical representation of text.

After a couple of iterations, it was decided to go with three clusters; the key attributes of which

are summarized in the Table 12 below:

Table 12: Summarization of Key Attributes

As it is very evident in the above table, the three clusters were cohesive (indicated by the

low RMSTD) and equidistant in the vector space. Also, the proportion of review texts was almost

uniform across the clusters. The cluster labels based on the descriptive terms in the cluster

definition were - Time Bound, Foodies, and Service/Atmosphere Bound. These three clusters were

indicative of the three common customer segments in the restaurant market.

Analyzing the customer segments:

The next task in the analysis was to analyze the distribution of customer segments across

zones and cuisines. The pie chart distribution summarizes the same:

Page 21: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 20

Figure 7: Distribution by Zone

The bigger pie chart shows the number of reviews for the restaurants by zones and the

smaller pie charts shows the customer segment distribution in each zone identified by the matching

color. It is evident that the customers in uptown area are more inclined towards Service and

atmosphere where as the customers in business and industry zones are more bound towards the

taste and quality of food.

Figure 8: Distribution by Cuisine

Page 22: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 21

This chart is analogous to Fig.7 above, but shows the cuisine wise distribution instead of

the zones. It is clear that the customers talk a lot about the service and atmosphere of European

and Asian restaurants.

Once the zonal and cuisine wise distribution was understood, the analysis proceeded to

understand how customer likings vary by zones. The average ratings by the customer who provided

the review was used to understand the customer sentiments. The Table 13 below summarizes the

mean customer rating by zones and cuisines for each of the customer segment identified above:

Table 13: Mean Customer Rating by Zone and Cuisine

Lastly, the above segments by zone and cuisines were compared based on the percentage

of closed restaurants as shown in Fig.9 below, which helped in validating the Table 13 with respect

to the overall success of the restaurants.

Page 23: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 22

Figure 9: Percentage of Closed Restaurants by Zone/ Cuisine

BUSINESS STRATEGY RECOMMENDATION:

Case Study:

The case study was performed to apply the conclusions that we came to after performing our

analysis. We decided to apply the discovered customer preferences to 2 instances from our data (a

successful and an unsuccessful restaurant) to confirm our research. We chose the restaurants with

the highest number of reviews. More reviews, means more objective public opinion that would

better help us out in determining the reasons that contributed to the unsuccessful case (shutting

down the restaurant).

Table 14: “Pinky’s Westside Grill” vs “Nan and Byron’s”

Successful case: “Pinky’s Westside Grill”

The successful case was an American cuisine restaurant that is located in residential zone. It has

been in business for about 7 years now and has a rating of 4 stars as shown in Table 14 above.

Unsuccessful case: “Nan and Byron’s”

The unsuccessful case was an American cuisine restaurant that was located in uptown are. It closed

down after 3.5 years and had a rating of 3.5 stars at the time of closing.

Comparison:

Pinky’s Westside Grill Nan and Byron’s

Page 24: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 23

First, we wanted to consider the average popularity of the two restaurants in terms of people

reviewing them, so we can fairly compare them. We simply divided the number of reviews per the

number of years that this restaurant has been in business. As you can see both restaurants have

similar number of reviews per month and about similar rating. The successful case was still

leading, but the difference between both was not significant. This made our case study even more

intriguing. Having that the restaurants had similar popularity and rating makes everybody wonder

what might be the reasons for one of the restaurants to close and the other to stay in business.

In the following table, we have described what will be the profile of a successful American

cuisine restaurant that is opened in residential or uptown area:

*GFM: Good for Meals

Table 15: Profile of successful restaurant by cuisine and zone

From the conjoint analysis, we came to the conclusion that a successful restaurant that opens in a

residential area needs to offer at least wine and beer, has hipster ambience, validated parking,

and it should be good for dinner and brunch. Additionally, from our customer segmentation

analysis, we concluded that in the residential zone and for an American cuisine restaurant, people

value food and waiting time the most.

On the other hand, we determined that a successful American cuisine restaurant that opens in

uptown area should provide at least wine and beer, have hipster ambience, and offer validated or

Page 25: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 24

street parking. It should be good primarily for dinner while dessert is also an important dining

aspect. From our customer segmentation analysis, we concluded that in the uptown area and for

an American cuisine restaurant, people value service and waiting time the most.

Next, we compared the characteristics of the two restaurants with the conclusions from our

research. We used the reviews of the two restaurants we chose, to find out what might be the

reason for the “Nan and Byron” to close doors. After comparing the characteristic of a successful

restaurant to the profiles of the two restaurants, we observe the following:

*GFM: Good for Meals

Table 16: Comparing Successful vs Unsuccessful Restaurants

Successful case:

“Pinky’s Westside Grill” satisfies the customer needs for choice of drinks, and preferred

ambience (highlighted in Green). It is good for lunch and dinner and according to the reviews, it

has very good food and short waiting time. However, it did not have “validated parking”, but it

offered a parking lot, which was acceptable.

Unsuccessful case:

“Nan and Byron” satisfied the customer needs for choice of drinks. According to our analysis,

the people preferred validated or street parking, so we found that having a parking lot is

Page 26: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 25

acceptable. However, the restaurant did not match the customer preferences for ambience. As per

the conjoint analysis, “Nan and Byron” seemed to be more preferred for brunch, however, the

customers in the uptown location valued dinner and dessert. Moreover, people in this zone

valued the service the most, but according to the reviews, this restaurant didn’t have the best

service. It had good food, but that was not enough to keep its customers satisfied, and this could

have been the possible reason for it to shut down. In essence, if this restaurant better understood

its customers segment and took remedial action to match the customer preference, it might have

stayed in business longer.

Key Recommendations:

Irrespective of the features offered at a restaurant, customers look for majorly three factors - time

bound service, hospitality of staff and quality of food. But, the preference among these attributes

may vary highly among the locality and cuisine type. Hence, in order to cater to these demands,

even different franchises of the same chain may have to tailor these attributes based on zones

and/or cuisine type. For example, the customers of uptown restaurants are more concerned about

the wait time than about the food or staff hospitality, whereas, customers who reviewed about

restaurants in business/industrial area are more inclined towards the food taste and quality than

about waiting time or the services offered. Similarly, if an investor is looking to invest in an

American cuisine restaurant, then ideal value proposition to the customers would be “Hipster”

ambience along with “validated” parking. The consumers would value specials along with variety

in menu options. Timely service of the order is also a critical aspect. On the other hand, if investor

is choosing Institutional/ Research location then the ideal value proposition to the customers would

be a relative less expensive restaurant with “Classy” ambience. An improved food quality while

maintaining the price would help gain better market share

Page 27: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 26

CHALLENGES

The biggest challenge during the data collection phase of the project was to handle the large dataset

of more than 4 GB with more than one million rows in json format. The data conversion was done

using python and the data was exported in MySQL database for merging and filtering. Later,

cleaning the text data such as removing non English and special characters was tedious task;

however the inbuilt features in SAS Miner simplified the process.

As we proceeded with the detailed analysis of text, dealing with sarcasm in the reviews and ability

to differentiate genuine vs made up reviews imposed a greater challenge. Preparing restaurant

specific taxonomy and stopwords facilitated in easing this process. Lastly, in customer

segmentation, text clustering had to be executed in multiple rounds to arrive at meaningful and

logical clusters.

Our recommendations are primarily for the Charlotte area. As such this outcome might not be fully

applicable to other regions within the US.

We also have not taken into account the other factors that impact restaurant business like restaurant

model – franchise/ single-owner, operational cost, profit margin, employee skills, managerial

acumen, etc. as it was beyond the scope of our project.

CONCLUSION

Charlotte is among the fastest growing cities with rapid expansion in both business and population.

We found it important to understand how cuisine preferences and locality influenced the success

of restaurants in Charlotte and came up with a reference guide to future investors, so they can make

an educated decision when executing their projects.

Page 28: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 27

To reflect our first objective, we focused on identifying key attributes of the Charlotte restaurant

business industry. The primary attributes that we identified are specific to Charlotte region. The

importance of these attributes might vary for different region.

With regards to our second objective of understanding customers’ overall preferences for

restaurants we identified term matrix detailing key terms across different cuisines.

Our third objective was to identify multiple customer segments in Charlotte and contrast

customers’ differential preferences among Charlotte area. Based on our customer segmentation

analysis, we found three primary customer segments - Time Bound, Service/Atmosphere Bound,

and Foodie customers. We concluded that the higher customer satisfaction is driven by Food and

Service, while lower satisfaction levels are primarily accounted by poor Service/Ambience.

REFERENCES:

Primary dataset: https://www.yelp.com/dataset_challenge

SAS, Big Data – What is it and Why it Matters!

https://www.sas.com/en_us/insights/big-data/what-is-big-data.html#

City of Charlotte Zoning Data:

http://cltcharlotte.opendata.arcgis.com/datasets/17a4cbd948934fae8a63139a8e371000_8

National Restaurant Association, “Big Data and Restaurants: Something to Chew On””

Matt Wolff, March 2011, “The Best 10”, Restaurant Growth Index

Page 29: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 28

APPENDIX:

Table 17: Working for Conjoint Analysis - Cuisine

Table 18: Sample Ontology for Topic Modeling

Page 30: YELP DATA CHALLENGE - WordPress.com · For the business problem discussed above, we decided to integrate the freely available Yelp dataset from Yelp Dataset challenge 2017 with the

PAGE 29