Information Schooldagda.shef.ac.uk/dispub/dissertations/2014-15/External/Rodgers_L.pdf ·...
Transcript of Information Schooldagda.shef.ac.uk/dispub/dissertations/2014-15/External/Rodgers_L.pdf ·...
Information School
Dissertation COVER SHEET (TURNITIN)
Module Code: INF6000
Registration Number 140136568
Family Name Rodgers First Name Lauren
Assessment Word Count: 10,283. Coursework submitted after the maximum period will receive zero marks. Your assignment
has a word count limit. A deduction of 3 marks will be applied for coursework that is 5% or more above or below the word count as specified above or that does not state the word count.
Ethics documentation is included in the Appendix if your dissertation has been judged to be Low
Risk or High Risk. X (Please tick the box if you have included the documentation)
A deduction of 3 marks will be applied for a dissertation if the required ethics documentation is not included in the appendix. The deduction procedures are detailed in the INF6000 Module Outline and Dissertation Handbook (for postgraduates) or the INF315 Module Outline and Dissertation Handbook (for undergraduates)
140136568
2 | P a g e
Explore the Impact of Social Media for Improving Customer Satisfaction with
User Generated Content: A case study
A study submitted in partial fulfillment of the requirements for the degree of
MSc Data Science
at
THE UNIVERSITY OF SHEFFIELD
by
Lauren Rodgers
September 2015
140136568
3 | P a g e
Abstract
Background
The literature reveals how sentiment analysis and opinion mining has sparked great interest
from researchers and businesses alike. There are many beneficial applications for companies
utilising such techniques, but few studies have actually compared the techniques on a
business’s ‘gold standard’ data, thus revealing a gap in the literature.
Aims
The aim of the research is to investigate user generated content in social media, and evaluate
the results against business ‘gold standard’ protocols using the company’s benchmarked data.
Methods
The methodology consists of a mixture of qualitative and quantitative data collection and
analysis, as well as carrying out a case study on Marston’s Plc. Reviews posted about Marston’s
Plc were scraped from TripAdvisor and compared to the reviews within the Benchmark dataset.
Then the review underwent sentiment analysis and opinion mining, to extract features and
opinions that customers write about them. Particular methods such as error analysis were
performed to evaluate the techniques precision and accuracy.
Results
Results indicated that TripAdvisor reviews held more value than the current data used by
Marston’s Plc. The confusion matrixes indicated that the highest accuracy, 45.45%, was
utilising TripAdvisor reviews split into sentences. In addition, the results suggested that the
polarity struggled to rate reviews negatively compared to the manually rated reviews.
Conclusions
The study concludes that the precision and accuracy scores obtained were significantly lower,
when compared with similar studies in the literature review. The research, however, highlights
the capability the techniques has for driving the decision making process in a business, and that
the methods proposed can be generalised to other industry sector types.
140136568
4 | P a g e
Acknowledgements
I would like to thank Professor Paul Clough for his ongoing support and guidance through this
dissertation.
I would also like the Information School to have some recognition for putting together a very
interesting new course, it was very informative, and I have thoroughly enjoyed my time at the
University.
I would also like to thank my family for their inspirational talks which kept me going, and a big
special thanks to my step-mum, Sarah, who proof read my dissertation and gave very useful
advice, I couldn’t have done this without you.
I would also like to thank Marston’s Plc for contributing data for my dissertation, it has been
interesting to understand a part of the business functionality - this research would not have
been possible for you participation.
140136568
5 | P a g e
Table of Contents
Abstract ........................................................................................................................................................ 3
Acknowledgements .................................................................................................................................... 4
List of Tables ............................................................................................................................................... 8
Introduction .................................................................................................................................................. 9
The Research ......................................................................................................................................... 9
The Research Questions, Aims, and Objectives ............................................................................. 10
The Literature Review ......................................................................................................................... 11
Literature Review ...................................................................................................................................... 11
Current Business Methods Implementations ................................................................................... 11
Sentiment and Opinions ...................................................................................................................... 12
Social Media ......................................................................................................................................... 13
Research Limitations ........................................................................................................................... 13
Related Work ........................................................................................................................................ 15
Methodology .............................................................................................................................................. 16
Research Design and Method ............................................................................................................ 16
Analytical Techniques.......................................................................................................................... 18
Company Background ......................................................................................................................... 20
TripAdvisor ............................................................................................................................................ 20
Ethics ..................................................................................................................................................... 20
Data Collection ..................................................................................................................................... 21
Results ....................................................................................................................................................... 23
Dataset Comparison ............................................................................................................................ 23
Manual Vs Automatic ........................................................................................................................... 24
Feature Extraction ................................................................................................................................ 28
Summary ............................................................................................................................................... 37
Discussion ................................................................................................................................................. 38
The Research Aim ............................................................................................................................... 38
140136568
6 | P a g e
Research Questions ............................................................................................................................ 39
Can businesses utilise “free” data from social media sites to improve business protocols? 40
Are there any significant differences in incorporating user generated content to standard
business methods? .......................................................................................................................... 40
What current approaches are businesses taking to monitor customer satisfaction? ............ 41
Related Work ........................................................................................................................................ 42
Conclusion ................................................................................................................................................. 43
Limitations to the Research ................................................................................................................ 44
Future Work .......................................................................................................................................... 45
Recommendations ............................................................................................................................... 46
A Final Word ......................................................................................................................................... 46
Bibliography ............................................................................................................................................... 48
Appendix A ................................................................................................................................................ 56
Ethical Application Form ..................................................................................................................... 56
Research Approval Letter ................................................................................................................... 62
Information sheet provided to Marston’s Plc .................................................................................... 63
Signed consent form ............................................................................................................................ 64
Appendix B ................................................................................................................................................ 64
RStudio Script ....................................................................................................................................... 64
Basic Statistics ................................................................................................................................. 64
Correlation ........................................................................................................................................ 66
Benchmark Dataset ......................................................................................................................... 67
TripAdvisor Dataset ......................................................................................................................... 72
Plotting Grouped Plots .................................................................................................................... 75
Scraping the Web ............................................................................................................................ 76
Plot for Feature Summarisation ..................................................................................................... 84
Word Cloud Plots ............................................................................................................................. 85
140136568
7 | P a g e
List of Figures
Figure 1 shows an overview of creating feature summaries. ...................................................... 19
Figure 2 displays the average word count for reviews in each dataset ....................................... 23
Figure 3 illustrates the vocabulary range within each dataset .................................................... 24
Figure 4 illustrates polarity scores and TripAdvisor ratings ......................................................... 25
Figure 5 displays polarity scores and manual ratings for the Benchmark reviews ...................... 27
Figure 6 illustrates the Benchmark data and most frequent words ............................................ 29
Figure 7 shows a word cloud from TripAdvisor reviews .............................................................. 29
Figure 8 illustrates Marston’s Benchmark data ........................................................................... 30
Figure 9 displays the average polarity scores for top features in sentences in TripAdvisor ....... 31
Figure 10 illustrates the Benchmark data and top features polarity ........................................... 32
Figure 11 shows both dataset features and average polarity ..................................................... 33
Figure 12 shows positive and negative opinions expressed on the 3 features ........................... 34
Figure 13 shows positive and negative opinions on the 3 features for TripAdvisor ................... 35
Figure 14 illustrates the feature polarity summary for all features ............................................ 36
Figure 15 shows a map of pubs using traffic colour system ........................................................ 45
140136568
8 | P a g e
List of Tables
Table 1 illustrates the first five reviews scraped from TripAdvisor ............................................. 22
Table 1 shows the ranges for transforming polarity scores into TripAdvisor ratings .................. 26
Table 2 displays the confusion matrix for classifying TripAdvisor reviews .................................. 26
Table 3 shows the ranges used to transform polarity scores into the Likert scale ..................... 27
Table 4 is the confusion matrix for Benchmark reviews and manual ratings. ............................. 28
Table 5 displays the POS tagset ................................................................................................... 31
Table 6 shows the confusion matrix for the 3 features in the Benchmark dataset .................... 34
Table 7 shows the confusion matrix for the 3 features in the TripAdvisor dataset .................... 35
Table 8 summarises the main findings ........................................................................................ 37
140136568
9 | P a g e
Introduction
The explosion of web 2.0 applications has meant that users could not only access content, but
upload, create and share their own, and thus the term ‘user generated content’ became
popularised. The idea that any one individual can create and disperse information so readily on
the web, allowed the potential for an abundant amount of research to be conducted.
Businesses then recognised the potential to harness this generated content. In this era, users
are engaging with the web via social media and technologies such as mobile (the rise of the
smartphone) and internet platforms, making it easier to disseminate information at the touch
of a finger. Kietzmann, Hermkens, McCarthy and Silvestre (2011) recognised that the growth of
online content intimidated businesses, and could directly impact their survival, therefore their
research proposed a framework for businesses to develop strategies to enhance themselves
and engage with social media appropriately. O’Reilly (2007) suggests that the key principle for
businesses to succeed is for them to embrace the web and collect intelligence, supporting the
concept that businesses can thrive utilising such knowledge from the web. One method which
allows businesses to gain knowledge is to apply data mining tools; Rygielski, Wang and Yen
(2002) suggest that data mining techniques is the process of discovering patterns and
relationships hidden in the data, which leads to knowledge discovery.
Word of mouth on social media communities has the ability to grow exponentially compared
to face to face word of mouth, because anyone can access the user generated content, even if
the author of the content is a stranger. Businesses have realised that online reviews can
influence their reputation, revenue and success, due to widespread social media communities.
For example, Ye, Law, Gu and Chen (2010) suggest that online user generated content had a
significant impact on business performance within tourism. In another study, the popularity of
restaurants was positively associated with online consumer reviews, and suggests that these
findings will help guide other researchers on the impact that electronic word of mouth can
have on consumer decisions (Zhang, Ye, Law & Li, 2010).
The Research
The purpose of this research is to apply sentiment analysis and opinion mining techniques to
user generated content on social media sites, such as TripAdvisor. The research conducted in
140136568
10 | P a g e
this field appears to lack a logical approach to evaluate the practicalities of user generated
content, and integration with businesses ‘gold standard’ protocols in the hospitality industry
sector. To approach this gap in the field, a case study on Marston’s Plc will allow a practical
methodology because the data which is to be compared with user generated content from
social media, is actual information Marston’s Plc use to drive the business decision making
process. The area manager of Marston’s Plc has given the opportunity to gain insight into their
business protocols, and allow comparisons to be drawn from online social media. A note to the
reader is that the remainder of this study will exploit data from a single pub within the area
manager’s region. Due to the pub wanting to remain anonymous it will be referred to as
Marston’s Plc. It is therefore important to understand that the results obtained in this research
do not represent Marston’s Plc as a whole.
The Research Questions, Aims, and Objectives
Aim: The aim of this research is to investigate user generated content in social media, and to evaluate results against ‘gold standard’ business protocols using the company’s benchmarked data. Objectives: To achieve the aims of this research, the objectives defined below will guide this study to accomplish the overall purpose of the research.
To collect necessary data from social media sites, such as TripAdvisor, by utilising mining data techniques.
To pre-process and analyse the data and begin to compare with data obtained from Marston’s Plc, by constructing visualisations.
To contrast findings and introduce how sentiment analysis can be valuable to businesses.
To explore and highlight further research possibilities with user generated content, for beneficial business use.
Research Questions:
(1) Can businesses utilise “free” data from social media sites to improve business protocols?
(2) Are there any significant differences in incorporating user generated content to standard business methods?
140136568
11 | P a g e
(3) What current approaches are businesses taking to monitor customer satisfaction?
The Literature Review
The next chapter in this research is the literature review. The literature review will define
keywords expressed so far, and also discuss the research already conducted in this field. This
will reveal how the formulation of this study’s aims, objectives and research questions were
developed, by highlighting the current literature gap. In addition, the review will discuss
limitations which researchers have encountered whilst conducting research on user generated
content in social media.
Literature Review
Current Business Methods Implementations
This research will evaluate user generated content in social media against the business ‘gold
standard’ data, but first, the current methods adopted by businesses to utilise collecting
intelligence from the web, will be explored. A concept which has been at the forefront of
businesses over the last two decades is Business Intelligence and Analytics (BI&A), with one
specific area within BI&A focussing on text analytics and opinion mining on user generated
content (Chen, Chiang & Storey, 2012). Reinschmidt and Francoise (2000) propose a generic
definition of BI which states, organising data for better business decisions, efficiently and
effectively, to lead to a competitive advantage. To examine this definition, it appears that BI
includes various stages, and requires a business to have the ability to transform raw data into
valuable information, and give the business the knowledge to support their decision making
processes. Several researchers define the main principles of BI as providing the right
information, to the right individual, at the right time enabling them to make informed decisions
(Rud, 2009; Yeoh & Koronios, 2009; Reinschmidt & Francoise, 2000).
Many companies are applying Business Intelligence and Analytics (BI) to support their business
models for strategic purposes, and the existing methods and applications will be explored in
the literature review. Research has been carried out by Wang and Wang (2008) to explore the
ability to use data mining as a tool within BI, in the knowledge discovery route for businesses.
140136568
12 | P a g e
Cody, Kruelen, Krishna, Spangler (2002) discuss BI technologies, such as, on-line analytical
processing (OLAP) within text analytics, and their proposed framework, eClassifier, which
provides a deeper analysis of text. There are numerous data mining tools within BI, for
example, Cui, Mittal, and Datar (2006) suggest that machine learning algorithms can be
optimised for sentiment classification; a method to learn customer behaviour. It is the
customer behaviour element that this research will examine, to obtain knowledge for business
strategic decision making and compare to ‘gold standard’ protocols. The literature has
revealed how BI&A techniques such as data mining, OLAP, and machine learning algorithms all
integrate together, to collect insightful knowledge.
Sentiment and Opinions
To access customer behaviour, a form of data mining will be utilised, defined as sentiment
analysis and opinion mining. Sentiment analysis and opinion mining are terms which are used
interchangeably, and these terms can be defined under a single field of study which involves
computational algorithms, natural language processing, data mining and other components
(Cambria, Schuller, Xia, & Havasi, 2013; Liu, 2012; Pang & Lee, 2008). This area of research has
sparked vast interest over the last decade due to various beneficial applications. For example,
firstly, businesses desire consumer’s opinions and thoughts on services received and products
purchased (Feldman, 2013; Liu & Zhang, 2012). Secondly, predicting election results, gaining
insight on thoughts and attitudes towards particular campaigns, and modelling the public’s
mood on decisions made by politicians (O’Connor, Balasubramanyan, Routledge, & Smith,
2010: Tumasjan, Sprenger, Sandner, & Welpe, 2010). Thirdly, opinion mining and sentiment
analysis can go to the extremes of affect analysis, which is useful for measuring the presence of
hate and violence amongst extremist groups and hate groups (Abbasi, Chen, & Salem, 2008:
Abbasi, 2007: Gerstenfield, Grant, & Chiang, 2003). The current literature suggests a wide
range of application for opinion mining and sentiment analysis, and hence an interesting topic
for further research.
This research will predominately focus on the use of online reviews on social media sites.
Online reviews, since the arrival of Web 2.0 are a great resource for generating rich information
(Dellarocas, Zhang, & Awad, 2007). In a study carried out by Vermeulen and Seegers (2009) on
140136568
13 | P a g e
hotel reviews, it was shown that on average the exposure to online reviews enhanced hotel
consideration towards consumers. Dar and Chang (2009) propose that record labels should
seriously consider user generated content for music sales. This therefore indicates that online
reviews are a valuable source of information for businesses to gain insight from user generated
content.
Social Media
In recent years, social media has grown exponentially in terms of the amount of user
generated content created. Kitchin (2014) points out that the shift towards Web 2.0 made it
possible for anybody to create content, and Web 2.0 sites and services allow users to actively
share preferences, values, opinions and many other personal details, specifically through social
networking sites. The volume of content available through social media has had many
applications, such as predicting box-office movie revenues (Asur, & Huberman, 2010),
performing competitor analysis (He, Zha, & Li, 2013; Dey, Haque, Khurdiya, & Shroff, 2011;
Jansen, Zhang, Sobel, & Chowdury, 2009), and the Government recognising the potential
applications of social media (Kavanaugh et al., 2012). There have also been several studies into
the field of using customer reviews within social media (Ye, Zhang, & Law, 2009: Lee, Jeong, &
Lee, 2008: Liu, Hu, & Cheng, 2005). This research will focus on mining customer reviews on
social media, and the methods practiced in this research will acknowledge the existing studies.
User generated content in social media presents a variety of forms, including unstructured
data; for example, images, audio, videos, blogs and web forums (Agichtein, Castillo, Donato,
Gionis, & Mishne, 2008). McCallum (2005) suggests that natural language text, which is an
example of unstructured data, requires fine-grained processing to be transformed into
structured data for data mining applications. The methodology adopted by this research, will
take into account this concept of transforming the data into a structured format.
Research Limitations
Investigating sentiment analysis and opinion mining is a popular research topic, thus the
issues surrounding researchers when investigating sentiment analysis and opinion mining in
social media will be explored. User generated content via social media is often complex and of
various forms, hence the wide variety of methodologies involved. For example, in a study by
140136568
14 | P a g e
Agichtein et al. (2008), their methodology involved analysing the textual content of Yahoo!
Answers using semantic features such as punctuation and typos, syntactic and semantic
complexity, and grammaticality. In research where Twitter has been utilised for collecting data,
the methodology has involved considering the different tenses, the use of verbs, adjectives and
nouns within subjective and objective text (Pak, & Paroubek, 2010). In another study involving
Twitter, the data goes through stages, for example, tokenisation, normalisation, and part-of-
speech tagging (Kouloumpis, Wilson, & Moore, 2011). In the afore-mentioned literature, it is
evident that there are many techniques which can be applied to text to process the sentiment
and opinion. This brings the first dispute amongst researchers; which method provides the
highest accuracy. Pang, Lee and Vaithyanathan (2002) discuss how their chosen techniques
struggled to precisely outline the sentiment of particular movie reviews, where the authors
contrasted their thoughts, and this presented difficulty for machine learning algorithms to
understand. In addition, a study by Turney (2002) suggests that movie reviews were difficult to
classify because the whole review is not represented by parts of the review, and hence low
accuracies obtained in the results. In other words, sentences within reviews may have
presented sarcasm, or conflicting statements, and this affected the algorithms to correctly
identify the whole review polarity. The theme emerging from the literature indicates that
processing natural language from social media is a multifaceted task, due to the large variance
in language used, from abbreviations and slang, to sarcasm and jokes.
An important aspect to consider amongst the literature is that automated approaches to
sentiment analysis and opinion mining do require human input, particularly in recognising
ambiguous language in social media (Prabowo, & Thelwall, 2009; Pang, Lee, & Vaithyanathan,
2002). A second problem common in sentiment analysis and opinion mining is the domain
used for building models. Aue and Gamon (2005) propose that classifiers that are trained in a
particular domain do not achieve the same results in another domain, and the researchers
suggest this is explained by the lack of labelled data across domains. In addition, Thelwall,
Buckley, and Paltoglou (2012) suggest that SentiStrength, a lexicon-based classifier approach,
although used across six social web datasets, may not work in other datasets due to unusual
language used, such as jokes and sarcasm. In a study by Asur and Huberman (2010), on the
other hand, a model was proposed which could be extended from movie revenue prediction to
other product reviews, which may be of consumer interest. The literature above has conflicting
140136568
15 | P a g e
results, and therefore a potential area of research is presented, which aims to resolve the
differences in the findings highlighted. There are still extensive research projects into a model
which is successful in domain adaptation for sentiment analysis, and it appears that models
have only been proposed for a limited number of different domains (Glorot, Bordes, & Bengio,
2011; Pan, Ni, Sun, Yang, & Chen, 2010; Blitzer, Dredze, & Pereira, 2007).
Related Work
Sentiment analysis and opinion mining tasks have several approaches, firstly is at the
document-level, where a whole document (i.e. a product review) is assumed to represent a
single entity, and expresses opinion from a single opinion holder (Liu, 2012). The research tends
to focus on detecting the polarity of the whole document as positive, neutral, or negative
(Yessenalina, Yue, & Cardie, 2010; Das, & Chen, 2007). Secondly, is sentiment analysis at
sentence-level; the research uses algorithms to successfully identify subjective and opinionated
sentences, along with detecting the polarity (Benamara, Chardon, Mathieu, & Popescu, 2011;
Riloff, Wiebe, & Phillips, 2005; Wiebe, & Riloff, 2005). Thirdly, and a more in-depth approach is
at feature-level; where a useful outcome from the literature is proposed, and involves grouping
features of a single product and summarising the opinions about each feature (Somprasertsri,
& Lalitrojwong, 2010; Ding, Liu, & Yu, 2008; Titov, & McDonald, 2008a). For the purpose of this
research, document-level and sentence-level as suggested by Liu (2012), do not supply
sufficient detail for users to gain insight from their consumers about particular aspects of
products. This is supported by findings made by Liu and Zhang (2012), where they also discuss
that a positive opinionated document/sentence does not directly imply the author (opinion
holder) finds all the features positive.
Studies that have explored feature-level summarisation techniques have been briefly outlined,
to allow comparisons to be made in a later section of the research. For example, a study by
Somprasertsri and Lalitrojwong (2010) suggests that utilising a dependency and semantic based
approach allows an effective and flexible model in summarising product features and opinions
(F-score of 75.45%), regardless of distance between feature and opinion, whilst Blair-
Goldensohn et al. (2008) propose a structured architecture for summarising sentiment. In
addition, Hu and Liu (2004) present a feature and opinion summarisation, with a precision of
140136568
16 | P a g e
64.2%. In another study by Popescu and Etzioni (2005), a system called OPINE is introduced,
which has the ability to extract product features and evaluate (according to reviewers) quality
with high precision. In addition, a study by Liu, Hu and Cheng (2005), utilised methods for
extracting relevant features and creating visual summaries that performed with high precision;
88.9% in positive reviews and 79.1% in negative reviews.
The literature discussed in this section has allowed a thorough understanding of the
beneficial applications of sentiment analysis and opinion mining in real world scenarios.
Rygielski, Wang and Yen (2002) highlight that data mining tools and techniques does not mean
the patterns and knowledge obtained from data is trustworthy, and may require further
verification. It is important that researchers acknowledge this for their own studies, and it will
be considered within this research. Moreover, a number of complications that researchers
have encountered in this field have been presented, and this has provided an insight for this
research to think carefully about. In addition, the literature has introduced various approaches
to sentiment analysis and opinion mining techniques. The literature has also revealed how
similar studies have proposed methods to improve precision and accuracy, however, there
appears to be very little research applying such models to current business data, and evaluating
usefulness between the models. Acknowledging the results of similar studies to this research,
alongside the culmination of this reading has contributed to the formulation of not only the
research question, but the aims and objectives this research intends to answer.
Methodology
Research Design and Method
The overall aim is to investigate user generated content on social media sites and evaluate
against business benchmarked data. To achieve this aim, the methodology adopted in this
research will utilise an inductive approach. Bryman (2012) defines the inductive approach as
forming a theory (or theories) from conducting data analysis, and the theory is the outcome of
the research. In this project, the data is collected and analysed for trends and patterns, and
from the results, a theory proposed, hence an inductive approach. Walter (2010) advises that
the theoretical direction the project takes is essential to the overall design, aims, objectives,
and the research questions constructed.
140136568
17 | P a g e
Designing any research can take several approaches, a qualitative, quantitative, or a mixed
methods design. First, qualitative research can be summarised as supplying an explanation or
understanding of social phenomena and their context through information rich data (Ritchie &
Lewis, 2003). Corbin and Strauss (2014) suggest that qualitative research is an interpretive,
dynamic, and free-flowing process; if researchers forget the basics of their study, then the
research can become superficial and fail to provide novel insights. The literature seems to
suggest that qualitative research is difficult to establish validity (Qazi, 2011; Sandelowski, &
Barroso, 2008; Whittemore, Chase, & Mandle, 2001).
Second, quantitative research is defined as performing statistical tests on numeric data to
verify predetermined theories (Punch, 2013). A general theme emerges from the literature
that quantitative design lacks explanation of causal relationships due to the structured and
controlled procedures involved (Gelo, Brakkmann, & Benetka, 2008; Maxwell, 2004).
Third, a mixed methods research involves a combination of the former designs mentioned so
far. A definition is provided by Johnson, Onwuegbuzie and Turner (2007), who propose that
mixed methods research is an intellectual and practical combination of qualitative and
quantitative research, and is likely to produce insightful findings and outcomes. The design of
this project requires a mixed methods strategy, as both quantitative and qualitative features
are present during the research.
Malterud (2001) proposes that when qualitative studies are added to quantitative ones, then
a better understanding of the results and implications can be gained. For example, the data
collected is through a qualitative approach, as the reviews contain in-depth thoughts and
opinions on experiences through Marston’s Plc, i.e. quality responses, however, the data also
includes quantitative features, such as ratings out of five, and are therefore numerical. The
analysis involves identifying patterns within the data (qualitative), but then leads onto
quantitative data analysis, as statistical tests are performed to answer the research questions
at hand, and the significance of the patterns found through qualitative analysis. Creswell
(2003) suggests that the biases properties from individual research designs could be
neutralised (or cancelled) by integrating both strategies.
140136568
18 | P a g e
The research method utilised in this project is a case study of Marston’s Plc, comparing
Marston’s benchmark data to user generated content on social media. Burns (2000) defines a
case study as a bounded system, an entity in itself, and involves the collection of extensive data
to produce an understanding of the entity being studied. Hodkinson and Hodkinson (2001)
propose six strengths of utilising a case study approach, including, the ability to understand
complex inter-relationships. The approach tends to be grounded in lived reality, and case
studies can illustrate the processes involved in causal relationships. Yin (2014), on the other
hand, indicates that researchers employing a case study are subject to bias and may sway
towards supportive evidence and discard contrary evidence. In addition, researchers seem to
consider case studies lacking the ability to be generalizable to other areas (Gomm,
Hammersley, & Foster, 2000; Johnson, 1994; Eisenhardt, 1989). Wellington and Szczerbinski
(2007) make an interesting observation in that the ability to relate to the case study and learn
from it, is perhaps more important than generalising from it.
Yin (2011) proposes that case studies derive an in-depth understanding of the case/cases in a
real world context, and this can lead to learning new behaviours in the real world, and their
meanings. To answer the research questions formulated in this study, a case study will allow
the transfer of results from a single unit to be generalised across larger units (VanWynsberghe
& Khan, 2008; Flyvbjerg, 2006; Gerring, 2004). This research has considered the strengths and
limitations of utilising the case study method, however despite the limitations, the nature of
this topic seems best suited to this particular method.
Analytical Techniques
To extract the relevant features in order to produce the feature summaries, the methodology
will be similar to that of Hiu and Lu (2004), and an overview of the stages involved is illustrated
below.
140136568
19 | P a g e
Figure 1 shows an overview of creating feature summaries.
The outcomes of the tasks presented in figure one will be explained in-depth in the results
section. To evaluate the effectiveness of these visual summarisations against the benchmarked
data of Marston’s Plc, the same strategies will be applied to their Benchmark data, allowing a
clear comparison of results. This will be similar to visualisation summaries and non-visualisation
summaries from pre-existing research in the field (Lu, Zhai, & Sundaresan, 2009; Zhuang, Jing,
& Zhu, 2006; Liu, Hu, & Cheng, 2005; Hu, & Liu, 2004).
To evaluate the results between the two datasets, the research will compare the confusion
matrixes created, which will allow a precision, accuracy and F-Score to be calculated. The
results can then be evaluated against the results from similar studies mentioned in the
140136568
20 | P a g e
literature, in a later chapter of the study. Visa, Ramsay, Ralescu and Knaap (2011) define an
nxn confusion matrix as a form of error analysis, in which the number of predicted and actual
classifications is illustrated using n classes. This study will use the ratings of reviews as the
classes, where the actual ratings are carried out manually, and the sentiment score calculated
will be the predicted values.
Company Background
Marston’s Plc (2015) is a UK brewery and pub chain, with 5 independent breweries, and
around 1700 pubs and bars located across the United Kingdom. The pubs vary in style, such as,
the traditional carvery, rotisserie restaurant, tenanted, leased partners, and two-for-one pubs.
Marston’s Plc is listed as a 250 FTSE company; classified in the restaurants and pubs sub-sector,
within the leisure entertainment and hotels sector. The data currently utilised by Marston’s Plc
is through a cloud-based customer optimisation platform, as part of a paid service. For ethical
reasons this will be named Benchmark, for simplicity, therefore ensuring the true identity
remains anonymous. The platform allows Marston’s to engage with customers to improve
business results.
TripAdvisor
TripAdvisor (2015a) is one of the largest travel sites which allow users to plan and book their
perfect trip; covering opinions and reviews on accommodation, restaurants, hotels, flights, and
attractions, worldwide. The site offers external links to booking tools which aids users to
finding the greatest deals. In addition, according to TripAdvisor (2015a) log files, a total of 375
million users visit this site on a monthly basis, and 250 million reviews are posted online, hence
a popular and current social media site, where opinions are exchanged publicly.
Ethics
The research, as detailed in an earlier section, is utilising reviews from social media sites and
data retrieved from Marston’s Plc (the case study). After submitting the ethical application to
the University, this study was evaluated as “low risk”, due to the topic of interest not breaching
140136568
21 | P a g e
politically or culturally sensitive areas, and not including the involvement of Marston’s Plc as a
participant.
To adhere to the Universities ethical guidelines, a consent form with an attached information
sheet was provided to Marston’s Plc. The information sheet contained details of what the
research entailed and how Marston’s data would be handled, including the storage of sensitive
data. The ethical forms are attached to Appendix A, where further details are provided.
Marston’s Plc had the knowledge that withdrawal from the research was an available option
at any time during the research, and if requested, all data obtained from Marston’s Plc would
be destroyed instantly. In addition, all employee details are anonymised in the data to protect
confidentiality. Data could only be collected once the University approved the ethical
application.
The terms of use for collecting data from TripAdvisor were considered, and TripAdvisor
(2015b) specify that certain activities are prohibited, such as accessing, copying and monitoring
content using robots, scrapers, spiders or by any automated means, must have written
permission. However, following a supervisorial discussion about these policies, there is
believed to be no infringement of ethical guidelines, as the sole purpose of using TripAdvisor
content is for academic research only.
Data Collection
After ethical approval, data could be gathered to process sentiment values and run opinion
mining tools and make comparisons. Data retrieved from TripAdvisor was achieved using
scraping methods within RStudio, version 3.2.1. The reviews span over six months, from
January to June, and include a total of 55 reviews. The full programming script is presented in
Appendix B.
The top five reviews are presented in the table below to illustrate the type of data this project
is dealing with.
140136568
22 | P a g e
Table 1 illustrates the first five reviews scraped from TripAdvisor
The data collected is now ready to be pre-processed and analysed, using the methods
highlighted in figure 1 to gain insight on Marston’s Plc, against their benchmarked data. The
output, once the data has been cleansed and noisy data removed, is achieved by the steps
clarified in figure one. Each step is vital to creating accurate results, and the transformation
includes the removal of stopwords, converting all text to lowercase, and POS tagging words
(amongst other processes). The stages mentioned previously are precursors to achieving the
aim and finding solutions to the research questions. Further details regarding transformation
of the data collected is explained in the results section.
140136568
23 | P a g e
Results
The results of the sentiment scores obtained for each review dataset, and correlation
coefficients between polarity scores and manual ratings, are presented in this section. In
addition, the features extracted from reviews and their associated polarity scores are also
obtained, and feature summarisations are explored for each dataset. The structure of this
chapter begins by comparing both datasets, extracting key features for comparisons. Secondly,
the chapter goes on to presenting sentiment for both review datasets and the concept of
manual ratings versus automatic ratings. Thirdly, the features are extracted and summarised
between datasets. At the end of this section, the main findings are tabulated and analysed.
Dataset Comparison
The datasets in this research were explored to analyse any differences displayed within the
TripAdvisor reviews and Benchmark reviews, which could affect the outcomes presented later
in the chapter. A large proportion of the reviews are from the Benchmark dataset (N=161), and
a small proportion from the TripAdvisor dataset (N=55). The difference in number of reviews is
no concern because the average number of words per review is illustrated below and displays
interesting observations.
Figure 2 displays the average word count for reviews in each dataset
Despite the Benchmark dataset containing more reviews, the average word count for a
Benchmark review was 16 words, whereas, the TripAdvisor reviews on average contain 95
140136568
24 | P a g e
words. This suggests that users of TripAdvisor tend to write longer reviews than the users who
write reviews within the Benchmark dataset, and perhaps, the TripAdvisor reviews contain
more information which could be valuable to Marston’s Plc. Further details were explored
within each dataset, such as the range of vocabulary used in the reviews. This was achieved by
eliminating all punctuation, and converting the words to lower case. A list of all the words
utilised was created, removing all duplicates, and the results are shown below.
Figure 3 illustrates the vocabulary range within each dataset
Two vocabulary ranges were computed; the first included stopwords, and the second
removed all stopwords. In both cases, the TripAdvisor reviews included a wider range of
vocabulary compared to the Benchmark reviews. The vocabulary range of TripAdvisor reviews
(without stopwords) was calculated at n=1113, and the vocabulary range of the Benchmark
data at n=613, that yields a 35.22% difference, further supporting the idea that TripAdvisor
reviews seem to include insightful information over the Benchmark data.
Manual Vs Automatic
This section of the chapter investigates the ratings given to the reviews against polarity scores
(i.e. sentiment). The polarity algorithm utilised in RStudio follows similar techniques applied in
the study by Hiu and Lu (2004). RStudio explicitly defines that the same sentiment dictionary is
140136568
25 | P a g e
utilised to tag words as that from the study of Hiu and Lu (2004). A scatterplot was constructed
to investigate the polarity scores calculated by the algorithm to individual reviews from
TripAdvisor, and the ratings given to the reviews by the review author themselves.
Figure 4 illustrates polarity scores and TripAdvisor ratings
The scatterplot from figure 4 indicates that as TripAdvisor ratings increase, the polarity scores
for reviews increase, the Pearson correlation coefficient was 0.7109, suggesting a strong
positive correlation between the TripAdvisor ratings and polarity scores. The polarity scores
given to the 55 TripAdvisor reviews can be explored further by utilising a confusion matrix. The
confusion matrix allows an indicator of the algorithm’s performance to be determined. First,
the polarity scores were transformed into the Likert scale, adopted by TripAdvisor to
investigate the accuracy and precision of the sentiment classifier applied to the reviews by the
programming. The table below indicates the ranges applied to the polarity scores, to calculate
the Likert scale numbers.
140136568
26 | P a g e
Table 2 shows the ranges for transforming polarity scores into TripAdvisor ratings
This step allows comparisons to be made because they have the same format. Secondly, a
confusion matrix was constructed to analyse if reviews were correctly rated using the algorithm
in RStudio. The results are presented in the table below.
Table 3 displays the confusion matrix for classifying TripAdvisor reviews
The boxes highlighted in yellow indicate the number of true positives; the reviews which were
correctly rated according to the actual TripAdvisor review ratings. Out of the 55 reviews, 18
were correctly classified, which gives an overall accuracy of 32.73% and over the 5 classes, an
average precision of 34.17%, indicating a poor rating system.
This technique was also applied to the Benchmark dataset reviews, but due to the lack of
ratings provided by the reviews, this procedure was done manually. The scatterplot of these
results is displayed below.
140136568
27 | P a g e
Figure 5 displays polarity scores and manual ratings for the Benchmark reviews
The scatterplot in figure 5 suggests that as the manual ratings increase, the polarity score
increases. The correlation coefficient between manual ratings and polarity scores was
calculated at 0.6453. This indicates that there is a positive correlation between manual ratings
and the Benchmark sentiment scores, however, this coefficient is not as strong when compared
to the coefficient for the TripAdvisor reviews. The polarity scores obtained from the Benchmark
reviews were also transformed into the Likert scale using slightly different ranges.
Table 4 shows the ranges used to transform polarity scores into the Likert scale
140136568
28 | P a g e
A confusion matrix was constructed to evaluate the performance of the algorithm on
classifying ratings for Benchmark reviews, against the manual ratings. The matrix is illustrated
below.
Table 5 is the confusion matrix for Benchmark reviews and manual ratings.
The overall review count reduced to 142 (from N=161). This is because a number of the
reviews only expressed a particular staff name with no other comments, thus for the research
to comply with ethical guidelines, these reviews were discarded. The overall accuracy
computed at 44.37%, meaning that a total of 63 reviews were correctly classified according to
the manual ratings. Over the 9 classes, the average precision calculated to be 35.44%, again
indicating a poor classifier.
Feature Extraction
An aspect of the Benchmark dataset is presenting the user a summary of most popular words,
which Marston’s Plc adhere to as common features. A summary was produced over the six
month range, identical to that of TripAdvisor review range, and the results are displayed below.
140136568
29 | P a g e
Figure 6 illustrates the Benchmark data and most frequent words
The yellow stars in figure 6 are present to delete the names of members of staff that were
mentioned in the reviews submitted by customers. This ensures that staff identities are
protected, and this research complies with ethical guidelines. The biggest words appear to be
“friendly”, “food”, “good”, “carvery” and “helpful”. The figure also illustrates the less frequent
words, indicated by size, and this figure is how the summary appears on the interface to users.
Considering the two words removed for ethical reasons, the summary displays the top 25 most
popular words. Figure 7 will demonstrate the most frequent words obtained from the
TripAdvisor reviews.
Figure 7 shows a word cloud from TripAdvisor reviews
The word cloud in figure 7 display the top 25 most frequent words in TripAdvisor, and it is
reassuring that there are some similarities amongst the words, such as, “food”, “good”, and
“carvery”. To ensure the techniques applied within RStudio were effective, the same
programming was applied to the reviews gained through the Benchmark data, and the results
are displayed below.
140136568
30 | P a g e
Figure 8 illustrates Marston’s Benchmark data
Comparing figures 6 and 8, it appears that the majority of words match, and from the 23
words visible in figure 4, 69.57% of words match. The words illustrated so far appear to be the
most frequent words (with the removal of stopwords), but the data requires further
transformation to extract useful features.
To pre-process the data further to extract relevant features, numerous stages were involved.
The reviews were compiled together as one piece of text for each dataset. Punctuation, such as
“?”, “!”, “.” and “,” indicated the end of a sentence, thus the text was split accordingly. All
characters were transformed to lower case, thus all words had the same impact, such as
“Great” and “great”. Speech marks around words were also discarded. The text was then
labelled using POS tagging and the tagset can be shown below.
POS Taggers and Definitions
CC Coordinating conjunction PRP$ Possessive pronoun
CD Cardinal number RB Adverb
DT Determiner RBR Adverb, comparative
EX Existential there RBS Adverb, superlative
FW Foreign word RP Particle
IN Preposition or subordinating conjunction
SYM Symbol
JJ Adjective TO to
JJR Adjective, comparative UH Interjection
JJS Adjective, superlative VB Verb, base form
LS List item marker VBD Verb, past tense
MD Modal VBG Verb, gerund or present participle
140136568
31 | P a g e
NN Noun, singular or mass VBN Verb, past participle
NNS Noun, plural VBP Verb, non 3rd person singular present
NNP Proper noun, singular VBZ Verb, 3rd person singular present
NNPS Proper noun, plural WDT Whdeterminer
PDT Predeterminer WP Whpronoun
POS Possessive ending WP$ Possessive whpronoun
PRP Personal pronoun WRB Whadverb Table 6 displays the POS tagset
The POS tagset utilised in this research allows the detection of nouns which appear most
frequently in the reviews. The most frequent occurring nouns become the features for each
dataset.
The words featured in the figures so far indicate popular topics mentioned in the reviews.
However, in order to bring value to businesses it would be insightful to see if a customer’s
opinions and thoughts on the individual features were positive or negative. The top 10 features
extracted from the TripAdvisor dataset, and the average polarity score of all sentences in which
the feature is contained is displayed below.
Figure 9 displays the average polarity scores for top features in sentences in TripAdvisor
Polarity scores are between -1 and 1, where a score of -1 indicates complete negative
opinions and thoughts, zero polarity suggests neutral opinions and thoughts, and a score of 1
implies complete positivity. It can be suggested that all of the top 10 features, on average,
were above a score of zero, thus indicating overall the features do not have negative thoughts
and opinions. It appears however, that some features are very close to a score of zero, and
140136568
32 | P a g e
businesses do not aim to obtain neutral polarity about their services and products. It is
therefore assumed that they would want to achieve outstanding sentiment. The two lowest
scoring features appear to be the carvery and meal, on average. According to the average
polarity score and TripAdvisor reviews, it appears that customers have positive opinions and
thoughts on the staff, pub and service, whilst displaying neutral opinions and thoughts on the
carvery, menu, and meals.
The same procedures performed on the TripAdvisor reviews were then executed on the
Marston’s benchmark data for comparison. The top 10 features and the average polarity are
illustrated below.
Figure 10 illustrates the Benchmark data and top features polarity
Figure 10 suggests that on average the two lowest scoring features were table and menu. This
is somewhat different to the two lowest features of the TripAdvisor dataset. The service
feature on average scored the highest, but it appears that the average polarity scores in figure
10 ranges from 0.15 to 0.25 (with the exception of the service feature). The Benchmark
reviews and average polarity scores suggests that customers have positive opinions and
thoughts on the service they receive at Marston’s Plc. This feature has appeared positively for
both datasets and implies that Marston’s are providing a service to customers who are happy
to express this.
140136568
33 | P a g e
The identical features common to both datasets and their average polarity are plotted below,
so comparisons can be analysed.
Figure 11 shows both dataset features and average polarity
It can be suggested that over the 6 months of reviews mined, TripAdvisor scored low
averages for the meal, carvery and menu features. A recommendation for a manager of
Marston’s Plc, is that the 3 features identified as receiving neutral sentiment would require
attention, and amendments would need to be made to improve customer opinions on these
features, as observed from figure 11. On another note, the figure implies that the Benchmark
dataset reveals a good polarity score for the service feature, and therefore it is recommended
to Marston’s Plc that this is maintained.
The biggest difference in average polarity scores seems to be the carvery, meal, and service
features. The sentences which contain these 3 features were explored deeper between the
datasets, and sentiment scores compared to manual ratings. The number of sentences for each
feature was low, so the analysis was conducted by combining all sentences with the stated
features for each dataset. Firstly, the Benchmark dataset and the 3 features defined were
explored, and the confusion matrix is illustrated below.
140136568
34 | P a g e
Table 7 shows the confusion matrix for the 3 features in the Benchmark dataset
The rating classes in table 6 do not include 1 and 1.5 because there were no sentences were
manually rated at those scores, therefore it was not necessary to involve empty classes. The
overall accuracy for sentiment scores was exactly 30%, and the average precision over the 7
classes was 21.32%, suggesting an insignificant rating system. The opinions and thoughts
surrounding the features (carvery, meal, and service) in the Benchmark data can be visualised
in the following word cloud, displaying the most frequent words.
Figure 12 shows positive and negative opinions expressed on the 3 features
The negative opinions in figure 12 really highlights that the key opinion across the 3 features
is disappointment, whereas the most frequent positive words are great and good. The same
method was applied to the 3 features in the TripAdvisor dataset. The confusion matrix is
constructed, again by combining all the sentences containing the 3 features.
140136568
35 | P a g e
Table 8 shows the confusion matrix for the 3 features in the TripAdvisor dataset
The class rating of 5 was not included because no manual ratings of the sentences scored this
highly. The overall accuracy of the sentiment was calculated at 45.45%, whilst the average
precision across the 8 classes was 23.62%, which suggests an insignificant rating system. The
positive and negative opinions expressed in the TripAdvisor sentences were examined, and the
results are shown below:
Figure 13 shows positive and negative opinions on the 3 features for TripAdvisor
The most frequent opinions on carvery, meal and the service features appears to be “poor”,
“cold”, “penalise”, “nice”, and “excellent”. The positive and negative opinion expressed within
the 3 features for the TripAdvisor sentences support the idea that the range of vocabulary is
140136568
36 | P a g e
greater than the vocabulary range in the Benchmark sentences. The range of sentiment can be
summarised for all features in the following figure:
Figure 14 illustrates the feature polarity summary for all features
The feature summarisation of features common to both datasets and the range of polarity
scores allow comparisons to be analysed. The yellow bars indicate the Benchmark polarity,
whilst the purple bars display the TripAdvisor polarity. The overlaps are illustrated by the dark
yellow parts, and figure 14 suggests that the features “table” and “staff” have very similar
agreement on sentiment between the datasets. In addition, the feature service appears to only
have positive sentiment in the Benchmark dataset; however, the TripAdvisor dataset has a
larger range of sentiment score. The TripAdvisor dataset in general appears to hold more
sentiment range to the features extracted, indicating that these reviews may hold more
valuable information than that of the Benchmark data, despite having significantly less reviews
overall.
140136568
37 | P a g e
Summary
The findings suggest that TripAdvisor reviews tend to contain more words per review, and a
significantly larger vocabulary range to that of the Benchmark reviews. This suggests that the
TripAdvisor reviews may be more valuable and insightful to Marston’s Plc for analysing
customer thoughts and opinions on their experiences at Marston’s Plc. The scatterplots in this
chapter indicate that TripAdvisor review ratings and the sentiment scores were positively
correlated, also that the Benchmark reviews correlated with sentiment scores. The features
extracted, and opinions associated with the sentences containing the features, further
supported the idea that TripAdvisor reviews hold greater vocabulary, and the sentiment scores
varied more. The main findings, computing the performance of the sentiment techniques
applied to the reviews, using the measures of precision, accuracy and F-Score, were tabulated
and the results shown below.
Table 9 summarises the main findings
The table highlights that the sentiment achieved the best accuracy when performed on the
TripAdvisor feature sentences, along with the highest average F-Score out of the four cases. An
interesting observation from this research is that the sentiment scores for reviews and
sentences in both datasets scored 0%, which explains the significantly lower performance
measures. The table suggests that the sentiment classifier can handle reviews and sentences
which express neutral opinions. The F-Score computed for both datasets is relatively low,
indicating that the sentiment techniques may not be adequate for these datasets, however in
140136568
38 | P a g e
the next chapter, the limitations to the study are discussed which directly affect the results
obtained in this section.
Discussion
This chapter will analyse the results outlined, and discuss the research questions and whether
the research satisfies what the study intended to achieve. The section will also highlight
comparisons of the findings to similar research conducted in the literature.
The Research Aim
The aim of this study was to investigate user generated content in social media and compare
this to ‘gold standard’ data for driving business protocols. The steps taken to achieve this were
to explore Marston’s Plc gold standard data, and the interactive platform which provides
managerial positions (and upwards) with a visual overview of the reviews collected. The data
gathered from social media was then aggregated and analysed for similar patterns and trends
to that of the Benchmark data, allowing an evaluation of the two separate datasets. From the
submitted responses in the ‘gold standard’ data, the interface presents the users with the most
frequently mentioned terms. The interface also provides Marston’s Plc with targets on
different sections of the company such as customer feedback response rate, food, service, and
revenue, but also specifies deadlines for targets to be accomplished. It appears that the
platform Marston’s Plc utilises, lacks the ability to assist the business in gaining valuable insight
from consumer experiences and the opinions they hold. In addition, there appears to be little
benefit for Marston’s Plc to generate the required response rate for the reviews, especially
when customers are freely expressing opinions and thoughts on social media sites.
The findings imply that the ‘gold standard’ data and the user generated content from the
social media extract have very similar features, complementing one another in the topics which
arise from customer experiences. In addition, from the 3 features that appeared to have the
greatest range difference in sentiment scores (carvery, meal and service), the word clouds
illustrate very similar positive words, and this suggests that customers from both datasets
agree on opinions expressed. This is further supported by figure 14 which shows that for the 3
140136568
39 | P a g e
features, the range of sentiment scores are focused on the positive sentiment. An interesting
observation arises when the word clouds displaying negative opinions are considered, because
from figure 14, the TripAdvisor reviews tend to express a significant amount of negative
sentiment for the three features, compared to the Benchmark data. The confusion matrixes
illustrated throughout the study all had one similarity, and that was that the sentiment
appeared to cope with classifying reviews, and sentences, at the class ratings of 3, 3.5, and 4.
The error appears to lie in accurately scoring reviews and sentences, with class ratings of 2 and
below, automatically, as well as struggling with the positive class rating of five. A possible
explanation for this relates back to a point made earlier on in the literature review, that natural
language from social media uses a large variance of language, such as sarcasm, jokes and slang,
which the machine algorithm struggles to interpret.
Although it is assumed Marston’s Plc would like to receive 100% positive feedback on all
aspects of the business, the negative opinions do allow Marston’s to acknowledge the areas of
the business which require improvement. The negative opinions expressed for the 3 features
utilising the word cloud from the TripAdvisor data, suggests that “cold” was a frequently
occurring term, along with the word “disappointment”, therefore it appears that Marston’s Plc
needs to focus on ensuring customers are receiving hot meals and carveries (it is assumed the
service feature cannot be cold and this opinion refers to the carvery and meal feature), or
ensuring food is delivered at the correct temperature, and that the customers’ perception of
their visit being one of “disappointment”, is addressed. The points highlighted here were all
the outcomes of applying sentiment analysis and opinion mining to the two datasets. These
techniques ensure that popular features mentioned are extracted, and once sentiment scores
are assigned and particular features that have high/low polarity, the opinions are them
calculated to justify the polarity they are scored. The current ‘gold standard’ data utilised by
Marston’s Plc does not allow this in-depth analysis of customer’s perceptions, and therefore
lacks the ability to inform Marston’s Plc on how the business can move forward.
Research Questions
This sub-section will discuss each research question in turn, and discuss the insight gained
from the answers; this section will also discuss the implications this has for business.
140136568
40 | P a g e
Can businesses utilise “free” data from social media sites to improve business
protocols?
The research conducted in this study has shown how user generated content from social
media can guide businesses to improve current procedures by obtaining sentiment scores for
popular topics, and then extracting opinions related to the topics. These practices can be
related to other industry sectors which rely upon customer satisfaction, thus not limiting the
study to just Marston’s Plc. The term “free” data refers to the user generated content on social
media, which can indeed be marketed as free for businesses to adopt and generate meaningful
insight. An important fact to consider is that this research only utilised content from
TripAdvisor, but there are a majority of sources available for businesses to mine data (freely)
from. For example, there have been previous studies encapsulating Amazon product reviews
(Dave, Lawrence, & Pennock, 2003), Twitter (Pak & Paroubek, 2010) and even YouTube
(Morency, Michalcea, & Doshi, 2011). This highlights the wealth and range of data which is
available within social media for businesses to integrate with their decision making processes.
On the other hand, although the data is available for any individual to access and gain
information from to transform into knowledgeable actions; there are debates in research to
the trustworthiness of such sources. In a study by Ayeh, Au and Law (2013), they investigated
the perception and creditability of traveller’s attitudes towards user generated content. There
appears to be several studies which explore online reviews and the perceptions users have on
the trustworthiness of online content (Sparks & Browning, 2011; Zhang, Ye, Law, & Li, 2010;
Gretzel & Yoo, 2008). The idea of unlimited access to a wide range of content available on
social media suggests great opportunities for business, but the only limitation is employing
individuals with the responsibility of aggregating data, and drawing valuable knowledge. This
would require continuous work due to the real time scenario of new reviews being uploaded,
and businesses needing the latest information to drive the decision making process.
Are there any significant differences in incorporating user generated content to
standard business methods?
The industry sector explored in this research was classified as the leisure entertainment and
hotels sector. These sectors heavily rely on customers to drive revenue, and therefore the
140136568
41 | P a g e
content from users on social media does need to reflect the business in a way that attracts new
customers, and returns existing customers. It is important for the reader to understand that
although Marston’s Plc standard business methods were evaluated against user generated
content, not all businesses will follow their current methods, as the majority of businesses have
different infrastructures. In the literature however, it is recognised that businesses consider
user generated content in social media as an important source of information. In a survey by
Social Media Examiner (2014), it was claimed that 92% of marketers indicate that social media
is an important aspect for their business. In a report by Harvard Business Review Analytic
Services (2010), only 7% of the 2100 businesses surveyed said they could integrate social media
into their business strategy, including business intelligence and Customer Relationship
Management (CRM). This report may seem outdated as it was written over 5 years ago, but it
highlights the fact that businesses are struggling to incorporate content from social media into
their current business methods. In addition, a survey by Accenture (2014) claimed that 90% of
the Consumer Goods Packaged (CPG) company respondents saw the importance of social
media analytics rising over the next 2 years. As previously mentioned in the literature review,
there have already been several studies investigating the use of customer reviews in social
media for business purposes (Ye, Zhang, & Law, 2009: Lee, Jeong, & Lee, 2008: Liu, Hu, &
Cheng, 2005). The literature indicates that currently businesses have recognised the value of
content from social media, and that their future business plans aim to integrate such methods,
but whether this will significantly impact the business positively, is still a growing area.
What current approaches are businesses taking to monitor customer
satisfaction?
The literature review in an earlier chapter of this study identified how businesses utilise
methods in business intelligence and analytics for monitoring customer behaviour. A term
which is popular in businesses for monitoring customer satisfaction is through Customer
Relationship Management (CRM). Parvatiyar and Sheth (2001) propose that CRM has the
potential to provide knowledge about customer behaviour through developing programs and
strategies which entice customers to continually boost their relationship with the business. An
example of a metric which businesses can utilise to measure CRM is relationship marketing
instruments (RMIs), this can be in the form of loyalty programs and direct mailings. There are
numerous businesses which are popular for their loyalty programs, for example Sainsbury for
140136568
42 | P a g e
their Nectar card, Tesco for their Club card and the Boots loyalty card. These programs enable
companies to monitor transactions made through the company, and encourage further
purchases by rewarding customers with deals and offers. A drawback from such schemes is
that this monitors regular customers, but fails to encapsulate data from new customers or even
customers who have left the scheme. In addition, such programs are not popular for particular
industry types such as hotels and restaurants, where the number of visits per week is less than
that of supermarkets. The use of user generated content in social media, and sentiment
analysis and opinion mining begins to bridge this gap for a majority of industry sectors.
Related Work
The results of this research and results from similar studies will be compared to evaluate the
practicalities of utilising user generated content in social media. Studies that investigated
sentiment analysis and opinion mining, in particular feature summarisation, tended to achieve
precision scores ranging between 60% and 90% (Titov & McDonald, 2008b; Liu, Hu, & Cheng,
2005; Hu & Liu, 2004). It is clear that the precision achieved in this research is much lower than
the studies mentioned before, which could raise questions to the validity of the results found in
this research. However, there are numerous observations to be made about the differences
between studies in the literature and the techniques implemented within this research.
Firstly, a big difference between the studies in the literature and this research is the number
of reviews which the sentiment analysis methods were applied to. Due to the particular
Marston’s Plc pub involved in this research being a relatively new establishment, the number of
reviews available to mine was limited. The TripAdvisor reviews for the 6 month range
investigated totalled up to 55, and the Benchmark data tallied up to 161 reviews. Several
reviews were removed for ethical reasons (protecting staff identities), therefore the final
number of reviews utilised was 197 reviews, however, it is important for the reader to
remember that the reviews were not used together, the research explored differences
between the two datasets. The studies in the literature review, who have applied the same
techniques to datasets, contained a minimum of 500 reviews (Hu & Liu, 2004), whilst another
example of a dataset being used contained 10,000 reviews (Titov & McDonald, 2008b).
140136568
43 | P a g e
Secondly, the reviews discussed in the literature have different aspects. For example Zhuang,
Jing and Zhu (2006) argue that movie reviews contain more objective sentences than the
product reviews used in other studies. This can affect the precision, resulting in lower precision
scores. In addition, in a study by Turner (2002), it has been debated that movie reviews are
complicated to assign sentiment scores to because the sum of the review does not account for
parts of the review. Whilst Bing and Lee (2008) suggest that it is difficult for sentiment and
opinion mining tasks to handle negation words directly, especially where a single word can
separate two sentences into opposite sentiment classes (negative vs positive). Despite the
research having low precision and accuracy scores, the literature has displayed the limitations
studies have encountered in this field, and this has caused researchers to verify and strive to
perfect tools and methods when dealing with natural language online.
Thirdly, and the most significant point to be made, is that the methods used in the literature
utilise feature summarisations that are applied to the reviews of products, movies, hotels and
other various review types, but they do not directly compare a business’s current methods for
driving business decision making, they only propose frameworks for a business to engage with.
The aim of this study was to evaluate current business methods against the frameworks which
have been vigorously assessed. The literature has provided this research with the necessary
background information to perform sentiment analysis and opinion mining techniques;
however, it is not the study’s priority to surpass or create improved methods for sentiment and
opinion mining techniques, but to evaluate these proposed methods in a practical sense to a
business’s gold standard data.
Conclusion
This chapter will summarise the main findings of the research and limitations encountered
during the methodology, but also introduce ideas for future work and finish with
recommendations to Marston’s Plc and other industry sectors.
140136568
44 | P a g e
Limitations to the Research
Firstly, the programming used in RStudio encountered numerous complications when dealing
with the Benchmark data reviews, and this was due to the presence of noisy data within the
reviews. Analysing the Benchmark reviews before any processes were applied; it became clear
that the structure of the reviews vary significantly compared to that of the TripAdvisor reviews.
For example, some reviews were missing punctuation at the end of the reviews to indicate the
comment end, thus, when the programming has been running over these reviews, some have
been combined together and caused confusion when polarity scores were computed. This may
explain the low precision and accuracy scores obtained between polarity and the manual
scores assigned to the reviews.
Secondly, although the data underwent pre-processing steps before identifying the relevant
features, a further step, stemming the data, could have further removed any noise within the
data. Stemming the data reduces words to their “base” form and reduces all word variations to
the initial form, for example, play, plays, played, and playing would be all outputted as “play”.
An example of where this occurred in the research is illustrated by figure 12 where the words
“disappointed” and “disappointment” are displayed. Through the process of stemming, these
words would have been combined, and perhaps would have moved up in ranking of most
frequently occurring terms. This procedure may have altered the features extracted from the
reviews as well.
Thirdly, the sentences containing the features identified then underwent polarity analysis to
determine scores for the features. The sentences which remain, although contain features
extracted, may be objective or subjective. For the aims of this research, only subjective
sentences would be required, which express opinion. A part of the methodology did
compensate for this, for example, the reviews that were manually tagged did obtain a rating of
3 if it was believed no sentiment was being expressed. An example of this is a review which
contained the sentences, “I like to eat carvery”, and “It was a 70th birthday celebration”, both
received a score of 3, because the former is an opinion not relating to Marston’s Plc (just the
review author expressing a fact about one’s self) and the latter is a fact. The downfall in the
automation process of calculating polarity scores is that the machine may have extracted the
feature “carvery” and then assigned a positive score due to the presence of the word “like”.
140136568
45 | P a g e
Future Work
Research into sentiment analysis and opinion mining for business purposes has wide range of
potential, although this research focuses primarily on feature extraction summarisations, this
technique can be integrated with many other techniques to give insightful information. For
example, the summarisations combined with map features would allow an interactive
dashboard for managerial (and higher positions) to observe the sentiment and opinion
surrounding their businesses for different unit locations, utilising a traffic colour system. For
illustrative purposes, a map is displayed below:
Figure 15 shows a map of pubs using traffic colour system
The diagram of the map in figure 15 could illustrate to area managers which of their pubs
requires further assistance to achieve positive sentiment and opinions, from user generated
content. The dashboard could allow users to interact with areas of concern, by displaying the
feature summarisations allocated to that specific pub, upon request. The work could be
extended beyond Marston’s Plc to other industry sections, such as retail, hotels, leisure and
financial services.
In addition, this work has the potential to lead on to other exploratory analysis. For example,
one idea is that a relationship could be investigated between revenue sales and social media
ratings at particular time intervals. If the pub receives a couple of weeks of really bad reviews,
140136568
46 | P a g e
does this directly impact the sales of the pub, or have a knock-on effect? The range of choices
available for future work suggests that managerial positions and up, across industry sectors,
should consider the potential user generated content has for their business.
A further possibility for future work is that this research could be used as a method for
Competitive Intelligence (CI). Although this study has focused on utilising methods for
improving standard business protocols, it could be exploited for exploring the features and
opinions of rival business’, in order to observe what customers are expressing about them.
Recommendations
From the findings of this research, a recommendation put forward to Marston’s Plc is to re-
assess the results they wish to require from paid platform services, and focus more of their
energy on the information which could be gained from extracting valuable information from
user generated content and social media. The role of sentiment analysis and opinion mining in
social media can turn valuable data into knowledge, which can harness insightful business
decisions.
In addition, the suggestions put forward for future analysis, envisaged creating an interactive
dashboard which allowed area managers to oversee their individual businesses around the UK,
and monitor them by utilising a traffic colouring system. The dashboard could focus solely on
any business that appears to be falling behind, and extract the key features along with opinions
and thoughts from consumers, to alert managerial positions and up on certain aspects that
needed improvement. This research will recommend two programs for businesses to consider
for creating this idea. The first is a free software package called Tableau Public, which allows
users to import any kind of data, perform analytics and then create advanced dashboards. The
second is utilising BIRST products; however, these products are not free, and therefore this
research leans towards the Tableau Public platform.
A Final Word
The main findings to take away from this study are that from utilising a case study approach,
the current measures of how a business drives its procedures could be directly compared to
140136568
47 | P a g e
the techniques applied to user generated content from social media. The reviews obtained
from TripAdvisor tended to contain a wider vocabulary range and more words per review,
implying that when it came to extracting opinions from the features, it produced the insight for
Marston’s Plc to be able to improve on the aspects which customers are expressing negative
opinions. The research highlighted that reviews manually scored as 1 or 1.5, were over-rated
by the sentiment as being more neutral, when in fact, the reviews displayed negative
sentiment. Despite the sentiment precision and accuracy scoring being significantly lower than
the studies conducted in the literature, the limitations highlighted would allow an improved
method for the study to be repeated. The research, however, does begin to fill a gap in the
literature, by directly applying the feature summarisation methods and comparing to a
business’s current approach to driving decision making.
The research conducted in this paper highlights the potential for other businesses to adopt
similar techniques with user generated content, to gain insight into their business models and
consider the suggestions put forward for future work. Although this study utilised data from
Marston’s Plc, the methods can be extended to other industry sectors, indicating the
generalisability and scalability of this study, and the implications to spread this concept to
other various industry sectors.
Words: 10,283
140136568
48 | P a g e
Bibliography
Abbasi, A. (2007). Affect intensity analysis of dark web forums. In Intelligence and Security
Informatics 2007 IEEE, 282-288.
Abbasi, A., Chen, H., & Salem, A. (2008). Sentiment analysis in multiple languages: feature
selection for opinion classification in web forums. ACM Transactions on Information Systems
(TOIS), 26(3), 12.
Accenture. (2014). What’s trending in analytics for the consumer packaged goods industry.
Retrieved August 27, 2015 from https://www.accenture.com/t20150523T033001__w__/mz-
en/_acnmedia/Accenture/Conversion-
Assets/DotCom/Documents/Global/PDF/Dualpub_7/Accenture-CPG-Analytics-European-
Survey.pdf
Agichtein, E., Castillo, C., Donato, D., Gionis, A., & Mishne, G. (2008). Finding high-quality
content in social media. In Proceedings of the 2008 International Conference on Web Searching
and Data Mining, 183-194. Doi:10.1145/1341531.1341557
Asur, S., & Huberman, B. (2010). Predicting the future with social media. In Web Intelligence
and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference, 1,
492-499.
Aue, A., & Gamon, M. (2005). Customizing sentiment classifiers to new domains: a case
study. In Proceedings of recent advances in natural language processing, 1(3).
Ayeh, J. K., Au, N., & Law, R. (2013). “Do we believe in tripadvisor?” examining credibility
perceptions and online travelers’ attitude towards using user-generated content. Journal of
Travel Research, 52(4), 437-452. Doi:10.1177/0047287512475217
Benamara, F., Chardon, B., Mathieu, Y. Y., & Popescu, V. (2011). Towards context-based
subjectivity analysis. In Proceedings of the 5th International Joint Conference on Natural
Language Processing, 1180-1188.
Blair-Goldensohn, S., Hannan, K., McDonald, R., Neylon, T., Reis, G. A., & Reynar, J. (2008).
Building a sentiment summarizer for local service reviews. In WWW Workshop on NLP in the
Information Explosion Era, 14.
Blitzer, J., Dredze, M., & Pereira, F. (2007). Biographies, bollywood, boom-boxes and
blenders: domain adaptation for sentiment classification. In Proceedings of the 45th Annual
Meeting of the Association of Computational Linguistics, 440–447.
140136568
49 | P a g e
Bryman, A. (2012). Social research methods. (4th ed.). Oxford, United Kingdom: Oxford
university press.
Burns, R. B. (2000). Introduction to research methods. (4th ed.). London, United Kingdom:
Sage publications.
Cambria, E., Schuller, B., Xia, Y., & Havasi, C. (2013). New avenues in opinion mining and
sentiment analysis. IEEE Intelligent Systems, (2), 15-21.
Chen, H., Chiang, R. H., & Storey, V. C. (2012). Business intelligence and analytics: from big
data to big impact. MIS quarterly, 36(4), 1165-1188.
Cody, W. F., Kreulen, J. T., Krishna, V., & Spangler, W. S. (2002). The integration of business
intelligence and knowledge management. IBM systems journal, 41(4), 697-713.
Corbin, J., & Strauss, A. (2014). Basics of qualitative research: techniques and procedures for
developing grounded theory. London, United Kingdom: Sage publications.
Creswell, J. W. (2013). Research design: qualitative, quantitative, and mixed methods
approaches. London, United Kingdom: Sage publications.
Cui, H., Mittal, V., & Datar, M. (2006). Comparative experiments on sentiment classification
for online product reviews. In American Association for Artificial Intelligence, 6, 1265-1270.
Das, S. R., & Chen, M. Y. (2007). Yahoo! for amazon: sentiment extraction from small talk on
the web. Management Science, 53(9), 1375-1388.
Dave, K., Lawrence, S., & Pennock, D. M. (2003). Mining the peanut gallery: opinion
extraction and semantic classification of product reviews. In Proceedings of the 12th
international conference on World Wide Web, 519-528.
Dhar, V., & Chang, E. A. (2009). Does chatter matter? the impact of user-generated content
on music sales. Journal of Interactive Marketing, 23(4), 300-307.
Dellarocas, C., Zhang, X. M., & Awad, N. F. (2007). Exploring the value of online product
reviews in forecasting sales: the case of motion pictures. Journal of Interactive
marketing, 21(4), 23-45.
Dey, L., Haque, S. M., Khurdiya, A., & Shroff, G. (2011). Acquiring competitive intelligence
from social media. In Proceedings of the 2011 joint workshop on multilingual OCR and analytics
for noisy unstructured text data. Doi:10.1145/2034617.2034621
Ding, X., Liu, B., & Yu, P. S. (2008). A holistic lexicon-based approach to opinion mining. In
Proceedings of the 2008 International Conference on Web Search and Data Mining, 231-240.
140136568
50 | P a g e
Eisenhardt, K. M. (1989). Building theories from case study research. Academy of
management review, 14(4), 532-550.
Feldman, R. (2013). Techniques and applications for sentiment analysis. Communications of
the ACM, 56(4), 82-89. Doi:10.1145/2436256.2436274
Flyvbjerg, B. (2006). Five misunderstandings about case-study research. Qualitative
inquiry, 12(2), 219-245.
Gelo, O., Braakmann, D., & Benetka, G. (2008). Quantitative and qualitative research: beyond
the debate. Integrative psychological and behavioral science, 42(3), 266-290.
Gerring, J. (2004). What is a case study and what is it good for?. American political science
review, 98(2), 341-354.
Gerstenfeld, P. B., Grant, D. R., & Chiang, C. P. (2003). Hate online: a content analysis of
extremist internet sites. Analyses of social issues and public policy, 3(1), 29-44.
Glorot, X., Bordes, A., & Bengio, Y. (2011). Domain adaptation for large-scale sentiment
classification: a deep learning approach. In Proceedings of the 28th International Conference
on Machine Learning, 11, 513-520.
Gomm, R., Hammersley, M., & Foster, P. (2000). Case study and generalization. Case study
method, 98-115.
Gretzel, U., & Yoo, K. H. (2008). Use and impact of online travel reviews. Information and
communication technologies in tourism 2008, 35-46.
Harvard Business Review Analytic Services. (2010). The new conversation: taking social
media from talk into action. Retrieved August 26, 2015 from
https://hbr.org/resources/pdfs/tools/16203_HBR_SAS%20Report_webview.pdf
He, W., Zha, S., & Li, L. (2013). Social media competitive analysis and text mining: a case
study in the pizza industry. International Journal of Information Management, 33(3), 464-472.
Hodkinson, P. & Hodkinson, H. (2001). The strengths and limitations of case study research.
In Learning and Skills Development Agency Conference, 1(1), 5-7.
Hu, M., & Liu, B. (2004). Mining opinion features in customer reviews. In American
Association for Artificial Intelligence, 4(4), 755-760.
Jansen, B. J., Zhang, M., Sobel, K., & Chowdury, A. (2009) . Micro-blogging as online word of
mouth branding. In CHI'09 Extended Abstracts on Human Factors in Computing Systems, 3859-
3864.
140136568
51 | P a g e
Johnson, D. (1994). Research methods in educational management. London, United
Kingdom: Longman Publishing Group.
Kavanaugh, A. L., Fox, E. A., Sheetz, S. D., Yang, S., Li, L. T., Shoemaker, D. J., & Xie, L. (2012).
Social media use by government: from the routine to the critical. Government Information
Quarterly, 29(4), 480-491.
Kietzmann, J. H., Hermkens, K., McCarthy, I. P., & Silvestre, B. S. (2011). Social media? get
serious! understanding the functional building blocks of social media. Business horizons, 54(3),
241-251.
Kitchin, R. (2014). The data revolution. London, United Kingdom: Sage.
Kouloumpis, E., Wilson, T., & Moore, J. (2011). Twitter sentiment analysis: the good the bad
and the omg!. Proceedings of the Fifth International AAAI Conference on Weblogs and Social
Media, 11, 538-541.
Lee, D., Jeong, O., & Lee, S. (2008). Opinion mining of customer feedback data on the web. In
Proceedings of the 2nd international conference on Ubiquitous information management and
communication, 230-235. Doi:10.1145/1352793.1352842
Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis Lectures on Human
Language Technologies, 5(1), 1-167.
Liu, B., & Zhang, L. (2012). A survey of opinion mining and sentiment analysis. In C. C.
Aggarwal, & C. X. Zhai (Eds.), Mining text data. (pp. 415-463). New York, NY: Springer.
Liu, B., Hu, M., & Cheng, J. (2005). Opinion observer: analyzing and comparing opinions on
the web. In WWW ’05 Proceedings of the 14th international conference on World Wide Web,
342-351. Doi:10.1145/1060745.1060797
Lu, Y., Zhai, C., & Sundaresan, N. (2009). Rated aspect summarization of short comments. In
Proceedings of the 18th international conference on World wide web, 131-140.
Marston’s Plc. (2015). Corporate matters: it’s our business. Retrieved August 13, 2015 from
http://www.marstons.co.uk/corporate/
Maxwell, J. A. (2004). Causal explanation, qualitative research, and scientific inquiry in
education. Educational Researcher, 33(2), 3-11. Doi:10.3102/0013189X033002003
McCallum, A. (2005). Information extraction: distilling structured data from unstructured
text. Queue, 3(9), 48-57.
140136568
52 | P a g e
Morency, L. P., Mihalcea, R., & Doshi, P. (2011). Towards multimodal sentiment analysis:
harvesting opinions from the web. In Proceedings of the 13th international conference on
multimodal interfaces, 169-176.
O'Connor, B., Balasubramanyan, R., Routledge, B. R., & Smith, N. A. (2010). From tweets to
polls: linking text sentiment to public opinion time series. ICWSM, 11, 1-2.
O'reilly, T. (2007). What is Web 2.0: Design patterns and business models for the next
generation of software. Communications & strategies, (1), 17-37.
Pak, A., & Paroubek, P. (2010). Twitter as a corpus for sentiment analysis and opinion
mining. In LREC, 10, 1320-1326.
Pan, S. J., Ni, X., Sun, J. T., Yang, Q., & Chen, Z. (2010). Cross-domain sentiment classification
via spectral feature alignment. In Proceedings of the 19th international conference on World
wide web, 751-760. Doi:10.1145/1772690.1772767
Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and trends in
information retrieval, 2, 1-135.
Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up?: sentiment classification using
machine learning techniques. In Proceedings of the ACL-02 conference Empirical methods in
natural language processing, 10, 79-86. Doi:10.3115/1118693.1118704
Parvatiyar, A., & Sheth, J. N. (2001). Customer relationship management: emerging practice,
process, and discipline. Journal of Economic and Social research, 3(2), 1-34.
Popescu, A., & Etzioni, O. (2005). Extracting product features and opinions from reviews. In
Proceedings of Human Language Technology Conference and Conference on Empirical Methods
in Natural Language Processing, 339-346.
Prabowo, R., & Thelwall, M. (2009). Sentiment analysis: a combined approach. Journal of
Informetrics, 3(2), 143-157.
Punch, K. (2014). Introduction to social research: quantitative and qualitative approaches. (3rd
Ed.). London, United Kingdom: Sage publications.
Qazi, H. A. (2011). Evaluating goodness in qualitative researcher. Bangladesh Journal of
Medical Science, 10(1), 11-20.
Reinschmidt, J., & Francoise, A. (2000). Business intelligence certification guide. IBM
International Technical Support Organisation.
140136568
53 | P a g e
Riloff, E., Wiebe, J., & Phillips, W. (2005). Exploiting subjectivity classification to improve
information extraction. In Proceedings of the National Conference On Artificial Intelligence,
20(3), 1106-1111.
Ritchie, J. & Lewis, J. (Eds.). (2003). Qualitative research practice. London, United Kingdom:
Sage publications.
Rud, O. P. (2009). Business intelligence success factors: tools for aligning your business in the
global economy. Hoboken, NJ: John Wiley & Sons.
Rygielski, C., Wang, J., & Yen, D. (2002). Data mining techniques for customer relationship
management. Technology in Society, 24, 483-502.
Sandelowski, M., & Barroso, J. (2008). Reading qualitative studies. International journal of
qualitative methods, 1(1), 74-108.
Social Media Examiner. (2014). 2014 Social media marketing industry report: how marketers
are using social media to grow their business. Retrieved August 26, 2015 from
http://www.socialmediaexaminer.com/SocialMediaMarketingIndustryReport2014.pdf
Somprasertsri, G., & Lalitrojwong, P. (2010). Mining feature-opinion in online customer
reviews for opinion summarization. Journal of Universal Computer Science, 16(6), 938-955.
Sparks, B. A., & Browning, V. (2011). The impact of online reviews on hotel booking
intentions and perception of trust. Tourism Management, 32(6), 1310-1323.
Thelwall, M., Buckley, K., & Paltoglou, G. (2012). Sentiment strength detection for the social
web. Journal of the American Society for Information Science and Technology, 63(1), 163-173.
Titov, I., & McDonald, R. (2008a). Modeling online reviews with multi-grain topic models. In
Proceedings of the 17th international conference on World Wide Web, 111-120.
Titov, I., & McDonald, R. T. (2008b). A joint model of text and aspect ratings for sentiment
summarization. In ACL, 8, 308-316.
TripAdvisor. (2015a). About tripadvisor. Retrieved August 6, 2015 from
http://www.tripadvisor.co.uk/PressCenter-c6-About_Us.html
TripAdvisor. (2015b). Tripadvisor website terms, conditions and notices. Retrieved August 6,
2015 from http://www.tripadvisor.co.uk/pages/terms.html
Tumasjan, A., Sprenger, T. O., Sandner, P. G., & Welpe, I. M. (2010). Predicting elections with
twitter: what 140 characters reveal about political sentiment. ICWSM, 10, 178-185.
140136568
54 | P a g e
Turney, P. D. (2002). Thumbs up or thumbs down?: semantic orientation applied to
unsupervised classification of reviews. In Proceedings of the 40th annual meeting on
association for computational linguistics, 417-424.
VanWynsberghe, R., & Khan, S. (2008). Redefining case study. International Journal of
Qualitative Methods, 6(2), 80-94.
Viso, S., Ramsay, B., Ralescu, A., & Knaap, E. (2011). Confusion matrix-based feature
selection. In Proceedings of the 22nd Midwest Artificial Intelligence and Cognitive Science
Conference, 120-127.
Walter, M. (Ed.). (2010). Social research methods. (2nd Ed.). Oxford, United Kingdom: Oxford
university press.
Wang, H., & Wang, S. (2008). A knowledge management approach to data mining process for
business intelligence. Industrial Management & Data Systems, 108(5), 622-634.
Wellington, J. & Szczerbinski, M. (2007). Research methods for the social sciences. London,
United Kingdom: Continuum International Publishing Group.
Whittemore, R., Chase, S. K., & Mandle, C. L. (2001). Validity in qualitative research.
Qualitative Health Research, 11(4), 522-537. Doi:10.1177/104973201129119299
Wiebe, J., & Riloff, E. (2005). Creating subjective and objective sentence classifiers from
unannotated texts. In Computational Linguistics and Intelligent Text Processing, 486-497.
Ye, Q., Law, R., Gu, B., & Chen, W. (2011). The influence of user-generated content on traveler
behavior: an empirical investigation on the effects of e-word-of-mouth to hotel online
bookings. Computers in Human Behavior, 27(2), 634-639.
Ye, Q., Zhang, Z., & Law, R. (2009). Sentiment classification of online reviews to travel
destinations by supervised machine learning approaches. Expert machines with Applications,
36(3), 6527-6535.
Yeoh, W., & Koronios, A. (2010). Critical success factors for business intelligence
systems. Journal of computer information systems, 50(3), 23-32.
Yessenalina, A., Yue, Y., & Cardie, C. (2010). Multi-level structured models for document-level
sentiment classification. In Proceedings of the 2010 Conference on Empirical Methods in
Natural Language Processing, 1046-1056.
Yin, R. K. (2013). Case study research: Design and methods (5th Ed.). London, United
Kingdom: Sage publications.
140136568
55 | P a g e
Yin, R. K. (2011). Applications of case study research (3rd Ed.). London, United Kingdom: Sage
publications.
Zhang, Z., Ye, Q., Law, R., & Li, Y. (2010). The impact of e-word-of-mouth on the online
popularity of restaurants: a comparison of consumer reviews and editor reviews. International
Journal of Hospitality Management, 29(4), 694-700.
Zhuang, L., Jing, F., & Zhu, X. Y. (2006). Movie review mining and summarization. In
Proceedings of the 15th ACM international conference on Information and knowledge
management, 43-50.
140136568
56 | P a g e
Appendix A
Ethical Application Form
140136568
57 | P a g e
140136568
58 | P a g e
140136568
59 | P a g e
140136568
60 | P a g e
140136568
61 | P a g e
140136568
62 | P a g e
Research Approval Letter
140136568
63 | P a g e
Information sheet provided to Marston’s Plc
140136568
64 | P a g e
Signed consent form
Appendix B
RStudio Script
Basic Statistics
#Installing relevant packages install.packages("plotrix") library(plotrix) #Plotting the average number of words per review for each dataset #First, the TripAdvisor dataset WGReviews <- WinterGreen$Review
140136568
65 | P a g e
WGReviews <- as.vector(WGReviews) element <- strsplit(WGReviews, " ") sapply(element, length) sum(sapply(element,length))/length(element) #Second, the benchmark dataset Benchmark <- as.vector(BenchmarkReviews) bench <- strsplit(Benchmark, " ") sapply(bench, length) sum(sapply(bench, length))/length(bench) #Average words per review on graph AWPR <- c(95,16) barplot(AWPR,main="Average Number of Words per Review",ylim=c(0,100), col=c("purple","yellow"),legend.text=c("TripAdvisor","Benchmark"),ylab="Number of words",cex.names=c("95","16")) #Unique Words for Benchmark for vocab range Allreviews <- paste(BenchmarkReviews, collapse=" ") review_source <- VectorSource(Allreviews) corpusben <- Corpus(review_source) corpusben <- tm_map(corpusben, content_transformer(tolower)) corpusben <- tm_map(corpusben, stripWhitespace) Corpusben <- tm_map(corpusben,removePunctuation) Corpusben <- tm_map(Corpusben, removeWords, staff) CoRpusben <- tm_map(Corpusben, removeWords, mystopwords) Corpusben <- as.String(Corpusben[[1]]) CoRpusemp <- as.String(CoRpusben[[1]]) BenchWords1 <- unlist(strsplit(Corpusben, " ")) BW1 <- unique(BenchWords) BenchWords2 <- unlist(strsplit(CoRpusben, " ")) BW2 <- unique(BenchWords2) sum(sapply(BW2, length)) sum(sapply(BW, length)) #Unique words for TripAdvisor for vocab range WinterGreen$Review CorpusTA <- Corpus(VectorSource(paste(WinterGreen$Review,collapse=" "))) CorpusTA <- tm_map(CorpusTA, removePunctuation) CorpusTA <- tm_map(CorpusTA,content_transformer(tolower)) CorpusTA <- tm_map(CorpusTA, stripWhitespace) Corpusta <- tm_map(CorpusTA, removeWords, mystopwords) CorpusTA <- as.String(CorpusTA[[1]]) Corpusta <- as.String(Corpusta[[1]]) TripWords <- unlist(strsplit(CorpusTA, " ")) TAW <- unique(TripWords) sum(sapply(TAW, length)) TripAWords <- unlist(strsplit(Corpusta, " ")) TAW2 <- unique(TripAWords) sum(sapply(TAW2, length)) vocabwithsw <- c(712,1229) vocabwosw <- c(613,1113) vocabdf <- data.frame(row.names=c("Benchmark Vocab","TripAdvisor Vocab"),"With Stopwords"=vocabwithsw,"Without Stopwords"=vocabwosw) vocabdf <- as.matrix(vocabdf) barplot(vocabdf,main="Vocabulary Range in Each Dataset",beside=TRUE,ylim=c(0,1300),
140136568
66 | P a g e
col=c("yellow","purple"),legend.text=c("Benchmark Reviews","TripAdvisor Reviews"),args.legend = list(x=5,y=1350),ylab="Number of Words")
Correlation
#Correlation of TripAdvisor Ratings and Polarity Scores tawg <- WinterGreen$Rating tawg WGR <- WinterGreen$Review WGP <- polarity(WGR,constrain=TRUE,n.before=2,n.after=2, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words) wgscores <- WGP$all$polarity plot(tawg,wgscores,lwd=1.2,pch=16,main="Scatter Plot of Polarity Scores and TripAdvisor Ratings",xlab="TripAdvisor Ratings",ylab="Polarity Score",abline(lm(wgscores~tawg),col="red", lwd=2)) wgscores <- as.vector(wgscores) cor(tawg,wgscores) #Transform Polarity Scores into likert scale likertta <-c(4,4,4,3,5,3,3,4,4,3,4,4,4,4,4,4,3,4,4,4,4,4,3,3,5,3,4,3,4,4,4,4,4,2,3,3,2,4,4,5,4,3,4,3,3,3,4,4,4,3,4,3,3,4,4) table(tawg) #Correlation for benchmark revviews # 1=Very Bad, 2=Bad, 3=Neutral, 4=Good, 5=Very Good #Using benchmark reviews, only the names of staff were removed and this #reduced the review count to 142. BMR <- BenchmarkReviews BMR <- tolower(BMR) BMR <- removeFeatures(BMR,staff) BMR <- removePunctuation(BMR) #The ratings assigned to reviews (manually). ManualRatingsb <- c(4,4,3.5,3.5,3.5,4,4,4,3.5,3.5,3.5,2.5,3,5,4,2,3.5,5,3,3,4,4,3,4,3.5,4,3,4.5,3.5,3,4,4,3.5,1.5,2,2,3,4,4,4,5,5,4,3,4,5,5,3.5,3,4,4,3.5,3.5,3,4,3.5,4,3,4,3,3,3,5,3,4,4,1.5,4,4,3.5,5,4.5,4,4,4,3.5,2.5,4,3.5,5,4.5,5,4,3,4,3.5,3,3.5,3.5,3,4,3,3.5,3,5,4,5,4.5,4.5,5,2,4,2,2,2.5,1.5,2,2,3,2,3,1.5,1,2.5,2.5,3.5,4,4,3,4,4,4,3,4.5,2.5,3,3.5,3,1,2.5,2.5,3,3,4,4,3,2,3.5,3,4,4,4) #Assign Polarity Scores to benchmark reviews benchpolarity <- polarity(BMR,n.before=2,n.after=2, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words,constrain=TRUE) benchpolarityscores <- benchpolarity$all$polarity #Correlation Plot for benchmark reviews and polarity scores plot(ManualRatingsb,benchpolarityscores,pch=16,abline(lm(benchpolarityscores~ManualRatingsb),col="red",lwd=2),xlab="Manual Ratings",ylab="Polarity Scores",main="Scatter Plot of Manual Ratings and Polarity Scores") cor(ManualRatingsb,benchpolarityscores,use="complete.obs") #Convert polarity scores into likert scale benchpolarityscores[141:147] BENCHT <- c(4,4,3.5,4,3.5,4,4,4,3.5,3.5,4,2.5,3,4,4,3.5,3.5,4,3,3,3.5,3.5,3,4.5,4,3.5,3.5,4,4,3,3.5,4,3.5,3.5,3.5,3,3.5,4,4,3.5,4.5,4,3.5,3,4.5,4,3,4,3,4,2.5,3,4,4,3.5,4,4,3,4,3.5,3,3,4,3,4,4,3,3,4,3.5,4.5,4,4,4,4,4,3,3.5,4,4,4,4,4.5,3,3.5,3.5,3,3,3.5,3,3.5,3,4,3,5,4,3.5,3,4,3.5,2.5,4,3,3,3,3,2.5,3,3.5,3.5,3,2.5,2,4,3,3.5,4,3,3,4,4.5,4.5,3,4,2.5,3.5,3.5,3,2.5,3.5,3,3,3,3.5,4,3,2,4,3,4.5,3.5,4) table(ManualRatingsb)
140136568
67 | P a g e
ManualRatingsb[134:142] BENCHT[134:142]
Benchmark Dataset
#Aggregating all Benchmark Reviews Benchmark1 <- c("Arun was polite and attentive throughout our visit! Top customer service skills-will return again!","They were attentive and were very friendly.", "No I can't think of anything you could do better.","Di, friendly conversation.","Arun - first time I'd done contactless payment but it was painless :)","A young lady called Lauren was very helpful and took care of us all and provided an excellent servic","All the staff who served us very very friendly and efficient , cannot fault any of them","We find the pub comfortable and clean, the staff excellent so I cannot recommend anything to change","all staff were nice. a guy in a blue shirt came over and chatted with us - think he was the host (at the door). blonde girl","can't fault it. it does what its tryin to do pretty well.", "Elisha very helpful on picnicking our food","No I think not Menu look a little fussy","Let us know when food was being served so that the person who was having carvery could get theirs and we all could eat at the same time.","no every thing was perfect will definitely go again.","no we all had a good meal and night thank you very much","Elema","DI","Good experience. Changes unnecessary at this time. Just reached the end of the survey and wasn't happy to be asked for personal details. This has put me off Marstons.","Elerma","Di good with information about events coming up","Di","Abigail","I think the management have got everything spot on the food, staff who are always very pleasant, helpful and nothing is too much trouble for them. Well done Emma and our team. Our experience visiting the Wintergreen is always relaxing atmosphere, never feel rushed to move from table and everything is 10/10. Well done Emma and team again.","sousages","Adam and chris","Zena went above and beyond and rob on the carvery is a lovely young man","We enjoyed our meal thanks","Chloe s and di","Lauren was very efficient and Nathan was very polite on the carvery","Tracie made our visit very pleasant","OUR SERVER DI WAS VERY HELPFUL AND ADVISED ON OUR CHOICE OF MENU AS IT WAS OUR FIRST VISIT.","A GOOD LOCAL PLACE TO EAT. WE COMMENTED THAT THE VEGETABLES WERE COLD AND THEY WERE IMMEDIATELY REPLACED WITH APOLOGIES. GOOD VALUE FOR MONEY.","Di, went above and beyond making our birthday meal celebration a good one","Her name was Di she told us about all the offers available was very friendly and made us feel really","The guy on carvery think is name was rob","Becki great service once again","Bekki","Adam great service alround","Use this pub 3 times a week always enjoy","We were very disappointed, my daughter was told she wasn't allowed paper to draw on my the supervisor called di !!! We also had to ask for the table to be wiped which was done under duress !!!! Person behind the bar was just going through the motions !!! The lady called Beckie was the only little ray of sun shine !!!! Food wasn't great either overall we won't be here for a while... But great customer service Beckie same about the other.","Weren't great at all tikka should of been with rice","As explained earlier they were so dis interested","ash","Beckie","Happy to talk about the experience if some one contacted me","Very welcoming and very friendly","Great veneu for lunch abig thank you to Adam lovely man and caring also to the lady on the carvery","The experiences we have had over the past months at the Pub have all been very good. The staff were tolerant as they accepted my change of numbers for a party booking. The carvery and 'mains' are both good value for money. The Pub is a lovely new building and this only aids the eating experience as it is the food and company that make it. I hope that the answers to the questions earlier speak for themselves.","They were all exceptionally helpful and friendly and had a smile :D","For my first visit i was pleasantly surprised at the level of service that was in place, friendly, inviting, helpful, smiling and most of all welcoming.","It is very close to home and has good food, good atmosphere and value for money","fish pie","Di - very helpful and friendly") Benchmark2 <- c("James B was extremely friendly and went out of his way to make us feel welcome! It really made my ni","First visit-won't be my last-First impressions were great,very welcoming,great food and good prices-Give it a go-you won't be disappointed.","Ash was very good and Lisa on the carvery what a lovely couple","Becky & carver John.","Chris the carver was very efficient","Zena was lovely. Graham was welcoming. Chefs good, never had a bad meal.","'Di' took great pains with ensuring that the 'Help for Heroes' beer was pulled correctly even though it was the first pint of the day","The more
140136568
68 | P a g e
mature lady. Approx 5'5 tall medium length hair. Very friendly and polite when we thanked her for our lovely meal, even though the restaurant was full when we arrived she very quickly found us a table. She wasn't the same person who cleared our food.","I like to eat carvery, and the vegetables are always the same so I would like to see more variety instead of always peas, carrots, leeks and swede. Also a little more side salad with the burgers would be nice. The food overall is very good value for money..","Have been here now a few times, good food reasonable prices..","Di","beccy, ensured all our food was ok and served us quickly and took food away promptly, she was very friendly","It was my first visit and I dealt with three separate peole all of whom were helpful and polite. Having recently had a disappointing vist to a fairly similar local pub I was very happy with the way today's vist went","Portobello mushroom tagliatelli","Server was Di, chefs were Tom and Chris. I was served drinks by a very pleasant, efficient blond young lady.","Would be good if the '2 for' offers were extended to include a third person, ie 3 for £15.","Chris and di","Di/ john the cheff","We find ALL the staff (and management) at this restaurant to be really friendly and helpful - It would be very unfair to single anyone out! I would however like to say a big thank-you for the organisers of the kids club. Our daughter loves coming along. It?s great to have such a family friendly place on our doorstep that puts so much effort into what they do. Please keep up the good work guys, it's a pleasure to come in and eat/drink with you.","pasta","An excellent evening to celebrate a 70th birthday. Will come again.","Di.Really good service.","Disappointed by the lay out of the pub. Since it is on the estate where me and my partner live we would appreciate a dedicated drinking area as the whole pub feels like a restaurant. My partner and his dad were upset that they were unable to come in for just a drink on xmas day.","James B, gave us a deal we were unaware of when he could have over charged us full price thanks!:)","Name---Di Very helpful as she always remembers us from previous visits.","nothing to improve on.","The girl who took my food and drinks order was extremely helpful efficient and friendly. Her name was Chloe S. She made our visit most enjoyable","We really enjoyed our visit and will visit again and hopefully be served by Chloe S. The food was really enjoyable","Chloe s took the order from my wife she was most helpful and pleasant","Nothing could be done better the staff especially Chloe S. We're brilliant and food really good.","chloe s was very polite and friendly nothing was to much trouble for her.","The winter green is great We hand a great carvery","being able to have a tab for the table, and having drinks orders being taken at the table when eating.","emma manager hanah wairress emma and her team are the msst polite and made youwelcme from the moment we walked in the door this was our first visit but it will not be our last they are crdit to your training","i would like to be able to book atable for any amount not to have to take 10 in to reserve aTABLE OTHER WISE EVERYTHING IS PERFECT","Hannah's service was exceptional.","Zena made our experience enjoyable and chris on the carvery was very polite","Beccy","After seeing your pictures on facebook we came to try. Absolutely amazing would recommend to anyone. Thank you it was brilliant.") Benchmark3 <- c("Adam very polite","Mushroom Tagliatelle","Graham and Chloe both extremely accomodating and made us feel welcome at the bar and at our table.","Waited at the Carvery for the chef for a little while. Otherwise enjoyed our visit.","Di","Di","Di","Stella Artois on draught","Bez","Hannah","Chloe S","Staff that served Di & Chris worked so hard at sorting our food and drinks quickly. The","Diana","Di just very helpful informed us of Xmas activities and forthcoming events.sad we missed the Xmas jollies. Will be looking forward to our next visit, big family so plenty of times we will be needing a nice place to eat, drink and be merry.","Hannah","Don't know","Her name was Alex and she was happy and obliging just as she had been on previous visits.","Had a meal voucher","Friendly","we were given full size menus,I had 10 oz gammon my friend had steak & ale pie,we were charged just over £19. As we were leaving I picked up a 2 for £10 menu and found that both our meals were on this menu but were not advised by staff of this fact so I feel we did not receive the value we should have under this offer.This will not stop us visiting again as the meals were excellent,but in future will be quick to point out to staff if our choices are on the 2 for £10 menu.","chloe","Graham - customer service outstanding & very friendly. Di was friendly & attentive.","Lovely afternoon out","Diana told us about the 2 ten offer she explained the meal where the medium size which save us on the meal we choose the carvey also the children's portions where very good size we didn't finish our desserts . Diana took them into the kitchen and boxed them 4 us. Too take home . we willingly tell our family & friend to vist the Winter Green","Graham went above and beyond to help is with our needs","The toilets were super clean","We thoroughly enjoyed our meal. The place was warm & welcoming, the desert was
140136568
69 | P a g e
beyond expectation - delicious. I can't think of anything that was lacking & we certainly look forward to our next visit.","I was disappointed that we had to wait for a table - but it was a busy Sunday time.","Nothing we all enjoyed the relaxed atmosphere and food","There could be more choice of vegetables and the food could be hotter.","Not really what I wanted in the first place","As mentioned earlier I wanted to order the Mushroom Tagliatelle for main and New York Style Cheesecake for dessert","As mentioned previously the portions of the main meal I had did not justify the price.","I wanted to order the Mushroom Tagliatelle for my main course but was told that it was no longer available as the menu is being updated. I chose the Chicken Melt and was very very disappointed. Just one small chicken breast with bacon and melted cheese on it, a small baked potato and a small portion of salad, all for just over £9 which wasn't good value for money, even though it tasted nice. I then ordered the New York Style Cheesecake, again to be told that his wasn't available either. Chose another dessert, the Chocolate Flake Cheesecake, again it was nice but not really what I wanted. Overall, I was disappointed with my visit as I didn't get what I really wanted to eat.","Wider carvery range on veg","Maybe more vegetables at the carvery, lovely otherwise!","I feel the menu could be better. There isn't a lot of choice of 'pub type' meals.","The server asDi","As I previsouly explained, puddings were poor value for money","More staff on bar and restaurant staff, more refills on carvery, mashed potato was disgusting","the cook wasn't happy with the veg at the carvery so brought fresh.","Listening to our request and give us a little longer to choose deserts ! BEZ","I recommend Winter Green pub in Rotherham.","Di she was very good at her job and made my party feel very much welcome so much that we will be coming again","just keep up the same standard of service","She told us all about the new menu. All the food we could have.") Benchmark4 <- c("Bez/zena. Both were very helpful.","Lauren, very friendly and helpful","Bekki. Always smiling and very friendly.","I prefer Stones or John Smiths Bitter","The server today was Di. She was very efficient, meals were all delivered together and we didn't have to wait too long for our desserts. The service was excellent.","The background music could have been quieter, more subtle, it was a bit distracting for conversation. Mobile phone users should be asked to step out into the lobby - I don't go out to listen to other people conducting their business or having a loud conversation.","At time of visit only one person serving at the bar which resulted in a long wait as she was serving drinks and taking food orders on her own. This was the only complaint - our party particularly liked the internal decor","His name was John, and he explained about the area of Orgreave and the land was Orgreave Colliery, very interesting young man.","My carvery meal was delicious, but my husband ordered fish and chips which was a very small plate of food and the fish inside the batter was almost nonexistent, he said he would have the carvery next time.","The selection of vegetables were not great, not many to choose from. Also I was disgusted at the portion size of the cake away. £4.65 for a piece of cheesecake that was a normal size to all on our party. We also ordered the Jaffa Cake and when we got home we had got 2 pieces of cheesecake! Not happy!","The only small criticism I have is that the dessert (carrot cake) was a little dry, and didn't like the cream much. Still ate it though, but I have tasted better.","Just that staff inform customers of all offers","Di - Told us when child meal was being served so that we could get our carvery meal and then all eat together.","Her name was Di. She pointed out that as we'd come in about 5.45 that there were specials on offer before 5 pm","Server was DI and they was very friendly towards me and my family pay rise is needed","Very friendly and explained everything. She was called Di.","Took orders for coffee before desserts arrived.","Just very expensive on drinks.","cannot think of anything to change really enjoyed it.","Have a wider choice of the blonde/paler real ales","Ash - Great night, Great Quiz","nothing I was happy with everything","Chloe s was really helpful thanks") BenchmarkReviews <- c(Benchmark1,Benchmark2,Benchmark3,Benchmark4) #161 Reviews #Cleaning the text #Convert into a corpus to perform cleansing tasks Allreviews <- paste(BenchmarkReviews, collapse=" ") review_source <- VectorSource(Allreviews) corpusben <- Corpus(review_source) #List of staff names to remove from reviews
140136568
70 | P a g e
staff <- c("chloe","di","lisa","hanah","alex","becki","beckie","bekki","becky","tracie","rob","nathan","lauren","tom","chris","beccy","diana","john","arun","s","bez","hannah","ash","elerma","elisha","graham","zena","james","bez/zena","adam","abigail","emma","elema") corpusben <- tm_map(corpusben, content_transformer(tolower)) corpusben <- tm_map(corpusben, stripWhitespace) corpusben <- tm_map(corpusben, removeWords, staff) corpusben <- tm_map(corpusben, removeWords, mystopwords) #Reviews are now ready to find most frequent terms dtmben <-DocumentTermMatrix(corpusben) dtm2ben <- as.matrix(dtmben) frequencyben <- colSums(dtm2ben) frequencyben <- sort(frequencyben, decreasing=TRUE) #Creating a word cloud to visualise most frequent words wordsben <-names(frequencyben) wordcloud(wordsben[1:25],frequencyben[1:50], color = brewer.pal(8,"Dark2"), min.freq=5) #Begin the steps to tag POS to all words in benchmark reviews dataframeben<-data.frame(text=unlist(sapply(corpusben, `[`, "content")), stringsAsFactors=F) dataframeben <- unlist(dataframeben) dataframeben <- as.String(dataframeben) #POS tagging sent_token_annotator <- Maxent_Sent_Token_Annotator() word_token_annotator <- Maxent_Word_Token_Annotator() pos_tag_annotator <- Maxent_POS_Tag_Annotator() TesT <- annotate(dataframeben, list(sent_token_annotator, word_token_annotator)) TestC <- annotate(dataframeben, pos_tag_annotator, TesT) TestCC <- subset(TestC, type =="word") tagS <- sapply(TestCC$features, '[[', "POS") postaG <- sprintf("%s/%s", dataframeben[TestCC],tagS) Allreviewsben <- as.character(Allreviews) Doc <- sent_detect(text.var = Allreviewsben, endmarks = c(".","?","!",","),incomplete.sub = TRUE) Doc <- tolower(Doc) Doc <- removePunctuation(Doc) Doc <- removeFeatures(Doc,mystopwords) Doc <- removeFeatures(Doc,staff) Doc <- removeNumbers(Doc) Doc <- stripWhitespace(Doc) Nouns <- grep("/NN",postaG,value = TRUE) Nouns freq_terms(Nouns) Topfeatures <- c("food","meal","staff","visit","carvery","service","menu","pub","table","value") #Find all sentences corresponding to the top features found within the benchmark #reviews Extraction <- lapply(Topfeatures, grep, Doc, value = TRUE) foodben <- polarity(Extraction[[1]],constrain=TRUE,n.before=2,n.after=2, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words) mealben <- polarity(Extraction[[2]],constrain=TRUE,n.before=2,n.after=2, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words) staffben <- polarity(Extraction[[3]],constrain=TRUE,n.before=2,n.after=2, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words)
140136568
71 | P a g e
visitben <- polarity(Extraction[[4]],constrain=TRUE,n.before=2,n.after=2, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words) carveryben <- polarity(Extraction[[5]],constrain=TRUE,n.before=2,n.after=2, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words) serviceben <- polarity(Extraction[[6]],constrain=TRUE,n.before=2,n.after=2, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words) menuben <- polarity(Extraction[[7]],constrain=TRUE,n.before=2,n.after=2, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words) pubben <- polarity(Extraction[[8]],constrain=TRUE,n.before=2,n.after=2, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words) tableben <- polarity(Extraction[[9]],constrain=TRUE,n.before=2,n.after=2, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words) valueben <- polarity(Extraction[[10]],constrain=TRUE,n.before=2,n.after=2, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words) #Calculate average polarity scores Averagepolarity <- c(foodben$group$ave.polarity,mealben$group$ave.polarity,staffben$group$ave.polarity,visitben$group$ave.polarity,carveryben$group$ave.polarity,serviceben$group$ave.polarity,menuben$group$ave.polarity,pubben$group$ave.polarity,tableben$group$ave.polarity,valueben$group$ave.polarity) #Number of sentences with features present Nosents <- c(foodben$group$total.sentences,mealben$group$total.sentences,staffben$group$total.sentences,visitben$group$total.sentences,carveryben$group$total.sentences,serviceben$group$total.sentences,menuben$group$total.sentences,pubben$group$total.sentences,tableben$group$total.sentences,valueben$group$total.sentences) FinalScoreS <- data.frame(Topfeatures,Nosents,Averagepolarity,POsitive,NEgative) #Plot feature average polarity scores barplot(FinalScoreS$Averagepolarity,ylab="Average Polarity Score",names.arg=FinalScoreS$Topfeatures,main = "Benchmark Data Top Features and Average Polarity Score",col=c("green","blue","red","white","orange","purple","lightgreen","yellow","pink","darkblue"),ylim = c(0,0.5)) #Carvery Feature analysis #Recall sentences which contain the carvery feature Extraction[[5]] #Rate sentences amnually carverybenman <- c(3,4,4,4,4,4,3,4,4,2.5,2.5,3,2,4.5,3,3) #Recall the polarity scores for carvery feature and compare carveryben$all$polarity bencarvery <- c(3.5,4.5,4,4.5,3.5,4,4,4.5,4,3,3,3,3,4,3,3) #Same procedure for meal feature #Sentences containing meal Extraction[[2]] benmealman <- c(4,3.5,4,3.5,4,3,3,4,4.5,4.5,2,2.5,3.5,4.5,3) mealben$all$polarity mealbenpol <- c(4.5,4.5,4,4,4.5,3.5,3,3.5,3.5,4,3,3,3,4,3) #Same procedure for service feature
140136568
72 | P a g e
#Sentences containing service Extraction[[6]] serviceman <- c(4,4,4,5,4,5,5,4,4) serviceben$all$polarity servicebenpol <- c(3.5,4,4,4.5,4,4,4,3.5,4) #Wordcloud of negative words for features negfeaturesb <-c(carveryben$all$neg.words,serviceben$all$neg.words,mealben$all$neg.words) negbcorpus <-Corpus(VectorSource(negfeaturesb)) negbdtm <- DocumentTermMatrix(negbcorpus) negbdtm2 <- as.matrix(negbdtm) negbfrequency <- colSums(negbdtm2) negbfrequency <- sort(negbfrequency, decreasing=TRUE) #Creating a word cloud to visualise most positive words negbwords <-names(negbfrequency) wordcloud(negbwords[1:5],negbfrequency[1:5], color = brewer.pal(8,"Dark2"), min.freq=1) #Wordcloud of negative words for positive features posfeaturesb <-c(carveryben$all$pos.words,serviceben$all$pos.words,mealben$all$pos.words) posbcorpus <-Corpus(VectorSource(posfeaturesb)) posbdtm <- DocumentTermMatrix(posbcorpus) posbdtm2 <- as.matrix(posbdtm) posbfrequency <- colSums(posbdtm2) posbfrequency <- sort(posbfrequency, decreasing=TRUE) #Creating a word cloud to visualise most negative words posbwords <-names(posbfrequency) wordcloud(posbwords[1:10],posbfrequency[1:10], color = brewer.pal(8,"Dark2"), min.freq=2)
TripAdvisor Dataset
#Installing relevant packages install.packages("devtools", dependencies = TRUE) install.packages("tm", dependencies = TRUE) install.packages("qdap", dependencies = TRUE) install.packages("openNLP") install.packages("quanteda") install.packages("DBI", dependencies = TRUE) install.packages("assertthat") #Loading the relevant packages library(devtools) library(NLP) library(tm) library(qdap) library(assertthat) library(openNLP) library(quanteda) library(stringr) #Steps taken to cleanse the reviews from TripAdvisor #List of stopwords to be removed from reviews mystopwords <- c(stopwords("english"),"us","go","went","saw","list","didnt","ive","ways","character","wonder","place") all <- paste(WinterGreen$Review, collapse=" ") Source <- VectorSource(all) corpusta <- Corpus(Source)
140136568
73 | P a g e
corpusta <- tm_map(corpusta, content_transformer(tolower)) corpusta <- tm_map(corpusta, removePunctuation) corpusta <- tm_map(corpusta, stripWhitespace) corpusta <- tm_map(corpusta, removeWords, mystopwords) corpusta <- tm_map(corpusta, removeNumbers) #Create a data frame so features can have POS tags attached dataframe <- data.frame(text=unlist(sapply(corpusta, `[`, "content")), stringsAsFactors=F) dataframe <- unlist(dataframe) dataframe <- as.String(dataframe) #Begin the steps to POS tag TripAdvisor reviews for feature extraction sent_token_annotator <- Maxent_Sent_Token_Annotator() word_token_annotator <- Maxent_Word_Token_Annotator() pos_tag_annotator <- Maxent_POS_Tag_Annotator() Test <- annotate(dataframe, list(sent_token_annotator, word_token_annotator)) Testc <- annotate(dataframe, pos_tag_annotator, Test) Testcc <- subset(Testc, type =="word") tags <- sapply(Testcc$features, '[[', "POS") postag <- sprintf("%s/%s", dataframe[Testcc],tags) #All words have been tagged #Split TripAdvisor reviews into sentences allta <- as.character(all) doc <- sent_detect(text.var = allta, endmarks = c(".","?","!",","),incomplete.sub = TRUE) #TripAdvisor reviews converted into a document matrix to find most frequent terms dataframeta <- as.vector(dataframe) mydfm <- dfm(dataframeta) topfeatures(mydfm,30) mydfm #Found 1077 features #Limit the POS tag type to only display nouns, this equates to 1072 #features specified by the dfm. nouns <- grep("/NN",postag,value = TRUE) nouns #function to count the number of times the noun is mentioned freq_terms(nouns) #Picking top 10 features which are most frequently common topfeatures <- c("food","staff","meal","pub","table","carvery","service","drinks","meat","menu") #Returning to reviews which have been split into sentences already #The sentences are now cleansed doc <- tolower(doc) doc <- removeFeatures(doc,mystopwords) doc <- removeNumbers(doc) doc <- removePunctuation(doc) doc <- stripWhitespace(doc) #Find all sentences which contain the features from topfeatures vector. extraction <- lapply(topfeatures, grep, doc, value = TRUE) #Run polarity over the sentences which contain the feature to determine sentiment. #First, calculate the polarity of sentences with the feature food foodpol <- polarity(extraction[[1]],constrain=TRUE,n.before=2,n.after=2, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words) foodpol$all$polarity #Now compute polarity scores for the remaining top features
140136568
74 | P a g e
staffpol <- polarity(extraction[[2]],constrain=TRUE,n.before=2,n.after=2, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words) mealpol <- polarity(extraction[[3]],constrain=TRUE,n.before=2,n.after=2, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words) pubpol <- polarity(extraction[[4]],n.before=2,constrain=TRUE,n.after=2, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words) tablepol <- polarity(extraction[[5]],n.before=2,n.after=2,constrain=TRUE, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words) carverypol <- polarity(extraction[[6]],n.before=2,n.after=2,constrain=TRUE, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words) servicepol <- polarity(extraction[[7]],n.before=2,n.after=2,constrain=TRUE, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words) drinkspol <- polarity(extraction[[8]],n.before=2,n.after=2,constrain=TRUE, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words) meatpol <- polarity(extraction[[9]],n.before=2,n.after=2,constrain=TRUE, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words) menupol <- polarity(extraction[[10]],n.before=2,n.after=2,constrain=TRUE, amplifiers = qdapDictionaries::amplification.words, negators = qdapDictionaries::negation.words, deamplifiers = qdapDictionaries::deamplification.words) #Compute the average polarity scores for all features averagepolarity <- c(foodpol$group$ave.polarity,staffpol$group$ave.polarity,mealpol$group$ave.polarity,pubpol$group$ave.polarity,tablepol$group$ave.polarity,carverypol$group$ave.polarity,servicepol$group$ave.polarity,drinkspol$group$ave.polarity,meatpol$group$ave.polarity,menupol$group$ave.polarity) #The number of sentences which contain the features nosents <- c(foodpol$group$total.sentences,staffpol$group$total.sentences,mealpol$group$total.sentences,pubpol$group$total.sentences,tablepol$group$total.sentences,carverypol$group$total.sentences,servicepol$group$total.sentences,drinkspol$group$total.sentences,meatpol$group$total.sentences,menupol$group$total.sentences) #Combine all aspects into a dataframe FinalScore <- data.frame(topfeatures,nosents,averagepolarity,pOsitive,nEgative) #Plot average polarity scores for features barplot(FinalScore$averagepolarity,names.arg=FinalScore$topfeatures,ylab="Average Polarity Score",main = "Top Features and Average Polarity Score",col=c("green","red","blue","yellow","pink","orange","purple","white","darkblue","lightgreen"),ylim = c(0,0.25)) #Manually Rate sentences which contain features extraction[[6]] mancarv <- c(2,3,2,3.5,2,4,4,3,2.5,3,4,3,3,3,3,2.5,3.5,3.5,4,3,3,2.5,2,2,1.5,3.5,2,3,3,3.5,3) carvpolt <- carverypol$all$polarity carvpolt <- c(3.5,3,2.5,2.5,2,4,4,3,3,3.5,3,3,3,3,3.5,3,3.5,3.5,4,3,3,3,3.5,2,2,3.5,3,3,3,3.5,3) mancarv[29:35] carvpolt[29:35] extraction[[3]]
140136568
75 | P a g e
manmeal <- c(3,3,2.5,3,3,2,4,3,2,2,3,3,3,4,2.5,1.5,2.5,3,3.5,4,4,3.5,3,1,2,3,3.5,3.5) mealpol$all$polarity mealpolt <- c(3.5,2.5,3,3,3.5,3,4,3,2.5,2.5,3.5,3,2,4,3,2,3.5,3.5,4.5,3.5,4,3.5,3,2,4,3.5,3.5,4) manmeal[22:28] mealpolt[22:28] extraction[[7]] manservice <- c(3.5,4,4,3,4,4,4.5,2.5,3.5,4,3.5,3.5,3.5,3.5,3,3,3.5,3) servicepol$all$polarity manservicet <- c(4,3.5,3.5,3,4,4,3.5,2,4,4,3,3.5,4,3.5,2.5,4,4.5,3) manservice[15:21] manservicet[15:21] #TripAdvisor positive words as a word cloud. trippos <- c(carverypol$all$pos.words,mealpol$all$pos.words,servicepol$all$pos.words) trippos2 <- Corpus(VectorSource(trippos)) tripposdtm <-DocumentTermMatrix(trippos2) tripposdtm2 <- as.matrix(tripposdtm) tripposfreq <- colSums(tripposdtm2) tripposfreq <- sort(tripposfreq, decreasing = TRUE) tposwords <-names(tripposfreq) wordcloud(tposwords[1:10],tripposfreq[1:10], color = brewer.pal(8,"Dark2"), min.freq=2) #Same procedure but for negative words tripneg <- c(carverypol$all$neg.words,mealpol$all$neg.words,servicepol$all$neg.words) tripneg2 <- Corpus(VectorSource(tripneg)) tripnegdtm <-DocumentTermMatrix(tripneg2) tripnegdtm2 <- as.matrix(tripnegdtm) tripnegfreq <- colSums(tripnegdtm2) tripnegfreq <- sort(tripnegfreq, decreasing = TRUE) tnegwords <-names(tripnegfreq) wordcloud(tnegwords[1:10],tripnegfreq[1:10], color = brewer.pal(8,"Dark2"), min.freq=1)
Plotting Grouped Plots
#The features common to both datasets groupS <- c("food","staff","meal","pub","carvery","service","menu","table") #Vector containing feature polarity for each dataset Trip <- c(foodpol$group$ave.polarity,staffpol$group$ave.polarity,mealpol$group$ave.polarity,pubpol$group$ave.polarity,carverypol$group$ave.polarity,servicepol$group$ave.polarity,menupol$group$ave.polarity,tablepol$group$ave.polarity) Bench <- c(foodben$group$ave.polarity,staffben$group$ave.polarity,mealben$group$ave.polarity,pubben$group$ave.polarity,carveryben$group$ave.polarity,serviceben$group$ave.polarity,menuben$group$ave.polarity,tableben$group$ave.polarity) Frame <- data.frame(row.names=c("food","staff","meal","pub","carvery","service","menu","table"), benchmark=Bench, tripadvisor= Trip) Frame <- as.matrix(Frame) Frame <- t(Frame) Frame barplot(Frame,main="Feature Polarity Comparison Across Datasets",beside=TRUE,ylim=c(0,0.45),col=c("yellow","purple"),legend.text=c("Benchmark","TripAdvisor"),args.legend=list(x="topleft"),ylab="Average Polarity Score")
140136568
76 | P a g e
Scraping the Web
#Install relevant packages install.packages("RCurl",dependencies = TRUE) install.packages("XML",dependencies = TRUE) install.packages("rvest",dependencies = TRUE) install.packages("xml2",dependencies = TRUE) install.packages("magrittr",dependencies = TRUE) library(bitops) library(RCurl) library(XML) library(rvest) library(xml2) library(magrittr) #Url used to scrape reviews urlone <-"http://www.tripadvisor.co.uk/ShowUserReviews-g190734-d6875011-r284223606-Winter_Green-Rotherham_South_Yorkshire_England.html#CHECK_RATES_CONT" #Programming to extract reviews and other information reviews <- urlone %>% html() %>% html_nodes("#REVIEWS .innerBubble") Quote <- reviews %>% html_node(".quote") %>% html_text() %>% as.character() Rating <- reviews %>% html_node(".rating .rating_s_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.integer() Value <- reviews %>% html_node(".recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Food <- reviews %>% html_node(".recommend-answer:nth-child(2) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Service <- reviews %>% html_node(":nth-child(3) .recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Date <- reviews %>% html_node("span.ratingDate") %>% html_text(".ratingDate") %>% gsub("Reviewed ","", .)%>% as.Date(format="%d %B %Y") Review <- reviews %>% html_node(".entry") %>%
140136568
77 | P a g e
html_text() #Dataframe showing key features extracted from url WG1<-data.frame(Quote, Rating, Value, Service, Food, Date, Review, stringsAsFactors = FALSE) #Same method repeated to other url pages urltwo <- "http://www.tripadvisor.co.uk/ShowUserReviews-g190734-d6875011-r281585638-Winter_Green-Rotherham_South_Yorkshire_England.html#REVIEWS" reviews <- urltwo %>% html() %>% html_nodes("#REVIEWS .innerBubble") Quote <- reviews %>% html_node(".quote") %>% html_text() %>% as.character() Rating <- reviews %>% html_node(".rating .rating_s_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.integer() Value <- reviews %>% html_node(".recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Food <- reviews %>% html_node(".recommend-answer:nth-child(2) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Service <- reviews %>% html_node(":nth-child(3) .recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Date <- reviews %>% html_node("span.ratingDate") %>% html_text(".ratingDate") %>% gsub("Reviewed ","", .)%>% as.Date(format="%d %B %Y") Review <- reviews %>% html_node(".entry") %>% html_text() WG2<-data.frame(Quote, Rating, Value, Service, Food, Date, Review, stringsAsFactors = FALSE) urlthree <- "http://www.tripadvisor.co.uk/ShowUserReviews-g190734-d6875011-r274065996-Winter_Green-Rotherham_South_Yorkshire_England.html#REVIEWS" reviews <- urlthree %>% html() %>% html_nodes("#REVIEWS .innerBubble") Quote <- reviews %>% html_node(".quote") %>% html_text() %>% as.character()
140136568
78 | P a g e
Rating <- reviews %>% html_node(".rating .rating_s_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.integer() Value <- reviews %>% html_node(".recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Food <- reviews %>% html_node(".recommend-answer:nth-child(2) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Service <- reviews %>% html_node(":nth-child(3) .recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Date <- reviews %>% html_node("span.ratingDate") %>% html_text(".ratingDate") %>% gsub("Reviewed ","", .)%>% as.Date(format="%d %B %Y") Review <- reviews %>% html_node(".entry") %>% html_text() WG3<-data.frame(Quote, Rating, Value, Service, Food, Date, Review, stringsAsFactors = FALSE) urlfour <- "http://www.tripadvisor.co.uk/ShowUserReviews-g190734-d6875011-r271638603-Winter_Green-Rotherham_South_Yorkshire_England.html#REVIEWS" reviews <- urlfour %>% html() %>% html_nodes("#REVIEWS .innerBubble") Quote <- reviews %>% html_node(".quote") %>% html_text() %>% as.character() Rating <- reviews %>% html_node(".rating .rating_s_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.integer() Value <- reviews %>% html_node(".recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Food <- reviews %>% html_node(".recommend-answer:nth-child(2) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>%
140136568
79 | P a g e
as.character() Service <- reviews %>% html_node(":nth-child(3) .recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Date <- reviews %>% html_node("span.ratingDate") %>% html_text(".ratingDate") %>% gsub("Reviewed ","", .)%>% as.Date(format="%d %B %Y") Review <- reviews %>% html_node(".entry") %>% html_text() WG4<-data.frame(Quote, Rating, Value, Service, Food, Date, Review, stringsAsFactors = FALSE) urlfive <- "http://www.tripadvisor.co.uk/ShowUserReviews-g190734-d6875011-r268696801-Winter_Green-Rotherham_South_Yorkshire_England.html#REVIEWS" reviews <- urlfive %>% html() %>% html_nodes("#REVIEWS .innerBubble") Quote <- reviews %>% html_node(".quote") %>% html_text() %>% as.character() Rating <- reviews %>% html_node(".rating .rating_s_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.integer() Value <- reviews %>% html_node(".recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Food <- reviews %>% html_node(".recommend-answer:nth-child(2) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Service <- reviews %>% html_node(":nth-child(3) .recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Date <- reviews %>% html_node("span.ratingDate") %>% html_text(".ratingDate") %>% gsub("Reviewed ","", .)%>% as.Date(format="%d %B %Y") Review <- reviews %>% html_node(".entry") %>% html_text()
140136568
80 | P a g e
WG5<-data.frame(Quote, Rating, Value, Service, Food, Date, Review, stringsAsFactors = FALSE) urlsix <- "http://www.tripadvisor.co.uk/ShowUserReviews-g190734-d6875011-r267370175-Winter_Green-Rotherham_South_Yorkshire_England.html#REVIEWS" reviews <- urlsix %>% html() %>% html_nodes("#REVIEWS .innerBubble") Quote <- reviews %>% html_node(".quote") %>% html_text() %>% as.character() Rating <- reviews %>% html_node(".rating .rating_s_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.integer() Value <- reviews %>% html_node(".recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Food <- reviews %>% html_node(".recommend-answer:nth-child(2) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Service <- reviews %>% html_node(":nth-child(3) .recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Date <- reviews %>% html_node("span.ratingDate") %>% html_text(".ratingDate") %>% gsub("Reviewed ","", .)%>% as.Date(format="%d %B %Y") Review <- reviews %>% html_node(".entry") %>% html_text() WG6<-data.frame(Quote, Rating, Value, Service, Food, Date, Review, stringsAsFactors = FALSE) urlseven <- "http://www.tripadvisor.co.uk/ShowUserReviews-g190734-d6875011-r264465043-Winter_Green-Rotherham_South_Yorkshire_England.html#REVIEWS" reviews <- urlseven %>% html() %>% html_nodes("#REVIEWS .innerBubble") Quote <- reviews %>% html_node(".quote") %>% html_text() %>% as.character() Rating <- reviews %>% html_node(".rating .rating_s_fill") %>% html_attr("alt") %>%
140136568
81 | P a g e
gsub(" of 5 stars", "", .) %>% as.integer() Value <- reviews %>% html_node(".recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Food <- reviews %>% html_node(".recommend-answer:nth-child(2) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Service <- reviews %>% html_node(":nth-child(3) .recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Date <- reviews %>% html_node("span.ratingDate") %>% html_text(".ratingDate") %>% gsub("Reviewed ","", .)%>% as.Date(format="%d %B %Y") Review <- reviews %>% html_node(".entry") %>% html_text() WG7<-data.frame(Quote, Rating, Value, Service, Food, Date, Review, stringsAsFactors = FALSE) urleight <- "http://www.tripadvisor.co.uk/ShowUserReviews-g190734-d6875011-r261328206-Winter_Green-Rotherham_South_Yorkshire_England.html#REVIEWS" reviews <- urleight %>% html() %>% html_nodes("#REVIEWS .innerBubble") Quote <- reviews %>% html_node(".quote") %>% html_text() %>% as.character() Rating <- reviews %>% html_node(".rating .rating_s_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.integer() Value <- reviews %>% html_node(".recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Food <- reviews %>% html_node(".recommend-answer:nth-child(2) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Service <- reviews %>% html_node(":nth-child(3) .recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>%
140136568
82 | P a g e
gsub(" of 5 stars", "", .) %>% as.character() Date <- reviews %>% html_node("span.ratingDate") %>% html_text(".ratingDate") %>% gsub("Reviewed ","", .)%>% as.Date(format="%d %B %Y") Review <- reviews %>% html_node(".entry") %>% html_text() WG8<-data.frame(Quote, Rating, Value, Service, Food, Date, Review, stringsAsFactors = FALSE) urlnine <- "http://www.tripadvisor.co.uk/ShowUserReviews-g190734-d6875011-r258413230-Winter_Green-Rotherham_South_Yorkshire_England.html#REVIEWS" reviews <- urlnine %>% html() %>% html_nodes("#REVIEWS .innerBubble") Quote <- reviews %>% html_node(".quote") %>% html_text() %>% as.character() Rating <- reviews %>% html_node(".rating .rating_s_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.integer() Value <- reviews %>% html_node(".recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Food <- reviews %>% html_node(".recommend-answer:nth-child(2) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Service <- reviews %>% html_node(":nth-child(3) .recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Date <- reviews %>% html_node("span.ratingDate") %>% html_text(".ratingDate") %>% gsub("Reviewed ","", .)%>% as.Date(format="%d %B %Y") Review <- reviews %>% html_node(".entry") %>% html_text() WG9<-data.frame(Quote, Rating, Value, Service, Food, Date, Review, stringsAsFactors = FALSE) urlten <- "http://www.tripadvisor.co.uk/ShowUserReviews-g190734-d6875011-r255413295-Winter_Green-Rotherham_South_Yorkshire_England.html#REVIEWS" reviews <- urlten %>%
140136568
83 | P a g e
html() %>% html_nodes("#REVIEWS .innerBubble") Quote <- reviews %>% html_node(".quote") %>% html_text() %>% as.character() Rating <- reviews %>% html_node(".rating .rating_s_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.integer() Value <- reviews %>% html_node(".recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Food <- reviews %>% html_node(".recommend-answer:nth-child(2) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Service <- reviews %>% html_node(":nth-child(3) .recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Date <- reviews %>% html_node("span.ratingDate") %>% html_text(".ratingDate") %>% gsub("Reviewed ","", .)%>% as.Date(format="%d %B %Y") Review <- reviews %>% html_node(".entry") %>% html_text() WG10<-data.frame(Quote, Rating, Value, Service, Food, Date, Review, stringsAsFactors = FALSE) urleleven <- "http://www.tripadvisor.co.uk/ShowUserReviews-g190734-d6875011-r251874477-Winter_Green-Rotherham_South_Yorkshire_England.html#REVIEWS" reviews <- urleleven %>% html() %>% html_nodes("#REVIEWS .innerBubble") Quote <- reviews %>% html_node(".quote") %>% html_text() %>% as.character() Rating <- reviews %>% html_node(".rating .rating_s_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.integer() Value <- reviews %>% html_node(".recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>%
140136568
84 | P a g e
gsub(" of 5 stars", "", .) %>% as.character() Food <- reviews %>% html_node(".recommend-answer:nth-child(2) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Service <- reviews %>% html_node(":nth-child(3) .recommend-answer:nth-child(1) .rating_ss_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.character() Date <- reviews %>% html_node("span.ratingDate") %>% html_text(".ratingDate") %>% gsub("Reviewed ","", .)%>% as.Date(format="%d %B %Y") Review <- reviews %>% html_node(".entry") %>% html_text() WG11<-data.frame(Quote, Rating, Value, Service, Food, Date, Review, stringsAsFactors = FALSE) View(WG11) #Now combine all reviews into one dataframe WinterGreen <-rbind(WG1,WG2,WG3,WG4,WG5,WG6,WG7,WG8,WG9,WG10,WG11) View(WinterGreen)
Plot for Feature Summarisation
#Create a summarisation plot comparing features #Recall the lowest and highest scoring sentence for each feature groupS <- c("food","staff","meal","pub","carvery","service","menu","table") Min2t <- c(min(foodpol$all$polarity),min(staffpol$all$polarity),min(mealpol$all$polarity),min(pubpol$all$polarity),min(carverypol$all$polarity),min(servicepol$all$polarity),min(menupol$all$polarity),min(tablepol$all$polarity)) Max2t <- c(max(foodpol$all$polarity),max(staffpol$all$polarity),max(mealpol$all$polarity),max(pubpol$all$polarity),max(carverypol$all$polarity),max(servicepol$all$polarity),max(menupol$all$polarity),max(tablepol$all$polarity)) Minb <- c(min(foodben$all$polarity),min(staffben$all$polarity),min(mealben$all$polarity),min(pubben$all$polarity),min(carveryben$all$polarity),min(serviceben$all$polarity),min(menuben$all$polarity),min(tableben$all$polarity)) Maxb <- c(max(foodben$all$polarity),max(staffben$all$polarity),max(mealben$all$polarity),max(pubben$all$polarity),max(carveryben$all$polarity),max(serviceben$all$polarity),max(menuben$all$polarity),max(tableben$all$polarity)) CFS <- data.frame(groupS,Min2t,Minb,Max2t,Maxb) ggplot(CFS) + geom_crossbar(aes(x=groupS,ymin=Min2t,ymax=Max2t,y=Min2t),fill="purple")+ geom_crossbar(aes(x=groupS,y=Minb,ymin=Minb,ymax=Maxb),fill="yellow",alpha=0.6)+
140136568
85 | P a g e
theme(panel.background=element_blank())+ ylab("Polarity Scores")+ xlab("Features Popular to both Datasets") + ggtitle("Comparing Feature Polarity Scores for both Datasets in a Summary")
Word Cloud Plots
#Installing the relevant packages install.packages("ctv",dependencies = TRUE) install.packages("wordcloud",dependencies = TRUE) library(tm) library(ctv) library(wordcloud) #Recalling the reviews from TripAdvisor AllReviews <- paste(WinterGreen$Review, collapse=" ") #Converting reviews into a single corpus Review_source <- VectorSource(AllReviews) corpus <- Corpus(Review_source) #Cleaning the text corpus <- tm_map(corpus, content_transformer(tolower)) corpus <- tm_map(corpus, removePunctuation) corpus <- tm_map(corpus, stripWhitespace) corpus <- tm_map(corpus, removeWords, mystopwords) #Creating a dfm to locate most popular words dtm <-DocumentTermMatrix(corpus) dtm2 <- as.matrix(dtm) frequency <- colSums(dtm2) frequency <- sort(frequency, decreasing=TRUE) #Creating a word cloud for top 25 words in TripAdvisor words <-names(frequency) wordcloud(words[1:25],frequency[1:50], color = brewer.pal(8,"Dark2"), min.freq=5) negs <-paste((unlist(as$all$neg.words)),collapse = ' ') negs2 <-paste((unlist(as2$all$neg.words)),collapse = ' ') negs3 <-paste((unlist(as3$all$neg.words)),collapse = ' ') negs4 <-paste((unlist(as4$all$neg.words)),collapse = ' ') negs5 <-paste((unlist(as5$all$neg.words)),collapse = ' ') negs6 <-paste((unlist(as6$all$neg.words)),collapse = ' ') negs7 <-paste((unlist(as7$all$neg.words)),collapse = ' ') negs8 <-paste((unlist(as8$all$neg.words)),collapse = ' ') negs9 <-paste((unlist(as9$all$neg.words)),collapse = ' ') negs10 <-paste((unlist(as10$all$neg.words)),collapse = ' ') nEgAtive <- c(negs,negs2,negs3,negs4,negs5,negs6,negs7,negs8,negs9,negs10) nEgAtive <-gsub("-", "",nEgAtive) NEG <- Corpus(VectorSource(nEgAtive)) ggg<-DocumentTermMatrix(NEG) ggg2 <-as.matrix(ggg) frequencies <- colSums(ggg2) frequencies <- sort(frequencies, decreasing=TRUE)
140136568
86 | P a g e
wordS <-names(frequencies) wordcloud(wordS[1:10],frequencies[1:10], color = brewer.pal(8,"Dark2"),min.freq=2)
140136568
87 | P a g e
Access to Dissertation A Dissertation submitted to the University may be held by the Department (or School) within which the Dissertation was undertaken and made available for borrowing or consultation in accordance with University Regulations. Requests for the loan of dissertations may be received from libraries in the UK and overseas. The Department may also receive requests from other organisations, as well as individuals. The conservation of the original dissertation is better assured if the Department and/or Library can fulfill such requests by sending a copy. The Department may also make your dissertation available via its web pages. In certain cases where confidentiality of information is concerned, if either the author or the supervisor so requests, the Department will withhold the dissertation from loan or consultation for the period specified below. Where no such restriction is in force, the Department may also deposit the Dissertation in the University of Sheffield Library.
To be completed by the Author – Select (a) or (b) by placing a tick in the appropriate box If you are willing to give permission for the Information School to make your dissertation available in these ways, please complete the following:
(a) Subject to the General Regulation on Intellectual Property, I, the author, agree to this dissertation being made immediately available through the Department and/or University Library for consultation, and for the Department and/or Library to reproduce this dissertation in whole or part in order to supply single copies for the purpose of research or private study
(b) Subject to the General Regulation on Intellectual Property, I, the author, request that this dissertation be withheld from loan, consultation or reproduction for a period of [ ] years from the date of its submission. Subsequent to this period, I agree to this dissertation being made available through the Department and/or University Library for consultation, and for the Department and/or Library to reproduce this dissertation in whole or part in order to supply single copies for the purpose of research or private study
Name Lauren Rodgers
Department The Information School
Signed
Date 01/09/2015
To be completed by the Supervisor – Select (a) or (b) by placing a tick in the appropriate box
(a) I, the supervisor, agree to this dissertation being made immediately available through the Department and/or University Library for loan or consultation, subject to any special restrictions (*) agreed with external organisations as part of a collaborative project.
*Special restrictions
(b) I, the supervisor, request that this dissertation be withheld from loan, consultation or reproduction for a period of [ ] years from the date of its submission. Subsequent to this period, I, agree to this dissertation being made available through the Department and/or University Library for loan or consultation, subject to any special restrictions (*) agreed with external organisations as part of a collaborative project
Name
Department
Signed Date
THIS SHEET MUST BE SUBMITTED WITH DISSERTATIONS BY DEPARTMENTAL REQUIREMENTS.