Fine-Grained Sentiment Analysis of Restaurant Customer...

6
1 Fine-Grained Sentiment Analysis of Restaurant Customer Reviews in Chinese Language Suofei Feng 1 [email protected] Eziz Durdyev [email protected] Abstract—Chinese language processing is a challenging topic in the well-developed area of sentiment analysis. In this project we implement 3 types of 4-class classification models (SVM, XGBoost, LSTM) for the fine-grained, or aspect-level sentiment analysis of restaurant customer reviews in Chinese language. There are 20 aspects for clas- sification, each representing one type of target information in the reviews. We separately train one model for each element. The overall results of the models on the 20 aspects shows that XGBoost has the best performance, based on the average accuracy, weighted F1 score, and efficiency. Index Terms—Chinese language, NLP, LSTM, SVM, XGBoost I. I NTRODUCTION T HE era of information explosion, brings an increas- ing demanding on the ability to extract core mes- sage from billions of records of data. Sentiment analysis, or opinion mining, is widely applied to extracting and studying subjective information in texts. By quantifying the opinions or attitudes in a large bulk of texts in a few minutes, sentiment analysis has gained popularity in various business scenarios for retrieving customer responses. In recent decades, considerable progress has been achieved in sentiment analysis of English language. At the same time, a similar development comparable to the growth of market has not be seen in the scenario of Chinese language [1]. In our project, we propose to implement a fine-grained (aspect-level) sentiment analy- sis of restaurant customer reviews in Chinese language. The topic and data come from the 2018 AI Challenger competition. The inputs are reviews about restaurant in Chinese language. The task is to classify each piece of review text into 4 classes (”not mentioned”[-2], ”negative”[-1], ”neutral”[0], ”positive”[1]) under 20 aspects. Each aspect (or element) represents one type of information about the business. We develop and train 3 types of models-LSTM 1 Department of East Asian Languages and Cultures, Stanford University. (Long Short-Term Memory), SVM (Support Vector Ma- chine), and XGBoost (eXtreme Gradient Boosting)-for each of the aspect. II. RELATED WORK A. Machine Learning in Sentiment Analysis Pang and Lee [2] briefly summarize the history of sentiment analysis, and describe the related works as computational treatment of ”opinion, sentiment, and subjectivity in text” (p8). Early machine learning ap- proaches towards sentiment classification use unigrams (single words) and n-gram as features for subjectivity detection, and SVM, MNB (Multinomial Na¨ ıve Bayes) for classification [3] [4]. SVM has longer history with sentiment analysis com- paring to the gradient boosting and LSTM. It is one of the most used classification methods in sentiment analysis due to its ability to” generalize well in high dimensional feature spaces” [5]. Pang et al [6] applied Na¨ ıve Bayes, maximum entropy classification, and sup- port vector machines in classifying IMDb reviews by sentiment, determining whether a review is positive or negative. Although the differences were not large, SVMs achieved higher three-fold cross validation accuracy, 0.829 than Na¨ ıve Bayes, 0.804 and maximum entropy classification, 0.81. As this movie review sentiment analysis is comparable to our restaurant review sentiment analysis, we decide to build SVMs to classify sentiments in 20 elements in our project. On the other hand, although gradient boosting is one of the most applied “off the shelf” methods in general classification tasks, surprisingly, application of boosting ensemble method is not common in text classification. Ehrentraut et al [7] applied SVM and gradient tree boosting in classification of Swedish patient records (texts) to detect hospital-acquired infections. Gradient tree boosting achieved higher performance over SVM in precision, recall, and F1 score. Chen and Guestrin [8] proposed XGBoost, a scalable and regularized version of gradient tree boosting. Encouraged with these results

Transcript of Fine-Grained Sentiment Analysis of Restaurant Customer...

Page 1: Fine-Grained Sentiment Analysis of Restaurant Customer …cs229.stanford.edu/proj2018/report/195.pdf · 2019-01-06 · Recent years have seen a substantial progress in NLP tasks with

1

Fine-Grained Sentiment Analysis of RestaurantCustomer Reviews in Chinese Language

Suofei Feng1 [email protected] Durdyev [email protected]

Abstract—Chinese language processing is a challengingtopic in the well-developed area of sentiment analysis. Inthis project we implement 3 types of 4-class classificationmodels (SVM, XGBoost, LSTM) for the fine-grained,or aspect-level sentiment analysis of restaurant customerreviews in Chinese language. There are 20 aspects for clas-sification, each representing one type of target informationin the reviews. We separately train one model for eachelement. The overall results of the models on the 20 aspectsshows that XGBoost has the best performance, based onthe average accuracy, weighted F1 score, and efficiency.

Index Terms—Chinese language, NLP, LSTM, SVM,XGBoost

I. INTRODUCTION

THE era of information explosion, brings an increas-ing demanding on the ability to extract core mes-

sage from billions of records of data. Sentiment analysis,or opinion mining, is widely applied to extracting andstudying subjective information in texts. By quantifyingthe opinions or attitudes in a large bulk of texts in afew minutes, sentiment analysis has gained popularityin various business scenarios for retrieving customerresponses. In recent decades, considerable progress hasbeen achieved in sentiment analysis of English language.At the same time, a similar development comparable tothe growth of market has not be seen in the scenarioof Chinese language [1]. In our project, we propose toimplement a fine-grained (aspect-level) sentiment analy-sis of restaurant customer reviews in Chinese language.The topic and data come from the 2018 AI Challengercompetition.

The inputs are reviews about restaurant in Chineselanguage. The task is to classify each piece of reviewtext into 4 classes (”not mentioned”[-2], ”negative”[-1],”neutral”[0], ”positive”[1]) under 20 aspects. Each aspect(or element) represents one type of information about thebusiness. We develop and train 3 types of models-LSTM

1Department of East Asian Languages and Cultures, StanfordUniversity.

(Long Short-Term Memory), SVM (Support Vector Ma-chine), and XGBoost (eXtreme Gradient Boosting)-foreach of the aspect.

II. RELATED WORK

A. Machine Learning in Sentiment Analysis

Pang and Lee [2] briefly summarize the history ofsentiment analysis, and describe the related works ascomputational treatment of ”opinion, sentiment, andsubjectivity in text” (p8). Early machine learning ap-proaches towards sentiment classification use unigrams(single words) and n-gram as features for subjectivitydetection, and SVM, MNB (Multinomial Naı̈ve Bayes)for classification [3] [4].

SVM has longer history with sentiment analysis com-paring to the gradient boosting and LSTM. It is oneof the most used classification methods in sentimentanalysis due to its ability to” generalize well in highdimensional feature spaces” [5]. Pang et al [6] appliedNaı̈ve Bayes, maximum entropy classification, and sup-port vector machines in classifying IMDb reviews bysentiment, determining whether a review is positive ornegative. Although the differences were not large, SVMsachieved higher three-fold cross validation accuracy,0.829 than Naı̈ve Bayes, 0.804 and maximum entropyclassification, 0.81. As this movie review sentimentanalysis is comparable to our restaurant review sentimentanalysis, we decide to build SVMs to classify sentimentsin 20 elements in our project.

On the other hand, although gradient boosting is oneof the most applied “off the shelf” methods in generalclassification tasks, surprisingly, application of boostingensemble method is not common in text classification.Ehrentraut et al [7] applied SVM and gradient treeboosting in classification of Swedish patient records(texts) to detect hospital-acquired infections. Gradienttree boosting achieved higher performance over SVMin precision, recall, and F1 score. Chen and Guestrin [8]proposed XGBoost, a scalable and regularized versionof gradient tree boosting. Encouraged with these results

Page 2: Fine-Grained Sentiment Analysis of Restaurant Customer …cs229.stanford.edu/proj2018/report/195.pdf · 2019-01-06 · Recent years have seen a substantial progress in NLP tasks with

2

and improvements, we decide to apply XGBoost in ourproject as we needed a scalable method to classify 20elements in Chinese restaurant review data.

Recent years have seen a substantial progress in NLPtasks with neural network approaches. LSTM is pop-ular in sequence modeling for sentiment classificationbecause of its advantage against gradient vanishing orexploding issues in long texts. Wang et al. [9] proposedan attention-based LSTM model with aspect embeddingfor aspect-level sentiment classification. They experi-mented with restaurant customer reviews in English,and implemented a 3-class (positive, negative, neutral)classification on 5 aspects: food, price, service, am-bience, anecdotes/miscellaneous. The accuracy of thismodel improved by 2% compared with standard LSTMmodel. However, considering the different natures ofChinese language, and the large number of aspects forclassification (20), we decide to start with standardLSTM model for the neural network approach in thisproject.

B. Chinese NLP Researches

A review of sentiment analysis in Chinese languagewas given by Peng et al [10]. Apart from conductingsentiment analysis directly on Chinese language, thereis another approach: transform the task to sentimentanalysis on English language by machine translation.In our project, we conduct a mono-lingual experimentby directing extracting features from original Chineselanguage. This is mainly because that the style of ourinput texts is highly colloquial and context-specific,which might lose information in the process of machinetranslation. According to this article, good results weregained from a combination of neural network (word2vec)for word representation and SVM for classification.

Peng et al. also mentioned the different techniques forsegmentation. As Chinese language does not have spacebetween words, it is necessary to use segmentation toolsto extract words as the basic units of semantic meaning.They summarized that Jieba had a high speed and goodadaptation to different programming languages. For thesereasons, we decide to use Jieba as our segmentation tool.

III. DATASET AND FEATURES

A. Dataset

We use the data sets provided by AI challenger official[11]. The training and validation data are manuallylabelled. They also provided test dataset without labels.In the training dataset, there are 105,000 records of

reviews, with labels of 4 classes {”positive”[1], ”neu-tral”[0], ”negative”[-1], ”not mentioned”[-2]} on 20 as-pects/elements under 6 categories. The validation set has14998 records of reviews. For the aim to evaluate ourmodels by ourselves, we split the validation set into asmaller validation set (first 7500 records in the originalvalidation set) and a test set (rest 7498 records in thevalidation set) with true labels. We make 20 plots forclass distributions of each of the three datasets, whichshow that the class distributions are very similar acrossall three datasets. Table I shows all the elements andcorresponding categories. Here is an example input text.Because of the limited space, we just put part of thereview and its translation here:

”吼吼吼,萌死人的棒棒糖,中了大众点评的霸王餐,太可爱了。一直就好奇这个棒棒糖是怎么个东西,大众点评给了我这个土老冒一个见识的机会。看介绍棒棒糖是用德国糖做的,不会很甜,中间的照片是糯米的,能食用,真是太高端大气上档次了...”

Translation:”Ha ha ha, the lollipop is soooo cute. I won the’free meal prize’ on Dazhongdianping [author’scomment: similar to Yelp], this is so cute. I havebeen always curious about what the lollipop islike. Dazhongdianping gave me the bumpkin thisopportunity to open my eyes. The introduction saidit was made using German candy, not too sweet.The photo in the middle is made of glutinous rice,edible. It is really high-end...”

A glance of Google Translation:”Hey, the lollipop of the dead man, the overlordmeal of the public comment, so cute...”

B. Feature Extraction

The main challenge in our project is preprocessingour data. Chinese language is difficult to accuratelysegment because of absence of space, variant lengthsof words, and high flexibility of making new words.We apply same preprocessing approaches to the training,validation, and test dataset. With reasons presented in therelated work section, we use Jieba cut for segmentation.After segmentation, we gain three lists of word listsproduced by segmenting the lists of sentences. We thentrain a Word2Vec model following the instructions in[12], and produce a vocabulary and embedding matrix ofthe same word order with the vocabulary. Pad token andunknown-word token are added to the vocabulary as wellas the embedding matrix. The next step is to digitalize

Page 3: Fine-Grained Sentiment Analysis of Restaurant Customer …cs229.stanford.edu/proj2018/report/195.pdf · 2019-01-06 · Recent years have seen a substantial progress in NLP tasks with

3

TABLE I: Elements

Locationtraffic

distance from business district

easy to find

Service

wait time

waiter’s attitude

parking convenience

serving speed

Priceprice level

cost-effective

discount

Environment

decoration

noise

space

cleanness

Dish

portion

taste

look

recommendation

Others overall experience

willing to consume again

the input texts. We replace the words in each sentence(after segmented) with their indices in the vocabulary.All the sentences are padded or cut to a length of 350for SVM and XGBoost, and 400 for LSTM. The outputsare three matrices for train (105000, 350 or 400), val(7500, 350 or 400), and test data (7498, 350 or 400).The following is a graph showing the procedures ofpreprocessing training data (Fig.1).

IV. CLASSIFICATION MODELS

A. Baseline

The baseline model is provided by AI Challengerofficial [13]. The feature is extracted by TF-IDF frame-work, whose values for representation are based onthe frequency of a word in a given document. Theclassification model is RBF SVC. The average F1 scoreacross the 20 elements is around 0.2.

B. LSTM

LSTM, or Long Short-term Memory network, is a typeof recurrent neural network (RNN) to process sequenceof information, by feeding the output of preceding neu-rons to subsequent neurons. Unlike traditional RNN,LSTM networks are not vulnerable to gradient explosion

List of 105,000 sentences

Jieba Segmentation

List of 105,000 word lists

Gensim Word2Vec

300-d w2v modelMin count: 3Window: 5

Vocabulary(size: 6573)

Word vectors(dim: 300)

Embedding Matrix

Indices Matrix(105,000 * 350 or 400)

Fig. 1: Preprocessing Flowchart

or vanishing when dealing with long sequences of data.This is achieved by forget gate, input gate, and outputgate in each hidden unit. These gates decide how muchinformation to let through, and therefore can connectinformation with a wide gap between [14].

We build a many-to-one LSTM model for each of the20 elements. Fig.2 shows the structure of the model.The inputs are the output indices from preprocessingstep. The labels are transformed to one-hot vector, eachas a (1, 4) row vector. The embedding matrix is usedas the weights for the embedding layer, which is nottrainable. We add arbitrary class weights to address theclass imbalance problem. The loss function is categoricalcross-entropy:

L(θ) = − 1

n

n∑i=1

4∑j=1

yij log(pij) (1)

with n the number of examples, j the class ID, y thetrue label, and p the predicted probability. Accuracy andweighted F1 score are the evaluation metrics.

C. Support Vector Machine

SVM classifier is one of the most preferred classifi-cation method among classic machine learning methods

Page 4: Fine-Grained Sentiment Analysis of Restaurant Customer …cs229.stanford.edu/proj2018/report/195.pdf · 2019-01-06 · Recent years have seen a substantial progress in NLP tasks with

4

Inpu

t: 40

0 *

1 w

ord

indi

ces

Embe

ddin

g La

yer (

6573

* 3

00):

Trai

nabl

e=

False

LSTM Layer ( x 2):Units: 128Dropout: 0.5

Dense (4):Activation: softmax

Class Label

Output

Fig. 2: Two-Layer LSTM Model

since they generalize well in high dimensional featurespaces. We first extract the embedding vectors for eachword in a sentence through indices, to form a sentencematrix. Then we create sentence feature vectors for train-ing, validation, and test sets by averaging the sentencematrix along the vertical axis to get a vector. Eachobservation is a vector of size (1, 300). With 10 foldcross validation error, we get the best kernel as ”linear”.

D. XGBoost

We feed our XGBoost models with the same inputwith SVM models. We use GridSearchCV function fromsklearn package in python progaramming language totune for parameters ’learning rate’: [0.01, 0.05, 0.1] and’max depth’: [3,5] with 5 and 10 fold cross validation.With both 5 and 10 fold cross validation, we get thebest XGBoost parameters as learning rate=0.1 and maxdepth=5. Although XGBoost is not a highly preferredmethod in text classification, it yield the best resultsbased on test set accuracy and F1 score.

V. RESULTS AND DISCUSSION

A. Results

After tuning on a subset of 500 training records and100 validation records, we choose 0.5 for LSTM layerdropout and recurrent dropout, 128 for number of hiddenunits, 128 for batch size, and Adam for optimizer withlearning rate of 0.001, β1 of 0.9, β2 of 0.999. At thebeginning we tried 50 epochs for all the elements, andfound most of them converge after around 14 to 15epochs. We therefore decide to train for 20 epochs. Aswe have 105k records in the training dataset, batch sizeof 128 will make the training process comparativelyfast. Adding different arbitrary class weights to differentelements does not have clear improvement. Here is theplot (Fig.3) of the accuracies and losses of trainingand validation of the first element. To make this report

Fig. 3: Training and validation losses & accuracies of1st element

concise and short, we will not report all the trainingdetails and statistics of all the 20 elements here.

Generally the LSTM model performs well when sen-timent features are apparent:

”一直经过这条路第一次进去拔草还是通过看美食节目首先说说环境还是很不错的感觉很适合小情侣来很温馨的感觉喝喝下午茶感觉特别好服务也很好哦都很勤快可能不是周末中午人不多很安静非常喜欢这样的气氛再说说美食点了一个新款黄桃冰激凌披萨薄薄的披萨真的蛮好吃也以前一直吃厚的现在发觉薄的也不错下次试试榴莲披萨日式猪排饭真的量好多比图片看起来还多就是有点偏咸了意式千层面咬起来都是芝士的味道厚厚的感觉好吃小食还行量挺大玫瑰巧克力和榛果海盐拿铁真的都好好喝噢下次再去必点目前大众点评买单还能享受95折真的挺划算以后还会经常光顾的”

(Rough) Translation:”Always walk through this road. This is my firsttime walk in the restaurant. Know it through TVshow. Environment is pretty good, suitable forcouples, warm feeling. It’s nice to have afternoontea. Service is good, too...Maybe because it’s notweekend, not so many people in the noon. It’squiet. Really like the atmosphere. About food, ...pizza really good, wanna try another type nexttime... The pork combo has a large portion, just alittle too salty...Drinks are tasty. Must try next time.Using Dazhongdianping will give you 5% discount,a good bargain. Will come often in the future.”The predictions results are shown in Table II.Our linear kernel SVM models with the new sen-

tence features vectors showed great improvement overbaseline model. SVM model prediction F1 scores and

Page 5: Fine-Grained Sentiment Analysis of Restaurant Customer …cs229.stanford.edu/proj2018/report/195.pdf · 2019-01-06 · Recent years have seen a substantial progress in NLP tasks with

5

TABLE II: Predictions of the Example Text.

1: -2 2: -2 3: -2 4: -2 5: 16: -2 7: -2 8: -2 9: -2 10: 1

11: -2 12: 1 13:-2 14:-2 15:116: 1 17: -2 18: -2 19: 1 20: 1

test accuracy for each 20 elements range from 0.44 to0.94 (Fig.4). XGBoost yields better results in terms oftest accuracy and F1 Scores. Fig.5 shows test accuraciesacross 20 elements with LSTM, SVM, and XGBoost.According to the graph, LSTM has the most fluctuatingperformance over the 20 elements, while XGBoost isrelatively more stable with higher accuracies. However,we observed in the original datasets that the class imbal-ance problem is serious, we cannot rely on accuracies toevaluate our models. Weighted F1 scores are investigatedfor a better knowledge of the performances.

TABLE III: Weighted F1 scores of Test Dataset for 3Topics

Models Dish Recomm. Wait Time Traffic ConvenienceLSTM 0.6524 0.8776 0.8389SVM 0.7561 0.8381 0.8563

XGBoost 0.7582 0.8326 0.8382

We also check confusion matrices from XGBoost,which are presented as following. The models are biasedtowards dominating classes (such as -2 in E13, and 1 inE15). In both matrices, columns represent true labels -2,-1, 0, and 1 from left to right, and row values representpredicted values for the actual values -2, -1, 0, 1 fromtop to bottom.

E13 =

4444 1 8 265331 3 3 35565 2 7 891331 2 4 408

E16 =

46 14 91 2068 26 159 8716 21 1456 139915 3 785 3166

B. General Discussion

We use weighted F1 score along with accuracy be-cause of the imbalance in class distributions which isprevalent across the elements. The confusion matrix ofelement 13 (”space”) reveals the problem in XGBoostmodel, which also exists across all models. Class weightsmight be a good strategy, though in LSTM it does notshow clear advantage. On the one hand, the situation

Fig. 4: SVM Weighted F1 Scores and Test Accuraciesacross 20 elements

Fig. 5: Test set accuracy across 20 elements with LSTM,SVM and XGBoost

might be improved through more precise assignments ofclass weights, e.g. 1/(number of class-j examples in thetraining data). On the other hand, data augmentation suchas bootstrapping minor classes might help. Because ofthe demand on the integrity of context to judge about thesentiment, cropping is not suitable in this case. Changingthe key feature words might also worth trying.

VI. CONCLUSIONS AND FUTURE WORKS

In general, XGBoost yield better results in terms ofF1 scores and test accuracies. Our models improve frombaseline model partially due to better preprocessing, andpartially due to better-tuned hyperparameters. Apart fromthe aspects mentioned in the Discussion section, this taskcan be improved in the following ways: 1.) collect datawith higher label quality (some examples are difficult toclassify even for human beings); 2.)improve the qualityof language models with contextual representation, e.g.BERT; 3.) Moreover, we might benefit from applyingattention mechanism for long input texts.

Page 6: Fine-Grained Sentiment Analysis of Restaurant Customer …cs229.stanford.edu/proj2018/report/195.pdf · 2019-01-06 · Recent years have seen a substantial progress in NLP tasks with

6

VII. CONTRIBUTIONS

The team members contribute equally to the project.Suofei was responsible for the training of baselinemodel, data preprocessing, and the construction andtraining of LSTM model. Eziz contributed to the con-struction and training of SVM and XGBoost models.Two members both worked on poster making and reportwriting.

VIII. CODE

The code can be found at:

https://github.com/suofeif/CS229-Project.

REFERENCES

[1] H. Peng, E. Cambria, and A. Hussain, “A review of sentimentanalysis research in chinese language,” Cognitive Computation,vol. 9, pp. 423–435, 2017.

[2] B. Pang and L. Lee, “Opinion mining and sentiment analysis,”Foundations and Trends R© in Information Retrieval, vol. 2, no.1-2, pp. 1–135, 2008.

[3] E. Boiy and M. Moens, “A machine learning approach to senti-ment analysis in multilingual web texts,” Information Retrieval,vol. 12, no. 5, pp. 526–558, 2009.

[4] R. P. . S. H. Manning, C. D., Introduction to informationretrieval. Cambridge University Press, 2008, ch. 13.

[5] T. Joachims, “Text categorization with support vector machines:Learning with many relevant features,” European Conference onMachine Learning, vol. 1398, pp. 137–142, 2005.

[6] L. L. . V. S. Pang, B., “Thumbs up?: sentiment classificationusing machine learning techniques,” In Proceedings of theACL-02 conference on Empirical methods in natural languageprocessing, vol. 10, pp. 79–86, 2002.

[7] E. M. T. H. T. J. D. H. Ehrentraut, C., “Detecting hospital-acquired infections: A document classification approach usingsupport vector machines and gradient tree boosting.” Healthinformatics journal, vol. 24, no. 1, pp. 24–42, 2018.

[8] C. T and G. C. XGBoost, “Xgboost: A scalable tree boostingsystem,” 2016. [Online]. Available: http://dmlc.cs.washington.edu/data/pdf/XGBoostArxiv.pdf

[9] H. M. Wang, Y. and L. Zhao, “Attention-based lstm for aspect-level sentiment classification,” Proceedings of the 2016 confer-ence on empirical methods in natural language processing, pp.606–615, 2016.

[10] C. E. Peng, H. and A. Hussain, “A review of sentiment analysisresearch in chinese language,” Cognitive Computation, vol. 9,no. 4, pp. 423–435, 2017.

[11] Ai challenger 2018. [Online]. Available: https://challenger.ai/competition/fsauor2018

[12] W. Gong. (2017) Chinese word vectors. [Online]. Available:https://primer.ai/blog/Chinese-Word-Vectors/

[13] Ai challenger 2018 baseline. [Online]. Available: https://github.com/AIChallenger/AI Challenger 2018

[14] C. Olah. (2015) Understanding lstm net-works. [Online]. Available: http://colah.github.io/posts/2015-08-Understanding-LSTMs/