Detection of fake online reviews using semi-supervised and ...

5
2019 International Conference on Electrical, Computer and Communication Engineering (ECCE), 7-9 February, 2019 Detection of fake online reviews using semi-supervised and supervised learning Rakibul Hassan Dept. of Computer Science & Engineering Rajshahi University of Engineering & Technology Rajshahi, Bangladesh [email protected] Md. Rabiul Islam Dept. of Computer Science & Engineering Rajshahi University of Engineering & Technology Rajshahi, Bangladesh rabiul [email protected] Abstract—Online reviews have great impact on today’s busi- ness and commerce. Decision making for purchase of online products mostly depends on reviews given by the users. Hence, opportunistic individuals or groups try to manipulate product reviews for their own interests. This paper introduces some semi-supervised and supervised text mining models to detect fake online reviews as well as compares the efficiency of both techniques on dataset containing hotel reviews. Index Terms—Fake reviews, semi-supervised learning, super- vised learning, Naive Bayes classifier, Support Vector Machine classifier, Expectation-maximization algorithm. I. I NTRODUCTION Technologies are changing rapidly. Old technologies are continuously being replaced by new and sophisticated ones. These new technologies are enabling people to have their work done efficiently. Such an evolution of technology is online marketplace. We can shop and make reservation using online websites. Almost, everyone of us checks out reviews before purchasing some products or services. Hence, online reviews have become a great source of reputation for the companies. Also, they have large impact on advertisement and promotion of products and services. With the spread of online marketplace, fake online reviews are becoming great matter of concern. People can make false reviews for promotion of their own products that harms the actual users. Also, competitive companies can try to damage each others reputation by providing fake negative reviews. Researchers have been studying about many approaches for detection of these fake online reviews. Some approaches are review content based and some are based on behavior of the user who is posting reviews. Content based study focuses on what is written on the review that is the text of the review where user behavior based method focuses on country, ip-address, number of posts of the reviewer etc. Most of the proposed approaches are supervised classification models. Few researchers, also have worked with semi-supervised models. Semi-supervised methods are being introduced for lack of reliable labeling of the reviews. In this paper, we make some classification approaches for detecting fake online reviews, some of which are semi- supervised and others are supervised. For semi-supervised learning, we use Expectation-maximization algorithm. Statisti- cal Naive Bayes classifier and Support Vector Machines(SVM) are used as classifiers in our research work to improve the performance of classification. We have mainly focused on the content of the review based approaches. As feature we have used word frequency count, sentiment polarity and length of review. In the following section II, we discuss about the related works. Section III describes our proposed approaches and experiment setup. Results and findings of our research are discussed in Section IV. Section V includes conclusions and future work. II. RELATED WORK Many approaches and techniques have been proposed in the field of fake review detection.The following methods have been able to detect fake online review with higher accuracy. Sun et al. [1] divided these approaches into two categories. a) Content Based Method: Content based methods focus on what is the content of the review. That is the text of the review or what is told in it. Heydari et al. [2] have attempted to detect spam review by analyzing the linguistic features of the review. Ott et al. [3] used three techniques to perform classification. These three techniques are- genre identification, detection of psycholinguistic deception and text categorization [1]–[3]. 1) Genre Identification: The parts-of-speech (POS) dis- tribution of the review are explored by Ott et al. [3]. They used frequency count of POS tags as the features representing the review for classification. 2) Detection of Psycholinguistic Deception: The psy- cholinguistic method approaches to assign psycholin- guistic meanings to the important features of a review. Linguistic Inquiry and Word Count (LIWC) software was used by Pennebaker et al. [4] to build their features for the reviews. 3) Text Categorization: Ott et al. experimented n-gram that is now popularly used as an important feature in fake review detection. Other linguistic features are also explored. Such as, Feng et al. [5] took lexicalized and unlexicalized syntactic features by 978-1-5386-9111-3/19/$31.00 ©2019 IEEE

Transcript of Detection of fake online reviews using semi-supervised and ...

Page 1: Detection of fake online reviews using semi-supervised and ...

2019 International Conference on Electrical, Computer andCommunication Engineering (ECCE), 7-9 February, 2019

Detection of fake online reviews usingsemi-supervised and supervised learning

Rakibul HassanDept. of Computer Science & Engineering

Rajshahi University of Engineering & TechnologyRajshahi, Bangladesh

[email protected]

Md. Rabiul IslamDept. of Computer Science & Engineering

Rajshahi University of Engineering & TechnologyRajshahi, Bangladeshrabiul [email protected]

Abstract—Online reviews have great impact on today’s busi-ness and commerce. Decision making for purchase of onlineproducts mostly depends on reviews given by the users. Hence,opportunistic individuals or groups try to manipulate productreviews for their own interests. This paper introduces somesemi-supervised and supervised text mining models to detectfake online reviews as well as compares the efficiency of bothtechniques on dataset containing hotel reviews.

Index Terms—Fake reviews, semi-supervised learning, super-vised learning, Naive Bayes classifier, Support Vector Machineclassifier, Expectation-maximization algorithm.

I. INTRODUCTION

Technologies are changing rapidly. Old technologies arecontinuously being replaced by new and sophisticated ones.These new technologies are enabling people to have theirwork done efficiently. Such an evolution of technology isonline marketplace. We can shop and make reservation usingonline websites. Almost, everyone of us checks out reviewsbefore purchasing some products or services. Hence, onlinereviews have become a great source of reputation for thecompanies. Also, they have large impact on advertisementand promotion of products and services. With the spreadof online marketplace, fake online reviews are becominggreat matter of concern. People can make false reviews forpromotion of their own products that harms the actual users.Also, competitive companies can try to damage each othersreputation by providing fake negative reviews.

Researchers have been studying about many approachesfor detection of these fake online reviews. Some approachesare review content based and some are based on behaviorof the user who is posting reviews. Content based studyfocuses on what is written on the review that is the text of thereview where user behavior based method focuses on country,ip-address, number of posts of the reviewer etc. Most of theproposed approaches are supervised classification models.Few researchers, also have worked with semi-supervisedmodels. Semi-supervised methods are being introduced forlack of reliable labeling of the reviews.

In this paper, we make some classification approachesfor detecting fake online reviews, some of which are semi-

supervised and others are supervised. For semi-supervisedlearning, we use Expectation-maximization algorithm. Statisti-cal Naive Bayes classifier and Support Vector Machines(SVM)are used as classifiers in our research work to improve theperformance of classification. We have mainly focused on thecontent of the review based approaches. As feature we haveused word frequency count, sentiment polarity and length ofreview.

In the following section II, we discuss about the relatedworks. Section III describes our proposed approaches andexperiment setup. Results and findings of our research arediscussed in Section IV. Section V includes conclusions andfuture work.

II. RELATED WORK

Many approaches and techniques have been proposed inthe field of fake review detection.The following methods havebeen able to detect fake online review with higher accuracy.Sun et al. [1] divided these approaches into two categories.

a) Content Based Method: Content based methods focuson what is the content of the review. That is the text of thereview or what is told in it. Heydari et al. [2] have attemptedto detect spam review by analyzing the linguistic features ofthe review. Ott et al. [3] used three techniques to performclassification. These three techniques are- genre identification,detection of psycholinguistic deception and text categorization[1]–[3].

1) Genre Identification: The parts-of-speech (POS) dis-tribution of the review are explored by Ott et al. [3].They used frequency count of POS tags as the featuresrepresenting the review for classification.

2) Detection of Psycholinguistic Deception: The psy-cholinguistic method approaches to assign psycholin-guistic meanings to the important features of a review.Linguistic Inquiry and Word Count (LIWC) softwarewas used by Pennebaker et al. [4] to build their featuresfor the reviews.

3) Text Categorization: Ott et al. experimented n-gramthat is now popularly used as an important feature infake review detection.

Other linguistic features are also explored. Such as, Feng etal. [5] took lexicalized and unlexicalized syntactic features by

978-1-5386-9111-3/19/$31.00 ©2019 IEEE

Page 2: Detection of fake online reviews using semi-supervised and ...

constructing sentence parse trees for fake review detection.They shown experimenally that the deep syntactic featuresimprove the accuracy of prediction. Li et al. [6] explored avariety of generic deceptive signals which contribute to thefake review detection. They also concluded that combinedgeneral features such as LIWC or POS with bag of wordswill be more robust than bag of words alone. Metadata aboutreviews such as reviews length, date, time and rating are alsoused as features by some researchers.

b) Behavior Feature Based Methods: Behavior featurebased study focuses on the reviewer that includes character-istics of the person who is giving the review. Lim et al. [7]addressed the problem of review spammer detection, or findingusers who are the source of spam reviews. People who postintentional fake reviews have significantly different behaviorthan the normal user. They have identified the followingdeceptive rating and review behaviors.

• Giving unfair rating too often: Professional spammersgenerally posts more fake reviews than the real ones.Suppose a product has average rating of 9.0 out of 10.But a reviewer has given 4.0 rating. Analyzing the otherreviews of the reviewer if we find out that he often givesthis type of unfair ratings than we can detect him as aspammer.

• Giving good rating to own country’s product: Some-times people post fake reviews to promote products ofown region. This type of spamming is mostly seen incase of movie reviews. Suppose, in an international moviewebsite an Indian movie have the rating of 9.0 out of 10.0,where most of the reviewers are Indian. This kinds ofspamming can be detected using address of the reviewers.

• Giving review on a vast variety of product: Eachperson has specific interests of his own. A person gener-ally is not interested in all types of products. Supposea person who loves gaming may not be interested inclassic literature. But if we find some people givingreviews in various types of products which exceeds thegeneral behavior then we can intuit that their reviews areintentional fake reviews.

Deceptive online review detection is generally consideredas a classification problem and one popular approach is to usesupervised text classification techniques [5]. These techniquesare robust if the training is performed using large datasetsof labeled instances from both classes, deceptive opinions(positive instances) and truthful opinions (negative examples)[8]. Some researchers also used semi-supervised classificationtechniques.

For supervised classification process ground truth is deter-mined by – helpfulness vote, rating based behaviors, usingseed words, human observation etc. Sun et al. [1] proposeda method that offers classification results through a baggingmodel which bags three classifiers including product wordcomposition classifier (PWCC), TRIGRAMSSVM classifier,and BIGRAMSSVM classifier. They introduced a productword composition classifier to predict the polarity of thereview. The model was used to map the words of a re-

view into the continuous representation while concurrentlyintegrating the product-review relations. To build the docu-ment model, they took the product word composition vec-tors as input and used Convolutional Neural Network CNNto build the representation model. After bagging the resultwith TRIGRAMSSVM classification, and BIGRAMSSVM

classification they got F-Score value 0.77.However supervised method has some challenges to over-

come. The following problems occur in case of supervisedtechniques.

• Assuring of the quality of the reviews is difficult.• Labeled data points to train the classifier is difficult to

obtain.• Human are poor in labeling reviews as fake or genuine.Hence Jitendra et al. [8] proposed semi-supervised method

where labeled and unlabeled data both are trained together.They proposed to use semi-supervised method in the followingsituations.

1) When reliable data is not available.2) Dynamic nature of online review.3) Designing heuristic rules are difficult.They proposed several semi-supervised learning techniques

which includes Co-training, Expectation maximization, LabelPropagation and Spreading and Positive Unlabeled Learning[8]. They used several classifiers which includes k-Nearestneighbor, Random Forest, Logistic Regression and Stochas-tic Gradient Descent. Using semi-supervised techniques theyachieved highest accurace of 84%.

III. PROPOSED WORK

A. Dataset Description

In this paper, the ‘gold standard’ dataset developed by Ottet al. [3], [8], is used in our evaluations. The dataset contains1,600 reviews in text format on 20 hotels in the Chicago area,USA. Here we have 800 fake reviews and 800 true reviews. Forthe evaluations, a tag of ‘0’ denotes deceptive reviews, whereas‘1’ denotes genuine reviews. In the dataset, from genuinereviews 400 are written with a negative sentimental polarityand 400 includes positive sentimental polarity. Similarly formfake reviews, 400 include positive and rest 400 reviews containnegative sentiment polarity. These reviews were collected fromvarious sources. The deceptive reviews were generated usingAmazon Mechanical Turk (AMT) and the rest obtained fromvarious online reviewing websites such as Yelp, TripAdvisor,Expedia, and Hotels.com etc.

For the evaluations, the dataset is partitioned in a fixed way.Of the 1600 examples in the corpus, two sets of examples arecreated, such as: the training set and the test set. The propor-tions partition the corpus in ratios of 75:25, 80:20 respectively.The examples in each set are chosen using random sampling.

B. Proposed methodology

For detection of fake online reviews, we start with raw textdata. We have used a dataset which was already labeled bythe previous researchers. We remove unnecessary texts like

Page 3: Detection of fake online reviews using semi-supervised and ...

Fig. 1. Stages of proposed feature extraction process

article and prepositions in the data. Then these text data areconverted into numeric data for making them suitable for theclassifier. Important and necessary features are extracted andthen classification process took place.

As we have used ‘gold standard’ dataset prepared by Ottet al. [3], we did not require the steps like handling missingvalues, removing inconsistency, removing redundancy etc. Instead we needed to merge the texts, create a dictionary andmap the texts to numeric value as the tasks of preprocessing.

we have used word frequency count, sentiment polarity andlength of the review as our features. We have taken 2000 wordsas features. Hence the size of our feature vector is 160×2002.We have not taken n-gram or parts of speech as featuresbecause these are the derived features from bag of words andmay cause over-fitting. The process of feature extraction issummarized in the figure 1.

From the figure 1, we can see that, when we are workingwith i’th review, it’s corresponding features are generated inthe following procedure.

1) Each review goes through tokenization process first.Then, unnecessary words are removed and candidatefeature words are generated.

2) Each candidate feature words are checked against thedictionary and if it’s entry is available in the dictionarythen it’s frequency is counted and added to the columnin the feature vector that corresponds the numeric mapof the word.

3) Alongside with counting frequency, The length of thereview is measured and added to the feature vector.

4) Finally, sentiment score which is available in the data setis added in the feature vector. We have assigned negativesentiment as zero valued and positive sentiment as somepositive valued in the feature vector.

We have implemented both semi-supervised and supervisedclassifications. For semi-supervised classification of the dataset, we have used Expectation-Maximization(EM) algorithm.The Expectation Maximization algorithm, first proposed byKarimpour et al. [9], is designed to label unlabeled data

Fig. 2. Expectation-Maximization Algorithm

to be used for training. The algorithm operates as follows:A classifier is first derived from the labeled dataset. Thisclassifier is then used to label the unlabeled dataset. Let thispredicted set of labels be PU. Now, another classifier is derivedfrom the combined sets of both labeled and unlabeled datasetsand is used to classify the unlabeled dataset again. This processis repeated until the set PU stabilizes. After a stable PU set isproduced, we have trained the classification algorithm with thecombined training set of both labeled and unlabeled datasetsand deploy it for predicting test dataset [8]. The algorithm isgiven below.

As classifier, we have used Support Vector machines(SVM)and Naive Bayes(NB) classifier with EM algorithm. ScikitLearn package of Python programming language providessophisticated library of these classifiers. Hence for our researchwork, we have used Python with scikit-learn and numpypackages. We have tuned the parameters of the SVM forbetter results. For supervised classification, we have usedNaive Bayes and SVM classifiers. We know, Naive Bayesclassifier can be implemented where conditional independenceproperty is maintained. As, text comes randomly from usermind, we can’t know what the next line and word is goingto be. Hence, Naive Bayes classifier is popularly used in textmining. It is probabilistic method hence it can be used both forclassification and regression. It is also very fast to calculate.

IV. RESULTS AND PERFORMANCE ANALYSIS

A. Experimental Environment and Tools

We have applied our experiments on a machine with Pro-cessor: Intel (R) Core (TM) i5-4200U and CPU - 1.6GHz,RAM: 6 GB, System type: 64 bit OS, x64- based processor,Hard Disk: 1 TB. We have used Linux(Ubuntu 16.04) asour operating system. We have used Python programminglanguage with Scikit-learn and numpy packages.

B. Results

We have used Expectation maximization(EM) algorithmfor semi-supervised classification. As classifier we have used

Page 4: Detection of fake online reviews using semi-supervised and ...

Fig. 3. Graph showing Gamma parameter vs Accuracy for EM with SVM

Support Vector machines(SVM) and Naive Bayes classifier.We have divided our dataset into a train test ratio of 75:25and 80:20 for each classification process.

For semi-supervised classification with SVM, we havetuned different gamma parameters keeping C parameterconstant. The percentage accuracy graph is shown in thefigure 3. From the graph we can see, for semi-supervisedclassification with SVM classifier, We have found an accuracyof 81.34% for 80:20 split ratio and 80.47% for 75:25 splitratio with gamma equal 0.3 and 0.6 respectively. For semi-supervised classification with Naive Bayes classifier we havegot an accuracy of 85.21% and 84.87% respectively for splitratio of 80:20 and 75:25.

Jiten et al. [8] using semi-supervised classification withEM and Positively Unlabeled learning respectively, got high-est accuracy of 83.00% and 83.75% for train test ratio of80:20. They have tried Logistic regression, K-nearest neighbor,Stochastic Gradient Descent and Random Forest as classifier.

We have also tried supervised classification techniques tofind out performance of them for our dataset. We have usedNaive Bayes and SVM classifiers. For SVM classifier we havetuned gamma parameter keeping C parameter constant forhaving a better fit of the model. The results are shown inthe following figure 4.

For supervised classification with SVM classifier, We havefound an accuracy of 82.28% for 80:20 split ratio and 82.04%for 75:25 split ratio with gamma equal 0.1 and 0.8 respectively.For supervised classification with Naive Bayes classifier wehave got the highest accuracy of 86.32% and 86.21% respec-tively for split ratio of 80:20 and 75:25.

C. Performance Analysis

A histogram, showing the performances of our implementedtechniques and previous work on the dataset is given in thefigure 5.

In our research work we have chosen our features carefullyto reduce over fitting. We have not taken derived features like

Fig. 4. Graph showing Gamma parameter vs Accuracy for supervised SVMclassier

Fig. 5. Histogram showing performances of implemented techniques.

bigrams or trigrams. We have taken review length as a featureas it has well significance. We have chosen Naive Bayes as ourclassifier considering properties of our dataset. By doing thesewe have been able to increase the accuracy of semi-supervisedclassification to 85.21% where Jiten et al. [8] were able toget the highest accuracy of 83.75%. We have also found thehighest accuracy of 86.32% by using supervised classificationwith Naive Bayes classier. Our findings are summarized in thetable I.

V. CONCLUSIONS AND FUTURE WORK

We have shown several semi-supervised and supervised textmining techniques for detecting fake online reviews in thisresearch. We have combined features from several researchworks to create a better feature set. Also we have tried someother classifer that were not used on the previous work. Thus,we have been able to increase the accuracy of previous semi-supervised techniques done by Jiten et al. [8]. We have alsofound out that supervised Naive Bayes classifier gives thehighest accuracy. This ensures that our dataset is labeled well

Page 5: Detection of fake online reviews using semi-supervised and ...

TABLE ICOMPARATIVE SUMMARY OF SEMI-SUPERVISED AND SUPERVISED LEARNING TECHNIQUES

as we know semi-supervised model works well when reliablelabeling is not available.

In our research work we have worked on just user reviews.In future, user behaviors can be combined with texts toconstruct a better model for classification. Advanced prepro-cessing tools for tokenization can be used to make the datasetmore precise. Evaluation of the effectiveness of the proposedmethodology can be done for a larger data set. This researchwork is being done only for English reviews. It can be donefor Bangla and several other languages.

REFERENCES

[1] Chengai Sun, Qiaolin Du and Gang Tian, “Exploiting Product RelatedReview Features for Fake Review Detection,” Mathematical Problems inEngineering, 2016.

[2] A. Heydari, M. A. Tavakoli, N. Salim, and Z. Heydari, ”Detection ofreview spam: a survey”, Expert Systems with Applications, vol. 42, no.7, pp. 3634–3642, 2015.

[3] M. Ott, Y. Choi, C. Cardie, and J. T. Hancock, “Finding deceptiveopinion spam by any stretch of the imagination,” in Proceedings ofthe 49th Annual Meeting of the Association for Computational Linguis-tics: Human Language Technologies (ACL-HLT), vol. 1, pp. 309–319,Association for Computational Linguistics, Portland, Ore, USA, June2011.

[4] J. W. Pennebaker, M. E. Francis, and R. J. Booth, ”Linguistic Inquiryand Word Count: Liwc,” vol. 71, 2001.

[5] S. Feng, R. Banerjee, and Y. Choi, “Syntactic stylometry for deceptiondetection,” in Proceedings of the 50th Annual Meeting of the Associationfor Computational Linguistics: Short Papers, Vol. 2, 2012.

[6] J. Li, M. Ott, C. Cardie, and E. Hovy, “Towards a general rule foridentifying deceptive opinion spam,” in Proceedings of the 52nd AnnualMeeting of the Association for Computational Linguistics (ACL), 2014.

[7] E. P. Lim, V.-A. Nguyen, N. Jindal, B. Liu, and H. W. Lauw, “Detectingproduct review spammers using rating behaviors,” in Proceedings ofthe 19th ACM International Conference on Information and KnowledgeManagement (CIKM), 2010.

[8] J. K. Rout, A. Dalmia, and K.-K. R. Choo, “Revisiting semi-supervisedlearning for online deceptive review detection,” IEEE Access, Vol. 5,pp. 1319–1327, 2017.

[9] J. Karimpour, A. A. Noroozi, and S. Alizadeh, “Web spam detection bylearning from small labeled samples,” International Journal of ComputerApplications, vol. 50, no. 21, pp. 1–5, July 2012.