Opinion Fraud Detection in Online Reviews by Network Effects

16
Opinion Fraud Detection in Online Reviews using Network Effects Leman Akoglu, Rishi Chandy, Christos Faloutsos ICWSM’13

Transcript of Opinion Fraud Detection in Online Reviews by Network Effects

Opinion Fraud Detection in On-line Reviews using Network Ef-

fects

Leman Akoglu, Rishi Chandy, Christos FaloutsosICWSM’13

Problem

Which reviews do/should you trust?

Bogus reviewsDefaming spam

Hype spam

Previous workManual labeling is hard

Unsupervised fashion requiring no la-beled data

Side information is used (review text, timestamp, behavioral analysis..) No side information

Different characteristics from dataset Can be applied to all types of review

networks

ProblemGiven

◦user-product review network◦review sign (+:thumbs up/-:thumbs down)

Classify◦objects into type-specific classes: users: `honest’ / `fraudster’ products: `good’ / `bad’ reviews: `genuine’ / `fake’

Automatically label the users, products, and reviews in the network.

No side data! (e.g., timestamp, review text)

honest

Goodquality

badquality

fraud

Formulation

node labels as random variables

prior belief observed neighbor potentials

compatibility potentials

Signededges

Objective function utilizes pairwise Markov Random Fields (Kindermann&Snell, 1980)

Finding best assignments to unobserved variables in our objective functionis inference problem !

FormulationNew algorithm extends

LBP(Loopy Belief Propagation) for signed networks.

Signed Inference Algorithm (sIA)

i

I) Repeat for each node:

II) At convergence:

Alternatley communi-cate mes-sages

Belief of user i having label yi

Formulation Before

After sIA converges..

FRAUDEAGLEStep 1. Scoring (signed Inference Algorithm)

◦Repeat message propagation ◦– update messages from users to products◦– update messages from products to users◦Until all messages stop changing

Step 2. Grouping◦Ranked list of users by score from Step1.◦Cluster induced subgraphs on top k users and

products

Dataset SWM: All app reviews of entertain-

ment category (games, news, sports, etc.) from an anonymous online app store database ◦ As of June 2012:

◦ * 1, 132, 373 reviews◦ * 966, 842 users◦ * 15,094 software products

(apps) Ratings: 1 (worst) to 5 (best)

A large number of reviews come from users who have very few reviews

Skewed towards positive ratings

Competitive algorithms Compared to 2 iterative classifiers (modified to handle signed edges):

I) Weighted-vote Relational Classifier (wv-RC) (Macskassy&Provost, 2003)

II) HITS (honesty-goodness in mutual recursion) (Kleinberg, 1999)

Top scorers

+ positive (4-5) ratingo negative (1-2) rating

Users

Products

Top scorers

+ positive (4-5) ratingo negative (1-2) rating

Users

Products

31 user all with 5-star ratings to 5 products

‘Fraud-bot’ member re-views

Same developer! Duplicated text! Same day activity!

After removingfake reviews

Top-scorers matter!

Fraud score > 0.5Honesty score < 0

High-rating product

Low-rating product

Computational complexityScalable to large data with computa-

tional complexity linear in network size

Running time grows linearly with in-creasing size

ConclusionFRAUDEGLE framework

◦ exploits the network effect among review-ers and products

◦ scoring users and reviews for fraud detec-tion, grouping for visualization and sense-making

◦Unsupervised method; no labeled data◦Scalable to large datasets