PhD Thesis presentation

DETECTION OF DISHONEST BEHAVIORS IN ON-LINE NETWORKS USING GRAPH-BASED RANKING TECHNIQUES

Francisco Javier Ortega Rodríguez

Supervised by

Prof. Dr. José Antonio Troyano Jiménez

2

Motivation

3

Motivation

WWW: Web Search A new business model

Advertisements on the web pages More web traffic More visits to (or views of)

the ads

Search Engine Optimization (SEO) is born

White Hat SEO

Black Hat SEOWeb Spam!

4

Social Networks Reputation of users similar to relevance of web

pages

Higher reputation can imply some benefits

Malicious users manipulate the TRS’s On-line marketplaces: money Social news sites: slant the contents of the web site Simply for “trolling” (for pleasure)

Motivation

Motivation5

6

Motivation

Hypothesis

The detection of dishonest behaviors in on-line networks can be carried out with graph-based techniques, flexible enough to include in their schemes specific information (in the form of features of the elements in a graph) about the network to be processed and the concrete task to be solved.

7

Detection of

Dishonest

Behaviors

Web Spam

Detection

Trust &

Reputation

in Social

Networks

State-of-

Art

Conclusio

ns

PolaritySp

am

State-of-

Art

PolarityTru

st

Roadmap

8

Web Spam Detection

Web spam mechanisms try to increase the web traffic to specific web sites

Reach the top positions of a web search engine Relatedness: similarity to the user query

Changing the content of the web page

Visibility: relevance in the collection Getting a high number of references

9

Web Spam Detection

Content-based methods: self promotion Hidden HTML code Keyword stuffing

10

Web Spam Detection

Link-based methods: mutual promotion Link-farms PR-sculpting

11

Web Spam

Detection

State-of-

Art

Roadmap

Link-

based

methods

Content-

based

methods

Hybrid

methods

12

Web Spam Detection

Relevant web spam detection methods: Link-based approaches

PageRank-based

Adaptations: Truncated PageRank [Castillo et al. 2007] TrustRank [Gyongy et al. 2004]

13

Web Spam Detection

Relevant web spam detection methods: Link-based approaches

Pros: Tackle the link-based spam methods The ranking can be directly used as the result of an

user query

Cons: Do not take into account the content of the web

pages Need human intervention in some specific parts

14

Web Spam Detection

Relevant web spam detection methods: Content-based approaches

WP’s

Size

Compressibility

Avg. word lenght

..

.

1 … … … …

2 … … … …

… … … … …

Classifier

SpamNot

Spam

Database

15

Web Spam Detection

Relevant web spam detection methods: Content-based approaches

Pros: Deal with content-based spam methods Binary classification methods

Cons: Very slow in comparison to the link-based methods Based on user-specified features Do not take into account the topology of the web

graph

16

Web Spam Detection

Relevant web spam detection methods: Hybrid approaches

WP’s Size

% In-links

Compressibility

Out-links / In-links

Avg. word lenght

..

.

1 … … … … … …

2 … … … … … …

… … … … … … …

Link-based metrics

Database

17

Web Spam Detection

Relevant web spam detection methods: Hybrid approaches

Pros: Combine the pros of link and content-based

methods. Really effective in the classification of web pages

Cons: Need user-specified features for both the content

and the link-based heuristics.

Opportunity: Do not take advantage of the global topology of

the web graph

18

Web Spam

Detection

PolaritySp

am

Roadmap

Content

Evaluation

Selection

of

sources

Evaluatio

n

Propagati

on

algorithm

19

PolaritySpam

Intuition Include content-based knowledge in a link-

based system.

Selection of sources

Content Evaluation

Propagation

algorithm

Ranking

Database

20

PolaritySpam

Content Evaluation

Content Evaluation

Database

21

PolaritySpam

Content Evaluation Acquire useful knowledge from the textual

content

Content-based heuristics Adequate for spam detection Easy to compute Highest discriminative ability

A-priori spam likelihood of a web page

22

PolaritySpam

Content Evaluation Small set of heuristics [Ntoulas et al., 2006]

Compressibility

Average length of words

A high value of the metrics implies an a-priori high spam likelihood of a web page

23

PolaritySpam

Selection of Sources

Selection of sources

Database

24

PolaritySpam

},...,,{ 21 ijiii mmmM

Selection of Sources Automatically pick a set of a-priori spam and

not-spam web pages, Sources- and Sources+, respectively

Take into account the content-based heuristics Given a web page wpi with metrics:

25

PolaritySpam

Selection of Sources Most Spamy/Not-Spamy sources (S-NS)

Content-based S-NS (CS-NS)

Content-based Graph Sources (C-GS)

Sources

Sources

26

PolaritySpam

Propagation algorithm

Propagation

algorithm

Ranking

27

PolaritySpam

Propagation algorithm: PageRank-based algorithm

Idea: propagate a-priori information from a specific set of web pages, Sources

A-priori scores for the Sources

28

PolaritySpam

Propagation algorithm: Two scores for each web page, vi:

Set of a-priori non-spam web pages

Set of a-priori spam web pages

29

PolaritySpam

Propagation

algorithm

Ranking

30

PolaritySpam

Evaluation: Dataset

Baseline

Evaluation Methods

Results

31

PolaritySpam

Evaluation: Dataset

WEBSPAM-UK 2006 (Università degli Studi di Milano)

Metrics: 98 million pages 11,400 hosts manually labeled

7,423 hosts are labeled as spam About 10 million web pages are labeled as spam

Processed with Terrier IR Platform http://terrier.org

32

PolaritySpam

Evaluation: Baseline: TrustRank [Gyongy et al., 2004]

Link-based web spam detection method

Personalized PageRank equation

Propagation from a set of hand-picked web pages

33

PolaritySpam

Evaluation: Evaluation methods: PR-Buckets

Bucket 1

Bucket 2

Bucket N

…

34

PolaritySpam


Evaluation metric: number of spam web pages in each bucket

Bucket 1

Bucket 2

Bucket N

…

35

PolaritySpam


Evaluation metric: number of spam web pages in each bucket

Bucket 1

Bucket 2

Bucket N

…

36

PolaritySpam

Evaluation: Normalized Discounted Cumulative Gain

(nDCG): Global metric: measures the demotion of spam

web pages

Sum the “relevance” scores of not-spam web pages

37

PolaritySpam



web pages


38

PolaritySpam



web pages


39

PolaritySpam

Evaluation: PR-Buckets evaluation

S-NS and CS-

NSUsing the 5% of

web pages with

highest and

lowest Spaminess

40

PolaritySpam

Evaluation: nDCG evaluation

nDCG

TrustRank

0.7381

S-NS 0.4230

CS-NS 0.8621

C-GS 0.8648

Our proposals

CS-NS and C-

GS outperforms

TrustRank in

the demotion

of spam web

pages

41

1 2 3 4 5 6 7 8 9 101

10

100

1000

AverageLength Compressibility AllMetrics TrustRank PolaritySpam

Buckets

Num

ber o

f Spa

m W

e b P

a ge s

PolaritySpam

Evaluation: Content-based heuristics

The content-

based metrics

do not achieve

good results by

themselves

42

Detection of

Dishonest

Behaviors

Web Spam

Detection

Trust &

Reputation

in Social

Networks

State-of-

Art

Conclusio

ns

PolaritySp

am

State-of-

Art

PolarityTru

st

Roadmap

43

Trust & Reputation in Social Networks

Trust and reputation are key concepts in social networks Similar to the relevance of web pages in

the WWW

Reputation: assessment of the trustworthiness of a user in a social network, according to his behavior and the opinions of the other users.

Example: On-line marketplaces

Trustworthiness as determining as the price Higher reputation implies more sales Positive and negative opinions


44

Main goal: gain high reputation Obtain positive feedbacks from the customers

Sell some bargains Special offers

Give negative opinions for sellers that can be competitors.

Obtain false positive opinions from other accounts

(not necessarily other users).

Dishonest behaviors!


45

46

Roadmap

Trust &

Reputation

in Social

Networks

State-of-

Art

TRS’s in

real World

Threats

for TRS’s

Transitivit

y of

Distrust

47

TRS’s in real world Moderators

Special set of users with specific responsibilities

Example: Slashdot.org A hierarchy of moderators A special user, No_More_Trolls, maintains a list of

known trolls

Drawbacks: Scalability Subjectivity


48

TRS’s in real world Unsupervised TRS’s

Users rate the contents of the system (and also other users)

Scalability problem: rely on the users Subjectivity problem: decentralized

Examples: Digg.com, eBay.com

Drawbacks: Unsupervised!


49

Transivity of Trust and Distrust [Guha et al., 2004] Multiplicative distrust

The enemy of my enemy is my friend

Additive distrust Don’t trust someone not trusted by someone you don’t trust

Neutral distrust Don’t take into account your enemies’ opinions


50

Threats of TRS’sOrchestrated attacks

Camouflage behind good behavior

Malicious Spies

Camouflage behind judgments


51

Threats of TRS’s Orchestrated attacks: obtaining positive

opinions from other accounts (not necessarily other users)


8

9

67

3

2

54

0

1

52

Threats of TRS’s Camouflage behind good behavior:

feigning good behavior in order to obtain positive feedback from others.


8

9

67

3

2

54

0

1

53

Threats of TRS’s Malicious spies: using an “honest” account to

provide positive opinions to malicious users.


8

9

67

3

2

54

0

1

54

Threats of TRS’s Camouflage behind judgments: giving

negative feedback to users who can be competitors.


8

9

67

3

2

54

0

1

55

Trust &

Reputation

in Social

Networks

PolarityTru

st

Roadmap

Algorithm

Non-

Negative

Propagatio

n

Evaluatio

n

Action-

Reaction

Propagation

56

Intuition Compute a ranking of the users in a social

network according to their trustworthiness

Take into account both positive and negative feedback

Graph-based ranking algorithm to obtain two scores for each node: : positive reputation of user i : negative reputation of user i

PolarityTrust

)( ivPT

)( ivPT

57

Intuition Propagation algorithm for the opinions of the users

Given a set of trustworthy users Their PT+ and PT- scores are propagated to their

neighbors … and so on

PolarityTrust

8

9

67

3

2

54

0

1

0

1

3

2

54

67

8

9

58

Algorithm Propagation schema of the opinions of the

users Different behavior depending on the type of

relation between the users: positive or negative

PolarityTrust

a

b

c

d

e

fPT⁺(a) PT⁻(d)

c

PT⁺ (b) ↑

PT⁻(c) ↑ PT⁺(f) ↑

PT⁻(e) ↑

59

Algorithm The scores of the nodes influence the

scores of their neighbors

PolarityTrust

)( ivPT

60



PolarityTrust

Set of sources

dedvPT ii )1()(

61



PolarityTrust

Direct relation with the PT+ of positively voting users

)()(

)(||

)1()(iInj

j

vOutkjk

ijii vPT

p

pdedvPT

j

62



PolarityTrust

)()(

)()(

)(||

)(||

)1()(iInj

j

vOutkjk

ij

iInjj

vOutkjk

ijii vPT

p

pvPT

p

pdedvPT

jj

Inverse relation with the PT- of negatively voting users

63



PolarityTrust

)()(

)()(

)(||

)(||

)1()(iInj

j

vOutkjk

ij

iInjj

vOutkjk

ijii vPT

p

pvPT

p

pdedvPT

jj

)()(

)()(

)(||

)(||

)1()(iInj

j

vOutkjk

ij

iInjj

vOutkjk

ijii vPT

p

pvPT

p

pdedvPT

jj

64

Non-Negative Propagation Problems caused by negative opinions from

malicious users

Solution: dynamically avoid the propagation of these opinions from malicious users

PolarityTrust

a

PR⁻(a)

b

c

PR⁺(c) ↑

PR⁻(b) ↑

c

65

Action-Reaction Propagation Problems caused by dishonest voting

attacks Positive votes to malicious users

Orchestrated attacks, malicious spies… Negative votes to good users

Camouflage behind bad judgments

React against bad actions: dishonest voting Penalize users who performs these actions Proportional to the trustworthiness of the nodes

been affected

PolarityTrust

66

Action-Reaction Propagation Computation:

Relation between the number of dishonest votes and the total number of votes

Applied after each iteration of the ranking algorithm

PolarityTrust

ab

d

c

67

Complete Formulation

PolarityTrust

68

Evaluation Datasets

Baselines

Results

PolarityTrust

69

Evaluation Datasets

Barabasi & Albert Preferential attachment property

Randomly generated attacks

Metrics of the dataset 104nodes per graph 103 malicious users 100 malicious spies

PolarityTrust

70

Evaluation Datasets

Slashdot Zoo Graph of users in Slashdot.org Friend and Foe relationships Gold Standard = list of Foes of the special user

No_More_Trolls

Metrics of the dataset 71,500 users in total 24% negative edges 96 known trolls Source set: CmdrTaco and his friends 6 users in

total

PolarityTrust

71

Evaluation Baselines

EigenTrust [Kamvar et al. 2003] It does not take into account negative opinions

Fans Minus Freaks (Number of friends – Number of foes)

Signed Spectral Ranking [Kunegis et al. 2009]

Negative Ranking [Kunegis et al. 2009]

PolarityTrust

72

Evaluation Results: Randomly generated datasets

nDCG

PolarityTrust

Threats

ET FmF SR NR PTNN PTAR PT

A 0.833 0.843

0.599

0.749

0.876 0.906 0.987

AB 0.833 0.844

0.811

0.920

0.876 0.906 0.987

ABC 0.842 0.719

0.816

0.920

0.877 0.903 0.984

ABCD 0.823 0.723

0.818

0.937

0.879 0.903 0.984

ABCDE 0.753 0.777

0.877

0.933

0.966 0.862 0.982

A: No estrategies B: Orchestrated attackC: Camouflage behind good behaviorsD: Malicious spiesE: Camouflage behind judgments

ET: EigenTrust FmF: Fans Minus FreaksSR: Spectral RankingNR: Negative Ranking

PTNN: Non-Negative PropagationPTAR: Action-Reaction PropagationPT: PolarityTrust

PolarityTrust

outperforms

the baselines

in terms of

demotion of

malicious

users

73

Evaluation Results: Slashdot Zoo dataset

PolarityTrust


nDCG 0.310

0.460 0.479 0.477 0.593

0.570 0.588



Similar performance

with a real-

world dataset

74

Evaluation Results: Trolling Slashdot

nDCG

PolarityTrust

Threats


A 0.310

0.460

0.479

0.477

0.593

0.570 0.588

AB 0.308

0.460

0.478

0.477

0.593

0.570 0.588

ABC 0.311

0.460

0.474

0.484

0.593

0.570 0.588

ABCD 0.370

0.476

0.501

0.501

0.580 0.570 0.586

ABCDE 0.370

0.475

0.501

0.496

0.580 0.574 0.588

A: No estrategies B: Orchestrated attackC: Camouflage behind good behaviorsD: Malicious spiesE: Camouflage behind judgments



Significant

improvement

in the demotion of

malicious

users

75

Evaluation Include a set of sources of distrust

In Slashdot Zoo Dataset:

Sources of trust: CmdrTaco and friends

Sources of distrust: 5 random foes of No_More_Trolls

Many possible methods to choose the sources of distrust

PolarityTrust

76

Evaluation Results: Sources os trust and distrust

nDCG

PolarityTrust

Sources of Trust Sources of Trust & Distrust

Threats

PTNN PTAR PT PTNN PTAR PT

A 0.593

0.570

0.588

0.846 0.790 0.846

AB 0.593

0.570

0.588

0. 846 0.790 0.846

ABC 0.593

0.570

0.588

0.846 0.790 0.846

ABCD 0.580

0.570

0.586

0.775 0.739 0.782

ABCDE 0.580

0.574

0.588

0.774 0.741 0.781

A: No estrategies B: Orchestrated attackC: Camouflage behind good behaviors


D: Malicious spiesE: Camouflage behind judgments

Improvement in

the demotion of

malicious users

with the sources

of distrust

77

Detection of

Dishonest

Behaviors

Web Spam

Detection

Trust &

Reputation

in Social

Networks

State-of-

Art

Conclusio

ns

PolaritySp

am

State-of-

Art

PolarityTru

st

Roadmap

Final

RemarksFuture

Work

78

Conclusions

Final Remarks

Development of two systems for the detection of dishonest behaviors in on-line networks

Web Spam Detection: PolaritySpam Trust and Reputation: PolarityTrust

Propagation of some a-priori information

Web Spam: Textual content of the web pages Trust and Reputation: Trust and distrust sources

sets

79

Conclusions

Final Remarks

Web Spam Detection

Unlike existent approaches, include content-based knowledge into a link-based technique

Unsupervised methods for the selection of sources

Propagate information of the sources through the network

Two simple metrics improve state-of-the-art methods

80

Conclusions

Final Remarks

Trust and Reputation in social networks

Negative links improve the discriminative ability of TRS’s

Propagation estrategies to deal with different attacks against a TRS

Non-Negative propagation Action-Reaction propagation

Interrelated scores modeling the transitivity of trust and distrust

Flexible to be adapted to different situations and threats

81

Conclusions

Future Work

PolaritySpam Applicability of more content-based metrics

Aditional methods for the selection of sources Propagation ability of each source

Infer negative relations between web pages According to their textual content Apply similar propagation schemas as in

PolarityTrust

82

Conclusions

Future Work PolarityTrust

Study other possible attacks Playbook sequences (omniscience of the attackers) Analyze the casuistry of the different social networks

Selection of sources of trust and distrust Link-based methods

Study other contexts with positive and negative relations:

Trending topics Authorities in the blogosphere

83

Conclusions

Future Work Both techniques

Study of the parallelization of both algorithms Many works on the parallelization of PageRank Saving time and memory

Detection of Spam on the social networks Spam messages and spam user accounts

Recommender Systems NLP and Opinion Mining techniques in a link-based

system Use the positive and negative information

Curriculum Vitae

Academic and Research milestones 2006: Degree on Computer Science

2006: Funded Student in the Itálica research group

2008: Master of Advances Studies: “STR: A graph-based tagger generator”

2010: Research stay at the University of Glasgow IR Group (Dr. Iadh Ounis and Dr. Craig Macdonald)

84

Curriculum Vitae

26 contributions to conferences and journals 5 JCR 10 International Conferences 2 CORE B 4 CORE C 4 ISI Proceedings 3 Lecture Notes in Computer Sciences 3 CiteSeer Venue Impact Ratings

Proyectos de investigación

85

Curriculum Vitae86

System Combination Methods

STR

TexRank for Tagging

PolarityRank

PolaritySpam

PolarityTrust

Web Spam DetectionImproving a

Tagger Generator in IE

Contributions related to the thesis

National Conf.International Conf.JCR

Curriculum Vitae87

System Combination Methods

TexRank for Tagging

Improving a Tagger Generator in IE


Bootstrapping Applied to a Corpus Generation Task, EUROCAST 2007

TextRank como motor de aprendizaje en tareas de etiquetado, SEPLN 2006

Improving the Performance of a Tagger Generator in an Information Extraction Application, Journal of Universal Computer Science (2007)


Curriculum Vitae88

STR: A Graph-based Tagging Technique, International Journal on Artificial Intelligence Tools (2011)


STR


Curriculum Vitae89

A Knowledge-Rich Approach to Featured-based Opinion Extraction from Product Reviews, SMUC 2010 (CIKM 2010)

Combining Textual Content and Hyperlinks in Web Spam Detection, NLDB 2010

Contributions related to the thesisPolarityRank

Web Spam Detection


Curriculum Vitae90

PolarityTrust: Measuring Trust and Reputation in Social Networks, ITA 2011PolaritySpam: Propagating Content-based Information Through a Web Graph to Detect Web Spam, International Journal of Innovative Computing, Information and Control (2012)


PolaritySpam

PolarityTrust


DETECTION OF DISHONEST BEHAVIORS IN ON-LINE NETWORKS USING GRAPH-BASED RANKING TECHNIQUES

Francisco Javier Ortega Rodríguez

Supervised by

Prof. Dr. José Antonio Troyano Jiménez

PhD Thesis presentation

Technology

Transcript of PhD Thesis presentation