PhD Thesis presentation
-
Upload
javier-ortega -
Category
Technology
-
view
1.406 -
download
5
description
Transcript of PhD Thesis presentation
DETECTION OF DISHONEST BEHAVIORS IN ON-LINE NETWORKS USING GRAPH-BASED RANKING TECHNIQUES
Francisco Javier Ortega Rodríguez
Supervised by
Prof. Dr. José Antonio Troyano Jiménez
2
Motivation
3
Motivation
WWW: Web Search A new business model
Advertisements on the web pages More web traffic More visits to (or views of)
the ads
Search Engine Optimization (SEO) is born
White Hat SEO
Black Hat SEOWeb Spam!
4
Social Networks Reputation of users similar to relevance of web
pages
Higher reputation can imply some benefits
Malicious users manipulate the TRS’s On-line marketplaces: money Social news sites: slant the contents of the web site Simply for “trolling” (for pleasure)
Motivation
Motivation5
6
Motivation
Hypothesis
The detection of dishonest behaviors in on-line networks can be carried out with graph-based techniques, flexible enough to include in their schemes specific information (in the form of features of the elements in a graph) about the network to be processed and the concrete task to be solved.
7
Detection of
Dishonest
Behaviors
Web Spam
Detection
Trust &
Reputation
in Social
Networks
State-of-
Art
Conclusio
ns
PolaritySp
am
State-of-
Art
PolarityTru
st
Roadmap
8
Web Spam Detection
Web spam mechanisms try to increase the web traffic to specific web sites
Reach the top positions of a web search engine Relatedness: similarity to the user query
Changing the content of the web page
Visibility: relevance in the collection Getting a high number of references
9
Web Spam Detection
Content-based methods: self promotion Hidden HTML code Keyword stuffing
10
Web Spam Detection
Link-based methods: mutual promotion Link-farms PR-sculpting
11
Web Spam
Detection
State-of-
Art
Roadmap
Link-
based
methods
Content-
based
methods
Hybrid
methods
12
Web Spam Detection
Relevant web spam detection methods: Link-based approaches
PageRank-based
Adaptations: Truncated PageRank [Castillo et al. 2007] TrustRank [Gyongy et al. 2004]
13
Web Spam Detection
Relevant web spam detection methods: Link-based approaches
Pros: Tackle the link-based spam methods The ranking can be directly used as the result of an
user query
Cons: Do not take into account the content of the web
pages Need human intervention in some specific parts
14
Web Spam Detection
Relevant web spam detection methods: Content-based approaches
WP’s
Size
Compressibility
Avg. word lenght
..
.
1 … … … …
2 … … … …
… … … … …
Classifier
SpamNot
Spam
Database
15
Web Spam Detection
Relevant web spam detection methods: Content-based approaches
Pros: Deal with content-based spam methods Binary classification methods
Cons: Very slow in comparison to the link-based methods Based on user-specified features Do not take into account the topology of the web
graph
16
Web Spam Detection
Relevant web spam detection methods: Hybrid approaches
WP’s Size
% In-links
Compressibility
Out-links / In-links
Avg. word lenght
..
.
1 … … … … … …
2 … … … … … …
… … … … … … …
Link-based metrics
Database
17
Web Spam Detection
Relevant web spam detection methods: Hybrid approaches
Pros: Combine the pros of link and content-based
methods. Really effective in the classification of web pages
Cons: Need user-specified features for both the content
and the link-based heuristics.
Opportunity: Do not take advantage of the global topology of
the web graph
18
Web Spam
Detection
PolaritySp
am
Roadmap
Content
Evaluation
Selection
of
sources
Evaluatio
n
Propagati
on
algorithm
19
PolaritySpam
Intuition Include content-based knowledge in a link-
based system.
Selection of sources
Content Evaluation
Propagation
algorithm
Ranking
Database
20
PolaritySpam
Content Evaluation
Content Evaluation
Database
21
PolaritySpam
Content Evaluation Acquire useful knowledge from the textual
content
Content-based heuristics Adequate for spam detection Easy to compute Highest discriminative ability
A-priori spam likelihood of a web page
22
PolaritySpam
Content Evaluation Small set of heuristics [Ntoulas et al., 2006]
Compressibility
Average length of words
A high value of the metrics implies an a-priori high spam likelihood of a web page
23
PolaritySpam
Selection of Sources
Selection of sources
Database
24
PolaritySpam
},...,,{ 21 ijiii mmmM
Selection of Sources Automatically pick a set of a-priori spam and
not-spam web pages, Sources- and Sources+, respectively
Take into account the content-based heuristics Given a web page wpi with metrics:
25
PolaritySpam
Selection of Sources Most Spamy/Not-Spamy sources (S-NS)
Content-based S-NS (CS-NS)
Content-based Graph Sources (C-GS)
Sources
Sources
26
PolaritySpam
Propagation algorithm
Propagation
algorithm
Ranking
27
PolaritySpam
Propagation algorithm: PageRank-based algorithm
Idea: propagate a-priori information from a specific set of web pages, Sources
A-priori scores for the Sources
28
PolaritySpam
Propagation algorithm: Two scores for each web page, vi:
Set of a-priori non-spam web pages
Set of a-priori spam web pages
29
PolaritySpam
Propagation
algorithm
Ranking
30
PolaritySpam
Evaluation: Dataset
Baseline
Evaluation Methods
Results
31
PolaritySpam
Evaluation: Dataset
WEBSPAM-UK 2006 (Università degli Studi di Milano)
Metrics: 98 million pages 11,400 hosts manually labeled
7,423 hosts are labeled as spam About 10 million web pages are labeled as spam
Processed with Terrier IR Platform http://terrier.org
32
PolaritySpam
Evaluation: Baseline: TrustRank [Gyongy et al., 2004]
Link-based web spam detection method
Personalized PageRank equation
Propagation from a set of hand-picked web pages
33
PolaritySpam
Evaluation: Evaluation methods: PR-Buckets
Bucket 1
Bucket 2
Bucket N
…
34
PolaritySpam
Evaluation: Evaluation methods: PR-Buckets
Evaluation metric: number of spam web pages in each bucket
Bucket 1
Bucket 2
Bucket N
…
35
PolaritySpam
Evaluation: Evaluation methods: PR-Buckets
Evaluation metric: number of spam web pages in each bucket
Bucket 1
Bucket 2
Bucket N
…
36
PolaritySpam
Evaluation: Normalized Discounted Cumulative Gain
(nDCG): Global metric: measures the demotion of spam
web pages
Sum the “relevance” scores of not-spam web pages
37
PolaritySpam
Evaluation: Normalized Discounted Cumulative Gain
(nDCG): Global metric: measures the demotion of spam
web pages
Sum the “relevance” scores of not-spam web pages
38
PolaritySpam
Evaluation: Normalized Discounted Cumulative Gain
(nDCG): Global metric: measures the demotion of spam
web pages
Sum the “relevance” scores of not-spam web pages
39
PolaritySpam
Evaluation: PR-Buckets evaluation
S-NS and CS-
NSUsing the 5% of
web pages with
highest and
lowest Spaminess
40
PolaritySpam
Evaluation: nDCG evaluation
nDCG
TrustRank
0.7381
S-NS 0.4230
CS-NS 0.8621
C-GS 0.8648
Our proposals
CS-NS and C-
GS outperforms
TrustRank in
the demotion
of spam web
pages
41
1 2 3 4 5 6 7 8 9 101
10
100
1000
AverageLength Compressibility AllMetrics TrustRank PolaritySpam
Buckets
Num
ber o
f Spa
m W
e b P
a ge s
PolaritySpam
Evaluation: Content-based heuristics
The content-
based metrics
do not achieve
good results by
themselves
42
Detection of
Dishonest
Behaviors
Web Spam
Detection
Trust &
Reputation
in Social
Networks
State-of-
Art
Conclusio
ns
PolaritySp
am
State-of-
Art
PolarityTru
st
Roadmap
43
Trust & Reputation in Social Networks
Trust and reputation are key concepts in social networks Similar to the relevance of web pages in
the WWW
Reputation: assessment of the trustworthiness of a user in a social network, according to his behavior and the opinions of the other users.
Example: On-line marketplaces
Trustworthiness as determining as the price Higher reputation implies more sales Positive and negative opinions
Trust & Reputation in Social Networks
44
Main goal: gain high reputation Obtain positive feedbacks from the customers
Sell some bargains Special offers
Give negative opinions for sellers that can be competitors.
Obtain false positive opinions from other accounts
(not necessarily other users).
Dishonest behaviors!
Trust & Reputation in Social Networks
45
46
Roadmap
Trust &
Reputation
in Social
Networks
State-of-
Art
TRS’s in
real World
Threats
for TRS’s
Transitivit
y of
Distrust
47
TRS’s in real world Moderators
Special set of users with specific responsibilities
Example: Slashdot.org A hierarchy of moderators A special user, No_More_Trolls, maintains a list of
known trolls
Drawbacks: Scalability Subjectivity
Trust & Reputation in Social Networks
48
TRS’s in real world Unsupervised TRS’s
Users rate the contents of the system (and also other users)
Scalability problem: rely on the users Subjectivity problem: decentralized
Examples: Digg.com, eBay.com
Drawbacks: Unsupervised!
Trust & Reputation in Social Networks
49
Transivity of Trust and Distrust [Guha et al., 2004] Multiplicative distrust
The enemy of my enemy is my friend
Additive distrust Don’t trust someone not trusted by someone you don’t trust
Neutral distrust Don’t take into account your enemies’ opinions
Trust & Reputation in Social Networks
50
Threats of TRS’sOrchestrated attacks
Camouflage behind good behavior
Malicious Spies
Camouflage behind judgments
Trust & Reputation in Social Networks
51
Threats of TRS’s Orchestrated attacks: obtaining positive
opinions from other accounts (not necessarily other users)
Trust & Reputation in Social Networks
8
9
67
3
2
54
0
1
52
Threats of TRS’s Camouflage behind good behavior:
feigning good behavior in order to obtain positive feedback from others.
Trust & Reputation in Social Networks
8
9
67
3
2
54
0
1
53
Threats of TRS’s Malicious spies: using an “honest” account to
provide positive opinions to malicious users.
Trust & Reputation in Social Networks
8
9
67
3
2
54
0
1
54
Threats of TRS’s Camouflage behind judgments: giving
negative feedback to users who can be competitors.
Trust & Reputation in Social Networks
8
9
67
3
2
54
0
1
55
Trust &
Reputation
in Social
Networks
PolarityTru
st
Roadmap
Algorithm
Non-
Negative
Propagatio
n
Evaluatio
n
Action-
Reaction
Propagation
56
Intuition Compute a ranking of the users in a social
network according to their trustworthiness
Take into account both positive and negative feedback
Graph-based ranking algorithm to obtain two scores for each node: : positive reputation of user i : negative reputation of user i
PolarityTrust
)( ivPT
)( ivPT
57
Intuition Propagation algorithm for the opinions of the users
Given a set of trustworthy users Their PT+ and PT- scores are propagated to their
neighbors … and so on
PolarityTrust
8
9
67
3
2
54
0
1
0
1
3
2
54
67
8
9
58
Algorithm Propagation schema of the opinions of the
users Different behavior depending on the type of
relation between the users: positive or negative
PolarityTrust
a
b
c
d
e
fPT⁺(a) PT⁻(d)
c
PT⁺ (b) ↑
PT⁻(c) ↑ PT⁺(f) ↑
PT⁻(e) ↑
59
Algorithm The scores of the nodes influence the
scores of their neighbors
PolarityTrust
)( ivPT
60
Algorithm The scores of the nodes influence the
scores of their neighbors
PolarityTrust
Set of sources
dedvPT ii )1()(
61
Algorithm The scores of the nodes influence the
scores of their neighbors
PolarityTrust
Direct relation with the PT+ of positively voting users
)()(
)(||
)1()(iInj
j
vOutkjk
ijii vPT
p
pdedvPT
j
62
Algorithm The scores of the nodes influence the
scores of their neighbors
PolarityTrust
)()(
)()(
)(||
)(||
)1()(iInj
j
vOutkjk
ij
iInjj
vOutkjk
ijii vPT
p
pvPT
p
pdedvPT
jj
Inverse relation with the PT- of negatively voting users
63
Algorithm The scores of the nodes influence the
scores of their neighbors
PolarityTrust
)()(
)()(
)(||
)(||
)1()(iInj
j
vOutkjk
ij
iInjj
vOutkjk
ijii vPT
p
pvPT
p
pdedvPT
jj
)()(
)()(
)(||
)(||
)1()(iInj
j
vOutkjk
ij
iInjj
vOutkjk
ijii vPT
p
pvPT
p
pdedvPT
jj
64
Non-Negative Propagation Problems caused by negative opinions from
malicious users
Solution: dynamically avoid the propagation of these opinions from malicious users
PolarityTrust
a
PR⁻(a)
b
c
PR⁺(c) ↑
PR⁻(b) ↑
c
65
Action-Reaction Propagation Problems caused by dishonest voting
attacks Positive votes to malicious users
Orchestrated attacks, malicious spies… Negative votes to good users
Camouflage behind bad judgments
React against bad actions: dishonest voting Penalize users who performs these actions Proportional to the trustworthiness of the nodes
been affected
PolarityTrust
66
Action-Reaction Propagation Computation:
Relation between the number of dishonest votes and the total number of votes
Applied after each iteration of the ranking algorithm
PolarityTrust
ab
d
c
67
Complete Formulation
PolarityTrust
68
Evaluation Datasets
Baselines
Results
PolarityTrust
69
Evaluation Datasets
Barabasi & Albert Preferential attachment property
Randomly generated attacks
Metrics of the dataset 104nodes per graph 103 malicious users 100 malicious spies
PolarityTrust
70
Evaluation Datasets
Slashdot Zoo Graph of users in Slashdot.org Friend and Foe relationships Gold Standard = list of Foes of the special user
No_More_Trolls
Metrics of the dataset 71,500 users in total 24% negative edges 96 known trolls Source set: CmdrTaco and his friends 6 users in
total
PolarityTrust
71
Evaluation Baselines
EigenTrust [Kamvar et al. 2003] It does not take into account negative opinions
Fans Minus Freaks (Number of friends – Number of foes)
Signed Spectral Ranking [Kunegis et al. 2009]
Negative Ranking [Kunegis et al. 2009]
PolarityTrust
72
Evaluation Results: Randomly generated datasets
nDCG
PolarityTrust
Threats
ET FmF SR NR PTNN PTAR PT
A 0.833 0.843
0.599
0.749
0.876 0.906 0.987
AB 0.833 0.844
0.811
0.920
0.876 0.906 0.987
ABC 0.842 0.719
0.816
0.920
0.877 0.903 0.984
ABCD 0.823 0.723
0.818
0.937
0.879 0.903 0.984
ABCDE 0.753 0.777
0.877
0.933
0.966 0.862 0.982
A: No estrategies B: Orchestrated attackC: Camouflage behind good behaviorsD: Malicious spiesE: Camouflage behind judgments
ET: EigenTrust FmF: Fans Minus FreaksSR: Spectral RankingNR: Negative Ranking
PTNN: Non-Negative PropagationPTAR: Action-Reaction PropagationPT: PolarityTrust
PolarityTrust
outperforms
the baselines
in terms of
demotion of
malicious
users
73
Evaluation Results: Slashdot Zoo dataset
PolarityTrust
ET FmF SR NR PTNN PTAR PT
nDCG 0.310
0.460 0.479 0.477 0.593
0.570 0.588
ET: EigenTrust FmF: Fans Minus FreaksSR: Spectral RankingNR: Negative Ranking
PTNN: Non-Negative PropagationPTAR: Action-Reaction PropagationPT: PolarityTrust
Similar performance
with a real-
world dataset
74
Evaluation Results: Trolling Slashdot
nDCG
PolarityTrust
Threats
ET FmF SR NR PTNN PTAR PT
A 0.310
0.460
0.479
0.477
0.593
0.570 0.588
AB 0.308
0.460
0.478
0.477
0.593
0.570 0.588
ABC 0.311
0.460
0.474
0.484
0.593
0.570 0.588
ABCD 0.370
0.476
0.501
0.501
0.580 0.570 0.586
ABCDE 0.370
0.475
0.501
0.496
0.580 0.574 0.588
A: No estrategies B: Orchestrated attackC: Camouflage behind good behaviorsD: Malicious spiesE: Camouflage behind judgments
ET: EigenTrust FmF: Fans Minus FreaksSR: Spectral RankingNR: Negative Ranking
PTNN: Non-Negative PropagationPTAR: Action-Reaction PropagationPT: PolarityTrust
Significant
improvement
in the demotion of
malicious
users
75
Evaluation Include a set of sources of distrust
In Slashdot Zoo Dataset:
Sources of trust: CmdrTaco and friends
Sources of distrust: 5 random foes of No_More_Trolls
Many possible methods to choose the sources of distrust
PolarityTrust
76
Evaluation Results: Sources os trust and distrust
nDCG
PolarityTrust
Sources of Trust Sources of Trust & Distrust
Threats
PTNN PTAR PT PTNN PTAR PT
A 0.593
0.570
0.588
0.846 0.790 0.846
AB 0.593
0.570
0.588
0. 846 0.790 0.846
ABC 0.593
0.570
0.588
0.846 0.790 0.846
ABCD 0.580
0.570
0.586
0.775 0.739 0.782
ABCDE 0.580
0.574
0.588
0.774 0.741 0.781
A: No estrategies B: Orchestrated attackC: Camouflage behind good behaviors
PTNN: Non-Negative PropagationPTAR: Action-Reaction PropagationPT: PolarityTrust
D: Malicious spiesE: Camouflage behind judgments
Improvement in
the demotion of
malicious users
with the sources
of distrust
77
Detection of
Dishonest
Behaviors
Web Spam
Detection
Trust &
Reputation
in Social
Networks
State-of-
Art
Conclusio
ns
PolaritySp
am
State-of-
Art
PolarityTru
st
Roadmap
Final
RemarksFuture
Work
78
Conclusions
Final Remarks
Development of two systems for the detection of dishonest behaviors in on-line networks
Web Spam Detection: PolaritySpam Trust and Reputation: PolarityTrust
Propagation of some a-priori information
Web Spam: Textual content of the web pages Trust and Reputation: Trust and distrust sources
sets
79
Conclusions
Final Remarks
Web Spam Detection
Unlike existent approaches, include content-based knowledge into a link-based technique
Unsupervised methods for the selection of sources
Propagate information of the sources through the network
Two simple metrics improve state-of-the-art methods
80
Conclusions
Final Remarks
Trust and Reputation in social networks
Negative links improve the discriminative ability of TRS’s
Propagation estrategies to deal with different attacks against a TRS
Non-Negative propagation Action-Reaction propagation
Interrelated scores modeling the transitivity of trust and distrust
Flexible to be adapted to different situations and threats
81
Conclusions
Future Work
PolaritySpam Applicability of more content-based metrics
Aditional methods for the selection of sources Propagation ability of each source
Infer negative relations between web pages According to their textual content Apply similar propagation schemas as in
PolarityTrust
82
Conclusions
Future Work PolarityTrust
Study other possible attacks Playbook sequences (omniscience of the attackers) Analyze the casuistry of the different social networks
Selection of sources of trust and distrust Link-based methods
Study other contexts with positive and negative relations:
Trending topics Authorities in the blogosphere
83
Conclusions
Future Work Both techniques
Study of the parallelization of both algorithms Many works on the parallelization of PageRank Saving time and memory
Detection of Spam on the social networks Spam messages and spam user accounts
Recommender Systems NLP and Opinion Mining techniques in a link-based
system Use the positive and negative information
Curriculum Vitae
Academic and Research milestones 2006: Degree on Computer Science
2006: Funded Student in the Itálica research group
2008: Master of Advances Studies: “STR: A graph-based tagger generator”
2010: Research stay at the University of Glasgow IR Group (Dr. Iadh Ounis and Dr. Craig Macdonald)
84
Curriculum Vitae
26 contributions to conferences and journals 5 JCR 10 International Conferences 2 CORE B 4 CORE C 4 ISI Proceedings 3 Lecture Notes in Computer Sciences 3 CiteSeer Venue Impact Ratings
Proyectos de investigación
85
Curriculum Vitae86
System Combination Methods
STR
TexRank for Tagging
PolarityRank
PolaritySpam
PolarityTrust
Web Spam DetectionImproving a
Tagger Generator in IE
Contributions related to the thesis
National Conf.International Conf.JCR
Curriculum Vitae87
System Combination Methods
TexRank for Tagging
Improving a Tagger Generator in IE
Contributions related to the thesis
Bootstrapping Applied to a Corpus Generation Task, EUROCAST 2007
TextRank como motor de aprendizaje en tareas de etiquetado, SEPLN 2006
Improving the Performance of a Tagger Generator in an Information Extraction Application, Journal of Universal Computer Science (2007)
National Conf.International Conf.JCR
Curriculum Vitae88
STR: A Graph-based Tagging Technique, International Journal on Artificial Intelligence Tools (2011)
Contributions related to the thesis
STR
National Conf.International Conf.JCR
Curriculum Vitae89
A Knowledge-Rich Approach to Featured-based Opinion Extraction from Product Reviews, SMUC 2010 (CIKM 2010)
Combining Textual Content and Hyperlinks in Web Spam Detection, NLDB 2010
Contributions related to the thesisPolarityRank
Web Spam Detection
National Conf.International Conf.JCR
Curriculum Vitae90
PolarityTrust: Measuring Trust and Reputation in Social Networks, ITA 2011PolaritySpam: Propagating Content-based Information Through a Web Graph to Detect Web Spam, International Journal of Innovative Computing, Information and Control (2012)
Contributions related to the thesis
PolaritySpam
PolarityTrust
National Conf.International Conf.JCR
DETECTION OF DISHONEST BEHAVIORS IN ON-LINE NETWORKS USING GRAPH-BASED RANKING TECHNIQUES
Francisco Javier Ortega Rodríguez
Supervised by
Prof. Dr. José Antonio Troyano Jiménez