Hackers, Fraudsters and Botnets: Tackling the Problem of ...
Identifying on-line Fraudsters: Anomaly Detection Using Network Effects
description
Transcript of Identifying on-line Fraudsters: Anomaly Detection Using Network Effects
CMU SCS
Identifying on-line Fraudsters: Anomaly Detection Using
Network Effects
Christos Faloutsos
CMU
CMU SCS
Thanks
• Saman Haqqi
IBM-PBGH June 2013 C. Faloutsos (CMU) 2
CMU SCS
C. Faloutsos (CMU) 3
Roadmap
• Graph problems:– G1: Fraud detection – BP– G2: Botnet detection – spectral – G3: Beyond graphs: tensors and ``NELL’’
• Influence propagation and spike modeling– C1: spikeM model
• Conclusions
IBM-PBGH June 2013
CMU SCS
IBM-PBGH June 2013 C. Faloutsos (CMU) 4
E-bay Fraud detection
w/ Polo Chau &Shashank Pandit, CMU[www’07]
CMU SCS
IBM-PBGH June 2013 C. Faloutsos (CMU) 5
E-bay Fraud detection
CMU SCS
IBM-PBGH June 2013 C. Faloutsos (CMU) 6
E-bay Fraud detection
CMU SCS
IBM-PBGH June 2013 C. Faloutsos (CMU) 7
E-bay Fraud detection - NetProbe
CMU SCS
IBM-PBGH June 2013 C. Faloutsos (CMU) 8
E-bay Fraud detection - NetProbe
F A H
F 99%
A 99%
H 49% 49%
Compatibilitymatrix
heterophily
details
CMU SCS
C. Faloutsos (CMU) 9
Background 1: Belief Propagation Equations
€
mij (x j ) = φi (xi ) ⋅ψ ij (xi , x j ) ⋅ mni (xi )n∈N (i)\ j
∏xi
∑
€
bi (xi ) = η ⋅φi (xi ) ⋅ mij (xi )j∈N (i)
∏
[Pearl ‘82][Yedidia+ ‘02]…[Pandit+ ‘07][Gonzalez+ ‘09][Chechetka+ ‘10]
IBM-PBGH June 2013
~bi (xi )
CMU SCS
C. Faloutsos (CMU) 10
Background 1: Belief Propagation Equations
€
mij (x j ) = φi (xi ) ⋅ψ ij (xi , x j ) ⋅ mni (xi )n∈N (i)\ j
∏xi
∑
€
bi (xi ) = η ⋅φi (xi ) ⋅ mij (xi )j∈N (i)
∏
[Pearl ‘82][Yedidia+ ‘02]…[Pandit+ ‘07][Gonzalez+ ‘09][Chechetka+ ‘10]
IBM-PBGH June 2013
~bi (xi )
F A H
F 99%
A 99%
H 49% 49%
CMU SCS
Popular press
And less desirable attention:• E-mail from ‘Belgium police’ (‘copy of
your code?’)
IBM-PBGH June 2013 C. Faloutsos (CMU) 11
CMU SCS
C. Faloutsos (CMU) 12
Roadmap
• Graph problems:– G1: Fraud detection – BP
• Ebay• Symantec• Unification
– G2: Botnet detection – spectral – G3: Beyond graphs: tensors and ``NELL’’
• Influence propagation and spike modeling• Conclusions
IBM-PBGH June 2013
CMU SCS
Polo ChauMachine Learning Dept
Carey NachenbergVice President & Fellow
Jeffrey WilhelmPrincipal Software Engineer
Adam WrightSoftware Engineer
Prof. Christos FaloutsosComputer Science Dept
Polonium: Tera-Scale Graph Mining and Inference for Malware Detection
PATENT PENDING
SDM 2011, Mesa, Arizona
CMU SCS
Polonium: The Data60+ terabytes of data anonymously contributed by participants of worldwide Norton Community Watch program
50+ million machines900+ million executable files
Constructed a machine-file bipartite graph (0.2 TB+)
1 billion nodes (machines and files)37 billion edges
IBM-PBGH June 2013 14C. Faloutsos (CMU)
CMU SCS
Polonium: Key Ideas• Use “guilt-by-association” (i.e., homophily)
– E.g., files that appear on machines with many bad files are more likely to be bad
• Scalability: handles 37 billion-edge graph
IBM-PBGH June 2013 15C. Faloutsos (CMU)
CMU SCS
Polonium: One-Interaction Results
84.9% True Positive Rate1% False Positive Rate
True Positive Rate% of malware
correctly identified
False Positive Rate% of non-malware wrongly labeled as malware16
Ideal
IBM-PBGH June 2013 C. Faloutsos (CMU)
CMU SCS
C. Faloutsos (CMU) 17
Roadmap
• Graph problems:– G1: Fraud detection – BP
• Ebay• Symantec• Unification
– G2: Botnet detection – spectral – G3: Beyond graphs: tensors and ``NELL’’
• Influence propagation and spike modeling• Conclusions
IBM-PBGH June 2013
CMU SCS
Unifying Guilt-by-Association Approaches:
Theorems and Fast Algorithms
Danai Koutra
U Kang
Hsing-Kuo Kenneth Pao
Tai-You Ke
Duen Horng (Polo) Chau
Christos Faloutsos
ECML PKDD, 5-9 September 2011, Athens, Greece
CMU SCS
Problem Definition:GBA techniques
C. Faloutsos (CMU) 19
Given: Graph; & few labeled nodesFind: labels of rest(assuming network effects)
?
?
?
?
IBM-PBGH June 2013
CMU SCS
Homophily and Heterophily
C. Faloutsos (CMU) 20
Step 1
Step 2
homophily heterophily
All methods handle
homophily
NOT all methods handle
heterophily
BUT
proposed method
does!
IBM-PBGH June 2013
CMU SCS
Are they related?• RWR (Random Walk with Restarts)
– google’s pageRank (‘if my friends are important, I’m important, too’)
• SSL (Semi-supervised learning) – minimize the differences among neighbors
• BP (Belief propagation) – send messages to neighbors, on what you
believe about them
IBM-PBGH June 2013 C. Faloutsos (CMU) 21
CMU SCS
Are they related?• RWR (Random Walk with Restarts)
– google’s pageRank (‘if my friends are important, I’m important, too’)
• SSL (Semi-supervised learning) – minimize the differences among neighbors
• BP (Belief propagation) – send messages to neighbors, on what you
believe about them
IBM-PBGH June 2013 C. Faloutsos (CMU) 22
YES!
CMU SCS
C. Faloutsos (CMU) 23
Background 1: Belief Propagation Equations
€
mij (x j ) = φi (xi ) ⋅ψ ij (xi , x j ) ⋅ mni (xi )n∈N (i)\ j
∏xi
∑
€
bi (xi ) = η ⋅φi (xi ) ⋅ mij (xi )j∈N (i)
∏
[Pearl ‘82][Yedidia+ ‘02]…[Pandit+ ‘07][Gonzalez+ ‘09][Chechetka+ ‘10]
IBM-PBGH June 2013
CMU SCS
Correspondence of Methods
C. Faloutsos (CMU) 24
Method Matrix Unknown knownRWR [I – c AD-1] × x = (1-c)y
SSL [I + a(D - A)] × x = y
FABP [I + a D - c’A] × bh = φh
0 1 01 0 10 1 0
? 0 1 1
d1
d2 d3
final labels/ beliefs
prior labels/ beliefs
adjacency matrix
IBM-PBGH June 2013
CMU SCS
Correspondence of Methods
C. Faloutsos (CMU) 25
Method Matrix Unknown knownRWR [I – c AD-1] × x = (1-c)y
SSL [I + a(D - A)] × x = y
FABP [I + a D - c’A] × bh = φh
0 1 01 0 10 1 0
? 0 1 1
d1
d2 d3
final labels/ beliefs
prior labels/ beliefs
adjacency matrix
IBM-PBGH June 2013
We know when it converges!
CMU SCS
Results: Scalability
C. Faloutsos (CMU) 26
FABP is linear on the number of edges.
# of edges (Kronecker graphs)
run
tim
e (m
in)
IBM-PBGH June 2013
CMU SCS
Results: Parallelism
C. Faloutsos (CMU) 27
FABP ~2x faster & wins/ties on accuracy.
runtime (min)
% a
ccu
racy
IBM-PBGH June 2013
CMU SCS
C. Faloutsos (CMU) 28
Conclusions for BP
• ‘NetProbe’, ‘Polonium’, and belief propagation: exploit network effects.
• FaBP: fast & accurate (and -> convergence conditions)
IBM-PBGH June 2013
CMU SCS
C. Faloutsos (CMU) 29
Roadmap
• Graph problems:– G1: Fraud detection – BP
• Ebay• Symantec• Unification
– G2: Botnet detection – spectral – G3: Beyond graphs: tensors and ``NELL’’
• Influence propagation and spike modeling• Conclusions
IBM-PBGH June 2013
CMU SCS
EigenSpokes
B. Aditya Prakash, Mukund Seshadri, Ashwin Sridharan, Sridhar Machiraju and Christos Faloutsos: EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs, PAKDD 2010, Hyderabad, India, 21-24 June 2010.
C. Faloutsos (CMU) 30IBM-PBGH June 2013
CMU SCS
EigenSpokes• Eigenvectors of adjacency matrix
equivalent to singular vectors (symmetric, undirected graph)
31C. Faloutsos (CMU)IBM-PBGH June 2013
CMU SCS
EigenSpokes• Eigenvectors of adjacency matrix
equivalent to singular vectors (symmetric, undirected graph)
32C. Faloutsos (CMU)IBM-PBGH June 2013
N
N
details
CMU SCS
EigenSpokes• Eigenvectors of adjacency matrix
equivalent to singular vectors (symmetric, undirected graph)
33C. Faloutsos (CMU)IBM-PBGH June 2013
N
N
details
CMU SCS
EigenSpokes• Eigenvectors of adjacency matrix
equivalent to singular vectors (symmetric, undirected graph)
34C. Faloutsos (CMU)IBM-PBGH June 2013
N
N
details
CMU SCS
EigenSpokes• Eigenvectors of adjacency matrix
equivalent to singular vectors (symmetric, undirected graph)
35C. Faloutsos (CMU)IBM-PBGH June 2013
N
N
details
CMU SCS
EigenSpokes• EE plot:• Scatter plot of
scores of u1 vs u2• One would expect
– Many points @ origin
– A few scattered ~randomly
C. Faloutsos (CMU) 36
u1
u2
IBM-PBGH June 2013
1st Principal component
2nd Principal component
CMU SCS
EigenSpokes• EE plot:• Scatter plot of
scores of u1 vs u2• One would expect
– Many points @ origin
– A few scattered ~randomly
C. Faloutsos (CMU) 37
u1
u290o
IBM-PBGH June 2013
CMU SCS
EigenSpokes - pervasiveness
•Present in mobile social graph across time and space
•Patent citation graph
38C. Faloutsos (CMU)IBM-PBGH June 2013
CMU SCS
EigenSpokes - explanation
Near-cliques, or near-bipartite-cores, loosely connected
39C. Faloutsos (CMU)IBM-PBGH June 2013
CMU SCS
EigenSpokes - explanation
Near-cliques, or near-bipartite-cores, loosely connected
40C. Faloutsos (CMU)IBM-PBGH June 2013
CMU SCS
EigenSpokes - explanation
Near-cliques, or near-bipartite-cores, loosely connected
41C. Faloutsos (CMU)IBM-PBGH June 2013
CMU SCS
EigenSpokes - explanation
Near-cliques, or near-bipartite-cores, loosely connected
42C. Faloutsos (CMU)IBM-PBGH June 2013
CMU SCS
EigenSpokes - explanation
Near-cliques, or near-bipartite-cores, loosely connected
So what? Extract nodes with high
scores high connectivity Good “communities”
spy plot of top 20 nodes
43C. Faloutsos (CMU)IBM-PBGH June 2013
CMU SCS
Bipartite Communities!
magnified bipartite community
patents fromsame inventor(s)
`cut-and-paste’bibliography!
44C. Faloutsos (CMU)IBM-PBGH June 2013
CMU SCS
(maybe, botnets?)
Victim IPs?
Botnet members?
45C. Faloutsos (CMU)IBM-PBGH June 2013
Exploring itwith Dr. Eric Mao (III-Taiwan)
CMU SCS
C. Faloutsos (CMU) 46
Roadmap
• Graph problems:– G1: Fraud detection – BP– G2: Botnet detection – spectral – G3: Beyond graphs: tensors and ``NELL’’
• Influence propagation and spike modeling• Conclusions
IBM-PBGH June 2013
CMU SCS
GigaTensor: Scaling Tensor Analysis Up By 100 Times –
Algorithms and Discoveries
U Kang
ChristosFaloutsos
KDD’12
EvangelosPapalexakis
AbhayHarpale
IBM-PBGH June 2013 47C. Faloutsos (CMU)
CMU SCS
Background: Tensors
• Tensors (=multi-dimensional arrays) are everywhere– Hyperlinks &anchor text [Kolda+,05]
URL 1
URL 2
Anchor Text
Java
C++
C#
11
1
1
1
1 1
IBM-PBGH June 2013 48C. Faloutsos (CMU)
java
CMU SCS
Background: Tensors
• Tensors (=multi-dimensional arrays) are everywhere– Sensor stream (time, location, type)– Predicates (subject, verb, object) in knowledge base
“Barack Obama is president of U.S.”
“Eric Clapton playsguitar”
(26M)
(26M)
(48M)
NELL (Never Ending Language Learner) data
Nonzeros =144M
IBM-PBGH June 2013 49C. Faloutsos (CMU)
CMU SCS
Background: Tensors
• Tensors (=multi-dimensional arrays) are everywhere– Sensor stream (time, location, type)– Predicates (subject, verb, object) in knowledge base
IBM-PBGH June 2013 50C. Faloutsos (CMU)IP-destination
IP-source
Time-stamp Anomaly Detection inComputernetworks
CMU SCS
Problem Definition
• How to decompose a billion-scale tensor?– Corresponds to SVD in 2D case
IBM-PBGH June 2013 51C. Faloutsos (CMU)
CMU SCS
Problem Definition
• How to decompose a billion-scale tensor?– Corresponds to SVD in 2D case
IBM-PBGH June 2013 52C. Faloutsos (CMU)
‘Politicians’ ‘Artists’
CMU SCS
Problem Definition
Q1: Dominant concepts/topics? Q2: Find synonyms to a given noun phrase? (and how to scale up: |data| > RAM)
(26M)
(26M)
(48M)
NELL (Never Ending Language Learner) data
Nonzeros =144M
IBM-PBGH June 2013 53C. Faloutsos (CMU)
CMU SCS
Experiments
• GigaTensor solves 100x larger problem
Number of nonzero= I / 50
(J)
(I)
(K)
GigaTensor
Tensor
Toolbox Out ofMemory
100x
IBM-PBGH June 2013 54C. Faloutsos (CMU)
CMU SCS
A1: Concept Discovery
• Concept Discovery in Knowledge Base
IBM-PBGH June 2013 55C. Faloutsos (CMU)
CMU SCS
A1: Concept Discovery
IBM-PBGH June 2013 56C. Faloutsos (CMU)
CMU SCS
A2: Synonym Discovery
IBM-PBGH June 2013 57C. Faloutsos (CMU)
CMU SCS
C. Faloutsos (CMU) 58
Roadmap
• Graph problems:– G1: Fraud detection – BP– G2: Botnet detection – spectral – G3: Beyond graphs: tensors and ``NELL’’
• Influence propagation and spike modeling• Conclusions
IBM-PBGH June 2013
CMU SCS
Rise and Fall Patterns of Information Diffusion:Model and Implications
Yasuko Matsubara (Kyoto University),
Yasushi Sakurai (NTT), B. Aditya Prakash (CMU),
Lei Li (UCB), Christos Faloutsos (CMU)
KDD’12, Beijing China
CMU SCS
C. Faloutsos (CMU)
• Meme (# of mentions in blogs)– short phrases Sourced from U.S. politics in 2008
60
“you can put lipstick on a pig”
“yes we can”
Rise and fall patterns in social media
IBM-PBGH June 2013
CMU SCS
C. Faloutsos (CMU)
Rise and fall patterns in social media
61
• four classes on YouTube [Crane et al. ’08]• six classes on Meme [Yang et al. ’11]
IBM-PBGH June 2013
CMU SCS
C. Faloutsos (CMU)
Rise and fall patterns in social media
62
• Can we find a unifying model, which includes these patterns?
• four classes on YouTube [Crane et al. ’08]• six classes on Meme [Yang et al. ’11]
IBM-PBGH June 2013
CMU SCS
C. Faloutsos (CMU)
Rise and fall patterns in social media
63
• Answer: YES!
• We can represent all patterns by single model
IBM-PBGH June 2013
CMU SCS
C. Faloutsos (CMU) 64
Main idea - SpikeM- 1. Un-informed bloggers (uninformed about rumor)
- 2. External shock at time nb (e.g, breaking news)
- 3. Infection (word-of-mouth)
Time n=0 Time n=nb
β
IBM-PBGH June 2013
Infectiveness of a blog-post at age n:
- Strength of infection (quality of news)
- Decay function
Time n=nb+1
CMU SCS
C. Faloutsos (CMU) 65
- 1. Un-informed bloggers (uninformed about rumor)
- 2. External shock at time nb (e.g, breaking news)
- 3. Infection (word-of-mouth)
Time n=0 Time n=nb
β
IBM-PBGH June 2013
Infectiveness of a blog-post at age n:
- Strength of infection (quality of news)
- Decay function
Time n=nb+1
Main idea - SpikeM
CMU SCS
IBM-PBGH June 2013 C. Faloutsos (CMU) 66
-1.5 slope
J. G. Oliveira & A.-L. Barabási Human Dynamics: The Correspondence Patterns of Darwin and Einstein. Nature 437, 1251 (2005) . [PDF]
Response time (log)
Prob(RT > x)(log) -1.5
CMU SCS
C. Faloutsos (CMU)
SpikeM - with periodicity• Full equation of SpikeM
67
Periodicity
noonPeak 3am
Dip
Time n
Bloggers change their activity over time
(e.g., daily, weekly, yearly)
activity
Details
IBM-PBGH June 2013
CMU SCS
C. Faloutsos (CMU)
Details• Analysis – exponential rise and power-raw fall
68
Lin-log
Log-log
Rise-part
SI -> exponential SpikeM -> exponential
IBM-PBGH June 2013
CMU SCS
C. Faloutsos (CMU)
Details• Analysis – exponential rise and power-raw fall
69
Lin-log
Log-log
Fall-part
SI -> exponential SpikeM -> power law
IBM-PBGH June 2013
CMU SCS
C. Faloutsos (CMU)
Tail-part forecasts
70
• SpikeM can capture tail part
IBM-PBGH June 2013
CMU SCS
C. Faloutsos (CMU)
“What-if” forecasting
71
e.g., given (1) first spike,
(2) release date of two sequel movies
(3) access volume before the release date
?
(1) First spike
(2) Release date
(3) Two weeks before release
IBM-PBGH June 2013
?
CMU SCS
C. Faloutsos (CMU)
“What-if” forecasting
72SpikeM can forecast upcoming spikes
(1) First spike
(2) Release date
(3) Two weeks before release
IBM-PBGH June 2013
CMU SCS
Conclusions for spikes• Exp rise; PL decay• ‘spikeM’ captures all patterns, with a few
parms– And can do extrapolation– And forecasting
IBM-PBGH June 2013 C. Faloutsos (CMU) 73
CMU SCS
C. Faloutsos (CMU) 74
Roadmap
• Graph problems:– G1: Fraud detection – BP– G2: Botnet detection – spectral – G3: Beyond graphs: tensors and ``NELL’’
• Influence propagation and spike modeling• Future research• Conclusions
IBM-PBGH June 2013
CMU SCS
Challenge#1: Time evolving networks / tensors
• Periodicities? Burstiness?• What is ‘typical’ behavior of a node, over time• Heterogeneous graphs (= nodes w/ attributes)
IBM-PBGH June 2013 C. Faloutsos (CMU) 75
…
CMU SCS
Challenge #2: ‘Connectome’ – brain wiring
IBM-PBGH June 2013 C. Faloutsos (CMU) 76
• Which neurons get activated by ‘bee’• How wiring evolves• Modeling epilepsy
N. Sidiropoulos
George Karypis
V. Papalexakis
Tom Mitchell
CMU SCS
C. Faloutsos (CMU) 77
Thanks
IBM-PBGH June 2013
Thanks to: NSF IIS-0705359, IIS-0534205, CTA-INARC; Yahoo (M45), LLNL, IBM, SPRINT, Google, INTEL, HP, iLab
CMU SCS
C. Faloutsos (CMU) 78
Project info: PEGASUS
IBM-PBGH June 2013
www.cs.cmu.edu/~pegasusResults on large graphs: with Pegasus +
hadoop + M45
Apache license
Code, papers, manual, video
Prof. U Kang Prof. Polo Chau
CMU SCS
C. Faloutsos (CMU) 79
Cast
Akoglu, Leman
Chau, Polo
Kang, U
McGlohon, Mary
Tong, Hanghang
Prakash,Aditya
IBM-PBGH June 2013
Koutra,Danai
Beutel,Alex
Papalexakis,Vagelis
CMU SCS
C. Faloutsos (CMU) 80
References
• Deepayan Chakrabarti, Christos Faloutsos: Graph mining: Laws, generators, and algorithms. ACM Comput. Surv. 38(1): (2006)
IBM-PBGH June 2013
CMU SCS
C. Faloutsos (CMU) 81
References• Christos Faloutsos, Tamara G. Kolda, Jimeng Sun:
Mining large graphs and streams using matrix and tensor tools. Tutorial, SIGMOD Conference 2007: 1174
IBM-PBGH June 2013
CMU SCS
References• Yasuko Matsubara, Yasushi Sakurai, B. Aditya
Prakash, Lei Li, Christos Faloutsos, "Rise and Fall Patterns of Information Diffusion: Model and Implications", KDD’12, pp. 6-14, Beijing, China, August 2012
IBM-PBGH June 2013 C. Faloutsos (CMU) 82
CMU SCS
References• Jimeng Sun, Dacheng Tao, Christos
Faloutsos: Beyond streams and graphs: dynamic tensor analysis. KDD 2006: 374-383
IBM-PBGH June 2013 C. Faloutsos (CMU) 83
CMU SCS
Overall Conclusions• G1: fraud detection
– BP: powerful method– FaBP: faster; equally accurate; known
convergence
• G2: botnets -> Eigenspokes• G3: Subject-Verb-Object ->
Tensors/GigaTensor• Spikes: ‘spikeM’ (exp rise; PL drop)
IBM-PBGH June 2013 C. Faloutsos (CMU) 84