Anomaly Detection and Virus Propagation in Large Graphs Christos Faloutsos CMU.
CMU SCS Graph and stream mining Christos Faloutsos CMU.
-
date post
21-Dec-2015 -
Category
Documents
-
view
228 -
download
0
Transcript of CMU SCS Graph and stream mining Christos Faloutsos CMU.
CALD IC 2004 © C. Faloutsos (2004) 3
CMU SCS
Outline
• Problem definition / Motivation
• Graphs and power laws
• Streams and forecasting
• Conclusions
CALD IC 2004 © C. Faloutsos (2004) 4
CMU SCS
Motivation
• Data mining: ~ find patterns
• How do real graphs look like?
• How do (numerical) streams look like?
CALD IC 2004 © C. Faloutsos (2004) 6
CMU SCS
Graphs - why should we care?
Internet Map [lumeta.com]
Food Web [Martinez ’91]
Protein Interactions [genomebiology.com]
Friendship Network [Moody ’01]
CALD IC 2004 © C. Faloutsos (2004) 7
CMU SCS
Problem #1 - network and graph mining• How does the Internet look like?• How does the web look like?• What constitutes a ‘normal’ social
network?• What is the ‘network value’ of a
customer? • which gene/species affects the others
the most?
CALD IC 2004 © C. Faloutsos (2004) 8
CMU SCS
Problem#1Given a graph:
• which node to market-to / defend / immunize first?
• Are there un-natural sub-graphs? (criminals’ rings or terrorist cells)?
• How do P2P networks evolve?
CALD IC 2004 © C. Faloutsos (2004) 9
CMU SCS
Solution#1:• A1: Power law in the degree distribution
[SIGCOMM99]
log(rank)
log(degree)
-0.82
internet domains
att.com
ibm.com
CALD IC 2004 © C. Faloutsos (2004) 10
CMU SCS
Solution#1’: Eigen Exponent E
• power law in the eigenvalues of the adjacency matrix
E = -0.48
Exponent = slope
Eigenvalue
Rank of decreasing eigenvalue
May 2001
CALD IC 2004 © C. Faloutsos (2004) 11
CMU SCS
But:
• Q1: How about graphs from other domains?
• Q2: How about temporal evolution?
CALD IC 2004 © C. Faloutsos (2004) 12
CMU SCS
More power laws:
citation counts: (citeseer.nj.nec.com 6/2001)
log(#citations)
log(count)
Ullman
CALD IC 2004 © C. Faloutsos (2004) 13
CMU SCS
More power laws:
• web hit counts [w/ A. Montgomery]
Web Site Traffic
log(freq)
log(count)
Zipf
userssites
CALD IC 2004 © C. Faloutsos (2004) 14
CMU SCS
The Peer-to-Peer Topology
• Frequency versus degree • Number of adjacent peers follows a power-law
[Jovanovic+]
CALD IC 2004 © C. Faloutsos (2004) 15
CMU SCS
epinions.com
• who-trusts-whom [Richardson + Domingos, KDD 2001]
(out) degree
count
CALD IC 2004 © C. Faloutsos (2004) 16
CMU SCS
More Power laws
• Also hold for other web graphs [Barabasi+], [Tomkins+], with additional ‘rules’ (bi-partite cores follow power laws)
CALD IC 2004 © C. Faloutsos (2004) 18
CMU SCS
A famous power law: Zipf’s law
• Bible - rank vs frequency (log-log)
• similarly, in many other languages; for customers and sales volume; city populations etc etclog(rank)
log(freq)
CALD IC 2004 © C. Faloutsos (2004) 19
CMU SCS
Olympic medals (Sidney):
y = -0.9676x + 2.3054
R2 = 0.9458
0
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2
Series1
Linear (Series1)
rank
log(#medals)
CALD IC 2004 © C. Faloutsos (2004) 20
CMU SCS
More power laws: areas – Korcak’s law
Scandinavian lakes area vs complementary cumulative count (log-log axes)
log(count( >= area))
log(area)
CALD IC 2004 © C. Faloutsos (2004) 23
CMU SCS
Time Evolution: rank R
-1
-0.9
-0.8
-0.7
-0.6
-0.50 200 400 600 800
Instances in time: Nov'97 - now
Ra
nk
ex
po
ne
nt
The rank exponent has not changed!
Domainlevel
log(rank)
log(degree)
-0.82
att.com
ibm.com
CALD IC 2004 © C. Faloutsos (2004) 24
CMU SCS
Outline
• Problem definition / Motivation
• Graphs and power laws
• Streams and forecasting
• Conclusions
CALD IC 2004 © C. Faloutsos (2004) 25
CMU SCS
Why care about streams?• Sensor devices
– Temperature, weather measurements– Road traffic data– Geological observations– Patient physiological data
• Embedded devices– Network routers– Intelligent (active) disks
CALD IC 2004 © C. Faloutsos (2004) 26
CMU SCS
Modeling bursty traffic
Given a signal (eg., bytes over time)
• model bursty traffic
• generate realistic traces
• (Poisson does not work)
time
# bytes
Poisson
CALD IC 2004 © C. Faloutsos (2004) 27
CMU SCS
Modeling bursty traffic
Given a signal (eg., bytes over time)
• give guarantees:
time
# bytes
Poisson
CALD IC 2004 © C. Faloutsos (2004) 29
CMU SCS
Approach
• Q1: How to generate a sequence, that is– bursty– self-similar– and has ‘realistic’ queue length distributions
CALD IC 2004 © C. Faloutsos (2004) 31
CMU SCS
Binary multifractals20 80
[Mengzhi Wang, best student paper award, PEVA’02]
CALD IC 2004 © C. Faloutsos (2004) 32
CMU SCS
Forecasting:
• AWSOM: using wavelets and AutoRegression [Papadimitriou+, 2003]
CALD IC 2004 © C. Faloutsos (2004) 33
CMU SCS
ResultsReal data – Sunspot
• Sunspot intensity – Slightly time-varying “period”• AR captures wrong trend (average)• Seasonal ARIMA
– Captures immediate wrong downward trend– Requires human to determine seasonal component period (fixed)
CALD IC 2004 © C. Faloutsos (2004) 34
CMU SCS
ResultsReal data – Sunspot
• Sunspot intensity – Slightly time-varying “period”
CALD IC 2004 © C. Faloutsos (2004) 35
CMU SCS
Conclusions
• Graphs & streams pose fascinating problems
• self-similarity, fractals and power laws work, when textbook methods fail!
CALD IC 2004 © C. Faloutsos (2004) 36
CMU SCS
Other on-going projects• video data mining [Pan + Yang]
• Disk traffic modeling [Ailamaki;
Ganger]
CALD IC 2004 © C. Faloutsos (2004) 37
CMU SCS
Books
• Manfred Schroeder: Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise W.H. Freeman and Company, 1991 (Probably the BEST book on fractals!)
CALD IC 2004 © C. Faloutsos (2004) 38
CMU SCS
Contact info
• [email protected]• www.cs.cmu.edu/~christos• Wean Hall 7107• Ph#: x8.1457