Anomaly Detection over Graphical KB using HPCC Thisprevious work on anomaly detection in information...
Transcript of Anomaly Detection over Graphical KB using HPCC Thisprevious work on anomaly detection in information...
RESEARCH POSTER PRESENTATION DESIGN © 2015
www.PosterPresentations.com
Detecting abnormalities in the typical behavior of time evolving
network graphs finds application in social networks, mobile
computing and internet banking. Unlike previous work on anomaly
detection in information networks that worked with a static network
graph, our methods consider the network as it evolves and
monitors properties of the network for changes using the High
Performance Computing Cluster(HPCC) system for parallel
processing.
Introduction
Approach 1: Wiki-watchdog
This approach employs several of this efficient-based metrics and are
collectively more powerful for anomaly detection. Distributional analysis
looks at Wikipedia as a whole, and provides a multidimensional view of
revision history from several different perspective. By monitoring
distributions of revisions/page and revisions/contributor as they change, we
wish to identify three kinds of anomalies: the activity of bots, events and
outages
It has been seen that distribution based anomaly detection
• speed up anomaly detection
• identify events that are hard to detect
• yields less false positives
Metrics:
• Entropy (H) : H(σ) = − Ʃ1≤i≤n pi log pi, where pi = fi/m
• Zeroth frequency moment F0 : the number of elements i in the distribution
with frequency fi > 0
• Second Frequency moment F2 : F2(σ) = Ʃ1≤i≤n fi2
Approach 2: GraphScope
MDL-based, parameter free algorithm. Discover node partitions in streaming,
directed, bipartite graphs, monitoring their evolution over time to detect
events or changes. Unlike many existing techniques, it requires no user
defined parameters, and it operates completely automatically. Furthermore, it
adapts to the dynamic environment by automatically finding the communities
and determining good change-points in time. It iteratively searches for best
source and destination partitions in each graph partitions until further
partitioning does not lead to additional decrease of the encoding cost
Approach
Project Results
HPCC v/s Hadoop
References
Pros:
• HPCC ECL - Unlike Java, high level primitives such as JOIN,
TRANSFORM, PROJECT, SORT, DISTRIBUTE, MAP, etc. are
available.
• Implicitly parallel: Parallelism is built into the underlying
platform. The programmer needs not be concerned with it
• Open Data Model: Unlike Hadoop, the data model is defined by
the user, and it’s not constrained by the limitations of a strict
key-value paradigm.
• Integrated toolset (as opposed to hodgepodge of third party
tools). We're talking about an IDE, job monitoring, scheduler,
configuration manager, etc.
Cons:
• ECL is not object oriented.
• There are no unit testing or integration testing frameworks.
• Does not retain state without writing to a file or embedding C
code, and cannot be compiled without compiling every
dependency every time.
By Akash Agarwal and Roukna Sengupta
Department of Computer Science, University of Florida
Anomaly Detection over Graphical KB using HPCC
HPCC System
Figure 3. Wiki Watchdog steps
Figure 4: Cost objective
Figure 5: GraphScope
We calculate the metrics for Wikipedia revisions of the year 2002 which consisted on just above 1 million
revision entries and tried to identify three kinds of events:
• Flash events: generally results in a dip in H or spike in F2 of page distribution
• Wikipedia bots: reflects a spike in H and F0 of page distribution and a dip in H and a spike in F2 for
contributor distribution
• Outages: dips in volume metric
Prior work also shows that the anomalies flagged are never consistently mapped by any other metrics, hence
the each of the metrics is significant. We observe the results obtained in the following figures and tables:
[1] Jimeng Sun, Christos Faloutsos, Spiros Papadimitriou, and Philip S. Yu.
GraphScope: parameter-free mining of large time-evolving graphs. In
Proceedings of the13th ACM International Conference on Knowledge
Discovery and Data Mining (SIGKDD), San Jose, CA, pages 687–696.
ACM, 2007a.
[2] Chrisil Arackaparambil, Guanhua Yan Wiki-Watchdog: Anomaly
Detection in Wikipedia Through a Distributional Lens 2011
IEEE/WIC/ACM International Conferences on Web Intelligence and
Intelligent Agent Technology, pages 257–264
[3] Charu C. Aggarwal, Yuchen Zhao, and Philip S. Yu. Outlier detection in
graph streams. In Proceedings of the 27th International Conference on Data
Engineering (ICDE), Hannover, Germany, pages 399–409, 2011.
[4] Stephen Ranshous,Shitian Shen,Danai Koutra,Steve Harenberg,Christos
Faloutsos,Nagiza F. Samatova Anomaly detection in dynamic networks: a
survey.
Conclusion
From our experience with using HPCC systems is that when it
comes to manipulation, transformation and querying of big data it
shines. Our only limitation was the language ECL where complex
algorithms are challenging to implement as it never allows to
update or change anything and producing a new dataset on every
operation.
Wiki-watchdog is an efficient, online distribution-based anomaly
detection methodology. Using this it is possible to detect several
kinds of anomalies with a detection rate that is higher than
traditional methods, and a low false-positive rate. The results of
Wiki-watchdog confirms with prior observation of metric behavior.
The idea of the GraphScope algorithm is good because it solves
both the problems of community identification and anomaly
detection but it has difficulty in scaling to large number of nodes.
More suitable for less nodes and frequent updates.
• Runs on commodity hardware
• Built-in distributed file system
• Scales out to thousands of nodes
• Fault resilient, redundancy and availability
• Powerful development IDE
• Extension modules for specific big data tasks like machine
learning or web log analytics, natural language parsing, machine
learning, data encryption, etc.
Dataset
The data contains the complete
edit history of Wikipedia since
its inception till January 2008
Decompressed size of data is >3
TB of text
HPCC (High-Performance Computing Cluster) is an open source,
data-intensive computing system platform developed by LexisNexis
Risk Solutions. The HPCC platform incorporates a software
architecture implemented on commodity computing clusters to
provide high-performance, data-parallel processing for applications
utilizing big data.
Spray the parsed data
to HPCC thor cluster
Filter the revision update records
Divide the update
stream into consecutive
windows W1, W2, …, WN,
each of fixed time interval of 24 hours.
Compute our metrics on
the sub-stream Si in the interval
Wi
A time series of metric values v1,
v2,…,vN over the time intervals.
Normalize the time series
metric values -𝑣1 , 𝑣2, … 𝑣𝑁
Flag anomaly when |𝑣𝑖 - E|
> 𝜏
Parameters:
•E (Exponential Weighted Moving Average):
•E := E + 𝛼(𝑣𝑖 - E)
•𝛼 := 0.3
•𝜏 := 0.014
Segment encoding
cost (C(s))
C(s)=log*(ts+1 – ts) + C(s)
p +
C(s)g
Partition encoding
cost (C(s)p)
C(s)
p = log*m + log*n +
log*ks + log*ls + mH(P) +
nH(Q)
Graph encoding cost
(C(s)g)
C(s)
g =
(𝑙𝑜𝑔 ∗ |𝐸|𝑝,𝑞(𝑠)𝑙𝑠
𝑞=1𝑘𝑠𝑝=1 +
𝒢𝑝,𝑞(𝑠). 𝐻(𝒢𝑝,𝑞(𝑠)))
Total encoding cost 𝑪 = 𝑪(𝒔)𝒔
0
0.2
0.4
0.6
0.8
1
Jan
-02
Feb
-02
Mar
-02
Ap
r-0
2
May
-02
Jun
-02
Jul-
02
Au
g-0
2
Sep
-02
Oct
-02
No
v-0
2
De
c-0
2
F0
0
0.2
0.4
0.6
0.8
1
1.2
Entropy
Date Event Category Indicator of anomaly
1/1/2002 Twelve countries
introduce Euro as official Tender
Flash events Dip in H and spike in
F2
2/27/2002 Godhra Riots, Grammy Flash events Dip in H and spike in
F2
10/12/2002 The deadliest act of
terrorism in the history of Indonesia
Flash events Dip in H and spike in
F2
Date Category Indicator of anomaly
12/03/2002 Wikipedia bots
Spike in both H and F0 of page distribution and dip in H and spike in F2 of contributor distribution
05/14/2002 Wikipedia bots
Spike in both H and F0 of page distribution and dip in H and spike in F2 of contributor distribution
Figure 2: HPCC Architecture
Figure 1:Dataset format
Table 1: Flash Events Table 2: Bot Events
Figure 6. Entropy v/s Time
Figure 7. F0 v/s Time