Anomaly Detection over Graphical KB using HPCC Thisprevious work on anomaly detection in information...

Post on 20-Apr-2018

232 views 6 download

Transcript of Anomaly Detection over Graphical KB using HPCC Thisprevious work on anomaly detection in information...

RESEARCH POSTER PRESENTATION DESIGN © 2015

www.PosterPresentations.com

Detecting abnormalities in the typical behavior of time evolving

network graphs finds application in social networks, mobile

computing and internet banking. Unlike previous work on anomaly

detection in information networks that worked with a static network

graph, our methods consider the network as it evolves and

monitors properties of the network for changes using the High

Performance Computing Cluster(HPCC) system for parallel

processing.

Introduction

Approach 1: Wiki-watchdog

This approach employs several of this efficient-based metrics and are

collectively more powerful for anomaly detection. Distributional analysis

looks at Wikipedia as a whole, and provides a multidimensional view of

revision history from several different perspective. By monitoring

distributions of revisions/page and revisions/contributor as they change, we

wish to identify three kinds of anomalies: the activity of bots, events and

outages

It has been seen that distribution based anomaly detection

• speed up anomaly detection

• identify events that are hard to detect

• yields less false positives

Metrics:

• Entropy (H) : H(σ) = − Ʃ1≤i≤n pi log pi, where pi = fi/m

• Zeroth frequency moment F0 : the number of elements i in the distribution

with frequency fi > 0

• Second Frequency moment F2 : F2(σ) = Ʃ1≤i≤n fi2

Approach 2: GraphScope

MDL-based, parameter free algorithm. Discover node partitions in streaming,

directed, bipartite graphs, monitoring their evolution over time to detect

events or changes. Unlike many existing techniques, it requires no user

defined parameters, and it operates completely automatically. Furthermore, it

adapts to the dynamic environment by automatically finding the communities

and determining good change-points in time. It iteratively searches for best

source and destination partitions in each graph partitions until further

partitioning does not lead to additional decrease of the encoding cost

Approach

Project Results

HPCC v/s Hadoop

References

Pros:

• HPCC ECL - Unlike Java, high level primitives such as JOIN,

TRANSFORM, PROJECT, SORT, DISTRIBUTE, MAP, etc. are

available.

• Implicitly parallel: Parallelism is built into the underlying

platform. The programmer needs not be concerned with it

• Open Data Model: Unlike Hadoop, the data model is defined by

the user, and it’s not constrained by the limitations of a strict

key-value paradigm.

• Integrated toolset (as opposed to hodgepodge of third party

tools). We're talking about an IDE, job monitoring, scheduler,

configuration manager, etc.

Cons:

• ECL is not object oriented.

• There are no unit testing or integration testing frameworks.

• Does not retain state without writing to a file or embedding C

code, and cannot be compiled without compiling every

dependency every time.

By Akash Agarwal and Roukna Sengupta

Department of Computer Science, University of Florida

Anomaly Detection over Graphical KB using HPCC

HPCC System

Figure 3. Wiki Watchdog steps

Figure 4: Cost objective

Figure 5: GraphScope

We calculate the metrics for Wikipedia revisions of the year 2002 which consisted on just above 1 million

revision entries and tried to identify three kinds of events:

• Flash events: generally results in a dip in H or spike in F2 of page distribution

• Wikipedia bots: reflects a spike in H and F0 of page distribution and a dip in H and a spike in F2 for

contributor distribution

• Outages: dips in volume metric

Prior work also shows that the anomalies flagged are never consistently mapped by any other metrics, hence

the each of the metrics is significant. We observe the results obtained in the following figures and tables:

[1] Jimeng Sun, Christos Faloutsos, Spiros Papadimitriou, and Philip S. Yu.

GraphScope: parameter-free mining of large time-evolving graphs. In

Proceedings of the13th ACM International Conference on Knowledge

Discovery and Data Mining (SIGKDD), San Jose, CA, pages 687–696.

ACM, 2007a.

[2] Chrisil Arackaparambil, Guanhua Yan Wiki-Watchdog: Anomaly

Detection in Wikipedia Through a Distributional Lens 2011

IEEE/WIC/ACM International Conferences on Web Intelligence and

Intelligent Agent Technology, pages 257–264

[3] Charu C. Aggarwal, Yuchen Zhao, and Philip S. Yu. Outlier detection in

graph streams. In Proceedings of the 27th International Conference on Data

Engineering (ICDE), Hannover, Germany, pages 399–409, 2011.

[4] Stephen Ranshous,Shitian Shen,Danai Koutra,Steve Harenberg,Christos

Faloutsos,Nagiza F. Samatova Anomaly detection in dynamic networks: a

survey.

Conclusion

From our experience with using HPCC systems is that when it

comes to manipulation, transformation and querying of big data it

shines. Our only limitation was the language ECL where complex

algorithms are challenging to implement as it never allows to

update or change anything and producing a new dataset on every

operation.

Wiki-watchdog is an efficient, online distribution-based anomaly

detection methodology. Using this it is possible to detect several

kinds of anomalies with a detection rate that is higher than

traditional methods, and a low false-positive rate. The results of

Wiki-watchdog confirms with prior observation of metric behavior.

The idea of the GraphScope algorithm is good because it solves

both the problems of community identification and anomaly

detection but it has difficulty in scaling to large number of nodes.

More suitable for less nodes and frequent updates.

• Runs on commodity hardware

• Built-in distributed file system

• Scales out to thousands of nodes

• Fault resilient, redundancy and availability

• Powerful development IDE

• Extension modules for specific big data tasks like machine

learning or web log analytics, natural language parsing, machine

learning, data encryption, etc.

Dataset

The data contains the complete

edit history of Wikipedia since

its inception till January 2008

Decompressed size of data is >3

TB of text

HPCC (High-Performance Computing Cluster) is an open source,

data-intensive computing system platform developed by LexisNexis

Risk Solutions. The HPCC platform incorporates a software

architecture implemented on commodity computing clusters to

provide high-performance, data-parallel processing for applications

utilizing big data.

Spray the parsed data

to HPCC thor cluster

Filter the revision update records

Divide the update

stream into consecutive

windows W1, W2, …, WN,

each of fixed time interval of 24 hours.

Compute our metrics on

the sub-stream Si in the interval

Wi

A time series of metric values v1,

v2,…,vN over the time intervals.

Normalize the time series

metric values -𝑣1 , 𝑣2, … 𝑣𝑁

Flag anomaly when |𝑣𝑖 - E|

> 𝜏

Parameters:

•E (Exponential Weighted Moving Average):

•E := E + 𝛼(𝑣𝑖 - E)

•𝛼 := 0.3

•𝜏 := 0.014

Segment encoding

cost (C(s))

C(s)=log*(ts+1 – ts) + C(s)

p +

C(s)g

Partition encoding

cost (C(s)p)

C(s)

p = log*m + log*n +

log*ks + log*ls + mH(P) +

nH(Q)

Graph encoding cost

(C(s)g)

C(s)

g =

(𝑙𝑜𝑔 ∗ |𝐸|𝑝,𝑞(𝑠)𝑙𝑠

𝑞=1𝑘𝑠𝑝=1 +

𝒢𝑝,𝑞(𝑠). 𝐻(𝒢𝑝,𝑞(𝑠)))

Total encoding cost 𝑪 = 𝑪(𝒔)𝒔

0

0.2

0.4

0.6

0.8

1

Jan

-02

Feb

-02

Mar

-02

Ap

r-0

2

May

-02

Jun

-02

Jul-

02

Au

g-0

2

Sep

-02

Oct

-02

No

v-0

2

De

c-0

2

F0

0

0.2

0.4

0.6

0.8

1

1.2

Entropy

Date Event Category Indicator of anomaly

1/1/2002 Twelve countries

introduce Euro as official Tender

Flash events Dip in H and spike in

F2

2/27/2002 Godhra Riots, Grammy Flash events Dip in H and spike in

F2

10/12/2002 The deadliest act of

terrorism in the history of Indonesia

Flash events Dip in H and spike in

F2

Date Category Indicator of anomaly

12/03/2002 Wikipedia bots

Spike in both H and F0 of page distribution and dip in H and spike in F2 of contributor distribution

05/14/2002 Wikipedia bots

Spike in both H and F0 of page distribution and dip in H and spike in F2 of contributor distribution

Figure 2: HPCC Architecture

Figure 1:Dataset format

Table 1: Flash Events Table 2: Bot Events

Figure 6. Entropy v/s Time

Figure 7. F0 v/s Time