Anomaly Detection over Graphical KB using HPCC Thisprevious work on anomaly detection in information...

1
RESEARCH POSTER PRESENTATION DESIGN © 2015 www.PosterPresentations.com Detecting abnormalities in the typical behavior of time evolving network graphs finds application in social networks, mobile computing and internet banking. Unlike previous work on anomaly detection in information networks that worked with a static network graph, our methods consider the network as it evolves and monitors properties of the network for changes using the High Performance Computing Cluster(HPCC) system for parallel processing. Introduction Approach 1: Wiki-watchdog This approach employs several of this efficient-based metrics and are collectively more powerful for anomaly detection. Distributional analysis looks at Wikipedia as a whole, and provides a multidimensional view of revision history from several different perspective. By monitoring distributions of revisions/page and revisions/contributor as they change, we wish to identify three kinds of anomalies: the activity of bots, events and outages It has been seen that distribution based anomaly detection speed up anomaly detection identify events that are hard to detect yields less false positives Metrics: Entropy (H) : H(σ) = Ʃ1≤i≤n pi log pi, where pi = fi/m Zeroth frequency moment F0 : the number of elements i in the distribution with frequency fi > 0 Second Frequency moment F2 : F2(σ) = Ʃ1≤i≤n fi2 Approach 2: GraphScope MDL-based, parameter free algorithm. Discover node partitions in streaming, directed, bipartite graphs, monitoring their evolution over time to detect events or changes. Unlike many existing techniques, it requires no user defined parameters, and it operates completely automatically. Furthermore, it adapts to the dynamic environment by automatically finding the communities and determining good change-points in time. It iteratively searches for best source and destination partitions in each graph partitions until further partitioning does not lead to additional decrease of the encoding cost Approach Project Results HPCC v/s Hadoop References Pros: HPCC ECL - Unlike Java, high level primitives such as JOIN, TRANSFORM, PROJECT, SORT, DISTRIBUTE, MAP, etc. are available. Implicitly parallel: Parallelism is built into the underlying platform. The programmer needs not be concerned with it Open Data Model: Unlike Hadoop, the data model is defined by the user, and it’s not constrained by the limitations of a strict key-value paradigm. Integrated toolset (as opposed to hodgepodge of third party tools). We're talking about an IDE, job monitoring, scheduler, configuration manager, etc. Cons: ECL is not object oriented. There are no unit testing or integration testing frameworks. Does not retain state without writing to a file or embedding C code, and cannot be compiled without compiling every dependency every time. By Akash Agarwal and Roukna Sengupta Department of Computer Science, University of Florida Anomaly Detection over Graphical KB using HPCC HPCC System Figure 3. Wiki Watchdog steps Figure 4: Cost objective Figure 5: GraphScope We calculate the metrics for Wikipedia revisions of the year 2002 which consisted on just above 1 million revision entries and tried to identify three kinds of events: Flash events: generally results in a dip in H or spike in F2 of page distribution Wikipedia bots: reflects a spike in H and F0 of page distribution and a dip in H and a spike in F2 for contributor distribution Outages: dips in volume metric Prior work also shows that the anomalies flagged are never consistently mapped by any other metrics, hence the each of the metrics is significant. We observe the results obtained in the following figures and tables: [1] Jimeng Sun, Christos Faloutsos, Spiros Papadimitriou, and Philip S. Yu. GraphScope: parameter-free mining of large time-evolving graphs. In Proceedings of the13th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Jose, CA, pages 687696. ACM, 2007a. [2] Chrisil Arackaparambil, Guanhua Yan Wiki-Watchdog: Anomaly Detection in Wikipedia Through a Distributional Lens 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, pages 257264 [3] Charu C. Aggarwal, Yuchen Zhao, and Philip S. Yu. Outlier detection in graph streams. In Proceedings of the 27th International Conference on Data Engineering (ICDE), Hannover, Germany, pages 399409, 2011. [4] Stephen Ranshous,Shitian Shen,Danai Koutra,Steve Harenberg,Christos Faloutsos,Nagiza F. Samatova Anomaly detection in dynamic networks: a survey. Conclusion From our experience with using HPCC systems is that when it comes to manipulation, transformation and querying of big data it shines. Our only limitation was the language ECL where complex algorithms are challenging to implement as it never allows to update or change anything and producing a new dataset on every operation. Wiki-watchdog is an efficient, online distribution-based anomaly detection methodology. Using this it is possible to detect several kinds of anomalies with a detection rate that is higher than traditional methods, and a low false-positive rate. The results of Wiki-watchdog confirms with prior observation of metric behavior. The idea of the GraphScope algorithm is good because it solves both the problems of community identification and anomaly detection but it has difficulty in scaling to large number of nodes. More suitable for less nodes and frequent updates. Runs on commodity hardware Built-in distributed file system Scales out to thousands of nodes Fault resilient, redundancy and availability Powerful development IDE Extension modules for specific big data tasks like machine learning or web log analytics, natural language parsing, machine learning, data encryption, etc. Dataset The data contains the complete edit history of Wikipedia since its inception till January 2008 Decompressed size of data is >3 TB of text HPCC (High-Performance Computing Cluster) is an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions. The HPCC platform incorporates a software architecture implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing big data. Spray the parsed data to HPCC thor cluster Filter the revision update records Divide the update stream into consecutive windows W 1 , W 2 , …, W N , each of fixed time interval of 24 hours. Compute our metrics on the sub- stream S i in the interval W i A time series of metric values v 1 , v 2 ,…,v N over the time intervals. Normalize the time series metric values - 1 , 2 ,… Flag anomaly when | - E| > Parameters: •E (Exponential Weighted Moving Average): •E := E + ( - E) := 0.3 := 0.014 Segment encoding cost (C (s) ) C (s) =log*(t s+1 – t s ) + C (s) p + C (s) g Partition encoding cost (C (s) p ) C (s) p = log*m + log*n + log*k s + log*l s + mH(P) + nH(Q) Graph encoding cost (C (s) g ) C (s) g = ( ∗ || , () =1 =1 + , () . ( , () )) Total encoding cost = () 0 0.2 0.4 0.6 0.8 1 Jan-02 Feb-02 Mar-02 Apr-02 May-02 Jun-02 Jul-02 Aug-02 Sep-02 Oct-02 Nov-02 Dec-02 F0 0 0.2 0.4 0.6 0.8 1 1.2 Entropy Date Event Category Indicator of anomaly 1/1/2002 Twelve countries introduce Euro as official Tender Flash events Dip in H and spike in F 2 2/27/2002 Godhra Riots, Grammy Flash events Dip in H and spike in F 2 10/12/2002 The deadliest act of terrorism in the history of Indonesia Flash events Dip in H and spike in F 2 Date Category Indicator of anomaly 12/03/2002 Wikipedia bots Spike in both H and F 0 of page distribution and dip in H and spike in F 2 of contributor distribution 05/14/2002 Wikipedia bots Spike in both H and F 0 of page distribution and dip in H and spike in F 2 of contributor distribution Figure 2: HPCC Architecture Figure 1:Dataset format Table 1: Flash Events Table 2: Bot Events Figure 6. Entropy v/s Time Figure 7. F0 v/s Time

Transcript of Anomaly Detection over Graphical KB using HPCC Thisprevious work on anomaly detection in information...

Page 1: Anomaly Detection over Graphical KB using HPCC Thisprevious work on anomaly detection in information networks that collectivelyworked with a ... HPCC v/s Hadoop References ... PowerPoint

RESEARCH POSTER PRESENTATION DESIGN © 2015

www.PosterPresentations.com

Detecting abnormalities in the typical behavior of time evolving

network graphs finds application in social networks, mobile

computing and internet banking. Unlike previous work on anomaly

detection in information networks that worked with a static network

graph, our methods consider the network as it evolves and

monitors properties of the network for changes using the High

Performance Computing Cluster(HPCC) system for parallel

processing.

Introduction

Approach 1: Wiki-watchdog

This approach employs several of this efficient-based metrics and are

collectively more powerful for anomaly detection. Distributional analysis

looks at Wikipedia as a whole, and provides a multidimensional view of

revision history from several different perspective. By monitoring

distributions of revisions/page and revisions/contributor as they change, we

wish to identify three kinds of anomalies: the activity of bots, events and

outages

It has been seen that distribution based anomaly detection

• speed up anomaly detection

• identify events that are hard to detect

• yields less false positives

Metrics:

• Entropy (H) : H(σ) = − Ʃ1≤i≤n pi log pi, where pi = fi/m

• Zeroth frequency moment F0 : the number of elements i in the distribution

with frequency fi > 0

• Second Frequency moment F2 : F2(σ) = Ʃ1≤i≤n fi2

Approach 2: GraphScope

MDL-based, parameter free algorithm. Discover node partitions in streaming,

directed, bipartite graphs, monitoring their evolution over time to detect

events or changes. Unlike many existing techniques, it requires no user

defined parameters, and it operates completely automatically. Furthermore, it

adapts to the dynamic environment by automatically finding the communities

and determining good change-points in time. It iteratively searches for best

source and destination partitions in each graph partitions until further

partitioning does not lead to additional decrease of the encoding cost

Approach

Project Results

HPCC v/s Hadoop

References

Pros:

• HPCC ECL - Unlike Java, high level primitives such as JOIN,

TRANSFORM, PROJECT, SORT, DISTRIBUTE, MAP, etc. are

available.

• Implicitly parallel: Parallelism is built into the underlying

platform. The programmer needs not be concerned with it

• Open Data Model: Unlike Hadoop, the data model is defined by

the user, and it’s not constrained by the limitations of a strict

key-value paradigm.

• Integrated toolset (as opposed to hodgepodge of third party

tools). We're talking about an IDE, job monitoring, scheduler,

configuration manager, etc.

Cons:

• ECL is not object oriented.

• There are no unit testing or integration testing frameworks.

• Does not retain state without writing to a file or embedding C

code, and cannot be compiled without compiling every

dependency every time.

By Akash Agarwal and Roukna Sengupta

Department of Computer Science, University of Florida

Anomaly Detection over Graphical KB using HPCC

HPCC System

Figure 3. Wiki Watchdog steps

Figure 4: Cost objective

Figure 5: GraphScope

We calculate the metrics for Wikipedia revisions of the year 2002 which consisted on just above 1 million

revision entries and tried to identify three kinds of events:

• Flash events: generally results in a dip in H or spike in F2 of page distribution

• Wikipedia bots: reflects a spike in H and F0 of page distribution and a dip in H and a spike in F2 for

contributor distribution

• Outages: dips in volume metric

Prior work also shows that the anomalies flagged are never consistently mapped by any other metrics, hence

the each of the metrics is significant. We observe the results obtained in the following figures and tables:

[1] Jimeng Sun, Christos Faloutsos, Spiros Papadimitriou, and Philip S. Yu.

GraphScope: parameter-free mining of large time-evolving graphs. In

Proceedings of the13th ACM International Conference on Knowledge

Discovery and Data Mining (SIGKDD), San Jose, CA, pages 687–696.

ACM, 2007a.

[2] Chrisil Arackaparambil, Guanhua Yan Wiki-Watchdog: Anomaly

Detection in Wikipedia Through a Distributional Lens 2011

IEEE/WIC/ACM International Conferences on Web Intelligence and

Intelligent Agent Technology, pages 257–264

[3] Charu C. Aggarwal, Yuchen Zhao, and Philip S. Yu. Outlier detection in

graph streams. In Proceedings of the 27th International Conference on Data

Engineering (ICDE), Hannover, Germany, pages 399–409, 2011.

[4] Stephen Ranshous,Shitian Shen,Danai Koutra,Steve Harenberg,Christos

Faloutsos,Nagiza F. Samatova Anomaly detection in dynamic networks: a

survey.

Conclusion

From our experience with using HPCC systems is that when it

comes to manipulation, transformation and querying of big data it

shines. Our only limitation was the language ECL where complex

algorithms are challenging to implement as it never allows to

update or change anything and producing a new dataset on every

operation.

Wiki-watchdog is an efficient, online distribution-based anomaly

detection methodology. Using this it is possible to detect several

kinds of anomalies with a detection rate that is higher than

traditional methods, and a low false-positive rate. The results of

Wiki-watchdog confirms with prior observation of metric behavior.

The idea of the GraphScope algorithm is good because it solves

both the problems of community identification and anomaly

detection but it has difficulty in scaling to large number of nodes.

More suitable for less nodes and frequent updates.

• Runs on commodity hardware

• Built-in distributed file system

• Scales out to thousands of nodes

• Fault resilient, redundancy and availability

• Powerful development IDE

• Extension modules for specific big data tasks like machine

learning or web log analytics, natural language parsing, machine

learning, data encryption, etc.

Dataset

The data contains the complete

edit history of Wikipedia since

its inception till January 2008

Decompressed size of data is >3

TB of text

HPCC (High-Performance Computing Cluster) is an open source,

data-intensive computing system platform developed by LexisNexis

Risk Solutions. The HPCC platform incorporates a software

architecture implemented on commodity computing clusters to

provide high-performance, data-parallel processing for applications

utilizing big data.

Spray the parsed data

to HPCC thor cluster

Filter the revision update records

Divide the update

stream into consecutive

windows W1, W2, …, WN,

each of fixed time interval of 24 hours.

Compute our metrics on

the sub-stream Si in the interval

Wi

A time series of metric values v1,

v2,…,vN over the time intervals.

Normalize the time series

metric values -𝑣1 , 𝑣2, … 𝑣𝑁

Flag anomaly when |𝑣𝑖 - E|

> 𝜏

Parameters:

•E (Exponential Weighted Moving Average):

•E := E + 𝛼(𝑣𝑖 - E)

•𝛼 := 0.3

•𝜏 := 0.014

Segment encoding

cost (C(s))

C(s)=log*(ts+1 – ts) + C(s)

p +

C(s)g

Partition encoding

cost (C(s)p)

C(s)

p = log*m + log*n +

log*ks + log*ls + mH(P) +

nH(Q)

Graph encoding cost

(C(s)g)

C(s)

g =

(𝑙𝑜𝑔 ∗ |𝐸|𝑝,𝑞(𝑠)𝑙𝑠

𝑞=1𝑘𝑠𝑝=1 +

𝒢𝑝,𝑞(𝑠). 𝐻(𝒢𝑝,𝑞(𝑠)))

Total encoding cost 𝑪 = 𝑪(𝒔)𝒔

0

0.2

0.4

0.6

0.8

1

Jan

-02

Feb

-02

Mar

-02

Ap

r-0

2

May

-02

Jun

-02

Jul-

02

Au

g-0

2

Sep

-02

Oct

-02

No

v-0

2

De

c-0

2

F0

0

0.2

0.4

0.6

0.8

1

1.2

Entropy

Date Event Category Indicator of anomaly

1/1/2002 Twelve countries

introduce Euro as official Tender

Flash events Dip in H and spike in

F2

2/27/2002 Godhra Riots, Grammy Flash events Dip in H and spike in

F2

10/12/2002 The deadliest act of

terrorism in the history of Indonesia

Flash events Dip in H and spike in

F2

Date Category Indicator of anomaly

12/03/2002 Wikipedia bots

Spike in both H and F0 of page distribution and dip in H and spike in F2 of contributor distribution

05/14/2002 Wikipedia bots

Spike in both H and F0 of page distribution and dip in H and spike in F2 of contributor distribution

Figure 2: HPCC Architecture

Figure 1:Dataset format

Table 1: Flash Events Table 2: Bot Events

Figure 6. Entropy v/s Time

Figure 7. F0 v/s Time