Download - Anomaly Detection with Extreme Value Theory

Anomaly Detection with Extreme Value Theory

A. Siffer, P-A Fouque, A. Termier and C. LargouetApril 26, 2017

Contents

Context

Providing better thresholds

Finding anomalies in streams

Application to intrusion detection

A more general framework

1

Context

General motivations

⊸ Massive usage of the Internet

• More and more vulnerabilities• More and more threats

⊸ Awareness of the sensitive data and infrastructures

2

General motivations

⊸ Massive usage of the Internet• More and more vulnerabilities

• More and more threats


2

General motivations

⊸ Massive usage of the Internet• More and more vulnerabilities• More and more threats


2

General motivations

⊸ Massive usage of the Internet• More and more vulnerabilities• More and more threats


2

⇒ Network security :a major concern

A Solution

⊸ IDS (Intrusion Detection System)• Monitor traffic• Detect attacks

⊸ Current methods : rule-based

• Work fine on common and well-known attacks• Cannot detect new attacks

⊸ Emerging methods : anomaly-based

• Use the network data to estimate a normal behavior• Apply algorithms to detect abnormal events (→ attacks)

3

A Solution


⊸ Current methods : rule-based• Work fine on common and well-known attacks• Cannot detect new attacks

⊸ Emerging methods : anomaly-based

• Use the network data to estimate a normal behavior• Apply algorithms to detect abnormal events (→ attacks)

3

A Solution


⊸ Current methods : rule-based• Work fine on common and well-known attacks• Cannot detect new attacks

⊸ Emerging methods : anomaly-based• Use the network data to estimate a normal behavior• Apply algorithms to detect abnormal events (→ attacks)

3

Overview

⊸ Basic scheme

Algorithmdata alerts

⊸ Many ”standard” algorithms have been tested⊸ Complex pipelines are emerging (ensemble/hybrid techniques)

dataalerts

4

Overview

⊸ Basic scheme


⊸ Many ”standard” algorithms have been tested

⊸ Complex pipelines are emerging (ensemble/hybrid techniques)

dataalerts

4

Overview

⊸ Basic scheme


⊸ Many ”standard” algorithms have been tested⊸ Complex pipelines are emerging (ensemble/hybrid techniques)

dataalerts

4

Inherent problem

⊸ Algorithms are not magic• They give some information about data (scores)

• But the decision often rely on a human choice

if score>threshold then trigger alert

⊸ The thresholds are often hard-set

• Expertise• Fine-tuning• Distribution assumption

⊸ Our idea: provide dynamic threshold with a probabilisticmeaning

5

Inherent problem

⊸ Algorithms are not magic• They give some information about data (scores)• But the decision often rely on a human choice


⊸ The thresholds are often hard-set

• Expertise• Fine-tuning• Distribution assumption


5

Inherent problem

⊸ Algorithms are not magic• They give some information about data (scores)• But the decision often rely on a human choice


⊸ The thresholds are often hard-set• Expertise• Fine-tuning• Distribution assumption


5

Providing better thresholds

My problem

0 25 50 75 100 125 150 175 200Daily payment by credit card ( )

0.000

0.005

0.010

0.015

0.020

0.025

Freq

uenc

y

⊸ How to set zq such that P(X€ > zq) < q ?

6

My problem


0.000

0.005

0.010

0.015

0.020

0.025

Freq

uenc

y

⊸ How to set zq such that P(X€ > zq) < q ?6

Solution 1: empirical approach


0.000

0.005

0.010

0.015

0.020

0.025

Freq

uenc

y

⊸ Drawbacks: stuck in the interval, poor resolution

7



0.000

0.005

0.010

0.015

0.020

0.025

Freq

uenc

y

1− q q

zq

⊸ Drawbacks: stuck in the interval, poor resolution

7



0.000

0.005

0.010

0.015

0.020

0.025

Freq

uenc

y

1− q q

zq

⊸ Drawbacks: stuck in the interval, poor resolution7

Solution 2: Standard Model


0.000

0.005

0.010

0.015

0.020

0.025

Freq

uenc

y

⊸ Drawbacks: manual step, distribution assumption

8



0.000

0.005

0.010

0.015

0.020

0.025

Freq

uenc

y

zq

⊸ Drawbacks: manual step, distribution assumption

8



0.000

0.005

0.010

0.015

0.020

0.025

Freq

uenc

y

zq

⊸ Drawbacks: manual step, distribution assumption8

Realities

0 100 200 3000.000

0.005

0.010

0.015

16 18 20 22 240.00

0.05

0.10

0.15

0.20

20 40 60 80 1000.000

0.005

0.010

0.015

0.020

0 25 50 750.00

0.01

0.02

0.03

⊸ Different clients and/or temporal drift

9

Results

Properties Empirical quantile Standard modelstatistical guarantees Yes Yes

easy to adapt Yes Nohigh resolution No Yes

10

Inspection of extreme events


0.000

0.005

0.010

0.015

0.020

0.025

Freq

uenc

y

Probability estimation ?

11

Extreme Value Theory

⊸ Main result (Fisher-Tippett-Gnedenko, 1928)

The extreme values of any distribution have nearly the samedistribution (called Extreme Value Distribution)

0 1 2 3 4 5 6x

0.0

0.1

0.2

0.3

0.4

0.5

0.6

(X>

x)

heavy tailexponential tailbounded tail

12

An impressive analogy

⊸ Let X1, X2, . . . Xn a sequence of i.i.d. random variables with

Sn =n∑i=1

Xi Mn = max1≤i≤n

(Xi)

⊸ Central Limit TheoremSn − nµ√n

d−→ N (0, σ2)

⊸ FTG Theorem Mn − anbn

d−→ EVD(γ)

13

A more practical result

⊸ Second theorem of EVT (Pickands-Balkema-de Haan, 1974)

The excesses over a high threshold follow a Generalized ParetoDistribution (with parameters γ, σ)

⊸ What does it imply ?

• we have a model for extreme events• we can compute zq for q as small as desired

14

A more practical result

⊸ Second theorem of EVT (Pickands-Balkema-de Haan, 1974)

The excesses over a high threshold follow a Generalized ParetoDistribution (with parameters γ, σ)

⊸ What does it imply ?• we have a model for extreme events• we can compute zq for q as small as desired

14

How to use EVT

⊸ Get some data X1, X2 . . . Xn⊸ Set a high threshold t and retrieve the excesses Yj = Xkj − t

when Xkj > t

⊸ Fit a GPD to the Yj (→ find parameters γ, σ)⊸ Compute zq such as P(X > zq) < q

EVT

q

X1, X2 . . . Xn zq

15

How to use EVT


when Xkj > t⊸ Fit a GPD to the Yj (→ find parameters γ, σ)

⊸ Compute zq such as P(X > zq) < q

EVT

q

X1, X2 . . . Xn zq

15

How to use EVT


when Xkj > t⊸ Fit a GPD to the Yj (→ find parameters γ, σ)⊸ Compute zq such as P(X > zq) < q

EVT

q

X1, X2 . . . Xn zq

15

Finding anomalies in streams

Streaming Peaks-Over-Threshold (SPOT) algorithm

(initial batch)

X1, X2 . . . Xn Calibration

q

t zq

(stream)

Xi>n Xi > zq

trigger alarmyes

no Xi > t

yesupdate model

no drop

16

Streaming Peaks-Over-Threshold (SPOT) algorithm

(initial batch)

X1, X2 . . . Xn Calibration

q

0

0.10

0.20

0.30

20 40 60 80 100 120

t zq

(stream)

Xi>n Xi > zq

trigger alarmyes

no Xi > t

yesupdate model

no drop

16

Can we trust that threshold zq ?

⊸ An example with ground truth : a Gaussian White Noise• 40 streams with 200 000 iid variables drawn from N (0, 1)• q = 10−3 ⇒ theoretical threshold zth ≃ 3.09

⊸ Averaged relative error

0 50000 100000 150000 200000Number of observations

0.01

0.02

0.03

0.04

Rela

tive

erro

r

17

Application to intrusion detection

About the data

⊸ Lack of relevant public datasets to test the algorithms ...

⊸ KDD99 ? See [McHugh 2000] and [Mahoney & Chan 2003]⊸ We rather use MAWI

• 15 min a day of real traffic (.pcap file)• Anomaly patterns given by the MAWILab [Fontugne et al. 2010]with taxonomy [Mazel et al. 2014]

⊸ Preprocessing step : raw .pcap → NetFlow format (onlymetadata)

18

About the data

⊸ Lack of relevant public datasets to test the algorithms ...⊸ KDD99 ? See [McHugh 2000] and [Mahoney & Chan 2003]

⊸ We rather use MAWI



18

About the data

⊸ Lack of relevant public datasets to test the algorithms ...⊸ KDD99 ? See [McHugh 2000] and [Mahoney & Chan 2003]⊸ We rather use MAWI



18

An example to detect network syn scan

⊸ The ratio of SYN packets : relevant feature to detect networkscan [Fernandes & Owezarski 2009]

0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0Time (s)

0.0

0.2

0.4

0.6

0.8

ratio

of S

YN p

acke

ts in

50m

s tim

e wi

ndow

⊸ Goal: find peaks

19



0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0Time (s)

0.0

0.2

0.4

0.6

0.8ra

tio o

f SYN

pac

kets

in 5

0ms t

ime

wind

ow

⊸ Goal: find peaks

19



0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0Time (s)

0.0

0.2

0.4

0.6

0.8ra

tio o

f SYN

pac

kets

in 5

0ms t

ime

wind

ow

⊸ Goal: find peaks19

SPOT results

⊸ Parameters : q = 10−4,n = 2000 (from the previous day record)

0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0Time (s)

0.0

0.2

0.4

0.6

0.8

ratio

of S

YN p

acke

ts in

50m

s tim

e wi

ndow

20

SPOT results

⊸ Parameters : q = 10−4,n = 2000 (from the previous day record)

0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0Time (s)

0.0

0.2

0.4

0.6

0.8ra

tio o

f SYN

pac

kets

in 5

0ms t

ime

wind

ow

20

Do we really flag scan attacks ?

⊸ The main parameter q: a False Positive regulator

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

FPr (%)

0.0

20.0

40.0

60.0

80.0

100.0

TPr

(%)

9. 10−6

1. 10−5 5. 10−5

8. 10−5

8. 10−6

6. 10−3

⊸ 86% of scan flows detected with less than 4% of FP

21



0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

FPr (%)

0.0

20.0

40.0

60.0

80.0

100.0TPr

(%)

9. 10−6

1. 10−5 5. 10−5

8. 10−5

8. 10−6

6. 10−3

⊸ 86% of scan flows detected with less than 4% of FP

21



0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

FPr (%)

0.0

20.0

40.0

60.0

80.0

100.0TPr

(%)

9. 10−6

1. 10−5 5. 10−5

8. 10−5

8. 10−6

6. 10−3

⊸ 86% of scan flows detected with less than 4% of FP21

A more general framework

SPOT Specifications

⊸ A single main parameter q• With a probabilistic meaning → P(X > zq) < q• False Positive regulator

⊸ Stream capable

• Incremental learning• Fast (∼ 1000 values/s)• Low memory usage (only the excesses)

22

SPOT Specifications

⊸ A single main parameter q• With a probabilistic meaning → P(X > zq) < q• False Positive regulator

⊸ Stream capable• Incremental learning• Fast (∼ 1000 values/s)• Low memory usage (only the excesses)

22

Other things ?

⊸ SPOT• performs dynamic thresholding without distribution assumption• uses it to detect network anomalies

⊸ But it could be adapted to

• compute upper and lower thresholds• other fields• drifting contexts (with an additional parameter) → DSPOT

23

Other things ?


⊸ But it could be adapted to• compute upper and lower thresholds

• other fields• drifting contexts (with an additional parameter) → DSPOT

23

Other things ?


⊸ But it could be adapted to• compute upper and lower thresholds• other fields

• drifting contexts (with an additional parameter) → DSPOT

23

Other things ?


⊸ But it could be adapted to• compute upper and lower thresholds• other fields• drifting contexts (with an additional parameter) → DSPOT

23

A recent example

⊸ Thursday the 9th of February 2017

• 9h : explosion at Flamanville nuclear plant• 11h : official declaration of the incident by EDF

⊸ What about the EDF stock prices ?

24

A recent example

⊸ Thursday the 9th of February 2017• 9h : explosion at Flamanville nuclear plant

• 11h : official declaration of the incident by EDF


24

A recent example

⊸ Thursday the 9th of February 2017• 9h : explosion at Flamanville nuclear plant• 11h : official declaration of the incident by EDF


24

EDF stock prices

09:02

09:42

10:34

11:32

12:19

13:30

14:43

15:34

16:21

17:14

Time

9.05

9.10

9.15

9.20

9.25

9.30

9.35

9.40ED

F st

ock

price

()

25

Conclusion

⊸ Context: A great deal of work has been done to developanomaly detection algorithms

⊸ Problem: Decision thresholds rely on either distributionassumption or expertise

⊸ Our solution: Building dynamic threshold with a probabilisticmeaning

• Application to detect network anomalies• But a general tool to monitor online time series in a blind way

⊸ Future: Adapt the method to higher dimensions

26

Conclusion




• Application to detect network anomalies

• But a general tool to monitor online time series in a blind way


26

Conclusion




• Application to detect network anomalies• But a general tool to monitor online time series in a blind way


26