Statistical Decision Theory · The SNMP server maintains a database of management variables called...

StatisticalDecisionTheory

L. Fillatre

OutlinesPart I: Anomalydetection innetworks :state-of-the-art

Part II: Statisticaltesting :fundamentals

Part III: Statisticaltesting : sequentialapproaches

Part IV: Statisticaltests : a case study

Statistical Decision Theory

Lionel Fillatre

ENST Bretagne, Computer Science Department


L. Fillatre





Part I : Anomaly detection in networks

1 Motivation

2 Network anomalies

3 Sources of network data

4 Anomaly detection methods


L. Fillatre





Part II : statistical testing

5 Motivation

6 Test between two simple hypotheses

7 Test between two composed hypotheses


L. Fillatre





Part III : sequential approaches

8 Motivation

9 Sequential probability ratio test

10 Change detection: known change

11 Change detection: unknown change


L. Fillatre





Part IV : a case study

12 DOS attack detection

13 Multichannel parametric CUSUM

14 Multichannel non-parametric CUSUM

15 Practical example


L. Fillatre

Motivation

Networkanomalies

Sources ofnetwork data

Anomalydetectionmethods

Part I

Anomaly detection in networks


L. Fillatre

Motivation

Networkanomalies



Outlines of Part I

1 Motivation

2 Network anomalies

3 Sources of network data

4 Anomaly detection methods


L. Fillatre

Motivation

Networkanomalies



Motivation

Networks are complex system: vast amounts ofinformation need to be collected and processed.

It is desirable to detect network anomalies andperformance bottlenecks to improve networkmanagement.

To detect anomalies, it is necessary:To give a definition of network anomalies,

To choose the sources of network data relevant todetect anomalies,

To choose a method to detect anomalies.


L. Fillatre

Motivation

Networkanomalies



Network anomalies

Definition: networks anomalies typically refer tocircumstances when network operations deviate fromnormal network behavior.

Classification: there are two kinds of anomalies:

Network failures: server failures, broadcast storms,transient congestions,. . .

Security-related problems: denial of services (DOS),network intrusions,. . .

For the purpose of anomaly detection, we mustcharacterize normal traffic behavior.


L. Fillatre

Motivation

Networkanomalies



Data from network probes

Network probes are specialized tools such as “ping”and “traceroute”.

These methods do not require the cooperation of thenetwork service provider.

Performance metrics derived from such tools canprovide only a coarse grained view of the network.

Hence, the data obtained from probing mechanismsmay be of limited value for anomaly detection.


L. Fillatre

Motivation

Networkanomalies



Data from packet filtering

Packet flows are sampled by capturing the IP headersof a select set of packets at different points in thenetwork.

For flow-based monitoring, a flow is identified bysource-destination addresses and source-destinationport numbers.

Data obtained from this method can be used to detectanomalous network flows.

However, the hardware requirements required for thismeasurement method makes it difficult to use inpractice.


L. Fillatre

Motivation

Networkanomalies



Data from routing protocols

The data collected can be used to build the networktopology and provides link status updates.

Since routing updates occur at frequent intervals, anychange in link utilization will be updated in near realtime.

However, since routing updates must be kept small,only limited information pertaining link statistics can bepropagated through routing updates.


L. Fillatre

Motivation

Networkanomalies



Network management protocols

Network management protocols provide informationabout network traffic statistics.

The information obtained can be used to characterizenetwork behavior.

This source of data is obtained by using the SimpleNetwork Management Protocol (SNMP):

This protocol provides a mechanism to communicatebetween the manager and hundred of SNMP agents.

The SNMP server maintains a database ofmanagement variables called the ManagementInformation Base (MIB) variables.

It is a widely deployed protocol and has beenstandardized for all different network devices.

Due to the fine-grained data available from SNMP, it is agood data source for network anomaly detection.


L. Fillatre

Motivation

Networkanomalies



Hierarchical scheme of methods

Anomaly detection

Rule-based

Pattern matching

Statistical testing

approachesapproachesapproaches approaches

approaches

Signal processing

Finite state

machines

Deterministic Stochastic Non-sequential Sequential


L. Fillatre

Motivation

Networkanomalies



Rule-based approaches (1/2)

Early work in the area of fault or anomaly detection wasbased on expert systems.

An exhaustive database containing the rules ofbehavior of the faulty system is used to determined if afault occurred.

Two kinds of rule selection are possible: deterministicor stochastic (belief networks for example).


L. Fillatre

Motivation

Networkanomalies



Rule-based approaches (2/2)

These rule-based systems rely heavily on the expertiseof the network manager and do not adapt well to theevolving network environment.

Is is possible to improve such a system by adding apicture of previous fault scenarios, which leads tocase-based reasoning systems.

These systems have an heavy dependance on pastinformation and the number of functions to be learnedalso increases with the number of fault studied.


L. Fillatre

Motivation

Networkanomalies



Finite state machines

Anomaly or fault detection using finite state machinesmodel alarm sequences that occur during and prior tofault events.

An alarm is modeled as a state of the finite statemachine.

Finite state machines are built for a known network faultusing history data.

Not all faults can be captured by a finite sequence ofalarms of reasonable length.


L. Fillatre

Motivation

Networkanomalies



Pattern matching

Online learning is used to build a traffic profile for agiven network.

Traffic profiles are built using symptom-specific featurevectors such as link utilization.

When acquired data failed to fit the developed profileswithin some confidence interval, then an anomaly isdeclared.

The efficiency depends on the accuracy of the trafficprofile generated. It is necessary to spend aconsiderable amount of time building traffic profiles (thismethod is not scale gracefully).


L. Fillatre

Motivation

Networkanomalies



Signal processing techniques

Signal processing techniques have been used to modeldata flows.

The normal behavior of data flows are modeled byusing several approaches: spectral analysis, timeseries analysis, wavelets decompositions,. . .

Anomalies correspond to deviations in the normalbehavior of the data flows.


L. Fillatre

Motivation

Networkanomalies



Statistical testing (1/2)

Statistical testing has been used to detect bothanomalies corresponding to network failures as well asnetwork intrusions.

The statistical nature of the available information isused to define the normal behavior of the network(distribution of packet sizes,. . . ).

Non-sequential and sequential approaches can beused according to the network manager’s requirements.


L. Fillatre

Motivation

Networkanomalies



Statistical testing (2/2)

Non-sequential approches allow us to define optimalalgorithms: minimization of false alarms andmaximization of the probability of anomaly detection.

Sequential approaches are used to minimize thenumber of observations needed to detect an anomaly.

When data flows are modeled by using parametricmodels, the design of optimal algorithms is possible.

Non-parametric approaches are particularly studiedbecause of the lack of parametric models. Theseapproaches are often suboptimal.


L. Fillatre

MotivationMain objectives

Practical examples

Test betweentwo simplehypothesesBasic definitions

Bayes test

Most powerful test

Example

Test betweentwo composedhypothesesBasic definitions

UMP Test

Example

GLR test

Example

Part II

Statistical testing : fundamentals


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example

Outlines of Part II

5 Motivation

6 Test between two simple hypotheses

7 Test between two composed hypotheses


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example

Main objectives

Given some observations, it is aimed to diagnose asystem: detection and identification of an anomaly.

Observations are often noisy due to model errorsand/or measurement errors.

For our purpose, the final aim consists of designingautomatic systems to monitor a network and to launchalarms when an anomaly appears.


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example

Practical examples

To detect Denial Of Services (DOS) attacks on a server.

To detect an abrupt change in the link utilizations on anetwork.

To identify the protocol associated to a flow of packets:http, ftp,. . .


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example

Basic notations

Assume that we have 2 distributions of probabilityP1, P2.

Let a n-size sample y1, . . . ,yn of independent andidentically distributed (i.i.d.) random variablesgenerated by one of these distributions.

It is assumed that yi ∈ Ω for all i (for example Ω = Rm)

and Ωn is the observation space.

Let us denote Ei[yk] the expectation of yk when yk

follows the distribution Pi, which is denoted yk ∼ Pi.

Assume that each distribution Pi has a probabilitydensity function (pdf) fi(y).All results can be applied to discrete random variables.


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example

Basic definitions

Definition (simple hypothesis)We call simple hypothesis Hk any assumption concerningthe distribution Pk that can be reduced to a single value inthe space of probability distributions, which is denoted:

Hk = y1, . . . ,yn ∼ Pk, k = 1, 2.

Definition (statistical test)We call a statistical test for testing between hypotheses H1

and H2 any measurable mapping g : Ωn 7→ H1,H2.


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example

Basic definitions: an illustration

Criterion of optimality Design of the test P1, P2

H1,H2y1, . . . ,yn

g(·)Observation space Ωn


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example

Basic definitions

Definition (quality of a test)The quality of a test is defined with the aid of a set of errorprobabilities:

αi = Pr(g(y1, . . . ,yn) 6= Hi | Hi true)

= Pri(g(y1, . . . ,yn) 6= Hi)

where αi is the probability of rejecting hypothesis Hi when itis true.

Remarkα1 is called the probability of false alarm ;

α2 is called the probability of miss.


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example

Bayes test (1/2)

Assume that each hypothesis Hi has a known a prioriprobaility qi such that q1 + q2 = 1.

Definition (Weighted error probability)

For a test g, we define the weighted error probability α(g) by

α(g) = q1α1 + q2α2.

Definition (Bayes test)The test g is said to be a Bayes test if it minimizes α(g) forgiven a priori probabilities qi.


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example

Bayes test (2/2)

Definition (Likelihood ratio)The Likelihood Ratio (RT) between two pdfs f1 and f2 forthe independent sequence of observations y1, . . . ,yn is

Λ(y1, . . . ,yn) =n∏

i=1

f2(yi)

f1(yi).

Theorem (Bayes test)

The test g which minimizes α(g) is defined by

g(y1, . . . ,yn) =

H1 if Λ(y1, . . . ,yn) <q1q2

H2 if Λ(y1, . . . ,yn) ≥ q1q2

.


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example

Most Powerful Test (1/2)

DefinitionLet Kα be the class of tests with a bounded probability offalse alarm:

Kα = g : α1(g) ≤ α.

Definition (Most powerful test)We say that a test g∗ ∈ Kα is the Most Powerful (MP) in theclass Kα if, for all g ∈ Kα,

α2(g∗) ≤ α2(g),

or, equivalently,β(g∗) ≥ β(g),

where β(g) = 1− α2(g) is the power of the test g.


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example

Most Powerful Test (2/2)

Theorem (Neyman-Pearson’s lemma)The MP test g∗ in Kα is given by

g∗(y1, . . . ,yn) =

H1 if Λ(y1, . . . ,yn) < λα

H2 if Λ(y1, . . . ,yn) ≥ λα.

by choosing λα such as α1(g∗) = α.

RemarkThis lemma is fundamental from the theoretical point of viewbut its interest is often limited from the practical point ofview.


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example

Location testing with Gaussian errors

Assume yi ∼ N (θ, 1).

The two hypotheses are H1 : θ = θ1 andH2 : θ = θ2 with 0 < θ1 < θ2.

The pdf of a Gaussian variable N (θ, 1) isϕθ(x) = ϕ(x− θ) with

ϕ(x) =1√2π

exp (−x2

2).

QuestionFind the Neyman-Pearson test.


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example

Solution (1/2)

By subtracting θ1 from yi, we can suppose that θ1 = 0.

log Λ(y1, . . . ,yn) = θ2

(

∑ni=1 yi − n θ2

2

)

.

The Neyman-Pearson test is given by

g∗(y1, . . . ,yn) =

H1 if 1√n

∑ni=1 yi < λ′

α

H2 if 1√n

∑ni=1 yi ≥ λ′

α

with λ′α = λα/(θ2

√n) + θ2

√n/2.

Under H1, Λn = 1√n

∑ni=1 yi ∼ N (0, 1) and

λα = Φ−1(1− α), i.e. α1(g∗) = Pr(Λn > λα) = α, where

Φ is the cumulative function of the standardizedGaussian variable N (0, 1).


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example

Solution (2/2): graphical illustration

-10 -8 -6 -4 -2 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxx

λ0.01

α1(g∗)α2(g

∗)

ϕ(x)ϕ(x− 5)

xθ1 = 0 θ2 = 5


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example

Basic notations

Let a n-size sample y1, . . . ,yn of i.i.d. random variablesgenerated by a distribution Pθ parameterized by avector θ ∈ Θ.

It is assumed that yi ∈ Ω for all i (for example Ω = Rm)

and Ωn is the observation space.

Let us denote Eθ[yk] the expectation of yk when yk

follows the distribution Pθ, which is denoted yk ∼ Pθ.

Assume that each distribution Pθ has a probabilitydensity function (pdf) fθ(y).All results can be applied to discrete random variables.


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example

Basic definitions

Definition (composed hypothesis)Any nonsimple hypothesis is called a composed hypothesis.

DefinitionLet us denote H1 : θ ∈ Θ1 and H2 : θ ∈ Θ2 withΘ1 ∩Θ2 = ∅ and Θ1,Θ2 two specified subsets of Θ.

Definition (size of a test)Let α1(g) be the size of a test defined by:

α1(g) = supθ∈Θ1

Pr(g(y1, . . . ,yn) 6= H1 | H1 true)

= supθ∈Θ1

Prθ(g(y1, . . . ,yn) 6= H1).

and let Kα the class of tests with fixed size:

Kα = g : α1(g) ≤ α.


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example

Uniformly most powerful test

Definition (power function of a test)The power function of a test g is defined by:

βg(θ) = Prθ(g(y1, . . . ,yn) = H2), θ ∈ Θ2.

Definition (Uniformly Most Powerful test)A test g∗ ∈ Kα is said to be Uniformly Most Powerful (UMP)in the class Kα of tests with fixed size α1(g) = α if, for allother tests g ∈ Kα, we have:

∀θ ∈ Θ2, βg(θ) ≤ βg∗(θ).


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example

Graphical interpretation

0

1

Other tests

UMP test

α

β(θ)

θ

θΘ1 = 0 ≤ θ < θ Θ2 = θ ≤ θ


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example



The two hypotheses are H1 : θ = 0 andH2 : θ ≥ θ2 with θ2 > 0.


ϕ(x) =1√2π

exp (−x2

2).

QuestionFind the UMP test.


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example

Solution

The Neyman-Pearson test between H1 : θ = 0 andH2(θ2) : θ = θ2 is given by

g∗(y1, . . . ,yn) =

H1 if 1√n

∑ni=1 yi < λα

H2(θ2) if 1√n

∑ni=1 yi ≥ λα

.

Under H1, 1√n

∑ni=1 yi ∼ N (0, 1) and λα = Φ−1(1− α)

where Φ is the cumulative function of the standardizedGaussian variable N (0, 1).

Since the decision function 1√n

∑ni=1 yi and the

threshold λα do not depend on θ2, the test g∗ is MP forall θ2 > 0 and, hence, it is an UMP test.


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example

Generalized Likelihood Ratio test

DefinitionWe say that a test gGLR is a Generalized Likelihood Ratio(GLR) test for testing between H1 = θ : θ ∈ Θ1 andH2 = θ : θ ∈ Θ2 when

gGLR(y1, . . . ,yn) =

H1 if ΛGLR(y1, . . . ,yn) < λα

H2 if ΛGLR(y1, . . . ,yn) ≥ λα

with ΛGLR(y1, . . . ,yn) =supθ2∈Θ2

∏ni=1 fθ2(yi)

supθ1∈Θ1

∏ni=1 fθ1(yi)

.

RemarkThe optimality of the GLR test is established for certaincases (exponential families when n → +∞ for example) butit is not necessary optimal for all cases.


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example



The two hypotheses are H1 : θ = |θ1| ≤ a andH2 : θ = |θ2| ≥ b with 0 < a < b.


ϕ(x) =1√2π

exp (−x2

2).

QuestionFind the GLR test.


L. Fillatre


Practical examples


Bayes test

Most powerful test

Example


UMP Test

Example

GLR test

Example

Solution

2nlog ΛGLR(y1, . . . ,yn) =

2nlog

sup|θ2|≥b

∏ni=1

fθ2 (yi)

sup|θ1|≤a

∏ni=1

fθ1 (yi),

which leads to

2

nlog ΛGLR(y1, . . . ,yn)=

−(y − b)2 if |y| ≤ a

−(y − b)2 + (y − a)2 if a ≤ |y| ≤ b

(y − a)2 if |y| ≥ b

,

with y = 1n

∑ni=1 yi.

Since 2nlog ΛGLR(y1, . . . ,yn) is an increasing function

of |y|, it follows that:

gGLR(y1, . . . ,yn) =

H1 if y2 < λα

H2 if y2 ≥ λα.

When y1, . . . ,yn ∼ N (θ, 1), y2 ∼ χ2n(‖θ‖22), which leads

to λα = Ψ−1n,a2

(1− α) where Ψn,a2 is the cumulativefunction of a χ2 variable with n degrees of freedom andthe non-centrality parameter a2.


L. Fillatre

MotivationMotivation

SPRTBasic definitions

Definition

Example

Optimality

Threshold selection

Changedetection:known changeMotivation

Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality

Changedetection:unknownchangeMotivation

Unknown changetype

Weighting functions

Invariant principle

GLR algorithm

Part III

Sequential approaches


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm

Outlines of Part III

8 Motivation

9 Sequential probability ratio test

10 Change detection: known change

11 Change detection: unknown change


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm

Motivation

In the previous part, we have shown that it is possibleto minimize the error probabilities for a given samplesize n.

New problem: for given error probabilities, try tominimize the sample size or, equivalently, to make thedecision with as few observations as possible.

Sequential analysis is the theory of solving hypothesistesting problems when the sample size is not fixed apriori .


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm

Basic definitions (1/2)

Definition (Stopping time)A random variable T is called a stopping time with respectto a process y1, . . . ,yn, . . . if T takes only integer valuesand if, for every n ≥ 1, the event T = n is determined by(y1, . . . ,yn).

ExampleThe first time at which the process y1, . . . ,yn, . . . visitsa set A is a stopping time.

The last time at which the process y1, . . . ,yn, . . . visitsa set A is NOT a stopping time.


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm


Definition (Sequential test)A sequential test for testing between between simplehypotheses H1 = y1, . . . ,yn ∼ f1 andH2 = y1, . . . ,yn ∼ f2 is defined to be a pair (g, T ) whereT is a stopping time and g(y1, . . . ,yn) is a decision function.

Definition (Closed test)We say that a sequential test (g, T ) is closed if

P (T < +∞) = 1.

RemarkFor a closed test (g, T ), the mean number of observationsnecessary to decide between the two hypotheses is alwaysfinite: E1(T < +∞) < +∞ and E2(T < +∞) < +∞.


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm

Sequential Probability Ratio Test (SPRT)

Definition (SPRT)The test (g, T ) is a Sequential Probability Ratio Test (SPRT)for testing between simple hypotheses H1 and H2 if wesequentially observe data y1, . . . ,yn and if, at time n, wemake one of the following decisions:

accept H1 when Sn ≤ −a ;

accept H2 when Sn ≥ b ;

continue to observe and to test when −a < Sn < b,

Sn =n∑

i=1

logf2(yi)

f1(yi)

and a, b are thresholds such that −∞ < −a < b < +∞.

RemarkThe SPRT is closed.


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm

Sequential location testing


The two hypotheses are H1 : θ = 0 andH2 : θ = 2.


ϕ(x) =1√2π

exp (−x2

2).

QuestionFind the SPRT.


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm

Solution

Sn =∑n

i=1 logϕ(yi−2)ϕ(yi)

= 2∑n

i=1(yi − 1).

Simulated data:

0 10 20 30 40 50 60-1

0

1

2

3

4

5

yi

i


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm

Solution

Sn =∑n

i=1 logϕ(yi−2)ϕ(yi)

= 2∑n

i=1(yi − 1).

Simulated SPRT:

0 10 20 30 40 50 60-60

-40

-20

0

20

40

60

80

100

120

Sn

n

−a

b

acceptance zone of H1

acceptance zone of H2


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm

Optimality of the SPRT

DefinitionDenote Kα1,α2

the class of all (sequential andnonsequential) tests (g, T ) such that

α1(g) ≤ α1 , α2(g) ≤ α2

E1(T ) < +∞ , E2(T ) < +∞,

where Ei(T ) is the mean number of observations under Hi.

Let (g, T ) ∈ Kα1,α2a SPRT test for testing between

hypotheses H1 and H2.

Theorem

For every test (g, T ) ∈ Kα1,α2, we have:

E1(T ) ≤ E1(T ) , E2(T ) ≤ E2(T ).


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm

Threshold selection: Wald’s identity

TheoremThe error probabilities of (g, T ) verify:

logα2(g)

1− α1(g)≤ min0,−a , log

1− α2(g)

α1(g)≥ max0, b.

RemarkThe equalities hold for the SPRT when the excess over theboundary are small:

Pr1(ST =−a |H1 is accepted)≃Pr2(ST =b |H2 is accepted)≃1.

The thresholds may be chosen by using the followingapproximations: a ≃ log 1−α1

α2, b ≃ log 1−α2

α1.


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm

Motivation

The aim is to detect the occurrence of a change assoon as possible, with a fixed rate of false alarm beforethe unknown change time t0.

Let y1,y2, . . . be a random sequence with pdf fθ(yk).Until the unknown time t0, the parameter is θ = θ1 andfrom t0 becomes θ = θ2.

Let ta be the alarm time (stopping time) at which adetection occurs.

For estimating the efficiency of the detection, it isconvenient to use the mean time between false alarmsand mean delay for detection.


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm


It is assumed that the change time t0 is non-random.

Definition (Mean time between false alarms)We define mean time between false alarms as the followingexpectation:

T = Eθ1(ta)

where ta is the alarm time.

Definition

Let Kγ = ta : T = Eθ1(ta) ≥ γ the class of all sequentialalgorithms with a bounded mean time between false alarms.


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm


Definition (Essential supremum)Let (yi)i∈I be a family of real-valued random variablesbounded by another variable. We say that y is an essentialsupremum for (yi)i∈I , which is denoted y = ess supIyi, if

∀i ∈ I,Pr(yi ≤ z) = 1 ⇔ Pr(y ≤ z) = 1.

Definition (Conditional mean delay)We define conditional mean delay for detection as:

Eθ1(ta − t0 + 1 | ta ≥ t0,y1, . . . ,yt0−1).

Definition (Worst mean delay)We define worst mean delay for detection as:

τ∗(ta) = supt0≥1

ess sup Eθ1(ta − t0 + 1 | ta ≥ t0,y1, . . . ,yt0−1).


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm

CUmulated SUM (CUSUM)

Definition (CUSUM)The CUSUM algorithm ta is defined by:

ta = mink ≥ 1 : gk ≥ h

wheregk = Sk −mk,

Sk =

k∑

i=1

si =

k∑

i=1

logfθ2(yi)

fθ1(yi),

mk = min1≤j<k

Sj ,

and h is the threshold.


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm

Intuitive derivation of the CUSUM

0 10 20 30 40 50-200

-150

-100

-50

0

50

100

0 10 20 30 40 50-3

-2

-1

0

1

2

3

4

5

6

yk

k k

Sk

h

Alarm time

mk

Eθ1(si) < 0 Eθ2(si) > 0


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm


Definition (CUSUM recursive form)The CUSUM algorithm ta can be rewritten:

ta = mink ≥ 1 : Gk ≥ h

where

G0 = 0, Gk =

[

Gk−1 + logfθ2(yk)

fθ1(yk)

]+

,

x+ = max0, x.


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm


Definition (Kullback-Leibler distance)The Kullback-Leibler distance between two probabilitydensities fθ1 and fθ2 is defined as:

1,2 =

∫

logfθ1(y)

fθ2(y)fθ1(y)dy.

This distance is always positive and is zero only when thetwo densities are equal.

Theorem (Lorden)Let n(γ) = infta∈Kγτ∗(ta). Then

n(γ) =log γ

2,1(1 + o(1))

as γ → +∞, where o(1) stands for a negligible term such aso(1) → 0 as γ → +∞.


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm


Theorem (Lorden)Let a CUSUM algorithm ta designed to verifyT = Eθ1(ta) = γ with γ > 0. Then we have the followingequality:

τ∗(ta) =log γ

2,1(1 + o(1))

as γ → +∞.

Theorem (Optimality of the CUSUM)The CUSUM algorithm is asymptotically optimal in the classKγ .


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm

Motivation

In practice, the distribution after the change is rarelyknown.

Let y1,y2, . . . be a random sequence with pdf fθ(yk).Until the unknown time t0, the parameter is θ1 and fromt0 becomes θ2 ∈ Θ2 where the set Θ2 is known.

Three main solutions:Weighted likelihood ratio ;

Invariant likelihood ratio;

Generalized likelihood ratio.


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm

Method of weighting functions

It is assumed that θ2 follows a distribution a priori:θ2 ∼ p(θ2).

After the change, the observations yt0 ,yt0+1, . . . followthe distribution:

fθ2(yk) =

∫

Θ2

fθ2(yk)p(θ2)dθ2 , k ≥ t0

⇒ the hypotheses after the change becomes simple.

We can then apply the CUSUM algorithm.


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm

Invariant principle

Certain problems are typically invariant with respect toa group of transformation.

The complexity of the hypotheses is then reduced byconsidering only the maximal invariant statistics.

An invariant statistic is a function of the observationssuch as:

the function is invariant with respect to the group oftransformations ;

all other invariant functions depend on this maximalinvariant.

The simplified problem is then solved by using classicaltools.


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm

Example

Notation: Np(θ, Ip) denotes the p-dimensionalGaussian distribution with unit covariance matrix andmean θ ∈ R

p.

Problem: the observations y1,y2, . . . follow thedistribution Np(0, Ip) before the change and thedistribution Np(θ, Ip) after the change, with‖θ‖22 =

∑pi=1 θ

2i = c2, c > 0 known.

This problem is invariant with respect to the group ofp-dimensional rotations. The invariant statistics are‖y1‖22, ‖y2‖22, . . ..

These “simplified” observations follow a central χ2

distribution with p degrees of freedom before thechange and a χ2 distribution with p degrees of freedomand the non-centrality parameter c2 after the change.


L. Fillatre



Definition

Example

Optimality

Threshold selection


Basic definitions

CUSUM algorithm

Asymptotical bound

Optimality


Unknown changetype

Weighting functions

Invariant principle

GLR algorithm

GLR algorithm

It is based on the principle of the GLR test.

ta = mink ≥ 1 : gk ≥ h with

gk = max1≤j≤k

supθ∈Θ2

k∑

i=j

logfθ2(yi)

fθ1(yi).

The properties of optimality of this algorithm are notknown, except for certain cases (exponentialfamilies,. . . ).


L. Fillatre

DOS attackdetectionAttack scheme

Detection scheme

Problem statement

LR CUSUMLR CUSUM

Optimality

NP-CUSUMPrinciple

NP-CUSUM

ExampleA poisson example

Comparison

Part IV

A case study


L. Fillatre


Detection scheme

Problem statement

LR CUSUMLR CUSUM

Optimality

NP-CUSUMPrinciple

NP-CUSUM


Comparison

Outlines of Part IV

12 DOS attack detection

13 Multichannel parametric CUSUM

14 Multichannel non-parametric CUSUM

15 Practical example


L. Fillatre


Detection scheme

Problem statement

LR CUSUMLR CUSUM

Optimality

NP-CUSUMPrinciple

NP-CUSUM


Comparison

Typical “SYN flooding” attack scheme

The SYN flooding attacks exploit the TCP’s three-wayhand-shake mechanism and its limitation in maintaininghalf-open connections.

ACK ???

SYN

SYN

SYN/ACK

TCP connection

Client Server

Timeout

Half-open connection


L. Fillatre


Detection scheme

Problem statement

LR CUSUMLR CUSUM

Optimality

NP-CUSUMPrinciple

NP-CUSUM


Comparison

Typical detection scheme

It is aimed to detect Denial Of Services (DOS) attacks:SYN flooding attacks, UDP packet storm,. . . .

A DOS attack is generally characterized by an increaseof the number of packets of a particular size.

Principle of monitoring:To split packet sizes into a set of bins (or channels),To monitor these channels simultaneously,To detect a change in one of these channels.


L. Fillatre


Detection scheme

Problem statement

LR CUSUMLR CUSUM

Optimality

NP-CUSUMPrinciple

NP-CUSUM


Comparison

Problem statement

Denote N the number of channel and yk(i), k ≥ 1, thenumber of packets measured in the i-th channel at timek.

Until the unknown time t0, each random value yk(i)follows a distribution Pθ0,i and from t0, there is achange in the distribution, Pθi , of only one of therandom variable, say the i-th channel.

It is assumed that each distribution Pθ0,i and Pθi admitsa pdf denoted fθ0,i and fθi .


L. Fillatre


Detection scheme

Problem statement

LR CUSUMLR CUSUM

Optimality

NP-CUSUMPrinciple

NP-CUSUM


Comparison

LR-CUSUM

Definition (LR-CUSUM)The multichannel parametric CUSUM, simply calledLR-CUSUM, algorithm ta is defined by:

ta = min1≤i≤N

ta(i)

where ta(i) = mink ≥ 1 : Uk(i) ≥ hi,

Uk(i) = max1≤j≤k

Skj (i)

Skj (i) =

k∑

ℓ=j

sℓ(i) =

k∑

ℓ=j

logfθi(yℓ(i))

fθ0,i(yℓ(i)),

and hi is the threshold adapted to the i-th channel.


L. Fillatre


Detection scheme

Problem statement

LR CUSUMLR CUSUM

Optimality

NP-CUSUMPrinciple

NP-CUSUM


Comparison

Criterion of optimality

Definition (False alarm rate)The False Alarm Rate (FAR) is defined by:

FAR(ta) =1

Eθ0 [ta].

Definition (Average detection delay)

When the hypothesis Ht0,i = a change occurs at time t0 inthe i-th channel is true, the speed of detection is measuredby the conditional Average Detection Delay (ADD):

ADDt0,i(ta) = Et0,i[ta−t0+1 | ta ≥ t0] , t0 ≥ 1, i = 1, . . . , N.


L. Fillatre


Detection scheme

Problem statement

LR CUSUMLR CUSUM

Optimality

NP-CUSUMPrinciple

NP-CUSUM


Comparison

Optimality of the LR-CUSUM

Assume hi = h for all i = 1, . . . , N .

Denote Ii =∫

logfθi (y)

fθ0,i(y)fθi(y)dy.

Theorem

Suppose Eθi [logfθi (yℓ(i))

fθ0,i(yℓ(i))]2< +∞ for all i. Then:

For all t0 ≥ 1 and i = 1, . . . , N :

ADDt0,i(ta) ∼h

Iias h → +∞.

If h = log(Nγ), then FAR(ta) ≤ γ and

infτ :FAR(τ)≤γ

supt0≥1

ADDt0,i(τ) ∼|log γ|Ii

as γ → +∞.


L. Fillatre


Detection scheme

Problem statement

LR CUSUMLR CUSUM

Optimality

NP-CUSUMPrinciple

NP-CUSUM


Comparison

Non-parametric change detection

When the distributions Pθi are unknown, the likelihoodratios are also unknown.

The quantities Skj (i) should be replaced by appropriate

score function V kj (i) such as Eθ0 [V

kj (i)] < 0 and

Eθi [Vkj (i)] > 0.

Typical DOS attacks lead to abrupt changes in themean values of the number of packets. Therefore, thedecision function should be sensitive to changes inmean values.


L. Fillatre


Detection scheme

Problem statement

LR CUSUMLR CUSUM

Optimality

NP-CUSUMPrinciple

NP-CUSUM


Comparison

Notations and definitions

Let µi = E0[yk(i)] and θi = Eθi [yk(i)] denote thepre-change and post-change mean values in the i-thchannel by assuming µi < θi.

Definition (Score function)

The score functions V kj (i) are defined by

V kj (i) =

k∑

ℓ=j

wi(yℓ(i)− µi − ci,ℓ) , i = 1, . . . , N,

where wi > 0,ci,ℓ > 0 are tuning parameters.

It is assumed that ci,ℓ = ci for all ℓ.

Denote Vi(yℓ(i)) = wi(yℓ(i)− µi − ci). We have:

E0Vi(yℓ(i))=−wi ci < 0 and EθiVi(yℓ(i))=wi (θi−µi−ci) > 0.

for ci judiciously chosen (0 < ci < θi − µi).


L. Fillatre


Detection scheme

Problem statement

LR CUSUMLR CUSUM

Optimality

NP-CUSUMPrinciple

NP-CUSUM


Comparison

Definition of NP-CUSUM

Definition (NP-CUSUM)The NP-CUSUM algorithm tv is defined by:

t′a = mink ≥ 1 : max1≤i≤N

Wk(i) ≥ h

whereWk(i) = max

1≤j≤kV kj (i)

and h is a threshold.


L. Fillatre


Detection scheme

Problem statement

LR CUSUMLR CUSUM

Optimality

NP-CUSUMPrinciple

NP-CUSUM


Comparison

A poisson example

Assume the size of packet in the i-th channel followsthe poisson distribution P(µi) in the pre-change modeand P(θi) after the change occurs in the i-th channel:

Pr(yk(i) = m) =(µi)

m

m!e−µi , k < t0

Pr(yk(i) = m) =(θi)

m

m!e−θi , k ≥ t0.

It is assumed that θi, µi are known and θi > µi.

QuestionFind the LR-CUSUM ;

Show that the NP-CUSUM is asymptotically optimalwhen ci = εiθi where the variables εi need to bespecified.


L. Fillatre


Detection scheme

Problem statement

LR CUSUMLR CUSUM

Optimality

NP-CUSUMPrinciple

NP-CUSUM


Comparison

Comparison between the algorithms

The LR-CUSUM is based on the statistics

Sℓ(i) = yℓ(i) log(θi/µi)− (θi − µi).

The NP-CUSUM is based on the statistics

Vi(yℓ(i)) = wi(yℓ(i)− µi − εiθi).

It is straightforward to verify that the NP-CUSUMcoincides with the LR-CUSUM test if

εi =Qi − logQi − 1

Qi logQi, wi = logQi

with Qi = θi/µi, which proves that the NP-CUSUM isasymptotically optimal.

Statistical Decision Theory · The SNMP server maintains a database of management variables called...

Documents

Transcript of Statistical Decision Theory · The SNMP server maintains a database of management variables called...