Chapter 1: Foundationmy.fit.edu/~gmarin/CSE5636/CharacterizeActivitySection6.pdf · mp l Ans Namu...

Characterizing Activity 6-1

Characterizing Activity

Dr. G. A. Marin


Activity Profiling

Collecting statistics that summarize the kinds of activities that occur regularly on the network.

Get a description that can be used to identify deviations from normal behavior.Develop profiles of machine behaviorGroup machines into activity clusters


Characterize Machine ActivityCount number of SYNs to each port (by machine)

Services soughtCount SYN/ACKs from each port

Services provided Count SYNs by port (to destination machines)

Requests for serviceKeep a list of all IP addresses interacting with each machine in our network?

External machines only?Others?


Activity Profile

An activity profile for a machine is a vector of counts or probabilities. Each count is associated with a specific activity.

E.g. TCP SYN packets sent to a specific port.Note that counts are generally time sensitive so vectors should be collected by hour of day. Thus consider the activity vector to be a vector of counts or probabilities relative to a given time period on a given day of the week.


Cluster Analysis

Developed largely in biological and physical sciences to classify items or individuals into groups. “Clusters” are those with similar characteristics that seem to belong to a single group. Cluster analysis is the commonly used term for using procedures to identify groups in data.


Our Goal

Divide machines into clusters based on their activity vectors (either network side or system side). Characterize the activity of machines that are similar (belong to one cluster). Compare new data (or process behavior) with behavior in existing clusters.Alarm if deviation exceeds a threshold.


What is a Cluster?


Multivariate Data Matrixn p×

1,1 1,2 1,

2,1 2,2 2,

,1 ,2 ,

... ...

... ...... ... ... ... ...... ... ... ... ...

... ...

p

p

n n n p

x x xx x x

x x xEach row represents the elements of a particular machine such as number of SYNs received by port and number of SYN/ACKs sent by port.


Create Proximity Matrix

Consider scaling the vector (or row) elements and perhaps weighting them if some deemed more important than others.

Divide by range so that each element is between zero and one. Divide by standard deviation if greater variance implies less significance.

For discrete or continuous numerical values compute distance between each pair of vectors using, for example, Euclidean distance.Results in nxn proximity matrix.


where each element represents the distance beween the activity vector for process and the activity vector for process . One might use, for example, the usual Euclidean distance:

1,1 1,2 1,

2,1 2,2 2,

,1 ,2 ,

... ...

... ...... ... ... ... ...... ... ... ... ...

... ...

n

n

n n n n

d d dd d d

d d d

Create the proximity matrix:

( ) ( ) ( )2 2 2

, ,1 ,1 ,2 ,2 , ,...i j i j i j i p j pd x x x x x x= − + − + + −


514 127 934 729 84 0 205 403648 47 864 0 31 92 610 0950 54 988 721 49 52 584 693283 102 0 0 53 34 0 5763 119 764 492 88 35 665 225

Example:SYNs to4 particular ports and SYN/ACKs from Same Ports

M1:M2:M3:M4:M5:


Proximity MatrixUsing Euclidean Distance

0 948 656 1289 626948 0 1053 1122 398656 1053 0 608 614

1289 1122 608 0 1244626 398 614 1244 0


Ordered Similarity List1. M2 and M5: 3982. M3 and M4: 6083. M3 and M5: 6144. M1 and M5: 6265. M1 and M3: 6566. M1 and M2: 9487. M2 and M3: 10538. M2 and M4: 11229. M4 and M5: 124410. M1 and M4: 1289

What are the clusters?

We’ll return to this. Next, we look at classifying machines by system-call activity.


Benign System Call Traces

0

5000

10000

15000

20000

25000

acce

pt

CloseHan

dle

close

sock

et

CloseW

indow

Station

CoCrea

teGuid

CoCrea

teIns

tance

CoGetC

lassO

bject

CoGetT

reatAsC

lass

CoInter

netG

etSec

urityU

rl

CoMarh

alInte

rface

conn

ect

CopyF

ile

CoReg

isterC

lassObje

ct

CoUnm

arsha

lInter

faceCrea

teFile

CreateM

utex

CreateP

roces

s

CreateT

hread

CreateU

RLMon

iker

DeleteF

ile

Duplic

ateTok

en

ExitProc

ess

ExitThre

ad

FindFirs

tFile

FindMim

eFrom

Data

FindNextF

ile

GetCom

mandLin

e

GetDate

Format

getho

stbyn

ame

GetMod

uleFile

Name

GetMod

uleHan

dle

getpe

ernam

e

getso

ckna

me

GetTem

pFile

Name

GetThe

meMarg

ins

GetTick

Count

GetUrlC

acheE

ntryIn

foA

GlobalA

ddAtom

Intern

etGetC

onne

ctedS

tate

Load

Librar

yrec

v

RegClos

eKey

RegCrea

teKey

RegOpe

nKey

RegQue

ryValu

e

RegSetV

alue

RegSetV

alueE

xse

nd

SetCurre

ntDire

ctory

Sleep

Socke

t

WNetClos

eEnu

m

WNetEnu

mResou

rce

WNetOpe

nEnu

mWrite

File

WSACleanup

WSARecv

WSAStartup

Syst

em C

all F

requ

ency


Viral System Call Traces

0

100

200

300

400

500

600

acce

pt

CloseHan

dle

close

sock

et

CloseW

indowStat

ion

CoCreateGuid

CoCreateInsta

nce

CoGetClas

sObject

CoGetTrea

tAsC

lass

CoInterne

tGetS

ecurity

Url

CoMarhalIn

terfac

eco

nnect

CopyFile

CoRegist

erClass

Object

CoUnmars

halIn

terfac

e

CreateF

ile

CreateM

utex

CreateP

roces

s

CreateT

hread

CreateU

RLMon

iker

DeleteFile

Duplica

teTok

en

ExitProc

ess

ExitThre

ad

FindFirs

tFile

FindMim

eFromData

FindNex

tFile

GetCommand

Line

GetDateFo

rmat

getho

stbyn

ame

GetMod

uleFileNam

e

GetMod

uleHandle

getpe

ername

getso

ckna

me

GetTem

pFile

Name

GetThe

meMargins

GetTick

Count

GetUrlC

ache

EntryIn

foA

GlobalA

ddAtom

Intern

etGetC

onne

ctedS

tate

Load

Librar

yrecv

RegClos

eKey

RegCrea

teKey

RegOpe

nKey

RegQue

ryValu

e

RegSetV

alue

RegSetV

alueE

xse

nd

SetCurr

entD

irecto

rySlee

pSoc

ket

WNetClos

eEnum

WNetEnu

mResou

rce

WNetOpe

nEnu

mWrite

File

WSACleanup

WSARecv

WSAStartup

Syst

em C

all F

requ

ency


Process Activity Matrix

1,1 1,2 1,

2,1 2,2 2,

,1 ,2 ,

... ...

... ...... ... ... ... ...... ... ... ... ...

... ...

r

r

n n n r

x x xx x x

x x xWhere each element of the matrix is:

, , with representing the th process ( 1) and representing the th system call numbered in any convenient order.

i jx i i i jj

≥


AGAIN………….Create the proximity matrix:

where each element represents the distance beween the activity vector for process and the activity vector for process . One might use, for example, the usual Euclidean distance:

1,1 1,2 1,

2,1 2,2 2,

,1 ,2 ,

... ...

... ...... ... ... ... ...... ... ... ... ...

... ...

n

n

n n n n

d d dd d d

d d d

( ) ( ) ( )2 2 2

, ,1 ,1 ,2 ,2 , ,...i j i j i j i r j rd x x x x x x= − + − + + −

,i jdi j


Measures of Distance (numerical data)

( )122

1

1

1

1

1

2 2

1 1

Euclidean distance .

City Block distance .

Minkowski distance , 1.

1Angular distance , with

2

r

ij ik jkk

r

ij ik jkk

r mm

ij ik jkk

r

ik jkij k

ij ijr r

ij jkk k

d x x

d x x

d x x m

x xd

x x

φφ

=

=

=

=

= =

⎛ ⎞= −⎜ ⎟⎝ ⎠

= −

⎛ ⎞= − ≥⎜ ⎟⎝ ⎠

−= =

∑

∑

∑

∑

∑12⎛ ⎞

⎜ ⎟⎝ ⎠

∑


To illustrate the use of the proximity matrix we take system call data from five of the processes represented in Figures 1 and 2. We select the following 4 (of the 58 total) system call counts for illustrative purposes only:

1. close handle2. create file3. find first file4. register querythe example process activity matrix (greatly abbreviated) and proximity matrix:

Process 1: 179 11 6 226Process 2: 160 163 70 67Process 3: 30 0 1 2Process 4: 70 30 0 101Process 5: 407 0 0 4

0 230 270 167 318230 0 229 178 310270 229 0 111 377167 178 111 0 352318 310 377 352 0


Importance of Scaling• Proximity matrix leads to this ordering of distances:

3,4 2,4 1,4 2,3 2,1 1,3 2,5 1,5 4,5 3,5, , , , , , , , , .d d d d d d d d d d

• Close distance between process 3 and process 4 may simply be due to small total number of system calls in this two cases.

• Another approach would be to normalize by percent of call of each type:

•Process 3 =

•Process 4 =

[ ]90.9 0 3 6.1

[ ]34.8 14.9 0 50.2

This would result in a different ordering that would not be affected by the total number of system calls only by the percentage of various types. What if we want some of both?


A New Activity RepresentationThe original activity vectors:

Process 1: 179 11 6 226Process 2: 160 163 70 67Process 3: 30 0 1 2Process 4: 70 30 0 101Process 5: 407 0 0 4

Add a first element which is total activity scaled between 0 and 100 and represent other elements as percent of total for that vector:

M

27.6

30.1

2.2

13.2

26.9

42.4

34.8

90.9

34.8

99.0

2.6

35.4

0

14.9

0

1.4

15.2

3.0

0

0

53.6

14.6

6.1

50.2

1.0

⎛⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎠

:=


New Proximity Matrix

D

0

53.397

72.546

20.735

77.327

53.397

0

73.484

46.95

76.165

72.546

73.484

0

73.784

26.659

20.735

46.95

73.784

0

83.379

77.327

76.165

26.659

83.379

0

⎛⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎠

=

Now the closest two activity vectors are 1 and 4 instead of 3 and 4. This is simply due to what we have determined is important (and how we measure the distance between two vectors)!


Measures for Categorical DataSuppose each vector contains fields like the following:- Layer 4 protocol (TCP,UDP,"ICMP")- Layer 3 protocol (IP,IPX,OSI,APPN)- Layer 2 protocol (Ethernet,Token Ring,ATM).

We can compute similarity mea

1

sures (similar to proximity idea), 1 where each 1 if agrees with in the element.

r

ij ijr ijr i jk

s s s x x rthr =

= =∑


Clustering ExampleHierarchical Agglomerative Method

D

0

53.397

72.546

20.735

77.327

53.397

0

73.484

46.95

76.165

72.546

73.484

0

73.784

26.659

20.735

46.95

73.784

0

83.379

77.327

76.165

26.659

83.379

0

⎛⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎠

=

We return to the vectors that led to matrix D.

1,4

(14)2 12 42 42 (14)3 13 43 13

(14)

The smallest positive value is 20.735; thus, form the two-membercluster (1,4) = (14). Compute the nearest-neighbor distances:

min[ , ] =46.95. min[ , ] 72.546.

d

d d d d d d d d

d

=

= = = = =

5 15 45 15min[ , ] =77.327.d d d= =


Clustering Step 2

( )

2

We compute the new proximity matrix014

47.0 0272.5 73.5 0377.3 76.2 83.4 05

D

⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥⎣ ⎦

The smallest value is between (14) and 2 so we add 2 to this clusterand form (142).


Clustering Step 3.(142)3 (14)3 23

(142)5 (14)5 25

142

3 3

5

min[ , ] 72.5min[ , ] 76.2.

072.5 076.2 83.4 0

d d dd d d

dD d

d

= =

= =

⎡ ⎤⎢ ⎥= ⎢ ⎥⎢ ⎥⎣ ⎦

Form the clusters (1423) and (5).


Clustering Step 4

(1423)5 (142)5 35min[ , ] 76.2.

Next cluster will be (14235).

d d d= =


Choosing Clusters12345

1234 5

3 124

2 14

1 4

72.5

20.7

76.2

47.0CLUSTERS:{1,2,4}, {3}, {5}


Using Principal Components Analysis to Reduce the Number of Variables

(Reduce dimension of activity vectors.)


Correlation Coefficients

Recall that for two random variables X and Y their covariance is

Cov(X,Y) = E(XY)-E(X)E(Y).

Their correlation coefficient is

Cov( , )( , ) .X Y

X YX Yρσ σ

=


Estimated Correlation Matrix

1,1 1,2 1,

2,1 2,2 2,

,1 ,2 ,

,

An original matrix of observations... ...... ...

... ... ... ... ...

... ... ... ... ...... ...

represents values from random variables , where implies, for

example, the

r

r

n n n r

i j

x x xx x x

x x xX i

machine, or process, and implies, for example, counts for a particular system call or port number access. Each column contains valuesof a RV giving, say, counts of system call per procej

ith j

S jth

, , , ,1 1 1

,

ss.

We form the estimated covariance matix C

n n n

i k j k i k j kk k k

i j r r

r r

x x x xc

n n n= = =

×

×

⎡ ⎤⎛ ⎞⎛ ⎞⎢ ⎥⎜ ⎟⎜ ⎟⎢ ⎥⎜ ⎟⎜ ⎟⎡ ⎤= = −⎣ ⎦ ⎢ ⎥⎜ ⎟⎜ ⎟

⎜ ⎟⎜ ⎟⎢ ⎥⎝ ⎠⎝ ⎠⎣ ⎦

∑ ∑ ∑


Example correlation matrixProcess 1: 179 11 6 226Process 2: 160 163 70 67Process 3: 30 0 1 2Process 4: 70 30 0 101Process 5: 407 0 0 4

For the matrix:

C

1.719 104×

873.56−

144.88−

1.55− 103×

873.56−

3.853 103×

1.667 103×

23.4

144.88−

1.667 103×

750.24

22.4−

1.55− 103×

23.4

22.4−

6.757 103×

⎛⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎠

=


The mathcad computation…

X

179

160

30

70

407

11

163

0

30

0

6

70

1

0

0

226

67

2

101

4

⎛⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎠

:=

i 1 2, 4..:=

MXi mean X i⟨ ⟩( ):= SXi stdev X i⟨ ⟩( ):=

j 1 2, 4..:=

SSi j,X i⟨ ⟩ X j⟨ ⟩⋅

5MXi MXj⋅−:=Ci,j


Principal Components Analysis

1 11 1 12 2 1

2 21 1 22 2 2

1 1 2 2

It is possible to find new random variables......

...

Such that each accounts for a decreasing amount of the variancein the random

r r

r r

r r r rr r

i

i

Y a S a S a SY a S a S a S

Y a S a S a S

YS

= + + += + + +

= + + +

iii

variables and each pair , are uncorrelated. Once

we account for "most" of the original variance we can drop remaining from consideration (reduce problem dimension).

i j

i

Y Y

Y


Finding the A MatrixThe coefficients are the "eigenvectors" of the covariance matrix C.

That is, each column of the matrix satisfies:

C[ ] [ ] , where the are the "eigenvalues."

Th

ij

T

ij

i ii i

a

AT a

AT I ATλ λ

⎡ ⎤= ⎣ ⎦

=

[ ]( )

[ ]( )

e eigenvalues are found by solving equations:

det 0.

The eigenvectors are found by solving the equations 0.

However, many tools exist to find these.

i

i i i

I C

V I C V

λ

λ

− =

− =


“Simple” eigenvalue example[ ] ( )( )

2

1 2 1 2Let . Then det det 1 4 2

1 4 1 4

5 2. If we set this equal to zero

5 17we find 4.562 and 0.438. These are t2

C I Cλ

λ λ λλ

λ λ

λ

− −⎡ ⎤ ⎡ ⎤= − = = − − −⎢ ⎥ ⎢ ⎥− −⎣ ⎦ ⎣ ⎦

= − +

±= ≈

[ ]

1

2

1 1

2 2

he eigenvalues, and we need

the corresponding eigenvectors. If is an eigenvector, then

1 2 0 0. For 4.562 we get

1 4

the simultaneous equatio

vv

v

v vI C

v vλ

λ λλ

⎡ ⎤= ⎢ ⎥⎣ ⎦

− −⎡ ⎤ ⎡ ⎤⎡ ⎤− = ⇒ = =⎢ ⎥ ⎢ ⎥⎢ ⎥− −⎣ ⎦⎣ ⎦ ⎣ ⎦

1 2

1 2

1 2

2 1

3.562 2 0ns . These are homogeneous equations

1 0.562 0which do not have unique solutions. The second equation yields 0.562 .Simply set 1 to get 0.562. Then we normalize by div

v vv v

v vv v

− =⎧⎨− + =⎩

== =

2 21 2 1 2

iding each by

1.147. This gives us 0.49 and 0.872.v v v v+ = = =


Eigenvalues continued0.490

Thus, corresponding to the eigenvalue 4.562 we have the eigenvector .0.872

0.963Similarly, we find corresponding to 0.438 we have the eigenvector .

0.270If C really had been a covarian

⎡ ⎤⎢ ⎥⎣ ⎦

⎡ ⎤⎢ ⎥−⎣ ⎦

ce matrix, we would write the matrix rows0.490 0.872

using the transpose of these two vectors. Thus, . Notice0.963 0.270

that the first row corresponds to the largest eigenvalue and we contin

A

A ⎡ ⎤= ⎢ ⎥−⎣ ⎦

ue in orderof decreasing eigenvalue size.


Goodnews – Mathcad demo

C1

1

2

4⎛⎜⎝

⎞⎟⎠

:= e eigenvals C( ):= e0.438

4.562⎛⎜⎝

⎞⎟⎠

=

v eigenvecs C( ):= v0.963−

0.27

0.49−

0.872−⎛⎜⎝

⎞⎟⎠

=

The vectors columns are the negatives of what we found, which does not matter. The first column corresponds to the first eigenvector found by eigenvals, etc.


Understanding Y Values1 2

1 2

Recall that we're looking for new random variables , ,..., that are orthogonaland may capture most of the variance of the original , ,..., with a much-reducednumber of ' . The defining eq

r

r

i

Y Y YS S S

Y s uations are given on slide number 34. Each row ofthe original X matrix represents one "realization" of the original random variables, ;thus, one row of the X matrix results in one estimate of the

iS' . In practice, to estimate

the ' , we first normalize the x-values by subtracting the column mean from each(as we shall see). Then we transpose each row of this normalized X matrix, NX, sothat

i

i

Y sY s

( )Tr n

the rows become column vectors prior to multiplying by the A matrix.

Utlimately the matrix equation Y NX results in the columns of Y.Each column is an estimate of the random variables

r r r nA nY

× × ×= ×

( )T

arising from the corresponding

normalized columns of NX .i

r n×


Estimating Y values

,

In our original system call example:Each row of the original X matrix contains a set of estimates of the system call random variables, , 1, 2,..4 for a single process.Each is the th sample of

i

i j

S ix i

=

, j

51

, ,51

,

system call .

From each entry we subtract the column mean (estimated mean for S )

MX to obtain the normalized matrix NX MX .

We estimate the matrix as

i j

j i j i j ji

i j

j

x

x x

Y

Y y

=

⎡ ⎤= = −⎣ ⎦

⎡ ⎤= ⎣

∑

4 5(NX) .T

xA= ×⎦


Find Eigen-vectors/values

E eigenvecs C( ):= E

0.013

0.401

0.916−

1.553− 10 3−×

0.073−

0.912−

0.401−

0.045−

0.139−

0.05

0.022

0.989−

0.987−

0.065

0.015

0.143

⎛⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎠

=

eval eigenvals C( ):=eval

21.803

4.517 103×

6.538 103×

1.747 104×

⎛⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎠

=

smallest

largest

A

0.987−

0.139−

0.073−

0.013

0.065

0.05

0.912−

0.401

0.015

0.022

0.401−

0.916−

0.143

0.989−

0.045−

1.553− 10 3−⋅

⎛⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎠

:=


Checking A for Orthogonality

AT A⋅

0.999

6.84 10 4−×

4.98− 10 4−×

4.052− 10 4−×

6.84 10 4−×

0.999

4.71 10 4−×

2.622 10 4−×

4.98− 10 4−×

4.71 10 4−×

1.001

1.455− 10 4−×

4.052− 10 4−×

2.622 10 4−×

1.455− 10 4−×

1.001

⎛⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎠

=

This is approximately as expected for for an orthogonal matrix. (ones on diagonal and zero elsewhere)


Finding Y Original Observations of:

X

179

160

30

70

407

11

163

0

30

0

6

70

1

0

0

226

67

2

101

4

⎛⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎠

=

S1 S2 S3 S4i 1 2, 5..:=

j 1 2, 4..:=

NXi j, Xi j, MXj−:=

NX

9.8

9.2−

139.2−

99.2−

237.8

29.8−

122.2

40.8−

10.8−

40.8−

9.4−

54.6

14.4−

15.4−

15.4−

146

13−

78−

21

76−

⎛⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎠

=

Y A NXT⋅:=

Y

9.127

147.453−

23.662

3.439−

15.983

21.447

132.084−

1.111−

123.368

94.134

56.656

4.859−

99.98

7.859−

22.322

8.453

248.46−

39.731

29.446

0.955

⎛⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎠

=

MX

169.2

40.8

15.4

80

⎛⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎠

=


Reducing Variables:

We write MEANS

169.2

169.2

169.2

169.2

169.2

40.8

40.8

40.8

40.8

40.8

15.4

15.4

15.4

15.4

15.4

80

80

80

80

80

⎛⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎠

= And it follows that

YT A⋅ MEANS+

178.915

160.071

30.152

70.092

406.77

11.062

162.927

0.093−

29.938

0.165

5.955

70.095

1.053

0.033

0.135−

226.077

67.02

2.001

101.052

3.85

⎛⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎠

=

Y

9.127

147.453−

23.662

3.439−

15.983

21.447

132.084−

1.111−

123.368

94.134

56.656

4.859−

99.98

7.859−

22.322

8.453

248.46−

39.731

29.446

0.955

⎛

This is the reconstructionof the X matrix.

Recall that ⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎠

=

X

179

160

30

70

407

11

163

0

30

0

6

70

1

0

0

226

67

2

101

4

⎛

5 estimates of Y1 in1st row. ..5 estimatesof Y4 in 4th row.

But Y was created so that Y4 has least influence on original data, X, matrix.

⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎠

:=


Checking reduction

Y

9.127

147.453−

23.662

3.439−

15.983

21.447

132.084−

1.111−

123.368

94.134

56.656

4.859−

99.98

7.859−

22.322

8.453

248.46−

39.731

29.446

0.955

⎛⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎠

=

Replace

withYRows3

9.127

147.453−

23.662

0

15.983

21.447

132.084−

0

123.368

94.134

56.656

0

99.98

7.859−

22.322

0

248.46−

39.731

29.446

0

⎛⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎠

=

YRows3T A⋅ MEANS+

178.96

160.085

30.215

69.982

406.757

12.441

163.372

1.856

26.548

0.218−

2.805

69.077

3.397−

7.776

0.74

226.071

67.018

1.994

101.065

3.851

⎛⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎠

=

and

1 2 34

ii=1

This is a reconstruction of theX matrix using only 3 variables.

eval eval evalthat account for eval

0.999 of the variance or "energy."

+ +

=

∑


Estimate of X with 2 variables

X

179

160

30

70

407

11

163

0

30

0

6

70

1

0

0

226

67

2

101

4

⎛⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎠

:=Originally

YRows2T A⋅ MEANS+

180.687

150.443

34.351

71.612

408.907

34.021

42.911

53.526

46.906

26.637

12.293

16.112

19.321

16.727

12.547

227.136

61.075

4.543

102.07

5.176

⎛⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎠

=


Reduction to 3 Variables

1 2 3Notice that the estimates of the three random variables , , can be used to recoverthe original rows of the X matrix to a considerable degree. The reduction to only tworandom variables leads to m

Y Y Y

uch greater errors.

On the next slide we plot the 5 original points using the new Y-variables and use theseto make a visual clustering decision. This reduced number of variables could also beused for other classifications, such as determining which of the original rows (processes)seem to be malicious.


P3

P4

P5

P1

P2

Plot Using Three Dimensions:Perhaps Cluster P1, P3, P4?


Within and Between1 2 1 2Suppose outcomes of random variables and are written , ,..., .

We want to divide the data into two groups representing outcomes from each of the RVs. Using whatever means we select a subset

nX X d d d

( ) ( )

1 211 12 1 21 22 2

1 2

222

1 1 1

of thepoints and relabel them as , ,..., and , ,..., with

.

The total variance (dispersion) estimated from the data is

1 1 1 . The total

n n

n n n

i i ii i i

x x x x x x

n n n

V d d d dn n n= = =

+ =

⎛ ⎞= − = − ⎜ ⎟⎝ ⎠

∑ ∑ ∑

( ) ( )22 2 2

1 1 1

sum-of-squares

is defined as .

It can be shown that ,

where is called the "within" sum of squares and is called the"between" sum of squares.

kn

mk m m mm k m

T nV

T x x n x d W B

W B= = =

=

= − + − = +∑∑ ∑


Within/Between ExampleData: 1,5,2,4,3,6 and we choose groups 1,2,3 and 4,5,6.

X1

1

2

3

⎛⎜⎜⎝

⎞⎟⎟⎠

:= X2

4

5

6

⎛⎜⎜⎝

⎞⎟⎟⎠

:=MX1 mean X1( ):=

MX2 mean X2( ):= MT

X1∑ X2∑+⎛⎜⎝

⎞⎟⎠

6:=

SST X1 MT−( ) X1 MT−( )⋅ X2 MT−( ) X2 MT−( )⋅+:=

SST 17.5=

W X1 MX1−( ) X1 MX1−( )⋅ X2 MX2−( ) X2 MX2−( )⋅+:=

B 3 MX1 MT−( )2 3 MX2 MT−( )2+:=

W 4= B 13.5=


Optimization Criterion

A commonly used criterion for determining appropriate groupsis to divide the data in such a way as to minimize the "within-group"sum of squares, .

This is equivalent to maximizing the "between-group"

W

sum of squares, .

This generally requires knowing the correct number of groups.

B


Data Vectors

1 2

Our problem is more challenging because the data that we collectare vectors, with dimension 1. If we divide these into groups,

then the members of the kth group are vectors: , ,..., ak

i

k k kn

d r g

x x x

>

( )( )T

x1 1

nd

each vector has dimension . (Think of these as our system-call vectors each with variables or counts.) The equation for the total sum

of squares becomes and, similarkng

r r km kmk m

rr

T x x x x= =

= − −∑∑

( )( ) ( )( )T T

1 1 1

ly,

and

Each of these matrices is because original vectors arerepresented as 1 column vectors. Again we have .Groups are formed to minimize Trac

kng g

km m km m m m mk m m

W x x x x B n x x x x

r rr T W B

= = =

= − − = − −

×× = +

∑∑ ∑

e( ).W


Notation11

1211

1

12

The vector contains the counts of "calls" of type 1 through

made by process 1 of group 1. represents the vector of calls madeby process 2 of group 1. Similarly, represents

m

r

mn

cc

x r

cx

x

=

1

2

the vector of calls

made by the final process, , of group . is now a vector of means:

where each is the mean count for each system call type.

m

r

n m x

c

cx

c

⎡ ⎤⎢ ⎥⎢ ⎥= ⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦


Notation Continued…

[ ]

211 11 12 11 111

2T 12 11 12 12 12 2

11 11 11 12 1

21 11 1 12 1 1

It follows that

...

... ... . The diagonal

...contains the square of the counts for each system call made

r

rr

r r r r

c c c c ccc c c c c cx x c c c

c c c c c c

⎡ ⎤⎢ ⎥⎢ ⎥= =⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

i

( )( )

( ) ( ) ( )

T

x1 1

2 2 2

1 1 2 21 1 1

by process 1.

Thus, , in this case is a matrix with

diag( ) , ,..., . Where

is the total number of processes (total across all grou

kng

r r km kmk m

n n n

i i ir ri i i

T x x x x

T c c c c c c n

= =

= = =

= − −

⎡ ⎤= − − −⎢ ⎥⎣ ⎦

∑∑

∑ ∑ ∑ps).


Instructive Example:1 2 32 3 13 1 24 5 6

We begin with the data matrix X= .5 6 46 4 57 8 98 9 79 7 8

We divide naturally into 3 groups:1 2 3 4 5 6 7 8 92 3 1 , 5 6 4 , and 8 9 73 1 2 6 4 5 9 7 8

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

⎡ ⎤ ⎡ ⎤ ⎡⎢ ⎥ ⎢ ⎥ ⎢⎢ ⎥ ⎢ ⎥ ⎢⎢ ⎥ ⎢ ⎥ ⎢⎣ ⎦ ⎣ ⎦ ⎣

⎤⎥⎥⎥⎦


Notation11 12 13

11 12 13

21 22

The vectors , , represent the row-data of the data matrix, but

are written as column vectors:1 2 32 3 1 . Similarly,3 1 2

4 55 6

x x x

x x x

x x

⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥ ⎢ ⎥= = =⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎣ ⎦

⎡ ⎤⎢ ⎥= =⎢ ⎥⎢ ⎥⎣ ⎦

T11

T12

33 T21

T33

9

6 ... 7 . Thus, .4 8

5The overall mean is 5 .

5

x

x

x Xx

x

x

⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥= =⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦

⎡ ⎤⎢ ⎥= ⎢ ⎥⎢ ⎥⎣ ⎦


Total Sum of Squares

( )( ) ( )( )3 3T T

1 1 1 1

Recall

kng

km km km kmk m k m

T x x x x x x x x= = = =

= − − = − −∑∑ ∑∑

XI

1

2

3

4

5

6

7

8

9

2

3

1

5

6

4

8

9

7

3

1

2

6

4

5

9

7

8

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

:=

XIT

1

2

3

2

3

1

3

1

2

4

5

6

5

6

4

6

4

5

7

8

9

8

9

7

9

7

8

⎛⎜⎜⎝

⎞⎟⎟⎠

=xmeani mean XI i⟨ ⟩( ):=

xmean

5

5

5

⎛⎜⎜⎝

⎞⎟⎟⎠

= j 1 2, 3..:=

XV1 j, XIT j⟨ ⟩:=

XV2 j, XIT j 3+⟨ ⟩:=

XV3 j, XIT j 6+⟨ ⟩:=

XV2 1,

4

5

6

⎛⎜⎜⎝

⎞⎟⎟⎠

=

T1

3

k 1

3

m

XVk m, xmean−( ) XVk m, xmean−( )T⋅∑=

∑=

:=

Example:


Total Matrix Result

T1

3

k 1

3

m

XVk m, xmean−( ) XVk m, xmean−( )T⋅∑=

∑=

:=

T

60

51

51

51

60

51

51

51

60

⎛⎜⎜⎝

⎞⎟⎟⎠

=


Within Matrix Result

W1

3

k 1

3

m

XVk m, gmeank−( ) XVk m, gmeank−( )T⋅∑=

∑=

:=

XIT

1

2

3

2

3

1

3

1

2

4

5

6

5

6

4

6

4

5

7

8

9

8

9

7

9

7

8

⎛⎜⎜⎝

⎞⎟⎟⎠

=

gmean1

mean XIT 1⟨ ⟩( )mean XIT 2⟨ ⟩( )mean XIT 3⟨ ⟩( )

⎛⎜⎜⎜⎝

⎞⎟⎟⎟⎠

:= gmean2


⎛⎜⎜⎜⎝

⎞⎟⎟⎟⎠

:= gmean3


⎛⎜⎜⎜⎝

⎞⎟⎟⎟⎠

:=

XV1 1,

1

2

3

⎛⎜⎜⎝

⎞⎟⎟⎠

=

Example

W

6

3−

3−

3−

6

3−

3−

3−

6

⎛⎜⎜⎝

⎞⎟⎟⎠

= Trace W = 18.


Between Matrix ResultB

1

3

m

3 gmeanm xmean−( )⋅ gmeanm xmean−( )T⋅∑=

:=

B

54

54

54

54

54

54

54

54

54

⎛⎜⎜⎝

⎞⎟⎟⎠

=

T

60

51

51

51

60

51

51

51

60

⎛⎜⎜⎝

⎞⎟⎟⎠

=

W

6

3−

3−

3−

6

3−

3−

3−

6

⎛⎜⎜⎝

⎞⎟⎟⎠

=


Optimization AlgorithmsSo...we want to partition machines (or ports or whatever) into groups in a way that minimizes Trace( ). How? * In theory we could compute Trace( ) for each possible partition. * BUT numb

n gW

W

( )1

28

er of partitions of objects into groups is:1 ( , ) 1 !

* (2,5) 15 (10,3) 9330 (50, 4) 5.3x10 ... * Algorithms have been developed to sear

gg m

m

n g

N n gg

N N N

−

=

= −

= = =

∑

ch for optimum value of clustering criteria by picking an initial partition, rearranging in some way and keeping the new arrangement only if the criteria are improved.


Hill Climbing Algorithms

• Initial partition of n objects into g groups• Move each object into a different group and recompute criterion• Keep the change that most improves the criterion.• Repeat until no improvement from moving a single object.


K-Means AlgorithmA hill-climbing algorithm in which change is made by relocating objectsinto the group whose mean is closest to the object. Under common conditions results in minimizing Trace W. We return to the process analysis matrix for 5 processes as an example:

X

179

160

30

70

407

11

163

0

30

0

6

70

1

0

0

226

67

2

101

4

⎛⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎠

:=

Suppose that we use the 10 points on the next slide as an example. Note that their coordinates are labeled (Y2,Y4) from real data.


From Transformed System Call Data

4.5 37.7-147.7 -73.7

2.5 -19.3-14.9 23.143.3 -278.2

-100.2 -98.527.5 -210.3

-150.8 -150.151.2 -225.411.2 1.7

Y2 Y4Process 1Process 2Process 3Process 4Process 5Process 6Process 7Process 8Process 9Process 10


Kmeans Example with 3 Groups

Y

4.5

147.7−

2.5

14.9−

43.3

100.2−

27.5

150.8−

51.2

11.2

37.7

73.7−

19.3−

23.1

278.2−

98.5−

210.3−

150.1−

225.4−

1.7

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

:=

YT1 2 3 4 5 6 7 8 9 10

12

4.5 -147.7 2.5 -14.9 43.3 -100.2 27.5 -150.8 51.2 11.237.7 -73.7 -19.3 23.1 -278.2 -98.5 -210.3 -150.1 -225.4 1.7

=

YT( ) 1⟨ ⟩ 4.5

37.7⎛⎜⎝

⎞⎟⎠

=

YT( ) 5⟨ ⟩ 43.3

278.2−⎛⎜⎝

⎞⎟⎠

=

YT( ) 8⟨ ⟩ 150.8−

150.1−⎛⎜⎝

⎞⎟⎠

=

Pick 3 points (value-pairs) that are furthest apart.


Place Process 2 Begin step 1 to place process 2..

YT YT:=

g1mean YT 1⟨ ⟩:= g2mean YT 5⟨ ⟩:= g3mean YT 8⟨ ⟩:=

dist1 YT 2⟨ ⟩ g1mean−( ) YT 2⟨ ⟩ g1mean−( )⋅:=



dist1 188.613=

dist2 279.824=

dist3 76.463= minimum

Process 2 belongs in group 3.


Define Group Member Vectors

group1

1

0

0

0

0

0

0

0

0

0

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

:= group2

0

0

0

0

1

0

0

0

0

0

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

:= group3

0

1

0

0

0

0

0

1

0

0

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

:=

Process 2

Process 8

g3mean12

⎛⎜⎝

⎞⎟⎠

YT group3⋅( )⋅:= g3mean149.25−

111.9−⎛⎜⎝

⎞⎟⎠

=

Recalculate the group 3 mean…


Add next process

Begin step 3 to assign process 3 to a group. Result: group 1.




dist1 57.035=dist2 262.095=

dist3 177.772=

group1 group1

0

0

1

0

0

0

0

0

0

0

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

+:=

minimum

g1mean12

⎛⎜⎝

⎞⎟⎠

YT group1⋅( )⋅:=

g1mean3.5

9.2⎛⎜⎝

⎞⎟⎠

=


Assign process 4dist1 YT 4⟨ ⟩ g1mean−( ) YT 4⟨ ⟩ g1mean−( )⋅:= dist1 23.06=

dist2 YT 4⟨ ⟩ g2mean−( ) YT 4⟨ ⟩ g2mean−( )⋅:= dist2 306.87=

dist3 YT 4⟨ ⟩ g3mean−( ) YT 4⟨ ⟩ g3mean−( )⋅:= dist3 190.46= group1 group1

0

0

0

1

0

0

0

0

0

0

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

+:=

ng group1∑:= ng 3=

g1mean1

ng⎛⎜⎝

⎞⎟⎠

YT group1⋅( )⋅:= g1mean2.633−

13.833⎛⎜⎝

⎞⎟⎠

=

g2mean43.3

278.2−⎛⎜⎝

⎞⎟⎠

=

g3mean149.25−

111.9−⎛⎜⎝

⎞⎟⎠

=

Continue in this manner assigning processes 6,7,9,10.


After all 10 points assigned

group1

112

3

4

5

67

8

9

10

10

1

1

0

00

0

0

1

= group2

1

12

3

4

5

6

7

8

9

10

00

0

0

1

0

1

0

1

0

= group3

1

123

4

5

67

8

9

10

010

0

0

10

1

0

0

=

g1mean0.825

10.8⎛⎜⎝

⎞⎟⎠

= g2mean40.667

237.967−⎛⎜⎝

⎞⎟⎠

= g3mean132.9−

107.433−⎛⎜⎝

⎞⎟⎠

=


i 1 2, 10..:=

dist1i YT i⟨ ⟩ g1mean−( ) YT i⟨ ⟩ g1mean−( )⋅:=



Now check to see if each process remains closer to its current group mean than to either of the other group means.


More iterations?

dist1

1

12

3

4

5

6

7

8

9

10

27.15170.88

30.147

19.964

292.105

148.837

222.703

221.086

241.512

13.8

= dist2

1

12

3

45

6

7

8

9

10

278.029249.931

221.973

266.91540.319

198.228

30.64

210.666

16.397

241.471

= dist3

1

12

3

4

5

6

7

8

9

10

199.85636.837

161.557

175.963

245.373

33.898

190.551

46.269

218.653

180.762

=group1

1

12

3

4

5

6

7

8

9

10

10

1

1

0

0

0

0

0

1

= group2

1

12

3

4

5

6

7

8

9

10

00

0

0

1

0

1

0

1

0

= group3

1

12

3

4

5

6

7

8

9

10

01

0

0

0

1

0

1

0

0

=

If any process is closer to a different group mean, it gets moved into that group and group mean is recomputed. Continue until no change needed as in this case.


-300

-250

-200

-150

-100

-50

0

50

100

-200 -150 -100 -50 0 50 100

Series1

P1P4 P10

P3P6P2

P8P7

P9

P5

Group 1

Graph of Original PVA Values

Group 2Group 3


Example Use of Process Clusters

First collect data and determine process clustersDetermine the vector of means for each cluster. Compute distance of each process from the mean vector for its cluster. Let Di be the RV for distance of process in

cluster i from its mean. Estimate distribution empirically and choose threshold di such that P[Di> di]=.05 (or your choice).

Or use Chebyshev’s Inequality (next slide)For each new process first determine the closest cluster mean. (Suppose group i.) Compute distance of process to group i mean. Flag as suspicious if distance is greater than di.


Chebyshev’s Inequality

2

If is a random variable with mean and standard deviation ,1then , for 0.

Note that the distribution of need not be known. The estimate

is conservative; that is, often the probabi

X

P X k kk

X

μ σ

μ σ⎡ − ≥ ⎤ ≤ >⎣ ⎦

2

1lity is much less than .k


Choosing Number of Groups

References:Cluster Analysis, 4th Ed, Brian S. Everitt, Sabine Landau, Morven Leese, Oxford University Press, NY, 2001.A dendrite method for cluster analysis, Communications in Statistics, 3, 1-27, R. B. Calinski and J. Harabasz, 1974Pattern Classification and Scene Analysis, R. O. Duda and P. E. Hart, Wiley, NY, 1973.


A Statistical Method for Profiling Network Traffic*•*Paper: David Marchette, Published in Proceedings of the Workshop on Intrusion Detection and Network Monitoring, Santa Clara, Ca, April, 1999. • “Two clustering methods described and applied to NETWORK data. These allow the clustering of machines into ‘activity groups’, which consist of machines which tend to have similar activity profiles. In addition these methods allow the user to determine whether current activity matches these profiles and hence to determine when there is ‘abnormal’ activity on the network. A method for visualizing the clusters is described, and the approaches are applied to a data set consisting of a months worth of data from 993 machines.”


Example: Counts of Incoming Telnets (1999)


Possible Approach

Tabulate incoming telnet sessions for current day and compare with activity for previous two months. Counts can be normalized as probabilities.Examine abnormal activity closelyCan only be done for major services and a limited number of machines.Marchette suggests using clustering to group large number of machines in order to find activity abnormal for the cluster.


ExampleCounts kept for first 1024 ports in both TCP and UDP.Separate counts for ports > 1023.Normalized by total counts to produce probability (activity) vectors of dimension 2050 from data for 993 machines.

1024 + 1 + 1024 + 1Eliminate ports with prob < 0.2 leaves vectors of length 61.

Data plotted with pixel values ~ probabilities.K-means algorithm used for clustering.


Clusters from port counts of 993 machines created with k-means algorithm.


Idea for “flagging” inbound packets as abnormal.

Use the destination address to determine the appropriate cluster profile. Look at the activity probability vector (dim 2050) for that cluster.Pick a threshold and if P(dest_port)<=threshold, flag this packet as abnormal. Record the source address as possible attacker.


Marchette Results

“There were [actually] 27 source IPs that were determined to be attackers against one or more of the 993 machines in the data set.”Total number of records analyzed: 1,757,206.


Assigment: Read this paper.Intrusion detection and response: An empirical analysis of NATE: Network Analysis of Anomalous Traffic Events

September 2002

Proceedings of the 2002 workshop on New security paradigms

This paper presents results of an empirical analysis of NATE (Network Analysis of Anomalous Traffic Events), a lightweight, anomaly based intrusion detection tool. Previous work was based on the simulated Lincoln Labs data set. Here, we show that NATE can operate under the constraints of real data inconsistencies. In addition, new TCP sampling and distance methods are presented. Differences between real and simulated data are discussed in the course of the analysis.

Carol Taylor and Jim Alves-Foss

Note: Available in ACM Digital Library

Chapter 1: Foundationmy.fit.edu/~gmarin/CSE5636/CharacterizeActivitySection6.pdf · mp l Ans Namu...

Documents

Transcript of Chapter 1: Foundationmy.fit.edu/~gmarin/CSE5636/CharacterizeActivitySection6.pdf · mp l Ans Namu...