Krzysztof Fabjański Common string pattern searching.

18
Krzysztof Fabjański Common string pattern searching

Transcript of Krzysztof Fabjański Common string pattern searching.

Page 1: Krzysztof Fabjański Common string pattern searching.

Krzysztof Fabjański

Common string pattern searching

Page 2: Krzysztof Fabjański Common string pattern searching.

Presentation layout:

➢ methods of network traffic collection and its

representation

➢ process of signature generation

➢ summary and conclusions

Page 3: Krzysztof Fabjański Common string pattern searching.

Methods of network traffic collection and its representation

collecting network traffic:➢ PC + tcpdump

➢ PC + snort

➢ honeynet

➢ nepenthes (malware collection)

➢ arakis

representation of network traffic:

➢ tcpdump format (with payload)

Page 4: Krzysztof Fabjański Common string pattern searching.

Sample tcpdump with payload

Page 5: Krzysztof Fabjański Common string pattern searching.

Process of signature generation

➢identification of attack

➢classification of threat

➢classification of vulnerability

➢network traffic representation

➢Proposition of the signature

➢normalization and validation

➢introduction of the new signature to the rule set

Area of interest.

Page 6: Krzysztof Fabjański Common string pattern searching.

Problem of a huge amount of information

web site sport.onet.pl was loaded in 3 sec. During that time tcpdump captured 195 packages. The file with packages consisted of 5666 lines and its size was 415155 bytes.

Page 7: Krzysztof Fabjański Common string pattern searching.

Problem of similarities

Should we have:

3 larger singatures:AA|C|HHKK|WW|IIDD|LL|DD

or

1 common:ABC|C

AA

AA AA BB CC C

C CC HH H

HJ

J

DD

DD AA BB CC L

LL

L CC DD D

D

AA

AA AA BB CC C

C CC HH H

HC

C

EE

AA AA BB CC C

CF

FH

H HH CC

AA BB CC KK

KK W

WW

W CC II

II

DD

DD AA BB CC L

LL

L CC DD D

D

KK

KK AA BB CC W

WW

W CC II

II

AA BB CC KK

KK W

WW

W CC II I

I

DD

DD AA BB CC L

LL

L CC DD

DD

Page 8: Krzysztof Fabjański Common string pattern searching.

Different types of analysis(for and against)

Offline:

(DBSCAN)➢ good precision➢ low efficiency➢ time-consuming

Online:

(Suffix tree algorithm)➢ good precision➢ good efficiency➢ very fast

Page 9: Krzysztof Fabjański Common string pattern searching.

Suffix Trees

Suffix Trees are universal data structures useful in a variety of string processing

problems

Align entire genomesFinding the largest palindrome

Detect repeats in DNAFinding the longest common substring in a set

Sequence homologyExact and approximate substring matching

BioinformaticsTraditional Text Applications

Page 10: Krzysztof Fabjański Common string pattern searching.

$bdacba

Building the Suffix Tree with the naive algorithm

abcabd$

bcabd$

cabd$ba

d$$

bdac

bd$ cabd$

d$

$

Running time O(n2)

abcabd$abcabd$abcabd$abcabd$abcabd$abcabd$

Page 11: Krzysztof Fabjański Common string pattern searching.

Building the Suffix Tree with the Ukkonen algorithm O(n)➢ Online Algorithm➢ Uses Suffix Links which link

nodes xα→ αLinkLink

xxαα→ → αα

abcabd$

cabd$

ba

d$$

bdac

bd$ cabd$

d$

$

Page 12: Krzysztof Fabjański Common string pattern searching.

1 create a root2 add a branch and leaf with S[1] label3 LastExtension=14 for Phase=2 to length[S]5 do6 for Extension=LastExtension to Phase7 do8 find the end of the path with S[Extension .. Phase – 1] label9 extend the path10 if rule for extension==3 then end the loop11 done12 LastExtension=Extension13 done

Building the Suffix Tree with the Ukkonen algorithm - pseudocode

Page 13: Krzysztof Fabjański Common string pattern searching.

Comparison of two strings s1 and s2 in steps

➢ building a suffix tree for s1

➢ finding the longest match of suffixes of s2 on the suffix

tree of s1

➢ return of the longest suffix of s2 machted on the suffix tree

of s1

Page 14: Krzysztof Fabjański Common string pattern searching.

Comparison of more then two strings using Generalized suffix tree

➢ concatenation of strings {s1,s2,...,sn}

➢ building a suffix tree for contacenated s string using

Ukkonen approach.

➢ return suffix which is the most common for {s1,s2,...,sn}

Page 15: Krzysztof Fabjański Common string pattern searching.

Common string pattern searching – main assumptions

➢ online string comparison require O(n) running time

➢ should find all possible common substrings

➢ should clusterize into sets of common strings

Page 16: Krzysztof Fabjański Common string pattern searching.

Common string pattern searching proposition

➢ genarlized suffix tree as a main structure (addition of

strings is performed in online mode – no concatenation).

➢ additional variables describing the weight of particular

node (numer of matches)

➢ additional structure – list of strings with the numbers

denoting the starting position of the suffix in those strings

(possible use of hash tables).

Page 17: Krzysztof Fabjański Common string pattern searching.

cab$

aba$

abc$

An example:

ba

$c

a$

$

$c a

$$

b c

$ ab$

$

1 {3 abc$} 1 {3 aba$} 1 {4 cab$}

3 {1 abc$} {1 aba$} {2 cab$} 3 {2 abc$}

{2 aba$} {3 cab$}

2 {3 abc$} {1 cab$}

3 {4 abc$} {4 aba$} {4 cab$}

1 {3 abc$} 1 {3 aba$} 1 {4 cab$}1 {4 abc$} 1 {2 cab$}

Expected result: ab | $

Page 18: Krzysztof Fabjański Common string pattern searching.

Thank you for your attention