File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf ·...

24
INFORMATION NETWORKS DIVISION COMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au 1 DFRWS20003 File Classification Using Sub- Sequence Kernels Olivier de Vel Computer Forensics Group DSTO, Australia

Transcript of File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf ·...

Page 1: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

1

DFRWS20003

File Classification Using Sub-Sequence Kernels

Olivier de VelComputer Forensics Group

DSTO, Australia

Page 2: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

2

DFRWS20003

Presentation:

Document mining and computer forensics

Semi-structured documents

Broadband k-spectrum kernel

Coloured generalized suffix tree representation

Experimental methodology

Results and Conclusion

Page 3: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

3

DFRWS20003

Computer Forensics: Focus on Document Mining

Eg, disk surface contents

structured data

unstructured datasemi-structured

data

Page 4: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

4

DFRWS20003

Presentation:

Document mining and computer forensics

Semi-structured documents

Broadband k-spectrum kernel

Coloured generalized suffix tree representation

Experimental methodology

Results and Conclusion

Page 5: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

5

DFRWS20003

Semi-Structured Documents

May or may not include natural language text

May be compound documents

Include low-level, short-range structures. For example:

byte block identifiers

mark-up tokens

Page 6: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

6

DFRWS20003

Semi-Structured Documents

⇒ Traditional text mining techniques generally not appropriate.

Page 7: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

7

DFRWS20003

Presentation:

Document mining and computer forensics

Semi-structured documents

Broadband k-spectrum kernel

Coloured generalized suffix tree representation

Experimental methodology

Results and Conclusion

Page 8: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

8

DFRWS20003

Broadband k-Spectrum Kernel:

Use an extension to the string kernel.

String (k-Spectrum) Kernel:Is the set of all k-length contiguous subsequences that a sequence xi (of alphabet T, size |T|=l) containsFeature space dimension is up to |T|k (eg, if k=4 and |T| = 256 ⇒number of dimensions ≈ 109)

BOOKOOPERPPO…

BOOKOOKO

OKOOKOOP

(k=4)

Xi =

OOPEetc….

Vector representation: (φκ1(xi), φκ1(xi), …, φκ1(xi)) where κ1, κ2,…, κl ∈ Tk

Page 9: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

9

DFRWS20003

Broadband k-Spectrum Kernel:

Broadband k-Spectrum Kernel:

Is the set of all (0, 1, ..k)-length contiguous subsequences that a sequence (of alphabet T) containsFeature space dimension is even larger!

BOOKOOPERPPO…

B, BO, BOO, BOOKO, OO, OOK, OOKO

O, OK, OKO, OKOOK, KO, KOO, KOOP

(k=4)

O, OO, OOP, OOPEetc….

Page 10: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

10

DFRWS20003

Broadband k-Spectrum Kernel:

Matrix representation, φ(BB)(xi) :

(φκ1(1)(xi) φκ2(1)(xi) … φκl(1)(xi), φκ1(2)(xi) φκ2(2)(xi) … φκl(2)(xi),

:::

φκ1(2)(xi) φκ2(2)(xi) … φκl(2)(xi))

where κi(j) ∈ Tj for i = 1, 2,..l

Page 11: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

11

DFRWS20003

Broadband k-Spectrum Kernel: What’s the Big Deal?

This allows us to define the inner product between two input sequences, xr and xs as:

GramMatrix(xr, xs) ≡ φ(BB)(xr) • φ(BB)(xs)

and use the “kernel trick” when maximising the margin of separation between two classes. The decision function for the SVM learning algorithm is:

f(x) = sign(Σ αI yi[φ(BB)(xr) • φ(BB)(xs)] + b)

in the transformed feature space rather than the input feature space.

Page 12: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

12

DFRWS20003

Presentation:

Document mining and computer forensics

Semi-structured documents

Broadband k-spectrum kernel

Coloured generalized suffix tree representation

Experimental methodology

Results and Conclusion

Page 13: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

13

DFRWS20003

Coloured Generalized Suffix Tree (CGST):

BOOKOOPERPPO PPEKOOBOKKOO(k=4)

The CGST is a labelling of the GST representation of a sequence of symbols:

It stores a count vector to be stored at each node in the GST – each vector element value equals the frequency of the subsequence in the sequence.

B O K K(1,1) (1,1) (0,1)

(1,0)

(0,1)

(1,0)

(0,1)

O K

(1,2)

(0,1)

(1,0)

OE K O(1,1) (0,1)

R P(0,1)

P

(1,3)

(1,0) (1,0)

OK K O(0,1)

O O B(0,1)

(0,1)

P(1,2)

(1,0)

Page 14: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

14

DFRWS20003

CGST: Computing the Kernel (Gram) Matrix

For N input sequences, the NxN Gram matrix G is computed as follows:

initialize G r,s = 0 ∀ xr and xs

traverse the CGST

at each node, compute the product of each vector element pair and sum to G r,s

Page 15: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

15

DFRWS20003

Presentation:

Document mining and computer forensics

Semi-structured documents

Broadband k-spectrum kernel

Coloured generalized suffix tree representation

Experimental methodology

Results and Conclusion

Page 16: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

16

DFRWS20003

Experimental Methodology: Corpus

Document corpus for a controlled experiment :

5 document classes (MSWord, JPEG, MSExcel,

Java source, Adobe PDF)

465 documents (0.11KB min. to 523.3KB max.

size)

data are scaled to unit standard deviation and

zero mean

Page 17: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

17

DFRWS20003

Experimental Methodology: Classifier

SVMlight as the (two-way) classifier,

Obtain two-way categorisation matrix for each

document type category, using 10-fold cross-validation

sampling,

Calculate per-document-type category classification

performance statistics – precision, recall and F1

statistics.

Page 18: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

18

DFRWS20003

Presentation:

Document mining and computer forensics

Semi-structured documents

Broadband k-spectrum kernel

Coloured generalized suffix tree representation

Experimental methodology

Results and Conclusion

Page 19: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

19

DFRWS20003

Experiments: Performance Results

Classification performance was evaluated using the

following parameters:

Minimum and maximum sub-sequence window

offset lengths,

Page 20: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

20

DFRWS20003

Experiments: Performance Results

Minimum and Maximum Sub-sequence Window Offset Length:

Definition: For example;

sub-sequence window(length = 5)

MinSeqLength(=2)

MaxSeqLength(=4)

Page 21: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

21

DFRWS20003

Experiments: Performance Results

Minimum Sub-sequence Offset Length, MinSeqLength:

12

34

5

MSWord

JPEG

MSExcel

Java

PDF

0

10

20

30

40

50

60

70

80

90

100

F1 Performance (%)

Minimum Sub-sequence Offset Length

Document Type

Optimal MinSeqLength=1

Use broadband spectrum

Faster drop-off for non-text

documents (except JPEG)

[Parameters: MaxStringLength=256 MaxSeqLength=5 MinSeqCount=10]

Page 22: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

22

DFRWS20003

Experiments: Performance Results

Maximum Sub-sequence Offset Length, MaxSeqLength:

12

34

58

1015

MSWord

JPEG

MSExcel

Java

PDF

0

10

20

30

40

50

60

70

80

90

100

F1 Performance (%)

Maximum Sub-sequence Offset Length

Document Type

Optimal value is

4<MaxSeqLength<7

No significant improvements

for large sub-sequence offset

values

Page 23: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

23

DFRWS20003

Conclusions:

Promising semi-structured document categorization results using the broadband k-spectrum kernel.Experiments suggest:

use broadband kernels rather than narrow band kernels (iefixed k-length sequence), document length can be small,use a minimum suffix count to achieve improved classification

performance and tree compression.Extensions:

Increase the number of document typesInvestigate techniques for capturing longer-range data structures in documents

Page 24: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix

INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au

24

DFRWS20003

Questions ?Advertisement:

“Computer and Intrusion Forensics” by G. Mohay, A. Anderson, B.Collie, O. de Vel and R. McKemmishArtech House ISBN 1-58053-369-8 (2003)