Relevant characteristics extraction from semantically unstructured data

29
Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor: Lucian N. VINŢAN Sibiu, 2006

description

Relevant characteristics extraction from semantically unstructured data. PhD title : Data mining in unstructured data Daniel I. MORARIU , MSc PhD Supervisor: Lucian N. VIN ŢAN. Sibiu, 200 6. Contents. Prerequisites Correlation of the SVM kernel’s parameters Polynomial kernel - PowerPoint PPT Presentation

Transcript of Relevant characteristics extraction from semantically unstructured data

Page 1: Relevant characteristics extraction from semantically unstructured data

Relevant characteristics extraction from semantically

unstructured data

PhD title : Data mining in unstructured data

Daniel I. MORARIU, MSc

PhD Supervisor: Lucian N. VINŢANSibiu, 2006

Page 2: Relevant characteristics extraction from semantically unstructured data

Contents Prerequisites Correlation of the SVM kernel’s parameters

Polynomial kernel Gaussian kernel

Feature selection using Genetic Algorithms Chromosome encoding Genetic operators

Meta-classifier with SVM Non-adaptive method – Majority Vote Adaptive methods

Selection based on Euclidean distance Selection based on cosine

Initial data set scalability Choosing training and testing data sets Conclusions and further work

Page 3: Relevant characteristics extraction from semantically unstructured data

Prerequisites Reuters Database Processing

806791 total documents, 126 topics, 366 regions, 870 industry codes

Industry category selection – “system software” 7083 documents (4722 training /2361 testing) 19038 attributes (features) 24 classes (topics)

Data representation Binary Nominal Cornell SMART

Classifier using Support Vector Machine techniques

kernels

Page 4: Relevant characteristics extraction from semantically unstructured data

Correlation of the SVM kernel’s parameters

Polynomial kernel

Gaussian kernel

dxxdxxk '2)',(

Cn

xxxxk

2'

exp)',(

Page 5: Relevant characteristics extraction from semantically unstructured data

Polynomial kernel Commonly used kernel

d – degree of the kernel b – the offset

Our suggestion b = 2 * d

Polynomial kernel parameter’s correlation

ddk '2)',( xxxx

dbk xxxx ),(

Page 6: Relevant characteristics extraction from semantically unstructured data

Bias – Polynomial kernel

dbk xxxx ),(Influence of the bias - Nominal representation of input data

65

70

75

80

85

90

0 1 2 3 4 5 6 7 8 9 10

50

10

0

50

0

10

00

13

09

Values of the bias (b)

Acc

ura

cy (

%)

d=1

d=2

d=3

d=4

OurChoice

ddk '2)',( xxxx

Page 7: Relevant characteristics extraction from semantically unstructured data

Gaussian kernel parameter’s correlation

Gaussian kernel Commonly used kernel

C – usually represents the dimension of the set

Our suggestion n – numbers of distinct features greater

than 0

Ck

'exp)',(

xxxx

Cn

k2

'exp)',(

xxxx

Page 8: Relevant characteristics extraction from semantically unstructured data

n – Gaussian kernel

Ck

'exp)',(

xxxx

Influence of n - Cornell Smart data representation

50

55

60

65

70

75

80

85

90

1 10 50 100 500 654 1000 1309 auto

Values of parameter n

Acc

urac

y (%

) C=1.0

C=1.3

C=1.8

C=2.1

Cn

k2

'exp)',(

xxxx

auto

Page 9: Relevant characteristics extraction from semantically unstructured data

Feature selection using Genetic Algorithms

Chromosome

Fitness (ci) = SVM (ci)

Methods of selecting parents Roulette Wheel Gaussian selection

Genetic operators Selection Mutation Crossover

bwwwc ,,...,, 1903810

m

iin bbwwwfcf

121 ,),,...,,(()( xw

Page 10: Relevant characteristics extraction from semantically unstructured data

Methods of selecting the parents

Roulette Wheel each individual is represented by a space

that corresponds proportionally to its fitness

Gaussian : maxim value (m=1) and dispersion (σ =

0.4)

2

)((

2

1exp)(

mcfitness

cP ii

Page 11: Relevant characteristics extraction from semantically unstructured data

The process of obtaining the next

generation

Current generation

The best chromosome is copied from old

population into the new population

Selects two parents.

We create two children from selected parents using crossover with parents split

Need more chromosomes into the set?

Randomly eliminate one of the parents

Mutation – randomly change the sign for a random number of

elements

Selection Crossover Mutation

New generation

Yes

No

Page 12: Relevant characteristics extraction from semantically unstructured data

GA_FS versus SVM_FS for 1309 features

0

10

20

30

40

50

60

70

80

90Acc

ura

cy(%

)

D1.0 D2.0 D3.0 D4.0 D5.0

Kernel degree

GA-BINGA-NOMGA-CSSVM-BINSVM-NOMSVM-CS

Page 13: Relevant characteristics extraction from semantically unstructured data

Training time, polynomial kernel, d= 2, NOM

0

10

20

30

40

50

60

70

80

475 1309 2488 8000

number of features

Tim

e[m

inute

s]

GA_FSSVM_FSIG_FS

Page 14: Relevant characteristics extraction from semantically unstructured data

GA_FS versus SVM_FS for 1309 features

81.5

82

82.5

83

83.5

84Acc

ura

cy(%

)

C1.0 C1.3 C1.8 C2.1 C2.8 C3.1

Parameter C

GA-BINGA-CSSVM-BINSVM-CS

Page 15: Relevant characteristics extraction from semantically unstructured data

Training time, Gaussian kernel, C=1.3, BIN

0

20

40

60

80

100

120

475 1309 2488 8000

number of features

Tim

e[m

inute

s]

GA_FSSVM_FSIG_FS

Page 16: Relevant characteristics extraction from semantically unstructured data

Meta-classifier with SVM

Set of SVM’s Polynomial degree 1, Nominal Polynomial degree 2, Binary Polynomial degree 2, Cornell Smart Polynomial degree 3, Cornell Smart Gaussian C=1.3, Binary Gaussian C=1.8, Cornell Smart Gaussian C=2.1, Cornell Smart Gaussian C=2.8, Cornell Smart

Upper limit (94.21%)

Page 17: Relevant characteristics extraction from semantically unstructured data

Meta-classifier methods’

Non-adaptive method Majority Vote – each classifier votes a

specific class for a current document Adaptive methods - Compute the similarity between a

current sample and error samples from the self queue

Selection based on Euclidean distance First good classifier The best classifier

Selection based on cosine First good classifier The best classifier Using average

n

ii

n

ii

n

iii

xx

xx

1

2

1

2

1

]'[][

]'[][

'

',cos

xx

xx

n

iii xxEucl

1

2)][]([),( xx

Page 18: Relevant characteristics extraction from semantically unstructured data

Selection based on Euclidean distance

Classification accuracy

78808284868890929496

1 3 5 7 9 11 13

Steps

Acc

ura

cy(%

)

Upper LimitFC-SBEDBC-SBED

Page 19: Relevant characteristics extraction from semantically unstructured data

Selection based on cosine

Classification accuracy

80

82

84

86

88

90

92

94

96

1 3 5 7 9 11 13

Steps

Acc

ura

cy(%

)

Upper Limit

FC-SBCOS

BC-SBCOS

BC-SBCOS - withaverage

Page 20: Relevant characteristics extraction from semantically unstructured data

Comparison between SBED and SBCOS

Classification Accuracy

80

82

84

86

88

90

92

94

96

1 3 5 7 9 11 13

Steps

Acc

ura

cy(%

)

Majority VoteSBEDSBCOSUpper Limit

Page 21: Relevant characteristics extraction from semantically unstructured data

Comparison between SBED and SBCOS

Processing Time

0

10

20

30

40

50

60

70

80

1 3 5 7 9 11 13

Steps

Tim

e [

min

ute

s]

Majority VoteSBEDSBCOS

Page 22: Relevant characteristics extraction from semantically unstructured data

Initial data set scalability

Normalize each sample (7053)Group initial set based on distance (4474)

Take relevant vector (4474)Use relevant vector in classification process

Select only support vectors (847)

Take samples grouped in selected support vectors (4256)Make the classification (with 4256 samples)

Page 23: Relevant characteristics extraction from semantically unstructured data

Polynomial kernel – 1309 features, NOM

74

76

78

80

82

84

86

88

Acc

ura

cy (

%)

D1.0 D2.0 D3.0 D4.0 D5.0

Degree of kernel

Kernel's degree influence

SVM -7053SVM-4256

Page 24: Relevant characteristics extraction from semantically unstructured data

Gaussian kernel – 1309 features, CS

70

72

74

76

78

80

82

84

86

88

90Acc

ura

cy(%

)

1 1.3 1.8 2.1 2.8

parameter C

SVM-7053SVM-4256

Page 25: Relevant characteristics extraction from semantically unstructured data

Training time

0

10

20

30

40

50

Tim

e [

min

ute

s]

C1.0 C1.3 C1.8 C2.1 C2.8

7053-Bin4256-Bin

Parameter C

7053-Bin7053-CS4256-Bin4256-CS

Page 26: Relevant characteristics extraction from semantically unstructured data

Choosing training and testing data set

74

76

78

80

82

84

86

88Acc

ura

cy(%

)

D1.0

D2.0

D3.0

D4.0

D5.0

Aver

age

Kernel's degree

1309 Features - Polynomial kernel

average over oldsetaverage over newset

Page 27: Relevant characteristics extraction from semantically unstructured data

Choosing training and testing data set

7072747678808284868890

Acc

ura

cy(%

)

C1.0

C1.3

C1.8

C2.1

C2.8

Aver

age

Kernel's degree

1309 Features - Gaussian kernel

average over oldsetaverage over newset

Page 28: Relevant characteristics extraction from semantically unstructured data

Conclusions – other results

Using our correlation 3% better for Polynomial kernel 15% better for Gaussian kernel

Reduced number of features between 2.5% (475) and 6% (1309)

GA _FS faster than SVM_FS Polynomial kernel with nominal representation

and small degree Gaussian kernel with Cornell Smart

representation Reuter’s database is linearly separable SBED is better and faster than SBCOS Classification accuracy decreases with 1.2 %

when the data set is reduced

Page 29: Relevant characteristics extraction from semantically unstructured data

Further work

Features extraction and selection

Association rules between words (Mutual Information)

Synonym and Polysemy problem Using families of words (WordNet)

Web mining application Classifying larger text data sets A better method of grouping data Using classification and clustering

together