Relevant characteristics extraction from semantically unstructured data

Relevant characteristics extraction from semantically

unstructured data

PhD title : Data mining in unstructured data

Daniel I. MORARIU, MSc

PhD Supervisor: Lucian N. VINŢANSibiu, 2006

Contents Prerequisites Correlation of the SVM kernel’s parameters

Polynomial kernel Gaussian kernel

Feature selection using Genetic Algorithms Chromosome encoding Genetic operators

Meta-classifier with SVM Non-adaptive method – Majority Vote Adaptive methods

Selection based on Euclidean distance Selection based on cosine

Initial data set scalability Choosing training and testing data sets Conclusions and further work

Prerequisites Reuters Database Processing

806791 total documents, 126 topics, 366 regions, 870 industry codes

Industry category selection – “system software” 7083 documents (4722 training /2361 testing) 19038 attributes (features) 24 classes (topics)

Data representation Binary Nominal Cornell SMART

Classifier using Support Vector Machine techniques

kernels

Correlation of the SVM kernel’s parameters

Polynomial kernel

Gaussian kernel

dxxdxxk '2)',(

Cn

xxxxk

2'

exp)',(

Polynomial kernel Commonly used kernel

d – degree of the kernel b – the offset

Our suggestion b = 2 * d

Polynomial kernel parameter’s correlation

ddk '2)',( xxxx

dbk xxxx ),(

Bias – Polynomial kernel

dbk xxxx ),(Influence of the bias - Nominal representation of input data

65

70

75

80

85

90

0 1 2 3 4 5 6 7 8 9 10

50

10

0

50

0

10

00

13

09

Values of the bias (b)

Acc

ura

cy (

%)

d=1

d=2

d=3

d=4

OurChoice

ddk '2)',( xxxx

Gaussian kernel parameter’s correlation

Gaussian kernel Commonly used kernel

C – usually represents the dimension of the set

Our suggestion n – numbers of distinct features greater

than 0

Ck

'exp)',(

xxxx

Cn

k2

'exp)',(

xxxx

n – Gaussian kernel

Ck

'exp)',(

xxxx

Influence of n - Cornell Smart data representation

50

55

60

65

70

75

80

85

90

1 10 50 100 500 654 1000 1309 auto

Values of parameter n

Acc

urac

y (%

) C=1.0

C=1.3

C=1.8

C=2.1

Cn

k2

'exp)',(

xxxx

auto

Feature selection using Genetic Algorithms

Chromosome

Fitness (ci) = SVM (ci)

Methods of selecting parents Roulette Wheel Gaussian selection

Genetic operators Selection Mutation Crossover

bwwwc ,,...,, 1903810

m

iin bbwwwfcf

121 ,),,...,,(()( xw

Methods of selecting the parents

Roulette Wheel each individual is represented by a space

that corresponds proportionally to its fitness

Gaussian : maxim value (m=1) and dispersion (σ =

0.4)

2

)((

2

1exp)(

mcfitness

cP ii

The process of obtaining the next

generation

Current generation

The best chromosome is copied from old

population into the new population

Selects two parents.

We create two children from selected parents using crossover with parents split

Need more chromosomes into the set?

Randomly eliminate one of the parents

Mutation – randomly change the sign for a random number of

elements

Selection Crossover Mutation

New generation

Yes

No

GA_FS versus SVM_FS for 1309 features

0

10

20

30

40

50

60

70

80

90Acc

ura

cy(%

)

D1.0 D2.0 D3.0 D4.0 D5.0

Kernel degree

GA-BINGA-NOMGA-CSSVM-BINSVM-NOMSVM-CS

Training time, polynomial kernel, d= 2, NOM

0

10

20

30

40

50

60

70

80

475 1309 2488 8000

number of features

Tim

e[m

inute

s]

GA_FSSVM_FSIG_FS

GA_FS versus SVM_FS for 1309 features

81.5

82

82.5

83

83.5

84Acc

ura

cy(%

)

C1.0 C1.3 C1.8 C2.1 C2.8 C3.1

Parameter C

GA-BINGA-CSSVM-BINSVM-CS

Training time, Gaussian kernel, C=1.3, BIN

0

20

40

60

80

100

120

475 1309 2488 8000

number of features

Tim

e[m

inute

s]

GA_FSSVM_FSIG_FS

Meta-classifier with SVM

Set of SVM’s Polynomial degree 1, Nominal Polynomial degree 2, Binary Polynomial degree 2, Cornell Smart Polynomial degree 3, Cornell Smart Gaussian C=1.3, Binary Gaussian C=1.8, Cornell Smart Gaussian C=2.1, Cornell Smart Gaussian C=2.8, Cornell Smart

Upper limit (94.21%)

Meta-classifier methods’

Non-adaptive method Majority Vote – each classifier votes a

specific class for a current document Adaptive methods - Compute the similarity between a

current sample and error samples from the self queue

Selection based on Euclidean distance First good classifier The best classifier

Selection based on cosine First good classifier The best classifier Using average

n

ii

n

ii

n

iii

xx

xx

1

2

1

2

1

]'[][

]'[][

'

',cos

xx

xx

n

iii xxEucl

1

2)][]([),( xx

Selection based on Euclidean distance

Classification accuracy

78808284868890929496

1 3 5 7 9 11 13

Steps

Acc

ura

cy(%

)

Upper LimitFC-SBEDBC-SBED

Selection based on cosine

Classification accuracy

80

82

84

86

88

90

92

94

96

1 3 5 7 9 11 13

Steps

Acc

ura

cy(%

)

Upper Limit

FC-SBCOS

BC-SBCOS

BC-SBCOS - withaverage

Comparison between SBED and SBCOS

Classification Accuracy

80

82

84

86

88

90

92

94

96

1 3 5 7 9 11 13

Steps

Acc

ura

cy(%

)

Majority VoteSBEDSBCOSUpper Limit

Comparison between SBED and SBCOS

Processing Time

0

10

20

30

40

50

60

70

80

1 3 5 7 9 11 13

Steps

Tim

e [

min

ute

s]

Majority VoteSBEDSBCOS

Initial data set scalability

Normalize each sample (7053)Group initial set based on distance (4474)

Take relevant vector (4474)Use relevant vector in classification process

Select only support vectors (847)

Take samples grouped in selected support vectors (4256)Make the classification (with 4256 samples)

Polynomial kernel – 1309 features, NOM

74

76

78

80

82

84

86

88

Acc

ura

cy (

%)

D1.0 D2.0 D3.0 D4.0 D5.0

Degree of kernel

Kernel's degree influence

SVM -7053SVM-4256

Gaussian kernel – 1309 features, CS

70

72

74

76

78

80

82

84

86

88

90Acc

ura

cy(%

)

1 1.3 1.8 2.1 2.8

parameter C

SVM-7053SVM-4256

Training time

0

10

20

30

40

50

Tim

e [

min

ute

s]

C1.0 C1.3 C1.8 C2.1 C2.8

7053-Bin4256-Bin

Parameter C

7053-Bin7053-CS4256-Bin4256-CS

Choosing training and testing data set

74

76

78

80

82

84

86

88Acc

ura

cy(%

)

D1.0

D2.0

D3.0

D4.0

D5.0

Aver

age

Kernel's degree

1309 Features - Polynomial kernel

average over oldsetaverage over newset

Choosing training and testing data set

7072747678808284868890

Acc

ura

cy(%

)

C1.0

C1.3

C1.8

C2.1

C2.8

Aver

age

Kernel's degree

1309 Features - Gaussian kernel

average over oldsetaverage over newset

Conclusions – other results

Using our correlation 3% better for Polynomial kernel 15% better for Gaussian kernel

Reduced number of features between 2.5% (475) and 6% (1309)

GA _FS faster than SVM_FS Polynomial kernel with nominal representation

and small degree Gaussian kernel with Cornell Smart

representation Reuter’s database is linearly separable SBED is better and faster than SBCOS Classification accuracy decreases with 1.2 %

when the data set is reduced

Further work

Features extraction and selection

Association rules between words (Mutual Information)

Synonym and Polysemy problem Using families of words (WordNet)

Web mining application Classifying larger text data sets A better method of grouping data Using classification and clustering

together

Relevant characteristics extraction from semantically unstructured data

Documents

Transcript of Relevant characteristics extraction from semantically unstructured data