Relevant characteristics extraction from semantically unstructured data

Relevant characteristics extraction from semantically

unstructured data

PhD title : Data mining in unstructured data

Daniel I. MORARIU, MSc

PhD Supervisor: Lucian N. VINŢANSibiu, 2006

Contents Prerequisites Correlation of the SVM kernel’s parameters

Polynomial kernel Gaussian kernel

Feature selection using Genetic Algorithms Chromosome encoding Genetic operators

Meta-classifier with SVM Non-adaptive method – Majority Vote Adaptive methods

Selection based on Euclidean distance Selection based on cosine

Initial data set scalability Choosing training and testing data sets Conclusions and further work

Prerequisites Reuters Database Processing

806791 total documents, 126 topics, 366 regions, 870 industry codes

Industry category selection – “system software” 7083 documents (4722 training /2361 testing) 19038 attributes (features) 24 classes (topics)

Data representation Binary Nominal Cornell SMART

Classifier using Support Vector Machine techniques

kernels

Correlation of the SVM kernel’s parameters

Polynomial kernel

Gaussian kernel

dxxdxxk '2)',(

exp)',(

Polynomial kernel Commonly used kernel

d – degree of the kernel b – the offset

Our suggestion b = 2 * d

Polynomial kernel parameter’s correlation

ddk '2)',( xxxx

dbk xxxx ),(

Bias – Polynomial kernel

dbk xxxx ),(Influence of the bias - Nominal representation of input data

0 1 2 3 4 5 6 7 8 9 10

Values of the bias (b)

OurChoice

ddk '2)',( xxxx

Gaussian kernel parameter’s correlation

Gaussian kernel Commonly used kernel

C – usually represents the dimension of the set

Our suggestion n – numbers of distinct features greater

than 0

'exp)',(

n – Gaussian kernel

'exp)',(

Influence of n - Cornell Smart data representation

1 10 50 100 500 654 1000 1309 auto

Values of parameter n

) C=1.0

'exp)',(

Feature selection using Genetic Algorithms

Chromosome

Fitness (ci) = SVM (ci)

Methods of selecting parents Roulette Wheel Gaussian selection

Genetic operators Selection Mutation Crossover

bwwwc ,,...,, 1903810

iin bbwwwfcf

121 ,),,...,,(()( xw

Methods of selecting the parents

Roulette Wheel each individual is represented by a space

that corresponds proportionally to its fitness

Gaussian : maxim value (m=1) and dispersion (σ =

1exp)(

mcfitness

The process of obtaining the next

generation

Current generation

The best chromosome is copied from old

population into the new population

Selects two parents.

We create two children from selected parents using crossover with parents split

Need more chromosomes into the set?

Randomly eliminate one of the parents

Mutation – randomly change the sign for a random number of

elements

Selection Crossover Mutation

New generation

GA_FS versus SVM_FS for 1309 features

D1.0 D2.0 D3.0 D4.0 D5.0

Kernel degree

GA-BINGA-NOMGA-CSSVM-BINSVM-NOMSVM-CS

Training time, polynomial kernel, d= 2, NOM

475 1309 2488 8000

number of features

GA_FSSVM_FSIG_FS

GA_FS versus SVM_FS for 1309 features

C1.0 C1.3 C1.8 C2.1 C2.8 C3.1

Parameter C

GA-BINGA-CSSVM-BINSVM-CS

Training time, Gaussian kernel, C=1.3, BIN

475 1309 2488 8000

number of features

GA_FSSVM_FSIG_FS

Meta-classifier with SVM

Set of SVM’s Polynomial degree 1, Nominal Polynomial degree 2, Binary Polynomial degree 2, Cornell Smart Polynomial degree 3, Cornell Smart Gaussian C=1.3, Binary Gaussian C=1.8, Cornell Smart Gaussian C=2.1, Cornell Smart Gaussian C=2.8, Cornell Smart

Upper limit (94.21%)

Meta-classifier methods’

Non-adaptive method Majority Vote – each classifier votes a

specific class for a current document Adaptive methods - Compute the similarity between a

current sample and error samples from the self queue

Selection based on Euclidean distance First good classifier The best classifier

Selection based on cosine First good classifier The best classifier Using average

iii xxEucl

2)][]([),( xx

Selection based on Euclidean distance

Classification accuracy

78808284868890929496

1 3 5 7 9 11 13

Upper LimitFC-SBEDBC-SBED

Selection based on cosine

Classification accuracy

1 3 5 7 9 11 13

Upper Limit

FC-SBCOS

BC-SBCOS

BC-SBCOS - withaverage

Comparison between SBED and SBCOS

Classification Accuracy

1 3 5 7 9 11 13

Majority VoteSBEDSBCOSUpper Limit

Comparison between SBED and SBCOS

Processing Time

1 3 5 7 9 11 13

Majority VoteSBEDSBCOS

Initial data set scalability

Normalize each sample (7053)Group initial set based on distance (4474)

Take relevant vector (4474)Use relevant vector in classification process

Select only support vectors (847)

Take samples grouped in selected support vectors (4256)Make the classification (with 4256 samples)

Polynomial kernel – 1309 features, NOM

D1.0 D2.0 D3.0 D4.0 D5.0

Degree of kernel

Kernel's degree influence

SVM -7053SVM-4256

Gaussian kernel – 1309 features, CS

1 1.3 1.8 2.1 2.8

parameter C

SVM-7053SVM-4256

Training time

C1.0 C1.3 C1.8 C2.1 C2.8

7053-Bin4256-Bin

Parameter C

7053-Bin7053-CS4256-Bin4256-CS

Choosing training and testing data set

Kernel's degree

1309 Features - Polynomial kernel

average over oldsetaverage over newset

Choosing training and testing data set

7072747678808284868890

Kernel's degree

1309 Features - Gaussian kernel

average over oldsetaverage over newset

Conclusions – other results

Using our correlation 3% better for Polynomial kernel 15% better for Gaussian kernel

Reduced number of features between 2.5% (475) and 6% (1309)

GA _FS faster than SVM_FS Polynomial kernel with nominal representation

and small degree Gaussian kernel with Cornell Smart

representation Reuter’s database is linearly separable SBED is better and faster than SBCOS Classification accuracy decreases with 1.2 %

when the data set is reduced

Further work

Features extraction and selection

Association rules between words (Mutual Information)

Synonym and Polysemy problem Using families of words (WordNet)

Web mining application Classifying larger text data sets A better method of grouping data Using classification and clustering

together

Relevant characteristics extraction from semantically unstructured data

Documents

Transcript of Relevant characteristics extraction from semantically unstructured data

2-day Workshop on Unstructured Data Extraction, Analysis ...iimlcmee.org/wp-content/uploads/2020/03/Registration-Form_June.pdf · Unstructured Data Extraction, Analysis and Sense-Making

Rhea: Automatic Storage for Unstructured Cloud … data: ... Data Job Data Hadoop Cluster Input Job Rhea Filter Extraction ... Rhea: Automatic Storage for Unstructured Cloud Storage

Task: Information Extraction Goal: being able to answer semantic queries (a.k.a. “database queries”) using “unstructured” natural language sources Identify.

UNSTRUCTURED DATA EXTRACTION VIA … UNSTRUCTURED DATA EXTRACTION VIA NATURAL LANGUAGE PROCESSING (NLP) Presented by Alex Wu, Partner, Sagence, Inc. 2nd Annual INFORMS Midwest Practice

Information Extraction: Distilling Structured Data from Unstructured Text. -Andrew McCallum Presented by Lalit Bist.

Automatic Extraction of Structurally Coherent Mini-Taxonomies · the WorldWide Web, which is evolvingfrom an unstructured data presenterto a more semantically structured entity termed

Empowering Unstructured Information Extraction from Financial … · 2020-02-20 · unstructured documents has increased tremendously. Information Extraction (IE) is the automated

Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spark by Nicolas Claudon and Yana Ponomarova

A Reference-Set Approach to Information Extraction from ... · A Reference-Set Approach to Information Extraction from Unstructured, Ungrammatical Data Sources Matthew Michelson Ph.D.

A Joint Neural Model for Information Extraction with Global … · 2020. 6. 20. · Information Extraction (IE) aims to extract struc-tured information from unstructured texts. It

Semantically & Structurally Negatives.

Spatial Data Extraction and Multi Query Optimization for ... · systematic view of processing the unstructured data input. The unstructured information is found from public domain

ASWC08 Semantically Conceptualizing and Annotating Tables Stephen Lynn & David W. Embley Data Extraction Research Group Department of Computer Science.

Information Extraction from Unstructured Electronic Health ... · Information Extraction from Unstructured Electronic Health Records and Integration into a ... Maximilian Ertl, Anja

Entity Extraction: From Unstructured Text to DBpedia RDF Triples

SEMANTICALLY GROUNDED LEARNING FROM UNSTRUCTURED ...sniekum/pubs/NiekumDissertation.pdf · Sonia Chernova, Member Lori A. Clarke, Chair School of Computer Science. ... Dave Coleman,

Isosurface Extraction from Hybrid Unstructured Grids ...

A REFERENCE-SET APPROACH TO INFORMATION EXTRACTION …usc-isi-i2.github.io/papers/michelson09-thesis.pdf · This thesis investigates information extraction from unstructured, ungrammatical

Data Mining on NIJ data Sangjik Lee. Unstructured Data Mining Text Keyword Extraction Structured Data Base Data Mining Image Feature Extraction Structured.

Schema-Driven Relationship Extraction from Unstructured Text