Application of Text Mining in Developing Standardized...

13
Standardized Descriptions of Taxa in Paleontology By Vincent Yu Francisca Oboh- Ikuenobe Bih-Ru Lea Application of Text Mining in Developing Standardized Descriptions of Taxa in Paleontology: A Framework Francisca E. Oboh-Ikuenobe, Geological Science & Engineering, [email protected] Wen-Bin (Vincent) Yu, Information Science & Technology, [email protected] Bih-Ru Lea, Business Administration, [email protected] University of Missouri – Rolla Rolla, MO 65401, USA By Bih-Ru Lea Standardized Descriptions of Taxa in Paleontology By Vincent Yu Francisca Oboh- Ikuenobe Bih-Ru Lea Problems In Palynomorph Classifications And Interpretations Description #2: Large proximate, cyst with irregular perforated crests, well developed paratabulation. Description #1: Proximate, acavate cyst, large (75-95 μm) with irregular perforated parasutural crests, well developed paratabulation (including parasulcal paraplates), and precingular archaeopyle. 40 μm

Transcript of Application of Text Mining in Developing Standardized...

Page 1: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint

1

Standardized Descriptions of Taxa in

Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Application of Text Mining in Developing Standardized Descriptions of Taxa in

Paleontology: A Framework

Francisca E. Oboh-Ikuenobe, Geological Science & Engineering, [email protected]

Wen-Bin (Vincent) Yu, Information Science & Technology, [email protected]

Bih-Ru Lea, Business Administration, [email protected]

University of Missouri – Rolla

Rolla, MO 65401, USA

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Problems In PalynomorphClassifications And Interpretations

Description #2: Large proximate, cyst with irregular perforated crests, well developed paratabulation.

Description #1: Proximate, acavate cyst, large (75-95 µm) with irregular perforated parasuturalcrests, well developed paratabulation (including parasulcal paraplates), and precingular archaeopyle.

40 µm

Page 2: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint

2

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Text mining or Information Retrieval?Google Search/ online article database– Key word search (existing knowledge)

Web Mining/ Semantic Web – Structured ontology for Information Extraction

Text mining– A special case of data mining – Discover new knowledge by finding interesting and

non-trivial patterns in textural documents – Usually starts with clustering similar documents

and extracting key terms associated with each cluster

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Framework

Sample Database

CHRONOS

Text preprocessing

& Pattern Identification

Module

Clustering Module

Data Sources Text Mining

Model Creation Module

Model Selection Module

Data Mining

Model Evaluation

ModuleDescriptive

Model

Predictive Model

Variable Reduction

ModuleUser Input

Standardized Taxon

Recommendation Modeling Module

DinoSys

Data Mining/ Software Agent

Page 3: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint

3

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Framework- Text Mining Process

Sample Database

CHRONOS

Text preprocessing

& Pattern Identification

Module

Clustering Module

Data Sources Text Mining

Model Creation Module

Model Selection Module

Data Mining

Model Evaluation

ModuleDescriptive

Model

Predictive Model

Variable Reduction

ModuleUser Input

Standardized Taxon

Recommendation Modeling Module

DinoSys

Data Mining/ Software Agent

Text Mining is to find hidden patterns for textual documents by grouping similar descriptions

… portable and mobile video services could be ported to other locations, other devices or other media…. [IPTV, WiMAX get star billing at Supercomm 2005]

Data (Documents) Cleansing Text processing

– Parsing.– Stop words and start words.– Parts of speech– Stemming – Synonyms, Jargons, Abbreviations

Term Frequency MatrixWeighting Scheme

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Text mining process IITerm Frequency Matrix – Represent each document with selected terms– A note on Discriminating power: Terms with

the most discriminating power are those that occur frequently, but only in a few documents.

– Determine Term weights and Frequency weights

0111Doc 3

1100Doc 2

1011Doc 1

Term 4:id

Term 3:voice

Term 2:web

Term 1:process

Page 4: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint

4

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Framework- Text Mining Process

Sample Database

CHRONOS

Text preprocessing

& Pattern Identification

Module

Clustering Module

Data Sources Text Mining

Model Creation Module

Model Selection Module

Data Mining

Model Evaluation

ModuleDescriptive

Model

Predictive Model

Variable Reduction

ModuleUser Input

Standardized Taxon

Recommendation Modeling Module

DinoSys

Data Mining/ Software Agent

Text Mining is to find hidden patterns for textual documents by grouping similar descriptions

… portable and mobile video services could be ported to other locations, other devices or other media…. [IPTV, WiMAX get star billing at Supercomm 2005]

Data (Documents) Cleansing Text processing

– Parsing.– Stop words and start words.– Parts of speech– Stemming – Synonyms, Jargons, Abbreviations

Term Frequency MatrixWeighting Scheme

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Term Weights: Entropy MethodTerm weight (Gij)=

where – pij = aij / gi– aij: frequency that term i appears in document j. – gi: frequency that term i appears in document

collection.,– n: number of documents in the collection

Weighted Frequency Lij = log2(aij+1)

∑+=j

ijijj n

ppG

)(log)(log

12

2

Page 5: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint

5

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Framework- Text Mining Process

Sample Database

CHRONOS

Text preprocessing

& Pattern Identification

Module

Clustering Module

Data Sources Text Mining

Model Creation Module

Model Selection Module

Data Mining

Model Evaluation

ModuleDescriptive

Model

Predictive Model

Variable Reduction

ModuleUser Input

Standardized Taxon

Recommendation Modeling Module

DinoSys

Data Mining/ Software Agent

Text Mining is to find hidden patterns for textual documents by grouping similar descriptions

… portable and mobile video services could be ported to other locations, other devices or other media…. [IPTV, WiMAX get star billing at Supercomm 2005]

Data (Documents) Cleansing Text processing

– Parsing.– Stop words and start words.– Parts of speech– Stemming – Synonyms, Jargons, Abbreviations

Term Frequency MatrixWeighting Scheme

Variable Reduction & Transformation

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Text mining process IIIVariable reduction (Transformations)– SVD (Singular Value Decomposition)

• Principal components decomposition.

– Roll Up Term• use the n terms with the largest term weights.

Application in Data analysis – Cluster analysis– Classification– Predictions

Page 6: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint

6

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Text Mining Process – Example 1

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Framework- Text Mining Process

Sample Database

CHRONOS

Text preprocessing

& Pattern Identification

Module

Clustering Module

Data Sources Text Mining

Model Creation Module

Model Selection Module

Data Mining

Model Evaluation

ModuleDescriptive

Model

Predictive Model

Variable Reduction

ModuleUser Input

Standardized Taxon

Recommendation Modeling Module

DinoSys

Data Mining/ Software Agent

Clustering

Text Mining is to find hidden patterns for textual documents by grouping similar descriptions

… portable and mobile video services could be ported to other locations, other devices or other media…. [IPTV, WiMAX get star billing at Supercomm 2005]

Data (Documents) Cleansing Text processing

– Parsing.– Stop words and start words.– Parts of speech– Stemming – Synonyms, Jargons, Abbreviations

Term Frequency MatrixWeighting Scheme

Variable Reduction & Transformation

Page 7: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint

7

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Clusters with SVD 1

40 µm

Four different descriptions for the same dinocyst

* COL1, COL2 & COL3 represent the principle components (SVD variables) of the descriptive terms

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Clusters with SVD 2

Page 8: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint

8

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Text Mining Process – Example 2

Ifecysta sp. 2 Ifecysta sp. 1

Achomosphaera sp.Diphyes sp.

Ifecysta sp. 1

Ifecysta sp. 2

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Framework- Data Mining Process

Sample Database

CHRONOS

Text preprocessing

& Pattern Identification

Module

Clustering Module

Data Sources Text Mining

Model Creation Module

Model Selection Module

Data Mining

Model Evaluation

ModuleDescriptive

Model

Predictive Model

Variable Reduction

ModuleUser Input

Standardized Taxon

Recommendation Modeling Module

DinoSys

Data Mining/ Software Agent

Input (independent variable): – the SVD variables

output (dependent variable): – clusters

Descriptive models – Regression– Neural network– decision tree

Model EvaluationModel Selection

Page 9: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint

9

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Modelling: ANN

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Modelling: Regression

Page 10: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint

10

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Modelling: Decision Tree

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Dinocyst Description Modeling

Page 11: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint

11

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Model Evaluation

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Framework – Descriptive Model

Sample Database

CHRONOS

Text preprocessing

& Pattern Identification

Module

Clustering Module

Data Sources Text Mining

Model Creation Module

Model Selection Module

Data Mining

Model Evaluation

ModuleDescriptive

Model

Predictive Model

Variable Reduction

ModuleUser Input

Standardized Taxon

Recommendation Modeling Module

DinoSys

Data Mining/ Software Agent

Page 12: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint

12

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Scoring Dinocyst DescriptionAssume regression model is selected in the model selection process

Classification is proposed

A description is entered, analyzed by text mining process, and feed into the selected model (i.e., regression in this example)

Standardized Taxon

Recommendation Modeling Module

Standardized description is proposed

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

User/Human Interaction

Sample Database

CHRONOS

Text preprocessing

& Pattern Identification

Module

Clustering Module

Data Sources Text Mining

Model Creation Module

Model Selection Module

Data Mining

Model Evaluation

ModuleDescriptive

Model

Predictive Model

Variable Reduction

ModuleUser Input

Standardized Taxon

Recommendation Modeling Module

DinoSys

Data Mining/ Software Agent

the number of clustersthe validity of key termspossible start/stop words list…

Page 13: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint

13

By

Bih-Ru Lea

Standardized Descriptions of

Taxa in Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

Adoptive Model

Sample Database

CHRONOS

Text preprocessing

& Pattern Identification

Module

Clustering Module

Data Sources Text Mining

Model Creation Module

Model Selection Module

Data Mining

Model Evaluation

ModuleDescriptive

Model

Predictive Model

Variable Reduction

ModuleUser Input

Standardized Taxon

Recommendation Modeling Module

DinoSys

Data Mining/ Software Agent

Dinocyst descriptions collected and mined periodically to update model

Standardized Descriptions of Taxa in

Paleontology

ByVincent YuFrancisca

Oboh-Ikuenobe

Bih-Ru Lea

The End