Application of Text Mining in Developing Standardized...
Transcript of Application of Text Mining in Developing Standardized...
![Page 1: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19a53338a4941743416063/html5/thumbnails/1.jpg)
1
Standardized Descriptions of Taxa in
Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Application of Text Mining in Developing Standardized Descriptions of Taxa in
Paleontology: A Framework
Francisca E. Oboh-Ikuenobe, Geological Science & Engineering, [email protected]
Wen-Bin (Vincent) Yu, Information Science & Technology, [email protected]
Bih-Ru Lea, Business Administration, [email protected]
University of Missouri – Rolla
Rolla, MO 65401, USA
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Problems In PalynomorphClassifications And Interpretations
Description #2: Large proximate, cyst with irregular perforated crests, well developed paratabulation.
Description #1: Proximate, acavate cyst, large (75-95 µm) with irregular perforated parasuturalcrests, well developed paratabulation (including parasulcal paraplates), and precingular archaeopyle.
40 µm
![Page 2: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19a53338a4941743416063/html5/thumbnails/2.jpg)
2
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Text mining or Information Retrieval?Google Search/ online article database– Key word search (existing knowledge)
Web Mining/ Semantic Web – Structured ontology for Information Extraction
Text mining– A special case of data mining – Discover new knowledge by finding interesting and
non-trivial patterns in textural documents – Usually starts with clustering similar documents
and extracting key terms associated with each cluster
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Framework
Sample Database
CHRONOS
Text preprocessing
& Pattern Identification
Module
Clustering Module
Data Sources Text Mining
Model Creation Module
Model Selection Module
Data Mining
Model Evaluation
ModuleDescriptive
Model
Predictive Model
Variable Reduction
ModuleUser Input
…
Standardized Taxon
Recommendation Modeling Module
DinoSys
Data Mining/ Software Agent
![Page 3: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19a53338a4941743416063/html5/thumbnails/3.jpg)
3
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Framework- Text Mining Process
Sample Database
CHRONOS
Text preprocessing
& Pattern Identification
Module
Clustering Module
Data Sources Text Mining
Model Creation Module
Model Selection Module
Data Mining
Model Evaluation
ModuleDescriptive
Model
Predictive Model
Variable Reduction
ModuleUser Input
…
Standardized Taxon
Recommendation Modeling Module
DinoSys
Data Mining/ Software Agent
Text Mining is to find hidden patterns for textual documents by grouping similar descriptions
… portable and mobile video services could be ported to other locations, other devices or other media…. [IPTV, WiMAX get star billing at Supercomm 2005]
Data (Documents) Cleansing Text processing
– Parsing.– Stop words and start words.– Parts of speech– Stemming – Synonyms, Jargons, Abbreviations
Term Frequency MatrixWeighting Scheme
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Text mining process IITerm Frequency Matrix – Represent each document with selected terms– A note on Discriminating power: Terms with
the most discriminating power are those that occur frequently, but only in a few documents.
– Determine Term weights and Frequency weights
0111Doc 3
1100Doc 2
1011Doc 1
Term 4:id
Term 3:voice
Term 2:web
Term 1:process
![Page 4: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19a53338a4941743416063/html5/thumbnails/4.jpg)
4
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Framework- Text Mining Process
Sample Database
CHRONOS
Text preprocessing
& Pattern Identification
Module
Clustering Module
Data Sources Text Mining
Model Creation Module
Model Selection Module
Data Mining
Model Evaluation
ModuleDescriptive
Model
Predictive Model
Variable Reduction
ModuleUser Input
…
Standardized Taxon
Recommendation Modeling Module
DinoSys
Data Mining/ Software Agent
Text Mining is to find hidden patterns for textual documents by grouping similar descriptions
… portable and mobile video services could be ported to other locations, other devices or other media…. [IPTV, WiMAX get star billing at Supercomm 2005]
Data (Documents) Cleansing Text processing
– Parsing.– Stop words and start words.– Parts of speech– Stemming – Synonyms, Jargons, Abbreviations
Term Frequency MatrixWeighting Scheme
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Term Weights: Entropy MethodTerm weight (Gij)=
where – pij = aij / gi– aij: frequency that term i appears in document j. – gi: frequency that term i appears in document
collection.,– n: number of documents in the collection
Weighted Frequency Lij = log2(aij+1)
∑+=j
ijijj n
ppG
)(log)(log
12
2
![Page 5: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19a53338a4941743416063/html5/thumbnails/5.jpg)
5
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Framework- Text Mining Process
Sample Database
CHRONOS
Text preprocessing
& Pattern Identification
Module
Clustering Module
Data Sources Text Mining
Model Creation Module
Model Selection Module
Data Mining
Model Evaluation
ModuleDescriptive
Model
Predictive Model
Variable Reduction
ModuleUser Input
…
Standardized Taxon
Recommendation Modeling Module
DinoSys
Data Mining/ Software Agent
Text Mining is to find hidden patterns for textual documents by grouping similar descriptions
… portable and mobile video services could be ported to other locations, other devices or other media…. [IPTV, WiMAX get star billing at Supercomm 2005]
Data (Documents) Cleansing Text processing
– Parsing.– Stop words and start words.– Parts of speech– Stemming – Synonyms, Jargons, Abbreviations
Term Frequency MatrixWeighting Scheme
Variable Reduction & Transformation
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Text mining process IIIVariable reduction (Transformations)– SVD (Singular Value Decomposition)
• Principal components decomposition.
– Roll Up Term• use the n terms with the largest term weights.
Application in Data analysis – Cluster analysis– Classification– Predictions
![Page 6: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19a53338a4941743416063/html5/thumbnails/6.jpg)
6
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Text Mining Process – Example 1
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Framework- Text Mining Process
Sample Database
CHRONOS
Text preprocessing
& Pattern Identification
Module
Clustering Module
Data Sources Text Mining
Model Creation Module
Model Selection Module
Data Mining
Model Evaluation
ModuleDescriptive
Model
Predictive Model
Variable Reduction
ModuleUser Input
…
Standardized Taxon
Recommendation Modeling Module
DinoSys
Data Mining/ Software Agent
Clustering
Text Mining is to find hidden patterns for textual documents by grouping similar descriptions
… portable and mobile video services could be ported to other locations, other devices or other media…. [IPTV, WiMAX get star billing at Supercomm 2005]
Data (Documents) Cleansing Text processing
– Parsing.– Stop words and start words.– Parts of speech– Stemming – Synonyms, Jargons, Abbreviations
Term Frequency MatrixWeighting Scheme
Variable Reduction & Transformation
![Page 7: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19a53338a4941743416063/html5/thumbnails/7.jpg)
7
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Clusters with SVD 1
40 µm
Four different descriptions for the same dinocyst
* COL1, COL2 & COL3 represent the principle components (SVD variables) of the descriptive terms
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Clusters with SVD 2
![Page 8: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19a53338a4941743416063/html5/thumbnails/8.jpg)
8
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Text Mining Process – Example 2
Ifecysta sp. 2 Ifecysta sp. 1
Achomosphaera sp.Diphyes sp.
Ifecysta sp. 1
Ifecysta sp. 2
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Framework- Data Mining Process
Sample Database
CHRONOS
Text preprocessing
& Pattern Identification
Module
Clustering Module
Data Sources Text Mining
Model Creation Module
Model Selection Module
Data Mining
Model Evaluation
ModuleDescriptive
Model
Predictive Model
Variable Reduction
ModuleUser Input
…
Standardized Taxon
Recommendation Modeling Module
DinoSys
Data Mining/ Software Agent
Input (independent variable): – the SVD variables
output (dependent variable): – clusters
Descriptive models – Regression– Neural network– decision tree
Model EvaluationModel Selection
![Page 9: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19a53338a4941743416063/html5/thumbnails/9.jpg)
9
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Modelling: ANN
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Modelling: Regression
![Page 10: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19a53338a4941743416063/html5/thumbnails/10.jpg)
10
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Modelling: Decision Tree
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Dinocyst Description Modeling
![Page 11: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19a53338a4941743416063/html5/thumbnails/11.jpg)
11
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Model Evaluation
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Framework – Descriptive Model
Sample Database
CHRONOS
Text preprocessing
& Pattern Identification
Module
Clustering Module
Data Sources Text Mining
Model Creation Module
Model Selection Module
Data Mining
Model Evaluation
ModuleDescriptive
Model
Predictive Model
Variable Reduction
ModuleUser Input
…
Standardized Taxon
Recommendation Modeling Module
DinoSys
Data Mining/ Software Agent
![Page 12: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19a53338a4941743416063/html5/thumbnails/12.jpg)
12
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Scoring Dinocyst DescriptionAssume regression model is selected in the model selection process
Classification is proposed
A description is entered, analyzed by text mining process, and feed into the selected model (i.e., regression in this example)
Standardized Taxon
Recommendation Modeling Module
Standardized description is proposed
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
User/Human Interaction
Sample Database
CHRONOS
Text preprocessing
& Pattern Identification
Module
Clustering Module
Data Sources Text Mining
Model Creation Module
Model Selection Module
Data Mining
Model Evaluation
ModuleDescriptive
Model
Predictive Model
Variable Reduction
ModuleUser Input
…
Standardized Taxon
Recommendation Modeling Module
DinoSys
Data Mining/ Software Agent
the number of clustersthe validity of key termspossible start/stop words list…
![Page 13: Application of Text Mining in Developing Standardized ...web.mst.edu/~leabi/Lea_goinformatic06_show.pdf · Francisca Oboh-Ikuenobe Bih-Ru Lea The End. Title: Microsoft PowerPoint](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19a53338a4941743416063/html5/thumbnails/13.jpg)
13
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Adoptive Model
Sample Database
CHRONOS
Text preprocessing
& Pattern Identification
Module
Clustering Module
Data Sources Text Mining
Model Creation Module
Model Selection Module
Data Mining
Model Evaluation
ModuleDescriptive
Model
Predictive Model
Variable Reduction
ModuleUser Input
…
Standardized Taxon
Recommendation Modeling Module
DinoSys
Data Mining/ Software Agent
Dinocyst descriptions collected and mined periodically to update model
Standardized Descriptions of Taxa in
Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
The End