Post on 19-Dec-2015
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 1
An Introduction to Text Mining
Tim DaciukSPSS, Inc.Services Manager, Canada
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 2
AgendaAgenda
Introductions
An Overview of Document Warehousing
Understanding Unstructured Text
Concept Extraction
Text Mining
Data Mining
Demonstration
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 3
Tim DaciukTim Daciuk
Background Social research Survey research
SPSS 25 years working with the product 12 years working with the company 5 years working with text analysis
Prior history Consulting Education
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 4
Predictive analysis helps connect data to effective
action by drawing reliable conclusions about
current conditions and future events.
— Gareth Herschel, Research Director, Gartner Group
Predictive Analytics: DefinedPredictive Analytics: Defined
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 5
SPSS At A GlanceSPSS At A Glance
Leadership Market leader in Predictive Analytics Focus on online & offline customer data acquisition and analysis
Stability Founded in 1968 30+ year heritage in analytic technologies
Proven track record 250,000+ customers worldwide NASDAQ: SPSS
Analytics standard 80% of Fortune 500 are SPSS customers 80% plus market share in Survey & Market Research sector Ranked #1 Data Mining solution by KD Nuggets
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 7
Unstructured Data ManagementUnstructured Data Management
Text Mining is a subset of Unstructured Data
Management.
UDM can be broken down into: Content and Document Management
Search and Retrieval
XML database and tools
Categorization, Classification, and Visualization
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 8
80% of Data is Unstructured80% of Data is Unstructured
Database notes: Call center transcripts Other CRM
Open-ended survey responses
Web pages
NewsGroups
Documents themselves
Competitive information
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 9
Applications for Text AnalysisApplications for Text Analysis
Surveys
‘Reading’ email
Call centre data
Comment data
Abstracts
Document management
Corporate history
Thematic understanding of website
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 10
Data Warehouse vs. Document Data Warehouse vs. Document WarehouseWarehouse
Data warehouse Who, what, when, where, how much Internally focused Operational information Rarely include external information
Document warehouse Why May not be internally focused May contain a range of information Often integrate external information
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 11
Document Warehouse FeaturesDocument Warehouse Features
There is no single document structure or document
type
Documents are drawn from multiple sources
Essential features of documents are automatically
extracted and explicitly stored in the document
warehouse
Document warehouses are designed to integrate
semantically related documents
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 12
Building the Document WarehouseBuilding the Document Warehouse
IdentifySources
RetrieveDocument
TextAnalysis
Pre-processDocument
CompileMetadata
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 13
Predict, Impact, DeployPredict, Impact, Deploy
Customer
Data
Attitudes
Actions
Attributes
Business User
Grow
Retain
Fraud
Outcomes
Attract
Data Collection
Text
Surveys
WebChannel
OperationalSystems
Text Bu
sin
ess
UI
Expert UIExpert UI
Concepts
Concept Maps
Clustering
Categoriza-tion
Trending
Information Extraction
Prediction
NLP
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 14
The Building Blocks of LanguageThe Building Blocks of Language
Morphology
Syntax
Semantics
Phonology
Pragmatics
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 15
MorphologyMorphology
Understanding words Stems Affixes
Prefix Suffix
Inflectional elements
Reducing complexity of
analysis
Reduces complexity of
representation
Supports text mining
Noun
PrefixNoun Stem
Suffix
- abledisputein -
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 16
SyntaxSyntax
The Bank of Canada will curb inflation with higher interest rates
Prepositional phrase
Adjective
Sentence
Noun phrase Verb phrase
NounVerbAux
Noun phrase
NounAdjective
Noun
The Bank ofCanada
inflationcurbwill
Interest rateshigher
with
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 17
SemanticsSemantics
The meaning of it all
Approaches to meaning Semantic networks Deductive logic Rule-based systems
Useful for classification
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 18
Problems with NLPProblems with NLP
Limitations of Natural Language Processing Correctly identifying the role of noun phrases Representing abstract concepts Classifying synonyms Representing the number of concepts
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 19
Problems with NLPProblems with NLP
Limitations of technology Language specific designs are required Classification speed Classifying hybrid words and sentences
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 20
Underlying Technology is Based on Underlying Technology is Based on LinguisticsLinguistics
The Linguistic Approach: Does not treat a document as a bag of words
Removes ambiguity by extracting structured concepts
Concepts are the DNA of text.
Text is unstructured, ambiguous, and language dependent.
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 21
From Text to ConceptsFrom Text to Concepts
Morphology
Syntax
Semantics StatisticsLinguistic
Terminology
Extractor
ScalableAccurate
Customizable Discovery-Oriented
•Compound words
•Proper nouns
•Figures
•Named entities
•Domain specifics
•Speed
•Multiple formats
•Multiple languages
•SPSS dictionaries
•User dictionaries
•Extraction rules
•Extraction patterns
•Known terms
•Unknown terms
•New terms
•1GB/hour
•PDF, MS Office, text…
•English, French, GermanSpanish, Italian, Dutch,Japanese
• Inserm; merck & co…• tnp-470; glut-4…• factor receptor; Inhibitory effect;• D. John Paganoni, ..• Positive/Negative opinion…• London, Paris…
•Names, Orgs…
•MeSH, genes...
•Predicates
•Synonyms, stop words..
•Trends
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 22
From Concepts to Predictive From Concepts to Predictive Analytics ComponentsAnalytics Components
Linguistic
Terminology
Extractor
LexiQuestMine
Discover concepts,
relationships and trends
LexiQuest Categorize
Understand documents and assign in pre-defined categories
Text Mining for Clementine
Add text fields to data mining for better prediction
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 23
Concept Extraction EngineConcept Extraction Engine
The extractor turns unstructured text into concepts:
LexiQuest Extractor EngineLinguistic Processor
Visualization Probabilities
LexiQuestMine
ClementineLexiQuestCategorize
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 24
Part-of-Speech TaggingPart-of-Speech Tagging
a: adjective b: adverb c: preposition
d: determiner n: noun v: verb
o: coordination p: participle s: stop word
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 25
How is a Concept Extracted?How is a Concept Extracted?
Step 1: Part-of-Speech Tagging
Using a tool like LexiQuest Mine is a great
V P N A N N V P A
idea for any organization that is interested in maintaining
N P A N P V V P V
information on competitive intelligence.
N P N N
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 26
How is a Concept Extracted?How is a Concept Extracted?
Step 2: Matching to Known Patterns
This:
V P N A N N V P A N PA N P V V P V N PN N
Looks Most Like:
N C D N N
(32 Known patterns for English)
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 27
How is the Concept Extracted?How is the Concept Extracted?
The extractor looks at this sentence: Using a tool like LexiQuest Mine is a great idea for any
organization that is interested in maintaining information on competitive intelligence.
And extracts the concept: Competitive Intelligence
Concepts are: Noun based Can be longer than one word
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 28
Example: CategorizationExample: Categorization
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 29
The Issue of LanguageThe Issue of Language
NLP requires separate language understanding
Clementine text mining French English English/French German Spanish Dutch Japanese Italian Mesh (Medical subject headings)
http://www.nlm.nih.gov/mesh/meshhome.html
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc.
“The process of discovering meaningful
new relationships, patterns and trends by
sifting through data using pattern
recognition technologies as well as
statistical and mathematical techniques.”
- The Gartner group.
Data Mining DefinedData Mining Defined
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 31
Why data mining?Why data mining?
Data Mining software generally employs modeling
algorithms designed to handle non-linearities and
unusual patterns in data As opposed to classical linear models (e.g., linear
regression) that aren’t as capable
A related issue is ‘noise’ in the data: where, for
example, 2 seemingly similar sets of inputs yield a
different output
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 32
Use the cross industry standard process for data mining (CRISP-DM)
Based on real-world lessons: Focus on business
issues User-centric &
interactive Full process Results are used
A Data Mining MethodologyA Data Mining Methodology
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 33
Data Mining is not…Data Mining is not…
Keep in mind that data mining is not… “Blind” application of analysis/modeling algorithms Brute-force crunching of bulk data Black box technology Magic
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 34
Back to the ProcessBack to the Process
Text Mining
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 35
UnderstandingUnderstanding
Business Understanding Determine objective Assess situation Determine data mining goals Produce project plan
Data Understanding Collect initial data Describe data Explore data Verify data quality
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 36
Data PreparationData Preparation
Data Data set Data set description Select data Clean data Construct data set / Integrate data Format data
Text Concept extraction Concept combination Concept assessment
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 37
ModelingModeling
Select modeling technique Universe of techniques Appropriate techniques
Data Text
Requirements Constraints Selected tools
Generate test design
Run model(s)
Assess model(s)
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 38
EvaluationEvaluation
Results = Models + Findings
Evaluate results
Review process
Determine next steps
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 39
DeploymentDeployment
Plan deployment
Plan monitoring and maintenance
Final report
Project review
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 40
Unsupervised methods: Group patients by drugs and demographic information
and try to find unusual patients
Supervised methods: Attempt to predict amount due and find sets of cases
where the amount due is very different from the
predicted amount
Data Mining ApproachesData Mining Approaches
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 41
What Does Data Mining Do?What Does Data Mining Do?
Data mining uses existing data to: Predict
Category membership Numeric Value Ie. Credit risk
Group Cluster (group) things together
based on their characteristics Ie. Different types of TV viewers
Associate Find events that occur together, or in
a sequence Ie. Beer and diapers
Find outliers Identify cases that don’t follow
expected behavior Ie. Fraudulent behaviour
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 42
Benefits of Document WarehousingBenefits of Document Warehousing
Richer operational business intelligence
Knowing your customers
Macroenvironmental monitoring
Technology assessment
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 43
ConclusionsConclusions
Text mining is More than word counts Linguistically based Concept extraction
Data mining is Advanced analytics applied to datasets A family of techniques Supervised or unsupervised
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 44
ConclusionsConclusions
Text and data mining Add dimensionality to the data Allow for automation of the text analysis event Create 360 degree view
Applications Websites Surveys Email Call centre Documentation
Copyright 2003-4, SPSS Inc.Copyright 2003-4, SPSS Inc. 46
So How Do I Get Started?So How Do I Get Started?
Document Warehousing and Text Mining Dan Sullivan, Wiley, 2001
Survey of Text Mining: Clustering, Classification
and Retrieval Michael W. Berry (ed.), Springer, 2003
Natural Language Processing for Online
Applications: Text Retrieval, Extraction and
Categorization P. Jackson and I. Moulinier, John Benjamins, 2002