The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS...
Transcript of The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS...
Copyright © 2001, SAS Institute Inc. All rights reserved.
The SAS Text Mining [email protected]
Senior Data Mining ConsultantSAS International
[email protected] Manager, Enterprise Miner
SAS World Wide Marketing
Copyright © 2001, SAS Institute Inc. All rights reserved.
SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or Trademarks of their respective companies.
OverviewThis presentation provides an overview of the SAS text mining
solution including example applications.
Copyright © 2001, SAS Institute Inc. All rights reserved.
3
Text Mining TimelinesExperimental
Enterprise Miner 4.1, SAS 8.2, currently availableEarly Adopter Text Mining Program
ProductionAdd-on to Enterprise Miner 4.2 and 5.0, SAS 9.0, 1st quarter
of 2002. Separate API to potentially include text mining into other SAS
products and solutions.
Copyright © 2001, SAS Institute Inc. All rights reserved.
4
Early Adopter ProgramPlease note:
Enterprise Miner for Text is experimental and has a separate setinit. It will be made available on a restricted basis under an Early Adopter program. Please turn to your local SAS office for details."
Copyright © 2001, SAS Institute Inc. All rights reserved.
5
Why Text Mining ?
Copyright © 2001, SAS Institute Inc. All rights reserved.
6
What is Text Mining?Discovering and using knowledge that
exists in the document collection as a whole
Uncovering patterns within the collectionEstablish connections between documents and the
terms in the collection as a wholeCombining free-form text and quantitative variables
to derive information
Knowledge
Copyright © 2001, SAS Institute Inc. All rights reserved.
7
Text Mining is notNatural Language Processing/Understanding
Information Retrieval
Knowledge extraction from a single document
Copyright © 2001, SAS Institute Inc. All rights reserved.
8
Text Mining ApplicationsAutomatic Classification of Documents
Email filteringOrganize documents by topic into categoriesNews item routingGenerate classification codes from purchase descriptions
ClusteringSurvey data analysisCustomer complaints/comments
Copyright © 2001, SAS Institute Inc. All rights reserved.
9
Text Mining Applications (cont.)Other forms of Prediction
Stock market prices from business news announcementsCustomer satisfaction from customer commentsCost based on call center log
Copyright © 2001, SAS Institute Inc. All rights reserved.
10
How Does It Work?Document pre-processing
Data cleansingData reduction
Document analysisclassificationclusteringpredictionetc.
Copyright © 2001, SAS Institute Inc. All rights reserved.
11
Leveraging Enterprise Miner
Copyright © 2001, SAS Institute Inc. All rights reserved.
12
Data InputAs a variable
contained in a SAS data set.
As documents residing on the file system.
Copyright © 2001, SAS Institute Inc. All rights reserved.
13
Text Parsing (cont.)Stop lists
Filter out low information words• articles (e.g. the, a, this)• prepositions (e.g. of, from, by)• conjunctions (e.g. and, but, or)• etc.
Consider document subject matter
Copyright © 2001, SAS Institute Inc. All rights reserved.
14
Text Parsing
Create Term-Document Frequency MatrixRecord number of times term i appears in document jUses compressed representation in order to handle
large collections
Copyright © 2001, SAS Institute Inc. All rights reserved.
15
Term FrequenciesWhich terms should be
stopped?
Copyright © 2001, SAS Institute Inc. All rights reserved.
16
Weighing wordsWeigh words differently. Actually done in the SVD node
focus on words that help to distinguish between textsde-emphasize words that
• are common to most documents• only occur rarely
Copyright © 2001, SAS Institute Inc. All rights reserved.
17
Singular Value Decomposition (SVD)Dimension reductionFrom thousands to hundreds or less
Terms and documents projected into same spaceDocuments with similar meaning (but dissimilar
word patterns) can still be near one anotherOther variables can be combined with the results
of the SVDQuantitative variablesCategorical variables
Copyright © 2001, SAS Institute Inc. All rights reserved.
18
Benefits of SVD
Assumes that there is something conceptual underlying words (obscured by word choice)
these concepts can be expressed as collections or combination of words
SVD is used to estimate the structure of these concepts and find their most relevant words
Copyright © 2001, SAS Institute Inc. All rights reserved.
19
Text Clustering (E-M Cluster)
Generates groups of similar documents
from output of SVD
fast clustering of many documents
each document may belong to several classes (“fuzzy clustering”)
Copyright © 2001, SAS Institute Inc. All rights reserved.
20
Mining the TextUtilize Enterprise Miner’s tools
Memory-Based ReasoningNeural NetworkRegression
etc.Combine other variables with the results of the SVD
Quantitative variablesCategorical variablesTarget variables
Copyright © 2001, SAS Institute Inc. All rights reserved.
21
Review of the Text Mining ProcessText parsing extracts terms and creates a term-
document frequency matrixEach document is characterized by a vector in
multi-dimensional spaceThe vectors are then reduced down to100 or so
dimensions by the SVDResults can be classified, clustered or used for
prediction with existing Enterprise Miner tools
Copyright © 2001, SAS Institute Inc. All rights reserved.
22