The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS...

22
Copyright © 2001, SAS Institute Inc. All rights reserved. The SAS Text Mining Solution [email protected] Senior Data Mining Consultant SAS International [email protected] Product Manager, Enterprise Miner SAS World Wide Marketing

Transcript of The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS...

Page 1: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

The SAS Text Mining [email protected]

Senior Data Mining ConsultantSAS International

[email protected] Manager, Enterprise Miner

SAS World Wide Marketing

Page 2: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or Trademarks of their respective companies.

OverviewThis presentation provides an overview of the SAS text mining

solution including example applications.

Page 3: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

3

Text Mining TimelinesExperimental

Enterprise Miner 4.1, SAS 8.2, currently availableEarly Adopter Text Mining Program

ProductionAdd-on to Enterprise Miner 4.2 and 5.0, SAS 9.0, 1st quarter

of 2002. Separate API to potentially include text mining into other SAS

products and solutions.

Page 4: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

4

Early Adopter ProgramPlease note:

Enterprise Miner for Text is experimental and has a separate setinit. It will be made available on a restricted basis under an Early Adopter program. Please turn to your local SAS office for details."

Page 5: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

5

Why Text Mining ?

Page 6: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

6

What is Text Mining?Discovering and using knowledge that

exists in the document collection as a whole

Uncovering patterns within the collectionEstablish connections between documents and the

terms in the collection as a wholeCombining free-form text and quantitative variables

to derive information

Knowledge

Page 7: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

7

Text Mining is notNatural Language Processing/Understanding

Information Retrieval

Knowledge extraction from a single document

Page 8: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

8

Text Mining ApplicationsAutomatic Classification of Documents

Email filteringOrganize documents by topic into categoriesNews item routingGenerate classification codes from purchase descriptions

ClusteringSurvey data analysisCustomer complaints/comments

Page 9: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

9

Text Mining Applications (cont.)Other forms of Prediction

Stock market prices from business news announcementsCustomer satisfaction from customer commentsCost based on call center log

Page 10: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

10

How Does It Work?Document pre-processing

Data cleansingData reduction

Document analysisclassificationclusteringpredictionetc.

Page 11: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

11

Leveraging Enterprise Miner

Page 12: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

12

Data InputAs a variable

contained in a SAS data set.

As documents residing on the file system.

Page 13: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

13

Text Parsing (cont.)Stop lists

Filter out low information words• articles (e.g. the, a, this)• prepositions (e.g. of, from, by)• conjunctions (e.g. and, but, or)• etc.

Consider document subject matter

Page 14: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

14

Text Parsing

Create Term-Document Frequency MatrixRecord number of times term i appears in document jUses compressed representation in order to handle

large collections

Page 15: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

15

Term FrequenciesWhich terms should be

stopped?

Page 16: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

16

Weighing wordsWeigh words differently. Actually done in the SVD node

focus on words that help to distinguish between textsde-emphasize words that

• are common to most documents• only occur rarely

Page 17: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

17

Singular Value Decomposition (SVD)Dimension reductionFrom thousands to hundreds or less

Terms and documents projected into same spaceDocuments with similar meaning (but dissimilar

word patterns) can still be near one anotherOther variables can be combined with the results

of the SVDQuantitative variablesCategorical variables

Page 18: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

18

Benefits of SVD

Assumes that there is something conceptual underlying words (obscured by word choice)

these concepts can be expressed as collections or combination of words

SVD is used to estimate the structure of these concepts and find their most relevant words

Page 19: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

19

Text Clustering (E-M Cluster)

Generates groups of similar documents

from output of SVD

fast clustering of many documents

each document may belong to several classes (“fuzzy clustering”)

Page 20: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

20

Mining the TextUtilize Enterprise Miner’s tools

Memory-Based ReasoningNeural NetworkRegression

etc.Combine other variables with the results of the SVD

Quantitative variablesCategorical variablesTarget variables

Page 21: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

21

Review of the Text Mining ProcessText parsing extracts terms and creates a term-

document frequency matrixEach document is characterized by a vector in

multi-dimensional spaceThe vectors are then reduced down to100 or so

dimensions by the SVDResults can be classified, clustered or used for

prediction with existing Enterprise Miner tools

Page 22: The SAS Text Mining Solution - · PDF fileQThis presentation provides an overview of the SAS text mining solution including example applications. ... QTerms and documents projected

Copyright © 2001, SAS Institute Inc. All rights reserved.

22