©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek...

19
©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithklin e.com (610) 270-6851

Transcript of ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek...

Page 1: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

CSC 9010: Text Mining Applications

Dr. Paula Matuszek

[email protected]

(610) 270-6851

Page 2: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

So What Next? Evaluating systems Systems available Some good resources

Page 3: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Evaluating Text Mining Systems

There are dozens of text mining tools and systems available– commercial– open source– research

How do you decide which to use?

Page 4: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Determine Information Need First step: what are you trying to find out?

– Locate a specific piece of information?– Locate and capture a large amount of specific

information– Locate a specific document?– Get the gist of one or more documents?– Organize documents into groups?– Find out something about the overall domain

which is reflected in a set of documents?– ???

Page 5: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Determine Environment What operating system? What document formats? ASCII or something richer? What level of software maturity?

– COTS, with support available, maybe already tuned for your specific problem

– Open source or other fairly stable– Research tool

What is the cost justification?

Page 6: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Thinking About Information Needs

How specific is your need? How much do you know already? How big a corpus? How well-defined? One-time question or continuing? Incremental or episodic?

Page 7: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Information Extraction Tools

Extract specific information, probably from a large number of documents.

What's the typical precision and recall? KB info:

– What entities are already defined?– How easy is it to add enumerated lists?– How easy is it to add patterns?– What document formats does it accept?

Performance?

Page 8: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Document RetrievalNeed a specific document or some information For spidering:

– Coverage, including kinds of documents– Performance, which affects refresh speed– flexibility/configuration of spiders– special needs? (focused crawling)

For retrieval:– Relevance ranking– Performance– Richness of query engine– Precision and recall– Query broadening and narrowing

For both: ease of use

Page 9: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Document CategorizationYou need to sort your documents Does system perform in real time? How many categories total can it handle? How many categories/document? Flat or

hierarchical? Categories defined automatically or by hand?

– Automatically:– Assumes significant vocabulary differences among different

groups.– Requires training examples

– By hand assumes:– Time to do it!– Readily identifiable characteristics to distinguish groups

Page 10: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Document ClusteringWhat is going on in this domain? What features of document are used to

cluster? Linguistic? Semantic? TF*IDF? What methods are used for clustering? (How

do we define "similar"?) Any capability for incorporating domain

knowledge? Performance Incremental? Or do you have to start over

again to add new documents?

Page 11: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Document Summarization

What do I have? Sentence extraction or capture and generate? How much can it be shortened? How many documents at once? Sentence extraction methods are heavily

dependent on the method used to identify "important" words.

Page 12: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Grab Bag of Systems Available: Entity or

Information Extraction

– AeroText: Lockheed Martin– GATE: U of Sheffield– Sophia: CELI– iMiner: IBM– ClearTag: ClearForest– Thing Finder: Inxight– LexiQuest: SPSS– Faustus/TextPRO: SRI

Page 13: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Categorization/Clustering

Semio: Entrieva Oracle Text: Oracle Inxight Categorizer: Inxight Verity K2: Verity Autonomy ClearForest LexiMine: SPSS iMiner, Lotus Discovery Server: IBM (IBM)

Page 14: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Summarizing

All over the place! Every search engine Mac OS 10.2 and later Many others

Page 15: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

What's Happening Some specific domains are very hot or

interesting or intriguing– Expertise finder– Patent retrieval, visualization– Reputation Minder– Biological text mining– Semantic web– In fact, anything web-related– ??

Page 16: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

What's Happening Some technologies are also gaining speed:

– Taxonomy identification/extraction– Question answering– Automatic markup: for the semantic web, for

instance– Integrated domain-based and statistical

approaches– Machine learning of KBs

Page 17: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Some Useful Resources: Links

Portal text mining links, kept reasonably up to date:– filebox.vt.edu/users/wfan/text_mining.html– www.cs.utexas.edu/users/pebronia/text-mining

A really excellent overview paper, still useful although 2001: – www.mitre.org/work/tech_papers/tech_papers_01/

maybury_unstructured/maybury_unstructured.pdf Best site to start with for software, conferences, etc:

– www.kdnuggets.com/index.html

Page 18: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Useful Resources: Conferences

AAAI and IJCAI: Basic NL research; some good workshops and tutorials on text mining. Some of everything.

KDD: Text Mining often included as a form of data mining, especially more statistical approaches. KDD cup sometimes text based.

SIGIR: Lots of information retrieval ACL: Lots of linguistic-based info, especially things like entity

recognition and tagging. Data mining conferences: often include text mining component.

ICDM, for example. Domain-specific conferences: often include a text mining

component too.

Page 19: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

So Where Now?

You now all have a good background in the techniques and applications of text mining, and some ideas of how it's been applied.

Where do you think it will it be in 10 years, and what will we be doing with it?