Post on 13-Dec-2015
©2003 Paula Matuszek
CSC 9010: Text Mining Applications
Dr. Paula Matuszek
Paula_A_Matuszek@glaxosmithkline.com
(610) 270-6851
©2003 Paula Matuszek
So What Next? Evaluating systems Systems available Some good resources
©2003 Paula Matuszek
Evaluating Text Mining Systems
There are dozens of text mining tools and systems available– commercial– open source– research
How do you decide which to use?
©2003 Paula Matuszek
Determine Information Need First step: what are you trying to find out?
– Locate a specific piece of information?– Locate and capture a large amount of specific
information– Locate a specific document?– Get the gist of one or more documents?– Organize documents into groups?– Find out something about the overall domain
which is reflected in a set of documents?– ???
©2003 Paula Matuszek
Determine Environment What operating system? What document formats? ASCII or something richer? What level of software maturity?
– COTS, with support available, maybe already tuned for your specific problem
– Open source or other fairly stable– Research tool
What is the cost justification?
©2003 Paula Matuszek
Thinking About Information Needs
How specific is your need? How much do you know already? How big a corpus? How well-defined? One-time question or continuing? Incremental or episodic?
©2003 Paula Matuszek
Information Extraction Tools
Extract specific information, probably from a large number of documents.
What's the typical precision and recall? KB info:
– What entities are already defined?– How easy is it to add enumerated lists?– How easy is it to add patterns?– What document formats does it accept?
Performance?
©2003 Paula Matuszek
Document RetrievalNeed a specific document or some information For spidering:
– Coverage, including kinds of documents– Performance, which affects refresh speed– flexibility/configuration of spiders– special needs? (focused crawling)
For retrieval:– Relevance ranking– Performance– Richness of query engine– Precision and recall– Query broadening and narrowing
For both: ease of use
©2003 Paula Matuszek
Document CategorizationYou need to sort your documents Does system perform in real time? How many categories total can it handle? How many categories/document? Flat or
hierarchical? Categories defined automatically or by hand?
– Automatically:– Assumes significant vocabulary differences among different
groups.– Requires training examples
– By hand assumes:– Time to do it!– Readily identifiable characteristics to distinguish groups
©2003 Paula Matuszek
Document ClusteringWhat is going on in this domain? What features of document are used to
cluster? Linguistic? Semantic? TF*IDF? What methods are used for clustering? (How
do we define "similar"?) Any capability for incorporating domain
knowledge? Performance Incremental? Or do you have to start over
again to add new documents?
©2003 Paula Matuszek
Document Summarization
What do I have? Sentence extraction or capture and generate? How much can it be shortened? How many documents at once? Sentence extraction methods are heavily
dependent on the method used to identify "important" words.
©2003 Paula Matuszek
Grab Bag of Systems Available: Entity or
Information Extraction
– AeroText: Lockheed Martin– GATE: U of Sheffield– Sophia: CELI– iMiner: IBM– ClearTag: ClearForest– Thing Finder: Inxight– LexiQuest: SPSS– Faustus/TextPRO: SRI
©2003 Paula Matuszek
Categorization/Clustering
Semio: Entrieva Oracle Text: Oracle Inxight Categorizer: Inxight Verity K2: Verity Autonomy ClearForest LexiMine: SPSS iMiner, Lotus Discovery Server: IBM (IBM)
©2003 Paula Matuszek
Summarizing
All over the place! Every search engine Mac OS 10.2 and later Many others
©2003 Paula Matuszek
What's Happening Some specific domains are very hot or
interesting or intriguing– Expertise finder– Patent retrieval, visualization– Reputation Minder– Biological text mining– Semantic web– In fact, anything web-related– ??
©2003 Paula Matuszek
What's Happening Some technologies are also gaining speed:
– Taxonomy identification/extraction– Question answering– Automatic markup: for the semantic web, for
instance– Integrated domain-based and statistical
approaches– Machine learning of KBs
©2003 Paula Matuszek
Some Useful Resources: Links
Portal text mining links, kept reasonably up to date:– filebox.vt.edu/users/wfan/text_mining.html– www.cs.utexas.edu/users/pebronia/text-mining
A really excellent overview paper, still useful although 2001: – www.mitre.org/work/tech_papers/tech_papers_01/
maybury_unstructured/maybury_unstructured.pdf Best site to start with for software, conferences, etc:
– www.kdnuggets.com/index.html
©2003 Paula Matuszek
Useful Resources: Conferences
AAAI and IJCAI: Basic NL research; some good workshops and tutorials on text mining. Some of everything.
KDD: Text Mining often included as a form of data mining, especially more statistical approaches. KDD cup sometimes text based.
SIGIR: Lots of information retrieval ACL: Lots of linguistic-based info, especially things like entity
recognition and tagging. Data mining conferences: often include text mining component.
ICDM, for example. Domain-specific conferences: often include a text mining
component too.
©2003 Paula Matuszek
So Where Now?
You now all have a good background in the techniques and applications of text mining, and some ideas of how it's been applied.
Where do you think it will it be in 10 years, and what will we be doing with it?