©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek...

CSC 9010: Text Mining Applications

Dr. Paula Matuszek

Paula_A_Matuszek@glaxosmithkline.com

(610) 270-6851

So What Next? Evaluating systems Systems available Some good resources

Evaluating Text Mining Systems

There are dozens of text mining tools and systems available– commercial– open source– research

How do you decide which to use?

Determine Information Need First step: what are you trying to find out?

– Locate a specific piece of information?– Locate and capture a large amount of specific

information– Locate a specific document?– Get the gist of one or more documents?– Organize documents into groups?– Find out something about the overall domain

which is reflected in a set of documents?– ???

Determine Environment What operating system? What document formats? ASCII or something richer? What level of software maturity?

– COTS, with support available, maybe already tuned for your specific problem

– Open source or other fairly stable– Research tool

What is the cost justification?

Thinking About Information Needs

How specific is your need? How much do you know already? How big a corpus? How well-defined? One-time question or continuing? Incremental or episodic?

Information Extraction Tools

Extract specific information, probably from a large number of documents.

What's the typical precision and recall? KB info:

– What entities are already defined?– How easy is it to add enumerated lists?– How easy is it to add patterns?– What document formats does it accept?

Performance?

Document RetrievalNeed a specific document or some information For spidering:

– Coverage, including kinds of documents– Performance, which affects refresh speed– flexibility/configuration of spiders– special needs? (focused crawling)

For retrieval:– Relevance ranking– Performance– Richness of query engine– Precision and recall– Query broadening and narrowing

For both: ease of use

Document CategorizationYou need to sort your documents Does system perform in real time? How many categories total can it handle? How many categories/document? Flat or

hierarchical? Categories defined automatically or by hand?

– Automatically:– Assumes significant vocabulary differences among different

groups.– Requires training examples

– By hand assumes:– Time to do it!– Readily identifiable characteristics to distinguish groups

Document ClusteringWhat is going on in this domain? What features of document are used to

cluster? Linguistic? Semantic? TF*IDF? What methods are used for clustering? (How

do we define "similar"?) Any capability for incorporating domain

knowledge? Performance Incremental? Or do you have to start over

again to add new documents?

Document Summarization

What do I have? Sentence extraction or capture and generate? How much can it be shortened? How many documents at once? Sentence extraction methods are heavily

dependent on the method used to identify "important" words.

Grab Bag of Systems Available: Entity or

Information Extraction

– AeroText: Lockheed Martin– GATE: U of Sheffield– Sophia: CELI– iMiner: IBM– ClearTag: ClearForest– Thing Finder: Inxight– LexiQuest: SPSS– Faustus/TextPRO: SRI

Categorization/Clustering

Semio: Entrieva Oracle Text: Oracle Inxight Categorizer: Inxight Verity K2: Verity Autonomy ClearForest LexiMine: SPSS iMiner, Lotus Discovery Server: IBM (IBM)

Summarizing

All over the place! Every search engine Mac OS 10.2 and later Many others

What's Happening Some specific domains are very hot or

interesting or intriguing– Expertise finder– Patent retrieval, visualization– Reputation Minder– Biological text mining– Semantic web– In fact, anything web-related– ??

What's Happening Some technologies are also gaining speed:

– Taxonomy identification/extraction– Question answering– Automatic markup: for the semantic web, for

instance– Integrated domain-based and statistical

approaches– Machine learning of KBs

Some Useful Resources: Links

Portal text mining links, kept reasonably up to date:– filebox.vt.edu/users/wfan/text_mining.html– www.cs.utexas.edu/users/pebronia/text-mining

A really excellent overview paper, still useful although 2001: – www.mitre.org/work/tech_papers/tech_papers_01/

maybury_unstructured/maybury_unstructured.pdf Best site to start with for software, conferences, etc:

– www.kdnuggets.com/index.html

Useful Resources: Conferences

AAAI and IJCAI: Basic NL research; some good workshops and tutorials on text mining. Some of everything.

KDD: Text Mining often included as a form of data mining, especially more statistical approaches. KDD cup sometimes text based.

SIGIR: Lots of information retrieval ACL: Lots of linguistic-based info, especially things like entity

recognition and tagging. Data mining conferences: often include text mining component.

ICDM, for example. Domain-specific conferences: often include a text mining

component too.

So Where Now?

You now all have a good background in the techniques and applications of text mining, and some ideas of how it's been applied.

Where do you think it will it be in 10 years, and what will we be doing with it?

©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek...

Documents

Transcript of ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek...

CS 8520: Artificial Intelligence Search 2 Paula Matuszek Fall, 2008 Slides based on Hwee Tou Ng, aima.eecs.berkeley.edu/slides-ppt, which are in turn based.

©2012 Paula Matuszek CSC 9010: Text Mining Applications: Information Retrieval Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com.

1 CSC 8520 Spring 2013. Paula Matuszek CS 8520: Artificial Intelligence Logical Agents and First Order Logic Paula Matuszek Spring 2013.

CSC 8520 Fall, 2005. Paula Matuszek 1 CS 8520: Artificial Intelligence Introduction Paula Matuszek Fall, 2005.

1 01/12/2011Knowledge-Based Systems, Paula Matuszek Intro to CLIPS Paula Matuszek CSC 9010, Spring, 2011.

6851 Juniper Tree Street

1 01/12/2011Knowledge-Based Systems, Paula Matuszek More Intro to CLIPS Paula Matuszek CSC 9010, Spring, 2011.

©2012 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com.

Welcome to the Computer and Information Technology program matuszek.

CS 8520: Artificial Intelligence Search Paula Matuszek Fall, 2005 Slides based on Hwee Tou Ng, aima.eecs.berkeley.edu/slides-ppt, which are in turn based.

1 CSC 9010 Spring, 2011. Paula Matuszek Slides modified from Natasha Noy, protege.stanford.edu/amia2003/AMIA2003Tutorial.ppt CS 9010: Knowledge-Based Systems.

©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek.

©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications.

©2002 Paula Matuszek iMiner Introduction. ©2002 Paula Matuszek iMiner from IBM l Text Mining tool with multiple components l Text Analysis tools includ.

CSC 9010 Spring 2011. Paula Matuszek Intelligent Agents Overview Slides based in part on Hwee Tou Ng, aima.eecs. which are in turn.

1 CSC 8520 Spring 2013. Paula Matuszek CS 8520: Artificial Intelligence Knowledge Representation Paula Matuszek Spring, 2013.

1 CSC 9010 Spring 2011. Paula MatuszekSlides taken in part from Eric Eaton, matuszek/fall2008/KnowledgeRepresentation.pptmatuszek/fall2008/KnowledgeRepresentation.ppt.

EUROPEAN COMMISSION C(2014) 6851 finalec.europa.eu/competition/state_aid/cases/250265/... · EUROPEAN COMMISSION Brussels, 01.10.2014 C(2014) 6851 final PUBLIC VERSION ... information

Eclipse (from cis.upenn/~matuszek/ cit591-2004/Lectures/eclipse )

1 CSC 8520 Spring 2013. Paula Matuszek CS 8520: Artificial Intelligence Machine Learning 1 Paula Matuszek Spring, 2013.