Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

15
Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

Transcript of Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

Page 1: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

Text Mining In InQuery

Vasant Kumar, Peter Richards

August 25th, 1999.

Page 2: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

History

• InQuery was originally a research product from Center for Intelligent Information Retrieval at the University of Massachusetts, Amherst

• A commercial-strength InQuery API from Sovereign Hill Software

• InQuery 5.0 with LCA and Graphical User Interface

Page 3: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

Outline

• Text mining

• Text mining using “local context analysis” (LCA).

• Text mining using “top concepts”

• Concept recognizers

• Demonstration of LCA and “top concepts”

• Q & A

Page 4: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

Text Mining

• Helps find needle in the hay stack

• Query expansion

• Discovers interesting relationships between concepts

• Discovers characteristics about the database

Page 5: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

Concepts

• Words

• Noun phrases

• People names

• Company names

• User-defined concepts

Page 6: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

Local Context Analysis (LCA)

• Associates a query to a ranked list of concepts for several concept types (noun phrases, people names ..)

• Concept association is done on the fly – no complex databases to be created– changes to the database are immediately taken

into account.

Page 7: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

Background

• Unit of retrieval is a passage (local context), in contrast to a document in regular search.

• A passage is a window of words of length n

• Overlapping passages are used.

Page 8: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

LCA Process

• Generate candidate passages (sub-documents)

• Extract concepts and their statistics

• Apply LCA algorithm to rank the concepts for each concept type

Page 9: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

Step 1: Generate Candidate Passages

• The documents are split into passages (virtual sub-documents)

• Evaluate the query on these passages to generate a weight for each passage

• Rank the passages

• Select the top m best passages

Page 10: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

Step 2: Extract Concepts

• Extract the passages from their respective documents for all the passages in the candidate passage list.

• Each passage in the candidate list is passed through a set of “concept recognizers” to extract respective concept lists.

• Generate passage level statistics for all concepts and query terms

Page 11: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

Step 3: Apply LCA Algorithm

• Generate local context statistics for concepts and query terms (specific to the set of candidate passages)

• Use LCA algorithm to generate weights for concepts. The passage level and local context level statistics are used..

• Rank the concepts and select top n

• The above steps are repeated for all concept types.

Page 12: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

Text Mining Using Top Concepts

• Retrieve documents

• Extract concepts from each document using “concept recognizers”

• Generate most frequently occurring concepts for all concept types.

• Persist the most frequently occurring concepts.

Page 13: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

Noun Phrase Recognizer

• Tokenization to generate words

• Parts-of-speech tagging (noun, verb, etc.)

• Select noun phrases

Page 14: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

Other Recognizers

• Company and people name recognizers– based on pattern matching rules– uses external lists of names for normalization

and additional evidence.

• User-defined recognizer– uses a user provided list of concepts

(single/multiword)– generates a state machine

Page 15: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

Demonstration of LCA and Top Concepts in InQuery 5.1