Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.
-
Upload
august-reynolds -
Category
Documents
-
view
215 -
download
3
Transcript of Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.
![Page 1: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.](https://reader035.fdocuments.us/reader035/viewer/2022072015/56649ec65503460f94bd237a/html5/thumbnails/1.jpg)
Text Mining In InQuery
Vasant Kumar, Peter Richards
August 25th, 1999.
![Page 2: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.](https://reader035.fdocuments.us/reader035/viewer/2022072015/56649ec65503460f94bd237a/html5/thumbnails/2.jpg)
History
• InQuery was originally a research product from Center for Intelligent Information Retrieval at the University of Massachusetts, Amherst
• A commercial-strength InQuery API from Sovereign Hill Software
• InQuery 5.0 with LCA and Graphical User Interface
![Page 3: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.](https://reader035.fdocuments.us/reader035/viewer/2022072015/56649ec65503460f94bd237a/html5/thumbnails/3.jpg)
Outline
• Text mining
• Text mining using “local context analysis” (LCA).
• Text mining using “top concepts”
• Concept recognizers
• Demonstration of LCA and “top concepts”
• Q & A
![Page 4: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.](https://reader035.fdocuments.us/reader035/viewer/2022072015/56649ec65503460f94bd237a/html5/thumbnails/4.jpg)
Text Mining
• Helps find needle in the hay stack
• Query expansion
• Discovers interesting relationships between concepts
• Discovers characteristics about the database
![Page 5: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.](https://reader035.fdocuments.us/reader035/viewer/2022072015/56649ec65503460f94bd237a/html5/thumbnails/5.jpg)
Concepts
• Words
• Noun phrases
• People names
• Company names
• User-defined concepts
![Page 6: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.](https://reader035.fdocuments.us/reader035/viewer/2022072015/56649ec65503460f94bd237a/html5/thumbnails/6.jpg)
Local Context Analysis (LCA)
• Associates a query to a ranked list of concepts for several concept types (noun phrases, people names ..)
• Concept association is done on the fly – no complex databases to be created– changes to the database are immediately taken
into account.
![Page 7: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.](https://reader035.fdocuments.us/reader035/viewer/2022072015/56649ec65503460f94bd237a/html5/thumbnails/7.jpg)
Background
• Unit of retrieval is a passage (local context), in contrast to a document in regular search.
• A passage is a window of words of length n
• Overlapping passages are used.
![Page 8: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.](https://reader035.fdocuments.us/reader035/viewer/2022072015/56649ec65503460f94bd237a/html5/thumbnails/8.jpg)
LCA Process
• Generate candidate passages (sub-documents)
• Extract concepts and their statistics
• Apply LCA algorithm to rank the concepts for each concept type
![Page 9: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.](https://reader035.fdocuments.us/reader035/viewer/2022072015/56649ec65503460f94bd237a/html5/thumbnails/9.jpg)
Step 1: Generate Candidate Passages
• The documents are split into passages (virtual sub-documents)
• Evaluate the query on these passages to generate a weight for each passage
• Rank the passages
• Select the top m best passages
![Page 10: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.](https://reader035.fdocuments.us/reader035/viewer/2022072015/56649ec65503460f94bd237a/html5/thumbnails/10.jpg)
Step 2: Extract Concepts
• Extract the passages from their respective documents for all the passages in the candidate passage list.
• Each passage in the candidate list is passed through a set of “concept recognizers” to extract respective concept lists.
• Generate passage level statistics for all concepts and query terms
![Page 11: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.](https://reader035.fdocuments.us/reader035/viewer/2022072015/56649ec65503460f94bd237a/html5/thumbnails/11.jpg)
Step 3: Apply LCA Algorithm
• Generate local context statistics for concepts and query terms (specific to the set of candidate passages)
• Use LCA algorithm to generate weights for concepts. The passage level and local context level statistics are used..
• Rank the concepts and select top n
• The above steps are repeated for all concept types.
![Page 12: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.](https://reader035.fdocuments.us/reader035/viewer/2022072015/56649ec65503460f94bd237a/html5/thumbnails/12.jpg)
Text Mining Using Top Concepts
• Retrieve documents
• Extract concepts from each document using “concept recognizers”
• Generate most frequently occurring concepts for all concept types.
• Persist the most frequently occurring concepts.
![Page 13: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.](https://reader035.fdocuments.us/reader035/viewer/2022072015/56649ec65503460f94bd237a/html5/thumbnails/13.jpg)
Noun Phrase Recognizer
• Tokenization to generate words
• Parts-of-speech tagging (noun, verb, etc.)
• Select noun phrases
![Page 14: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.](https://reader035.fdocuments.us/reader035/viewer/2022072015/56649ec65503460f94bd237a/html5/thumbnails/14.jpg)
Other Recognizers
• Company and people name recognizers– based on pattern matching rules– uses external lists of names for normalization
and additional evidence.
• User-defined recognizer– uses a user provided list of concepts
(single/multiword)– generates a state machine
![Page 15: Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.](https://reader035.fdocuments.us/reader035/viewer/2022072015/56649ec65503460f94bd237a/html5/thumbnails/15.jpg)
Demonstration of LCA and Top Concepts in InQuery 5.1