Faceting with Lucene Block Join Query - Lucene/Solr Revolution 2014
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer
-
Upload
christian-barker -
Category
Documents
-
view
219 -
download
0
Transcript of Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer
LuceneLucene
LuceneLucene
A open source set of Java Classses◦Search Engine/Document
Classifier/Indexer http://lucene.sourceforge.net/talks/pisa/
◦Developed by Doug Cutting 1996 Contributed to Apache project Wrote several papers in IR
Modules for IRModules for IR◦ Analysis
Tokenization Where tokens are indexed
◦ Document Where the Document ID is created Date of Document is extracted Title of document is extracted
◦ Index Provides access to indexes Maintains indexes
◦ Query Parser Where the magic of query happens
◦ Search Searches across indexes
Modules for IRModules for IR◦Search Spans
Spans K+/- words Example:
Find me a document that has Rachael Ray and Alton Brown within 100 words of each other that also has the term cooking
◦Store/Util Store the indexes and other
housekeeping
TheoryTheorySpace Optimization for Total
Ranking◦Cutting et al 1996◦RAIO (Computer Assisted IR) 1997◦http://lucene.sf.net/papers/riao97.ps
Lucene lecture at Pisa◦Doug Cutting◦Slides from Lecture at University of
Pisa 2004
Vector Vector Vectors are a mathematical distance
between terms◦ Uses a cosine distance to determine how
close terms/documents are◦ This distance can then be used for
WSD/Clustering/IR◦ Example:
Bass,fishing: .6506 Bass,guitar: .000423 This tells us the document is about fishing not
about guitars
Vectors-IRVectors-IR“Vector-space search engines use the notion
of a term space, where each document is represented as a vector in a high-dimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document's position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart.”
Inverted IndexInverted IndexTerm/Doc Id/Weight
◦Term “A Token, the basic unit of indexing in
Lucene, represents a single word to be indexed after any document domain transformation -- such as stop-word elimination, stemming, filtering, term normalization, or language translation -- has been applied.”
http://www.javaworld.com/javaworld/jw-09-2000/jw-0915-lucene-p2.html
Inverted IndexInverted IndexDoc Id
◦A unique “key” that identifies each document
Weight◦Binary◦Freq Count◦Weighting Algorithm
Index MergeIndex MergeBasic/Basket/Basketball
◦Only keeps track of the differences between words
◦Periodically merges indexes Allows new documents to be added easily
QueryQueryBoolean Search
◦Only searches documents with at least 1 term in query
◦“Boolean Search Engine”Parallel Search
◦Each term in query is search in parallel
◦Partial scores added to queue of docs
QueryQueryThreshold
◦If partial score is too low and will not be part of N-best then the document is ignored even before search is complete Example
Potential New Doc [0,0,0,0,0,0,i] Document ranked 14 [233,202,109,100,i] Potential New Doc is ignored
◦Small loss of recall greatly increases speed of search
Evaluation of LuceneEvaluation of LuceneQuantitative Evaluation of
Passage Retrieval Algorithms for Question Answering◦Tellex et al, MIT AI Lab 2003
Compared Prise to Lucene for question and answer tasks◦Question & Answer
<Who is the president?> <George W. Bush .76>
Evaluation of LuceneEvaluation of LucenePrise
◦A IR system developed by NIS that according to the paper uses “modern” search engine techniques
Findings◦Found Prise was better than Lucene
since “Boolean” query engines are considered old school and its answers to questions were better
Evaluation of LuceneEvaluation of LuceneLucene
◦Found although Prise had better correct answers Lucene found more documents containing relevant information
MIT used Lucene in their 2005 TREC submission not Prise
UsersUsersLucene is used widely
◦TREC◦Document Retrieval Enterprise
Systems◦Part of Database/Web engine◦Part of Nutch◦Used by academics for large projects
ConclusionsConclusionsLucene is a good set of classes
◦Designed to allow customization without have to “reinvent the wheel”
◦Robust◦Fast◦Large development groups◦Used Widely in Academia and
Industry