What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.
-
Upload
osborn-perry -
Category
Documents
-
view
219 -
download
0
Transcript of Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.
Lucene Performance
Grant Ingersoll
November 16, 2007
Atlanta, GA
Overview
• Defining Performance• Basics• Indexing
– Parameters
– Threading
• Search• Document Retrieval• Search Quality
Defining Performance• Many factors in assessing Lucene (and search)
performance• Speed• Quality of results (subjective)
– Precision • # relevant retrieved out of # retrieved
– Recall• # relevant retrieved out of total # relevant
• Size of index– Compression rate
• Other Factors: – Local vs. distributed
Basics
• Consider latest version of Lucene– Lucene 2.3/Trunk has many performance improvements over prior
versions
• Consider Solr– Solr employs many Lucene best practices
• contrib/benchmark can help assess many aspects of performance, including speed, precision and recall– Task based approach makes for easy extension
• Sanity check your needs
• Profile to identify bottlenecks
Indexing Factors
• Lucene indexes Documents into memory
• On certain occasions, memory is flushed to the index representation (called a segment)
• Segments are periodically merged
• Internal Lucene models are changing and (drastically) improving performance
IndexWriter factors• setMaxBufferedDocs controls minimum # of docs
before merge occurs– Larger == faster– > RAM
• setMergeFactor controls how often segments are merged– Smaller == less RAM, better for large # of updates– Larger == faster, better for batch
• setMaxFieldLength controls the # of terms indexed from a document
• setUseCompoundFile controls the file format Lucene uses. Turning off compound file format is faster, but you could run out of file descriptors
Lucene 2.3 IndexWriter Changes
• setRAMBufferSizeMB– New model for automagically controlling indexing
factors based on the amount of memory in use– Obsoletes setMaxBufferedDocs and setMergeFactor
• Takes storage and term vectors out of the merge process
• Turn off auto-commit if there are stored fields and term vectors
• Provides significant performance increase
Analysis• An Analyzer is a Tokenizer and one or more TokenFilters
• More complicated analysis, slower indexing– Many applications could use simpler Analyzers than
the StandardAnalyzer– StandardTokenizer is now faster in 2.3 (thus
making StandardAnalyzer faster)
• Reuse in 2.3:– Re-use Token, Document and Field instances– Use the char[] API with Token instead of String API
Thread Safety
• Use a single IndexWriter for the duration of indexing
• Share IndexWriter between threads
• Parallel Indexing– Index to separate Directory instances
– Merge when done with IndexWriter.addIndexes()– Distribute and collect
Other Indexing Factors
• NFS– Have been some improvements lately, but…– “proceed with caution”– Not as good as local filesystem
• Replication– Index locally and then use rsync to replicate
copies of index to other servers– Have I mentioned Solr?
Benchmarking Indexing
• contrib/benchmark• Try out different algorithms between Lucene 2.2
and trunk (2.3)– contrib/benchmark/conf:
• indexing.alg• indexing-multithreaded.alg
• Info:– Mac Pro 2 x 2GHz Dual-Core Xeon– 4 GB RAM– ant run-task -Dtask.alg=./conf/indexing.alg -Dtask.mem=1024M
Benchmarking ResultsRecords/Sec Avg. T Mem
2.2 421 39M
Trunk 2,122 52M
Trunk-mt (4) 3,680 57M
Search Performance
• Many factors influence search speed– Query Type, size, analysis, # of occurrences,
index size, index optimization, index type– Known Enemies
• Search Quality also has many factors– Query formulation, synonyms, analysis, etc.– How to judge quality?
Query Types
• Some queries in Lucene get rewritten into simpler queries:– WildcardQuery rewrites to a BooleanQuery of all the terms that satisfy the wildcards
• a* -> abe, apple, an, and, array…
– Likewise with RangeQuery, especially with date ranges
Query Size
• Stopword removal can help reduce size
• Choose expansions carefully
• Consider using fewer fields to search over
• When doing relevance feedback, don’t use whole document, instead focus on most important terms
Index Factors for Search
• Size: – more unique terms, more to search– Stopword removal and stemming can help
reduce– Not a linear factor due to index compression
• Type – RAMDirectory if index smaller– MMapDirectory may perform better
Search Speed Tips
• IndexSearcher– Thread-safe, so share– Open once and use as long as possible
• Cache Filters when appropriate• Optimize if you have the time• Warm up your Searcher first by sending
it some preliminary queries before making it live
Known Enemies
• CPU, Memory, I/O are all known enemies of performance– Can’t live without them, either!
• Profile, run benchmarks, look at garbage collection policies, etc.
• Check your needs– Do you need wildcards?– Do you need so many Fields?
Document Retrieval
• Common Search Scenario:– Many small Fields containing info about the Document
– One or two big Fields storing content– Run search, display small Fields to user– User picks one result to view content
FieldSelector
• Gives developer greater control over how the Document is loaded– Load, Lazy, No Load, Load and Break, Size,
etc.
• In previous scenario, lazy load the large Fields
• Easier to store original content without performance penalty
Quality Queries
• Evaluating search quality is difficult and subjective
• Lucene provides good out of the box quality by most accounts
• Can evaluate using TREC or other experiments, but these risk overtuning
• Unfortunately, judging quality is a labor-intensive task
Quality Experiments
• Needs:– Standard collection of docs - easy
– Set of queries• Query logs
• Develop in-house
• TREC, other conferences
– Set of judgments• Labor intensive
• Can use log analysis to determine estimates of which queries are relevant based on clicks, etc.
Query Formulation
• Invest the time in determining the proper analysis of the fields you are searching– Case sensitive search– Punctuation analysis– Strict matching
• Stopword policy– Stopwords can be useful
• Operator choice• Synonym choices
Effective Scoring
• Similarity class provides callback mechanism for controlling how some Lucene scoring factors count towards the score– tf(), idf(), coord()
• Experiment with different length normalization factors– You may find Lucene is overemphasizing shorter or
longer documents
Effective Scoring
• Can also implement your own Query class– Ask if anyone else has done it first on java-user
mailing list
• Go beyond the obvious:– org.apach.lucene.search.function
package provides means for using values of Fields to change the scores
• Geographic scoring, user ratings, others
• Payloads (stay tuned for next presentation)
Resources
• Talk available at: http://lucene.grantingersoll.com/apachecon07/LucenePerformance.ppt
• http://lucene.apache.org
• Mailing List– [email protected]
• Lucene In Action– http://www.lucenebook.com