Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

26
Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Transcript of Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Page 1: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Lucene Performance

Grant Ingersoll

November 16, 2007

Atlanta, GA

Page 2: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Overview

• Defining Performance• Basics• Indexing

– Parameters

– Threading

• Search• Document Retrieval• Search Quality

Page 3: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Defining Performance• Many factors in assessing Lucene (and search)

performance• Speed• Quality of results (subjective)

– Precision • # relevant retrieved out of # retrieved

– Recall• # relevant retrieved out of total # relevant

• Size of index– Compression rate

• Other Factors: – Local vs. distributed

Page 4: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Basics

• Consider latest version of Lucene– Lucene 2.3/Trunk has many performance improvements over prior

versions

• Consider Solr– Solr employs many Lucene best practices

• contrib/benchmark can help assess many aspects of performance, including speed, precision and recall– Task based approach makes for easy extension

• Sanity check your needs

• Profile to identify bottlenecks

Page 5: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Indexing Factors

• Lucene indexes Documents into memory

• On certain occasions, memory is flushed to the index representation (called a segment)

• Segments are periodically merged

• Internal Lucene models are changing and (drastically) improving performance

Page 6: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

IndexWriter factors• setMaxBufferedDocs controls minimum # of docs

before merge occurs– Larger == faster– > RAM

• setMergeFactor controls how often segments are merged– Smaller == less RAM, better for large # of updates– Larger == faster, better for batch

• setMaxFieldLength controls the # of terms indexed from a document

• setUseCompoundFile controls the file format Lucene uses. Turning off compound file format is faster, but you could run out of file descriptors

Page 7: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Lucene 2.3 IndexWriter Changes

• setRAMBufferSizeMB– New model for automagically controlling indexing

factors based on the amount of memory in use– Obsoletes setMaxBufferedDocs and setMergeFactor

• Takes storage and term vectors out of the merge process

• Turn off auto-commit if there are stored fields and term vectors

• Provides significant performance increase

Page 8: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Analysis• An Analyzer is a Tokenizer and one or more TokenFilters

• More complicated analysis, slower indexing– Many applications could use simpler Analyzers than

the StandardAnalyzer– StandardTokenizer is now faster in 2.3 (thus

making StandardAnalyzer faster)

• Reuse in 2.3:– Re-use Token, Document and Field instances– Use the char[] API with Token instead of String API

Page 9: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Thread Safety

• Use a single IndexWriter for the duration of indexing

• Share IndexWriter between threads

• Parallel Indexing– Index to separate Directory instances

– Merge when done with IndexWriter.addIndexes()– Distribute and collect

Page 10: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Other Indexing Factors

• NFS– Have been some improvements lately, but…– “proceed with caution”– Not as good as local filesystem

• Replication– Index locally and then use rsync to replicate

copies of index to other servers– Have I mentioned Solr?

Page 11: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Benchmarking Indexing

• contrib/benchmark• Try out different algorithms between Lucene 2.2

and trunk (2.3)– contrib/benchmark/conf:

• indexing.alg• indexing-multithreaded.alg

• Info:– Mac Pro 2 x 2GHz Dual-Core Xeon– 4 GB RAM– ant run-task -Dtask.alg=./conf/indexing.alg -Dtask.mem=1024M

Page 12: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Benchmarking ResultsRecords/Sec Avg. T Mem

2.2 421 39M

Trunk 2,122 52M

Trunk-mt (4) 3,680 57M

Page 13: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Search Performance

• Many factors influence search speed– Query Type, size, analysis, # of occurrences,

index size, index optimization, index type– Known Enemies

• Search Quality also has many factors– Query formulation, synonyms, analysis, etc.– How to judge quality?

Page 14: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Query Types

• Some queries in Lucene get rewritten into simpler queries:– WildcardQuery rewrites to a BooleanQuery of all the terms that satisfy the wildcards

• a* -> abe, apple, an, and, array…

– Likewise with RangeQuery, especially with date ranges

Page 15: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Query Size

• Stopword removal can help reduce size

• Choose expansions carefully

• Consider using fewer fields to search over

• When doing relevance feedback, don’t use whole document, instead focus on most important terms

Page 16: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Index Factors for Search

• Size: – more unique terms, more to search– Stopword removal and stemming can help

reduce– Not a linear factor due to index compression

• Type – RAMDirectory if index smaller– MMapDirectory may perform better

Page 17: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Search Speed Tips

• IndexSearcher– Thread-safe, so share– Open once and use as long as possible

• Cache Filters when appropriate• Optimize if you have the time• Warm up your Searcher first by sending

it some preliminary queries before making it live

Page 18: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Known Enemies

• CPU, Memory, I/O are all known enemies of performance– Can’t live without them, either!

• Profile, run benchmarks, look at garbage collection policies, etc.

• Check your needs– Do you need wildcards?– Do you need so many Fields?

Page 19: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Document Retrieval

• Common Search Scenario:– Many small Fields containing info about the Document

– One or two big Fields storing content– Run search, display small Fields to user– User picks one result to view content

Page 20: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

FieldSelector

• Gives developer greater control over how the Document is loaded– Load, Lazy, No Load, Load and Break, Size,

etc.

• In previous scenario, lazy load the large Fields

• Easier to store original content without performance penalty

Page 21: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Quality Queries

• Evaluating search quality is difficult and subjective

• Lucene provides good out of the box quality by most accounts

• Can evaluate using TREC or other experiments, but these risk overtuning

• Unfortunately, judging quality is a labor-intensive task

Page 22: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Quality Experiments

• Needs:– Standard collection of docs - easy

– Set of queries• Query logs

• Develop in-house

• TREC, other conferences

– Set of judgments• Labor intensive

• Can use log analysis to determine estimates of which queries are relevant based on clicks, etc.

Page 23: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Query Formulation

• Invest the time in determining the proper analysis of the fields you are searching– Case sensitive search– Punctuation analysis– Strict matching

• Stopword policy– Stopwords can be useful

• Operator choice• Synonym choices

Page 24: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Effective Scoring

• Similarity class provides callback mechanism for controlling how some Lucene scoring factors count towards the score– tf(), idf(), coord()

• Experiment with different length normalization factors– You may find Lucene is overemphasizing shorter or

longer documents

Page 25: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Effective Scoring

• Can also implement your own Query class– Ask if anyone else has done it first on java-user

mailing list

• Go beyond the obvious:– org.apach.lucene.search.function

package provides means for using values of Fields to change the scores

• Geographic scoring, user ratings, others

• Payloads (stay tuned for next presentation)