Query Latency Optimization with Lucene

[email protected] Sr. Research Engineer, Ph.D.Stefan PohlQuery Latency Optimization

mailto:[email protected]

7 Nov 2013 Query Latency Optimization with Lucene 2

Who Am I● Search user, developer, researcher

● Many years in industry & academia

● Ph.D. in Information Retrieval

● Interests: Search, Big Data, Machine Learning

● Currently working on the Geocoding offer of HERE,

Nokia's Location Platform

● Spare time: Lucene contributor

http://developer.here.com/geocoder


Agenda● Motivation

● Latency Optimization

● Query Processing / Scoring

● Recent Developments in Lucene


Motivation: Query Latency● Human Reaction Time: 200 ms*

→ Backend latency: << 200 ms

● Faster queries means higher manageable load

● Costs

* Steven C. Seow, Designing and Engineering Time: The Psychology of Time Perception inSoftware, Addison-Wesley Professional, 2008.


Motivation: Query Latency Distribution


Latency Optimization


First: Do Your Homework● Keep enough RAM for OS (disk buffer cache)● Reduce HDD “pressure” (e.g. throttle indexing)● SSDs● Warming● Ideally: your index fits in memory

See http://wiki.apache.org/lucene-java/ImproveSearchingSpeed

http://wiki.apache.org/lucene-java/ImproveSearchingSpeed


Mining Hypothesis● Check if query latencies are reproducible

● If not, try to find correlations with system events:– Many new incoming docs to index?– Other daemons spike in disk or CPU activity?– Garbage Collections?– Other sar statistics (e.g. paging)

● If yes, profile– First, your code– Don't instrument Lucene internal low-level classes


Hypothesis Testing● You really think you understand the problem

and have a potential solution?

● Try it out (if it's cheap)!

● Otherwise, think of (cheap) experiments that– Give confidence– Tell you (and others) what the gains are (ROI)


Example: In-memory● Buy more memory / bigger machine !?

● Simulate1

– Consecutively execute the same query multiple times– Much lower memory requirement (i.e. the size of the involved postings)– Repeat for sample of queries of interest

● Gives lower bound on query latency

1 S. Pohl, A. Moffat. Measurement Techniques and Caching Effects. In Proceedings of the 31st European Conference on Information Retrieval, Toulouse, France, April 2009. Springer.

http://dx.doi.org/10.1007/978-3-642-00958-7_68


Query Processing


Conjunctions (i.e. AND / Occur.MUST)

● Sort Boolean clauses by increasing DocFreq ft



● Next() on sparsest posting list (“lead”)



● Advance(18) on next sparsest posting list → fail



● Start all over again with “lead”, but advance(22)



● Try to advance(31) on all other posting lists



● Match found → R = {31



● Next() on “lead” → R = {31}


Disjunctions (i.e. OR / Occur.SHOULD)



● Next() on all clauses



● Track clauses in min-heap → R = {2



● Next() on all previously matched clauses → R = {2,4



● Next() on all previously matched clauses → R = {2,4,5



● Next() → R = {2,4,5,7



● Next() → R = {2,4,5,7,9



● Next() → R = {2,4,5,7,9,11



● Next() → R = {2,4,5,7,9,11,12



● Next() → R = {2,4,5,7,9,11,12,16



● Next() → R = {2,4,5,7,9,11,12,16,18



● Next() → R = {2,4,5,7,9,11,12,16,18,20



● Next() → R = {2,4,5,7,9,11,12,16,18,20,22



● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26



● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27



● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29



● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31



● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32



● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37



● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37}


Why Query Processing Can Be Slow?● Disjunctive Processing: O(n log |C|)

– High DF terms (large n)– Many terms (large |C|), e.g. query expansion– No / too little use of advance()

● Filter (over-use)


Filter● Aims:

– (Pre-)computation of common sub-queries– Cache result– Don't influence scoring

● Limitation– Additional cost for 1st query– Currently, no skip information generated

→ Adding filter as a conjunct to queries can sometimes be fastere.g. http://java.dzone.com/news/fast-lucene-search-filters

http://java.dzone.com/news/fast-lucene-search-filters


Stopword Removal● Removal of High-DocFreq terms from

– Index : 10-30% space saving– Query: no very expensive terms

● Limitation:– “To be or not to be”

● In general, don't do it

http://www.flickr.com/photos/lincolnian/304477845


Minor, But Easy Improvements● Reduce information, increase locality:

– Don't store TF, if it's almost always 1 (and you don't need positions),fieldType.setIndexOptions(IndexOptions.DOCS_ONLY);

– Use BlockPostingsFormat (default in Lucene ≥ 4.1)

● Tune Space/Time/Quality tradeoffs:– DirectDocValues– Less complex scoring function


Recent Developmentswithin Lucene


MinShouldMatch● Don't want matches on only one (stop-)word?● Enforce at least mm>1 terms to be present !

● Synthetic example query used during dev:

(Lucene-4571)

Terms: ref restored struck wings dublin

DocFreq: 3.8M 32k 32k 32k 32k

Disjunctive Processing:next()

Conjunctive Processing:advance()

E.g. mm=2:

http://issues.apache.org/jira/browse/LUCENE-4571


MinShouldMatch (Lucene-4571)




DocFreq: 3.8M 32k 32k 32k 32k

HighDF 1/5: ref restored struck wings dublin

HighDF 2/5: ref http struck wings dublin

HighDF 3/5: ref http from wings dublin

HighDF 4/5: ref http from name dublin

HighDF 5/5: ref http from name title

DocFreq: 3.8M 3.5M 3.2M 2.8M 2.4M



MinShouldMatch – Results (Lucene-4571)



MinShouldMatch – Open Questions● How bad is it to exclude docs that only match one,

but an important term?

● Why is it enough to match any mm terms?

● Why not providing a list of stop-words to a 'StopwordExcludingScorer'?(But be careful: “To Be Or Not To Be”)

(Lucene-4571)


http://www.flickr.com/photos/lincolnian/304477845


ReqOptSumScorer● Benefit:

– Conjunctive processing on required clauses– Calls advance() on optional clauses

● How do you determine which clauses are required?– Lookup term statistics (i.e. DocFreq)– 2nd lookup unnecessary, if you hand over stats to query


CommonTermsQuery (≥ 4.1)● Looks up term infos (docfreq, posting list offset)● Categorizes query terms as

– Low-freq: At least one low-freq term MUST occur in result doc

– High-freq: SHOULD occur in doc → their presence add to score

● Executes query, but hands over term statistics

→ no 2nd round of term lookups necessary !

● Also supports MinShouldMatch

(Lucene-4628)



Cost-Model (≥ 4.3)● What about structured queries? E.g. +(a b) +c

● Currently: worst-case estimate of returned #docs (docfreq)– Disjunctions: sumcC(dfc)

– Conjunctions: mincC(dfc)

● Limitations:– Effort to generate returned docs?– Only one cost (next() vs. advance())

● Open Question:– Can we do better with more detailed cost models?

(Lucene-4607)



Maxscore Top-k Scoring Algorithm1

● Experimental prototype code attached to Lucene-4100● Limitation:

– Requires final run over whole index (i.e. only for static indexes)

(Lucene-4100)

1 H. Turtle, J. Flood. Query Evaluation: Strategies and Optimizations, IPM, 31(6), 1995.



Index Sorting (≥ 4.3)● Advantages (if appropriate sort order chosen)

– Better compression → more locality → faster processing– Early termination

● Use together with EarlyTerminatingSortingCollector– Can terminate scoring within sorted segments– Fully scores as-yet unsorted segments

→ see 2nd half of Shai & Adrian's talk yesterday for details

(Lucene-4752)



Parallelization● In general, sharding is better:

– Shared-nothing– Better use cores for handling load

● Multi-threaded query execution:– Static indexes:

For slow queries, almost perfect speedups(if docs are uniformly distributed over shards)

– Dynamic indexes:● Lucene-2840, Lucene-5299




Summary● Understand your problem

● Scoring can become an issue with many million docs

● Many recent efficiency improvements

● More to come... patches welcome


We're Hiring @HEREFrankfurt, Berlin, Boston, Chicago.

Come work with us.Get in touch!Come work with us.Get in touch!

developer.here.com/geocoder



Thank You!

Contact

Email : [email protected] : http://linkedin.com/in/stefanpohlTwitter : @pohlstefan

developer.here.com/geocoder

mailto:[email protected]

http://linkedin.com/in/stefanpohl


Query Latency Optimization with Lucene

Technology

Transcript of Query Latency Optimization with Lucene