Query Latency Optimization with Lucene
-
Upload
lucenerevolution -
Category
Technology
-
view
1.883 -
download
5
description
Transcript of Query Latency Optimization with Lucene
[email protected] Sr. Research Engineer, Ph.D.Stefan PohlQuery Latency Optimization
7 Nov 2013 Query Latency Optimization with Lucene 2
Who Am I● Search user, developer, researcher
● Many years in industry & academia
● Ph.D. in Information Retrieval
● Interests: Search, Big Data, Machine Learning
● Currently working on the Geocoding offer of HERE,
Nokia's Location Platform
● Spare time: Lucene contributor
7 Nov 2013 Query Latency Optimization with Lucene 3
Agenda● Motivation
● Latency Optimization
● Query Processing / Scoring
● Recent Developments in Lucene
7 Nov 2013 Query Latency Optimization with Lucene 4
Motivation: Query Latency● Human Reaction Time: 200 ms*
→ Backend latency: << 200 ms
● Faster queries means higher manageable load
● Costs
* Steven C. Seow, Designing and Engineering Time: The Psychology of Time Perception inSoftware, Addison-Wesley Professional, 2008.
7 Nov 2013 Query Latency Optimization with Lucene 5
Motivation: Query Latency Distribution
7 Nov 2013 Query Latency Optimization with Lucene 6
Latency Optimization
7 Nov 2013 Query Latency Optimization with Lucene 7
First: Do Your Homework● Keep enough RAM for OS (disk buffer cache)● Reduce HDD “pressure” (e.g. throttle indexing)● SSDs● Warming● Ideally: your index fits in memory
See http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
7 Nov 2013 Query Latency Optimization with Lucene 8
Mining Hypothesis● Check if query latencies are reproducible
● If not, try to find correlations with system events:– Many new incoming docs to index?– Other daemons spike in disk or CPU activity?– Garbage Collections?– Other sar statistics (e.g. paging)
● If yes, profile– First, your code– Don't instrument Lucene internal low-level classes
7 Nov 2013 Query Latency Optimization with Lucene 9
Hypothesis Testing● You really think you understand the problem
and have a potential solution?
● Try it out (if it's cheap)!
● Otherwise, think of (cheap) experiments that– Give confidence– Tell you (and others) what the gains are (ROI)
7 Nov 2013 Query Latency Optimization with Lucene 10
Example: In-memory● Buy more memory / bigger machine !?
● Simulate1
– Consecutively execute the same query multiple times– Much lower memory requirement (i.e. the size of the involved postings)– Repeat for sample of queries of interest
● Gives lower bound on query latency
1 S. Pohl, A. Moffat. Measurement Techniques and Caching Effects. In Proceedings of the 31st European Conference on Information Retrieval, Toulouse, France, April 2009. Springer.
7 Nov 2013 Query Latency Optimization with Lucene 11
Query Processing
7 Nov 2013 Query Latency Optimization with Lucene 12
Conjunctions (i.e. AND / Occur.MUST)
● Sort Boolean clauses by increasing DocFreq ft
7 Nov 2013 Query Latency Optimization with Lucene 13
Conjunctions (i.e. AND / Occur.MUST)
● Next() on sparsest posting list (“lead”)
7 Nov 2013 Query Latency Optimization with Lucene 14
Conjunctions (i.e. AND / Occur.MUST)
● Advance(18) on next sparsest posting list → fail
7 Nov 2013 Query Latency Optimization with Lucene 15
Conjunctions (i.e. AND / Occur.MUST)
● Start all over again with “lead”, but advance(22)
7 Nov 2013 Query Latency Optimization with Lucene 16
Conjunctions (i.e. AND / Occur.MUST)
● Try to advance(31) on all other posting lists
7 Nov 2013 Query Latency Optimization with Lucene 17
Conjunctions (i.e. AND / Occur.MUST)
● Try to advance(31) on all other posting lists
7 Nov 2013 Query Latency Optimization with Lucene 18
Conjunctions (i.e. AND / Occur.MUST)
● Try to advance(31) on all other posting lists
7 Nov 2013 Query Latency Optimization with Lucene 19
Conjunctions (i.e. AND / Occur.MUST)
● Match found → R = {31
7 Nov 2013 Query Latency Optimization with Lucene 20
Conjunctions (i.e. AND / Occur.MUST)
● Next() on “lead” → R = {31}
7 Nov 2013 Query Latency Optimization with Lucene 21
Disjunctions (i.e. OR / Occur.SHOULD)
7 Nov 2013 Query Latency Optimization with Lucene 22
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() on all clauses
7 Nov 2013 Query Latency Optimization with Lucene 23
Disjunctions (i.e. OR / Occur.SHOULD)
● Track clauses in min-heap → R = {2
7 Nov 2013 Query Latency Optimization with Lucene 24
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() on all previously matched clauses → R = {2,4
7 Nov 2013 Query Latency Optimization with Lucene 25
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() on all previously matched clauses → R = {2,4,5
7 Nov 2013 Query Latency Optimization with Lucene 26
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7
7 Nov 2013 Query Latency Optimization with Lucene 27
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9
7 Nov 2013 Query Latency Optimization with Lucene 28
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11
7 Nov 2013 Query Latency Optimization with Lucene 29
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12
7 Nov 2013 Query Latency Optimization with Lucene 30
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16
7 Nov 2013 Query Latency Optimization with Lucene 31
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16,18
7 Nov 2013 Query Latency Optimization with Lucene 32
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16,18,20
7 Nov 2013 Query Latency Optimization with Lucene 33
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16,18,20,22
7 Nov 2013 Query Latency Optimization with Lucene 34
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26
7 Nov 2013 Query Latency Optimization with Lucene 35
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27
7 Nov 2013 Query Latency Optimization with Lucene 36
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29
7 Nov 2013 Query Latency Optimization with Lucene 37
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31
7 Nov 2013 Query Latency Optimization with Lucene 38
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32
7 Nov 2013 Query Latency Optimization with Lucene 39
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37
7 Nov 2013 Query Latency Optimization with Lucene 40
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37}
7 Nov 2013 Query Latency Optimization with Lucene 41
Why Query Processing Can Be Slow?● Disjunctive Processing: O(n log |C|)
– High DF terms (large n)– Many terms (large |C|), e.g. query expansion– No / too little use of advance()
● Filter (over-use)
7 Nov 2013 Query Latency Optimization with Lucene 42
Filter● Aims:
– (Pre-)computation of common sub-queries– Cache result– Don't influence scoring
● Limitation– Additional cost for 1st query– Currently, no skip information generated
→ Adding filter as a conjunct to queries can sometimes be fastere.g. http://java.dzone.com/news/fast-lucene-search-filters
7 Nov 2013 Query Latency Optimization with Lucene 43
Stopword Removal● Removal of High-DocFreq terms from
– Index : 10-30% space saving– Query: no very expensive terms
● Limitation:– “To be or not to be”
● In general, don't do it
7 Nov 2013 Query Latency Optimization with Lucene 44
Minor, But Easy Improvements● Reduce information, increase locality:
– Don't store TF, if it's almost always 1 (and you don't need positions),fieldType.setIndexOptions(IndexOptions.DOCS_ONLY);
– Use BlockPostingsFormat (default in Lucene ≥ 4.1)
● Tune Space/Time/Quality tradeoffs:– DirectDocValues– Less complex scoring function
7 Nov 2013 Query Latency Optimization with Lucene 45
Recent Developmentswithin Lucene
7 Nov 2013 Query Latency Optimization with Lucene 46
MinShouldMatch● Don't want matches on only one (stop-)word?● Enforce at least mm>1 terms to be present !
● Synthetic example query used during dev:
(Lucene-4571)
Terms: ref restored struck wings dublin
DocFreq: 3.8M 32k 32k 32k 32k
Disjunctive Processing:next()
Conjunctive Processing:advance()
E.g. mm=2:
7 Nov 2013 Query Latency Optimization with Lucene 47
MinShouldMatch (Lucene-4571)
7 Nov 2013 Query Latency Optimization with Lucene 48
MinShouldMatch (Lucene-4571)
7 Nov 2013 Query Latency Optimization with Lucene 49
MinShouldMatch (Lucene-4571)
DocFreq: 3.8M 32k 32k 32k 32k
HighDF 1/5: ref restored struck wings dublin
HighDF 2/5: ref http struck wings dublin
HighDF 3/5: ref http from wings dublin
HighDF 4/5: ref http from name dublin
HighDF 5/5: ref http from name title
DocFreq: 3.8M 3.5M 3.2M 2.8M 2.4M
7 Nov 2013 Query Latency Optimization with Lucene 50
MinShouldMatch – Results (Lucene-4571)
7 Nov 2013 Query Latency Optimization with Lucene 51
MinShouldMatch – Open Questions● How bad is it to exclude docs that only match one,
but an important term?
● Why is it enough to match any mm terms?
● Why not providing a list of stop-words to a 'StopwordExcludingScorer'?(But be careful: “To Be Or Not To Be”)
(Lucene-4571)
7 Nov 2013 Query Latency Optimization with Lucene 52
ReqOptSumScorer● Benefit:
– Conjunctive processing on required clauses– Calls advance() on optional clauses
● How do you determine which clauses are required?– Lookup term statistics (i.e. DocFreq)– 2nd lookup unnecessary, if you hand over stats to query
7 Nov 2013 Query Latency Optimization with Lucene 53
CommonTermsQuery (≥ 4.1)● Looks up term infos (docfreq, posting list offset)● Categorizes query terms as
– Low-freq: At least one low-freq term MUST occur in result doc
– High-freq: SHOULD occur in doc → their presence add to score
● Executes query, but hands over term statistics
→ no 2nd round of term lookups necessary !
● Also supports MinShouldMatch
(Lucene-4628)
7 Nov 2013 Query Latency Optimization with Lucene 54
Cost-Model (≥ 4.3)● What about structured queries? E.g. +(a b) +c
● Currently: worst-case estimate of returned #docs (docfreq)– Disjunctions: sumcC(dfc)
– Conjunctions: mincC(dfc)
● Limitations:– Effort to generate returned docs?– Only one cost (next() vs. advance())
● Open Question:– Can we do better with more detailed cost models?
(Lucene-4607)
7 Nov 2013 Query Latency Optimization with Lucene 55
Maxscore Top-k Scoring Algorithm1
● Experimental prototype code attached to Lucene-4100● Limitation:
– Requires final run over whole index (i.e. only for static indexes)
(Lucene-4100)
1 H. Turtle, J. Flood. Query Evaluation: Strategies and Optimizations, IPM, 31(6), 1995.
7 Nov 2013 Query Latency Optimization with Lucene 56
Index Sorting (≥ 4.3)● Advantages (if appropriate sort order chosen)
– Better compression → more locality → faster processing– Early termination
● Use together with EarlyTerminatingSortingCollector– Can terminate scoring within sorted segments– Fully scores as-yet unsorted segments
→ see 2nd half of Shai & Adrian's talk yesterday for details
(Lucene-4752)
7 Nov 2013 Query Latency Optimization with Lucene 57
Parallelization● In general, sharding is better:
– Shared-nothing– Better use cores for handling load
● Multi-threaded query execution:– Static indexes:
For slow queries, almost perfect speedups(if docs are uniformly distributed over shards)
– Dynamic indexes:● Lucene-2840, Lucene-5299
7 Nov 2013 Query Latency Optimization with Lucene 58
Summary● Understand your problem
● Scoring can become an issue with many million docs
● Many recent efficiency improvements
● More to come... patches welcome
7 Nov 2013 Query Latency Optimization with Lucene 59
We're Hiring @HEREFrankfurt, Berlin, Boston, Chicago.
Come work with us.Get in touch!Come work with us.Get in touch!
developer.here.com/geocoder
7 Nov 2013 Query Latency Optimization with Lucene 60
Thank You!
Contact
Email : [email protected] : http://linkedin.com/in/stefanpohlTwitter : @pohlstefan
developer.here.com/geocoder