Query Latency Optimization with Lucene

60
[email protected] Sr. Research Engineer, Ph.D. Stefan Pohl Query Latency Optimization

description

Presented by Stefan Pohl, Senior Research Engineer, HERE, a Nokia Business Besides the quality of results, the time that it takes from the submission of a query to the display of results is of utmost importance to user satisfaction. Within search engine implementations such as Apache Lucene, significant development efforts are hence directed towards reducing query latency. In this session, I will explain reasons for high query latencies and describe general approaches and recent developments within Lucene to counter them.To make the presented material relevant to a wider audience, I will focus on the actual query processing, as this is at the core of every query and search use-case.

Transcript of Query Latency Optimization with Lucene

Page 1: Query Latency Optimization with Lucene

[email protected] Sr. Research Engineer, Ph.D.Stefan PohlQuery Latency Optimization

Page 2: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 2

Who Am I● Search user, developer, researcher

● Many years in industry & academia

● Ph.D. in Information Retrieval

● Interests: Search, Big Data, Machine Learning

● Currently working on the Geocoding offer of HERE,

Nokia's Location Platform

● Spare time: Lucene contributor

Page 3: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 3

Agenda● Motivation

● Latency Optimization

● Query Processing / Scoring

● Recent Developments in Lucene

Page 4: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 4

Motivation: Query Latency● Human Reaction Time: 200 ms*

→ Backend latency: << 200 ms

● Faster queries means higher manageable load

● Costs

* Steven C. Seow, Designing and Engineering Time: The Psychology of Time Perception inSoftware, Addison-Wesley Professional, 2008.

Page 5: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 5

Motivation: Query Latency Distribution

Page 6: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 6

Latency Optimization

Page 7: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 7

First: Do Your Homework● Keep enough RAM for OS (disk buffer cache)● Reduce HDD “pressure” (e.g. throttle indexing)● SSDs● Warming● Ideally: your index fits in memory

See http://wiki.apache.org/lucene-java/ImproveSearchingSpeed

Page 8: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 8

Mining Hypothesis● Check if query latencies are reproducible

● If not, try to find correlations with system events:– Many new incoming docs to index?– Other daemons spike in disk or CPU activity?– Garbage Collections?– Other sar statistics (e.g. paging)

● If yes, profile– First, your code– Don't instrument Lucene internal low-level classes

Page 9: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 9

Hypothesis Testing● You really think you understand the problem

and have a potential solution?

● Try it out (if it's cheap)!

● Otherwise, think of (cheap) experiments that– Give confidence– Tell you (and others) what the gains are (ROI)

Page 10: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 10

Example: In-memory● Buy more memory / bigger machine !?

● Simulate1

– Consecutively execute the same query multiple times– Much lower memory requirement (i.e. the size of the involved postings)– Repeat for sample of queries of interest

● Gives lower bound on query latency

1 S. Pohl, A. Moffat. Measurement Techniques and Caching Effects. In Proceedings of the 31st European Conference on Information Retrieval, Toulouse, France, April 2009. Springer.

Page 11: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 11

Query Processing

Page 12: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 12

Conjunctions (i.e. AND / Occur.MUST)

● Sort Boolean clauses by increasing DocFreq ft

Page 13: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 13

Conjunctions (i.e. AND / Occur.MUST)

● Next() on sparsest posting list (“lead”)

Page 14: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 14

Conjunctions (i.e. AND / Occur.MUST)

● Advance(18) on next sparsest posting list → fail

Page 15: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 15

Conjunctions (i.e. AND / Occur.MUST)

● Start all over again with “lead”, but advance(22)

Page 16: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 16

Conjunctions (i.e. AND / Occur.MUST)

● Try to advance(31) on all other posting lists

Page 17: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 17

Conjunctions (i.e. AND / Occur.MUST)

● Try to advance(31) on all other posting lists

Page 18: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 18

Conjunctions (i.e. AND / Occur.MUST)

● Try to advance(31) on all other posting lists

Page 19: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 19

Conjunctions (i.e. AND / Occur.MUST)

● Match found → R = {31

Page 20: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 20

Conjunctions (i.e. AND / Occur.MUST)

● Next() on “lead” → R = {31}

Page 21: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 21

Disjunctions (i.e. OR / Occur.SHOULD)

Page 22: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 22

Disjunctions (i.e. OR / Occur.SHOULD)

● Next() on all clauses

Page 23: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 23

Disjunctions (i.e. OR / Occur.SHOULD)

● Track clauses in min-heap → R = {2

Page 24: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 24

Disjunctions (i.e. OR / Occur.SHOULD)

● Next() on all previously matched clauses → R = {2,4

Page 25: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 25

Disjunctions (i.e. OR / Occur.SHOULD)

● Next() on all previously matched clauses → R = {2,4,5

Page 26: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 26

Disjunctions (i.e. OR / Occur.SHOULD)

● Next() → R = {2,4,5,7

Page 27: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 27

Disjunctions (i.e. OR / Occur.SHOULD)

● Next() → R = {2,4,5,7,9

Page 28: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 28

Disjunctions (i.e. OR / Occur.SHOULD)

● Next() → R = {2,4,5,7,9,11

Page 29: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 29

Disjunctions (i.e. OR / Occur.SHOULD)

● Next() → R = {2,4,5,7,9,11,12

Page 30: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 30

Disjunctions (i.e. OR / Occur.SHOULD)

● Next() → R = {2,4,5,7,9,11,12,16

Page 31: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 31

Disjunctions (i.e. OR / Occur.SHOULD)

● Next() → R = {2,4,5,7,9,11,12,16,18

Page 32: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 32

Disjunctions (i.e. OR / Occur.SHOULD)

● Next() → R = {2,4,5,7,9,11,12,16,18,20

Page 33: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 33

Disjunctions (i.e. OR / Occur.SHOULD)

● Next() → R = {2,4,5,7,9,11,12,16,18,20,22

Page 34: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 34

Disjunctions (i.e. OR / Occur.SHOULD)

● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26

Page 35: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 35

Disjunctions (i.e. OR / Occur.SHOULD)

● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27

Page 36: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 36

Disjunctions (i.e. OR / Occur.SHOULD)

● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29

Page 37: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 37

Disjunctions (i.e. OR / Occur.SHOULD)

● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31

Page 38: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 38

Disjunctions (i.e. OR / Occur.SHOULD)

● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32

Page 39: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 39

Disjunctions (i.e. OR / Occur.SHOULD)

● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37

Page 40: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 40

Disjunctions (i.e. OR / Occur.SHOULD)

● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37}

Page 41: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 41

Why Query Processing Can Be Slow?● Disjunctive Processing: O(n log |C|)

– High DF terms (large n)– Many terms (large |C|), e.g. query expansion– No / too little use of advance()

● Filter (over-use)

Page 42: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 42

Filter● Aims:

– (Pre-)computation of common sub-queries– Cache result– Don't influence scoring

● Limitation– Additional cost for 1st query– Currently, no skip information generated

→ Adding filter as a conjunct to queries can sometimes be fastere.g. http://java.dzone.com/news/fast-lucene-search-filters

Page 43: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 43

Stopword Removal● Removal of High-DocFreq terms from

– Index : 10-30% space saving– Query: no very expensive terms

● Limitation:– “To be or not to be”

● In general, don't do it

Page 44: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 44

Minor, But Easy Improvements● Reduce information, increase locality:

– Don't store TF, if it's almost always 1 (and you don't need positions),fieldType.setIndexOptions(IndexOptions.DOCS_ONLY);

– Use BlockPostingsFormat (default in Lucene ≥ 4.1)

● Tune Space/Time/Quality tradeoffs:– DirectDocValues– Less complex scoring function

Page 45: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 45

Recent Developmentswithin Lucene

Page 46: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 46

MinShouldMatch● Don't want matches on only one (stop-)word?● Enforce at least mm>1 terms to be present !

● Synthetic example query used during dev:

(Lucene-4571)

Terms: ref restored struck wings dublin

DocFreq: 3.8M 32k 32k 32k 32k

Disjunctive Processing:next()

Conjunctive Processing:advance()

E.g. mm=2:

Page 47: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 47

MinShouldMatch (Lucene-4571)

Page 48: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 48

MinShouldMatch (Lucene-4571)

Page 49: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 49

MinShouldMatch (Lucene-4571)

DocFreq: 3.8M 32k 32k 32k 32k

HighDF 1/5: ref restored struck wings dublin

HighDF 2/5: ref http struck wings dublin

HighDF 3/5: ref http from wings dublin

HighDF 4/5: ref http from name dublin

HighDF 5/5: ref http from name title

DocFreq: 3.8M 3.5M 3.2M 2.8M 2.4M

Page 50: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 50

MinShouldMatch – Results (Lucene-4571)

Page 51: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 51

MinShouldMatch – Open Questions● How bad is it to exclude docs that only match one,

but an important term?

● Why is it enough to match any mm terms?

● Why not providing a list of stop-words to a 'StopwordExcludingScorer'?(But be careful: “To Be Or Not To Be”)

(Lucene-4571)

Page 52: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 52

ReqOptSumScorer● Benefit:

– Conjunctive processing on required clauses– Calls advance() on optional clauses

● How do you determine which clauses are required?– Lookup term statistics (i.e. DocFreq)– 2nd lookup unnecessary, if you hand over stats to query

Page 53: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 53

CommonTermsQuery (≥ 4.1)● Looks up term infos (docfreq, posting list offset)● Categorizes query terms as

– Low-freq: At least one low-freq term MUST occur in result doc

– High-freq: SHOULD occur in doc → their presence add to score

● Executes query, but hands over term statistics

→ no 2nd round of term lookups necessary !

● Also supports MinShouldMatch

(Lucene-4628)

Page 54: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 54

Cost-Model (≥ 4.3)● What about structured queries? E.g. +(a b) +c

● Currently: worst-case estimate of returned #docs (docfreq)– Disjunctions: sumcC(dfc)

– Conjunctions: mincC(dfc)

● Limitations:– Effort to generate returned docs?– Only one cost (next() vs. advance())

● Open Question:– Can we do better with more detailed cost models?

(Lucene-4607)

Page 55: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 55

Maxscore Top-k Scoring Algorithm1

● Experimental prototype code attached to Lucene-4100● Limitation:

– Requires final run over whole index (i.e. only for static indexes)

(Lucene-4100)

1 H. Turtle, J. Flood. Query Evaluation: Strategies and Optimizations, IPM, 31(6), 1995.

Page 56: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 56

Index Sorting (≥ 4.3)● Advantages (if appropriate sort order chosen)

– Better compression → more locality → faster processing– Early termination

● Use together with EarlyTerminatingSortingCollector– Can terminate scoring within sorted segments– Fully scores as-yet unsorted segments

→ see 2nd half of Shai & Adrian's talk yesterday for details

(Lucene-4752)

Page 57: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 57

Parallelization● In general, sharding is better:

– Shared-nothing– Better use cores for handling load

● Multi-threaded query execution:– Static indexes:

For slow queries, almost perfect speedups(if docs are uniformly distributed over shards)

– Dynamic indexes:● Lucene-2840, Lucene-5299

Page 58: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 58

Summary● Understand your problem

● Scoring can become an issue with many million docs

● Many recent efficiency improvements

● More to come... patches welcome

Page 59: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 59

We're Hiring @HEREFrankfurt, Berlin, Boston, Chicago.

Come work with us.Get in touch!Come work with us.Get in touch!

developer.here.com/geocoder

Page 60: Query Latency Optimization with Lucene

7 Nov 2013 Query Latency Optimization with Lucene 60

Thank You!

Contact

Email : [email protected] : http://linkedin.com/in/stefanpohlTwitter : @pohlstefan

developer.here.com/geocoder