Dawid Weiss- Finite state automata in lucene

Post on 25-May-2015

3.148 views 2 download

Tags:

description

Finite state automata and transducers made it into Lucene fairly recently, but already show a very promising impact on search performance. This data structure is rarely exploited because it is commonly (and unfairly) associated with high complexity. During the talk, I will try to show that automata and transducers are in fact very simple, their construction can be very efficient (memory and time-wise) and their field of applications very broad.

Transcript of Dawid Weiss- Finite state automata in lucene

Finite State Automatain

DawidWEISS

DawidWeiss

20+ years of coding10 years assembly only

Academia & ResearchPhD in Information Retrieval, PUT

Open sourceCarrot2, HPPC, Lucene,…

Industry & BusinessCarrot Search s.c.

.

.

.

.

.

.

. .

Talk outline

State machines (automata)FSAs, DFAs, FSTs and other XXXs.

Use cases in Lucene and SolrSuggester. FuzzySearch. Index.

No API detailsStill @experimental.

(Non)? Deterministic FiniteState (Automata|Machines)

HashSet

hash → slot → value0x29384d34 → lucene0xde3e3354 → lucid0x00000666 → lucifer

FSA (deterministic)

l u c e n e

id

fe

rexists(sequence)oor(pre x)

ceil(pre x)

HashSet

hash → slot → value0x29384d34 → lucene0xde3e3354 → lucid0x00000666 → lucifer

FSA (deterministic)

l u c e n e

id

fe

r

exists(sequence)oor(pre x)

ceil(pre x)

HashSet

hash → slot → value0x29384d34 → lucene0xde3e3354 → lucid0x00000666 → lucifer

FSA (deterministic)

l u c e n e

id

fe

rexists(sequence)oor(pre x)

ceil(pre x)

k i l l

bl

li

deterministic, non-minimal

i l l

b

k

deterministic, minimal

i

l

l

b

k

i

lnon-deterministic,non-minimal

k i l l

bl

li

deterministic, non-minimal

i l l

b

k

deterministic, minimal

i

l

l

b

k

i

lnon-deterministic,non-minimal

k i l l

bl

li

deterministic, non-minimal

i l l

b

k

deterministic, minimal

i

l

l

b

k

i

lnon-deterministic,non-minimal

(Sorted)Map

lucene → 1lucid → 2lucifer → 666

FST (transducer)

l|1 u c e n e

i|1d

f|664e

r

(Sorted)Map

lucene → 1lucid → 2lucifer → 666

FST (transducer)

l u c e n e|1

id|2

fe

r|666

l|1 u c e n e

i|1d

f|664e

r

(Sorted)Map

lucene → 1lucid → 2lucifer → 666

FST (transducer)

l|1 u c e n e

i|1d

f|664e

r

NFSAs and

Regular expressions

Determinizationstates explosion, not always possible

Backtrackingrecursion explosion

aa

e1e2 e1 e1

e+e

e*e

e?e

a?nan

n=3 → a?a?a?aaa

Source: Russ Cox, Regular ExpressionMatching Can Be Simple And Fast (re2).

a?nann=3 → a?a?a?aaa

Source: Russ Cox, Regular ExpressionMatching Can Be Simple And Fast (re2).

a?nann=3 → a?a?a?aaa

Source: Russ Cox, Regular ExpressionMatching Can Be Simple And Fast (re2).

0

5000

10000

15000

20000

25000

30000

35000

0 5 10 15 20 25 30

Tim

e [

ms]

Time of matching an for pattern a?nan , depending on n. Java 1.6, modern hardware.

Linear-time, minimal, deterministic

FSA construction

Linear algorithm from sorted inputby Daciuk, Mihov, et al.

Active pathstates that still can change

States dictionarynodes that will never change

1) common AP pre x2) freeze the rest of AP3) add suffix → new AP

lucene

1) common AP pre x2) freeze the rest of AP3) add suffix → new AP

l u c e n e

lucid

1) common AP pre x2) freeze the rest of AP3) add suffix → new AP

l u c e n e

i

d

1) common AP pre x2) freeze the rest of AP3) add suffix → new AP

l u c e n e

i

d

lucifer

1) common AP pre x2) freeze the rest of AP3) add suffix → new AP

l u c e n e

id

fe r

1) common AP pre x2) freeze the rest of AP3) add suffix → new AP

l u c e n e

id

fe

r

FS(A|T)s in (Lucene|Solr)

Automata in

Lucene|Solr

org.apache.lucene.util.automaton.*partial port of brics, FuzzyQuery, AutomatonTermsEnum

org.apache.lucene.util.automaton.fst.FSTFSA and FSTs from sorted data, suggester, indexes

org.apache.lucene.util.automaton.fst.*

FSA representation

Arc-based, not state-basedMoore vs. Mealy. Compact vs. intuitive

Next-state chainingrequires unusual tricks during construction

Everything in a byte[]traversals-ready, memory-efficient

Dual transition storage formatlookup: bsearch or linear scan

Input: abc, bd, bde.a b c

b

d

d e

a b c

bd e

org.apache.lucene.util.automaton.fst.*

FSA representation

Arc-based, not state-basedMoore vs. Mealy. Compact vs. intuitive

Next-state chainingrequires unusual tricks during construction

Everything in a byte[]traversals-ready, memory-efficient

Dual transition storage formatlookup: bsearch or linear scan

s2 s1s3a b c

s4

bs5

d e

s1

cFL bL eFL dL a bL

s1s1s2

s2 s4 s3s5

s1

cFL bL eFL dL abLN

s2 s4 s3s5

org.apache.lucene.util.automaton.fst.*

FSA representation

Arc-based, not state-basedMoore vs. Mealy. Compact vs. intuitive

Next-state chainingrequires unusual tricks during construction

Everything in a byte[]traversals-ready, memory-efficient

Dual transition storage formatlookup: bsearch or linear scan

s2 s1s3a b c

s4

bs5

d e

s1

cFL bL eFL dL a bL

s1s1s2

s2 s4 s3s5

s1

cFL bL eFL dL abLN

s2 s4 s3s5

org.apache.lucene.util.automaton.fst.*

FSA representation

Arc-based, not state-basedMoore vs. Mealy. Compact vs. intuitive

Next-state chainingrequires unusual tricks during construction

Everything in a byte[]traversals-ready, memory-efficient

Dual transition storage formatlookup: bsearch or linear scan

s2 s1s3a b c

s4

bs5

d e

s1

cFL bL eFL dL a bL

s1s1s2

s2 s4 s3s5

s1

cFL bL eFL dL abLN

s2 s4 s3s5

Input size Compressed size (MB)

Input MB Terms Lucene morf. gzip

Wikipedia t.index 481 38092 045 258 164 149Polish in . 162 3 672 200 3.1 1.7 15.4

.

Use Cases:Solr's Autocomplete

Solr's

Suggesters

Design choicessort order (alpha, score), pre x vs. spelling, boost exact matches?

Weightsterm→weight, lookup(term, onlyMorePopular)

org.apache.solr.spelling.suggest.LookupJaspellLookup, TSTLookup, FSTLookup

flour|3four|4fourier|3furious|2

f

l

o

u

o

u

r i

r

u

ri

|

o u

e

4

|3

s | 2

Find pre x.Depth-in traversal for completions.PQ on score|alpha

. ...Take 1

flour|3four|4fourier|3furious|2

→fou*

f

l

o

u

o

u

r i

r

u

ri

|

o u

e

4

|3

s | 2

Find pre x.Depth-in traversal for completions.PQ on score|alpha

. ...Take 1

2furious3flour3fourier4four

2

3

4

f

f

f o

lo

ur

u

u

rr

i o

i e

us

From score roots, until N collected.Find pre x.Depth-in traversal for completions, stop if N collected.Find/boost exact match.

. ...Take 2

2furious3flour3fourier4four

→fou*

2

3

4

f

f

f o

lo

ur

u

u

rr

i o

i e

us

From score roots, until N collected.Find pre x.Depth-in traversal for completions, stop if N collected.Find/boost exact match.

. ...Take 2

2furious5urious|furious5rious|furious5ious|furious5ous|furious5us|furious5s|furious3flour…

. ...Take 3 (in xes)

.

.

2

3

4

5

6

7

f

f

f

i

o

r

s

u

e

il

o

r

u

o

ru

u

|r

r

eo

u

|

i

r

o ui

|r

s

ol

o

u

r

u

u

s

i

|

u

|

f

r

f

r

r

i o

il

| o

e

us

Constant time lookups!Regardless of the terms dictionary size.

Regardless of pre x length.

Exact matches only.Static snapshot (not incremental).

Discretized weights.

Constant time lookups!Regardless of the terms dictionary size.

Regardless of pre x length.

Exact matches only.Static snapshot (not incremental).

Discretized weights.

Top50KWiki.utf8, 676 KB, 50 000 terms

Jaspell TST FST

..RAM [B] ..7 869 415 ..7 914 524 ..300 175

queries per second,. . . tpq

..PREFIX [100-200] ..458 ..966 ..742

..PREFIX [6-9] ..330 ..228 ..659

..PREFIX [2-4] ..126 ...29 ..501

Summary

Summary and Conclusions

Automatacompact, powerful, efficient data structure

Lucene/Solr bene tsbehind the scenes, but spreading: index, queries, suggesters

API in Lucene…is shaped right now, still @experimental

Acknowledgement

Michael McCandless

Robert Muir

committer: .+

dawid.weiss@carrotsearch.com