Dawid Weiss- Finite state automata in lucene

47
Finite State Automata in Dawid WEISS

description

Finite state automata and transducers made it into Lucene fairly recently, but already show a very promising impact on search performance. This data structure is rarely exploited because it is commonly (and unfairly) associated with high complexity. During the talk, I will try to show that automata and transducers are in fact very simple, their construction can be very efficient (memory and time-wise) and their field of applications very broad.

Transcript of Dawid Weiss- Finite state automata in lucene

Page 1: Dawid Weiss- Finite state automata in lucene

Finite State Automatain

DawidWEISS

Page 2: Dawid Weiss- Finite state automata in lucene

DawidWeiss

20+ years of coding10 years assembly only

Academia & ResearchPhD in Information Retrieval, PUT

Open sourceCarrot2, HPPC, Lucene,…

Industry & BusinessCarrot Search s.c.

.

.

.

.

.

.

. .

Page 3: Dawid Weiss- Finite state automata in lucene

Talk outline

State machines (automata)FSAs, DFAs, FSTs and other XXXs.

Use cases in Lucene and SolrSuggester. FuzzySearch. Index.

No API detailsStill @experimental.

Page 4: Dawid Weiss- Finite state automata in lucene

(Non)? Deterministic FiniteState (Automata|Machines)

Page 5: Dawid Weiss- Finite state automata in lucene

HashSet

hash → slot → value0x29384d34 → lucene0xde3e3354 → lucid0x00000666 → lucifer

FSA (deterministic)

l u c e n e

id

fe

rexists(sequence)oor(pre x)

ceil(pre x)

Page 6: Dawid Weiss- Finite state automata in lucene

HashSet

hash → slot → value0x29384d34 → lucene0xde3e3354 → lucid0x00000666 → lucifer

FSA (deterministic)

l u c e n e

id

fe

r

exists(sequence)oor(pre x)

ceil(pre x)

Page 7: Dawid Weiss- Finite state automata in lucene

HashSet

hash → slot → value0x29384d34 → lucene0xde3e3354 → lucid0x00000666 → lucifer

FSA (deterministic)

l u c e n e

id

fe

rexists(sequence)oor(pre x)

ceil(pre x)

Page 8: Dawid Weiss- Finite state automata in lucene

k i l l

bl

li

deterministic, non-minimal

i l l

b

k

deterministic, minimal

i

l

l

b

k

i

lnon-deterministic,non-minimal

Page 9: Dawid Weiss- Finite state automata in lucene

k i l l

bl

li

deterministic, non-minimal

i l l

b

k

deterministic, minimal

i

l

l

b

k

i

lnon-deterministic,non-minimal

Page 10: Dawid Weiss- Finite state automata in lucene

k i l l

bl

li

deterministic, non-minimal

i l l

b

k

deterministic, minimal

i

l

l

b

k

i

lnon-deterministic,non-minimal

Page 11: Dawid Weiss- Finite state automata in lucene

(Sorted)Map

lucene → 1lucid → 2lucifer → 666

FST (transducer)

l|1 u c e n e

i|1d

f|664e

r

Page 12: Dawid Weiss- Finite state automata in lucene

(Sorted)Map

lucene → 1lucid → 2lucifer → 666

FST (transducer)

l u c e n e|1

id|2

fe

r|666

l|1 u c e n e

i|1d

f|664e

r

Page 13: Dawid Weiss- Finite state automata in lucene

(Sorted)Map

lucene → 1lucid → 2lucifer → 666

FST (transducer)

l|1 u c e n e

i|1d

f|664e

r

Page 14: Dawid Weiss- Finite state automata in lucene

NFSAs and

Regular expressions

Determinizationstates explosion, not always possible

Backtrackingrecursion explosion

aa

e1e2 e1 e1

e+e

e*e

e?e

Page 15: Dawid Weiss- Finite state automata in lucene

a?nan

n=3 → a?a?a?aaa

Source: Russ Cox, Regular ExpressionMatching Can Be Simple And Fast (re2).

Page 16: Dawid Weiss- Finite state automata in lucene

a?nann=3 → a?a?a?aaa

Source: Russ Cox, Regular ExpressionMatching Can Be Simple And Fast (re2).

Page 17: Dawid Weiss- Finite state automata in lucene

a?nann=3 → a?a?a?aaa

Source: Russ Cox, Regular ExpressionMatching Can Be Simple And Fast (re2).

Page 18: Dawid Weiss- Finite state automata in lucene

0

5000

10000

15000

20000

25000

30000

35000

0 5 10 15 20 25 30

Tim

e [

ms]

Time of matching an for pattern a?nan , depending on n. Java 1.6, modern hardware.

Page 19: Dawid Weiss- Finite state automata in lucene

Linear-time, minimal, deterministic

FSA construction

Linear algorithm from sorted inputby Daciuk, Mihov, et al.

Active pathstates that still can change

States dictionarynodes that will never change

Page 20: Dawid Weiss- Finite state automata in lucene

1) common AP pre x2) freeze the rest of AP3) add suffix → new AP

lucene

Page 21: Dawid Weiss- Finite state automata in lucene

1) common AP pre x2) freeze the rest of AP3) add suffix → new AP

l u c e n e

lucid

Page 22: Dawid Weiss- Finite state automata in lucene

1) common AP pre x2) freeze the rest of AP3) add suffix → new AP

l u c e n e

i

d

Page 23: Dawid Weiss- Finite state automata in lucene

1) common AP pre x2) freeze the rest of AP3) add suffix → new AP

l u c e n e

i

d

lucifer

Page 24: Dawid Weiss- Finite state automata in lucene

1) common AP pre x2) freeze the rest of AP3) add suffix → new AP

l u c e n e

id

fe r

Page 25: Dawid Weiss- Finite state automata in lucene

1) common AP pre x2) freeze the rest of AP3) add suffix → new AP

l u c e n e

id

fe

r

Page 26: Dawid Weiss- Finite state automata in lucene

FS(A|T)s in (Lucene|Solr)

Page 27: Dawid Weiss- Finite state automata in lucene

Automata in

Lucene|Solr

org.apache.lucene.util.automaton.*partial port of brics, FuzzyQuery, AutomatonTermsEnum

org.apache.lucene.util.automaton.fst.FSTFSA and FSTs from sorted data, suggester, indexes

Page 28: Dawid Weiss- Finite state automata in lucene

org.apache.lucene.util.automaton.fst.*

FSA representation

Arc-based, not state-basedMoore vs. Mealy. Compact vs. intuitive

Next-state chainingrequires unusual tricks during construction

Everything in a byte[]traversals-ready, memory-efficient

Dual transition storage formatlookup: bsearch or linear scan

Input: abc, bd, bde.a b c

b

d

d e

a b c

bd e

Page 29: Dawid Weiss- Finite state automata in lucene

org.apache.lucene.util.automaton.fst.*

FSA representation

Arc-based, not state-basedMoore vs. Mealy. Compact vs. intuitive

Next-state chainingrequires unusual tricks during construction

Everything in a byte[]traversals-ready, memory-efficient

Dual transition storage formatlookup: bsearch or linear scan

s2 s1s3a b c

s4

bs5

d e

s1

cFL bL eFL dL a bL

s1s1s2

s2 s4 s3s5

s1

cFL bL eFL dL abLN

s2 s4 s3s5

Page 30: Dawid Weiss- Finite state automata in lucene

org.apache.lucene.util.automaton.fst.*

FSA representation

Arc-based, not state-basedMoore vs. Mealy. Compact vs. intuitive

Next-state chainingrequires unusual tricks during construction

Everything in a byte[]traversals-ready, memory-efficient

Dual transition storage formatlookup: bsearch or linear scan

s2 s1s3a b c

s4

bs5

d e

s1

cFL bL eFL dL a bL

s1s1s2

s2 s4 s3s5

s1

cFL bL eFL dL abLN

s2 s4 s3s5

Page 31: Dawid Weiss- Finite state automata in lucene

org.apache.lucene.util.automaton.fst.*

FSA representation

Arc-based, not state-basedMoore vs. Mealy. Compact vs. intuitive

Next-state chainingrequires unusual tricks during construction

Everything in a byte[]traversals-ready, memory-efficient

Dual transition storage formatlookup: bsearch or linear scan

s2 s1s3a b c

s4

bs5

d e

s1

cFL bL eFL dL a bL

s1s1s2

s2 s4 s3s5

s1

cFL bL eFL dL abLN

s2 s4 s3s5

Page 32: Dawid Weiss- Finite state automata in lucene

Input size Compressed size (MB)

Input MB Terms Lucene morf. gzip

Wikipedia t.index 481 38092 045 258 164 149Polish in . 162 3 672 200 3.1 1.7 15.4

.

Page 33: Dawid Weiss- Finite state automata in lucene

Use Cases:Solr's Autocomplete

Page 34: Dawid Weiss- Finite state automata in lucene

Solr's

Suggesters

Design choicessort order (alpha, score), pre x vs. spelling, boost exact matches?

Weightsterm→weight, lookup(term, onlyMorePopular)

org.apache.solr.spelling.suggest.LookupJaspellLookup, TSTLookup, FSTLookup

Page 35: Dawid Weiss- Finite state automata in lucene

flour|3four|4fourier|3furious|2

f

l

o

u

o

u

r i

r

u

ri

|

o u

e

4

|3

s | 2

Find pre x.Depth-in traversal for completions.PQ on score|alpha

. ...Take 1

Page 36: Dawid Weiss- Finite state automata in lucene

flour|3four|4fourier|3furious|2

→fou*

f

l

o

u

o

u

r i

r

u

ri

|

o u

e

4

|3

s | 2

Find pre x.Depth-in traversal for completions.PQ on score|alpha

. ...Take 1

Page 37: Dawid Weiss- Finite state automata in lucene

2furious3flour3fourier4four

2

3

4

f

f

f o

lo

ur

u

u

rr

i o

i e

us

From score roots, until N collected.Find pre x.Depth-in traversal for completions, stop if N collected.Find/boost exact match.

. ...Take 2

Page 38: Dawid Weiss- Finite state automata in lucene

2furious3flour3fourier4four

→fou*

2

3

4

f

f

f o

lo

ur

u

u

rr

i o

i e

us

From score roots, until N collected.Find pre x.Depth-in traversal for completions, stop if N collected.Find/boost exact match.

. ...Take 2

Page 39: Dawid Weiss- Finite state automata in lucene

2furious5urious|furious5rious|furious5ious|furious5ous|furious5us|furious5s|furious3flour…

. ...Take 3 (in xes)

Page 40: Dawid Weiss- Finite state automata in lucene

.

.

2

3

4

5

6

7

f

f

f

i

o

r

s

u

e

il

o

r

u

o

ru

u

|r

r

eo

u

|

i

r

o ui

|r

s

ol

o

u

r

u

u

s

i

|

u

|

f

r

f

r

r

i o

il

| o

e

us

Page 41: Dawid Weiss- Finite state automata in lucene

Constant time lookups!Regardless of the terms dictionary size.

Regardless of pre x length.

Exact matches only.Static snapshot (not incremental).

Discretized weights.

Page 42: Dawid Weiss- Finite state automata in lucene

Constant time lookups!Regardless of the terms dictionary size.

Regardless of pre x length.

Exact matches only.Static snapshot (not incremental).

Discretized weights.

Page 43: Dawid Weiss- Finite state automata in lucene

Top50KWiki.utf8, 676 KB, 50 000 terms

Jaspell TST FST

..RAM [B] ..7 869 415 ..7 914 524 ..300 175

queries per second,. . . tpq

..PREFIX [100-200] ..458 ..966 ..742

..PREFIX [6-9] ..330 ..228 ..659

..PREFIX [2-4] ..126 ...29 ..501

Page 44: Dawid Weiss- Finite state automata in lucene

Summary

Page 45: Dawid Weiss- Finite state automata in lucene

Summary and Conclusions

Automatacompact, powerful, efficient data structure

Lucene/Solr bene tsbehind the scenes, but spreading: index, queries, suggesters

API in Lucene…is shaped right now, still @experimental

Page 46: Dawid Weiss- Finite state automata in lucene

Acknowledgement

Michael McCandless

Robert Muir

committer: .+