Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene...

43
Apache Lucene 7 What’s coming next? Uwe Schindler Apache Software Foundation | SD DataSolutions GmbH | PANGAEA @thetaph1 [email protected]

Transcript of Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene...

Page 1: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Apache Lucene 7What’s coming next?

Uwe SchindlerApache Software Foundation | SD DataSolutions GmbH | PANGAEA

@thetaph1 ∙ [email protected]

Page 2: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

My Background• Committer and PMC member of Apache Lucene and Solr - main

focus is on development of Lucene Core.• Implemented fast numerical search and maintaining the new

attribute-based text analysis API. Well known as Generics and Sophisticated Backwards Compatibility 👮.

• Elasticsearch lover.• Working as consultant and software architect at SD DataSolutions

GmbH in Bremen, Germany.• Maintaining PANGAEA (Data Publisher for Earth & Environmental

Science) where I implemented the portal's geo-spatial retrieval functions with Apache Lucene Core and Elasticsearch.

Page 3: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

ON THE WAY TO 7…

Page 4: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Lucene 7: When?

• Expected release date:

As always: no comment, but early summer is likely!

• Release branch (branch_7x) will be cut the next days

Page 5: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

DOC VALUES

Page 6: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Lucene 4: DocValues

• FieldCache was already replaced by DocValues in Lucene 4 !

• What are DocValues?

Page 7: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Ima

ge cred

its:Simo

n W

illnau

er / Trifork

Page 8: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Lucene 4: DocValues

Page 9: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Lucene 7: DocValues changes

• Complete refactoring

• “random access” removed:

Solely Iterators !

Page 10: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

(A, 4)(C, 12)(H, 5)(K, 19)(L, 1)(N, 2)

(K, 19)(C, 12)(H, 5)(A, 4)

Bound!(4 entries)

TopDocs: PriorityQueueResults: DocIdSetIterator

enqueue

Page 11: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Lucene 7: DocValues changes

• Allows better index compression

• Document norms already optimized:

Sparse!

Page 12: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Lucene 7: DocValues changes

FieldCache

and

UninvertingReader

is finally gone!

Page 13: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

NUMERICS

Page 14: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Lucene 6: Point Values(also known as dimensional values)

• Successor of NumericField(Solr: TrieField)

• Multidimensional (e.g. geographic coordinates): 8 dims

• Up to 128 bits / 16 bytes per value (IPv6 range queries are now possible)

See: https://www.elastic.co/blog/lucene-points-6.0

Page 15: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Legacy Numeric Range Queries

421

52

4

44 6442

644642641634633632522521448446445423

63

5 6

Page 16: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Block k-d-Trees

Very similar approach like NumericField!

– Just more dynamic

– Adapts dynamically depending on number of unique values!

Page 17: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Comparison (1D)

Diagram

: Mike M

cCan

dless

Page 18: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Comparison (2D)

Diagram

: Mike M

cCan

dless

Page 19: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Lucene 7: Legacy Numerics

• LegacyNumericField removed

• Terms in index no longer useable!

Page 20: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Lucene 7: Legacy Numerics

• LegacyNumericField removed

• Terms in index no longer useable!

Requires reindexing!

Page 21: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

SCORING

Page 22: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Lucene 6: Okapi BM25

It’s now the

default!

Page 23: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

BTW: What‘s Okapi BM25 ???

• BM25 is a bag-of-words retrieval function

• probabilistic model instead vectorspace model

• function of TF and IDF

Page 24: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

BTW: What‘s Okapi BM25 ???

Page 25: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

BTW: What‘s Okapi BM25 ???

• TF is not unbounded: saturation!

– Documents with high term frequency don‘t increase score too much

Page 26: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

BTW: What‘s Okapi BM25 ???

• TF is not unbounded: saturation!

– Documents with high term frequency don‘t increase score too much

Page 27: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

BTW: What‘s Okapi BM25 ???

• TF is not unbounded: saturation!

– Documents with high term frequency don‘t increase score too much

• Average document length

– Used with tuning factor to change significance of repeated terms

Page 28: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

BTW: What‘s Okapi BM25 ???

• TF is not unbounded: saturation!

– Documents with high term frequency don‘t increase score too much

• Average document length

– Used with tuning factor to change significance of repeated terms

• IDF similar to standard TF-IDF

Page 29: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Lucene 7: BM25 followup

• Query normalization removed

• BooleanQuery coordination factors removed

Page 30: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Lucene 7: BM25 followup

• Query normalization removed

• BooleanQuery coordination factors removed

• Classical similarity aka “TF-IDF” no longer supported, but are still available!

Page 31: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Lucene 7: BM25 followup

• Query normalization removed

• BooleanQuery coordination factors removed

• Classical similarity aka “TF-IDF” no longer supported, but are still available!

• Index time boosting is gone

Page 32: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

OTHER CHANGES

Page 33: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Lucene 7: Query Parser

• Default query parser no longer splits on whitespace

– Splits only on operators

– Better support for eastern languages

• Old behavior can be re-enabled

Page 34: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Lucene 7: Custom term frequency

• New TokenStream attribute (integer): TermFrequencyAttribute

• 1 by default, allows to emulate “stacked tokens”

• Only works when no positions/offsets are indexed!

Page 35: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

JAVA VERSION ?

Page 36: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Lucene 7: Java Version

• Java 8 is still minimum requirement!

• lucene-core.jar only uses compact1profile

• All other (Lucene) parts use compact2 profile

Page 37: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Lucene 7: Java 9 ?

• Compatibility with Java 9 module system restrictions

• Unicode 8: 💩👮🐸– with ICU or Java 9– ICU is now already on Unicode 9!

• Nightly tests with early access builds– currently Java 9 build 173

Page 38: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

HOW TO MIGRATE ?

Page 39: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Lucene 7: "Anti-Feature"

Removal of Lucene 5 indexsupport!

Page 40: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Lucene 7: "Anti-Feature"

Removal of Lucene 5 index support!

• Get rid of old index segments:IndexUpgrader in latest Lucene 6 release helps!

• Elasticsearch has automatic index upgrader already implemented / Solr users have to manually do this

Page 41: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

Lucene 7: Index Version

• Lucene stores version that created index

– Preserved during merges or index upgrades

• Better detection of no longer supported features

– Broken offset detection by default enabled for new indexes

Page 42: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

THANK YOU!Questions?

Page 43: Apache Lucene 7 - Berlin Buzzwords · My Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical

SD DataSolutions GmbHWätjenstr. 49

28213 Bremen, Germany+49 421 40889785-0

http://www.sd-datasolutions.de