Introduction to solr

21
Introduction to Solr Radu Gheorghe Sematext Group, Inc.

Transcript of Introduction to solr

Page 1: Introduction to solr

Introduction to Solr

Radu GheorgheSematext Group, Inc.

Page 2: Introduction to solr

About me

LogseneSPM

ES API

metrics

...

Products Services

+ https://sematext.com/blog/author/radu7gheorghe/+ https://www.manning.com/books/elasticsearch-in-action

Page 3: Introduction to solr

Agenda

What is Solr

When to use it

When not to use it

How it works

Demo

Pleeeeease ask questions. Otherwise it will be boring :(

Page 4: Introduction to solr

What is

Open source

Search engine

Based on Apache *

Distributed (SolrCloud) or not (master-slave)

* Actually the two project merged in 2010

Page 5: Introduction to solr

More on search: the term dictionary and its friends

Term Docs Positions counts, stored, etc

big 1,2 [0],[2] ...

bucharest 3 [0]

data 1 [1]

fun 1 ...

is 1,3

other 2

text 2

1) Big data is fun

2) Other text

3) Bucharest is big

analysis

big AND data

“big data”

Page 6: Introduction to solr

Segments and merging

Page 7: Introduction to solr

The [relevancy] score

BM25: bag-of-words based on TF-IDFq=big AND data

big big big bigbig big

I have big big big dataTermFrequency data

data

InverseDocumentFrequency

more occurrences in the document, more weight

less occurrences in the index, more weight

Page 8: Introduction to solr

Relevancy tuning

title: Big Data

description: this is a book about big data

published: 2016

title: Spark Rulz

description: big data big data big data big data

published: 2015

q=big AND data

boost fields

boost values

Page 9: Introduction to solr

Back to sorting: where the inverted index fails

Term Docs

1 [star] 1,2,8,5,128

2 7,84,129,

3 3,29,345

4 11,123,455

5 12,14,16,17

Search returned docs 84, 455, 12 and 8

Now sort them by rating. ¯\_(ツ)_/¯

Page 10: Introduction to solr

Enter doc values

Doc Terms

8 1

12 3

84 5

129 4

455 2

Search returned docs 84, 455, 12 and 8

Now sort them by rating.

Similar, but not quite like stored fields*

* Faster retrieval for doc values. For analyzed text, you’re stuck with stored fields and in-memory field cache

Page 11: Introduction to solr

Facets

search returns

doc IDsfacet=true

facet.field=host

doc1: host=server01

doc2: host=server02

doc3: host=server01

doc4: host=server01

server01: 3

server02: 1

doc values, usually*

* can be filter cache on low cardinality fields (depends on facet.method)

Page 12: Introduction to solr

Facets can be hierarchical

top_genres:{ terms:{ field: genre, limit: 5, facet:{ top_authors:{ terms:{ field: author, limit: 2

"top_genres":{ "buckets":[ { "val":"Fantasy", "count":5432, "top_authors":{ // top authors in the "Fantasy" genre "buckets":[{ "val":"Mercedes Lackey", "count":121}, { "val":"Piers Anthony", "count":98} ] } }, { "val":"Mystery", "count":4322, "top_authors":{ // top authors in the "Mystery" genre "buckets":[{ "val":"James Patterson", "count":146},

Can also be numeric/date ranges or functions like avg, sum, unique or percentile

Page 13: Introduction to solr

Beyond the shards: streaming aggregations

Sources

searchfacetjdbc...

Decorators

rollupuniqueinnerJoinparallel...

shard1 shard2

worker1 worker2

Solr endpoint

client app

Page 14: Introduction to solr

Beyond the shards: streaming aggregations

Sources

searchfacetjdbc...

Decorators

rollupuniqueinnerJoinparallel...

Parallel SQL

Text Classification

Graph Traversal⇒ shard1 shard2

worker1 worker2

Solr endpoint

client app

Page 15: Introduction to solr

Master-slave

indexer master

slave1

slave2

slave3

searcherdocs

queriesreplicatessegments

Page 16: Introduction to solr

Master-slave: high-QPS on static data

indexer master

slave1

slave2

slave3

searcher

replicatessegments

docs

queries

Simple

Battle-tested

Index data only once

Slaves can cache like crazy

Separate roles ⇒ separate (see optimized) hardware and configs

Page 17: Introduction to solr

SolrCloud

leader2

leader1

replica2

replica1

Zookeeper

Solr nodes

indexer searcher

Page 18: Introduction to solr

SolrCloud

leader2

leader1

replica2

replica1

Zookeeper

Solr nodes

indexer searcher

Near realtime search

Durability

Scales both reads and writes

No SPOF

Central config, nicer APIs

Page 19: Introduction to solr

In a nutshell

Typical use-cases Typical challenges

Product search (books, movies, bikes weapons… anything that requires relevancy)

Updates (though there’s WiP for numeric doc values in SOLR-5944)

Time-series data (logs, metrics, social media...)

Not really schema-less (schema can only be appended)

Search on top of (or as a source of) other Big Data tools (Spark, HDFS…)

Doesn’t like sparse data (again, there’s ongoing work to make it better, see LUCENE-7253)

Search on top of (or alongside) relational DBs

Some relational, stream and batch processing capabilities, but not the tool for those jobs

Page 20: Introduction to solr

Demo

Commands available at https://github.com/sematext/meetups/blob/master/introduction_to_solr_demo_commands.sh

Page 21: Introduction to solr

Thank you!

Radu [email protected]@radu0gheorghe

[email protected]://sematext.com@sematext

Join Us! We are hiring!

http://sematext.com/jobs

Backend, UI, Sales, Consulting, Trainers