Introduction to solr

Introduction to Solr

Radu GheorgheSematext Group, Inc.

About me

LogseneSPM

ES API

metrics

...

Products Services

+ https://sematext.com/blog/author/radu7gheorghe/+ https://www.manning.com/books/elasticsearch-in-action

https://sematext.com/blog/author/radu7gheorghe/

https://sematext.com/blog/author/radu7gheorghe/

https://www.manning.com/books/elasticsearch-in-action

https://www.manning.com/books/elasticsearch-in-action

Agenda

What is Solr

When to use it

When not to use it

How it works

Demo

Pleeeeease ask questions. Otherwise it will be boring :(

What is

Open source

Search engine

Based on Apache *

Distributed (SolrCloud) or not (master-slave)

* Actually the two project merged in 2010

More on search: the term dictionary and its friends

Term Docs Positions counts, stored, etc

big 1,2 [0],[2] ...

bucharest 3 [0]

data 1 [1]

fun 1 ...

is 1,3

other 2

text 2

1) Big data is fun

2) Other text

3) Bucharest is big

analysis

big AND data

“big data”

Segments and merging

The [relevancy] score

BM25: bag-of-words based on TF-IDFq=big AND data

big big big bigbig big

I have big big big dataTermFrequency data

data

InverseDocumentFrequency

more occurrences in the document, more weight

less occurrences in the index, more weight

Relevancy tuning

title: Big Data

description: this is a book about big data

published: 2016

title: Spark Rulz

description: big data big data big data big data

published: 2015

q=big AND data

boost fields

boost values

Back to sorting: where the inverted index fails

Term Docs

1 [star] 1,2,8,5,128

2 7,84,129,

3 3,29,345

4 11,123,455

5 12,14,16,17

Search returned docs 84, 455, 12 and 8

Now sort them by rating. ¯\_(ツ)_/¯

Enter doc values

Doc Terms

8 1

12 3

84 5

129 4

455 2

Search returned docs 84, 455, 12 and 8

Now sort them by rating.

Similar, but not quite like stored fields*

* Faster retrieval for doc values. For analyzed text, you’re stuck with stored fields and in-memory field cache

Facets

search returns

doc IDsfacet=true

facet.field=host

doc1: host=server01

doc2: host=server02

doc3: host=server01

doc4: host=server01

server01: 3

server02: 1

doc values, usually*

* can be filter cache on low cardinality fields (depends on facet.method)

Facets can be hierarchical

top_genres:{ terms:{ field: genre, limit: 5, facet:{ top_authors:{ terms:{ field: author, limit: 2

"top_genres":{ "buckets":[ { "val":"Fantasy", "count":5432, "top_authors":{ // top authors in the "Fantasy" genre "buckets":[{ "val":"Mercedes Lackey", "count":121}, { "val":"Piers Anthony", "count":98} ] } }, { "val":"Mystery", "count":4322, "top_authors":{ // top authors in the "Mystery" genre "buckets":[{ "val":"James Patterson", "count":146},

Can also be numeric/date ranges or functions like avg, sum, unique or percentile

Beyond the shards: streaming aggregations

Sources

searchfacetjdbc...

Decorators

rollupuniqueinnerJoinparallel...

shard1 shard2

worker1 worker2

Solr endpoint

client app

Beyond the shards: streaming aggregations

Sources

searchfacetjdbc...

Decorators

rollupuniqueinnerJoinparallel...

Parallel SQL

Text Classification

Graph Traversal⇒ shard1 shard2

worker1 worker2

Solr endpoint

client app

Master-slave

indexer master

slave1

slave2

slave3

searcherdocs

queriesreplicatessegments

Master-slave: high-QPS on static data

indexer master

slave1

slave2

slave3

searcher

replicatessegments

docs

queries

Simple

Battle-tested

Index data only once

Slaves can cache like crazy

Separate roles ⇒ separate (see optimized) hardware and configs

SolrCloud

leader2

leader1

replica2

replica1

Zookeeper

Solr nodes

indexer searcher

SolrCloud

leader2

leader1

replica2

replica1

Zookeeper

Solr nodes

indexer searcher

Near realtime search

Durability

Scales both reads and writes

No SPOF

Central config, nicer APIs

In a nutshell

Typical use-cases Typical challenges

Product search (books, movies, bikes weapons… anything that requires relevancy)

Updates (though there’s WiP for numeric doc values in SOLR-5944)

Time-series data (logs, metrics, social media...)

Not really schema-less (schema can only be appended)

Search on top of (or as a source of) other Big Data tools (Spark, HDFS…)

Doesn’t like sparse data (again, there’s ongoing work to make it better, see LUCENE-7253)

Search on top of (or alongside) relational DBs

Some relational, stream and batch processing capabilities, but not the tool for those jobs

Demo

Commands available at https://github.com/sematext/meetups/blob/master/introduction_to_solr_demo_commands.sh

https://github.com/sematext/meetups/blob/master/introduction_to_solr_demo_commands.sh

https://github.com/sematext/meetups/blob/master/introduction_to_solr_demo_commands.sh

Thank you!

Radu [email protected]@radu0gheorghe

[email protected]://sematext.com@sematext

Join Us! We are hiring!

http://sematext.com/jobs

Backend, UI, Sales, Consulting, Trainers

Introduction to solr

Technology

Transcript of Introduction to solr