Introduction to solr
-
Upload
sematext-group-inc -
Category
Technology
-
view
86 -
download
3
Transcript of Introduction to solr
Introduction to Solr
Radu GheorgheSematext Group, Inc.
About me
LogseneSPM
ES API
metrics
...
Products Services
+ https://sematext.com/blog/author/radu7gheorghe/+ https://www.manning.com/books/elasticsearch-in-action
Agenda
What is Solr
When to use it
When not to use it
How it works
Demo
Pleeeeease ask questions. Otherwise it will be boring :(
What is
Open source
Search engine
Based on Apache *
Distributed (SolrCloud) or not (master-slave)
* Actually the two project merged in 2010
More on search: the term dictionary and its friends
Term Docs Positions counts, stored, etc
big 1,2 [0],[2] ...
bucharest 3 [0]
data 1 [1]
fun 1 ...
is 1,3
other 2
text 2
1) Big data is fun
2) Other text
3) Bucharest is big
analysis
big AND data
“big data”
Segments and merging
The [relevancy] score
BM25: bag-of-words based on TF-IDFq=big AND data
big big big bigbig big
I have big big big dataTermFrequency data
data
InverseDocumentFrequency
more occurrences in the document, more weight
less occurrences in the index, more weight
Relevancy tuning
title: Big Data
description: this is a book about big data
published: 2016
title: Spark Rulz
description: big data big data big data big data
published: 2015
q=big AND data
boost fields
boost values
Back to sorting: where the inverted index fails
Term Docs
1 [star] 1,2,8,5,128
2 7,84,129,
3 3,29,345
4 11,123,455
5 12,14,16,17
Search returned docs 84, 455, 12 and 8
Now sort them by rating. ¯\_(ツ)_/¯
Enter doc values
Doc Terms
8 1
12 3
84 5
129 4
455 2
Search returned docs 84, 455, 12 and 8
Now sort them by rating.
Similar, but not quite like stored fields*
* Faster retrieval for doc values. For analyzed text, you’re stuck with stored fields and in-memory field cache
Facets
search returns
doc IDsfacet=true
facet.field=host
doc1: host=server01
doc2: host=server02
doc3: host=server01
doc4: host=server01
server01: 3
server02: 1
doc values, usually*
* can be filter cache on low cardinality fields (depends on facet.method)
Facets can be hierarchical
top_genres:{ terms:{ field: genre, limit: 5, facet:{ top_authors:{ terms:{ field: author, limit: 2
"top_genres":{ "buckets":[ { "val":"Fantasy", "count":5432, "top_authors":{ // top authors in the "Fantasy" genre "buckets":[{ "val":"Mercedes Lackey", "count":121}, { "val":"Piers Anthony", "count":98} ] } }, { "val":"Mystery", "count":4322, "top_authors":{ // top authors in the "Mystery" genre "buckets":[{ "val":"James Patterson", "count":146},
Can also be numeric/date ranges or functions like avg, sum, unique or percentile
Beyond the shards: streaming aggregations
Sources
searchfacetjdbc...
Decorators
rollupuniqueinnerJoinparallel...
shard1 shard2
worker1 worker2
Solr endpoint
client app
Beyond the shards: streaming aggregations
Sources
searchfacetjdbc...
Decorators
rollupuniqueinnerJoinparallel...
Parallel SQL
Text Classification
Graph Traversal⇒ shard1 shard2
worker1 worker2
Solr endpoint
client app
Master-slave
indexer master
slave1
slave2
slave3
searcherdocs
queriesreplicatessegments
Master-slave: high-QPS on static data
indexer master
slave1
slave2
slave3
searcher
replicatessegments
docs
queries
Simple
Battle-tested
Index data only once
Slaves can cache like crazy
Separate roles ⇒ separate (see optimized) hardware and configs
SolrCloud
leader2
leader1
replica2
replica1
Zookeeper
Solr nodes
indexer searcher
SolrCloud
leader2
leader1
replica2
replica1
Zookeeper
Solr nodes
indexer searcher
Near realtime search
Durability
Scales both reads and writes
No SPOF
Central config, nicer APIs
In a nutshell
Typical use-cases Typical challenges
Product search (books, movies, bikes weapons… anything that requires relevancy)
Updates (though there’s WiP for numeric doc values in SOLR-5944)
Time-series data (logs, metrics, social media...)
Not really schema-less (schema can only be appended)
Search on top of (or as a source of) other Big Data tools (Spark, HDFS…)
Doesn’t like sparse data (again, there’s ongoing work to make it better, see LUCENE-7253)
Search on top of (or alongside) relational DBs
Some relational, stream and batch processing capabilities, but not the tool for those jobs
Demo
Commands available at https://github.com/sematext/meetups/blob/master/introduction_to_solr_demo_commands.sh
Thank you!
Radu [email protected]@radu0gheorghe
[email protected]://sematext.com@sematext
Join Us! We are hiring!
http://sematext.com/jobs
Backend, UI, Sales, Consulting, Trainers