Scaling Solr with SolrCloud

Rafał Kuć – Sematext Group, Inc.@kucrafal @sematext sematext.com

Ta me…

Sematext consultant & engineerSolr.pl co-founderFather and husband

Solr History

Y. Seeley creates Solr

Incubator graduation

Solr 1.4 released

Solr 4.0 released

Solr 4.1 and counting

Lucene / Solr merge

Solr 1.3 released

Solr donated to ASF

The Past

Master – Slave Deployment

Application

Solr Master

Solr Slave Solr Slave Solr Slave Solr Slave

Master as SPOF

Application

Solr Slave Solr Slave Solr Slave Solr Slave

Solr Master

Replication Time

Indexing App

Solr Slave

Solr Master

Solr Slave

Querying App

Solr Slave Solr Slave

Solr Master

Too Much for a Single Shard

Application

Solr MasterSolr Master

Solr Slave Solr SlaveSolr Slave Solr Slave

Solr Slave Solr Slave

Solr Master

Too Much for a Single Shard

Application

Solr Master

Solr Slave Solr SlaveSolr Slave Solr Slave

Solr Master

DocResponseResponse

Querying in Multi Master Deployment

Solr SlaveShard 2

Solr SlaveShard 3

Solr SlaveShard 1

Application

SolrCloud Comes Into Play

Basic Glossary

https://cwiki.apache.org/confluence/display/solr/SolrCloud+Glossary

Cluster

Collection

Leader & Replica

Overseer

Apache ZooKeeperQuorum is required

Sample configuration

clientPort=2181dataDir=/usr/share/zookeeper/datatickTime=2000initLimit=10syncLimit=5server.1=192.168.1.1:2888:3888server.2=192.168.1.2:2888:3888server.3=192.168.1.3:2888:3888

ZooKeeper ZooKeeper ZooKeeper

Solr Instances

Solr Server Solr Server

-DzkHost=192.168.1.2:2181,192.168.1.1:2181,192.168.1.3:2181

-DzkHost=192.168.1.1:2181,192.168.1.2:2181,192.168.1.3:2181

-DzkHost=192.168.1.3:2181,192.168.1.1:2181,192.168.1.2:2181

Collection Creation

Solr Server Solr Server$ cloud-scripts/zkcli.sh –cmd upconfig -zkhost 192.168.1.2:2181 -confdir /usr/share/config/revolution/conf -conf revolution

$ curl 'http://solr1:8983/solr/admin/collections?action=CREATE&name=revolution&numShards=2&replicationFactor=1'

Solr Server

Single Collection Deployment

Solr Server

Shard1

Application

Shard2

Collection with Replica

Solr Server Solr Server$ curl 'http://solr1:8983/solr/admin/collections?action=CREATE&name=revolution&numShards=2&replicationFactor=2'

Solr Server

Collection with Replicas

Solr Server

Shard1 Replica

Shard2 Replica

Shard2Shard1

Application

Solr Server

Querying

Solr Server

fl=id,sco

fl=id,score

Application

Id,score Id,scoreShard1 Shard2

Solr Server

Querying

Solr Server

Application

docdoc

get docs

get docsResults

Shard2Shard1

Solr Server

Shard and Replica Number

How your data looks

Expected data growth

Target performance

Target node number

Max number of nodes = number of shards * (number of replicas + 1)

Replica

ReplicaReplica

Replica

What should I go for?

More data? Shard

Replica Replica

ShardShard

ReplicaMore queries ? Replica Replica Replica

Custom Routing

Default (numShards present, pre 4.5)

Implicit (numShards not present, pre 4.5)

Solr ServerSolr Server

id=userB!3id=userA!2

Custom Routing Example

id=userA!1

Shard2Shard1

Querying Solr – Default Routing

Shard 1 Shard 2 Shard 3 Shard 4

Solr Collection

Application

Solr Collection

Application

Quering Solr – Custom Routing

q=revolution&_route_=userA!

Collection Manipulation CommandsCreate

Delete

Reload

Create Alias

Delete Alias

Shard Creation/Deletionhttp://wiki.apache.org/solr/SolrCloud

Collection Creation

numShards

replicationFactor

maxShardsPerNode

createNodeSet

collection.configName

Collection Split Example

$ curl 'http://solr1:8983/solr/admin/collections?action=CREATE&name=collection1&numShards=2&replicationFactor=1'

Collection Split Example

$ curl 'http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=collection1&shard=shard1'

Collection Aliasing

$ curl 'http://solr1:8983/solr/admin/collections? action=CREATEALIAS&name=weekly&collections=20131107,20131108,20131109,20131110,20131111,20131112,20131113'

$ curl 'http://solr1:8983/solr/admin/collections? action=DELETEALIAS&name=weekly'

$ curl 'http://solr1:8983/solr/weekly/select?q=revolution'

Caches

q=lucene+revolution

fq=city:Dublin

Solr Cache

Refreshed with IndexSearcher

Configurable

Different purposes

Different implementations

Filter Cache

q=*:*&fq={!cache=false}city:Dublin

q=*:*&fq={!frange l=0 u=10 cache=false cost=200}sum(price,pro)

q=lucene+revolution&fq=city:Dublin

q=lucene+revolution+city:Dublin

Document Cache

Query Result Cache

q=lucene+revolution&fq=city:Dublin&sort=date+desc&start=0&rows=10

q=lucene+revolution+city:Dublin&sort=date+desc&start=0&rows=10

Warming<listener event="newSearcher" class="solr.QuerySenderListener"> <arr name="queries"> <lst><str name="q">*:*</str><str name="sort">date desc</str></lst> <lst><str name="q">keywords:* OR tags:*</str></lst> <lst><str name="q">*:*</str><str name="fq">active:*</str></lst> </arr></listener><listener event="firstSearcher" class="solr.QuerySenderListener"> <arr name="queries"> <lst><str name="q">*:*</str><str name="sort">date desc</str></lst> <lst><str name="q">keywords:* OR tags:*</str></lst> <lst><str name="q">*:*</str><str name="fq">active:*</str></lst> </arr></listener><useColdSearcher>false</useColdSearcher>

The Right Directory

_0.fdt _0.fdx _0.fnm _0.nvd

_1.fdt _1.fdx _1.fnm _1.nvd

StandardDirectory

SimpleFSDirectory

NIOFSDirectory

MMapDirectory

NRTCachingDirectory

RAMDirectory <directoryFactory name="DirectoryFactory" class="solr.NRTCachingDirectoryFactory" />

Column oriented fields - DocValues

NRT compatible

Better compression than field cache

Can store data outside of JVM heap

Can improve things for dynamic indices

Segment Merge

a b c d e

Level 0 Level 1

Segment Merge Under Control

Merge policy

Merge scheduler

Merge factor

Merge policy configuration

Configuring Segment Merge

Indexing Throughput Tuning

Maximum indexing threads

RAM buffer size

Maximum buffered documents

Bulk, bulks and bulks

CloudSolrServer

Autocommit

Cutting off unnecessary stuff

TransactionLog

Updates durability

Recovering peer replay

Performant Realtime Get

Autocommit or Not?

<autoCommit> <maxTime>15000</maxTime> <maxDocs>1000</maxDocs> <openSearcher>false</openSearcher></autoCommit>

Automatic data flush

Automatic index view refresh

Autocommit & openSearcher=true<autoCommit> <maxDocs>10</maxDocs> <openSearcher>true</openSearcher></autoCommit>

AutoSoftCommit & openSearcher=false<autoCommit> <maxDocs>1000</maxDocs> <openSearcher>false</openSearcher></autoCommit>

Postings Formats to the Rescue

Lucene 4.0 >= Flexible Indexing

Postings == docs, positions, payloads

Different postings formats available

BloomPulsingSimple textDirectMemory

MonitoringCluster state

Nodes utilization

Memory usage

Cache utilization

Query response time

Warmup times

Garbage collector work

JMX and Solr

Administration Panel

Monitoring with SPM

Other Monitoring Tools

Ganglia http://ganglia.sourceforge.net/

New Relic http://www.newrelic.com/

Opsview http://www.opsview.com

We Are Hiring !

Dig Search ?Dig Analytics ?Dig Big Data ?Dig Performance ?Dig working with and in open – source ?We’re hiring world – wide !

http://sematext.com/about/jobs.html

Rafał Kuć @kucrafal rafal.kuc@sematext.com

Sematext @sematext http://sematext.com http://blog.sematext.com

SPM discount code: LR2013SPM20

Thank You !

@ Sematext booth ;)

Scaling Solr with SolrCloud

Technology

Transcript of Scaling Solr with SolrCloud

Solr Exchange: Introduction to SolrCloud

Scaling up solr 4.1 to power big search in social media analytics

Scaling search at Trovit with Solr and Hadoop - Marc Sturlese

Solr Fusion a Solr Proxy

Apache Solr Cookbook - the-eye.euApache Solr Cookbook iii 4 Solr autocomplete example 27 4.1 Install Apache Solr ...

Inside Solr 5 - Bangalore Solr/Lucene Meetup

CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

IDEAL Pages IDEAL Pages List of Figures Figure 1 Big Picture 4 Figure 2 Solr Server vs. SolrCloud ...

Solr JDBC - Lucene/Solr Revolution 2016

Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Scaling Solr with Solr Cloud

SolrCloud Cluster management via APIs

the search is over - Cloud Object Storage · bin/solr start -e cloud Welcome to the SolrCloud example! To begin, how many Solr nodes would you like to run in your local cluster? (specify

Cloudera Search User · PDF fileCloudera Search User Guide | 5 ... SolrCloud, Apache Tika, and Solr Cell. ... fail to provide deep insight into utilization,

Scaling search to a million pages with Solr, Python, and Django

Solr cluster with SolrCloud at lucenerevolution (tutorial)

SolrCloud on Hadoop

Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014

Solandra Scaling Solr with Cassandra - DataStax · • Solr becomes aware of Cassandra ring (Locality) • Manage N Solr Cores via Cassandra (REST API) • IndexManager caps the number

Scaling search with SolrCloud