Scaling search with SolrCloud

Scaling with Solr Cloud

Saumitra Srivastav saumitra.srivastav@glassbeam.com

Bangalore Apache Solr Group September-2014 Meetup

What is Solr Cloud?

- set of features which add distributed capabilities in Solr

- fault tolerance and high availability

- distributed indexing and search

- enable and simplify horizontal scaling a search index using sharding and replication

Non-Cloud Single Node Deployment

Machine(server) - 1

Solr Node ( jetty on port 8983 )

Core - 1

Conf Data

Core - 2

Conf Data

Core - N

Conf Data

.........

Use Solr Cloud for ...

- performance

- scalability

- high-availability

- simplicity

- elasticity

Solr Cloud Glossary

- Cluster

- Node

- Shard

- Leader & Replica

- Overseer

- Collection

- Zookeeper

High Level View

Glossary

- Cluster - set of solr nodes

- Node - a JVM instance running Solr. - also known as a Solr server.

- Core

- an individual Solr instance (represents a logical index).

- multiple cores can run on a single node.

Glossary

- Collection - one or more documents grouped together in a

single logical index. - can be spread across multiple cores.

- Shard - a logical section of a single collection - Implemented as core

- Replica - A copy of a shard or single logical index - used in failover or load balancing.

Glossary

- Leader - The main node for each shard that routes

document adds, updates, or deletes to other replicas

- if leader goes down, a new node will be elected to take it's place

- Overseer

- A single node in SolrCloud that is responsible for processing actions involving the entire cluster

- if overseer goes down, a new node will be elected to take it's place

Zookeeper

- distributed coordination - maintaining configuration information

Solr Node 1 10.0.0.1:8983

Solr Node 3 10.0.0.3:8983

Solr Node 2 10.0.0.2:8983

Solr Node 4 10.0.0.4:8983

Zookeeper

Solr Node 1 10.0.0.1:8983

Solr Node 3 10.0.0.3:8983

Solr Node 2 10.0.0.2:8983

Solr Node 4 10.0.0.4:8983

zk-1:2181

zk-2:2182

zk-3:2183

Quorum

Client

Zookeeper - Central Configuration

Zookeeper - distributed coordination

- Keep track of /live_nodes

- Collection metadata and replica state in /clusterstate.json

- Alias list in /aliasies.json

- Leader election

Collections

- Collection is a distributed index defined by:

- named configuration - stored in ZooKeeper

- number of shards

- replication factor

- Number of copies of each document in the collection

- document routing strategy:

- how documents get assigned to shards

Collections API

localhost:8983/solr/admin/collections?action=CREATE &name=collection1 &numShards=4 &replicationFactor=2 &maxShardsPerNode=1 &createNodeSet=localhost:8933 &collection.configName=collection1Config

Collections

Sharding

- Collection has a fixed number of shards - existing shards can be split

- When to shard?

- Large number of docs - Large document sizes - Parallelization during indexing and queries - Data partitioning (custom hashing)

Replication

- Why replicate? - High-availability - Load balancing

- How does it work in SolrCloud? - Near-real-time, NOT master-slave - Leader forwards to replicas in parallel, waits

for response - Error handling during indexing is tricky

Indexing

1. Get cluster state from ZK

2. Route document directly to leader (hash on doc ID)

3. Persist document on durable storage (tlog)

4. Forward to healthy replicas

5. Acknowledge write succeed to client

Querying

- Query client can be ZK aware or just query via a load balancer

- Client can send query to any node in the cluster

- Controller node distributes the query to a replica for each shard to identify documents matching query

- Controller node sorts the results from step 3 and issues a second query for all fields for a page of results

Transaction Log (tlog)

- file where the raw documents are written for recovery purposes

- each node has its own tlog

- replayed on server restart - in case of non gracefull shutdown

- “rolled over” automatically on hard commit

- old one is closed and a new one is opened

Transaction Log (tlog)

Commits

- Hard Commit & Soft Commit

- Hard commits are about durability, soft commits are about visibility

- Further reading: https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

What happens on hard Commit?

- The tlog is truncated.

- A new tlog is started.

- Old tlogs will be deleted if there are more than 100 documents in newer tlogs.

- The current index segment is closed and flushed.

- Background segment merges may be initiated.

What happens on soft commit?

- The tlog has NOT been truncated. It will continue to grow.

- New documents WILL be visible.

- some caches will have to be reloaded

- top-level caches will be invalidated.

Shard Splitting

- Can split shards into two sub-shards

- Live splitting. No downtime needed.

- Requests start being forwarded to sub-shards automatically

- Expensive operation: Use as required during low traffic

Overseer

- Persists collection state change events to zooKeeper

- Controller for Collection API commands

- One per cluster (for all collections); elected using leader election

- Asynchronous (pub/sub messaging)

- Automated failover to a healthy node

- Can be assigned to a dedicated node

Overseer

Controlling data partitioning

- Shard vs Replicas - Custom Routing - Collection Aliasing

Shard vs Replica

More data? Shard

Replica Replica

Shard Shard

Replica More queries? Replica Replica Replica

Document Routing

- How to assign documents to shards - Default Routing - Custom routing

- Routers

- CompositeID - Implicit

Default Routing

- Each shard covers a hash-range

- Hash doc-ID into 32-bit integer, map to range

- Leads to balanced (roughly) shards

Default Routing

Shard 1 0 - 7fffffff

Collection Document-1

Id = bookdoc1

Document-2

Id = magazinedoc1

Document-3

Id = bookdoc2

32 bit Hash of

Document ID Shard 2

80000000 -ffffffff

858919514

2516704228

413288864

Default Routing - Querying

Shard 1 Shard 2 Shard 3 Shard 4

Collection

Application

q=soccer

Custom Routing

- Route documents to specific shards

- based on a shard key component in the document ID

Custom Routing

- send documents with a prefix in the document ID

- prefix in ID will be used to calculate the hash to determine the shard

- Prefix must be separated by exclamation mark(!)

- Example: 1. Book!doc1 2. Magazine!doc1 3. Book!author!doc2

Custom Routing - Indexing

Shard 1 0 - 7fffffff

Collection Document-1

Id = book!doc1

Document-2

Id = magazine!doc1

Document-3

Id = book!doc2

Shard 2 80000000 -

ffffffff

Custom Routing - Querying

http://10.0.0.7:8983/solr/collection1/select? q=soccer& _route_=books

http://10.0.0.7:8983/solr/collection1/select? q=soccer& _route_=books,magazines

Custom Routing - Querying

Collection

Application

q=soccer&_route_=books!

Implicit Router

- A field can be defined while creating collection to be used for routing

http://localhost:8983/solr/admin/collections? action=CREATE& name=articles& router.name=implicit& router.field=article-type

Collection Aliasing

- allows you to setup a virtual collection that actually points to one or more real collections

- Virtual collection == alias

localhost:8983/solr/admin/collections? action=CREATEALIAS &name=alias-name &collections=collection-list

Collection Aliasing

- Time-series data

last3months

latest

July Aug Sep Oct

Real Collections

Collection Aliasing

last3months

latest

July Aug Sep Oct

Real Collections

localhost:8983/solr/admin/collections? action=CREATEALIAS &name=last3months &collections=aug,sep,oct

localhost:8983/solr/admin/collections? action=CREATEALIAS &name=latest &collections=oct

Collection Aliasing

last3months

latest

July Aug Sep Oct

Real Collections

localhost:8983/solr/admin/collections? action=CREATEALIAS &name=last3months &collections=sep,oct,nov

localhost:8983/solr/admin/collections? action=CREATEALIAS &name=latest &collections=nov

Collection Aliasing

- Aliases can be:

• updated on the fly

• queried just like a normal collection

• used for indexing as long as it is pointing to a single collection

Other Features

- Near-Real-Time Search

- Atomic Updates

- Optimistic Locking

- HTTPS

- Use HDFS for storing indexes

- Use MapReduce for building index

Thanks

- Attributions: • Shalin Mangar’s slides on “SolrCloud: Searching Big Data” • Rafał Kuć’s slides on “Scaling Solr with SolrCloud”

- Connect

• saumitra.srivastav@glassbeam.com • saumitra.srivastav7@gmail.com • https://www.linkedin.com/in/saumitras • @_saumitra_

- Join:

• http://www.meetup.com/Bangalore-Apache-Solr-Lucene-Group/

Scaling search with SolrCloud

Data & Analytics

Transcript of Scaling search with SolrCloud

Scaling search in Oak with Solr

Solr Exchange: Introduction to SolrCloud

Scaling Search Marketing Agency Efforts by Wayne Sleight

SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

Apache SolrCloud Installation and Configuration · Apache SolrCloud™ Installation and Configuration 23. Procedure: How to Set Up and Configure a Load Balancer The following is an

Scaling search at Trovit with Solr and Hadoop

SANNS: Scaling Up Secure Approximate -Nearest Neighbors Search · SANNS: Scaling Up Secure Approximate k-Nearest Neighbors Search Hao Chen Microsoft Research Ilaria Chillotti imec-COSIC

Cloudera Search User · PDF fileCloudera Search User Guide | 5 ... SolrCloud, Apache Tika, and Solr Cell. ... fail to provide deep insight into utilization,

Scaling real-time search and analytics with Elasticsearch

Scaling Big Data Search with Solr and HBase

SolrCloud on Hadoop

Solr cluster with SolrCloud at lucenerevolution (tutorial)

How SolrCloud Changes the User Experience In a Sharded Environment

Scaling search at Trovit with Solr and Hadoop - Marc Sturlese

Scaling SolrCloud to a large number of Collections

Introduction to Graph Cloud Services, Database, and Analytics · PDF file–Combine deep learning with graph analytics ... Text Search through Apache Lucene/SolrCloud Why? –Contribute

MathWebSearch 0.5: Scaling an Open Formula Search Engine

Scaling distributed search for diagnostics and prognostics applications

ClouderaIntroduction - Machine Learning | Analytics | …whichincludesApacheLucene,SolrCloud,ApacheTika,andSolrCell. ClouderaSearchisincludedwithCDH5. UsingSearchwiththeCDHinfrastructureprovides:

Potential Scaling Effects for Asynchronous Video in Multirobot Search