SolrCloud Failover and Testing
Mark Miller (Cloudera)
Mark Miller
Lucene Committer, Solr Committer.
Works for Cloudera.
A lot of work on SolrCloud.
At Cloudera…
We are building an Enterprise Data Hub
Search is a part of that.
Solr is our search engine.
Solr On HDFS
Performance is good.
It can be even better.
A shared filesystem has advantages.
SolrCloud Reminder
Limitation
We replicate via both Solr and HDFS.
Replicating with just one has huge tradeoffs.
We are working on better tradeoffs.
autoAddReplicas
A new per collection option.
When a replica goes down, it is replaced on a node that is still up.
A shared filesystem as well means all replicas can go down and you can still automatically failover.
+
-
How Does it Work?SolrCloud elects a fault tolerant, single node to be an Overseer.
The Overseer monitors the cluster state in ZooKeeper.
Creates a new SolrCore on a machine that is up when necessary to replace ‘downed’ replicas.
Let’s Do A Demo!
SolrCloud Testing
Let’s talk about tests.
SolrCloud Tests
We did a straw man implementation of SolrCloud first.
We did the same for tests.
We favored integration tests over unit tests.
We did not make enough tests.
Distributed Tests
Are hard.
For a variety of reasons.
The Lucene / Solr testing framework hurts in order to help.
The Lucene / Solr Test Framework
Randomized Testing.
Rule Enforcement.
The Jenkins Cluster.
MocksWe avoided doing them early - too much churn.
They can be dangerous to future contributors / refactoring.
Some of the early mocking that did get in is a little painful.
We need them for good unit tests.
Testing Culture
Lucene has A+ testing culture. In many cases, it’s easier for Lucene.
Solr has a C testing culture.
Solr needs to get better.
Prescription?
More focus on back filling tests when adding features or changing code.
More focus on fixing frequently failing tests.
More focus on unit tests.
The End
@heismark
Thank You.
Top Related