SolrCloud Failover and Testing

18
SolrCloud Failover and Testing Mark Miller (Cloudera)

description

 

Transcript of SolrCloud Failover and Testing

Page 1: SolrCloud Failover and Testing

SolrCloud Failover and Testing

Mark Miller (Cloudera)

Page 2: SolrCloud Failover and Testing

Mark Miller

Lucene Committer, Solr Committer.

Works for Cloudera.

A lot of work on SolrCloud.

Page 3: SolrCloud Failover and Testing

At Cloudera…

We are building an Enterprise Data Hub

Search is a part of that.

Solr is our search engine.

Page 4: SolrCloud Failover and Testing
Page 5: SolrCloud Failover and Testing

Solr On HDFS

Performance is good.

It can be even better.

A shared filesystem has advantages.

Page 6: SolrCloud Failover and Testing

SolrCloud Reminder

Page 7: SolrCloud Failover and Testing

Limitation

We replicate via both Solr and HDFS.

Replicating with just one has huge tradeoffs.

We are working on better tradeoffs.

Page 8: SolrCloud Failover and Testing

autoAddReplicas

A new per collection option.

When a replica goes down, it is replaced on a node that is still up.

A shared filesystem as well means all replicas can go down and you can still automatically failover.

+

-

Page 9: SolrCloud Failover and Testing

How Does it Work?SolrCloud elects a fault tolerant, single node to be an Overseer.

The Overseer monitors the cluster state in ZooKeeper.

Creates a new SolrCore on a machine that is up when necessary to replace ‘downed’ replicas.

Page 10: SolrCloud Failover and Testing

Let’s Do A Demo!

Page 11: SolrCloud Failover and Testing

SolrCloud Testing

Let’s talk about tests.

Page 12: SolrCloud Failover and Testing

SolrCloud Tests

We did a straw man implementation of SolrCloud first.

We did the same for tests.

We favored integration tests over unit tests.

We did not make enough tests.

Page 13: SolrCloud Failover and Testing

Distributed Tests

Are hard.

For a variety of reasons.

The Lucene / Solr testing framework hurts in order to help.

Page 14: SolrCloud Failover and Testing

The Lucene / Solr Test Framework

Randomized Testing.

Rule Enforcement.

The Jenkins Cluster.

Page 15: SolrCloud Failover and Testing

MocksWe avoided doing them early - too much churn.

They can be dangerous to future contributors / refactoring.

Some of the early mocking that did get in is a little painful.

We need them for good unit tests.

Page 16: SolrCloud Failover and Testing

Testing Culture

Lucene has A+ testing culture. In many cases, it’s easier for Lucene.

Solr has a C testing culture.

Solr needs to get better.

Page 17: SolrCloud Failover and Testing

Prescription?

More focus on back filling tests when adding features or changing code.

More focus on fixing frequently failing tests.

More focus on unit tests.

Page 18: SolrCloud Failover and Testing

The End

@heismark

Thank You.