The State of HBase Replication

Post on 27-Aug-2014

289 views 0 download

Tags:

description

Speaker: Jean-Daniel Cryans (Cloudera) HBase Replication has come a long way since its inception in HBase 0.89 almost four years ago. Today, master-master and cyclic replication setups are supported; many bug fixes and new features like log compression, per-family peers configuration, and throttling have been added; and a major refactoring has been done. This presentation will recap the work done during the past four years, present a few use cases that are currently in production, and take a look at the roadmap.

Transcript of The State of HBase Replication

1

The State of HBase ReplicationJean-Daniel CryansMay 5th, 2014

©2014 Cloudera, Inc. All rights reserved.

About me

2

• Software Engineer at Cloudera, Storage team• Apache HBase committer since 2008, PMC member

©2014 Cloudera, Inc. All rights reserved.

Motivation for HBase Replication• Even though HBase is:

3

©2014 Cloudera, Inc. All rights reserved.

Motivation for HBase Replication• Even though HBase is:

• distributed;

3

©2014 Cloudera, Inc. All rights reserved.

Motivation for HBase Replication• Even though HBase is:

• distributed;• fault-tolerant;

3

©2014 Cloudera, Inc. All rights reserved.

Motivation for HBase Replication• Even though HBase is:

• distributed;• fault-tolerant;• highly available; and

3

©2014 Cloudera, Inc. All rights reserved.

Motivation for HBase Replication• Even though HBase is:

• distributed;• fault-tolerant;• highly available; and• almost magic.

3

©2014 Cloudera, Inc. All rights reserved.

Motivation for HBase Replication• Even though HBase is:

• distributed;• fault-tolerant;• highly available; and• almost magic.

3

©2014 Cloudera, Inc. All rights reserved.

The Current State• It’s production-ready.

4

©2014 Cloudera, Inc. All rights reserved.

The Current State• It’s production-ready.• It’s used to replicate data between thousands of nodes across continents.

4

©2014 Cloudera, Inc. All rights reserved.

The Current State• It’s production-ready.• It’s used to replicate data between thousands of nodes across continents.• It’s used for Disaster Recovery, geo-distributed serving, and more.

4

©2014 Cloudera, Inc. All rights reserved.5

Agenda• Four Years of Replication• Use Cases in Production• Roadmap

©2014 Cloudera, Inc. All rights reserved.

Design• Clusters are distinct• Pull VS push• Sync VS Async

6

©2014 Cloudera, Inc. All rights reserved.

Clusters are Distinct•HBase doesn’t span DCs, HDFSs

7

Master20 RS

Slave15 RS

©2014 Cloudera, Inc. All rights reserved.

Clusters are Distinct•HBase doesn’t span DCs, HDFSs• .META. operations aren’t replicated

7

Master20 RS

Slave15 RS

©2014 Cloudera, Inc. All rights reserved.

Clusters are Distinct•HBase doesn’t span DCs, HDFSs• .META. operations aren’t replicated

• Regions can be different

7

Master20 RS

Slave15 RS

©2014 Cloudera, Inc. All rights reserved.

Clusters are Distinct•HBase doesn’t span DCs, HDFSs• .META. operations aren’t replicated

• Regions can be different• Security has to be configured for each cluster

7

Master20 RS

Slave15 RS

©2014 Cloudera, Inc. All rights reserved.

Push instead of Pull

8

MySQLMaster

MySQLSlave

Get binlog

Apply locally

MySQL Replication uses PullCluster A Cluster B

©2014 Cloudera, Inc. All rights reserved.

Push instead of Pull

9

RS RSreplicate entries

Apply to cluster

HBase Replication uses PushCluster A Cluster B

©2014 Cloudera, Inc. All rights reserved.

Async instead of Sync

10

Cluster A Cluster B

RSHLog

MemStore

RSHLog

MemStore

Synchronous Replication

©2014 Cloudera, Inc. All rights reserved.

Async instead of Sync

10

Cluster A Cluster B

RSHLog

MemStore

RSHLog

MemStore

Put2

3

1

Synchronous Replication

©2014 Cloudera, Inc. All rights reserved.

Async instead of Sync

10

Cluster A Cluster B

RSHLog

MemStore

RSHLog

MemStore

Put2

3

1

Ack Ack

Put5

6

4

78

Synchronous Replication

©2014 Cloudera, Inc. All rights reserved.

Async instead of Sync

11

Asynchronous Replication

©2014 Cloudera, Inc. All rights reserved.

Async instead of Sync

11

Asynchronous ReplicationCluster A

RSHLog

MemStore

Put

Ack

2

3

1

4

©2014 Cloudera, Inc. All rights reserved.

Async instead of Sync

11

Asynchronous ReplicationCluster A

RSHLog

MemStore

Put

Ack

2

3

1

4

Cluster B

RSHLog

MemStoreAck

Put3

4

2

5

HLogTailingThread

1

©2014 Cloudera, Inc. All rights reserved.

First Release - 0.90.0• Simple master-slave (only one)•Disabled by default• Uses ZK as a metadata store

12

©2014 Cloudera, Inc. All rights reserved.

Original Implementation

13

replicateLogEntries()ReplicationSource

ZooKeeperWatcher

Region Server onMaster Cluster

ReplicationSink

HTablePut

Delete

Region Server onSlave Cluster

©2014 Cloudera, Inc. All rights reserved.

First Lesson Learned•HDFS doesn’t support tailing files being written to. It requires:• open()• seek()// go where we stopped last time• while (not EOF || enoughData)

•read()

• close()• repeat

14

©2014 Cloudera, Inc. All rights reserved.

Second Lesson Learned• Single threaded, non-batched ZK is slow• ZK didn’t have an atomic move operation

• Doubles # ops needed, race conditions

15

©2014 Cloudera, Inc. All rights reserved.

Second Lesson Learned• Single threaded, non-batched ZK is slow• ZK didn’t have an atomic move operation

• Doubles # ops needed, race conditions

15

/hbase /replication /RS1 /1 /hlog1 /hlog2...

/hbase /replication /RS2 /1-RS1 /hlog1

1. create new hlog22. delete old hlog2

©2014 Cloudera, Inc. All rights reserved.

Second Release - 0.92.0• Cyclic replication•Multi-slave (scope LOCAL or GLOBAL)• Enable / disable peer• Special configurations

16

©2014 Cloudera, Inc. All rights reserved.

Cyclic Replication

17

Cluster1

Cluster2

Cluster3

Put Row X

©2014 Cloudera, Inc. All rights reserved.

Cyclic Replication

17

Cluster1

Cluster2

Cluster3

Put Row X

Put Row X

©2014 Cloudera, Inc. All rights reserved.

Cyclic Replication

17

Cluster1

Cluster2

Cluster3

Put Row X

Put Row X

Put Row X

©2014 Cloudera, Inc. All rights reserved.

Cyclic Replication

17

Cluster1

Cluster2

Cluster3

Put Row X

Put Row X

Put Row X

Row X is from 1Don’t replicate!

©2014 Cloudera, Inc. All rights reserved.

Multi-Slave

18

Cluster1

Cluster2

Cluster3

Put Row X

©2014 Cloudera, Inc. All rights reserved.

Multi-Slave

18

Cluster1

Cluster2

Cluster3

Put Row X

Put Row X

©2014 Cloudera, Inc. All rights reserved.

Multi-Slave

18

Cluster1

Cluster2

Cluster3

Put Row X

Put Row X Put Row X

©2014 Cloudera, Inc. All rights reserved.

Enable / Disable Peers

19

Cluster 1

RSHLog

Cluster 2

RSHLogTailingThread

©2014 Cloudera, Inc. All rights reserved.

Enable / Disable Peers> disable_peer ‘2’

19

Cluster 1

RSHLog

Cluster 2

RSHLogTailingThread

Is the peer enabled?

©2014 Cloudera, Inc. All rights reserved.

Enable / Disable Peers> disable_peer ‘2’

19

Cluster 1

RSHLog

Cluster 2

RSHLogTailingThreadHLog

Is the peer enabled?

©2014 Cloudera, Inc. All rights reserved.

Enable / Disable Peers> disable_peer ‘2’

19

Cluster 1

RSHLog

Cluster 2

RSHLogTailingThreadHLog

HLog

Is the peer enabled?

©2014 Cloudera, Inc. All rights reserved.

Enable / Disable Peers> disable_peer ‘2’

19

Cluster 1

RSHLog

Cluster 2

RSHLogTailingThreadHLog

HLog

HLogIs the peer enabled?

©2014 Cloudera, Inc. All rights reserved.

Enable / Disable Peers> disable_peer ‘2’

19

Cluster 1

RSHLog

Cluster 2

RSHLogTailingThreadHLog

HLog

HLog

HLog Is the peer enabled?

©2014 Cloudera, Inc. All rights reserved.

Enable / Disable Peers> disable_peer ‘2’

19

Cluster 1

RSHLog

Cluster 2

RSHLogTailingThreadHLog

HLog

HLog

HLog

HLog

Is the peer enabled?

©2014 Cloudera, Inc. All rights reserved.

Special Configurations• KEEP_DELETED_CELLS

• Must be used on slaves with replication when deleting data.

20

©2014 Cloudera, Inc. All rights reserved.

Special Configurations• KEEP_DELETED_CELLS

• Must be used on slaves with replication when deleting data.

•MIN_VERSION• With TTL, makes it easy to configure a slave that contains only the last few days of data.

20

©2014 Cloudera, Inc. All rights reserved.

Third Lesson Learned• It’s easy to DDOS yourself.• Replication was using the normal handlers...• ... and using them to write back!

21

Handler1: PutHandler2: DeleteHandler3: ReplicateHandler4: GetHandler5: Put

Replicated Put goes in the queue

©2014 Cloudera, Inc. All rights reserved.

Fourth Lesson Learned• Instinctively, what would something called stop_replication do?

22

©2014 Cloudera, Inc. All rights reserved.

Fourth Lesson Learned• Instinctively, what would something called stop_replication do?•Good intentions, bad outcomes, HBASE-8861

22

start/stop_replicationX

©2014 Cloudera, Inc. All rights reserved.

Third Release - 0.96.0 / 0.98.0• Replication enabled by default!• Completely refactored for readability/extensibility (Chris Trezzo)• ReplicationSyncUp tool (HBASE-9047)• Throttling (HBASE-9501)• Finer grained replication controls (HBASE-8751)

23

©2014 Cloudera, Inc. All rights reserved.

ReplicationSyncUp Tool•Works on an offline cluster• Can finish replicating the queues in ZK• Useful to finish draining a master cluster

24

HBase

HDFS

ZooKeeper

HBase

HDFS

ZooKeeper

ReplicationSyncUp

©2014 Cloudera, Inc. All rights reserved.

Finer Grained Replication Controls> set_peer_tableCFs '2', "table1; table2:cf1,cf2; table3:cfA,cfB"•Meaning: enable replication to peer #2 for:

• All of table1• cf1 and cf2 from table2• cfA and cfB from table3

25

©2014 Cloudera, Inc. All rights reserved.26

Agenda• Four Years of Replication•Use Cases in Production• Roadmap

©2014 Cloudera, Inc. All rights reserved.

Flurry• Two data centers, coast to coast• Three clusters, in master-master pairs

• 1200 nodes• 800 nodes• 30 nodes

• Replication traffic: 2Gbps• Latency between DCs: 85ms

27

©2014 Cloudera, Inc. All rights reserved.

Opower• Two clusters, same data center

• Master: tens of nodes• Slave: tens of nodes

• Replication traffic: 1GB/day• Bulk load replication traffic: 180GB/day• Recent use case

28

©2014 Cloudera, Inc. All rights reserved.

Lily HBase Indexer• Collaboration between NGData & Cloudera.

• NGData are the creators of the Lily data management platform.

• Lily HBase Indexer • Service which acts as a HBase replication listener.• Custom sink writes to SolrCloud.• Integrates Cloudera Morphlines library for ETL of rows.

29

©2014 Cloudera, Inc. All rights reserved.30

Agenda• Four Years of Replication• Use Cases in Production• Roadmap

©2014 Cloudera, Inc. All rights reserved.

Stop Relying on Permanent Znodes• Current rule is to never rely on znodes to survive cluster restarts, upgrades, etc.• State data should be kept in an HBase table.•Notification done through a new mechanism• See: https://issues.apache.org/jira/browse/HBASE-10295

31

©2014 Cloudera, Inc. All rights reserved.

Define a Replication Interface• Replication is somewhat extendable but it lacks stable interfaces.• The HBase Indexer is such an extension and it required surgery every time a committer sneezed.• See: https://issues.apache.org/jira/browse/HBASE-10504

32

©2014 Cloudera, Inc. All rights reserved.

Distributed Counters• Incrementing consists of:

33

©2014 Cloudera, Inc. All rights reserved.

Distributed Counters• Incrementing consists of:

1.Taking a lock;

33

©2014 Cloudera, Inc. All rights reserved.

Distributed Counters• Incrementing consists of:

1.Taking a lock;2.Get’ing the current value; and

33

©2014 Cloudera, Inc. All rights reserved.

Distributed Counters• Incrementing consists of:

1.Taking a lock;2.Get’ing the current value; and3.Put’ing the newly incremented value.

33

©2014 Cloudera, Inc. All rights reserved.

Distributed Counters• Incrementing consists of:

1.Taking a lock;2.Get’ing the current value; and3.Put’ing the newly incremented value.

• This breaks in Master-Master because the Puts are overwriting each other.

33

©2014 Cloudera, Inc. All rights reserved.

Distributed Counters• Incrementing consists of:

1.Taking a lock;2.Get’ing the current value; and3.Put’ing the newly incremented value.

• This breaks in Master-Master because the Puts are overwriting each other.• See https://issues.apache.org/jira/browse/HBASE-2804

33

©2014 Cloudera, Inc. All rights reserved.

More Tooling• Replication management console, one shell to rule all the clusters!• Replication bootstrapping tool.• Tool that can move queues between region servers.• Tool that can throttle replication on a live cluster.

34

©2014 Cloudera, Inc. All rights reserved.

Questions?•Or ping me async:

• @jdcryans• jdcryans@cloudera.com• jdcryans on #hbase irc.freenode.net

35