Download - C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Transcript
Page 1: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

When Bad Things Happen to Good Data:

Understanding Anti-Entropy in Cassandra

Jason Brown

@jasobrown [email protected]

Page 2: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

About me

•  Senior Software Engineer @ Netflix •  Apache Cassandra committer

•  E-Commerce Architect, Major League Baseball Advanced Media

•  Wireless developer (J2ME and BREW)

Page 3: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Maintaining consistent state is hard in a distributed system

CAP theorem works against you

Page 4: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Inconsistencies creep in

•  Node is down •  Network partition •  Dropped mutations •  Process crash before commit log flush •  File corruption

Cassandra trades C for AP

Page 5: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Anti-Entropy Overview

•  write time o  tunable consistency o  atomic batches o  hinted handoff

•  read time o  consistent reads o  read repair

•  maintenance time o  node repair

Page 6: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Write Time

Page 7: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Cassandra Writes Basics

•  determine all replica nodes in all DCs •  send to replicas in local DC •  send one replica node in remote DCs,

o  it will forward to peers

•  all respond back to original coordinator

Page 8: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Writes - request path

Page 9: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Writes - response path

Page 10: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Writes - Tunable consistency

Coordinator blocks for specified count of replicas to respond

•  consistency level o  ALL o  EACH_QUORUM o  LOCAL_QUORUM o  ONE / TWO / THREE o  ANY

Page 11: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Hinted handoff

Save a copy of the write for down nodes, and replay later

hint = target replica + mutation data

Page 12: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Hinted handoff - storing

•  on coordinator, store a hint for any nodes not currently 'up'

•  if a replica doesn't respond within write_request_timeout_in_ms, store a hint

•  max_hint_window_in_ms - maximum amount of time a dead host will have hints generated.

Page 13: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Hinted handoff - replay

•  try to send hints to nodes •  runs every ten minutes •  multithreaded (as of 1.2) •  throttable (kb per second)

Page 14: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Hinted Handoff - R2 down

R2 down, coordinator (R1) stores hint

Page 15: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Hinted handoff - replay

R2 comes back up, R1 plays hints for it

Page 16: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

What if coordinator dies?

Page 17: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Atomic Batches

•  coordinator stores incoming mutation to two peers in same DC o  deletes from peers on successful completion

•  peers will replay the batch if not deleted o  runs every 60 seconds

•  with 1.2, all mutates use atomic batch

Page 18: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Read Time

Page 19: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Cassandra Reads - setup

•  determine endpoints to invoke o  consistency level vs. read repair

•  first data node to send back full data set, other nodes only return a digest

•  wait until the CL number of nodes to return

Page 20: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

LOCAL_QUORUM read

Pink nodes contain requested row key

Page 21: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Consistent reads

•  compare the digests of returned data sets •  if any mismatches, send request again to

same CL data nodes. o  this time no digests, full data set

•  compare the full data sets, send updates to out of date replicas

•  block until those fixes are responded to •  return data to caller

Page 22: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Read Repair

•  synchronizes the client-requested data amongst all replicas

•  piggy-backs on normal reads, but waits for all replicas to respond asynchronously

•  then, just like consistent reads, compares the digests, and fix if needed

Page 23: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Read Repair

green lines = LOCAL_QUORUM nodes blue lines = nodes for read repair

Page 24: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Read Repair - configuration

•  setting per column family •  percentage of all calls to CF •  Local DC vs. Global chance

Page 25: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Read repair fixes data that is actually requested,

... but what about data that isn't requested?

Page 26: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Node Repair - introduction

•  repairs inconsistencies across all replicas for a given range

•  nodetool repair o  repairs the ranges the node contains o  one of more column families (within the same

keyspace) o  can choose local datacenter only (c* 1.2)

Page 27: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

•  should be part of std operations maintenance for c*, esp if you delete data o  ensures tombstones are propagated, and avoid

resurrected data

•  repair is IO and CPU intensive

Node Repair - cautions

Page 28: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Node Repair - details 1

•  determine peer nodes with matching ranges •  triggers a major (validation) compaction on

peer nodes o  read and generate hash for every row in CF o  add result to a Merkle Tree o  return tree to initiator

Page 29: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Node Repair - details 2

•  initiator awaits trees from all nodes •  compares each tree to every other tree •  if any differences exist, two nodes are

exchange the conflicting ranges o  these ranges get written out as new, local sstables

Page 30: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

'ABC' node is repair initiator

Page 31: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Nodes sharing range A

Page 32: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Nodes sharing range B

Page 33: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Nodes sharing range C

Page 34: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Five nodes participating in repair

Page 35: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Anti-Entropy wrap-up

•  CAP Theorem lives, tradeoffs must be made

•  C* contains processes to make diverging data sets consistent

•  Tunable controls exist at write and read times, as well on-demand

Page 36: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Thank you!

Q & A time

@jasobrown

Page 37: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Notes from Netflix

•  carefully tune RR_chance •  schedule repair operations •  tickler •  store more hints vs. running repair