Scaling out and the CAP Theorem

48
CAP Theorem Reversim Summit 2014

description

Friday 4th June 1976, the Sex Pistols kicked off their first gig, a gig that's considered to change western music culture forever, pioneering the genesis of punk rock. Wednesday 19th July 2000 had a similar impact on internet scale companies as the Sex Pistols did on music, with the keynote speech by Eric Brewer at the ACM symposium on the [Principles of Distributed Computing](http://www.podc.org/podc2000/) (PODC). Eric Brewer claimed that as applications become more web-based we should stop worrying about data consistency, because if we want high availability in those new distributed applications, then we cannot have data consistency. Two years later, in 2002, Seth Gilbert and Nancy Lynch [formally proved](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf) Brewer's claim as what is known today as the Brewer's Theorem or CAP. The CAP theorem mandates that a distributed system cannot satisfy both Consistency, Availability and Partition tolerance. In the database ecosystem, many tools claim to solve our data persistence problems while scaling out, offering different capabilities (document stores, key/values, SQL, graph, etc). In this talk we will explore the CAP theorem + We will define what are Consistency, Availability and Partition Tolerance + We will explore what CAP means for our applications (ACID vs BASE) + We will explore practical applications on MySQL with read slave, MongoDB and Riak based on the work by [Aphyr - Kyle Kingsbury](http://aphyr.com/posts).

Transcript of Scaling out and the CAP Theorem

Page 1: Scaling out and the CAP Theorem

   CAP  Theorem  

Reversim  Summit  2014  

Page 2: Scaling out and the CAP Theorem

CAP  theorem,  or  Brewer’s  theorem,  states  that  it  is  impossible  for  a  distributed  computer  system  to  simultaneously  provide  all  three  of  the  following  guarantees  

•  Consistency  –  All  nodes  see  the  same  data  at  the  same  @me  

•  Availability  –  A  guarantee  that  every  request  receives  a  response  about  whether  it  was  

successful  or  failed  

•  Par@@on  Tolerance  –  The  system  con@nues  to  operate  despite  arbitrary  message  loss  or  failure  of  

part  of  the  system  

Page 3: Scaling out and the CAP Theorem

It  means  that  for  internet  scale  companies  we  should  stop  worrying  about  data  consistency  

 If  we  want  high  availability  in  such  distributed  systems  

then  guaranteed  consistency  of  data  is  something    we  cannot  have  

Page 4: Scaling out and the CAP Theorem

An  Example  

Consider  an  online  bookstore  

•  You  want  to  buy  the  book  “The  tales  of  the  CAP  theorem”  –  The  store  has  only  one  copy  in  stock  –  You  add  it  to  your  cart  and  con@nue  browsing,  looking  for  another  book  

(“ACID  vs  BASE,  a  love  story?”)  

•  As  you  browse  the  shop,  someone  else  goes  and  buy  the    “The  tales  of  the  CAP  theorem”  –  Adds  the  book  to  the  cart  and  checks-­‐out  process  

Page 5: Scaling out and the CAP Theorem

Consistency  

Page 6: Scaling out and the CAP Theorem

Consistency  

A  Service  that  is  consistent  operates  fully  or  not  at  all.    In  our  bookstore  example  •  There  is  only  one  copy  in  stock  and  only  one  person  will  get  it  

•  If  both  customers  can  con@nue  through  the  order  process  (payment)  the  lack  of  consistency  will  become  a  business  issue  

•  Scale  this  inconsistency  and  you  have  a  major  business  issue  

•  You  can  solve  this  issue  using  a  database  to  manage  inventory  –  The  first  checkout  operates  fully,  the  second  not  at  all  

Page 7: Scaling out and the CAP Theorem

Consistency  

Note:  CAP  Consistency  is  the  Atomicity  in  ACID  

•  CAP  consistency  is  a  constraint  that  mul@ple  values  of  the  same  data  are  not  allowed  

•  ACID  Atomicity  requires  that  each  transac@on  is  “all  or  nothing”  –  Which  implies  that  mul@ple  values  of  the  same  data  are  not  allowed  

•  ACID  consistency  means  that  any  transac@on  brings  the  database  from  one  consistent  state  to  another  –  Global  consistency  –  of  the  whole  database  

Page 8: Scaling out and the CAP Theorem

Availability  

Page 9: Scaling out and the CAP Theorem

Availability  

Availability  means  just  that  –  the  service  is  available  

•  When  you  purchase  a  book  you  want  to  get  a  response  –  Not  some  schrodinger  message  about  the  site  being  uncommunica@ve  

•  Availability  most  oYen  deserts  you  when  you  need  it  the  most  –  Services  tend  to  go  down  at  busy  periods  

•  A  service  that’s  available  but  cannot  be  reached  is  of  no  benefit  to  anyone  

Page 10: Scaling out and the CAP Theorem

Par@@on  Tolerance  

Page 11: Scaling out and the CAP Theorem

Par@@on  Tolerance  

Par@@on  happens  when  a  node  in  your  system  cannot  communicate  with  another  node  •  Say,  because  a  network  cable  gets  chopped  

•  Par@@ons  are  equivalent  to  server  crash  –  If  nothing  can  connect  to  it,  it  may  as  well  not  be  there  

•  If  your  applica@on  and  database  runs  on  one  box    then  your  server  acts  as  a  kind  of  atomic  processor  –  it  either  works  or  it  doesn’t  –  How  far  can  you  scale  on  one  host?  

•  Once  you  scale  to  mul@ple  hosts,  you  need  par@@on  tolerance  

Page 12: Scaling out and the CAP Theorem

Par@@ons  

But  wait,  are  par@@ons  real?  •  Our  infrastructure  is  reliable,  right?      Formally,  in  any  network    •  IP  networks  do  all  four  

•  TCP  means  no  dupes,  reorder  –  Unless  you  retry!  

•  Delays  are  indis@nguishable  from  drops  (aYer  a  @meout)    –  There  is  no  perfect  failure  detector  in  an  async  network  

 

A B

drop delay

duplicate reorder

A B

A B A B

time

Page 13: Scaling out and the CAP Theorem

Par@@ons  are  real!  Some  Causes  •  GC  Pause  

–  Is  actually  a  delay  

•  Network  maint  •  Segfaults  &  crashes  •  Faulty  NICs  •  Bridge  loops  •  VLAN  problems  •  Hosted  networks  •  The  cloud  •  WAN  links  &  Backhoes  

Published  examples  •  Neclix  •  Twilio  •  Fog  Creek  •  AWS  •  Github  •  Wix  •  MicrosoY  datacenter  study  

–  Average  failure  rate  5.2  devices  per  day  and  40.8  links  per  day  

–  Median  packet  loss  59,000  packets  –  Network  redundancy  improves  

median  traffic  by  43%  

More examples at hkp://aphyr.com/posts/288-­‐the-­‐network-­‐is-­‐reliable  

Page 14: Scaling out and the CAP Theorem

The  CAP  Theorem  proof  

Page 15: Scaling out and the CAP Theorem

Proof  in  Pictures  

•  Consider  a  system  with  two  nodes  N1  and  N2  

•  They  both  share  the  same  data  V  

•  On  N1  runs  the  program  A  •  On  N2  runs  program  B  

–  We  consider  both  A  and  B  to  be  ideal  -­‐  safe,  bug  free,    predictable  and  reliable  

•  In  this  example,  A  writes  a  new  values  of  V  and  B  reads  the  values  of  V  

Page 16: Scaling out and the CAP Theorem

Proof  in  Pictures  

Sunny-­‐day  scenario  

1.  A  writes  a  new  value  of  V,  denoted  as  V1  

2.  A  message  M  is  passed  from  N1  to  N2  which  updates  the  copy  of  V  there  

3.  Any  read  by  B  of  V  will  return  V1  

Page 17: Scaling out and the CAP Theorem

Proof  in  Pictures  

In  the  case  of  network  par@@on  

•  Messages  from  N1  to  N2  are  not  delivered  –  Even  if  we  use  guaranteed  delivery  of  M,  N1  has  no  way  of  knowing  if  a  

message  is  delayed  by  par@@oning  event  or  failure  on  N2  

–  Then  N2  contains  an  inconsistent  value  of  V  when  step  3  occurs  

•  We  have  lost  consistency!  

Page 18: Scaling out and the CAP Theorem

Proof  in  Pictures  

In  the  case  of  network  par@@on    •  We  can  make  M  synchronous  

–  Which  means  the  write  of  A  on  N1  and  the  update  N1  to  N2  is  an  atomic  opera@on  

–  A  write  will  fail  in  case  of  a  par@@on    

•  We  have  lost  availability!  

Page 19: Scaling out and the CAP Theorem

What  does  it  all  mean?  

Page 20: Scaling out and the CAP Theorem

In  prac@cal  terms  

For  a  distributed  system  to  not  require  par@@on-­‐tolerance  it  would  have  to  run  on  a  network  which  is  guaranteed  to  never  drop  messages  (or  even  

deliver  them  late)  and  whose  nodes  are  guaranteed  to  never  die.    Such  systems  do  not  exist.  

 Make  your  choice  

Choose  consistency  over  availability  Choose  availability  over  consistency  

Choose  neither  

Page 21: Scaling out and the CAP Theorem

CAP  Locality  

•  It  holds  per  opera@on  independently  –  A  system  can  be  both  CP  and  AP,  for  different  opera@ons  –  Different  opera@ons  can  be  modeled  with  different  CAP  proper@es  

•  An  opera@on  can  be  –  CP  –  consistent  and  par@@on  tolerant  –  AP  –  available  and  par@@on  tolerant  –  P  with  mixed  A  &  C  –  trading  off  between  A  and  C  

•  Eventual  consistency,  for  example  

 

Consistency Availability

Add item to cart Checkout

Page 22: Scaling out and the CAP Theorem

Lets  look  at  some  examples  

Page 23: Scaling out and the CAP Theorem

Using  the  findings  of    Kyle  Kingsbury  aphyr.com  

Page 24: Scaling out and the CAP Theorem
Page 25: Scaling out and the CAP Theorem

Postgres  

•  A  classic  open  source  database  •  We  think  of  it  as  a  CP  system  

–  It  accept  writes  only  on  a  single  primary  node  –  Ensuring  a  write  to  slaves  as  well  

•  If  a  par@@on  occurs  –  We  cannot  talk  to  the  server  and  the  system  is  unavailable  –  Because  transac@ons  are  ACID,  we’re  always  consistent  

 However  •  The  distributed  system  composed  of  the  server  and  client  together  may  

not  be  consistent  –  They  may  not  agree  if  a  transac@on  took  place  

Page 26: Scaling out and the CAP Theorem

Postgres  

•  Postgres’  commit  protocol  is  a  two  phase    commit  –  2PC  1.  The  client  votes  to  commit  and  sends  a  message  

to  the  server  2.  The  server  checks  for  consistency  and  votes  to    

commit  (or  reject)  the  transac@on    3.  It  writes  the  transac@on  to  storage  4.  The  server  informs  the  client  that  a  commit  took  place  

•  What  happens  if  the  acknowledgment  message  is  dropped?  –  The  client  doesn't  know  whether  the  commit  succeeded  or  not!  –  The  2PC  protocol  requires  the  client  to  wait  an  ack  –  The  client  will  eventually  get  a  @meout  (or  deadlock)  

 

Page 27: Scaling out and the CAP Theorem

Postgres  

The  experiment  •  Install  and  run  Postgres  on  one  host  •  Run  5  clients  who  write  to  postgres  within  a  transac@on  •  During  the  experiment,  drop  the  network  for  one  of  the  nodes    The  findings  •  Out  of  1000  write  opera@ons  •  950  successfully  acknowledged,  and  all  are  in  the  database  •  2  writes  succeeded,  but  the  client  got  an  excep@on  claiming  an  error  

occurred!  –  Note  that  the  client  has  no  way  know  if  the  write  succeeded  or  failed  

Page 28: Scaling out and the CAP Theorem

Postgres  

2PC  Strategies  

•  Accept  false  nega@ves  –  Just  ignore  the  excep@on  on  the  client.  Those  errors  happen  only  for  in-­‐flight  

writes  at  the  @me  the  par@@on  began.  

•  Use  idempotent  opera@ons  –  On  a  network  error,  just  retry  

•  Using  transac@on  ID  –  When  a  par@@on  is  resolved,  the  client  checks  if  a  transac@on  was  commiked  

using  the  transac@on  ID.  

Note  those  strategies  applies  to  most  SQL  engines  

Page 29: Scaling out and the CAP Theorem
Page 30: Scaling out and the CAP Theorem

MongoDB  

•  MongoDB  is  a  document-­‐oriented  database  

•  Replicated  using  a  replica  set  –  Single  writable  primary  node  –  Asynchronously  replicates  writes  as  an  oplog  

 to  N  secondaries  

•  MongoDB  supports  different  levels  of  guarantees  –  Asynchronous  replica@on  –  Confirm  successful  write  to  its  disk  log  –  Confirm  successful  replica@on  of  a  write  to  secondary  nodes  

•  Is  MongoDB  consistent?  –  MongoDB  is  promoted  as  a  CP  system  –  However,  it  may  “revert  opera@ons”  on  network  par@@on  in  some  cases  

Page 31: Scaling out and the CAP Theorem

MongoDB  

What  happens  when  the  primary  becomes  unavailable?  

•  The  remaining  secondaries  will  detect  the  failed  connec@on  –  Will  try  to  get  to  a  consensus  for  a  new  leader  –  If  they  have  majority,  they’ll  select  the  node  with  the  highest  op@me  

•  The  minority  nodes  will  detect  they  no  longer  have  a  quorom  –  Will  demote  the  primary  to  a  secondary  

•  If  our  primary  is  on  n1  and  we  cut  n1  &  n2  from  the  rest,  we  expect  n3,  n4  or  n5    to  become  the  new  primary    

Page 32: Scaling out and the CAP Theorem

MongoDB  

The  experiment  •  Install  and  run  MongoDB  on  5  hosts  •  With  5  clients  

–  Wri@ng  some  data  to  the  cluster  

•  During  the  experiment,    par@@on  the  network  –  To  a  minority  and  primary  nodes  

•  Then  restore  the  network  

•  check  what  happened?  –  What  writes  survived  

 

Page 33: Scaling out and the CAP Theorem

MongoDB  

Write  concern  unacknowledged  •  The  default  at  the  @me  Kyle  run  the    

experiment  

The  findings  •  6000  total  writes  •  5700  acknowledged  •  3319  survivors  •  2381  acknowledged  writes  lost  (42%  write  loss)  

Not  surprising,  we  have  data  loss.  

Page 34: Scaling out and the CAP Theorem

MongoDB  

42%  data  loss?  •  What  happened?  •  When  the  par@@on  started  

–  The  original  primary  (N1)  con@nued  to    accept  writes  

–  But  those  writes  never  made  it  to  the  new  primary  (N5)  

•  When  the  par@@on  ended  –  The  original  primary  (N1)  and  the  new  primary  (N5)  compare  notes  –  They  figure  that  the  N5  op@me  is  higher  –  N1  find  the  last  point  the  two  agreed  on  the  oplog  and    

rolls  back  to  that  point  •  During  a  rollback,  all  writes  the  old  primary  accepted  a@er  the  common  

point  in  the  oplog  are  removed  from  the  database  

Page 35: Scaling out and the CAP Theorem

MongoDB  

Write  concern  safe  or  acknowledged  •  The  current  default  •  Allows  clients  to  catch  network,    

duplicate  key  and  other  errors  

The  findings  •  6000  total  writes  •  5900  acknowledged  •  3692  survivors  •  2208  acknowledged  writes  lost  (37%  write  loss)  

Write  concern  acknowledged  only  verifies  the  write  was  accepted  on  the  master.  We  need  to  ensure  replicas  also  see  the  write  

Page 36: Scaling out and the CAP Theorem

MongoDB  

Write  concern  replicas_safe  or                                                          replica_acknowledged  •  Waits  for  at  least  2  servers  for  the    

write  opera@on  

The  findings  •  6000  total  writes  •  5695  acknowledged  •  3768  survivors  •  1927  acknowledged  writes  lost  (33%  write  loss)  

Mongo  only  verifies  that  a  write  took  place  against  two  nodes.  A  new  primary  can  be  elected  without  having  seen  those  writes.  In  this  case,  Mongo  will  rollback  those  writes.  

Page 37: Scaling out and the CAP Theorem

MongoDB  

Write  concern  majority    •  Waits  for  a  majority  of  servers  for  the    

write  opera@on  The  findings  •  6000  total  writes  •  5700  acknowledged  •  5701  survivors  •  2  acknowledged  writes  lost  •  3  unacknowledged  write  found  

The  reason  we  have  2  writes  lost  is  a  bug  in  Mongo  that  caused  it  to  threat  network  failures  as  successful  writes.  This  bug  was  fixed  in  2.4.3  (or  2.4.4)    The  fact  we  have  3  unacknowledged  writes  found  it  not  a  problem  -­‐    similar  arguments  to  Postgres  

Majority

Page 38: Scaling out and the CAP Theorem

MongoDB  

Takeaways  for  MongoDB    You  can  either  •  Accept  data  loss  

–  At  most  WriteConcern  levels  Mongo  can  get  to  a  point  it  rollback  data  

•  Use  WriteConcern.Majority  –  With  performance  impact    

Page 39: Scaling out and the CAP Theorem

Other  distributed  systems  Kyle  tested  All  have  different  Caveats  Worth  a  read  at  aphyr.com  

ZooKeeper

Kafka

Page 40: Scaling out and the CAP Theorem

Strategies  for  distributed    data  &  systems  

Page 41: Scaling out and the CAP Theorem

Immutable  Data  

•  Immutable  data  means  –  No  updates  –  No  deletes  –  No  need  for  data  merges  –  Easier  to  replicate  

•  Immutable  data  solves  the  problems  that  cause  distributed  systems  to  delete  data  (MongoDB,  Riak,  Cassandra,  etc.)  –  However,  even  if  your  data  is  immutable,  exis@ng  tools  assume  it  is  mutable  

and  may  s@ll  delete  your  data  

•  Can  you  model  all  your  data  to  be  immutable?  •  How  do  you  model  inventory  using  immutable  data?  

 

Page 42: Scaling out and the CAP Theorem

Idempotent  Opera@ons  

An  opera@on  is  idempotent  if,  whenever  it  is  applied  twice,    it  gives  the  same  result  as  if  it  were  applied  once  •  It  enables  recovering  from  availability  problems  

–  A  way  to  introduce  fault  tolerant  –  The  Postgres  client-­‐server  ack  issue,  for  example  

•  In  case  of  any  failure,  just  retry  –  Undetermined  response  –  Failure  to  write  

•  However,  it  does  not  solve  the  CAP  constraints  

•  Can  you  model  all  your  opera@ons  to  be  Idempotent?  

Page 43: Scaling out and the CAP Theorem

BASE  

BASE  •  Defined  by  Eric  Brewer  •  Basically  Available  

–  The  system  guarantee  availability,  in  terms  of  the  CAP  theorem  

•  SoY  state  –  The  system  state  is  sta@s@cally  consistent.  It  may  change  in  @me  without  

external  input.  •  Eventual  consistency  

–  The  system  will  converge  to  be  consistent  over  @me  

•  Considered  the  contrast  to  ACID  (Atomicity,  Consistency,  Isola@on,  Durability)  –  Not  really  J  –  Both  are  actually  contrived  

Page 44: Scaling out and the CAP Theorem

Eventual  Consistency  

For  AP  systems,  we  can  make  the  system  to  fix  consistency  •  We  all  know  such  a  system  –  Git  

–  Available  on  each  node,  full  par@@on  tolerant  –  Gains  consistency  using  Git  push  &  pull  –  Human  merge  of  data  

•  Can  we  take  those  ideas  to  other  distributed  systems?  

•  How  can  we  track  history?    –  Iden@fy  conflicts?  

•  Can  we  make  the  merge  automa@c?  

Page 45: Scaling out and the CAP Theorem

Vector  Clocks  

•  A  way  to  track  ordering  of  events  in  a  distributed  system  •  Enables  detec@ng  conflic@ng  writes  

–  And  the  shared  point  in  history  the  divergence  started    •  Each  write  includes  a  logical  clock  

–  A  clock  per  node  –  Each  @me  a  node  write  data,  it  increases  it’s  clock  

•  Nodes  sync  with  each  other  using  gossip  a    protocol  

•  Mul@ple  implementa@ons  –  Node  based  –  Opera@on  based  

Page 46: Scaling out and the CAP Theorem

Eventual  Consistency  

•  A  system  that  expects  data  to  diverge  –  For  small  intervals  in  @me  –  For  as  long  as  a  par@@on  exists  

•  Built  to  regain  consistency  –  Using  some  sync  protocol  (gossip)  –  Using  vector  clocks  or  @mestamps  to  compare  values  

•  Needs  to  handle  values  merge  –  Minimize  merges  using  vector  clocks  

•  Merge  only  if  values  actually  diverge  

–  Using  @mestamp  to  select  newer  value  –  Using  business  specific  merge  func@ons  –  Using  CRDTs  

Page 47: Scaling out and the CAP Theorem

CRDTs  

Commuta@ve  Replicated  Data  Type    (also  known  as  Conflict-­‐free  Replicated  Data  Type)  

•  Not  a  lot  of  data  types  available  to  select  from  –  G-­‐Counter,  PN-­‐Counter,  G-­‐Set,  2P-­‐Set,  OR-­‐Set,  U-­‐Set,  Graphs  

•  OR-­‐Set  –  For  social  graphs  –  can  be  used  for  shopping  cart  (with  some  modifica@ons)  

Page 48: Scaling out and the CAP Theorem

QuesEons?  anyone?