Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

36
Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen Joel Knighton @joelknighton DataStax #CassandraSummit

Transcript of Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

Page 1: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

Testing Cassandra Guarantees under Diverse Failure Modes with JepsenJoel Knighton

@joelknighton

DataStax

#CassandraSummit

Page 2: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

Who I am

Mathematician

Software hobbyist

Logic enthusiast

Former DataStax Intern

DataStax Cassandra Developer

Page 3: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

What I Do

Deconstruct

Formalize

Communicate

Prove

Automate

Page 4: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

How We Test #1

Unit Testsant test

in-tree

Page 5: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

How We Test #2

Distributed Testsnosetests

On GitHub – available at riptano/cassandra-dtest

Page 6: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

Why You’re Here

JepsenKyle Kingsbury (aphyr)https://aphyr.com/tags/jepsen

Page 7: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

What Jepsen Is

A blog series about distributed systems behavior

A talk series about distributed systems behavior

A Clojure library to test the behavior of distributed systems

A collection of tests written using those libraries

Page 8: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

What We Hope

Jepsen

💘Cassandra

Page 9: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

What I Did

Jepsen Testslein test

On GitHub – available at riptano/jepsen

Page 10: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

A Test Incarnate

{:name …

:os …

:db …

:client …

:generator …

:conductors {:nemesis …}

:checker …}

names the results

prepares the os

configures/starts/stops the db

interacts with the db

instructions on how to interact

interacts with the environment

looks at and assesses test run

Page 11: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

What You Need

One machine to run the tests

+

n machines to run Cassandra

Page 12: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

How A Test Runs

lein testos

n1

n2

n3

n4

n5

Page 13: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

How A Test Runs

lein testdb

n1

n2

n3

n4

n5

Page 14: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

How A Test Runs

lein testclient 1client 2client 3client 4client 5nemesis

n1

n2

n3

n4

n5

readwrite 3

start nemesiswrite 4

readstop nemesis

write 1cas 2 -> 3

Page 15: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

How A Test Runs

lein testchecker

1 – read2 – write 3 1 – read 0n – start nemesis2 – write timed-out3 – write 4n – started nemesis3 – wrote 44 – read4 – read 4n – stop nemesis0 – write 11 – cas 2 -> 3n – stopped nemesis…

valid?

Latency

Page 16: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

Single Test Deep-Dive

lein test :only

cassandra.collections.set-test/

cql-set-isolate-node-decommission

Page 17: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

Single Test Name

Test name used to label folder where test results, logs, and history will be stored with timestamp

cassandra cql set isolate node decommission

Page 18: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

Single Test Nodes

[:n1 :n2 :n3 :n4 :n5]

Page 19: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

Single Test Net

net/iptables

(drop! ;use iptables to drop packets)

(heal! ;flush iptables)

Page 20: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

Single Test OS

debian/os(setup! ;adjust hostfile

;update package manager;install base packages like curl, iptables, etc.

;make sure network is healed)(teardown!)

Page 21: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

Single Test DB

cassandra.core/db(setup! ;shutdown and wipe Cassandra if running

;install, configure, and start Cassandra)(teardown! ;shutdown and wipe Cassandra)

(log-files ;return path to log files)

Page 22: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

Single Test Client

cql-set-client(setup! ;driver connect to all nodes

;create schema)(invoke! ;add? Run CQL to add to set, handle errors

;read? Read value of CQL set, handle errors)(teardown! ;disconnect driver)

Page 23: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

Single Test Generator

(gen/phases

(->> (adds)

(gen/stagger 1/10)

(gen/delay 1/2)

std-gen)

(read-once))

Page 24: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

Single Test Conductors

{:nemesis (nemesis/partition-random-node)

:decommissioner (c/decommissioner)}

Page 25: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

What a Conductor Is

It’s just a client

Page 26: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

Single Test Checker

checker/set(check ;look at history of run

;find ok or uncertain adds

;compare these to final read

;return map with validity and

;ok, lost, unexpected, recovered)

Page 27: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

Invariants We Test

Do CQL collections (maps, sets) merge cleanly when add-only?

Do counters merge to accurately reflect increments/decrements?

Does LWT in a single datacenter allow us linearizability?

Do materialized views converge to matching the base table?

Do batch writes eventually get applied atomically?

Page 28: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

Failures We Consider

How does this work under a variety of network partitions?

What about with node crashes?

Even if nodes are flushing and compacting?

And when nodes are being bootstrapped?

Or decommissioned?

While clocks drift?

Page 29: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

How We Run

Start the Docker container

Install Java driver, Cassaforte, clj-ssh, and Jepsen

Use environment variables to point to build under test

Run lein test with any desired selectors and profiles

Page 30: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

Tunable Options

Should we make a best-effort attempt to scale test length?

Should we enable commitlog compression, the coordinator batchlog on materialized views, or hinted handoff?

Is a different compaction strategy or phi value in the failure detector appropriate for this test?

Should we install from a tagged release, a URL pointing to a tarball, or a local tarball?

Should we leave Cassandra running after the test?

Page 31: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

What We’ve FoundIssues with counter undercounting/overcounting (#10143)

Decommission race conditions causing gossip problems (#10231)

Write durability violations when recovering commitlog (#9851)

Problems with merging of collections (#10001)

Batchlog replay failures after decommission/crash (#10068)

Incorrect asserts in counter write-path when timestamps collide

A variety of materialized view issues during development

Page 32: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

Work We Shared

Minor Jepsen fixes/features (Jepsen PRs #58, 59, 62)

Docker images to run Jepsen tests (Docker Hub: tjake/jepsen)

Multibox Vagrant configurations to run Jepsen tests (on GitHub)

Upstream library fixes (clj-ssh PR #36)

Cassandra Jepsen tests (on GitHub)

Available on CassCI (on cassci.datastax.com)

Page 33: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

Jepsen on CassCI

Page 34: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

Lessons I Learned

Tests verifying invariants under failures are valuable and practical

These tests can and should be a part of regular development

Testing complex systems is hard, but there are low-hanging fruit

Jepsen provides one readily available way to accomplish this goal

Considering invariants against a recorded test run is effective

Invariants should be explicit and carefully considered in design

Page 35: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

Thanks

Jake Luciani

DataStax

The Cassandra community

Kyle Kingsbury

Page 36: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

QUESTIONS?TLA+ • TLC • TLAPS • Clojure

Formal Methods • Jepsen CRDTs • Cassandra • GossipConsistency Models • Alloy

Model Checking • Testing

@joelknighton#CassandraSummit