Distributed Systems Theory for Mere Mortals

Ensar Basri KahveciDistributed Systems Engineer, Hazelcast

Disclaimer NoticeIn this presentation, I talk about distributed systems theory based on my own understanding.

First of all, distributed systems theory is hard. It also covers a wide-range of topics.

So, my statements might be wrong or incomplete!

Please discuss any point you are confused or you think I am wrong.

Agenda- Defining distributed systems

- Systems Models

- Time and Order

- Consensus, FLP Result, Failure Detectors

- Consensus Algorithms: 2PC, 3PC, Paxos and others...

“A DISTRIBUTED SYSTEM IS ONE IN WHICH THE FAILURE OF A COMPUTER YOU DID NOT EVEN

KNOW EXISTED CAN RENDER YOUR OWN COMPUTER UNUSABLE”

Leslie Lamport

What is a distributed system?- Collection of entities (machines, nodes, processes...)

- trying to solve a common problem,

- linked by a network and communicating via passing messages,

- having uncertain and partial knowledge of the system.

About being distributed…

- Independent failures

- Some servers might fail while others work correctly.

- Non-negligible message transmission delays

- The interconnection between servers has lower bandwidth and higher latency than that available within a single server.

- Unreliable communication

- The connections between server are unreliable compared to the connections within a server.

System Models

Interaction Models- Synchronous

- Asynchronous

- Partially-synchronous

Failure Modes- Fail-stop

- Fail-recover

- Omission failures

- Arbitrary failures (Byzantine)

Time and Order

Time and Order- We use time to:

- order events- measure the duration between events

- In the asynchronous model, nodes have local clocks, which can shift unboundedly.

- Components of a distributed system behave in an unpredictable manner.- Failures, rates of advance, delays in network packets etc.

- We cannot assume sync clocks while designing our algorithms in the asynchronous model.- Clock synchronization methods helps us a lot but doesn’t fix the problem completely.

The Idea: Ordering Events- We don’t have the notion of “now” in distributed systems.

- To what extend do we need it?

- We don’t need absolute clock synchronization.

- If machines don’t interact with each other, why bothering synchronizing their clocks?

- For a lot of problems, processes need to agree on the order in which events occur, rather than the time at which they occur

Ordering Events: Logical Clocks- We can use Logical Clocks (=Lamport Clocks) [1] to order events in a

distributed system.

- Logical clocks rely on counters and the communication between nodes.- Each node maintains a local counter value.

- happened-before relationship ( “→” )- If events a and b are events in the same process, and a comes before b, then a → b- If a is sending and b is receipt of a message, then a → b- If a → b and b → c, then a → c- If neither of a → b or b → a holds, a and b are concurrent.

- Partial ordering and total ordering of the events13

Clock Condition- For any events a, b: if a → b, then C(a) < C(b).

- Can we also infer the reverse?

- p1→ q2 and q2 → q3, then C(q3) > C(p1)- Causality: p1 causes q2 and q2 causes q3, then p1causes q3.

- C(p3) and C(q3) are concurrent events due to the happened-before relationship.- Can we infer if there is any causality by comparing C(p3)and C(q3)?

Image taken from [1] 14

Vector Clocks and Causality- We use vector clocks to infer

causalities by comparing clock values.

- If V(a) < V(b) then a causally precedes b

Image taken from [2] 15

Is Logical Clocks our only chance?- Google Spanner [3] uses NTP, GPS, and atomic clocks to synchronize the

local clocks of the machines as much as possible.

- It doesn’t pretend that clocks are perfectly synchronized.

- It introduces the uncertainty of clocks into its TrueTime API.

- CockroachDB [4] uses Hybrid Logical Clocks [5] which combines logical clocks and physical clocks to infer causalities.

Consensus

Consensus- The problem of having a set of processes agree on a value.

- leader election, state machine replication, deciding to commit a transaction etc.

- Validity: the value agreed upon must have been proposed by some process

- Termination: at least one non-faulty process eventually decides

- Agreement: all deciding processes agree on the same value

Liveness and Safety Properties- Liveness: A “good” thing happens during execution of an algorithm

- Safety: Some “bad” thing never happens during execution of an algorithm

FLP Result (Fischer, Lynch and Paterson) [6]

- Distributed consensus is not always possible ...- with reliable message delivery- with a single crash-stop failure

- … in the asynchronous model, because we cannot differentiate between a crashed process or a slow process.

- No algorithm can always guarantee termination in the presence of crashes.- It is related to the liveness property, not the safety property.

Detecting failures: Why don’t you “talking to me”?

Unreliable Failure Detectors by Chandra and Toueg [7]

- Distributed failure detectors which are allowed to make mistakes- Each process has a local state to keep the list of processes that it suspects have failed

- A local failure detector can make 2 types of mistakes - suspecting processes that haven’t actually crashed ⇒ ACCURACY property- not-suspecting processes that have actually crashed ⇒ COMPLETENESS property

- Degrees of completeness- strong completeness, weak completeness

- Degrees of accuracy- strong accuracy, weak accuracy, eventually strong accuracy, eventually weak accuracy

Classes of Failure Detectors- Perfect Failure Detector (P)

- Strongly Complete: Every faulty process is eventually permanently suspected by every non-faulty process.- Strongly Accurate: No process is suspected (by anybody) before it crashes.

- Eventually Strong Failure Detector (⋄S)- Strongly Complete- Eventually Weakly Accurate: After some initial period of confusion, some non-faulty process is never suspected.

- Consensus problem can be solved with Eventually Strong Failure Detector (⋄S)with f < n / 2 failures in the asynchronous model. [7], [8]- As long as you hear from the majority, you can solve consensus.⇒ SAFETY- Every correct process eventually decides. No blocking forever. ⇒ LIVENESS

Consensus Algorithms2PC, 3PC, Paxos, Raft and the others

Two-Phase Commit (2PC) [9]

- With no failures, it satisfies Validity, Termination, and Agreement.

- C crashes before Phase 1: No problem

- C crashes before Phase 2: A can ask Bwhat it has vote for.

- C and A crash before Phase 2: The protocol blocks!

- The protocol blocks with fail-stop failures (the simplest failure model).

Three-Phase Commit (3PC) [10]

- The main problem of 2PC is the participants don’t know outcome of the voting before they actually take action (commit / abort).

- We add a new step for this ⇒3PC

- 3PC is non-blocking and it handles fail-stop failures.

- What about fail-recover, network partitions, the asynchronous model?

Paxos [11], [12]

- It chooses to sacrifice liveness to maintain safety- It doesn’t terminate when the network behaves asynchronously and terminates only when synchronicity returns.- It doesn’t block when the majority is available.

- The correct run is similar to 2PC.

- 2 new mechanisms:- Order to proposals such that we can find out which proposal should be accepted: sequence numbers- Prefer majority, instead of all participants

27Image taken from [23]

Paxos- The original paper “The Part-time Parliament” [11] is difficult to read as it explains the

algorithm using an analogy with Greek democracy. - Submitted in 1990, published in 1998, after explained in another paper [17] in 1996.

- “The Paxos algorithm, when presented in plain English, is very simple” Paxos Made Simple [12]

- Cheap Paxos [13], Fast Paxos [14] and many other variations…

- Paxos Made Live [15]: There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. In order to build a real-world system, an expert needs to use numerous ideas scattered in the literature and make several relatively small protocol extensions. The cumulative effort will be substantial and the final system will be based on an unproven protocol.

- Paxos Made Moderately Complex [16]: For anybody who has ever tried to implement it, Paxos is by no means a simple protocol, even though it is based on relatively simple invariants. This paper provides imperative pseudo-code for the full Paxos (or Multi-Paxos) protocol without shying away from discussing various implementation details. 28

Raft: In search of an understandable consensus algorithm [18]

- A new consensus algorithm with understandability being one of its design goals.

- It divides the problem into parts:

- leader election, log replication, safety and membership changes

- Also discusses implementation details

- More than 80 implementations on its website [19]

Other Consensus Algorithms - Viewstamped Replication [20], [21]

- Another consensus algorithm. It is less popular than Paxos.

- Raft has a lot of similarities to it.

- Zab [22]

- Implemented in ZooKeeper

- Many variants of Paxos...

References[1] Lamport, Leslie. "Time, clocks, and the ordering of events in a distributed system." Communications of the ACM 21.7 (1978): 558-565.[2] Raynal, Michel, and Mukesh Singhal. "Logical time: Capturing causality in distributed systems." Computer 29.2 (1996): 49-56.[3] Corbett, James C., et al. "Spanner: Google’s globally distributed database." ACM Transactions on Computer Systems (TOCS) 31.3 (2013): 8.[4] https://github.com/cockroachdb/cockroach[5] Leone, Marcelo, et al. "Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases." (2014).[6] Fischer, Michael J., Nancy A. Lynch, and Michael S. Paterson. "Impossibility of distributed consensus with one faulty process." Journal of the ACM (JACM) 32.2 (1985): 374-382.[7] Chandra, Tushar Deepak, and Sam Toueg. "Unreliable failure detectors for reliable distributed systems." Journal of the ACM (JACM) 43.2 (1996): 225-267.[8] Chandra, Tushar Deepak, Vassos Hadzilacos, and Sam Toueg. "The weakest failure detector for solving consensus." Journal of the ACM (JACM) 43.4 (1996): 685-722.[9] Gray, James N. "Notes on database operating systems." Operating Systems. Springer Berlin Heidelberg, 1978. 393-481.[10] Skeen, Dale. "Nonblocking commit protocols." Proceedings of the 1981 ACM SIGMOD international conference on Management of data. ACM, 1981.[11] Lamport, Leslie. "The part-time parliament." ACM Transactions on Computer Systems (TOCS) 16.2 (1998): 133-169.[12] Lamport, Leslie. "Paxos made simple." ACM Sigact News 32.4 (2001): 18-25.[13] Lamport, Leslie, and Mike Massa. "Cheap paxos." Dependable Systems and Networks, 2004 International Conference on. IEEE, 2004.[14] Lamport, Leslie. "Fast paxos." Distributed Computing 19.2 (2006): 79-103.[15] Chandra, Tushar D., Robert Griesemer, and Joshua Redstone. "Paxos made live: an engineering perspective." Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing. ACM, 2007.[16] Van Renesse, Robbert, and Deniz Altinbuken. "Paxos made moderately complex." ACM Computing Surveys (CSUR) 47.3 (2015): 42.[17] Lampson, Butler. "How to build a highly available system using consensus." Distributed Algorithms (1996): 1-17.[18] Ongaro, Diego, and John Ousterhout. "In search of an understandable consensus algorithm." 2014 USENIX Annual Technical Conference (USENIX ATC 14). 2014.[19] https://raft.github.io/[20] Oki, Brian M., and Barbara H. Liskov. "Viewstamped replication: A new primary copy method to support highly-available distributed systems." Proceedings of the seventh annual ACM Symposium on Principles of distributed computing. ACM, 1988.[21] Liskov, Barbara, and James Cowling. "Viewstamped replication revisited." (2012).[22] Junqueira, Flavio P., Benjamin C. Reed, and Marco Serafini. "Zab: High-performance broadcast for primary-backup systems." 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN). IEEE, 2011.[23] http://the-paper-trail.org/blog/consensus-protocols-paxos/ 31

Thank you!Stay tuned for the next episode...

Distributed Systems Theory for Mere Mortals

Engineering

Transcript of Distributed Systems Theory for Mere Mortals

Automated Unit Testing for Mere Mortals

Slack for the mere mortals

Selecting database technologies for mere mortals

OSGi for mere mortals

Delphi Unicode Migration for Mere Mortals - Danysoft · Delphi Unicode Migration for Mere Mortals: Stories & Advice from the Front Lines Embarcadero Technologies - 1 - SUMMARY With

HEAVEN'S GUIDE MERE MORTALS - Department of Mathematics

Database Design for Mere Mortals - pearsoncmg.com · 2013. 2. 13. · Database Design for Mere Mortals® A Hands-on Guide to Relational Database Design Third Edition Michael J. Hernandez

JVM Dive for mere mortals

Agile testing for mere mortals

Calc Scripts for Mere Mortals

Delphi Unicode Migration for Mere Mortals: Stories and Advice from ...

Declarative Datalog Debugging for Mere Mortals

Patient Portals for Mere Mortals - Conference Innovators · MERE MORTALS Quality Symposium Dunedin 27 July 2017 Dr Jeff Lowe. Ministry of Health data Patient Portal –uptake trend

Autodesk Revit Customization for Mere Mortals: Save Time ...help.autodesk.com.s3.amazonaws.com/sfdcarticles/kA230000000tiXXCAY/... · Autodesk® Revit® Customization for Mere Mortals:

Version Control for mere and freelance mortals

SECURITY ENHANCED LINUX FOR MERE MORTALS

Cryptography for the mere mortals

Database Design for Mere Mortals® A Hands-on Guide to Relational ...

Security for Mere Mortals Steve Lamb Technical Security Advisor Microsoft Ltd.

NoSQL for Mere Mortals® - pearsoncmg.com...T he For Mere Mortals® Series presents you with information on important technology topics in an easily accessible, common sense manner.