Distributed Systems Theory for Mere Mortals

32
Distributed Systems Theory for Mere Mortals Ensar Basri Kahveci Distributed Systems Engineer, Hazelcast 1

Transcript of Distributed Systems Theory for Mere Mortals

Page 1: Distributed Systems Theory for Mere Mortals

Distributed Systems Theory for Mere Mortals

Ensar Basri KahveciDistributed Systems Engineer, Hazelcast

1

Page 2: Distributed Systems Theory for Mere Mortals

Disclaimer NoticeIn this presentation, I talk about distributed systems theory based on my own understanding.

First of all, distributed systems theory is hard. It also covers a wide-range of topics.

So, my statements might be wrong or incomplete!

Please discuss any point you are confused or you think I am wrong.

2

Page 3: Distributed Systems Theory for Mere Mortals

Agenda- Defining distributed systems

- Systems Models

- Time and Order

- Consensus, FLP Result, Failure Detectors

- Consensus Algorithms: 2PC, 3PC, Paxos and others...

3

Page 4: Distributed Systems Theory for Mere Mortals

“A DISTRIBUTED SYSTEM IS ONE IN WHICH THE FAILURE OF A COMPUTER YOU DID NOT EVEN

KNOW EXISTED CAN RENDER YOUR OWN COMPUTER UNUSABLE”

Leslie Lamport

4

Page 5: Distributed Systems Theory for Mere Mortals

What is a distributed system?- Collection of entities (machines, nodes, processes...)

- trying to solve a common problem,

- linked by a network and communicating via passing messages,

- having uncertain and partial knowledge of the system.

5

Page 6: Distributed Systems Theory for Mere Mortals

About being distributed…

- Independent failures

- Some servers might fail while others work correctly.

- Non-negligible message transmission delays

- The interconnection between servers has lower bandwidth and higher latency than that available within a single server.

- Unreliable communication

- The connections between server are unreliable compared to the connections within a server.

6

Page 7: Distributed Systems Theory for Mere Mortals

System Models

7

Page 8: Distributed Systems Theory for Mere Mortals

Interaction Models- Synchronous

- Asynchronous

- Partially-synchronous

8

Page 9: Distributed Systems Theory for Mere Mortals

Failure Modes- Fail-stop

- Fail-recover

- Omission failures

- Arbitrary failures (Byzantine)

9

Page 10: Distributed Systems Theory for Mere Mortals

Time and Order

10

Page 11: Distributed Systems Theory for Mere Mortals

Time and Order- We use time to:

- order events- measure the duration between events

- In the asynchronous model, nodes have local clocks, which can shift unboundedly.

- Components of a distributed system behave in an unpredictable manner.- Failures, rates of advance, delays in network packets etc.

- We cannot assume sync clocks while designing our algorithms in the asynchronous model.- Clock synchronization methods helps us a lot but doesn’t fix the problem completely.

11

Page 12: Distributed Systems Theory for Mere Mortals

The Idea: Ordering Events- We don’t have the notion of “now” in distributed systems.

- To what extend do we need it?

- We don’t need absolute clock synchronization.

- If machines don’t interact with each other, why bothering synchronizing their clocks?

- For a lot of problems, processes need to agree on the order in which events occur, rather than the time at which they occur

12

Page 13: Distributed Systems Theory for Mere Mortals

Ordering Events: Logical Clocks- We can use Logical Clocks (=Lamport Clocks) [1] to order events in a

distributed system.

- Logical clocks rely on counters and the communication between nodes.- Each node maintains a local counter value.

- happened-before relationship ( “→” )- If events a and b are events in the same process, and a comes before b, then a → b- If a is sending and b is receipt of a message, then a → b- If a → b and b → c, then a → c- If neither of a → b or b → a holds, a and b are concurrent.

- Partial ordering and total ordering of the events13

Page 14: Distributed Systems Theory for Mere Mortals

Clock Condition- For any events a, b: if a → b, then C(a) < C(b).

- Can we also infer the reverse?

- p1→ q2 and q2 → q3, then C(q3) > C(p1)- Causality: p1 causes q2 and q2 causes q3, then p1causes q3.

- C(p3) and C(q3) are concurrent events due to the happened-before relationship.- Can we infer if there is any causality by comparing C(p3)and C(q3)?

Image taken from [1] 14

Page 15: Distributed Systems Theory for Mere Mortals

Vector Clocks and Causality- We use vector clocks to infer

causalities by comparing clock values.

- If V(a) < V(b) then a causally precedes b

Image taken from [2] 15

Page 16: Distributed Systems Theory for Mere Mortals

Is Logical Clocks our only chance?- Google Spanner [3] uses NTP, GPS, and atomic clocks to synchronize the

local clocks of the machines as much as possible.

- It doesn’t pretend that clocks are perfectly synchronized.

- It introduces the uncertainty of clocks into its TrueTime API.

- CockroachDB [4] uses Hybrid Logical Clocks [5] which combines logical clocks and physical clocks to infer causalities.

16

Page 17: Distributed Systems Theory for Mere Mortals

Consensus

17

Page 18: Distributed Systems Theory for Mere Mortals

Consensus- The problem of having a set of processes agree on a value.

- leader election, state machine replication, deciding to commit a transaction etc.

- Validity: the value agreed upon must have been proposed by some process

- Termination: at least one non-faulty process eventually decides

- Agreement: all deciding processes agree on the same value

18

Page 19: Distributed Systems Theory for Mere Mortals

Liveness and Safety Properties- Liveness: A “good” thing happens during execution of an algorithm

- Safety: Some “bad” thing never happens during execution of an algorithm

19

Page 20: Distributed Systems Theory for Mere Mortals

FLP Result (Fischer, Lynch and Paterson) [6]

- Distributed consensus is not always possible ...- with reliable message delivery- with a single crash-stop failure

- … in the asynchronous model, because we cannot differentiate between a crashed process or a slow process.

- No algorithm can always guarantee termination in the presence of crashes.- It is related to the liveness property, not the safety property.

20

Page 21: Distributed Systems Theory for Mere Mortals

Detecting failures: Why don’t you “talking to me”?

21

Page 22: Distributed Systems Theory for Mere Mortals

Unreliable Failure Detectors by Chandra and Toueg [7]

- Distributed failure detectors which are allowed to make mistakes- Each process has a local state to keep the list of processes that it suspects have failed

- A local failure detector can make 2 types of mistakes - suspecting processes that haven’t actually crashed ⇒ ACCURACY property- not-suspecting processes that have actually crashed ⇒ COMPLETENESS property

- Degrees of completeness- strong completeness, weak completeness

- Degrees of accuracy- strong accuracy, weak accuracy, eventually strong accuracy, eventually weak accuracy

22

Page 23: Distributed Systems Theory for Mere Mortals

Classes of Failure Detectors- Perfect Failure Detector (P)

- Strongly Complete: Every faulty process is eventually permanently suspected by every non-faulty process.- Strongly Accurate: No process is suspected (by anybody) before it crashes.

- Eventually Strong Failure Detector (⋄S)- Strongly Complete- Eventually Weakly Accurate: After some initial period of confusion, some non-faulty process is never suspected.

- Consensus problem can be solved with Eventually Strong Failure Detector (⋄S)with f < n / 2 failures in the asynchronous model. [7], [8]- As long as you hear from the majority, you can solve consensus.⇒ SAFETY- Every correct process eventually decides. No blocking forever. ⇒ LIVENESS

23

Page 24: Distributed Systems Theory for Mere Mortals

Consensus Algorithms2PC, 3PC, Paxos, Raft and the others

24

Page 25: Distributed Systems Theory for Mere Mortals

Two-Phase Commit (2PC) [9]

- With no failures, it satisfies Validity, Termination, and Agreement.

- C crashes before Phase 1: No problem

- C crashes before Phase 2: A can ask Bwhat it has vote for.

- C and A crash before Phase 2: The protocol blocks!

- The protocol blocks with fail-stop failures (the simplest failure model).

25

Page 26: Distributed Systems Theory for Mere Mortals

Three-Phase Commit (3PC) [10]

- The main problem of 2PC is the participants don’t know outcome of the voting before they actually take action (commit / abort).

- We add a new step for this ⇒3PC

- 3PC is non-blocking and it handles fail-stop failures.

- What about fail-recover, network partitions, the asynchronous model?

26

Page 27: Distributed Systems Theory for Mere Mortals

Paxos [11], [12]

- It chooses to sacrifice liveness to maintain safety- It doesn’t terminate when the network behaves asynchronously and terminates only when synchronicity returns.- It doesn’t block when the majority is available.

- The correct run is similar to 2PC.

- 2 new mechanisms:- Order to proposals such that we can find out which proposal should be accepted: sequence numbers- Prefer majority, instead of all participants

27Image taken from [23]

Page 28: Distributed Systems Theory for Mere Mortals

Paxos- The original paper “The Part-time Parliament” [11] is difficult to read as it explains the

algorithm using an analogy with Greek democracy. - Submitted in 1990, published in 1998, after explained in another paper [17] in 1996.

- “The Paxos algorithm, when presented in plain English, is very simple” Paxos Made Simple [12]

- Cheap Paxos [13], Fast Paxos [14] and many other variations…

- Paxos Made Live [15]: There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. In order to build a real-world system, an expert needs to use numerous ideas scattered in the literature and make several relatively small protocol extensions. The cumulative effort will be substantial and the final system will be based on an unproven protocol.

- Paxos Made Moderately Complex [16]: For anybody who has ever tried to implement it, Paxos is by no means a simple protocol, even though it is based on relatively simple invariants. This paper provides imperative pseudo-code for the full Paxos (or Multi-Paxos) protocol without shying away from discussing various implementation details. 28

Page 29: Distributed Systems Theory for Mere Mortals

Raft: In search of an understandable consensus algorithm [18]

- A new consensus algorithm with understandability being one of its design goals.

- It divides the problem into parts:

- leader election, log replication, safety and membership changes

- Also discusses implementation details

- More than 80 implementations on its website [19]

29

Page 30: Distributed Systems Theory for Mere Mortals

Other Consensus Algorithms - Viewstamped Replication [20], [21]

- Another consensus algorithm. It is less popular than Paxos.

- Raft has a lot of similarities to it.

- Zab [22]

- Implemented in ZooKeeper

- Many variants of Paxos...

30

Page 31: Distributed Systems Theory for Mere Mortals

References[1] Lamport, Leslie. "Time, clocks, and the ordering of events in a distributed system." Communications of the ACM 21.7 (1978): 558-565.[2] Raynal, Michel, and Mukesh Singhal. "Logical time: Capturing causality in distributed systems." Computer 29.2 (1996): 49-56.[3] Corbett, James C., et al. "Spanner: Google’s globally distributed database." ACM Transactions on Computer Systems (TOCS) 31.3 (2013): 8.[4] https://github.com/cockroachdb/cockroach[5] Leone, Marcelo, et al. "Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases." (2014).[6] Fischer, Michael J., Nancy A. Lynch, and Michael S. Paterson. "Impossibility of distributed consensus with one faulty process." Journal of the ACM (JACM) 32.2 (1985): 374-382.[7] Chandra, Tushar Deepak, and Sam Toueg. "Unreliable failure detectors for reliable distributed systems." Journal of the ACM (JACM) 43.2 (1996): 225-267.[8] Chandra, Tushar Deepak, Vassos Hadzilacos, and Sam Toueg. "The weakest failure detector for solving consensus." Journal of the ACM (JACM) 43.4 (1996): 685-722.[9] Gray, James N. "Notes on database operating systems." Operating Systems. Springer Berlin Heidelberg, 1978. 393-481.[10] Skeen, Dale. "Nonblocking commit protocols." Proceedings of the 1981 ACM SIGMOD international conference on Management of data. ACM, 1981.[11] Lamport, Leslie. "The part-time parliament." ACM Transactions on Computer Systems (TOCS) 16.2 (1998): 133-169.[12] Lamport, Leslie. "Paxos made simple." ACM Sigact News 32.4 (2001): 18-25.[13] Lamport, Leslie, and Mike Massa. "Cheap paxos." Dependable Systems and Networks, 2004 International Conference on. IEEE, 2004.[14] Lamport, Leslie. "Fast paxos." Distributed Computing 19.2 (2006): 79-103.[15] Chandra, Tushar D., Robert Griesemer, and Joshua Redstone. "Paxos made live: an engineering perspective." Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing. ACM, 2007.[16] Van Renesse, Robbert, and Deniz Altinbuken. "Paxos made moderately complex." ACM Computing Surveys (CSUR) 47.3 (2015): 42.[17] Lampson, Butler. "How to build a highly available system using consensus." Distributed Algorithms (1996): 1-17.[18] Ongaro, Diego, and John Ousterhout. "In search of an understandable consensus algorithm." 2014 USENIX Annual Technical Conference (USENIX ATC 14). 2014.[19] https://raft.github.io/[20] Oki, Brian M., and Barbara H. Liskov. "Viewstamped replication: A new primary copy method to support highly-available distributed systems." Proceedings of the seventh annual ACM Symposium on Principles of distributed computing. ACM, 1988.[21] Liskov, Barbara, and James Cowling. "Viewstamped replication revisited." (2012).[22] Junqueira, Flavio P., Benjamin C. Reed, and Marco Serafini. "Zab: High-performance broadcast for primary-backup systems." 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN). IEEE, 2011.[23] http://the-paper-trail.org/blog/consensus-protocols-paxos/ 31

Page 32: Distributed Systems Theory for Mere Mortals

Thank you!Stay tuned for the next episode...

32