Ordering and dURABILITY IN Isis 2
description
Transcript of Ordering and dURABILITY IN Isis 2
1
ORDERING AND DURABILITY IN ISIS2
Ken BirmanCornell University
2
Isis2 System
Core functionality: groups of objects … fault-tolerance, speed (parallelism),
coordination Intended for use in very large-scale settings
The local object instance functions as a gateway Read-only operations performed on local state Update operations update all the replicas
myGroupstate transfer
“joinmyGroup”
update update
3
Terminology we’ve used Process group: A term for a collection of programs
that are all running (perhaps on different machines, perhaps on the same machine) and that use Isis2
Each process group has a name (you pick it) You can have multiple groups in one application
Message: Data encoded to be sent between programs
State transfer: Data to initialize a new group member
Update: Any action that changes the shared data Lookup: Any action that only queries the data Multicast: A message sent to every group member
4
A distributed request that
updates group “state”...
Some service
A B C D
Example: Cloud-Hosted Service
SafeSend
SafeSend
SafeSend
... and the response
Standard Web-Services method
invocation
5
Multicast properties In the figure, “SafeSend” is a “multicast”
A message that can be sent to a whole group
What properties do these multicasts need to keep the group members consistent?
In Isis2 we focus on Ordering properties: relative to group membership
changes, and relative to other multicasts Durability guarantees: what happens if a crash
occurs?
6
In Isis2 new View upcalls are synchronized relative to message delivery
Key idea: View ordering
7
Membership changes When a group gains or loses a member, the Isis2
Oracle sequences the new view relative to other multicasts. Thus any multicast is delivered in the same view, from the perspective of all recipients.
Also, if a multicast is sent to the group in some view, it reaches all members of the group (of course if some crash, they might not process the message)
State transfers occur after every multicast has been delivered in the prior view and before any are delivered in the new view
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
Group View is synchronized
relative to multicasts
8
Message Ordering The basic idea of Isis2 is to deliver all multicasts in the
same order at all group members receiving them
This keeps the data consistent and allows you to implement “state machine” algorithms: group members perform any desired actions in the same state and in the same order
But we offer various implementations of multicast and if you use them very wisely, some are faster than others. The caveat is that the fast versions can only be used in certain situations, which we’ll discuss.
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
9
A multicast arrives in a group… What information is “the same” for all recipients?
If they call g.GetView(), or remembered properties of the most recently delivered view, all see same view
Also, everyone got the message And the requested ordering was enforced by Isis2
What aspects might differ, for different receivers? Each has its own “rank” in the membership list,
obtained by calling v.GetMyRank() or v.GetRankOf(who)
10
What if a failure happens just as a multicast is being sent?
What about failures?
11
Delayed delivery In Isis2, a multicast send will often delay
(in the platform) for a little while before delivery occurs
As a result, the sender does not know that the group view will be the same when the message is delivered
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
This multicast might have been
“sent” in the prior view when r, s and
t weren’t yet members!
12
How can we know for sure? Suppose the sender of a Query needs to know
how many members processed the query, e.g. to notice that some reply is missing due to a failure. What can it do to know? One option is to have the receivers include View
information (such as how many members were in the View, what rank each replying member had) in the Reply()
The sender is also a receiver, so another approach is for the sender to wait for its own multicast or Query to be delivered and then make note of the View
13
How do we know who sent a message?
You can just include the sender’s Address in the arguments to the message
Cool Isis2 fact: After you see a View notifying you that some
member has failed or voluntarily left the group, you will never receive additional multicasts from that sender!
If a process leaves a group but then tries to send in it, Isis2 throws an exception in that sender.
14
No messages from the dead In the Isis2 system, you never receive
messages from the deceased
Isis2 watches for “late” messages that came from a process which is already considered to have died
It actively blocks such messages and won’t deliver them
Thus if you reconfigure after a failure, and reassign roles, you can’t get a kind of split-brain effect due to late delivery of a message
15
Ordering Properties The most important form of message
ordering is “total order”
Obtained by using g.OrderedSend or g.SafeSend
They both provide the same ordering guarantee. They have different durability properties
Everyone receives these in the same order.
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
Everyone receives A first
Everyone receives B second
A
B
16
Weaker ordering Some applications want the lowest
possible message latency OrderedSend will usually achieve this best
delay, but not always. (Slower case: when multiple group members are calling OrderedSend concurrently)
SafeSend uses a much slower approach. For the very best speed, protocols
guaranteed to be faster are available: Send and RawSend
17
A FIFO Ordering situation Suppose one process sends all the
multicasts that update some variable in a group. What ordering is really needed?
In this group, only the oldest living membersends multicasts
FIFO suffices!
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
We say that p is the leader. It has rank
0
After p and q fail, r is the leader. It has rank 0
in the new view
18
A FIFO Ordering Situation In this group we really only need to
deliver messages in the order the leader sent them
For this purpose, the Send primitive is ideal Send respects the FIFO order its sender
used Guaranteed to be extremely fast
RawSend: Send, but with no effort to guarantee reliability. Respects FIFO order… unless message is lost
19
What if two senders use Send? When different senders use Send, the
ordering will depend on when the messages showed up!
Different members might see different orderings
Example: r sees A B … but p sees B A
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
A
B
20
When is FIFO good enough? Suppose our group manages a collection of
data items Each item has its own leader and only the
leader sends updates for that item Consistency: It suiffices to apply updates in the
order they were sent. g.Send() will do this!
But beware… Multicasts from different senders
can interleave in unpredictable ways
21
When would you use RawSend? This primitive doesn’t guarantee reliability
We use it when reporting data from real-time sensors We want the data delivered in order (new
data replaces older data). RawSend is still FIFO ordered
But if data is lost, there is no point “wasting time” in the platform retransmitting it.
22
What about Query ordering? Each kind of multicast has an associated
QueryMulticast Matching QueryRawSend RawQuerySend QueryOrderedSend OrderedQuerySafeSend SafeQuery
23
CausalSend Included mostly for academic reasons,
but not used very often in Isis2
Intended for situation in which the leader role moves around for each data item First p is in charge, then q is the leader for
a while, then r, then back to p… CausalSend will respect the FIFO order
“with moving leaders”. But we don’t recommend using it.
24
CausalSend picture: B is “after” A
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
A
B
25
Causality idea If B “might have been caused by A”, then
B is causally ordered after A (we write A B)
CausalSend tracks these causality dependencies and makes sure that if A B, then B will be delivered after A
But the Isis2 implementation of CausalSend is slow and this is why it isn’t used very often
26
Exactly what happens in the event of a failure?
Durability
27
Durability
A durability guarantee is the property that information will survive a failure
There are several cases to think about What if the sender of a multicast fails but someone
received the multicast? What if the sender and every receiver (so far) fails? What if a whole group fails, but later restarts? What if the group is managing a replicated database
or files that aren’t even on the same computers?
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
28
Soft State in the Cloud Many Isis2 applications run in cloud settings..
And the cloud favors “soft state” After a node crashes, the entire VM is reloaded Thus any local state (even local files) are restored
to their original state! All local data vanishes
We say that a group manages “hard state” if the group members can fail and yet their state lives on In the cloud a hard-state node costs more $$$
29
Two cases thus arise Durability for soft-state scenarios
Here the entire state “lives in the group members”
They might have files, but the files won’t be preserved if those members crash and later restart, even on the same nodes.
Very common in today’s cloud
Durability for hard-state cases Here the state really is outside the group
30
Multicast durability Isis2 offers all-or-nothing delivery
guarantees
Either every group member receives your multicast, or no group member receives it, even if the sender fails. As we saw, if a sender fails, its messages will be delivered before Isis2 reports the failure
But this statement didn’t explain what happens when a receiver crashes “instantly”
31
Two options: Optimistic/Pessimistic Optimistic case (Send, CausalSend, OrderedSend):
Messages are delivered instantly on arrival (low delay) But if the sender and all receivers with copies fail, an
optimistic message is lost forever even though it might have been delivered to some processes right before they crashed
An optimistic protocol always looks like it was all-or-nothing, but if you could see the details, you might see that in fact, it was delivered, but then “forgotten”
32
Optimistic delivery Consider messages B and C
B was delivered to r,s and t. But it didn’t reach p and q because of a network failure.
C was delivered by p and q but never reached r,s,t
But notice that p and q both crashed In a soft-state case, no evidence survived (unless
they talked to someone outside the group – an external client, for example)
In effect, the surviving portion of the system is consistent
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
A
B
C
33
Optimistic delivery is fastest We deliver messages as soon as they
arrive
But the price of this speed (which is a big benefit) is that these two “bad cases” can arise. Nobody can tell when these things happen,
unless p or q talked to an external client … which leads to the idea of g.Flush(k)
34
How does Flush(k) work?
g.Flush(n) pauses until n group members definitely have all the prior optimistic multicasts. g.Flush() waits for all members, but this is
slow Normally n=2 or n=3 is fine…
By calling g.Flush(2) or g.Flush(3) before talking to an external client, we can be sure these bad cases will not occur!
35
With g.Flush(k)… … those stray delivery events can still occur, but
we know that no external observer notices them! If g.Flush(3) is called prior to talking to the observer,
then until there are 3 or more copies of the message, the Flush waits.
In our example the crash would have occurred while we were waiting for g.Flush() to finish
If a tree falls in a forest… If a message is delivered but every processthat saw it crashes, the effect is the sameas if the message wasn’t delivered!
36
With g.Flush(k)… … those stray delivery events can still occur, but
we know that no external observer notices them! If g.Flush(3) is called prior to talking to the observer,
then until there are 3 or more copies of the message, the Flush waits.
In our example the crash would have occurred while we were waiting for g.Flush() to finish
If a tree falls in a forest… If a message is delivered but every processthat saw it crashes, the effect is the sameas if the message wasn’t delivered!
37
When to call g.Flush(k) Use this primitive
When working with optimistic multicast protocols like Send, OrderedSend
Call it prior to interacting with something outside of the group, like an external client who issued a request
With g.Flush after g.OrderedSend, we get the guarantee that the group won’t forget the update. Without g.Flush, an unlikely failure sequence could cause a problem (sender+first recipients all die).
38
Pessimistic Delivery SafeSend is much more pessimistic
This protocol is a kind of 2-phase commit Gives the message to recipients, and they hold it
(Two cases: In-memory logging, or on-disk logging)
When all have confirmed receipt, then delivery is authorized
No g.Flush(): it wouldn’t ever need to wait
39
Where’s the durable state? SafeSend raises a question of where the
state lives
For our optimistic protocols, state lives in the group
But Isis2 can also support two more cases State lives in a checkpoint that will be reloaded
if the whole group shuts down and restarts State lives in a database or in files external to
the group SafeSend with disk logging aims at this second
case
40
Should I always use SafeSend? The SafeSend protocol is very costly and
scales poorly, so it isn’t a great choice in the cloud
Also, using it correctly is a bit tricky
Better rule of thumb: use g.OrderedSend+g.Flush
41
Sidebar: Paxos family of protocols Experts in this area will know about Leslie Lamport’s
famous Paxos protocol (Wikipedia has a nice writeup) It provides ordered, durable “actions” These are often updates to a replicated database
SafeSend is the Isis2 name for Paxos
You don’t really need to learn about Paxos to understand how SafeSend works, but I’ll include some comments aimed at people who do know about Paxos in this lecture, simply because that work is so famous.
42
How Paxos works Paxos is basically a kind of 2-phase
commit In the first phase a leader proposes some
action (for us, a multicast) A quorum of group members (the
acceptors) need to vote in favor of the proposed ordering for the message, and they need to first save it in a durable place (usually a log that lives on the disk)
In the second phase, delivery occurs (in Paxos: the learners are informed about the new event)
43
Paxos has a notion similar to Flush(k)
In Paxos you can specify the number of “acceptors” that must have a copy of a message before it can be delivered. In Isis2 this same parameter is available by
means of a parameter you can set (g.SetSafeSendThreshold(k)) SafeSend is a true implementation of Paxos if
this number is more than half the group members.
With k smaller, like k=2 or k=3, but in a big group SafeSend starts to act exactly like g.OrderedSend()+g.Flush(k)
44
Isis2: Send v.s. SafeSendSend scales best, but SafeSend with
modern disks (RAM-like performance) and small numbers of acceptors isn’t terrible.
45
Variance from mean, 32-
member case
Jitter: how “steady” are latencies?
The “spread” of latencies is muchbetter (tighter) with Send: the 2-phase
SafeSend protocol is sensitive to scheduling delays
46
Flush delay as function of shard sizeFlush is fairly fast if we only wait foracks from 3-5 members, but is slow
if we wait for acks from all members.After we saw this graph, we changedIsis2 to let users set the threshold.
47
Putting our insights to work…
Several ways to make data durable
48
Checkpointing Any group can be made durable using a
checkpointing file Call g.Persistent(filename) Checkpoint will periodically be saved, or you can force
the creation of checkpoints at times convenient to you Entire group shares a single checkpoint file and it
would normally live in the global file system. It should not live in any sort of soft-state file system!
On restart from a total shutdown, checkpoint is reloaded and the group recovers to its old state
49
External databases If a group is being used to replicate
something like a set of external mySQL databases, recovering the group state just isn’t good enough
We also need to make sure the mySQL replicas are in the identical states after a recovery
This is the case where we use SafeSend with the disklogging option enabled
50
What is the disklogger? The disklogger is a special form of logged
checkpoint, similar to the one used for g.Persistent() But whereas normally there is just one durability
log, this log is replicated with one copy per acceptor
Messages delivered by SafeSend are appended to this log during phase one
When an acceptor restarts, its log is scanned and “replayed”. Isis2 will garbage collect a message once all the learners have seen it
51
A distributed request that
updates group “state”...
Some service
A B C D
Example: Cloud-Hosted Service
SafeSend
SafeSend
SafeSend
... and the response
Standard Web-Services method
invocation
DB
DB
DB
DB
Use the Isis2 version of Paxos to replicate an external database
52
A distributed request that
updates group “state”...
Some service
A B C D
Example: Cloud-Hosted Service
Send
Send
Send
... and the response
Standard Web-Services method
invocation
In-memory collecti
on
In-memory collecti
on
In-memory collecti
on
In-memory collecti
on
Cheaper multicast+Flush suffices with in-memory replicas
or other situations with soft state, like files local to the replicas on VMs that
will be reloaded if a crash occurs
g.Flush()
53
Check your understanding
Suppose we use SafeSend as shown in the figure, with 4 group members, and all are acceptors
You send 1 message. How many disk writes occur? At least 4 (one per log) and
perhaps 8 (the database may have a log too). Also, database needs to be updated!
54
Recovery with an external database is a pain!
g.SetDurabilityMethod
55
SetDurabilityMethod You must tell SafeSend to use the
DiskLogger durability method
When you do this, SafeSend has an extremely strong guarantee: it won’t ever forget messages, until is it explictly told to do so by your code
This yields a version suitable for use when replicating a database
56
Recovering a database replica After restarting a failed database replica, SafeSend
with the DiskLogger durability method will replay all messages that it knows about
Your job is to make sure all of these updates have been applied to the database, exactly once
After that you tell SafeSend it can safely garbage collect these messages, and it does so when every group member has told it that the message is safe to garbage collect (at that point, it truncates the disk log)
57
Why not always use SafeSend? SafeSend is harder to use
Must write code to handle replay of the log after recovery.
And SafeSend is also slower
Many people who assume Paxos is lightweight are surprised that all Paxos systems have high costs Paxos is really a kind of durable database – a
database of messages!
58
Durability Summary To recap:
If your application maintains data purely inside the members of the group, or purely in memory, you can use the standard “optimistic” methods Call g.Flush(k) if worried about the tree-in-the-forest case
Use checkpointing to a log (g.Persistent()) to make the group state survive complete shutdowns
But switch to SafeSend for the strongest durability requirements. You’ll need to enable the DiskLogger durability method, and to write code to handle restarts and to tell SafeSend when it can garbage collect the log.
59
How does one make a checkpoint?
Making Checkpoints
60
State transfer
In general, group members manage data (state)
When s and t join in this example, they don’t have the current state for the group. They obtain it via a state transfer: the white arrow. In this example, p “writes down” its state (a
checkpoint) Then s and t “load” the state (they read the
checkpoint)
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70 White Arrow is a state transfer
61
Making a checkpoint You can save any state you wish
You can call SendChkpt as many times as needed
int istuff; double dstuff; g.MakeChkpt += (Isis.ChkptMaker)delegate(View nv) { g.SendChkpt(istuff); // Checkpoint a single integer g.SendChkpt(dstuff); // Checkpoint a single floating point value g.EndOfChkpt(); // Finished making the checkpoint }; g.LoadChkpt += (loadichkpt)delegate(int what) { IsisSystem.WriteLine(name + ": Got integer checkpoint: istuff=" + what); istuff = what; }; g.LoadChkpt += (loaddchkpt)delegate(double what) { IsisSystem.WriteLine(name + ": Got double checkpoint: dstuff=" + what); dstuff = what; };
62
Steps The MakeCheckpt method is called from
time to time in your program. You can control exactly when this will
happen
That updates the log files
Later, after restart, the LoadCheckpt method(s) will be called to reload the saved state
63
To make a group persistent, store it in a global file system
It will be loaded into the NEXT instance that runs int istuff;
double dstuff; g.MakeChkpt += (Isis.ChkptMaker)delegate(View nv) { g.SendChkpt(istuff); // Checkpoint a single integer g.SendChkpt(dstuff); // Checkpoint a single floating point value g.EndOfChkpt(); // Finished making the checkpoint }; g.LoadChkpt += (loadichkpt)delegate(int what) { IsisSystem.WriteLine(name + ": Got integer checkpoint: istuff=" + what); istuff = what; }; g.LoadChkpt += (loaddchkpt)delegate(double what) { IsisSystem.WriteLine(name + ": Got floating point checkpoint: dstuff=" + what); dstuff = what; };
Note: You must also call myGroup.Persistent(gname);This tells Isis2 to keep checkpoints in a file (in this case with the same name as the group).
There are also ways to control when the checkpoint will be made
64
Why did we register two loaders? Isis2 is polymorphic
Each method can be defined many times with different type signatures
As events occur, upcalls are done to the ones that match
In our examples we had just one argument to SendChkpt(), but we could have given many:
Any data type is allowed but you must register user-defined types with Isis first
g.SendChkpt(x, y, z, ....);
65
State transfer uses checkpoints! If the checkpoint methods are defined, Isis2 will
ask for a checkpoint just as a new member joins The old member makes the checkpoint The new member loads it
This initializes the joining member
myGroupstate transfer
update update
66
Can we tell what a checkpoint will be used for? Can we do “per use” checkpoints?
Persistent or just State Transfer?
67
What are checkpoints used for? When you define a checkpoint
create/load method, that automatically enables state transfer for joining members
With g.Persistent(), a checkpoint plays two roles; they are also logged into a recovery log file that will be reread after recovery from a total shutdown
68
State transfer could be s..l..o..w..And while it happens, the group freezes up!
What if the group state is large?
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
A
B
69
What if the state is very large? Really large states can be slow to transfer. While
they are being sent, the group itself might hiccup
Best solution? Pre-transfer that huge state, perhaps using the highly efficient “Isis OOB” tool Out of band transfer is minimally disruptive and faster
too because the Isis2 system optimizes heavily for this But perhaps a few updates might occur after the pre-
transfer and before the member is added. So you can include an argument to Join that tells how
big the pre-transfer was, or what “time” it was made. Then the checkpoint only needs to include the delta!
70
Pretransfer In this picture we send
data to r, s and t “out of band”
Isis2 has a tool for that, the OOB file transfer tool. Ideal for big copying
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
When they join, we send just
the residual delta…
71
Enabling this feature Instead of calling g.Join(), call g.Join(offset)
Offset tells the group how much of the state you have.
It shows up in the View argument to the make checkpoint method
Offset 0 means “send the whole state”
Example: pretransfer included updates 0… 12345. So you call g.Join(12345). The state transfer contains just updates 12346-12348…
72
What happens in an application that experiences many “events” all at the same time?
When does State Transfer occur?
Isis2 has a strong consistency model: a new form of virtual synchrony.
73
Virtual synchrony is a “consistency” model: Membership epochs: begin when a new configuration
is installed and reported by delivery of a new “view” and associated state
Protocols run “during” a single epoch: rather than overcome failure, we reconfigure when a failure occurs
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
Synchronous execution Virtually synchronous execution
Non-replicated reference execution
A=3
B=7
B = B-A A=A+1
74
What Isis2 ensures is that... State transfer “seems” to occur at the
instant when a new view is delivered (all prior multicasts have already been performed) This means that the member preparing the
state has the correct values for state variables needed by joining member!
It is “safe” to send this state If desired, there is a way for you to
specify which member will send state to each joining process
75 How do Queries handle failure?
Queries when failures occur…
Group g = new Group(“myGroup”);Dictionary<string,double> Values = new
Dictionary<string,double>();g.ViewHandlers += delegate(View v) {
Console.Title = “myGroup members: “+v.members;};g.Handlers[UPDATE] += delegate(string s, double v) { Values[s] = v;};g.Handlers[LOOKUP] += delegate(string s) { g.Reply(Values[s]);};g.Join();
g.Send(UPDATE, “Harry”, 20.75);
List<double> resultlist = new List<double>();nr = g.Query(ALL, LOOKUP, “Harry”, EOL, resultlist);
First sets up group
Join makes this entity a member. State transfer isn’t shown
Then can multicast, query. Runtime callbacks to the “delegates” as events arrive
Easy to request security (g.SetSecure), persistence
“Consistency” model dictates the ordering seen for event upcalls and the assumptions user can make
76
77
This example used g.Reply Also available:
g.AbortReply() – throws exception in the Query caller g.NullReply() – Member doesn’t contribute any value
but the caller won’t wait for it (useful with ALL) g.NoReply() – A risky option: like NullReply but no
message of any kind is sent to the caller
Query can also specify an Isis “Timeout” new Timeout(delay_ms, action) Action is: TO_NULLREPLY, TO_FAILURE, TO_ABORT
78
How can a caller sense missing replies?
The caller is told how many replies it got If you expected 3 but got 2, either someone
failed, or they used g.NullReply() to “opt out”
But when you issue the Query you won’t know who is going to be in the group at the time of delivery! This is why it often makes sense for replies
to specify that “this is reply R of N” (R=rank, N=size of view)
79
Lecture Summary Isis2 gives you control over
How durable multicasts and group data will be
How strongly ordered they will be Whether to wait until a multicast has
reached k of the destinations before you talk to external observers
Using these forms of control, you can program exactly the behavior you need in a given setting