A Protocol for Reversible Distributed Computation - Computer Science

21
A Protocol for Reversible Distributed Computation Geoffrey Brown Amr Sabry William J. Bowman ( ) Indiana University ( ) Northeastern University Contact Information. Geoffrey Brown School of Informatics and Computing Indiana University Bloomington, IN 47405, USA Email: Tel: 812-855-4207 Abstract. We present a new protocol for realizing reversible distributed programs with CSP style communication. While this work was motivated by the desire to implement reversible process alge- bras, the protocol may have other interesting uses in distributed systems including checkpointing, speculative real-time simulation, and transaction synchronization. The protocol may be efficiently implemented in hardware for tightly coupled (e.g. multi-core) systems: it is insensitive to communi- cation delays and has minimal communication infrastructure requirements, only assuming in-order message delivery. The core of our presentation is the protocol and its verification. We additionally present a small reversible language, embedded in Ruby, to illustrate the potential for our protocol in implementing solutions to various distributed synchronization problems. Regular paper Not eligible for best student paper award

Transcript of A Protocol for Reversible Distributed Computation - Computer Science

A Protocol for Reversible Distributed Computation

Geoffrey Brown∗ Amr Sabry∗ William J. Bowman†

(∗) Indiana University (†) Northeastern University

Contact Information.Geoffrey BrownSchool of Informatics and ComputingIndiana UniversityBloomington, IN 47405, USAEmail: [email protected]: 812-855-4207

Abstract. We present a new protocol for realizing reversible distributed programs with CSP stylecommunication. While this work was motivated by the desire to implement reversible process alge-bras, the protocol may have other interesting uses in distributed systems including checkpointing,speculative real-time simulation, and transaction synchronization. The protocol may be efficientlyimplemented in hardware for tightly coupled (e.g. multi-core) systems: it is insensitive to communi-cation delays and has minimal communication infrastructure requirements, only assuming in-ordermessage delivery. The core of our presentation is the protocol and its verification. We additionallypresent a small reversible language, embedded in Ruby, to illustrate the potential for our protocolin implementing solutions to various distributed synchronization problems.

Regular paperNot eligible for best student paper award

A Protocol for Reversible Distributed Computation

Geoffrey Brown Amr Sabry William J. Bowman

1 Introduction

Speculative execution either by intent or through misfortune (in response to error conditions) ispervasive in system design and yet it remains difficult to handle at the program level. The state ofthe art in “speculative distributed computing” was summarized by Guerraoui in an invited talk atthe 2010 International Symposium on Distributed Computing in three main points [14]:

• speculation is the norm in practical distributed algorithms;

• speculative algorithms are intricate to write and hard to reason about because speculativecomputations usually intermingle the various execution paths with concurrency and commu-nication in an unprincipled way;

• a theory of speculation should tease apart these various aspects. Specifically (a) the specula-tion, (b) speculative algorithm, and (c), its backup, should all be first class separate citizensof a distributed computation.

Indeed, we find that despite the importance of speculative computation, there is very littleprogrammatic support for it in distributed languages at the foundational level it deserves. Wenote that, from a programming language perspective, speculative execution requires a backtrackingmechanism and that, even in the sequential case, backtracking in the presence of various compu-tational effects (e.g. assignments, exceptions, etc.) has significant subtleties. The introduction ofconcurrency additionally requires a “distributed backtracking” algorithm that must “undo” theeffects of any communication events that occurred in the scope over which we wish to backtrack.While this has been successfully accomplished at the algorithmic level (e.g. in virtual time basedsimulation [20, 19]), in models of concurrent languages (e.g. RCCS [6, 7]), and in some restrictedparallel shared-memory environments (e.g. [16]), it does not appear that any concurrent languagesbased upon message passing have directly supported backtracking with no restrictions.

In this paper, we introduce a programming model for a distributed language with message-passing concurrency that treats “speculation” as a first class construct. The focus on message-passing concurrency is justified by the emergence of massively parallel multicore processors withhigh-bandwidth message passing. In this domain, speculative execution can be used to eliminatesequential bottlenecks by performing work ahead of synchronization constraints and discardingthis speculative work if it violates consistency requirements. To facilitate the task of programmingspeculative algorithms in this context, we isolate the backtracking mechanism using two high-levelprogramming primitives inspired by the theory of stabilizers [34]. The two constructs are stable,which creates a scoped region, and backtrack, which reverts execution to the dynamically closeststable region, automatically undoing all communication effects that are invalidated by backtracking.Significantly, and unlike the original algorithm for stabilizers, the last step is realized using adistributed algorithm requiring no global synchronization.

The key contribution of this paper is a protocol for handling point-to-point synchronous com-munication that enables a clean implementation of scoped rollback in a distributed program. Asmotivation, we introduce a high-level reversible language in Sec. 2. We present our protocol and

1

its formal model with automated verification in Sec. 3 and Sec. 4, respectively. We conclude with adiscussion of related work in Sec. 5 and discuss applications for our protocol and language as wellas future research directions in Sec. 6.

2 A Reversible Language

To motivate our verified distributed backtracking protocol, we start with a prototype language with“first-class speculation,” i.e with programming constructs that enable direct control over speculativeexecution. The language is embedded in Ruby via a library that implements the protocol usingfirst-class continuations. Although the implementation is not verified and is not distributed (oreven multi-threaded), it is faithful to the protocol described in Sec. 3 and can, in principle, beverified and implemented natively. The main purpose of the language is to provide a convenientmechanism to illustrate some of the subtleties of the protocol in action; the language will alsoprovide the testbed for some of the larger applications suggested in Sec. 6.

Like in CSP [17], a program in our language consists of a set of cooperating processes that com-municate over point-to-point synchronous channels. Communication occurs when two processes, asender and a receiver, agree to communicate; when communication occurs, a value is transferredand both processes complete the communication simultaneously. An important aspect of CSP isthe concept of choice — a process may wait to communicate on a set of channels, accepting thefirst option that is viable. In our implementation, as with most other distributed implementationsof CSP languages (e.g. Occam [4]), choice is restricted to input. While some shared-memory CSPimplementations support generalized choice (e.g. CML [31, 30]), it is inefficient it implement in adistributed environment.

Due to space limitations, we restrict our language presentation to the single example illustratedin Listing 1. Lines 1–17 and 19–26 define the code to be executed by two processes named p1

and p2. Each process is parametrized by the channels it can use for communication. The entrypoint of the program (not shown) creates an additional “root” process that creates three namedchannels, forks the two processes, and blocks until both terminate. The main novelty of the exampleis in the stable region for p1 which is initiated at Line 4 and continues to Line 15. The backtrackingrequest by p1 (Line 13) causes p1 to jump to Line 4 invalidating the snd and rcv actions at Lines 8and 9. This, in turn, invalidates the corresponding rcv and snd actions in p2 at Lines 21 and 24.In more detail, the output corresponding to a sample run of the program is in Listing 2. The outputshows that process p1 offers a sequence of values (Line 2) to process p2 and p2 responds to eachoffered value with either “ok” or “not ok.” In the former case, p1 sends a termination message(Line 16) and both processes terminate. In the latter case (“not ok”), p1 backtracks which causesp2 to backtrack to the beginning of its (implicit) stable region.

Operationally, a scoped checkpoint pushes a continuation on a checkpoint stack along withinformation about the communication history (to be discussed along with our protocol). A processmay decide to backtrack which has the effect of restoring the process (control) state and unwindingany communication events that occurred after the checkpoint. In general, backtracking can leadto a cascade of backtracking. Our goal is to enable correct programs that utilize checkpoints —by design, we leave the decision of when to checkpoint up to the programmer. Notice that in thisexample language, checkpoints, which are implemented using continuations, do not restore datastate or IO actions. The array vals, which is modified within the stable region, is not restored,and the print statements are not revoked. Thus, it is up to the programmer to decide what dataand what actions other than communication are preserved by checkpoints. This important, butorthogonal, aspect of backtracking is abstracted away in our presentation to focus on communication

2

1 def p1(c12,c21,cdone)

vals = [3,2,1]

3 puts " #{Csp.name} at start"

Csp.stable {

5 puts " #{Csp.name} at stable; vals = #{vals}"

v = vals.shift

7 puts " #{Csp.name} sending #{v}"

c12.snd v

9 res = c21.rcv

if res == "not ok"

11 flag = false

puts " #{Csp.name} received \"#{res}\"; backtracking"

13 Csp.backtrack

end

15 }

cdone.snd 0 # to tell partner to terminate

17 end

#---------------

19 def p2(c12,c21,cdone)

puts " #{Csp.name} at start"

21 v = c12.rcv

res = (v == 1) ? "ok" : "not ok"

23 puts " #{Csp.name} received #{v}; sending \"#{res}\""

c21.snd res

25 cdone.rcv

end

Listing 1: Example 1

p1 at start

p1 at stable; vals = [3, 2, 1]

p1 sending 3

p2 at start

p2 received 3; sending "not ok"

p1 received "not ok"; backtracking

p1 at stable; vals = [2, 1]

p1 sending 2

p2 at start

p2 received 2; sending "not ok"

p1 received "not ok"; backtracking

p1 at stable; vals = [1]

p1 sending 1

p2 at start

p2 received 1; sending "ok"

Listing 2: Example 1 Output

effects.The design of our language is inspired by the stabilizers of CML [34], but it is important to

note some crucial differences. At a linguistic level, the original stabilizers work does not supportCSP style communication and hence has no notion of communication choice. More substantially,the stabilizer implementation tracks communication events in a centralized graph and backtrackingrequires a centralized computation. In contrast, the protocol we present in the next section allowsbacktracking to be performed in a completely distributed manner. The necessary mechanism is

3

built into the channel implementation.

3 Reversible Protocol

Our fundamental approach to reversibility depends upon logical clocks [21]. Each process maintainsa private clock which it increments at key synchronization events. When a process creates a check-point, it increments its clock and associates that time with the checkpoint. When two processessynchronize, they increment their clocks to a common value which is greater than their currenttimes; this new value is associated with the channel which records the time of the last synchroniza-tion event. Whenever a process creates a checkpoint, it saves the clock values of all its channels.Thus, the space requirements for handling reversibility depend upon programmer defined granular-ity. Reverse computation is handled in a dual fashion. A process that wishes to backtrack, restoresits most recent checkpoint and communicates, in parallel, along all channels which are “ahead” ofthe restored time, forcing its partners to backtrack. Backtracking is an iterative process: a processmay be forced, by its partners, to backtrack multiple times.

The correctness of the forward computation should be relatively clear as it satisfies Lamport’slogical clock model [21]. Correctness of the reverse computation is less obvious because it dependson the state restored with the context. If no state is restored, then the reverse computation can beviewed as erasing events; however, if the processes utilize knowledge gained from the abandonedcommunication events then it is possible to reach states that could not be reached in a purelyforward computation. Such semantic issues are beyond the scope of this paper. Instead, we focuson a correct channel implementation that supports both forward and reverse communication whilemaintaining sensible channel state.

..req ..data

....

ack

...

Figure 1: Hardware Handshake

Our channel protocol is derived from the widely used handshake protocol illustrated in Fig. 1in which a sender generates request (req) and data, and a receiver generates the signal (ack). Ingeneral, this protocol supports bidirectional data transfer. This particular implementation utilizes4-phase or return-to-zero (RZ) signaling requiring four events (two round-trips) for each transaction.The transitions are indicated by the curved arrows. The sender initiates a transaction by setting thedata signals and raising its req signal. The receiver acknowledges receipt of the data by raising itsack signal. The remaining transitions on req and ack restore the initial state. While data transfercan be more efficiently implemented with two control transitions (where a new transition beginswith both req and ack high), this 2-phase or non-return-to-zero signaling (NRZ) approach doesnot fundamentally change the protocol, but it does complicate the presentation. The widely usedalternating-bit protocol for distributed data transmission is an example of an NRZ implementationof this protocol. In a distributed environment, the signals can be realized by messages: the sendersends tuples (req,data) and receives (ack). As long as these are sent along lossless paths anddelivered in-order, the same protocol will work. With timeouts and re-transmission, the protocolwill work in the face of message loss.

We extend this basic handshake protocol with timestamps, reverse transactions, and a wayfor each partner to signal to the other that it wishes to engage in a reverse transaction. In the

4

extended protocol, the sender maintains a request req and a version of the channel clock txclock.Similarly the receiver maintains ack and rxclock. During a communication transaction, the twoclock values are exchanged and updated so that the required logical clock properties are maintained.For example, for forward transactions, the sender suggests a new clock value (greater than its localclock) and the receiver responds with a new clock value (also later than its local clock and at leastas large as the value suggested by the sender). Upon return to the idle state, these two clocksagree.

Rather than using the two control values of the handshake protocol of Fig. 1, our extensionuses three control values: “I” (idle), “F” (forward), and “R” (reverse). The reachable states of ourprotocol are illustrated in Fig. 2. The states are labeled with three symbols indicating the controlvalues of the sender and receiver and the relative order of txclock, rxclock. For example, astate (F,I) > indicates that the sender is in the forward state, the receiver is in the idle state,and that the sender’s logical clock txclock is greater than the receiver’s rxclock. For themoment, ignore the “dashed” portion of the diagram as these relate to reverse transactions. Thesolid portions of the state diagram correspond to the 4-phase handshake, with the exception thatsender clock values change twice through the handshake. The edge labels correspond to transitionsof the formal model described in Sec. 4 and given in full in Appendix A. 1 In more detail, the twoprocesses start in the idle states with their clocks equal (state (I,I) =). Transition (0) resultsin a state in which the sender has selected a larger value for its clock and has set its control stateto F. In the next state (following transition (1)), the receiver responds with a new clock value andsets its control state for F. After two more transitions both processes return to the idle state withtheir clocks synchronized.

..(I, I)=.

(F, I)>

. (F, F )≤

.

(I, F )=

.

(R, I)<

.(R,R)≥

.

(I,R)=

.

(F,R)>

.

(0)

.

(1)

.

(2)

.

(3)

.

(4)

.

(5)

.

(6)

.

(3)

.

(5)

.

(6)

Figure 2: Protocol States

The “dashed” portion of Fig. 2 corresponds to reverse transactions. In a reverse transaction, thesender and receiver agree upon a logical time earlier than the current channel time. It is importantto note that, because of cascading backtracking events, it may be necessary to perform multiplereverse transactions before a stable state is reached. This cascade is guaranteed to terminate bythe fact that there is always at least one stable state (at time 0).

One deficiency of the protocol as described is that a sender that is blocked waiting for a receivercannot initiate a reverse computation. Similarly, as described, the receiver can be blocked from

1Note to referees, the final version will provide a URL to the model rather than the appendix in this draft.

5

initiating a reverse transaction. These are potential sources of deadlock. We resolve this deficiencyby adding two boolean state variables: rxint for the sender (“interrupt receiver”) and txint forthe receiver (“interrupt sender”) that can be used by a blocked process to signal to its partner thatit would like to execute a reverse transaction. The details are omitted due to space constraints,but are included in the SAL model.

Returning to our Ruby-based language, we implemented communication using this protocol.Each channel was implemented by the state variables described. Each process has two states: for-ward and backwards. A process enters the backward state through an explicit backtrack eventor when it detects (through channel variables) that a partner wishes to backtrack. In the imple-mentation, whenever a task enters the scheduler (through a communication event or an explicityield), it polls the state of its channels to detect such requests. A process in the backward stateexecutes any required reverse communication transactions or backtrack events in a poll/yield cycle.While this is not the most efficient implementation, it sufficed to validate the basic approach. Onelimitation of our implementation is that it uses cooperative coroutines – an uncooperative processcan block all others. In a distributed implementation, polling of the channels should be performedin a preemptive kernel.

As mentioned previously, stable regions are implemented by saving a continuation on a contextstack. In addition, a saved context consists of the current process timestamp, and the timestampsof all channels in a process’s input and output set. By comparing these saved channel timestampswith the channel state, it is easy for a reversing process to determine whether a stable state hasbeen reached that is consistent with the current context, or whether it needs to backtrack further(by popping its context stack). Notice that the amount of saved communication state is modest:one timestamp for each channel incident on a process.

4 Verification

In this section we describe the formal model of our protocol and its automated verification.Throughout the discussion, we use the syntax of SAL [9] to describe our protocol. To simplifythe discussion, some details are elided. The SAL model (see Appendix A) should be considered thedefinitive version of the protocol. The protocol is defined with two types, logical time representedusing natural numbers, and control states modeled using the constants I, F, or R:

TIME : TYPE = NATURAL;

STATE : TYPE = { I, F, R };

The sender state is defined by three state variables: req which represents the control state,txclock which represents the local logical clock, and rxint which is used to signal to the receiverthat the (blocked) sender would like to perform a reverse transaction:

OUTPUT req : STATE % initially I

OUTPUT txclock : TIME % initially 0

OUTPUT rxint : BOOLEAN % intially false

The receiver state is similarly defined by three state variables:

OUTPUT ack : STATE % initially I

OUTPUT rxclock : TIME % initially 0

OUTPUT txint : BOOLEAN % initially false

To initiate a forward transaction, the sender sets req to F and selects a new (greater) valuefor txclock. This transition is permissible only if the ack is I. (Note the numbers at the end ofeach transition are not part of the protocol; rather they correspond to the edges in Fig. 2):

6

(ack = I) AND (req = I) --> req' = F; % (0)

txclock' IN { x : TIME | x > txclock}

In the SAL language, transition rules are simply predicates defining pre- and post-conditions andthe “next state” of req is req'. Symbols --> and ; are syntactic sugar and are logically equivalentto AND. Thus the above transition formalizes the following situation: if both the sender and receiverare currently idle, then the sender could move to the state F with a new clock value that is greaterthan its current clock. If the receiver wishes to complete the request, it responds by setting ack

to F and selecting a time (rxclock) at least as great as that of the txclock:

(req = F) AND (ack = I) --> ack' = F; % (1)

rxclock' IN { x : TIME | x >= txclock};

Completing the forward transaction requires two more transitions on the sender and receiver:

(ack = F) AND (req = F) --> req' = I; % (2)

txclock' = rxclock;

(req = I) AND (ack /= I) --> ack' = I % (3)

Thus a forward transaction involves four events (messages) and results in incrementing thechannel time stamp. We note that while this model uses state variables, it is more appropriate toimagine that the sender and receiver send copies of their channel states to the other partner. Thecomplete model includes buffers in each direction to simulate the corresponding delay.

Reverse communication, as illustrated by the “dashed” portion of Fig. 2, is somewhat analogousto forward communication. In the common case, the sender initiates a reverse communication,selecting a new time for the channel that is earlier than the current time:

(ack = I) AND (req = I) --> req' = R; % (4)

txclock' IN { x : TIME | x < txclock}

The receiver acknowledges the reverse transaction:

(ack = I) AND (req /= I) --> ack' = R; % (5)

rxclock' IN {x:TIME | x <= txclock AND x < rxclock};

Notice that the same action can be used by the receiver to respond to a forward transaction requestby the sender and hence initiate a reverse transaction. The sender returns to idle from either systemstate (R,R) or (R,F):

(ack = R) AND (req /= I) --> req' = I; % (6)

txclock' = rxclock;

This concludes the overview of the simplified protocol. The transitions of the full protocol alsomanage the variables txint and rxint. The two most relevant transitions are:

(req = F) AND (ack = I) --> rxint' = true;

(ack = I) AND (req = I) --> txint' = true

In the first transition, a (blocked) sender, i.e. a sender that has initiated a forward transaction andthat is currently waiting for the receiver’s acknowledgment, may instead ask the receiver to engagein a reverse transaction. In the second transition, the receiver asks the sender to start a reversetransaction. Elided from the transitions presented earlier are the assignments that reset txintand rxint to false.

The protocol, as described, has been verified using the SAL model checker [8, 11]. This al-lowed us to verify the system for arbitrary times. Notice that the states of Fig. 2 correspondto the system invariant illustrated in Listing 3 which has been verified by the infinite bounded

7

model checker sal-inf-bmc. Our verified system consists of four processes: a sender, a re-ceiver, and two buffers. The buffers, which operate asynchronously from the transmitter andreceiver, copy their inputs to their outputs: their purpose is to simulate transmission delays.Verification, while automated, required developing a rather large invariant that could be val-idated by the model checker. The complete executable model is available for download from:http://www.cs.indiana.edu/�sabry/papers/reversible-protocol/.

th1 : THEOREM system |- G (

((req = I) AND (ack = I) AND (txclock = rxclock)) OR

((req = F) AND (ack = I) AND (txclock > rxclock)) OR

((req = F) AND (ack = F) AND (txclock <= rxclock)) OR

((req = I) AND (ack = F) AND (txclock = rxclock)) OR

((req = R) AND (ack = I) AND (txclock < rxclock)) OR

((req = R) AND (ack = R) AND (txclock >= rxclock)) OR

((req = I) AND (ack = R) AND (txclock = rxclock)) OR

((req = F) AND (ack = R) AND (txclock > rxclock))

);

Listing 3: Protocol Invariant

The only physical requirement to implement this protocol is in-order message delivery. Ifmessage loss is possible, then an implementation would need to provide re-transmission based upontimeouts. It is also feasible to implement this protocol on a system providing only per-processmailboxes by extending the messages with channel identifiers. Finally, while we have presented afour-phase or return-to-zero protocol, it would not be difficult to develop a more efficient NRZ ortwo-phase version.

5 Related work

We confine our discussion to systems that support speculative computation in message-passingenvironments and omit systems that deal exclusively with shared-memory.

Distributed Systems. The work presented in this paper builds upon a rich history in distributedsystems research on rollback-recovery [12], logical clocks, and distributed snapshots. The protocolwe present in Sec. 3 utilizes “standard” logical clocks [21, 1], i.e. during forward computation, eachcommunication event appropriately updates the logical clocks of the communicating processes. Itis well known, that logical clocks have limitations which can be overcome through the use of vectorclocks [24, 13] with high cost, or at lower cost, sacrificing some accuracy [33]. In our approach, thetrade-off between accuracy and efficiency is passed on as a responsibility to the programmer. Theprogrammer determines when and over what scope state is preserved to enable reverse computation.Our work is also closely related to distributed snapshots [3, 25] in that reverse computation isenabled by processes saving local snapshots; however, at no point is it necessary to create a singleglobal snapshot. Instead state is saved and restored by programmer defined actions: we simplyguarantee that when state is restored, it is done in a causally consistent manner.

Reversible Process Calculi. Danos and Krivine [5] extend CCS [26] to enable reversing com-putations. Their extensions consist of extending individual processes with memory stacks to trackpast communications; during forward computation any information required to perform rollbackis pushed on these memory stacks. Backtracking is then accomplished by unwinding these events.

8

Their model requires global synchronization among simultaneously created processes when per-forming reverse steps. Their rules for synchronization and anti-synchronization (communicationand the equivalent reversal) are complicated by the need to pass uniquely identifying informationin synchronization that may later be used to select complementary events. As a model, this appearsto capture the key properties for backtracking; however, it is neither a practical programming lan-guage nor generally implementable in a distributed setting. This difficulty is shared among manyCCS-based models (e.g. [29, 2, 7, 6, 23, 28, 22]) because these models generally do not place anylimitation on the patterns of communication.

Programming Abstractions. Our motivating programming abstraction is the one developedin the theory of stabilizers [34]. This abstraction gives the programmer the familiar ability ofprogramming with explicit backtracking events. Another possibility explored in the work on trans-actional events [10] is to give the programmer an interface based on transactions as traditionallyprovided by databases and more recently by software transactional memory models (STM). Thelatter approach encapsulates — in a single construct — a computation which succeeds only whenan entire collection of choice alternatives and communication partners may successfully complete.The price is that a thread attempting such an operation blocks until a collection of choice alterna-tives and communication partners that may successfully complete is found. Despite the high-leveldifferences, it may however be possible to adapt our protocol to implement transactional events.

6 Discussion

We consider the design of a correct protocol for reversible distributed computation to be a significantresult. It is, however, only a first step in our research program whose main goal is to provide newprogramming abstractions that simplify the construction of efficient speculative programs. Weoutline our effort in extending the Ruby-based language and suggest some application domains ofimmediate relevance.

Language Design and Implementation. The Ruby embedding provides a convenient testbedfor testing the main ideas and exploring various small applications. Even within that model, thereare several interesting unresolved questions. Most notably, the programming model needs additionalconstructs or perhaps an expressive type system to track the lifetime of continuations to avoid theaccumulation of unreachable contexts. Another significant goal is to design a high-level semanticsfor the language that abstracts from the low-level protocol state transitions and that is expressedat the level of user-level constructs. Finally, an effort is underway to implement the main ideasnatively in Scala [27] using the Akka library for actors [32].

Suggested Applications. A natural application for this work is distributed simulation as de-scribed by Jefferson [19, 20]. Indeed, our approach is related to the time warp mechanism [19].At a basic level, rolling back a computation requires some form of “anti-message.” We embedthe anti-message mechanism in the protocol described in Sec. 3. A potential benefit of using ourprotocol is that it requires considerably less communication state: when saving a local snapshot(checkpoint), a process needs only store a single timestamp for each of its communication channels.In contrast, the time warp approach requires saving state for every message sent. Unwinding timewarp systems requires sending one anti-message for every message sent while our approach requiresat most one message per channel involved in each unwound context.

9

While in our language, every message is synchronous, one might consider a model in which thereare synchronization messages interspersed with in-order data messages. For example, Hunt et. al.recently developed a system using logical (vector) clocks to manage deterministic record/replay indistributed systems [18]. Their system uses multiple asynchronous channels with regular messagesinterspersed with timestamped “end-of-quantum” messages. Replay is enabled by storing copies ofmessages within a quantum. This algorithm could reasonably be implemented using the protocolwe describe in this paper.

Transactional memory has been widely discussed as a programming paradigm for multicore sys-tems [15]. The implementation of transactions requires saving/restoring state and managing roll-back efficiently. In more complex realizations of transactional memory, transactions can be nested.Our language could be used to effectively implement the synchronization required for transactionalmemory. We include (in Appendix B) a small program that illustrates STM-like capabilities. Theprogram consists of a process M representing a “memory” holding a shared resource (which is justa number), and two processes p1 and p2 that attempt to read and write the shared resource. An“atomic transaction” in the context of the example consists of a sequence of “increment” messagesconcluding with a “read” message, all from the same process. An intervening communication fromthe other process interferes with the atomicity of the transaction, which is aborted and re-tried.In the example, process M waits for requests from either process to either increment or read thecurrent value of the shared resource. Each of p1 and p2 sends a number of (randomly-delayed)increment messages followed by a read message. If all the messages until the read request comefrom the same process, process M serves them as expected. If however, an intervening messagecomes from the other process, process M backtracks. By simply backtracking, process M invalidatesthe transaction, and both clients are forced to backtrack and re-start their transactions. Natu-rally, in a more realistic STM implementation, the shared resources, transactions, and atomicitywould be represented more directly by higher-level abstractions rather than being encoded. Theexample does however illustrate that the protocol can transparently provide the synchronizationrequirements of higher-level abstractions that are expressive and convenient for programmers.

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No.1116725 and supported by (and while serving at) the National Science Foundation.

10

References

[1] Ozalp Babaoglu and Keith Marzullo. Distributed systems (2nd ed.). chapter Consistent globalstates of distributed systems: fundamental concepts and mechanisms, pages 55–96. ACMPress/Addison-Wesley Publishing Co., New York, NY, USA, 1993.

[2] Luca Cardelli and Cosimo Laneve. Reversible structures. In Proceedings of the 9th InternationalConference on Computational Methods in Systems Biology, CMSB ’11, pages 131–140, NewYork, NY, USA, 2011. ACM.

[3] K. Mani Chandy and Leslie Lamport. Distributed snapshots: determining global states ofdistributed systems. ACM Trans. Comput. Syst., 3(1):63–75, February 1985.

[4] Inmos Corp. Occam Programming Manual. Prentice Hall Trade, 1984.

[5] V. Danos and J. Krivine. Reversible communicating systems. Concurrency Theory, pages292–307, 2004.

[6] V. Danos and J. Krivine. Transactions in RCCS, pages 398–412. Springer-Verlag, London,UK, 2005.

[7] V. Danos, J. Krivine, and F. Tarissan. Self-assembling trees. Electron. Notes Theor. Comput.Sci., 175:19–32, May 2007.

[8] Leonardo de Moura. SAL: tutorial. Technical report, SRI International, 2004.

[9] Leonardo de Moura, Sam Owre, and N. Shankar. The SAL language manual. Technical report,SRI International, 2002.

[10] Kevin Donnelly and Matthew Fluet. Transactional events. In Proceedings of the eleventh ACMSIGPLAN international conference on Functional programming, ICFP ’06, pages 124–135, NewYork, NY, USA, 2006. ACM.

[11] Bruno Dutertre and Maria Sorea. Timed systems in SAL. Technical Report SRI-SDL-04-03,SRI International, 2004.

[12] E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. A survey ofrollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375–408,September 2002.

[13] Colin J. Fidge. Timestamps in message-passing systems that preserve the partial ordering. InProc. of the 11th Australian Computer Science Conference (ACSC’88), pages 56–66, February1988.

[14] Rachid Guerraoui. Foundations of speculative distributed computing. In Nancy Lynch andAlexander Shvartsman, editors, Distributed Computing, volume 6343 of Lecture Notes in Com-puter Science, pages 204–205. Springer Berlin / Heidelberg, 2010.

[15] Tim Harris, Adrian Cristal, Osman S. Unsal, Eduard Ayguade, Fabrizio Gagliardi, BurtonSmith, and Mateo Valero. Transactional memory: An overview. IEEE Micro, 27(3):8–29, May2007.

1

[16] M. Hermenegildo and K. Greene. The &-prolog system: Exploiting independent and-parallelism. New Generation Computing, 9:233–256, 1991.

[17] C. A. R. Hoare. Communicating sequential processes. Commun. ACM, 21(8):666–677, August1978.

[18] Nicolas Hunt, Louis Ceze, and Steven D. Gribble. DDOS: taming non-determinism in dis-tributed systems. In Proceedings of the eighteenth international conference on architecturalsupport for programming languages and operating systems, ASPLOS ’13, pages 499–508. ACM,2013.

[19] D. Jefferson, B. Beckman, F. Wieland, L. Blume, and M. Diloreto. Time warp operatingsystem. In Proceedings of the eleventh ACM Symposium on Operating systems principles,SOSP ’87, pages 77–93, New York, NY, USA, 1987. ACM.

[20] David R. Jefferson. Virtual time. ACM Transactions on Programming Languages and Systems,7:404–425, 1985.

[21] Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Commun.ACM, 21(7):558–565, July 1978.

[22] Ivan Lanese, Michael Lienhardt, Claudio Antares Mezzina, Alan Schmitt, and Jean-BernardStefani. Concurrent flexible reversibility. In Proceedings of the 22nd European conferenceon Programming Languages and Systems, ESOP’13, pages 370–390, Berlin, Heidelberg, 2013.Springer-Verlag.

[23] Ivan Lanese, Claudio Antares Mezzina, and Jean-Bernard Stefani. Reversing higher-order Pi.Electron. Notes Theor. Comput. Sci., 6269:478–493, 2010.

[24] Friedemann Mattern. Virtual time and global states of distributed systems. In Parallel andDistributed Algorithms, pages 215–226. North-Holland, 1988.

[25] Friedemann Mattern. Efficient algorithms for distributed snapshots and global virtual timeapproximation. Journal of Parallel and Distributed Computing, 18:423–434, 1993.

[26] Robin Milner. Communication and Concurrency. Prentice Hall International, 1989.

[27] Martin Odersky, Lex Spoon, and Bill Venners. Programming in Scala: A ComprehensiveStep-by-Step Guide. Artima Press, 2nd edition, 2010.

[28] Iain Phillips and Irek Ulidowski. Reversibility and models for concurrency. Electron. NotesTheor. Comput. Sci., 192:93–108, October 2007.

[29] Iain Phillips and Irek Ulidowski. Reversing algebraic process calculi. Journal of Logic and Al-gebraic Programming, 73(1-2):70 – 96, 2007. Foundations of Software Science and ComputationStructures 2006 (FOSSACS 2006).

[30] John Reppy, Claudio V. Russo, and Yingqi Xiao. Parallel concurrent ML. In Proceedingsof the 14th ACM SIGPLAN international conference on Functional programming, ICFP ’09,pages 257–268, New York, NY, USA, 2009. ACM.

[31] John H. Reppy. Concurrent programming in ML. Cambridge University Press, New York, NY,USA, 1999.

2

[32] The Akka Team. Akka. http://akka.io, Accessed May 2013.

[33] Francisco J. Torres-Rojas and Mustaque Ahamad. Plausible clocks: Constant size logicalclocks for distributed systems. In Ozalp Babaoglu and Keith Marzullo, editors, DistributedAlgorithms, volume 1151 of Lecture Notes in Computer Science, pages 71–88. Springer BerlinHeidelberg, 1996.

[34] Lukasz Ziarek and Suresh Jagannathan. Lightweight checkpointing for concurrent ML. J.Funct. Program., 20(2):137–173, March 2010.

3

A SAL Model

We include the complete SAL model for the benefit of the reviewers.

DISC: CONTEXT =

2 BEGIN

4

%---------------

6 TIME : TYPE = NATURAL;

STATE : TYPE = { I, F, R };

8 %---------------

tx : MODULE =

10 BEGIN

INPUT ack : STATE

12 INPUT rxclock : TIME

OUTPUT req : STATE % initially I

14 OUTPUT txclock : TIME % initially 0

OUTPUT rxint : BOOLEAN % intially false

16 INPUT txint : BOOLEAN

18 INITIALIZATION

20 req = I;

txclock = 0;

22 rxint = false;

24 TRANSITION

[

26 %% F transaction

28 (ack = I) AND (req = I) --> req' = F; % (0)

txclock' IN { x : TIME | x > txclock}

30 []

(ack = F) AND (req = F) --> req' = I; % (2)

32 txclock' = rxclock;

rxint' = false

34

36 %% Waiting for receiver -- ask receiver to reverse !

38 []

(req = F) AND (ack = I) --> rxint' = true;

40

%% R transaction

42

[]

44 (ack = I) AND (req = I) --> req' = R; % (4)

txclock' IN { x : TIME | x < txclock}

46

48

[]

50 (ack = R) AND (req /= I) --> req' = I; % (6)

txclock' = rxclock;

52 rxint' = false;

54 ]

4

END;

56 %---------------

rx : MODULE =

58 BEGIN

INPUT req : STATE

60 INPUT txclock : TIME

OUTPUT ack : STATE % initially I

62 OUTPUT rxclock : TIME % initially 0

OUTPUT txint : BOOLEAN % initially false

64 INPUT rxint : BOOLEAN

66

INITIALIZATION

68

ack = I;

70 rxclock = 0;

txint = false;

72

TRANSITION

74 [

76 %% Rx acknowledge F transaction -- capture data

78 (req = F) AND (ack = I) --> ack' = F; % (1)

rxclock' IN { x : TIME | x >= txclock};

80 txint' = false

82 []

(req = I) AND (ack /= I) --> ack' = I % (3)

84

%% Ask transmitter to reverse !

86

[]

88 (ack = I) AND (req = I) --> txint' = true

90 %% Rx acknowledge reverse transaction

92 []

(ack = I) AND (req /= I) --> ack' = R; % (5)

94 rxclock' IN {x:TIME | x <= txclock AND x < rxclock};

txint' = false

96 ]

END;

98 %---------------

txbuf : MODULE =

100 BEGIN

INPUT req : STATE

102 INPUT txclock : TIME

INPUT rxint : BOOLEAN

104 OUTPUT breq : STATE

OUTPUT btxclock : TIME

106 OUTPUT brxint : BOOLEAN

108

INITIALIZATION

110

breq = I;

112 btxclock = 0;

5

brxint = false;

114

TRANSITION

116 [

true --> breq' = req; btxclock' = txclock; brxint' = rxint

118 ]

END;

120

%---------------

122 rxbuf : MODULE =

BEGIN

124 INPUT ack : STATE

INPUT rxclock : TIME

126 INPUT txint : BOOLEAN

OUTPUT back : STATE

128 OUTPUT brxclock : TIME

OUTPUT btxint : BOOLEAN

130

INITIALIZATION

132

back = I;

134 brxclock = 0;

btxint = false;

136

TRANSITION

138 [

true --> back' = ack; brxclock' = rxclock; btxint' = txint

140 ]

END;

142 %---------------

system : MODULE = ((RENAME ack to back, rxclock to brxclock, txint to btxint IN tx) []

144 (RENAME req to breq, txclock to btxclock, rxint to brxint IN rx) []

rxbuf [] txbuf);

146 %---------------

% sal-inf-bmc -d 1 -i DISC.sal lemma1

148

lemma1 : LEMMA system |- G ((

150

% stable states

152

((((req = breq) AND (ack = back)

154 AND (txclock = btxclock) AND (rxclock = brxclock)) AND

(((req = I) AND (ack = I) AND (txclock = rxclock)) OR

156 ((req = F) AND (ack = I) AND (txclock > rxclock)) OR

((req = F) AND (ack = F) AND (txclock <= rxclock)) OR

158 ((req = F) AND (ack = R) AND (rxclock < txclock)) OR

((req = R) AND (ack = I) AND (txclock < rxclock)) OR

160 ((req = R) AND (ack = R) AND (rxclock <= txclock)) OR

((req = I) AND (ack /= I) AND (txclock = rxclock))))) OR

162

% Transitions between stable states

164

% II -> FI

166 ((req = F) AND (breq = I) AND

(ack = I) AND (back = I) AND

168 (brxclock = rxclock) AND (txclock > btxclock) AND (btxclock = rxclock)) OR

% II -> RI

170 ((req = R) AND (breq = I) AND

6

(ack = I) AND (back = I) AND

172 (brxclock = rxclock) AND (txclock < btxclock) AND (btxclock = rxclock)) OR

% FI -> FF

174 ((req = F) AND (breq = F) AND

(ack = F) AND (back = I) AND

176 (btxclock = txclock) AND (brxclock < rxclock) AND (rxclock >= txclock)) OR

% FI -> FR

178 ((req = F) AND (breq = F) AND

(ack = R) AND (back = I) AND

180 (btxclock = txclock) AND (brxclock < txclock) AND (rxclock < brxclock)) OR

% RI -> RR

182 ((req = R) AND (breq = R) AND

(ack = R) AND (back = I) AND

184 (btxclock = txclock) AND (rxclock <= txclock) AND (txclock < brxclock)) OR

% RR -> IR

186 ((req = I) AND (breq = R) AND

(ack = R) AND (back = R) AND

188 (rxclock <= txclock) AND

(txclock = brxclock) AND (rxclock = brxclock) AND (btxclock >= txclock)) OR

190 % IR -> II

((req = I) AND (breq = I) AND

192 (ack = I) AND (back = R) AND

(txclock = btxclock) AND (rxclock = brxclock) AND (txclock = rxclock)) OR

194 % FR -> IR

((req = I) AND (breq = F) AND

196 (ack = R) AND (back = R) AND

(txclock < btxclock) AND (txclock = rxclock) AND (rxclock = brxclock)) OR

198 % FF -> IF

((req = I) AND (breq = F) AND

200 (ack = F) AND (back = F) AND

(txclock = rxclock) AND (rxclock = brxclock) AND (rxclock >= btxclock)) OR

202 % IF -> II

((req = I) AND (breq = I) AND

204 (ack = I) AND (back = F) AND

(txclock = btxclock) AND (rxclock = brxclock) AND (txclock = rxclock))) AND

206 % interrupt states

( rxint => req = F) AND

208 ( brxint => breq = F) AND

( txint => ack = I) AND

210 ( btxint => back = I)

);

212 %---------------

% sal-inf-bmc -d 1 -i -l lemma1 DISC.sal th1

214

th1 : THEOREM system |- G (

216 ((req = I) AND (ack = I) AND (txclock = rxclock)) OR

((req = F) AND (ack = I) AND (txclock > rxclock)) OR

218 ((req = F) AND (ack = F) AND (txclock <= rxclock)) OR

((req = I) AND (ack = F) AND (txclock = rxclock)) OR

220 ((req = R) AND (ack = I) AND (txclock < rxclock)) OR

((req = R) AND (ack = R) AND (txclock >= rxclock)) OR

222 ((req = I) AND (ack = R) AND (txclock = rxclock)) OR

((req = F) AND (ack = R) AND (txclock > rxclock))

224 );

226

END

7

B STM Example

We include the complete implementation of an example demonstrating STM-like capabilities.

require_relative 'csp'

2 EM.synchrony do

def processReq(reqType, cin, cout, id, lastwrite, x)

4 if lastwrite[0] != id and lastwrite[0] != nil

puts "M: Got request from p#{id} --- ABORTING"

6 Csp.backtrack

end

8

if reqType == :read

10 puts "M: p#{id} reading #{x[0]}"

cout.snd x[0]

12 lastwrite[0] = nil

x[0] = 0

14 elsif reqType == :inc

lastwrite[0] = id

16 puts "M: Incrementing value to #{x[0]+1} (requested by p#{id})"

x[0] += 1

18 end

end

20

def M(t1m, t2m, cdone1, cdone2, tm1, tm2)

22 Csp.stable {

puts "M: at virtual time #{Csp.time}"

24 # Put x in box

x = [0]

26 lastwrite = [nil]

flag = 0

28 puts "M: value is #{x[0]}"

30 while flag < 2 do

puts "M: Ready"

32 Csp.alt [

t1m.g { |req| processReq(req, t1m, tm1, 1, lastwrite, x) },

34 t2m.g { |req| processReq(req, t2m, tm2, 2, lastwrite, x) },

cdone1.g { |t| flag += 1 },

36 cdone2.g { |t| flag += 1 }

]

38 end

puts "M: Done"

40 }

end

42

def send(id, cnt, tm1, t1m, cdone)

44 Csp.stable {

cnt.times {

46 rand(30).times { Csp.yield }

t1m.snd :inc

48 }

t1m.snd :read

50 puts "p#{id}: received #{tm1.rcv}"

cdone.snd 1

52 }

end

54

8

#-----------------------------

56

Csp.proc("root", [],[]) {

58

t1m = Csp.channel("T1->M")

60 t2m = Csp.channel("T2->M")

tm1 = Csp.channel("M->T1")

62 tm2 = Csp.channel("M->T2")

cdone1 = Csp.channel("T1 done")

64 cdone2 = Csp.channel("T2 done")

66 Csp.proc("M", [t1m, t2m, cdone1, cdone2], [tm1, tm2]) {

M(t1m, t2m, cdone1, cdone2, tm1, tm2)

68 }

Csp.proc("T1", [tm1], [t1m, cdone1]) { send(1, 2, tm1, t1m, cdone1) }

70 Csp.proc("T2", [tm2], [t2m, cdone2]) { send(2, 2, tm2, t2m, cdone2) }

Csp.yield while (Csp::CspProc.processes > 1)

72 EM.stop

}

74 end

76 =begin Sample runs

---------------------------------- RUN 1

78 M: at virtual time 2

M: value is 0

80 M: Ready

M: Incrementing value to 1 (requested by p2)

82 M: Ready

M: Got request from p1 --- ABORTING

84 M: at virtual time 2

M: value is 0

86 M: Ready

M: Incrementing value to 1 (requested by p1)

88 M: Ready

M: Incrementing value to 2 (requested by p1)

90 M: Ready

M: p1 reading 2

92 p1: received 2

M: Ready

94 M: Ready

M: Incrementing value to 1 (requested by p2)

96 M: Ready

M: Incrementing value to 2 (requested by p2)

98 M: Ready

M: p2 reading 2

100 p2: received 2

M: Ready

102 M: Done

104 ---------------------------------- RUN 2

M: at virtual time 2

106 M: value is 0

M: Ready

108 M: Incrementing value to 1 (requested by p1)

M: Ready

110 M: Incrementing value to 2 (requested by p1)

M: Ready

112 M: p1 reading 2

9

p1: received 2

114 M: Ready

M: Incrementing value to 1 (requested by p2)

116 M: Ready

M: Ready

118 M: Incrementing value to 2 (requested by p2)

M: Ready

120 M: p2 reading 2

p2: received 2

122 M: Ready

M: Done

124

---------------------------------- RUN 3

126 M: at virtual time 2

M: value is 0

128 M: Ready

M: Incrementing value to 1 (requested by p1)

130 M: Ready

M: Got request from p2 --- ABORTING

132 M: at virtual time 2

M: value is 0

134 M: Ready

M: Incrementing value to 1 (requested by p1)

136 M: Ready

M: Incrementing value to 2 (requested by p1)

138 M: Ready

M: Got request from p2 --- ABORTING

140 M: at virtual time 2

M: value is 0

142 M: Ready

M: Incrementing value to 1 (requested by p2)

144 M: Ready

M: Got request from p1 --- ABORTING

146 M: at virtual time 2

M: value is 0

148 M: Ready

M: Incrementing value to 1 (requested by p1)

150 M: Ready

M: Incrementing value to 2 (requested by p1)

152 M: Ready

M: p1 reading 2

154 p1: received 2

M: Ready

156 M: Ready

M: Incrementing value to 1 (requested by p2)

158 M: Ready

M: Incrementing value to 2 (requested by p2)

160 M: Ready

M: p2 reading 2

162 p2: received 2

M: Ready

164 M: Done

166 =end

10