Replicating Files and other big objects “Out of Band” With Isis2

27
REPLICATING FILES AND OTHER BIG OBJECTS “OUT OF BAND” WITH ISIS2 Ken Birman 1 Cornell University

description

Replicating Files and other big objects “Out of Band” With Isis2. Cornell University. Ken Birman. Core Challenge. Many cloud computing systems work with very large files or other big objects - PowerPoint PPT Presentation

Transcript of Replicating Files and other big objects “Out of Band” With Isis2

Page 1: Replicating Files and other big objects “Out of Band” With Isis2

REPLICATING FILES AND OTHER BIG OBJECTS “OUT OF BAND” WITH ISIS2Ken Birman

1

Cornell University

Page 2: Replicating Files and other big objects “Out of Band” With Isis2

2

Core Challenge Many cloud computing systems work

with very large files or other big objects

Frequently they take the form of massive byte arrays and it isn’t at all uncommon to “map” them into memory

On Linux and Windows the memory-mapped file API makes this easy to do. Takes a file name and returns a pointer to a

memory region where you can directly access bytes of that file

Page 3: Replicating Files and other big objects “Out of Band” With Isis2

3

Not long ago, Isis2 wasn’t good choice for applications with big objects… We created the OOB layer because moving big

objects inside Isis2 was simply too costly You can put big things into messages, and Isis2 carves

them into smaller chunks But they can seriously disrupt steady flow in the

system The issue is that

Isis2 needs to maintain FIFO ordering for lower level communication between group members

Hence a big object needs to be fully transferred before small things sent after it can be delivered, even if they were sent by some other thread for some other reason

Page 4: Replicating Files and other big objects “Out of Band” With Isis2

4

Out of Band (OOB) Concept We added a way to move very big byte[] objects

“outside” of the normal Isis2 communication path We start by assuming the objects are memory-mapped

files (they don’t have to exist at all on disk, but they do have file names that look like the names of disk files)

You can Create these from file Create a big mapped memory region and put data in it

These mapped files can be shared easily within a single computer and are ultra efficient because no copying occurs. Much faster than ANY form of copying!

Page 5: Replicating Files and other big objects “Out of Band” With Isis2

5

Out of Band (OOB) Concept

Before

… After

Keep in mind that “” is really big. And there may be many such transfers to do, all at the same time

Machine A has a big memory

object We want copies on B and C…. But not on D

Page 6: Replicating Files and other big objects “Out of Band” With Isis2

6

Out of Band (OOB) Concept So…

You’ve created a memory mapped region … and put data into it, somehow … it might be huge (hundreds of megabytes? Or

even gigabytes? No problem! But > 6Gb needs 64-bit O/S)

Our goal: Use Isis2 to efficiently move these from computer to computer in a cloud computing data center or a cluster Ideally: a single DMA transfer, or a super-efficient

series of ethernet multicasts

Page 7: Replicating Files and other big objects “Out of Band” With Isis2

7

Out of Band (OOB) Concept

Machine A has a big memory

object

We want copies on machines B and C…. But not on D

(2) Tell your application on B and

C to fetch X(1) Tell Isis2 about X using OOBRegister

(3) Applications on B and C call OOBFetch.

(3) Applications on B and C call OOBFetch.

(3) OOBReReplicate tells Isis2 to modify replication pattern

Page 8: Replicating Files and other big objects “Out of Band” With Isis2

8

Steps First you need to tell the Isis2 subsystem that the

file exists. There are three cases.1. Isis2 could be linked directly to your appliction code, 2. Isis2 could run in a server that you talk to via RPC,

perhaps from native C++. 3. We also have a command-line program that can

talk to our server for you, so you can access OOB by issuing commands if the server is running.

Isis2 wants to know the file name. In RPC mode the data lives in the mapped memory and isn’t copied to Isis2

Page 9: Replicating Files and other big objects “Out of Band” With Isis2

9

Steps So.. You

1. Register the memory-mapped file Now you can

1. Form a process group2. Replicate data within/among the group

members. We call this “rereplicate” because you can do it again and again, changing the replication pattern

3. On the receiving “side”, fetch a pointer into the memory-mapped file region (this will wait until the data arrives)

Page 10: Replicating Files and other big objects “Out of Band” With Isis2

10

Why do we call it “out of band”? Often you’ll mix Isis2 RPC and multicast with

out of band data transfer Register a file, and start transferring it In parallel, tell some group member(s) about it,

by name In such cases

Isis2 carries out the OOB transfer as efficiently as it can

The OOBFetch operation in the receiver blocks until the bytes have been correctly received and are available

Page 11: Replicating Files and other big objects “Out of Band” With Isis2

11

Other options You can also register an upcall handler

The OOB layer will tell it each time an incoming OOB file has been fully transferred

And you can access for the replication map It tells you which group members have

which files

Idea is to be able to rereplicate in a flash, in parallel for multiple files if desired, and as close as possible to the raw hardware speed of the network

Page 12: Replicating Files and other big objects “Out of Band” With Isis2

12

OOB interface Example:

Creating a new mapped file

You can also open an existing mapped file, if some other program on the same computer created it

Then call g.OOBRegister(string fname, MemoryMappedFile mmf)

MemoryMappedFile mmf = MemoryMappedFile.CreateNew(fname, CAPACITY);MemoryMappedViewAccessor mva = mmf.CreateViewAccessor();for (int n = 0; n < CAPACITY; n++){ byte b = (byte)(n & 0xFF); mva.Write<byte>(n, ref b);}

(1) Creates a completely new memory-mapped

object(2) An “Accessor”

allows you to access the bytes in the object

(3) An example of byte-by-byte access.

Page 13: Replicating Files and other big objects “Out of Band” With Isis2

13

Now Isis2 knows about the file Next we can call ReReplicate:

Fname is the file name. But what goes in “where”?

g.OOBReReplicate(fname, where);

Page 14: Replicating Files and other big objects “Out of Band” With Isis2

14

The “where” argument to ReReplicate

This should be an object of type List<Address>.

For example, given a view v for a group, List<Address> everywhere = v.members.ToList();

creates a list with every group member in it.

It must list ALL the places where you want replicas. Isis2 will create new replicas and also delete unwanted ones Create new replicas before deleting old ones: two steps OOBDelete(fname) is short for OOBReReplicate with an

empty replica location list.

Page 15: Replicating Files and other big objects “Out of Band” With Isis2

15

Now Isis2 knows about the file ReReplicate also has a second overload:

The delegate method will be called by Isis2 when the transfer finishes. The transfer itself runs asynchronously – out of band!

g.OOBReReplicate(fname, where, (Action<string, MemoryMappedFile>) delegate(string oobfname, MemoryMappedFile m) { IsisSystem.WriteLine("ReReplicate finished for " + oobfname); });

Page 16: Replicating Files and other big objects “Out of Band” With Isis2

16

How to access your replica You call MemoryMappedFile xmmf = g.OOBFetch(fname);

This call will wait until the ReReplication action finishes (so it is a mistake to do it if you haven’t started one!). That could take a while if the file is big: a 5GB file on a 10Gb network will need 5 seconds to transfer even at 100% rate

Page 17: Replicating Files and other big objects “Out of Band” With Isis2

17

How our server works We built a very simple server that accepts RPC

requests in Web Services style Then we created a simple “thin” library to talk to it

You can pass a file name to it, and it will do an OOB operation using that file name as the argument

Remember: memory mapped files are accessible from any program on the same machine!

So Isis2 can access your memory mapped files even from this server, even if you aren’t “linked” to it!

The command-line API works the same way

Page 18: Replicating Files and other big objects “Out of Band” With Isis2

18

Recap: A very fast way to move objects around

Machine A has a big memory

object

We want copies on machines B and C

(2) Tell your application on B and

C to fetch X

(1) Tell Isis2 about X using OOBRegister

(3) Applications on B and C call OOBFetch.

(3) Applications on B and C call OOBFetch.

(3) OOBReReplicate tells Isis2 to modify replication pattern

Page 19: Replicating Files and other big objects “Out of Band” With Isis2

19

How we use OOB inside Isis2

One situation where Isis2 has to copy identical data to lots of group members involves a master/worker startup with many new members joining All the new members need the new group view! … and because they don’t have the prior group

view, we can’t just send the delta, which is how new view events normally work

So, if the group is large, Isis2 creates a memory-mapped object containing the view, then uses OOB to transfer it to the joining processes!

Page 20: Replicating Files and other big objects “Out of Band” With Isis2

20

You might use it for state transfer too!

The initialization case is a form of state transfer

Suppose you are building a group but the state is very large, like a file service

If you try and transfer the state “in band” it could take ages and disrupt the group for a long time!

Page 21: Replicating Files and other big objects “Out of Band” With Isis2

21

OOB to the rescue! Better: pre-transfer as much state as you

can using the OOB tool You’ll need a way to contact the group

before even trying to join. A good option: the Client API Allows you to bind to a randomly chosen

“representative” Load balances these roles… Representative

must “allow client requests” to handlers you can call as a client.

So, you create a state pre-fetch API for clients Joining member shows up, perhaps

authenticates itself, and you use OOB to pre-send all that state

Page 22: Replicating Files and other big objects “Out of Band” With Isis2

22

But if updates are active… … a race condition forms!

Suppose the state is A…. W but during the time between when you finish being a client and join, updates X and Y occur in the group

Your state is “stale” – should it be discarded?

We recommend: Associate a counter or timestamp with the state.

The version you pre-transferred had, perhaps, T=23

Now we can use this to “finalize” the state

Page 23: Replicating Files and other big objects “Out of Band” With Isis2

23

Implementation g.Join() has a overload where you can

pass in an long integer. Pass this timestamp

The process that initiates your state transfer will find the timestamp value in the view, in a field called v.offset

It can compute a state for you that includes updates done subsequent to when you pre-transferred state!

Page 24: Replicating Files and other big objects “Out of Band” With Isis2

24

OOB pre-transfer idea

P QRPre-transfer please?

“look in /tmp/xxx, T=12345”OOBFetch()

… as Client of G

“/tmp/xxx” @ T=123

OOBReReplicate

OOBDelete

Memory Mapped Byte[]

Representation

P QRUpdates since T=12345

g.Join(12345)

Createmappedfile

Page 25: Replicating Files and other big objects “Out of Band” With Isis2

25

Group obligation? If state of the group is an append-style

log, this concept is easily implemented

Otherwise, group needs to keep a log of “recent” updates and implement some form of periodic snapshot in which the stored state has an associated time (how many updates it reflects), and the log has the remaining updates

Page 26: Replicating Files and other big objects “Out of Band” With Isis2

26

Serialization We have several ways to create the byte[]

representation of these view objects Msg.ToBArray(objs…) C# serialization Your favorite way of generating a byte[] object

But keep in mind that because an mva isn’t a byte[] object, copying does occur at the last step of transforming data into a C# managed object

Page 27: Replicating Files and other big objects “Out of Band” With Isis2

27

Performance considerations In theory, the very best way to move the

bytes is with Ethernet multicast or Infiniband Isis2 supports both… but they behave differently Ethernet multicast is highly efficient from 1:n,

but the data still is copied from kernel to user address space

Infiniband multicast doesn’t work well, hence we use Infiniband “verbs” to send the data via multiple 1:1 streams. But these avoid any kernel/user copying

Worst performance: ISIS_UNICAST_ONLY case