replicating files and other big objects “out of band” with isis2 ken birman 1 cornell university

27
REPLICATING FILES AND OTHER BIG OBJECTS “OUT OF BAND” WITH ISIS2 Ken Birman 1 Cornell University

Upload: sabina-banks

Post on 17-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

A Brief Introduction To High Assurance Cloud Computing With Isis2

Replicating Files and other big objects Out of Band With Isis2Ken Birman1Cornell University

Core Challenge2Many cloud computing systems work with very large files or other big objects

Frequently they take the form of massive byte arrays and it isnt at all uncommon to map them into memoryOn Linux and Windows the memory-mapped file API makes this easy to do.Takes a file name and returns a pointer to a memory region where you can directly access bytes of that fileNot long ago, Isis2 wasnt good choice for applications with big objects3We created the OOB layer because moving big objects inside Isis2 was simply too costlyYou can put big things into messages, and Isis2 carves them into smaller chunksBut they can seriously disrupt steady flow in the systemThe issue is thatIsis2 needs to maintain FIFO ordering for lower level communication between group membersHence a big object needs to be fully transferred before small things sent after it can be delivered, even if they were sent by some other thread for some other reasonOut of Band (OOB) Concept4We added a way to move very big byte[] objects outside of the normal Isis2 communication pathWe start by assuming the objects are memory-mapped files (they dont have to exist at all on disk, but they do have file names that look like the names of disk files)You canCreate these from fileCreate a big mapped memory region and put data in itThese mapped files can be shared easily within a single computer and are ultra efficient because no copying occurs. Much faster than ANY form of copying!Out of Band (OOB) Concept5Before

After

Keep in mind that is really big. And there may be many such transfers to do, all at the same timeMachine A has a big memory object We want copies on B and C. But not on DOut of Band (OOB) Concept6SoYouve created a memory mapped region and put data into it, somehow it might be huge (hundreds of megabytes? Or even gigabytes? No problem! But > 6Gb needs 64-bit O/S)Our goal: Use Isis2 to efficiently move these from computer to computer in a cloud computing data center or a clusterIdeally: a single DMA transfer, or a super-efficient series of ethernet multicasts

Out of Band (OOB) Concept7Machine A has a big memory object We want copies on machines B and C. But not on D(2) Tell your application on B and C to fetch X(1) Tell Isis2 about X using OOBRegister(3) Applications on B and C call OOBFetch.(3) Applications on B and C call OOBFetch.(3) OOBReReplicate tells Isis2 to modify replication patternSteps8First you need to tell the Isis2 subsystem that the file exists. There are three cases.Isis2 could be linked directly to your appliction code, Isis2 could run in a server that you talk to via RPC, perhaps from native C++. We also have a command-line program that can talk to our server for you, so you can access OOB by issuing commands if the server is running.Isis2 wants to know the file name. In RPC mode the data lives in the mapped memory and isnt copied to Isis2Steps9So.. YouRegister the memory-mapped fileNow you canForm a process groupReplicate data within/among the group members. We call this rereplicate because you can do it again and again, changing the replication patternOn the receiving side, fetch a pointer into the memory-mapped file region (this will wait until the data arrives)Why do we call it out of band?10Often youll mix Isis2 RPC and multicast with out of band data transferRegister a file, and start transferring itIn parallel, tell some group member(s) about it, by nameIn such casesIsis2 carries out the OOB transfer as efficiently as it canThe OOBFetch operation in the receiver blocks until the bytes have been correctly received and are availableOther options11You can also register an upcall handlerThe OOB layer will tell it each time an incoming OOB file has been fully transferredAnd you can access for the replication mapIt tells you which group members have which files

Idea is to be able to rereplicate in a flash, in parallel for multiple files if desired, and as close as possible to the raw hardware speed of the networkOOB interface12Example: Creating a new mapped file

You can also open an existing mapped file, if some other program on the same computer created itThen call g.OOBRegister(string fname, MemoryMappedFile mmf)

MemoryMappedFile mmf = MemoryMappedFile.CreateNew(fname, CAPACITY);MemoryMappedViewAccessor mva = mmf.CreateViewAccessor();for (int n = 0; n < CAPACITY; n++){ byte b = (byte)(n & 0xFF); mva.Write(n, ref b);}(1) Creates a completely new memory-mapped object(2) An Accessor allows you to access the bytes in the object(3) An example of byte-by-byte access.Now Isis2 knows about the file13Next we can call ReReplicate:

Fname is the file name. But what goes in where?

g.OOBReReplicate(fname, where);The where argument to ReReplicate14This should be an object of type List.

For example, given a view v for a group, List everywhere = v.members.ToList();creates a list with every group member in it.

It must list ALL the places where you want replicas. Isis2 will create new replicas and also delete unwanted onesCreate new replicas before deleting old ones: two stepsOOBDelete(fname) is short for OOBReReplicate with an empty replica location list.Now Isis2 knows about the file15ReReplicate also has a second overload:

The delegate method will be called by Isis2 when the transfer finishes. The transfer itself runs asynchronously out of band! g.OOBReReplicate(fname, where, (Action) delegate(string oobfname, MemoryMappedFile m) { IsisSystem.WriteLine("ReReplicate finished for " + oobfname); });How to access your replica16You call MemoryMappedFile xmmf = g.OOBFetch(fname);

This call will wait until the ReReplication action finishes (so it is a mistake to do it if you havent started one!). That could take a while if the file is big: a 5GB file on a 10Gb network will need 5 seconds to transfer even at 100% rateHow our server works17We built a very simple server that accepts RPC requests in Web Services styleThen we created a simple thin library to talk to itYou can pass a file name to it, and it will do an OOB operation using that file name as the argumentRemember: memory mapped files are accessible from any program on the same machine!So Isis2 can access your memory mapped files even from this server, even if you arent linked to it!The command-line API works the same wayRecap: A very fast way to move objects around18Machine A has a big memory object We want copies on machines B and C(2) Tell your application on B and C to fetch X(1) Tell Isis2 about X using OOBRegister(3) Applications on B and C call OOBFetch.(3) Applications on B and C call OOBFetch.(3) OOBReReplicate tells Isis2 to modify replication patternHow we use OOB inside Isis219One situation where Isis2 has to copy identical data to lots of group members involves a master/worker startup with many new members joiningAll the new members need the new group view! and because they dont have the prior group view, we cant just send the delta, which is how new view events normally workSo, if the group is large, Isis2 creates a memory-mapped object containing the view, then uses OOB to transfer it to the joining processes!You might use it for state transfer too!20The initialization case is a form of state transfer

Suppose you are building a group but the state is very large, like a file service

If you try and transfer the state in band it could take ages and disrupt the group for a long time!OOB to the rescue!21Better: pre-transfer as much state as you can using the OOB toolYoull need a way to contact the group before even trying to join. A good option: the Client APIAllows you to bind to a randomly chosen representativeLoad balances these roles Representative must allow client requests to handlers you can call as a client.So, you create a state pre-fetch API for clientsJoining member shows up, perhaps authenticates itself, and you use OOB to pre-send all that stateBut if updates are active22 a race condition forms!Suppose the state is A. W but during the time between when you finish being a client and join, updates X and Y occur in the groupYour state is stale should it be discarded?

We recommend:Associate a counter or timestamp with the state. The version you pre-transferred had, perhaps, T=23Now we can use this to finalize the stateImplementation23g.Join() has a overload where you can pass in an long integer. Pass this timestamp

The process that initiates your state transfer will find the timestamp value in the view, in a field called v.offset

It can compute a state for you that includes updates done subsequent to when you pre-transferred state!OOB pre-transfer idea24PQRPre-transfer please?look in /tmp/xxx, T=12345OOBFetch() as Client of G/tmp/xxx @ T=123OOBReReplicateOOBDeleteMemory Mapped Byte[] RepresentationPQRUpdates since T=12345g.Join(12345)CreatemappedfileGroup obligation?25If state of the group is an append-style log, this concept is easily implemented

Otherwise, group needs to keep a log of recent updates and implement some form of periodic snapshot in which the stored state has an associated time (how many updates it reflects), and the log has the remaining updatesSerialization26We have several ways to create the byte[] representation of these view objectsMsg.ToBArray(objs)C# serializationYour favorite way of generating a byte[] object

But keep in mind that because an mva isnt a byte[] object, copying does occur at the last step of transforming data into a C# managed objectPerformance considerations27In theory, the very best way to move the bytes is with Ethernet multicast or InfinibandIsis2 supports both but they behave differentlyEthernet multicast is highly efficient from 1:n, but the data still is copied from kernel to user address spaceInfiniband multicast doesnt work well, hence we use Infiniband verbs to send the data via multiple 1:1 streams. But these avoid any kernel/user copyingWorst performance: ISIS_UNICAST_ONLY case