Writing Fault-Tolerant Applications Using Resilient...

© 2014 IBM Corporation

Writing Fault-Tolerant Applications Using Resilient X10 2014/06/12 Kiyokuni Kawachiya (presentation by Mikio Takeuchi) IBM Research - Tokyo

X10 Workshop 2014 at Edinburgh, UK

This work was funded in part by the Air Force Office of Scientific Research under Contract No. FA8750-13-C-0052.


IBM Research - Tokyo

2 Writing Fault-Tolerant Applications Using Resilient X10 / Kiyokuni Kawachiya2014/06/12

Writing Fault-Tolerant Applications Using Resilient X10Contents

X10 overview

– Extension for fault tolerance ... Resilient X10

X10 applications with fault tolerance

– Computing π with the Monte Carlo method ... MontePi– Clustering points by K-Means ... Kmeans– Computing heat diffusion ... HeatTransfer

Evaluation– Modification size, performance, and fault tolerance

Summary




Resilient X10 – Extension for Fault Tolerance

Stores activities’ critical information in a “Resilient Storage”

– The most stable implementation uses Place 0 for this purpose

Throws a new exception, DeadPlaceException (DPE), for a place death

– If an activity is being moved to the place, corresponding at throws the DPE– If an activity is asynchronously executed in the dead place by async,

governing finish will throw a MultipleExceptions which contains the DPE

A simple fault-tolerant program which just reports node failuresclass ResilientExample {

public static def main(Rail[String]) {finish for (pl in Place.places()) async {

try {at (pl) do_something(); // parallel distributed execution

} catch (e:DeadPlaceException) {Console.OUT.println(e.place + " died"); // report failure

}} // end of finish, wait for the execution in all places

} }

If the target place (pl) dies, at statement throws a DPE, which is caught here

Fig 3

→next slide




Utilization of the Rooted Exception Model

In X10, exceptions thrown from asynchronous activities can be caught

The finish governing the activity (async) receives the exception(s), and throws a MultipleExceptions ... Rooted Exception Model By enclosing a finish with try～catch, async exceptions can be caught

• DeadPlaceException can be caught with the same mechanism

class HelloWorld {public static def main(args:Rail[String]) {

finish for (pl in Place.places()) {at (pl) async { // parallel distributed exec in each place

Console.OUT.println("Hello from " + here);do_something();


}}



Console.OUT.println("Hello from " + here);do_something();


}}

try {

} catch (es:MultipleExceptions) { for (e in es.exceptions()) ... }

Fig 2

How to catch exceptions thrown asynchronously?




Writing Fault-Tolerant Applications

The DeadPlaceException notification (and some support methods) are sufficient to add fault tolerance to existing distributed X10

However, it is necessary to understand the structure of each application– How the application is doing the distributed processing?– How the execution can be continued after a node failure?

Introduce three methods to add fault tolerance(a) Ignore failures and use the results from the remaining nodes(b) Reassign the failed node’s work to the remaining nodes(c) Restore the computation from a periodic snapshot. (b)+checkpointing




(a) MontePi – Computing π with the Monte Carlo Method

Overview

Try ITERS times at each place, and update the result at Place 0

Place death is simply ignored

– The result may become less accurate, but it is still valid

class ResilientMontePi {public static def main (args:Rail[String]) {

:finish for (p in Place.places()) async {

try {at (p) {

val rnd = new x10.util.Random(System.nanoTime());var c:Long = 0;for (iter in 1..ITERS) { // ITERS trials per place

val x = rnd.nextDouble(), y = rnd.nextDouble();if (x*x + y*y <= 1.0) c++; // if inside the circle

}val count = c;at (result) atomic { // update the global result

val r = result(); r() = Pair(r().first+count, r().second+ITERS);} }

} catch (e:DeadPlaceException) { /* just ignore place death */ }} // end of finish, wait for the execution in all places/* calculate the value of π and print it */

} }

In case of node failure

Fig 4




(b) KMeans – Clustering Points by K-MeansOverview

Each place processes assigned points, and iterates until convergence

Don’t assign the work to dead place(s)

– The work is reassigned to remaining places

Place death is ignored– Partial results are

still utilized

class ResilientKMeans {public static def main(args:Rail[String]) {

:for (iter in 1..ITERATIONS) { // iterate until convergence

/* deliver current cluster values to other places */val numAvail = Place.MAX_PLACES - Place.numDead();val div = POINTS / numAvail; // share for each placeval rem = POINTS % numAvail; // extra share for Place 0var start:Long = 0; // next point to be processedtry {

finish for (pl in Place.places()) {if (pl.isDead()) continue; // skip dead place(s)var end:Long = start+div; if (pl==place0) end+=rem;at (pl) async { /* process [start,end), and return the data */ }start = end;

} // end of finish, wait for the execution in all places} catch (es:MultipleExceptions) { /* just ignore place death */ }/* compute new cluster values, and exit if converged */

} // end of for (iter)/* print the result */

} }Assign the work only to live nodes

Fig 5




(c) HeatTransfer – Computing Heat DiffusionOverview

A 2D DistArray holds the heat values of grid points

Each place computes heat diffusion for its local elements

Upon place death, the DistArray is restored from the snapshot

Create a snapshot of the DistArray at every 10th iteration

class ResilientHeatTransfer {static val livePlaces = new ArrayList[Place]();static val restore_needed = new Cell[Boolean](false);public static def main(args:Rail[String]) {

: val A = ResilientDistArray.make[Double](BigD, ...); // create a DistArrayA.snapshot(); // create the initial snapshotfor (iter in 1..ITERATIONS) { // iterate until convergence

try {if (restore_needed()) { // if some places died

val livePG = new SparsePlaceGroup(livePlaces.toRail()));BigD = Dist.makeBlock(BigR, 0, livePG); // recreate Dist, andA.restore(BigD); // restore elements from the snapshotrestore_needed() = false;

}finish ateach (z in D_Base) { // distributed processing

/* compute new heat values for A's local elements */Temp = ((at (A.dist(x-1,y)) A(x-1,y)) + (at (A.dist(x+1,y)) A(x+1,y))

+ (at (A.dist(x,y-1)) A(x,y-1)) + (at (A.dist(x,y+1)) A(x,y+1)))/4;}/* if converged, exit the for loop */if (iter % 10 == 0) A.snapshot(); // create a snapshot at every 10th iter.

} catch (e:Exception) { processException(e); }} // end of for (iter)/* print the result */

} }

Remove the dead place from the livePlaces list and set the restore_needed flag

Fig 7




Resilient DistArray – Fault Tolerant DistArray

An extended DistArray which supports snapshot and reconfiguration

public class ResilientDistArray[T] ... {public static def make[T](dist:Dist, init:(Point)=>T) : ResilientDistArray[T];public static def make[T](dist:Dist){T haszero} : ResilientDistArray[T];public final operator this(pt:Point) : T; // read elementpublic final operator this(pt:Point)=(v:T) : T; // set elementpublic final def map[S,U](dst:ResilientDistArray[S], src:ResilientDistArray[U],

filter:Region, op:(T,U)=>S) : ResilientDistArray[S];public final def reduce(op:(T,T)=>T, unit:T) : T;

:// Create a snapshotpublic def snapshot() { snapshot_try();snapshot_commit(); }public def snapshot_try() : void;public def snapshot_commit() : void;// Reconstruct the DistArray with new Distpublic def restore(newDist:Dist) : void;public def remake(newDist:Dist, init:(Point)=>T) : void;public def remake(newDist:Dist){T haszero} : void;

}

Interface overview

Normal DistArray interfaces, plus the followings:

snapshot()

Dump the element values into the Resilient Storage

restore(newDist)

Reconstruct the DistArray over live places, and restore the snapshot

Added I/F

Fig 6




Evaluation – Modification Amount

Modifications necessary to add fault tolerance to the base codes

Consideration

Fault tolerance could be added with very small modifications

– The modified code can still run on standard X10, as long as node failure does not occur

Modification ratio of HeatTransfer is relatively large, but– Larger DistArray applications can be made fault tolerant using the

same approach, and the modification ratio will be smaller– Half of the modifications are in the exception-handling code of

processException, which can be provided as a support library

Program Code size Mod. size

MontePi 30 lines 4 lines

KMeans 93 lines 12 lines

HeatTransfer 68 lines 24 lines

(ResilientDistArray) ~200 lines －




Evaluation – Performance

Relative execution times of base code and fault- tolerant code, on Standard and Resilient X10

Consideration

On Standard X10

– For MontePi and KMeans, almost no overheads were observed– For HeatTransfer, FT version took 9% more time, because of the cost of the

periodic snapshots

On Resilient X10

– For MontePi and KMeans, there were 2.2~9.0% overheads– For HeatTransfer, 6x slowdown mainly because at is invoked too frequently

5.0

5.2

5.4

5.6

5.8

6.0

6.2

6.4

6.6

1

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

MontePi KMeans HeatTransfer

Base on X10 FT on X10 Base on ResX10 FT on ResX10

Environment: 2.7GHz Xeon blades (total 16 cores),40Gbps InfiniBand, Red Hat Enterprise Linux Server 6.3Native X10 2.4.2, 8 places, Place-0 resilient storage

X10 ResX10

Low

er is

bet

ter

Bas

e co

deFT

cod

e

Fig 8




Evaluation – Fault Tolerance

Behavior when places are killed

Fault tolerance was achieved by the combination of Resilient X10 and fault-tolerant applications

Effects caused by place deaths

– When 4 places among the 8 places were killed• Deviation of MontePi result increased from 0.0008% to 0.002%• But the execution time did not increase

– When Place 2 was killed during the execution of the 17th iteration• Execution time increased by 11% in KMeans and 14% in HeatTransfer• But the executions still ended with correct results

Base code Fault-tolerant code

On Standard X10 Aborted Aborted

On Resilient X10 End with DPE Survived




ConclusionsSummary Introduced three methods of adding fault tolerance to existing applications

(a) Ignore failures and use the results from the remaining nodes ... MontePi(b) Reassign the failed node’s work to the remaining nodes ... KMeans(c) Restore the computation from a periodic snapshot ... HeatTransfer

with ResilientDistArray

Evaluated the fault-tolerant applications in a real distributed environment– Very small modifications to add fault tolerance– 2.2~9.0% execution overhead (but 6x slowdown in case of too frequent at)– 11~14% additional overhead when 1 among 8 places was lost

Future work– Reduce the overhead – both in Resilient X10 and fault-tolerant applications– Make fault-tolerant versions of larger X10 applications [12,19]




Additional Information about Resilient X10

The Resilient X10 function is included as a technology preview in X10 2.4.3 (released in May 2014)

– Can be enabled by specifying “X10_RESILIENT_MODE=1”– Can run with either of Native X10 and Managed X10

• The communication layer is limited to sockets– Sample codes exist under “samples/resiliency/”

• Refer to README.txt in the directory for details

Related papers[3] Resilient X10: Efficient Failure-Aware Programming

Cunningham, D., Grove, D., Herta, B., Iyengar, A., Kawachiya, K., Murata, H., Saraswat, V., Takeuchi, M. and Tardieu, O. Proceedings of PPoPP ’14, pp. 67–80 (2014).

[2] Semantics of (Resilient) X10 Crafa, S., Cunningham, D., Saraswat, V., Shinnar, A. and Tardieu, O. Proceedings of ECOOP ’14, to appear (2014).




BACKUP




Execution Example – HeatTransfer （10x10）$ cd X10/242/samples/resiliency$ x10c++ -O -NO_CHECKS ResilientHeatTransfer.x10 -o ResilientHeatTransfer$ X10_NPLACES=8 X10_RESILIENT_MODE=1 runx10 ResilientHeatTransfer 10HeatTransfer for 10x10, epsilon=1.0E-50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 X10_RESILIENT_STORE_MODE=0X10_RESILIENT_STORE_VERBOSE=0---- Iteration: 1delta=0.25---- Iteration: 2delta=0.125

:---- Iteration: 10delta=0.023305892944336Create a snapshot at iteration 10---- Iteration: 11delta=0.020451545715332

:

---- Iteration: 86Create new Dist over available 6 places0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 Restore from a snapshot at iteration 80delta=8.5E-4---- Iteration: 87delta=8.1E-4

:---- Iteration: 194delta=9.7E-6Result converged---- Result0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000 0.000 0.491 0.679 0.761 0.799 0.815 0.815 0.799 0.761 0.679 0.491 0.000 0.000 0.285 0.463 0.566 0.622 0.646 0.646 0.622 0.566 0.463 0.285 0.000 0.000 0.184 0.324 0.419 0.475 0.501 0.501 0.475 0.419 0.324 0.184 0.000 0.000 0.126 0.232 0.309 0.358 0.382 0.382 0.358 0.309 0.232 0.126 0.000 0.000 0.089 0.167 0.227 0.267 0.287 0.287 0.267 0.227 0.167 0.089 0.000 0.000 0.064 0.121 0.166 0.197 0.212 0.212 0.197 0.166 0.121 0.064 0.000 0.000 0.045 0.086 0.118 0.141 0.153 0.153 0.141 0.118 0.086 0.045 0.000 0.000 0.031 0.058 0.081 0.097 0.105 0.105 0.097 0.081 0.058 0.031 0.000 0.000 0.019 0.036 0.051 0.061 0.066 0.066 0.061 0.051 0.036 0.019 0.000 0.000 0.009 0.017 0.024 0.029 0.032 0.032 0.029 0.024 0.017 0.009 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

---- Iteration: 38delta=0.003633990233121---- Iteration: 39 <---- Place 2 was killedPlace 2 exited unexpectedly with signal: TerminatedMultipleExceptions size=2DeadPlaceException thrown from Place(2)DeadPlaceException thrown from Place(2)

---- Iteration: 40Create new Dist over available 7 places0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 Restore from a snapshot at iteration 30delta=0.005177850168275Create a snapshot at iteration 40---- Iteration: 41delta=0.004930303185298

:---- Iteration: 85 <---- Place 7 was killedPlace 7 exited unexpectedly with signal: TerminatedMultipleExceptions size=1DeadPlaceException thrown from Place(7)




ResilientKMeans




ResilientHeatTransfer




Programming Language X10

X10 [25] is a programming language that supports parallel/distributed computing internally

Parallel and distributed Hello World in X10

– Compilation and execution example using Native X10



Console.OUT.println("Hello from " + here);}

} // end of finish, wait for the execution in all places}

} $ x10c++ HelloWorld.x10 -o HelloWorld # compile$ X10_NPLACES=4 runx10 HelloWorld # executionHello from Place(3)Hello from Place(0)Hello from Place(2)Hello from Place(1)

Fig 2

Executed at each node (place) in a parallel and distributed manner




Execution Model of X10 – Asynchronous PGASAsynchronous Partitioned Global Address Space

A global address space is divided into multiple places (≒computing nodes)– Each place can contain activities and objects

An activity (≒thread) is created by async, and can move to another place by at

An object belongs to a specific place, but can be remotely referenced from other places– To access a remote reference, activities must move to its home place

DistArray is a data structure whose elements are scattered over multiple places

Fig 1

DistArray

DistArray

Immutable data, class, struct, function

async at at at

asyncasyncActivity

Object

Address space

Local ref

Place 0 Place MAX_PLACES-1...Place 1

Remote ref

Creation

Place shift




If a Computing Node Failed ...Consider the case Place 1 (’s node) dies

Activities, objects, and part of DistArrays in the dead place are lost– This causes the abort of the entire X10 processing in standard X10

However in PGAS model, it is relatively easy to localize the impact of place death– Objects in other places are still alive, although remote references become inaccessible– Can continue the execution using the remaining nodes (places) Resilient X10 [2,3]

DistArray

DistArray

Immutable data, class, struct, function

async at at at

asyncasyncActivity

Object

Address space

Local ref

Place 0 Place MAX_PLACES-1...Place 1

Remote ref

Creation

Place shift

Fig 1




DistArray – A Distributed Array Library

DistArray – An array whose elements are scattered over multiple places

– Suitable for SPMD-style processingExample – Compute 12 + 22 + ・・・

+ 10002 using multiple places

Upon a node failure, the DistArray elements in the dead place are lost

public class DistArrayExample {public static def main(Rail[String]) {

val R = Region.make(1..1000);val D = Dist.makeBlock(R, 0, PlaceGroup.WORLD);val A = DistArray.make[Long](D, ([i]:Point)=>i);val tmp = new Array[Long](Place.MAX_PLACES);finish for (pl in Place.places()) async {

tmp(pl.id) = at (pl) { var s:Long = 0;for (p:Point in D|here) s += A(p)*A(p); s };

} // end of finish, wait for the execution in all placesval result = tmp.reduce((a:Long,b:Long)=>a+b, 0);Console.OUT.println(result); // -> 333833500

} }

Overview

Create a DistArray D with initial values 1~1000

Each place processes its local elements (x), and calculates the sum of x2

Sum up the all results

Local part of the DistArray is processed here

Writing Fault-Tolerant Applications Using Resilient...

Documents

Transcript of Writing Fault-Tolerant Applications Using Resilient...