Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department...

38
Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July 2002
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department...

Page 1: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

Error Scopeon a Computational Grid:

Theory and Practice

Douglas ThainComputer Sciences Department

University of Wisconsin

USC Reliability Workshop

July 2002

Page 2: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

Outline

An Exercise: Condor + Java Bad News: Error Explosion A Theory of Error Propagation Down with Generic Errors! Condor Revisited Parting Thoughts

Page 3: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

An Exercise:Coupling Condor and Java

The Condor Project, est. 1985.– Production high-throughput computing facility.– Provides a stable execution environment on a Grid of

unstable, autonomous resources. The Java Language, est. 1991.

– Production language, compiler, and interpreter.– Provides a standard instruction set and libraries on any

processor and system. The Grid, est. ????

– Execute any code any where at any time.– Dependable, consistent, pervasive, inexpensive...– Are we there yet?

Page 4: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

The Condor High Throughput Computing System

HTC != HPC– Measured in sims/week, frames/month, cycles/year.

All participants are autonomous.– Users give constraints on usable machines.– Machines give constraints on jobs and users.– ClassAds: a language for matchmaking.

If you are willing to re-link jobs...– Remote system calls for transparent mobility.– Binary checkpointing for migration and fault-tolerance.– Can’t relink? All other features available.

Special “universes” support software environments.– PVM, MPI, Master-Worker, Vanilla, Globus, Java

Page 5: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

HomeFile

System

Execution SiteSubmission Site

UserAgent

(schedd)

Match-Maker

MachineAgent(startd)

PolicyControl

PolicyControl

Execution Protocol

The Job

Fork

JobAgent

(starter)

Fork

JobAgent

(shadow)

Fork

“I want...

” “I have...”

Claiming Protocol

notify notify

Page 6: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

Java Universe

Execution:– User specifies .class and .jar files.– Machine provides the JVM details.

Input and Output:– Know all of your files?

Condor transfers whole files for you.

– Need online I/O? Link program with Chirp I/O Library. Execution site provides proxy to home site.

Page 7: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

JVM

Fork

Job Agent(starter)

Job Agent(shadow)

HomeFile

System I/O Library

The Job

I/O Server I/O ProxySecure Remote I/O

Local System Calls Local RPC(Chirp)

Execution SiteSubmission Site

Wrapper

Page 8: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

Initial Experience

Bad news! Any kind of error sent the job back to the user with an exception message:

– NullPointerException - Program is faulty.– OutOfMemory - Program outgrew machine.– ClassNotFoundError - Machine incorrectly installed.– ConnectionRefused - Network temporarily unavailable.

Users were frustrated because they had to evaluate whether the job failed or the system failed.

These were correct in the sense they were true. These were not bugs. We deliberately trapped all

possible errors and passed them up the chain.

Page 9: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

What’s the Problem?

To reason about this problem, we began to construct a theory of error propagation.

This theory offers some common definitions and four principles that outline a design discipline.

We re-examined the Java Universe according to this theory.

Our most serious mistake: We failed to propagate errors according to their scope.

Page 10: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

We are NOT Talking About:

Fault Tolerance– What algorithms are fault-resistant?– How many disks can I lose without losing data?– How many copies should I make for five nines?

Language Structures– Should I use Objects or Strings to represent errors?– Should I use Exceptions or Signals to communicate errors?

These are important and valuable questions, but we are asking something different!

Page 11: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

We ARE Talking About:

Where is the problem? How should a program respond to an error? Who should receive an error message? What information should an error carry? How can we even reason about this stuff?

Page 12: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

Engineering Perspective

Fault– A physical disruption of the machine.

Error– An information state that reflects a fault.

Failure– A violation of documented/guaranteed behavior.

Fault– (A failure in one’s underlying components.)

Page 13: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

Interface Perspective

Implicit Error– A result presented as valid, but found to be false.– Example: sqrt(3) -> 2.

Explicit Error– A result describing an inability to carry out the request.– Example: open(“file”) -> ENOENT.

Escaping Error– A return to a higher level of abstraction. – Example: read -> virt mem failure -> process abort.– Example: server out of memory -> shutdown socket

Page 14: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

Program

Virtual Memory System

PhysicalMemory

BackingStore

load data

Could return a default value, but that creates an implicit error.

Would like to return an explicit error, but a load insn has no exit code.

ParentProcess

Escaping error: Tell the parent that the program could not complete.

NormalExit

AbnormalExit

Page 15: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

Interface Contracts

int load( int address );

The implementor must either compute a result that conforms to the contract, or is obliged to cause an escaping error.

Page 16: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

Exceptions

int open( String filename )

throws FileNotFound, AccessDenied;

A language with exceptions provides more structure to the contract. A declared exception is an explicit error. Yet, escaping errors are still possible.

Page 17: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

Program

Virtual File System

MemoryDisk

open

Success,FileNotFound,AccessDenied

ParentProcess

NormalExit

AbnormalExit

MemoryCorrupt,DiskOffline,PigeonLost

INTERFACE

IMPLEMENTATION

Page 18: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

Error Scope

In order to be accepted by end users, a distributed system must be able to distinguish between errors computed by the program and errors forced upon it by the environment.

We use the term scope to draw the distinction.

Page 19: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

Error Scope The scope of an error is the portion of the

system that it invalidates. An error must be delivered to the process

responsible for managing that scope.

Error Scope Handler

FileNotFound File Calling Function

RPC Disconnect Process Parent Process

Cache Coherency Problem

Machine Hypervisor or Operator

PVM Node Crash PVM Cluster Parent Process

Page 20: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

Error Detail

The detail of an error describes in phenomenological terms the cause of the error.

In the right hands, the detail is useful. In the wrong hands, the detail can be misleading.

Suppose open returns AccessDenied...– File is not accessible - Ok.– Library containing ‘open’ is not accessible -

Problem!

Page 21: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

What To Do With An Error?

A program cannot possibly know what to do with an error outside its scope.

– Should sin(x) deal with “math library not available?”

Propagate an error to the manager of the scope as directly as possible.

Sometimes, a direct mechanism:– Signal, exception, dropped connection, message.

Sometimes, an indirect mechanism:– Touch a file, then exit by any means available.

Page 22: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

Principles for Error Design

Principle 1:– A routine must not generate an implicit error as a result of

receiving an explicit error. Principle 2:

– An escaping error converts a potential implicit error into an explicit error at a higher level.

Principle 3:– An escaping error must be propagated to the program that

manages the error’s scope. Principle 4:

– Error interfaces must be concise and finite.

Page 23: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

Return to Condor

What did we do wrong?– We failed to carefully consider the scope of an error.– We fell prey to the deadly generic error.

What’s the solution?– Identify error scopes in Condor.– Find more direct mechanisms to send escaping

errors to the managing process.

Page 24: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

schedd

shadow

starter

JVM

program

Code Data

Program Scope

Virtual Machine Scope

Remote Resource Scope

Local Resource Scope

Job Scope

InputData

ProgArgs

ProgImage

OutputSpace

I/OServer

UserPolicy

OwnerPolicy

JavaPkg

Mem& CPU

Page 25: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

Detail Scope Handler

Program exited normally. Program User

Null pointer exception. Program User

Out of memory. Virtual

Machine

JVM

Java misconfigured. Remote

Resource

Starter

Home file system offline. Local Resource

Shadow

Program image corrupt. Job Schedd

Scope in Condor

Page 26: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

Scope in Condor:JVM Exit Code

Detail Scope Handler Exit Code

Program exited normally. Program User (x)

Null pointer exception. Program User 1

Out of memory. Virtual

Machine

JVM 1

Java misconfigured. Remote

Resource

Starter 1

Home file system offline. Local Resource

Shadow 1

Program image corrupt. Job Schedd 1

Page 27: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

JVM

Job Agent(starter)

Job Agent(shadow)

HomeFile

System

Wrapper

I/O Library

The Job

ResultFile

JVM Result

ProgramResult

orError and

Scope

Starter Result +Program Result

Page 28: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

JVM

starter

shadow

HomeFile

System

Wrapper

I/O Library

The Job

ResultFile

JVM Result

I/O Proxy

Errors of Larger Scope

Errors InsideProgram Scope

Page 29: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

Half-Way Conclusion

Small but powerful changes drastically improved the Java Universe.

Our mistake was to represent all possible errors explicitly in the closest interface.

Error scope is an analytic tool that helps the designer decide how to propagate an error.

But, we were initially confused by the presence of the deadly generic error.

Page 30: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

The Deadly Generic Error

Whereas, a program may fail in more ways than we can possibly imagine...

And whereas, generality and flexibility are virtues of programming...

Be if therefore resolved that interfaces should return general, flexible, arbitrary values:– int open( String name ) throws IOException;

Page 31: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

What’s Wrong with Generality?

The structure and types of errors are as essential to an interface as the arguments and return values.

Every error requires a different recovery mechanism, according to its scope:

– EINTR - try again right away– ETIMEDOUT - will be available again in the future– EPERM - you can’t at all without talking to a person– ESTALE - must kill process

A program must know the *specific* details of an error in order to take the right action. Guesses don’t work.

– Exit on unknown errors? Program is brittle.– Retry on unknown errors? Program waits endlessly.

Page 32: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

An Example of Generality

int open( String name ) throws IOException;

int write( int data ) throws IOException;

Page 33: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

An Example of Generality

Java defines several types of IOException:– AccessDenied, FileNotFound, EndOfFile...

Can open throw...?– FileNotFound– EndOfFile– DiskFull

Can write throw...?– AccessDenied– FileNotFound– DiskFull

Trick Question!

Page 34: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

My Disk Runneth Over!

What can a program expect for a full disk?– DiskFullException– OutOfSpaceException– It’s really neither! (How would we know?)

What should an implementor do when the disk fills up?– There is no appropriate exception to throw.– Making up an exception is not useful.– Only solution: an escaping error. (Example later.)

Page 35: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

Advice for Constructing Error Interfaces

Export a small set of expected error types.– Bad Arguments– Lost Connection– No Such File

Choose an internal error management strategy. You know the cost of retry vs the cost of failure.

– Retry internally– Abort process– Drop connection

Page 36: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

A Better Interface

int open( String name ) throws AccessDenied, throws FileNotFound;

int write( int data ) throws DiskFull;

Page 37: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

Conclusion

Small but powerful changes drastically improved the Java Universe.

Our mistake was to represent all possible errors explicitly in the closest interface.

Error scope is an analytic tool that helps the designer decide how to propagate an error.

An error discipline saves precious resources: time and aggravation!

Page 38: Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

For more information...

“Error Scope” Paper– http://www.cs.wisc.edu/~thain

Douglas Thain – [email protected]

Miron Livny– [email protected]

Condor Software, Manuals, Papers, and More– http://www.cs.wisc.edu/condor

Questions now?