Asynchronous Remote Execution PhD Preliminary Examination Douglas Thain University of Wisconsin 19...
-
Upload
priscilla-ross -
Category
Documents
-
view
216 -
download
0
Transcript of Asynchronous Remote Execution PhD Preliminary Examination Douglas Thain University of Wisconsin 19...
AsynchronousRemote Execution
PhD Preliminary ExaminationDouglas ThainUniversity of Wisconsin19 March 2002
Thesis
Asynchronous operations improve the throughput, resiliency, and scalability of remote execution systems.
However, asynchrony introduces new failure modes that must be carefully understood in order to preserve the illusion of synchronous operation.
Proposal
I propose to explore the coupling between asynchrony, failures, and performance in remote execution.
To accomplish this, I will modify an existing system and increase the available asynchrony in degrees.
Contributions
A measurement of the performance benefits through asynchrony.A system design that accommodates asynchrony while tolerating a significant set of expected failure modes.An exploration of the balance between performance, risk, and knowledge in a distributed system.
Science NeedsLarge-Scale Computing
Theory, Experiment, ComputationNearly every field of scientific study has a ‘grand challenge” problem: Meteorology Genetics Astronomy Physics
The Grid Vision
Disk RAM
CPU
DiskArchive
TapeArchive
Disk RAM
CPU
Disk RAM
CPU
SecurityServices
DiskArchive
DiskArchive
DiskArchive
The Grid RealitySystems for managing CPUs: Condor, LSF, PBS
Programming Interfaces: POSIX, Java, C#, MPI, PVM....
Systems for managing data: SRB, HRM, SAM, ReqEx (DapMan)
Systems for storing data: NeST, IBP
Systems for moving data: GridFTP, HTTP, Kangaroo
Systems for remote authentication: SSL, GSI, Kerberos, NTSSPI
The Grid RealityHost uptime:
Median: 15.92 days Mean: 5.53 days Local Maximum: 1 day Long et al, “A Longitudinal Study of Internet Host
Reliability,” Symposium on Reliable Distributed Systems, 1995.
Wide-area connectivity: approx 1% chance of 30-sec interruption approx 0.1% chance of a persistent outage Chandra et al, “End-to-end WAN Service Availability”,
Proceedings of 3rd Usenix Symp on Internet Technologies and Systems, 2001.
TapeArchive
Disk RAM
CPU
DiskArchive
DiskArchive
DiskArchive
DiskArchive
TapeArchive
TapeArchive
TapeArchive
Disk RAM
CPU
Disk RAM
CPU
SecurityServices
Job
Usual Approach:Hold and Wait
Request CPU, Wait for SuccessStage Data to CPU, WaitMove Executable to CPU, WaitExecute Program, WaitMissing File -> Stage Data, WaitStage Output Data, WaitFailure? Start over...
Synchronous Systemsare Inflexible
Poor utilization and throughput due to hold-and-wait. CPU idle while disk busy. Disk idle while CPU busy. Disk full? CPU stops.
System sensitive to failures of both performance and correctness. Network down? Everything stops. Network slow? Everything slows. Credentials lost? Everything aborts.
Resiliency Requires Flexibility
Most jobs have weak couplings between all of their components.Asynchrony: Time Decoupling Can’t have network now? Ok, use disk. Can’t have CPU now? Ok, checkpoint. Can’t store data now? Ok, recompute
later.
Time Decoupling -> Space Decoupling
Computing’s Central Challenge:
“How not to make a mess of it.”
- Edsger Dijkstra, CACM March 2001.
How can we harness the advantages of asynchrony while maintaining a coherent and reliable user experience?
Outline
IntroductionA Case Study: The Distributed Buffer CacheRemote ExecutionRelated WorkProgress MadeResearch AgendaConclusion
Case Study:The Distributed Buffer Cache
The Kangaroo distributed buffer cache introduces asynchronous I/O for remote execution.It offers improved job throughput and failure resiliency at the price of increased latency in I/O arrival.A small mess: Jobs and I/O are not recoupled at completion time.
Kangaroo Prototype
KKApp K
An application may contact any node in the system and perform partial-file reads and writes.
Disk
The node may then execute or buffer operations as conditions warrant.
BufferBuffer
A consistency protocol ensures no loss of data due to crash/disconnect.
Distributed Buffer Cache
FileSystem
FileSystem
FileSystem
FileSystem
KKK
K
K
KK
DistributedBuffer CacheApp
Disk
Macrobenchmark:Image Processing
Post-processing of satellite image data: Need to compute various enhancements and produce output for each. Read input image For I=1 to N
Compute transformation of image Write output image
Example: Image size about 5 MB Compute time about 6 sec IO-cpu ratio .91 MB/s
I/O Models for Image Processing
OUTPUT OUTPUT
CPU
OUTPUT
Online Streaming I/O:
Offline Staging I/O:
Kangaroo:
INPUT
OUTPUT
CPU CPU CPU
OUTPUTOUTPUTCPU OUTPUTINPUT OUTPUTCPU CPU CPU
OUTPUT OUTPUTCPU OUTPUTINPUT OUTPUTCPUCPU CPU
A Small MessThe output will make it back eventually, barring the removal of a disk.But, what if... ...we need to know when it arrives? ...the data should be cancelled? ...it never arrives?
There is a hold-and-wait operation (push,) but this defeats much of the purpose.The job result needs to be a function of both the compute and data results.
Lesson
We may decouple CPU and I/O consumption for improved throughput. But, CPU and I/O must be semantically coupled at both dispatch and completion in order to provide useful semantics.Not necessary in a monolithic system: All components fail at once. Integration of CPU and I/O management
(fsync)
Outline
IntroductionA Case StudyRemote Execution Synchronous Execution Asynchronous Execution Failures, Transparency, and Performance
Related WorkProgress MadeResearch AgendaConclusion
Remote Execution
Remote execution is the problem of running a job in a distributed system.A job is a request to consume a set of resources in a coordinated way: Abstract: Programs, Files, Licenses, Users Concrete: CPUs, Storage, Servers, Terminals
A distributed system is a computing environment that is: Composed of autonomous units. Subject to uncoordinated failure. Subject to high performance variability.
About Jobs
Job policy dictates what resources are acceptable to consume: CPU must be a SPARC Must have > 128MB memory Must be within 100ms of a disk server CPU must be owned by a trusted
authority. Input data set X may come from any
trusted replication site.
About Jobs
The components of a job have flexible temporal and spatial requirements.
ProgramImage
CPUInputDevice
OutputDevice
InputData
OutputData
LicenseCreds
read write
presentthroughout
presentat startup
interactivepreferred
Expected Jobs
In this work, I will concentrate on a limited class of jobs: Executable image Single CPU request May checkpoint/restart to manage
CPU. Input data (online/offline) Output data (multiple targets)
Expected SystemsHigh latency I/O operations are ms->sec Process dispatch is seconds->minutes
Performance variation TCP hiccups cause outages of seconds->minutes. By day, network congested, by night->free
Uncoordinated failure File system fails, CPU continues to run. Network fails, but endpoints continue.
Autonomy Users reclaim workstation CPUs. Best-effort storage is reclaimed.
Expected Users
A wide variety of users will have varying degrees of policy aggression.Scientific computation: Maximize long-term utilization/throughput.
Scientific instrument: Minimize use of one device.
Disaster response: Compute this ASAP at any expense!
Graduate student: Finish job before mid-life crisis.
The Synchronous Approach
Grab one resource at a time as they become necessary and available.Assume any other resources are immediately available online.Start with the resource with the most contention.Examples: Condor distributed batch system Fermi Sequential Access Manager (SAM)
Problems
What if one resource is not obviously (or consistently) the constraint?What is the expense of holding one resource idle while waiting for another?What if no single resource is under your absolute control?What if all your resource requirements cannot be stated offline?How can we deal with failures without starting everything again from scratch?
Asynchronous Execution
Recognize when a job has loose synchronization requirements.Seek parallelism where available.Synchronize parallel activities at necessary joining points.Allow idle resources to be released and re-allocated for use by others.Consider failures in execution as allocation problems.
The Benefits of Asynchrony
Better utilization of disjoint resources -> higher throughput.More resilient to performance variations.Less expensive recovery from partial system failures.
The Price of Asynchrony
Complexity Many new boundary cases to cover. Is the complexity worth the trouble?
Risk Without appropriate policies, we may:
Oversubscribe (internal fragmentation) Undersubscribe (external fragmentation)
The Problem ofClosing the Loop
ProgramImage
CPUInputData
OutputData
read
write
ExitCode
exec
CPURequest
exit
Job
Su
bm
issi
on Jo
b C
om
ple
tion
Asynchronous Open-Loop I/O
CPU Busy
I/O Busy
CPU Busy ProgramResult
I/O Result?I/O Dispatch
I/OResult
I/OValidation
CPUIdle
AsynchronousClosed-Loop I/O
CPU Busy
I/O Busy
CPU Busy ProgramResult
I/O Result?I/O Dispatch
I/OResult
I/OValidation
CPUIdle
JobResult
Related Work
Many components of grid computing: CPUs, storage, networks...
Many traditional research areas: scheduling, file systems, virtual memory...
What systems seek parallelism in operations that would appear to be atomic?What systems exchange one resource for another?How do they deal with failures?
Computer Architecture
These two are remarkably similar: sort -n < infile > outfile ADD [r1+8], [r2+12], [r3]
Each has multiple parts with a loose coupling in time and space.idle -> working -> done -> commitedA failure or an unsuccessful speculation must roll back dependent parts.
Trading Storagefor Communication
Immediate Closure: Synchronous I/O
Bounded Closure: GASS, AFS, transaction
Indeterminate Closure: UNIX buffer cache Imprecise Exceptions
Human Closure: Coda -> Failures invoke email
Trading Computationfor Communication
Time Warp Simulation Model All nodes checkpoint frequently. All messages one-way without synchro. Missed a message? Roll back and send out anti-
messages to undo earlier work. Problems:
When can I throw out a checkpoint? Can’t tolerate message failure.
Virtual Data Grid Data sets have a functional specification. Transfer here, or recompute? Decide at run-time using cost/benefit.
Progress Made
We have already laid much of the research foundation necessary to explore asynchronous remote execution.Deployable software: Bypass, Kangaroo, NeST
Organizing Concepts: Distributed buffer cache I/O communities Error management theory
References in ClassAds
MachineMachine NeSTJob
JobAd
MachineAd
StorageAd
matc
h
Refers toNearestStorage.
Knows whereNearestStorage is.
Distributed Buffer Cache
FileSystem
FileSystem
FileSystem
FileSystem
KKK
K
K
KK
DistributedBuffer CacheApp
Disk
Error Management
In preparation: “Error Scope on a Computational Grid: Theory and Practice”An environment for Java in Condor.How do we understand the significance of the many things that may go wrong?Every scope must have a handler.
PublicationsDouglas Thain and Miron Livny, ``Error Scope on a Computational Grid,'' in preparation.Douglas Thain, John Bent, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny, ``Gathering at the Well: Creating Communities for Grid I/O,'' Proceedings of Supercomputing 2001, Denver, Colorado, November 2001. Douglas Thain, Jim Basney, Se-Chang Son, and Miron Livny, ``The Kangaroo Approach to Data Movement on the Grid,'' in Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC10), San Francisco, California, August 7-9, 2001, pp 325-333.Douglas Thain and Miron Livny, ``Multiple Bypass: Interposition Agents for Distributed Computing,'' Journal of Cluster Computing, Volume 4, Pages 39-47, 2001. Douglas Thain and Miron Livny, ``Bypass: A tool for building split execution systems'', in Proceedings of the Ninth IEEE Symposium in High Performance Distributed Computing (HPDC9), Pittsburgh, Pennsylvania, August 1-4, 2000, pp 79-85.
Research Agenda
I propose to create an end-to-end structure for asynchronous remote execution.To accomplish this, I will take an existing remote execution system, and increase the asynchrony by degrees.The focus will be mechanisms, not policies. Suggest points where policies must be
attached. Use simple policies to demonstrate use. Mechanisms must be correct regardless of
policy choices.
Research Environment
The Condor distributed batch system.Local resources: Test Pool - approx 20 workstations. Main Pool - approx 1000 machines. Possible to deploy significant changes to all
partcipating software.
Remote Resources: INFN Bologna - approx 300 workstations. Other pools as necessary. Can only deploy changes within the context of
an interposition agent.
Stage One:Asynchronous Output
Timeline: March-May 2002Goal: Decouple CPU allocation from output data
movement.
Method: Couple Kangaroo with Condor and close the
loop. Job requires a new state: “waiting for output.”
Policy: How long should a job remain in “waiting for
output” before it is re-run?
Stage Two:Asynchronous Input
Timeline: June-August 2002.Goal: Decouple CPU allocation from input data movement.
Method: Modify scheduler to be aware of I/O communities
and seek CPU and I/O allocations independently. Unexpected I/O needs may use checkpointing to
release CPU allocations.
Policy: How long to hold idle before timeout? How to estimate queueing time for each resource?
Stage Three:Disconnected Operation
Timeline: September-December 2002Goal: Permit application to execute without any run-time
dependence on the submitter.
Method: Release umbilical once policy is set. Job requires new state: “presumed alive.” Unexpected policy needs may require reconnection.
Policy: How much autonomy may be delegated tothe
interposition agent? (performance/control)
Stage Four:Dissertation
Timeline: January-May 2003Design What algorithms and data structures are
necessary?
Performance What are the quantitative costs/benefits?
Discussion What are the tradeoffs between performance,
risk, and knowledge? What are the implications of designing for fault-
tolerance?
Evaluation Criteria
Correctness The system must meet its interface obligations.
Reliability Satisfy the user with high probability.
Throughput Improve by avoiding hold-and-wait.
Latency A modest increase is ok for batch workloads.
Knowledge Has my job finished? How much has it consumed?
Complexity
Contributions
Short-term: Improvement of the Condor software. Not the goal, but a necessary validation
Medium-term: Serve as a design resource for grid computing. Key concepts such as “closing the loop” and
“matching interfaces.”
Long-term: Serve as a basis for further research.
Further Work
Should jobs move to data or vice versa? Let’s try both!
Many opportunities for speculation: Potentially stale data in file cache? Keep going.
Partial program completion is useful: DAGMan: dispatch a process based on exit code,
dispatch another based on output data.
What if we change the API? A drastic step, but... MPI, PVM, MW, Linda. Can we admit subprogram failure and maintain a
usable interface?
Conclusion
Large-grained asynchrony has yet to be explored in the context of remote program execution.Asynchrony has benefits, but requires careful management of failure modes.This dissertation will contribute a system design and an exploration of performance, risk, and knowledge in a distributed system.
JVM
Fork
startershadow
HomeFile
System
I/O Library
The Job
I/O Server I/O Proxy
Secure Remote I/O
Local System Calls Local I/O(Chirp)
Execution SiteSubmission Site
RunningProgram
InterpositionAgent
CPU RAMDisk
Network
Remote Resource Interfaces
Application Interface
Few guaranteeson performance, availability, and
reliability.
Explicit descriptions of ordering, reliability,
performance, and availability.
(POSIX, MPI, PVM, MW)
RunningProgram
InterpositionAgent
CPU RAMDisk
Network
In the event of a failure: Retry: Hold the
CPU allocation and try again.
Checkpoint: Release the CPU, with some restart condition.
Abort: Expose the failure to a supervisor.
SupervisorProcess
The Cost of Hiding FailuresEach technique is valid from the standpoint of API conformity.What to use? Depends on cost: Retry: Holds CPU idle while retrying device. Checkpoint: Consumes disk and network, but
releases CPU. Abort: No expense up front, but must re-
consume resources when ready to roll forward.
User policy is vital in determining what costs are acceptable for hiding failures.
RunningProgram
InterpositionAgent
CPU RAMDisk
Network
PolicyDirector
What file should I open?
How long should I try?
May I checkpoint now?
Where should I store the checkpoint image?
Should I stage or stream the output?
FYI, I’ve used 395 service units here.
FYI, I’m about to be evicted from this site.
Disconnected Operation
The policy manager is also an execution resource that is occasionally slow or unavailable.Holding a resource idle while waiting indefinitely for policy direction is still wait-while-idle.A higher degree of asynchrony can be achieved through disconnected operation.Requires each autonomous unit be given an “allowance”.