Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin...

1
Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level failure mitigation (ULFM) is becoming the front-running solution for process fault tolerance in MPI. While not yet adopted into the MPI standard, it is being used by applications and libraries and is being considered by the MPI Forum for future inclusion into MPI itself. In this paper, we introduce an implementation of ULFM in MPICH, a high-performance and widely portable implementation of the MPI standard. We demonstrate that while still a reference implementation, the runtime cost of the new API calls introduced is relatively low. 0 1 MPI_Recv MPI_ERR_PROC_FAILED Failure Notification Failure Propagation ULFM introduces semantics to define failure notification, propagation, and recovery within MPI. For failure notification, ULFM uses traditional error codes and returns a new error, MPI_ERR_PROC_FAILED. For failure propagation, MPI_COMM_REVOKE will notify all processes of a failure. For failure recovery, MPI_COMM_SHRINK can be used to create a new communicator. To assist with the primary functionality, there are also functions to detect failure (MPI_COMM_FAILURE_ACK, MPI_COMM_FAILURE_GET_ACKED) and for distributed, fault-tolerant agreement (MPI_COMM_AGREE). Failure Detection Local failures detected by Hydra and netmods • Error codes are returned back to the user from the API calls Agreement Uses two group-based allreduce operations • If either fails, an error is returned to the user Shrinking • All processes construct consistent group of failed procs via allreduce Uses MPI_COMM_CREATE_GROUP internally to create communicator • If failure is detected, the shrink is started over Revocation • Non-optimized implementation done with message flood • Internally marked as done when a messages is received from all other processes Internal refcount kept until done ULFM Evaluation ULFM Implementation Agreement Shrinking Performance with any failures is similar due to the lack of leave-early semantics. Failure-free performance is much better. Acknowledging failures also does not significantly improve the runtime because all failure detection code must still be executed. Slower than MPI_COMM_DUP because of failure detection. As expected, introducing failures in the middle of the algorithm causes large runtime increase as the algorithm must restart. Failure Recovery MPI_COMM_SHRINK() What is ULFM? 16 32 64 128 256 512 0 22.5 45 67.5 90 No Failures Processes Milliseconds (ms) 16 32 64 128 256 512 0 17.5 35 52.5 70 Dup No Failure Processes Milliseconds (ms) Traditional fault tolerance solutions require the entire application to synchronize and save its state to disc. In the event of a failure, all processes have to abort and roll back to the previously saved state. ULFM allows the application to either replace to failed process or continue with fewer processes. Why is this better than Checkpoint/Restart? Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. This material is based upon work supported by the U.S. Department of Energy Office of Science, Office of Advanced Scientific Computing Research, under contract number DE-AC02-06CH11357. We gratefully acknowledge the computing resources provided on Fusion, a high-performance computing cluster operated by the Laboratory Computing Resource Center at Argonne National Laboratory.

Transcript of Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin...

Page 1: Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level.

Lessons Learned Implementing User-Level Failure Mitigation in MPICHWesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji

Argonne National Laboratory

User-level failure mitigation (ULFM) is becoming the front-running solution for process fault tolerance in MPI. While not yet adopted into the MPI standard, it is being used by applications and libraries and is being considered by the MPI Forum for future inclusion into MPI itself. In this paper, we introduce an implementation of ULFM in MPICH, a high-performance and widely portable implementation of the MPI standard. We demonstrate that while still a reference implementation, the runtime cost of the new API calls introduced is relatively low.

0 1MPI_Recv

MPI_ERR_PROC_FAILED

Failure Notification

Failure Propagation

ULFM introduces semantics to define failure notification, propagation, and recovery within MPI. For failure notification, ULFM uses traditional error codes and returns a new error, MPI_ERR_PROC_FAILED. For failure propagation, MPI_COMM_REVOKE will notify all processes of a failure. For failure recovery, MPI_COMM_SHRINK can be used to create a new communicator. To assist with the primary functionality, there are also functions to detect failure (MPI_COMM_FAILURE_ACK, MPI_COMM_FAILURE_GET_ACKED) and for distributed, fault-tolerant agreement (MPI_COMM_AGREE).

Failure Detection• Local failures detected by Hydra and netmods• Error codes are returned back to the user from the API calls

Agreement• Uses two group-based allreduce operations• If either fails, an error is returned to the user

Shrinking• All processes construct consistent group of failed procs via allreduce• Uses MPI_COMM_CREATE_GROUP internally to create

communicator• If failure is detected, the shrink is started over

Revocation• Non-optimized implementation done with message flood• Internally marked as done when a messages is received from all other

processes• Internal refcount kept until done

ULFM Evaluation

ULFM Implementation

Agreement Shrinking

Performance with any failures is similar due to the lack of leave-early semantics. Failure-free performance is much better. Acknowledging failures also does not significantly improve the runtime because all failure detection code must still be executed.

Slower than MPI_COMM_DUP because of failure detection. As expected, introducing failures in the middle of the algorithm causes large runtime increase as the algorithm must restart.

Failure Recovery

MPI_COMM_SHRINK()

What is ULFM?

16 32 64 128 256 5120

22.5

45

67.5

90No FailuresUnack FailureAcked Failure

Processes

Mill

isec

onds

(ms)

16 32 64 128 256 5120

17.5

35

52.5

70 DupNo FailureOne Failure

Processes

Mill

isec

onds

(ms)

Traditional fault tolerance solutions require the entire application to synchronize and save its state to disc. In the event of a failure, all processes have to abort and roll back to the previously saved state. ULFM allows the application to either replace to failed process or continue with fewer processes.

Why is this better than Checkpoint/Restart?

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC.

This material is based upon work supported by the U.S. Department of Energy Office of Science, Office of Advanced Scientific Computing Research, under contract number DE-AC02-06CH11357.We gratefully acknowledge the computing resources provided on Fusion, a high-performance computing cluster operated by the Laboratory Computing Resource Center at Argonne National Laboratory.