More Crazy Mpi Ideas: Fault Detection And Recovery

SERVIDORES

I had a good conversation with an ISV yesterday who makes a popular MPI-based simulation application. One of the things I like to do in these kinds of conversations is ask the ISV engineers two questions:

What new features do you want from the MPI implementations that you use?
What new features or changes do you want from the MPI API itself?

You know - talk to theactual usersof MPI and see what they want, both from an implementation perspective and from a standards perspective. Shocking!

One of the big items that came out of our discussion was a desire for better fault tolerance and/or resilience in MPI applications.

To be fair: fault tolerance is abigtopic, and full of both difficult and contentious issues. But the big point that they wanted was actually surprisingly simple in concept:

When an MPI process fails (for whatever reason), guarantee that all other MPI processes that are stuck in blocking MPI API calls involving the dead process return with some kind of reasonable error code.

They didn't care too much about continuing MPI after that - they just wanted to know that an error occurred so that they could save some state to stable storage, perhaps print a helpful error message for the end user, or otherwise clean up after the run. This is a considerably smaller goal than other fault tolerance efforts (e.g., to be able to continue an MPI job after a failure).

So let's talk about fault detection.

It'susuallyeasy enough for an MPI implementation to figure out when the remote peer in a blocking send/receive operation has failed. Especially when the MPI is using some form of reliable network communication, because the networking layer will tell the MPI implementation when it can no longer reach a peer.

...but not always. Consider:

Perhaps the network has totally failed between the two peers, such that not even negative acknowledgements (NAKs) can flow between them (i.e., one process can't tell the other that it has failed). Put differently: in the steady state of an MPI job, silence between peers rarely means process failure.
Perhaps the MPI implementation is using unreliable data transports (e.g., UDP or other unreliable datagrams). Losses are then both common and expected - meaning that NAKs can get corrupted or lost.
The remote peer may not be in the MPI library, or otherwise may not be actively sending traffic to the local peer (e.g., the remote peer may not have posted the matching send or receive yet). Again: silence may not mean failure.

Many of these kinds of issues can be resolved in an "out of band" control network - e.g., the run-time system can monitor the individual processes in an MPI job, and can signal its peers in the event of an unexpected death. ...but there are scalability issues with this kind of approach, too. Let's not forget prior blog entries where I have discussed scalability challenges in MPI/HPC runtime systems.

The situation gets even more complex if there are non-blocking communications ongoing involving many peers, some of whom may have failed.

And it gets further complexified (I just made up that word; deal with it) when your processes fail partway through collective, dynamic process, or one-sided operations. Hardware support (potentially from the network) may be required to handle such failure detection efficiently. Or, put differently: we do not want to penalize the performance of thefar-more-commoncase of success by adding a lot of invasive and potentially performance-costing infrastructure to check for failure during MPI operations.

I should note that a flavor of this kind of failure detection is currently included in the MPI Forum Fault Tolerance Working Group's (FTWG) proposal for MPI-4 (in addition to other FT-related provisions). This is quite promising.

But there's still much discussion that must occur; other users want more than "simple" failure detection, for example - they want some kind of recovery (different models of which are under hot debate).

What kinds of failure detection and/or recovery would you find useful in your application?

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

SERVIDORES

NOTÍCIAS QUENTES

Huawei CloudEngine S5735 Switches Set the Benchmark for High-Performance, Energy-Efficient Switching

Huawei CloudEngine S5731‑S48P4X Datasheet

Huawei CloudEngine S5731‑S24P4X Datasheet

Huawei S5731-S Empowers Next-Generation Campus Networks with Advanced Capabilities

Huawei S5731-H24P4XC Switch Review: Power-Packed Performance and Smart PoE

Huawei S5731-H Series Switches Redefine Campus Networking with Intelligent High-Performance Architecture

Top Features of the Huawei S5731-S24T4X: The Ultimate Gigabit Access Switch for Modern Networks

General Power Module Fault Location Procedure (CE8800 & 7800 & 6800 & 5800)

How Do I Split a Stack? How to clear the stacking configuration?

Huawei CloudEngine S5731 Datasheet

Huawei CloudEngine S5731-S24P4X: Powerful Enterprise-Grade Switch Explained

Huawei S5731-S48T4X Review: Powerful Enterprise Switch for High-Speed Networking

Why are network cables limited to 100 meters?

Huawei S5731-S32ST4X: Powerful, Enterprise-Ready Gigabit Switch with Advanced Capabilities

Huawei S5731-H48T4XC Review: High-Performance Switching for Modern IT Infrastructures

Huawei S5731-H48P4XC: Comprehensive Overview

Common display Commands for Huawei Devices

Stacking Card Stacking vs. Service Port Stacking: Application Scenarios for the Two Switch Stacking Methods

Huawei S5731-H24T4XC: High-Performance Intelligent Gigabit Switch

Huawei S5731-S48P4X: High-Performance PoE Switch with Flexible Power and Uplink Options

Huawei S5731 Series: Advanced Networking Solutions for Enterprises

Difference between campus switch and data center switch

Huawei S6730-H28Y4C Campus CloudEngine Switch Datasheet

S6730-H48Y6C: Unleashing Power and Flexibility for Modern Networking

CloudEngine S6730-H Series Switches Datasheet

Huawei CloudEngine Switch S6730-S24X6Q Datasheet

CloudEngine S6700 Series Switches Naming Conventions & Description

Huawei CloudEngine S6730-H24X6C Datasheet

Huawei S6730 Series Switches Datasheet

Huawei CloudEngine Switch S6730-H48X6C Datasheet

More crazy MPI ideas: Fault detection and recovery

Tags quentes : HPC mpi

Ordering Guide

Recursos

Quem somos

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

SERVIDORES

NOTÍCIAS QUENTES

Huawei CloudEngine S5735 Switches Set the Benchmark for High-Performance, Energy-Efficient Switching

Huawei CloudEngine S5731‑S48P4X Datasheet

Huawei CloudEngine S5731‑S24P4X Datasheet

Huawei S5731-S Empowers Next-Generation Campus Networks with Advanced Capabilities

Huawei S5731-H24P4XC Switch Review: Power-Packed Performance and Smart PoE

Huawei S5731-H Series Switches Redefine Campus Networking with Intelligent High-Performance Architecture

Top Features of the Huawei S5731-S24T4X: The Ultimate Gigabit Access Switch for Modern Networks

General Power Module Fault Location Procedure (CE8800 & 7800 & 6800 & 5800)

How Do I Split a Stack? How to clear the stacking configuration?

Huawei CloudEngine S5731 Datasheet

Huawei CloudEngine S5731-S24P4X: Powerful Enterprise-Grade Switch Explained

Huawei S5731-S48T4X Review: Powerful Enterprise Switch for High-Speed Networking

Why are network cables limited to 100 meters?

Huawei S5731-S32ST4X: Powerful, Enterprise-Ready Gigabit Switch with Advanced Capabilities

Huawei S5731-H48T4XC Review: High-Performance Switching for Modern IT Infrastructures

Huawei S5731-H48P4XC: Comprehensive Overview

Common display Commands for Huawei Devices

Stacking Card Stacking vs. Service Port Stacking: Application Scenarios for the Two Switch Stacking Methods

Huawei S5731-H24T4XC: High-Performance Intelligent Gigabit Switch

Huawei S5731-S48P4X: High-Performance PoE Switch with Flexible Power and Uplink Options

Huawei S5731 Series: Advanced Networking Solutions for Enterprises

Difference between campus switch and data center switch

Huawei S6730-H28Y4C Campus CloudEngine Switch Datasheet

S6730-H48Y6C: Unleashing Power and Flexibility for Modern Networking

CloudEngine S6730-H Series Switches Datasheet

Huawei CloudEngine Switch S6730-S24X6Q Datasheet

CloudEngine S6700 Series Switches Naming Conventions & Description

Huawei CloudEngine S6730-H24X6C Datasheet

Huawei S6730 Series Switches Datasheet

Huawei CloudEngine Switch S6730-H48X6C Datasheet

More crazy MPI ideas: Fault detection and recovery

Tags quentes : HPC mpi

Ordering Guide

Recursos

Quem somos

Huawei CloudEngine S5731‑S48P4X Datasheet