I've mentioned it before: the run-time systems of MPI implementations are frequently unsung heroes.
A lot of blood, sweat, tears, and innovation goes into parallel run time systems, particularly those that can scale to very large systems. But they're not discussed often, mainly because they're not as sexy and ultra-low latency numbers, or other popular MPI benchmarks.
Here's one cool thing that we added to the runtime in Open MPI a few years ago, and have continued to improve on over the years (including pretty pictures!).
In the 1990's when clusters of Linux servers were a new concept, the only way to launch MPI processes on remote servers was via ssh (rsh was used for a while, but it eventually mostly died out).
While job schedulers and cluster resource managers tend to offer fast MPI/parallel job startup these days, there are a surprising number of users who still use ssh-backed job startup mechanisms. There are a number of valid and good reasons for this, but we'll explore that another time.
Let's take a step back and look at what a job launcher does.
Conceptually, parallel job launchers are simple: loop over starting each target process on their target machine. Keeping with the ssh theme here, the figure below shows this model using individual ssh connections:
An obvious optimization - one that Open MPI has done since its inception - is to only connect to each target machine onlyonce, and then launch the desired target processes from that initial single connection:
(NOTE: the above figure is a bit simplified: mpirun actually launches a proxy daemon on each node; the daemon then forks each of the target MPI processes).
As your parallel application grows in terms of number of servers, such a serial launch mechanism becomes an obvious bottleneck.
It therefore makes sense to parallelize the launcher: use a tree-based launch structure. Have the job initiator (shown as "mpirun" in each figure) be the root of a tree. Each server that is launched upon can also launch on further servers. The inherent parallelization speeds up the overall launch from O(N) to O(log N):
Schweet!
Open MPI debuted a tree-based ssh launcher back in the v1.3 series (circa 2009). The first generation tree-based launcher used a binomial tree. This shape effectively amortized the high costs for creating (expensive) ssh connections.
Note that the tree-based ssh structure necessitates setting up password-less/passphrase-less ssh loginsbetween each pair of servers in the HPC cluster. If you use the same ssh keys on every server, this is trivial to setup. If you use different ssh keys on each server, it's a little more work.
That being said, Open MPI allows users to disable the tree-based launch and use the linear ssh launcher, if desired.
This blog entry is getting a bit long, so stay tuned: I'll describe a few more fun things about the Open MPI ssh tree-based launching system in the next entry...