HTC MPI-Aware Tasks¶
This module is the fifth in a sequence that form the overall capabilities of the HTC library (see HTC Multi-node Tasks for the most relevant previous module where support for forked MPI workloads was added). This module deals with enabling tasks to be run over a set of nodes(specifically MPI/OpenMP tasks) where the tasks themselves are MPI aware.
Purpose of Module¶
In HTC Multi-node Tasks we added support for the HTC library to control tasks that are executed via the MPI launcher command. In that case, the task tracked by Dask is actually the process created by the launcher. For fully MPI-aware tasks, Dask itself is part of the MPI environment, running on the root process. The other processes wait for the code to be executed to come from root process. This is possible because Python is JIT compiled so we can serialise and send the instructions to the other processes (hiding complexity behind additional function calls).
The implementation is intended to be generic but the specific example implementation provided is for srun
launcher
that is used on
JURECA system.
Background Information¶
This module builds upon the work described in HTC Multi-node Tasks.
There is significant complexity in this use case since the task is only sent to the root process and must be packaged and sent to other processes before they can execute anything. The other processes must then go into a waiting state for next state to be sent from root, and when the workers are supposed to shut down, they should all exit cleanly.
Building and Testing¶
The library is a Python module and can be installed with
python setup.py install
More details about how to install a Python package can be found at, for example, Install Python packages on the research computing systems at IU
To run the tests for the decorators within the library, you need the pytest
Python package. You can run all the
relevant tests from the jobqueue_features
directory with
pytest tests/test_mpi_wrapper.py
Specific examples of usage for the JURECA system are available in the examples
subdirectory.
Source Code¶
The latest version of the library is available on the jobqueue_features GitHub repository
The code that was originally created specifically for this module can be seen in the MPI-capable tasks Merge Request. This includes a specific example of the use case