Dask-traj

For analysis of MD simulations MDTraj is a fast and commonly used analysis. However MDTraj has limitations, such as the requirement that the whole trajectory and result of the computation fits into memory. This module rewrites part of MDTraj to work with Dask in order to achieve out-of-memory computations, and combined with dask-distributed results in possible out-of-machine parallelization, essential for HPCs and a (surprising) speed-up even on a single machine.

Purpose of Module

Using MDTraj is a fast and easy way to analyze MD trajectories. However, MDTraj has a couple limitations:

  • The whole trajectory needs to fit into memory, or gathering results becomes inconvenient
  • The result of the computation also needs to fit into memory
  • All processes need access to all the memory, preventing out-of-machine parallelization, and HPC scaling

Dask-traj solves all 3 limitations by rewriting the MDTraj functions to work with dask.arrays. This is done for both the trajectory and the computation functions. As dask.arrays know how to spill to disk, this lifts the requirement to fit into memory on both.

Together with dask-distributed it also allows the computation to be executed in a distributed way, which allows scaling out of a single machine. In preliminary tests this approach even leads to a speedup on a single machine, which is surprising as MDTraj is already a parallel code.

The splitting of everything in Dask-traj is done in the time-axis of the MD trajectory and as a lot of analysis is embarrassingly parallel, this leads to nice non-communicating compute graphs as shown here.

Graph figure of a trajectory with 1251 split in chunks of 100 frames

Current Limitations

One very important point of dask-traj is that we seek in the trajectory file. So if your files are stored in a format that does not have an efficient seek method, the loading of Trajectories will not get a speed-up, and might even be slower than MDTraj.

Also, due to the way the code is written in MDTraj, only a subset of functions are available at the moment, but this will be expanded further in the future. If you have a use-case that requires the conversion of a MDTraj functionality, not yet present in dask-traj, please make an issue and I will focus on that.

Building and Testing

This code can be installed with conda using conda install -c dask_traj. To install the specific version associated with this module, use conda install -c conda-forge dask_traj==0.2.2

This code can also be installed with pip by running pip install dask-traj

Finally, this code can also be installed by downloading the source code (see the Source Code section below), and running python setup.py install from the root directory.

Tests for this module can be run with pytest. Install pytest with pip install pytest and then run the command py.test from within the directory with the source code, or py.test --pyargs dask_traj from anywhere after installation.

Examples

The examples require some extra dependencies to be installed, namely: * jupyter * distributed * python-graphviz

Which are all installable through conda and pip.

These examples can also be found in the examples directory in the source code. They can be run by using jupyter notebook from that directory (see Jupyter notebook documentation at http://jupyter.org/ for more details)

Source Code

The source code for this module, and modules that build on it, is hosted at https://github.com/sroet/dask-traj. This module specifically includes everything up to and including release 0.2.2