SimStore: Storable Functions

Authors: David W.H. Swenson

This module adds “storable functions” to SimStore, the new storage subsystem for OpenPathSampling. Storable functions cache the results of previous calculations to disk. This new implementation will support future parallelization approaches.

Purpose of Module

Trajectory-based methods to study rare events, such as transition path sampling (TPS), frequently require calculation of some collective variables during the simulation. In some cases, these collective variables can be relatively expensive to calculate, and my be calculated hundreds of thousands of times during simulation.

For some types of simulations, such as the one-way shooting variable in TPS, parts of trajectories can be reused, making it advantageous to store the results of collective variables in memory. Furthermore, those same collective variables are frequently used in analysis, make it advantageous to store the results to disk.

This module introduces the parts of SimStore that manage that storage. This includes the StorableFunction class itself, which wraps around a user-defined function and handles caching results in memory, and looking up results cached to disk. The user-defined function must take a data object (such as a snapshot or a trajectory), which has a unique universal identifier (UUID), and must return the same value every time it operates on the same input (i.e., it must be a “pure” function).

A StorableFunction can be used in different modes: in 'analysis' mode, it first searches the memory cache, then the disk storage, then finally evaluates the internal function. In 'production' mode, it first searches the memory cache, then evaluates the function. Finally, in 'no-caching' mode, it always evaluates the internal function.

One of the challenges in designing the new storable function infrastructure was to ensure that it would be compatible with parallelization. This module includes functionality so that the memory caches from different remote workers can be returned with the other results, and combined into a master memory cache of the process that also stores results to disk.

Background Information

This module builds on OpenPathSampling, a Python package for path sampling simulations. To learn more about OpenPathSampling, you might be interested in reading:

Installation and Testing

This was included in the version 1.4 release of OpenPathSampling. It can be installed via the conda package manager with:

conda install -c conda-forge openpathsampling

In addition to previous OPS requirements, this module requires SQLAlchemy, and other parts of the new storage require Dill. These can be installed with, e.g., conda install -c conda-forge sqlalchemy dill.

The tests for this module are split between unit tests included in the OpenPathSampling repository and integration tests in a separate repository. The easiest way to run both sets of tests is to download or clone the integration test repository at https://github.com/dwhswenson/ops-storage-notebooks. Install the required testing software, e.g., with:

conda install -c conda-forge pytest pytest-cov nbval

Then just run the test-storage.sh script in that repository. Note: although the module will work with Python 3.6+, some of the notebook tests are not compatible with more recent versions of Python, so the tests should be run with Python 3.7.

Source Code

This module has been merged into OpenPathSampling. It is composed of the following pull requests: