Appendix F: expert-level options
This section catalogs very advanced functionality that is stable and well-tested, but switched off by default.
NVidia Tesla coprocessor support
Several functions in Spinach can make use of CUDA GPUs. If your computer has a recent NVidia graphics card, enabling that functionality may be beneficial. This is done by adding 'gpu' to the enable array:
Spinach kernel modules that can make use of GPU arithmetic are:
- Time evolution in evolution.m function.
- Matrix exponential calculation in propagator.m function.
- Matrix inverse-times-vector operation during the slow-passage detection in slowpass.m function.
- Krylov propagation in krylov.m and step.m functions.
Very significant acceleration is observed (factor of 10 or more relative to the CPU) for systems that have state space dimensions in excess of 50,000.
For the typical 128x128x128 point grids used in paramagnetic centre probability density reconstructions from PCS, using a Tesla K40 card results in up to an order of magnitude acceleration relative to 32 CPU cores. Note that commodity NVidia graphics cards (e.g. GeForce) have artificially capped 64-bit floating-point performance - the Tesla range is more expensive, but much recommended.
Intel Xeon Phi coprocessor support
Matlab 2016b and later use the version of Intel MKL that supports automatic offload of low-level MKL functions to Xeon Phi coprocessors. Set the following environment variables in Linux or Windows to enable automatic offload:
BLAS_VERSION=mkl_rt.dll LAPACK_VERSION=mkl_rt.dll MKL_MIC_MAX_MEMORY=16G MKL_MIC_ENABLE=1
Change the amount of memory to match your version of the Xeon Phi card. We are using the following bash script at our supercomputing facility at Southampton:
module load matlab/2016b source /local/software/intel/2017/mkl/bin/mklvars.sh intel64 export BLAS_VERSION=/local/software/intel/2017/mkl/lib/intel64/libmkl_rt.so export LAPACK_VERSION=/local/software/intel/2017/mkl/lib/intel64/libmkl_rt.so export MIC_OMP_NUM_THREADS=240 export MKL_MIC_ENABLE=1 matlab
Modify paths and settings as appropriate for your case.
Note that we have so far not seen any advantage whatsoever in Spinach calculations from enabling these options. This is likely to do with the fact that neither the sparse matrix algebra, nor Fourier transforms, are currently automatically offloaded. Xeon Phi accelerators are therefore NOT RECOMMENDED - go buy a Tesla card instead.
Spinach may be instructed to keep a disk record of the matrices that propagator.m function has previously seen, so that propagators are not recomputed, but instead fetched from the disk next time the matrix is encountered. This can save large amounts of time in simulations of very repetitive pulse sequences. To turn this functionality on, add 'caching' to the enable array:
The cached propagators are placed into /scratch directory in the Spinach root folder.
Spinach uses Matlab's built-in Java interface for the hashing operation that is used to generate matrix identifiers. For very large matrices it may be necessary to increase the Java heap size (Preferences/MATLAB/General/Java Heap Memory). It is not a good idea to set the defailt Matlab file save format to v7.3 because the files in that format are not compressed. If at all possible, leave the default value of v7.0 unchanged.
The default setting in Matlab is that every worker process only runs in a single thread. Adding the 'greedy' flag to the enable array:
overrides the default setting and allows the worker processes to use as much CPU as they see fit. This is useful in situations when state spaces are dominated by a single large subspace. Note that, once a job with this setting is run, it would persist in the parallel pool unitl the pool is restarted or a job with different settings is run.
This setting is usually counterproductive when an efficient parallelisation avenue (such as powder averaging) is present - use it carefully and make sure it does not actually slow your calculation down. We typically see a lot of advantage for multi-dimensional liquid state NMR, but not elsewhere.
Disk-based Hamiltonian assembly
The weakest part of Matlab's Parallel Computing Toolbox is inter-process communication. Retrieving a large sparse matrix from a worker node uses way more memory than the matrix itself takes up. When the dimension of the Hamiltonian is extreme (over 10M in our experience), this can generate memory problems even on systems with very large amounts of RAM. Setting the following option:
will cause Spinach nodes to dump their chunk of the Hamiltonian on disk, from where it would then be retrieved by the head node. This option is slower than retrieving the chunks over a network, but it is less memory-intensive.
Version 1.10, authors: Ilya Kuprov