Protein NMR simulations

From Spinach Documentation Wiki
Revision as of 03:15, 8 August 2016 by Admin (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This document describes the stages of setting up a simple protein NMR experiment. It should be read together with the rest of the Spinach manual. To set up a simple protein simulation, you would need the following:

  1. A PDB file containing Cartesian coordinates of every atom in the protein, including protons. If the PDB file contains multiple geometries, the first one is read by Spinach.
  2. A BMRB file containing chemical shifts for those atoms that have been assigned. Unassigned atoms would not appear in the simulation.

Spinach cross-checks the amino acid sequence between the PDB and the BMRB file – any mismatch would produce an error message. Use the following command to import data and create Spinach input structures:


Detailed descriptions of the sub-fields of sys and inter data structures are available in the manual. The protein import function above fills and returns the following fields:


Field names are self-explanatory: isotope names are placed into sys.isotopes, PDB labels of each atom are placed into sys.labels, chemical shifts are placed into inter.zeeman.scalar, rough guesses for nitrogen CSAs are placed into inter.zeeman.matrix (if you have accurate CSA tensors, you need to place them manually into the corresponding cells of inter.zeeman.matrix array after the import is complete), rough guesses of J-couplings are placed into inter.coupling.scalar (if you have accurate J-couplings, you would need to overwrite the values in inter.coupling.scalar after the import is complete), PDB atom coordinates are placed into inter.coordinates; nothing else is imported or guessed.

The following takes place under the bonnet when protein data is imported:

  1. PDB file is parsed, coordinates and labels are read in.
  2. BMRB file is parsed, chemical shifts and labels are read in.
  3. The following J-coupling guessing procedure is run (see our 2014 protein simulation JMR paper for further details):
    1. The molecular bonding graph is partitioned into connected subgraphs of size two, and one-bond J-couplings are assigned from a complete database of atom pairs. Our experience with proteins indicates that there are fewer than 100 unique connected atom pairs in regular proteins, and that most one-bond J-couplings within those pairs can be either found in the literature, or measured in individual amino acids, or estimated with sufficient accuracy using electronic structure theory software. Spinach simply has a list – see guessj.m function, modify it if you have more accurate values at hand.
    2. The molecular bonding graph is partitioned into connected subgraphs of size three, and two-bond J-couplings assigned from a complete database of connected atom triples. The number of unique connected atom triples in proteins is also reasonable – we saw fewer than 150 in regular proteins, a small enough number for an exhaustive list to be compiled from experiments, literature and electronic structure theory estimates. Here Spinach also has a list – see guessj.m function, modify it if you have more accurate values at hand.
    3. The molecular bonding graph is partitioned into sequentially connected subgraphs of size four and dihedral angles are computed from atomic coordinates, allowing three-bond J-couplings to be assigned from a complete database of Karplus curves, with angle averaging for sites designated as mobile. Karplus curves are a well-researched topic, with specific data available for the backbone and less accurate generic curves available for the rest of the structure. The number of unique sequentially connected atom quartets found in proteins (fewer than 300, many belonging to similar structural types) was sufficiently small for a complete database of Karplus curves to be compiled from literature data, experiments, and electronic structure theory estimates.
    4. J-couplings across more than three bonds are ignored. The effect of the electrostatic environment is also ignored – for the accuracy required for protein simulations its effect on J-coupling is small. More accurate J-coupling estimation methods are undoubtedly possible – modify guessj.m function or drop us a note if you have a better idea.
  4. The following CSA guessing procedure is run for the amide group nitrogens (and nothing else): local directions are determined using the coordinates of N, H, C and CA atoms; ZZ axis of the CSA tensor is assumed to be collinear to the N-CO bond, XX axis is assumed to be perpendicular to the peptide plane and the YY axis is assumed to be perpendicular to the other two. The following eigenvalues are used: -125 ppm, 45 ppm, 80 ppm in XX, YY and ZZ direction respectively. If you have your own CSAs or would like to add CSAs for other atoms, put them into inter.zeeman.matrix cell array, overwriting the guess if necessary.
  5. Unassigned atoms are dropped from the atom list, the associated interactions and coordinates are removed. The list of unassigned atoms that appear in PDB, but not BMRB, is printed to the user. Please study that list carefully and add the missing information to the BMRB file if at all possible. If you do not do it, a significant number of signals would be missing from the computed spectra and some relaxation rates could end up being underestimated.

After the import is finished, the resulting sys and inter structures may be used elsewhere in Spinach. There is no need to import dipolar coupling tensors, they are computed automatically by Spinach from atomic coordinates. Any additional information (quadrupolar coupling, unpaired electrons and associated interactions, etc.) can be added to sys and inter structures at this point. Please refer to the manual for the detailed description of sys and inter arrays. At the next stage in the input preparation, you need to specify the magnet field and the cut-off tolerances for the various interactions (which distances are “too large” for the dipolar coupling and which J-couplings are “too small” to be consequential). The top of your Matlab file should look similar to the following:  

    % Protein data

    % Magnet field

    % Tolerances

Cut-off tolerance for proximity is specified in Angstrom and cut-off for J-coupling is specified in Hz. In the example above, dipolar couplings would not be taken into account between spins that are further than 4.0 Angstrom apart and any J-coupling smaller than 2.0 Hz would be ignored. The last line instructs Spinach to avoid the Krylov propagation algorithm – this is a technical setting that essentially tells the program that using huge amounts of memory is allowed.

The next thing to be specified is the relaxation theory. Redfield relaxation theory is a very expensive option from the computational point of view – NOESY simulation for a 70-residue protein requires about 256 GB of RAM. If you do not require accurate relaxation theory treatment, use something similar to the following:

    % Relaxation theory

This requests a non-selective damping at 5.0 Hz for all states (the relaxation superoperator would be a diagonal matrix with -5.0 on the diagonal). Alternatively, Spinach supports simple T1/T2 and Lindblad relaxation models – those are often sufficient, see the manual. However, if you do require accurate relaxation treatment (it is strictly necessary for NOESY spectra), the following should be supplied:

    % Relaxation theory

This requests full Redfield theory: DD, CSA, NQI and all cross-correlations thereof. Dipolar relaxation rates are computed from atomic coordinates, CSAs and NQIs must be provided – see the manual on how to add them to sys and inter data structures. The middle line in the specification above requests the “Redfield kite” – cross-relaxation is included between longitudinal states only. If you require the treatment of all cross-relaxation processes, specify ‘secular’ instead of ‘kite’ – note that the simulation time would increase considerably. The last line specifies the rotational correlation time in seconds, it is important that you get this number right because all relaxation rates depend on it. Spinach relaxation module supports anisotropic rotational diffusion (in which case multiple correlation times should be supplied) as well as many other options – see the manual for further information.

The next step is to choose a basis set. This is a very complicated topic (see the manual), but the minimal basis set that produces quantitatively accurate results for proteins in liquid state is the following:

    % Basis set
    bas.level=4; bas.space_level=3;

This requests IK-1(4,3) connectivity-adaptive basis set that includes local correlations of up to four spins on the J-coupling graph and local correlations of up to three spins on the spatial proximity graph. Very accurate quantitative calculations (e.g. NOESY simulations where peak volumes are analysed) should use IK-1(5,4) basis set.

The next stage is to call Spinach constructor functions and generate the spin_system data structure – the fundamental structure that contains all information about the spin system and is required by most high-level Spinach functions as the first argument:

    % Create the spin system structure

    % Kill carbons and nitrogens

    % Build the basis

The two lines in the middle are optional – in this case they request the removal of all carbon and nitrogen spins from the spin system. This is necessary for the 2D NOESY simulation we chose to run, but should not be done for HSQC, HNCO and other sequences that require the presence of 15N and 13C spins. You should determine which spins (if any) you would like to drop and modify (or delete) those lines accordingly. Both create and the basis functions would produce copious output, informing you of the properties of the spin system and of the simulation settings. You are advised to read the output carefully as it contains important information about what Spinach thinks your spin system is.

The next stage is to specify experiment parameters. Spinach supports many standard NMR pulse sequences, their parameters are listed in the manual. In the case of a 2D NOESY calculations of proteins, the following is a reasonable set:

    % Sequence parameters
    parameters.npoints=[512 512];
    parameters.zerofill=[2048 2048];

As the names of the parameters suggest, this requests a mixing time of 65 ms, frequency offset of 4250 Hz, sweep width of 10750 Hz, 512 points to be acquired in both dimensions and zero-filled to 2048 points in both dimensions, the sequence is operating on 1H nuclei, axis units should be ppm and the initial condition is Lz on protons. The ‘cheap’ option in the initial condition call specification requests Spinach to not worry about the overall scaling factor on the Z axis of the resulting spectrum (relative peak intensities and integrals are unaffected) – this option makes state vector generation much faster at runtime. Other pulse sequences can have different parameters, please refer to the manual.

The next stage is the actual simulation. For our chosen example case of 2D NOESY the syntax is:

    % Simulation

The choice of the outer function (liquid) reflects the fact that we are running a liquid state simulation (Spinach also supports solid state NMR, see the manual), spin_system is the data structure containing the information about the system, noesy is the name of the pulse sequence we are running (@ symbol is a Matlab technicality – it denotes a function handle) and the various fields of the parameters argument have all been set above. The simulation produces detailed output, informing the user about simulation stages and printing matrix dimension statistics at each stage. You are advised to study that output carefully as it contains important information about simulation progress. The result is a 2D free induction decay that is ready for standard NMR data processing. Depending on the pulse sequence, it may be a simple array of complex numbers, or it might contain subfields, such as fid.cos and fid.sin, that are used in States quadrature processing of phase-sensitive experiments. Protein simulations can take anywhere between minutes and days, depending on the size of the protein, the choice of the basis set and the complexity of the internal mathematics required.

The next stage is apodization, which may be accomplished using any of the window functions available in Spinach (see the manual). In this particular case we will use Gaussian apodization:

    % Apodization

where the last argument is the decay rate (per dataset point) of the Gaussian function – this parameter should be increased until the sinc wiggles disappear from the spectrum. Please note that simulation of experiments that require quantitative analysis of peak intensities and widths should skip the apodization step – you should instead select such a number of points in the sequence parameters as would ensure that the FID has decayed naturally.

The next stage is Fourier transform and quadrature processing. For our chosen example case of 2D NOESY simulation, the following commands are required:

    % F2 Fourier transform

    % States signal

    % F1 Fourier transform

This is standard Matlab Fourier transform syntax – please refer to Matlab documentation for further information. Finally, the plotting function produces a contour plot:

    % Plotting
           [0.01 0.05 0.01 0.05],2,256,6,'positive');

2D and 3D plotting functions in Spinach have a significant number of adjustable parameters – see the manual. The last argument tells the plotter to ignore negative peaks. If those are expected in the spectrum, the argument should be ‘both’.

Other protein NMR simulations proceed in a similar fashion – atom selection at the import stage could be different (here we dropped all non-proton nuclei, but they need to be kept in HNCO simulations), some extra parameters (such as CSAs and NQIs) might be added after PDB and BMRB import, different proximity cut-off might be needed (in deuterated proteins the dipolar interactions are relevant at distances longer than 4.0 Angstrom), and so on.

If you do not find your chosen sequence in the Spinach experiment set, a good starting point for writing it are HSQC, HNCO and NOESY-HSQC sequence files – Spinach pulse sequence programming language is largely self-explanatory and well documented, it requires the same level of programming skill as the one needed to implement a new sequence on an Agilent NMR spectrometer.

Data export to third-party processing software is a complicated matter – Spinach stores 2D and 3D datasets as complex-valued matrices or complex-valued cubes of data. This is simple in Matlab, but may not play well with legacy NMR processing software that expects time domain data in very convoluted backwards-compatible formats. To aid in data export, the following function is provided:


It attempts to write an ASCII file that NMRPipe can read in using txt2pipe.tcl script. At the time of writing, however, it is not entirely clear if this export functionality works as intended. If you find a better way of exporting 2D and 3D FID data from Matlab into NMR processing software, please let us know. Alternatively, the processing and visualization functionality provided with Spinach, along with Matlab’s built-in graphics export functionality, is quite sufficient in most cases.