H-REMD supporting info

16
S1 Supporting Information for J. Chem. Theory Comp.: Multi-dimensional Replica Exchange Molecular Dynamics Yields a Converged Ensemble of an RNA Tetranucleotide Christina Bergonzo, Niel M. Henriksen, Daniel R. Roe, Jason M. Swails, Adrian E. Roitberg, and Thomas E. Cheatham III This supporting information includes: Figure S1: Cluster populations or occupancies as a function of time for the T- REMD simulation. Figure S2. RMSD population histograms of 1 st half vs. 2 nd half from r(GACC) T- REMD simulation. Figure S3: Population histograms of the dihedral energy for scaled dihedral force constant replicas. Table 1: Percent population of defined structures seen in DBscan cluster analysis. Figure S4: Representative structures of clusters found in H-REMD and M-REMD simulations, and not found in T-REMD simulations. Figure S5: RMSD population histograms from the r(GACC) M-REMD simulations. Figure S6: RMSD population histograms of 1 st half vs. 2 nd half from r(GACC) M- REMD simulations. Figure S7: Cluster populations vs. time for H-REMD and M-REMD simulations. Figure S8. Correlation of cluster populations between independent H-REMD runs and independent M-REMD runs. Figure S9: Representative structures from cluster analysis for T-REMD and H- REMD. Figure S10. RMSD population histograms at 277K in different solution environments. A summary of 100 mM NaCl and 100 mM KCl system setup Sample scripts for post-processing with CPPTRAJ.

Transcript of H-REMD supporting info

Page 1: H-REMD supporting info

S1

Supporting Information for J. Chem. Theory Comp.: Multi-dimensional Replica Exchange Molecular Dynamics Yields a

Converged Ensemble of an RNA Tetranucleotide

Christina Bergonzo, Niel M. Henriksen, Daniel R. Roe, Jason M. Swails, Adrian E. Roitberg, and Thomas E. Cheatham III

This supporting information includes:

• Figure S1: Cluster populations or occupancies as a function of time for the T-REMD simulation.

• Figure S2. RMSD population histograms of 1st half vs. 2nd half from r(GACC) T-REMD simulation.

• Figure S3: Population histograms of the dihedral energy for scaled dihedral force constant replicas.

• Table 1: Percent population of defined structures seen in DBscan cluster analysis.

• Figure S4: Representative structures of clusters found in H-REMD and M-REMD simulations, and not found in T-REMD simulations.

• Figure S5: RMSD population histograms from the r(GACC) M-REMD simulations.

• Figure S6: RMSD population histograms of 1st half vs. 2nd half from r(GACC) M-REMD simulations.

• Figure S7: Cluster populations vs. time for H-REMD and M-REMD simulations.

• Figure S8. Correlation of cluster populations between independent H-REMD runs and independent M-REMD runs.

• Figure S9: Representative structures from cluster analysis for T-REMD and H-REMD.

• Figure S10. RMSD population histograms at 277K in different solution environments.

• A summary of 100 mM NaCl and 100 mM KCl system setup • Sample scripts for post-processing with CPPTRAJ.

Page 2: H-REMD supporting info

S2

Figure S1: Cluster populations or occupancies as a function of time. The plot shows the percent occupancy for each of the clusters from the r(GACC) T-REMD simulation with 24 replicas as a function of time (extended to ~3.8 μs per replica) at 300 K. DBscan clustering, as described in the previous section, was performed on the RNA structures in the 300 K (temperature-sorted) trajectory. Each color represents a different cluster/conformation. Note that the DBscan clustering (using an epsilon of 0.9 Å and 25 minimum points) placed both the A-form-minor and A-form-major conformations into the first cluster (black).

Page 3: H-REMD supporting info

S3

Figure S2: Estimating convergence of GACC T-REMD simulations from RMSD population histograms. Shown are the normalized populations (y-axis) of mass weighted RMSD values (x-axis) of all atoms in residues 1-4 to a reference structure (A-form RNA). The black line is the average of the first half and second half of the simulation, and the error bars are standard deviation. The first half histogram is shown in purple, and the second half histogram is shown in green.

Page 4: H-REMD supporting info

S4

Figure S3: Population histograms of the dihedral energy for scaled dihedral force constant replicas. To highlight the overlap in the dihedral energy distributions when sampling with different Hamiltonian, shown are histograms of the dihedral energy (kcal/mol) from a 5 nanosecond replica-exchange MD (REMD) simulation with 12 replicas where exchanges were disallowed. The system is the same TIP3P solvated r(GACC) tetranucleotide described in the referencing paper and the biased Hamiltonians are obtained by linearly scaling down the dihedral force constants by multiplying by values ranging from 1.0 (right) to 0.0 (black, left) at 0.1 intervals, with an extra replica at 0.05 scaling (red). Note that as dihedral scaling becomes larger (e.g. the dihedral energy approaches zero), reduced overlap in the distributions is observed. As overlap of these histograms directly relates to the exchange acceptance rates between neighboring replicas, if there is no overlap, no exchanges will occur. Because the energy overlap becomes quite poor after 0.3x scaling (yellow curve), the 8 values from 1.0x to 0.3x were chosen for the Hamiltonian replicas.

Dihedral energy (kcal/mol)

Popu

latio

n

Page 5: H-REMD supporting info

S5

Table S1: Populations of r(GACC) conformations from DBscan cluster analysis for the various replica exchange simulations at 300 K. Each column has the units "% of structures in the ensemble." The T-REMD results are for the original simulation out to 2 μs per replica.1 Molecular graphics of the representative structures are shown in the next two figures.

Page 6: H-REMD supporting info

S6

Figure S4: Representative structures from DBscan cluster analysis at 300K. Shown are structures that are populated in the H-REMD and M-REMD simulations but are not seen in the T-REMD simulations. For the r(GACC) sequence the coloring is G1 (red), A2 (green), C3 (cyan), and C4 (magenta).

Page 7: H-REMD supporting info

S7

Figure S5: RMSD population histograms from the two independent GACC M-REMD simulations. Shown are the normalized populations (y-axis) of structures at particular mass-weighted RMSD values (x-axis) of all atoms in residues 1-4 to a reference structure (A-form RNA) for two independent M-REMD simulations. The black line is the average of the last 250 ns of two independent simulations with different starting structure sets, and the error bars are the standard deviations of the populations. The run 1 histogram is shown in red and the run 2 histogram is shown in blue.

Page 8: H-REMD supporting info

S8

Figure S6: Estimating convergence of GACC M-REMD simulations from RMSD population histograms. Shown are the normalized populations (y-axis) of mass weighted RMSD values (x-axis) of all atoms in residues 1-4 to a reference structure (A-form RNA). The black line is the average of the first half and second half of run 1, and the error bars are standard deviation. The red line is the average of the first half and second half of run 2, and the error bars are standard deviation.

Page 9: H-REMD supporting info

S9

Figure S7: Cluster populations versus time. Shown are cluster populations as a function of simulation frame number (units of picoseconds) for each cluster comparing H-REMD (top) to M-REMD. The M-REMD cluster populations converge more rapidly. The plots are : a) H-REMD run 1, b) H-REMD run 2, c) M-REMD run 1, and d) M-REMD run 2.

Page 10: H-REMD supporting info

S10

Figure S8. Correlation of cluster populations between independent H-REMD runs and independent M-REMD runs. The most populated cluster is left out. Since the most populated cluster can have a majority of structures, removing it gives more weight to the less-populated clusters. If the simulations are converged the regression should not change. The shaded area represents a 95% confidence interval where the slope and y-intercept bounds are denoted by alpha and beta, respectively.

Page 11: H-REMD supporting info

S11

Figure S9. Representative structures from cluster analysis for T-REMD 300K replica (Left, G1 in Green) and H-REMD unbiased replica (Right, G1 in Blue) show the flip in chi dihedral for residue G1.

Page 12: H-REMD supporting info

S12

Figure S10. RMSD population histograms at 277K in different solution environments. Shown are the normalized population (y-axis) of structures at particular mass-weighted RMSD values (x-axis) of all atoms in residues 1-4 to a reference structure (A-form RNA) for the REMD simulations with different solvent environments. The cyan and orange lines are the 277K Temperature replicas from M-REMD runs 1 and 2. The maroon line represents the 277K Temperature replica from a M-REMD run with 100 mM NaCl present, and the green line represents the 277K Temperature replica from a M-REMD run with 100 mM KCl.

Page 13: H-REMD supporting info

S13

System setup for monovalent salt concentrations: The initial r(GACC) structure was built in an A-form major conformation using tleap. For 100 mM NaCl or KCl solvent conditions Joung-Cheatham monovalent ion parameters were used.2 The system was neutralized with 3 Na+ or K+ couterions, after which 5 NaCl or KCl molecules and 2513 TIP3P water molecules were added.3 The total volume for each system was 89118.338 Å3, and a truncated octahedron box shape was used. The ions were randomized for each system, and the solvent was equilibrated as described in the main text. These equilibrated systems were used as the starting structures for 100 mM NaCl and 100 mM KCl M-REMD simulations, where the M-REMD simulation conditions were identical to those described in the main text (i.e., 192 replicas, same Temperature and Hamiltonian spacing, same input parameters, and equivalent 300 ns sampling per replica).

Sample scripts for post processing with AmberTools:

Ensemble post-processing using CPPTRAJ: The ‘ensemble’ command can be used to read in, sort, and process ensembles of trajectories at the same time. The following script reads in ensembles of trajectories from several REMD runs, strips water and sodium ions (mask expression “:WAT,:Na+”), and writes the results to trajectories named nowat.nc.X in Amber NetCDF trajectory format (the ‘.X’ corresponds to the sorted trajectory’s position in the overall ensemble and is automatically appended by the ‘ensemble’ command). The lowest replica filename specified to the ‘ensemble’ command; all other replicas are automatically search for based on the file’s numeric extension.

parm full.topo ensemble ../run.000/TRAJ/rem.crd.000 ensemble ../run.001/TRAJ/rem.crd.000 ensemble ../run.002/TRAJ/rem.crd.000 ensemble ../run.003/TRAJ/rem.crd.000 ensemble ../run.004/TRAJ/rem.crd.000 ensemble ../run.005/TRAJ/rem.crd.000 ensemble ../run.006/TRAJ/rem.crd.000 ensemble ../run.007/TRAJ/rem.crd.000

Page 14: H-REMD supporting info

S14

ensemble ../run.008/TRAJ/rem.crd.000 ensemble ../run.009/TRAJ/rem.crd.000 strip :WAT,:Na+ trajout nowat.nc netcdf

Average linkage agglomerative clustering using PTRAJ:

cluster out c average none all none representative pdb \ averagelinkage epsilon 2.3 sieve 200 random \ rms mass :1-4@C*,N*,O*,P*

DBscan clustering using CPPTRAJ:

cluster dbscan minpoints 25 epsilon 0.9 sievetoframe rms \ :1@N2,O6,C1',P,:2@H2,N6,C1',P,:3@O2,H5,C1',P,:4@O2,H5,C1',P \ sieve 100 out cvt.dat summary summary.dat repout rep \ repfmt pdb cpopvtime cpop.agr normframe

Page 15: H-REMD supporting info

S15

Principal Component Analysis, including histogramming of projections using CPPTRAJ. Note that in the following script the variable $STOP1 corresponds to the last frame of traj.run1.nc and $START2 corresponds to the first frame of traj.run2.nc when both trajectories are combined (i.e. it is $STOP1 + 1). # Read in both trajectories # trajin traj.run1.nc trajin traj.run2.nc # RMS-fit to first frame # rms first :1-4&!@H= # Create an average structure # average gaccAvg.rst7 ncrestart # Save coordinates as ‘crd1’ # createcrd crd1 run # Fit to average structure # reference gaccAvg.rst7.1 [avg] # RMS-fit to average structure # crdaction crd1 rms ref [avg] :1-4&!@H= # Calculate coordinate covariance matrix # crdaction crd1 matrix covar :1-4&!@H= name gaccCovar # Diagonalize coordinate covariance matrix, first 15 E.vecs # runanalysis diagmatrix gaccCovar out evecs.dat vecs 15 # Now create separate projections for each trajectory # crdaction crd1 projection P1 modes evecs.dat \ beg 1 end 15 :1-4&!@H= crdframes 1,$STOP1 crdaction crd1 projection P2 modes evecs.dat \ beg 1 end 15 :1-4&!@H= crdframes $START2,last # Now histogram first 5 projections for each # hist P1:1,*,*,*,100 out pca.hist.agr norm name P1-1 hist P1:2,*,*,*,100 out pca.hist.agr norm name P1-2 hist P1:3,*,*,*,100 out pca.hist.agr norm name P1-3 hist P1:4,*,*,*,100 out pca.hist.agr norm name P1-4 hist P1:5,*,*,*,100 out pca.hist.agr norm name P1-5 hist P2:1,*,*,*,100 out pca.hist.agr norm name P2-1 hist P2:2,*,*,*,100 out pca.hist.agr norm name P2-2 hist P2:3,*,*,*,100 out pca.hist.agr norm name P2-3 hist P2:4,*,*,*,100 out pca.hist.agr norm name P2-4 hist P2:5,*,*,*,100 out pca.hist.agr norm name P2-5

Page 16: H-REMD supporting info

S16

References:

(1) Henriksen, N. M.; Roe, D. R.; Cheatham, T. E., III. J. Phys. Chem. B 2013, 117, 4014–4027.

(2) Joung, I. S.; Cheatham, T. E., III. J. Phys. Chem. B 2008, 112, 9020–9041.

(3) Jorgensen, W. L.; Chandrasekhar, J.; Madura, J. D.; Impey, R. W.; Klein, M. L. J. Chem. Phys. 1983, 79, 926–935.