ensemble_md.analysis

ensemble_md.analysis.analyze_traj

The analyze_traj module provides methods for analyzing trajectories of a REXEE simulation.

ensemble_md.analysis.analyze_traj.extract_state_traj(dhdl)[source]

Extracts the state-space trajectory from a GROMACS DHDL file. Note that the state indices here are local indices.

Parameters

dhdl (str) – The file path of the GROMACS DHDL file to be parsed.

Returns

  • traj (list) – A list that represents the state-space trajectory

  • t (list) – A list that represents the time frames of the trajectory

ensemble_md.analysis.analyze_traj.stitch_time_series(files, rep_trajs, shifts=None, dhdl=True, col_idx=-1, save_npy=True)[source]

Stitches the state-space/CV-space trajectories for each starting configuration from DHDL files or PLUMED output files generated at different iterations.

Parameters
  • files (list) – A list of lists of file paths to GROMACS DHDL files or general GROMACS XVG files or PLUMED ouptput files. Specifically, files[i] should be a list containing the files of interest from all iterations in replica i. The files should be sorted naturally.

  • rep_trajs (list) – A list of lists that represents the replica-space trajectories for each starting configuration. For example, rep_trajs[0] = [0, 2, 3, 0, 1, ...] means that starting configuration 0 transitioned to replica 2, then 3, 0, 1, in iterations 1, 2, 3, 4, …, respectively. This can be read from the rep_trajs.npy file generated by the REXEE simulation.

  • shifts (list, Optional) – A list of values for shifting the local state indices to global indices for each replica. The length of the list should be equal to the number of replicas. This is only needed when dhdl=True.

  • dhdl (bool, Optional) – Whether the input files are GROMACS dhdl files. If dhdl=False, the input files must be readable by numpy.loadtxt() assuming that the start of a comment is indicated by either the # or @ characters. Such files include any GROMACS XVG files or PLUMED output files (output by plumed driver, for instance). In this case, trajectories of the configurational collective variable of interest are generated. The default is True.

  • col_idx (int, Optional) – The index of the column to be extracted from the input files. This is only needed when dhdl=False, By default, we extract the last column.

  • save_npy (bool, Optional) – Whether to save the output trajectories as an NPY file. The default is True.

Returns

trajs – A list that contains lists of state-space/CV-space trajectory (in global indices) for each starting configuration. For example, trajs[i] is the state-space/CV-space trajectory of starting configuration i.

Return type

list

Example

>>> import glob
>>> import natsort
>>> import numpy as np
>>> from ensemble_md.analysis import analyze_traj
>>> n_sim = 4  # Assuming 4 replicas sampling states sets 0-3, 2-5, 4-7, and 6-9, respectively.
>>> files = [natsort.natsorted(glob.glob(f'sim_{i}/iteration_*/*dhdl*xvg')) for i in range(n_sim)]
>>> shifts = [0, 2, 4, 6]
>>> rep_trajs = np.load('rep_trajs.npy')  # rep_trajs.npy is generated by the REXEE simulation
>>> state_trajs = analyze_traj.stitch_time_series(files, rep_trajs, shifts, dhdl=True, save_npy=True)
ensemble_md.analysis.analyze_traj.stitch_time_series_for_sim(files, shifts=None, dhdl=True, col_idx=-1, save_npy=True)[source]

Stitches the state-space/CV-space time series in the same replica/simulation folder. That is, the output time series is contributed by multiple different trajectories (initiated by different starting configurations) to a certain alchemical range.

Parameters
  • files (list) – A list of lists of file paths to GROMACS DHDL files or general GROMACS XVG files or PLUMED output files. Specifically, files[i] should be a list containing the files of interest from all iterations in replica i. The files should be sorted naturally.

  • shifts (list, Optional) – A list of values for shifting the local state indices to global indices for each replica. The length of the list should be equal to the number of replicas. This is only needed when dhdl=True.

  • dhdl (bool, Optional) – Whether the input files are GROMACS dhdl files. If dhdl=False, the input files must be readable by numpy.loadtxt() assuming that the start of a comment is indicated by either the # or @ characters. Such files include any GROMACS XVG files or PLUMED output files (output by plumed driver, for instance). In this case, trajectories of the configurational collective variable of interest are generated. The default is True.

  • col_idx (int, Optional) – The index of the column to be extracted from the input files. This is only needed when dhdl=False, By default, we extract the last column.

  • save_npy (bool, Optional) – Whether to save the output trajectories as an NPY file. The default is True.

Returns

trajs – A list that contains lists of state-space/CV-space trajectory (in global indices) for each replica. For example, trajs[i] is the state-space/CV-space trajectory of replica i.

Return type

list

Example

>>> import glob
>>> import natsort
>>> from ensemble_md.analysis import analyze_traj
>>> n_sim = 4  # Assuming 4 replicas sampling states sets 0-3, 2-5, 4-7, and 6-9, respectively.
>>> files = [natsort.natsorted(glob.glob(f'sim_{i}/iteration_*/*dhdl*xvg')) for i in range(n_sim)]
>>> shifts = [0, 2, 4, 6]
>>> state_trajs = analyze_traj.stitch_time_series(files, shifts, dhdl=True, save_npy=True)
ensemble_md.analysis.analyze_traj.stitch_xtc_trajs(gmx_executable, files, rep_trajs)[source]

Demuxes GROMACS trajectories from different replicas into individual continuous trajectories.

Parameters
  • gmx_executable (str) – The path of the GROMACS executable.

  • files (list) – A list of lists of file paths to GROMACS XTC files. Specifically, files[i] should be a list containing the paths to the files of interest from all iterations in replica i. The files should be sorted naturally.

  • rep_trajs (list) – A list of lists that represents the replica space trajectories for each starting configuration. For example, rep_trajs[0] = [0, 2, 3, 0, 1, ...] means that starting configuration 0 transitioned to replica 2, then 3, 0, 1, in iterations 1, 2, 3, 4, …, respectively. This can be read from the rep_trajs.npy file generated by the REXEE simulation.

ensemble_md.analysis.analyze_traj.convert_npy2xvg(trajs, dt, subsampling=1)[source]

Convert a state_trajs.npy or cv_trajs.npy file to \(R\) XVG files that have two columns: time (ps) and state index/CV value. (\(R\) is the number of replicas.)

Parameters
  • trajs (numpy.ndarray) – The state-space or CV-space trajectories read from state_trajs.npy or cv_trajs.npy.

  • dt (float) – The time interval (in ps) between consecutive frames of the trajectories.

  • subsampling (int, Optional) – The stride for subsampling the time series. The default is 1.

ensemble_md.analysis.analyze_traj.traj2transmtx(traj, N, normalize=True)[source]

Computes the transition matrix given a trajectory. For example, if a state-space trajectory from a EXE or HREX simulation is given, a state-space transition matrix is returned. If a trajectory showing transitions between replicas in a REXEE simulation is given, a replica-space transition matrix is returned.

Parameters
  • traj (list) – A list of state indices showing the trajectory in the state space. The index is assumed to start from 0.

  • N (int) – The size (N) of the expcted transition matrix (N by N).

  • normalize (bool) – Whether to normalize the matrix so that each row sum to 1. If normalize=False, then the entries will be the counts of transitions.

Returns

transmtx – The transition matrix computed from the trajectory

Return type

numpy.ndarray

ensemble_md.analysis.analyze_traj.plot_rep_trajs(trajs, fig_name, dt=None, stride=None)[source]

Plots the replica-space trajectories for a REXEE simulation.

Parameters
  • trajs (list) – A list of lists that represent the all replica-space trajectories.

  • fig_name (str) – The file path to save the figure.

  • dt (float or None, Optional) – One trajectory timestep in ps. If dt=None, the function assumes there are no time frames but MC steps. The default is None.

  • stride (int, Optional) – The stride for plotting the time series. The default is 100 if the length of any trajectory has more than one million frames. Otherwise, it will be 1. Typically plotting more than 10 million frames can take a lot of memory.

ensemble_md.analysis.analyze_traj.plot_state_trajs(trajs, state_ranges, fig_name, dt=None, stride=None, title_prefix='Trajectory')[source]

Plots the state-space trajectories for a REXEE simulation.

Parameters
  • trajs (list) – A list of lists of state indices generated either from different continuous trajectories or from different alchemical ranges (i.e. from different simulation folders). This can be generated by either stitch_time_series() or stitch_time_series_for_sim().

  • state_ranges (list) – A list of lists of showing the state indices sampled by each replica.

  • fig_name (str) – The file path to save the figure.

  • dt (float or None, Optional) – One trajectory timestep in ps. If dt=None, the function assumes there are no time frames but MC steps. The default is None.

  • stride (int, Optional) – The stride for plotting the time series. The default is 10 if the length of any trajectory has more than 100,000 frames. Otherwise, it will be 1. Typically plotting more than 10 million frames can take a lot of memory.

  • title_prefix (str, Optional) – The prefix shared by the titles of the subplots. For example, if title_prefix is set to “Trajectory”, then the titles of the subplots will be “Trajectory 0”, “Trajectory 1”, …, etc. The default is 'Trajectory'.

ensemble_md.analysis.analyze_traj.plot_state_hist(trajs, state_ranges, fig_name, stack=True, figsize=None, prefix='Trajectory', subplots=False, save_hist=True)[source]

Plots the histograms of state visitation for all replicas in a REXEE simulation.

Parameters
  • trajs (list) – A list of lists of state indices generated either from different continuous trajectories or from different alchemical ranges (i.e. from different simulation folders). This can be generated by either stitch_time_series() or stitch_time_series_for_sim().

  • state_ranges (list) – A list of lists of showing the state indices sampled by each replica.

  • fig_name (str) – The file path to save the figure.

  • stack (bool, Optional) – Whether to stack the histograms. This parameter is only relevant when subplots is False. The default is True.

  • figsize (tuple, Optional) – A tuple specifying the length and width of the output figure. The default is (6.4, 4.8) for cases having less than 30 states and (10, 4.8) otherwise.

  • prefix (str, Optional) – The prefix shared by the titles of the subplots, or the labels shown in the same plot. For example, if prefix is set to “Trajectory”, then the titles/labels of the will be “Trajectory 0”, “Trajectory 1”, …, etc. The default is 'Trajectory'.

  • subplots (bool, Optional) – Whether to plot the histograms in multiple subplots, with the title of each based on the value of prefix. The default is False.

  • save_hist (bool, Optional) – Whether to save the histogram data. The default is True.

Returns

hist_data – The histogram data of the each state index time series.

Return type

list

ensemble_md.analysis.analyze_traj.calc_hist_rmse(hist_data, state_ranges)[source]

Calculates the RMSE of accumulated histogram counts of the state index. The reference is determined by assuming all alchemical states have equal chances to be visited, i.e. the alchemical weights are perfect.

Parameters
  • hist_data (list) – The histogram data of the state index for each trajectory.

  • state_ranges (list) – A list of lists of showing the state indices sampled by each replica.

Returns

rmse – The RMSE value of accumulated histogram counts of the state index, with respect to the case where equal sampling is reached for all states.

Return type

float

ensemble_md.analysis.analyze_traj.plot_transit_time(trajs, N, fig_prefix=None, dt=None, folder='.')[source]

Calculates and plots the average transit times for each trajectory, including the time it takes from states 0 to k, from k to 0 and from 0 to k back to 0 (i.e., round-trip time). If there are more than 100 round-trips, 3 histograms corresponding to t_0k, t_k0 and t_roundtrip will be generated.

Parameters
  • trajs (list) – A list of lists that represent the state-space trajectories of all continuous trajectories.

  • N (int) – The total number of states in the whole alchemical range.

  • fig_prefix (str, Optional) – A prefix to use for all generated figures. The default is None, which means no prefix.

  • dt (float or None, Optional) – One trajectory timestep in ps. If dt=None, the function assumes there are no time frames but MC steps. The default is None.

  • folder (str, Optional) – The directory where the figures will be saved. The default is the current directory.

Returns

  • t_0k_list (list) – A list of transit times from states 0 to k for each trajectory.

  • t_k0_list (list) – A list of transit times from states k to 0 for each trajectory.

  • t_roundtrip_list (list) – A list of round-trip times for each trajectory.

  • units (str) – The units of the time.

ensemble_md.analysis.analyze_traj.plot_g_vecs(g_vecs, refs=None, refs_err=None, plot_rmse=True)[source]

For each alchemical intermediate state, plots the alchemical weight as a function of the iteration index. Note that the alchemical weight of the first state (which is always 0) is skipped. If the reference values are given, they will be plotted in the figure (as horizontoal lines) and a final RMSE will be calculated. Note that this function is only meaningful for weight-updating REXEE simulations.

Parameters
  • g_vecs (numpy.ndarray) – The alchemical weights of all states as a function of the iteration index. The shape should be (n_iterations, n_states). Such an array can be directly read from g_vecs.npy generated by a REXEE simulation.

  • refs (numpy.ndarray) – The reference values of the alchemical weights. The default is None.

  • refs_err (list or numpy.ndarray, Optional) – The errors of the reference values. The default is None.

  • plot_rmse (bool, Optional) – Whether to plot RMSE as a function of the iteration index. The default is True.

ensemble_md.analysis.analyze_traj.get_swaps(REXEE_log='run_REXEE_log.txt')[source]

For each replica, identifies the states involved in proposed and accepted swaps.

Parameters

REXEE_log (str, Optional) – The output log file of the REXEE simulation. The default is 'run_REXEE_log.txt'.

Returns

  • proposed_swaps (list) – A list of dictionaries showing where the swaps were proposed in each replica. Each dictionary (corresponding to one replica) have keys being the global state indices and values being the number of proposed swaps that involved the state indicated by the key.

  • accepted_swaps (list) – A list of dictionaries showing where the swaps were accepted in each replica. Each dictionary (corresponding to one replica) have keys being the global state indices and values being the number of accepted swaps that involved the state indicated by the key.

Example

Below is an example based on a REXEE simulations having four replicas sampling states 0-4, 1-5, 2-6 and 3-7, respectively.

>>> from ensemble_md.analysis import analyze_traj
>>> proposed_swaps, accepted_swaps = analyze_traj.get_swaps('run_REXEE_log.txt')
>>> for i in range(len(proposed_swaps)):
>>>     print(proposed_swaps[i])
{0: 0, 1: 3, 2: 1, 3: 0, 4: 0}
{1: 2, 2: 2, 3: 0, 4: 1, 5: 1}
{2: 3, 3: 3, 4: 2, 5: 0, 6: 0}
{3: 0, 4: 1, 5: 0, 6: 3, 7: 0}

Todo

We should be able to only use rep_trajs.npy and state_trajs.npy instead of parsing the REXEE log file to reach the same goal.

See also

plot_swaps()

ensemble_md.analysis.analyze_traj.plot_swaps(swaps, swap_type='', stack=True, figsize=None)[source]

Plots the histogram of the proposed swaps or accepted swaps for each replica.

Parameters
  • swaps (list) – A list of dictionaries showing showing the number of swaps for each state for each replica. This list could be either of the outputs from get_swaps.

  • swap_type (str, Optional) – The type of swaps to be plotted. Common options include 'accepted' and 'proposed'. This value will only influence the name of y-axis and the output file name. The default is an empty string.

  • stack (bool, Optional) – Whether to stack the histograms. The default is True.

  • figsize (tuple, Optional) – A tuple specifying the length and width of the output figure. The default is (6.4, 4.8) for cases having less than 30 states and (10, 4.8) otherwise.

See also

get_swaps()

ensemble_md.analysis.analyze_traj.get_g_evolution(log_files, start_state, end_state, avg_frac=0, avg_from_last_update=False)[source]

For a weight-updating simulation, gets the time series of the alchemical weights of all states. Note that this funciton is only suitable for analyzing either a single expanded ensemble simulation or a replica in a REXEE simulation. For the latter case, all the log files for the replica should be provided.

Parameters
  • log_files (list) – The list of file paths to the log file(s). If multiple log files are provided (for a REXEE simulation), please make sure the files are in the correct order such that the time series of the alchemical weights are continuous.

  • start_state (int) – The index of the first state of interest. The index starts from 0.

  • end_state (int) – The index of the last state of interest. The index start from 0. For example, if start_state is set to 1 and end_state is set to 3, then the weight evolution for states 1, 2 and 3 will be extracted.

  • avg_frac (float, Optional) – The fraction of the last part of the simulation to be averaged. The default is 0, which means no averaging. Note that this parameter is ignored if avg_from_last_update is True.

  • avg_from_last_update (bool, Optional) – Whether to average from the last update of the Wang-Landau incrementor. If this option is set to False, or the option is set to True but the Wang-Landau incrementor was not updated in the provided log file(s), the all weights will be used for averging.

Returns

  • g_vecs_all (list) – The alchemical weights of all states as a function of time. It should be a list of lists. For example, g_vecs_all[i] should be the alchemical weights of all states at time frame with index i. Weights after equilibration are not included.

  • g_vecs_avg (list) – The alchemical weights of all states averaged over the last part of the simulation. If avg_frac is 0, None will be returned. Note that weights after equilibration are not considered.

  • g_vecs_err (list) – The errors of the alchemical weights of all states averaged over the last part of the simulation. If avg_frac is 0 and avg_from_last_update is False, None will be returned. Note that weights after equilibration are not considered.

Example

>>> from ensemble_md import analyze_traj
>>> log_files = ['EXE.log']  # For analyzing a single expanded ensemble simulation
>>> results = analyze_traj.get_g_evolution(log_files, start_state=0, end_state=6)
ensemble_md.analysis.analyze_traj.get_dg_evolution(log_files, start_state, end_state)[source]

For a weight-updating simulation, gets the time series of the weight difference (\(Δg = g_2-g_1\)) between the states of interest.

Parameters
  • log_files (list) – The list of file paths to the log file(s). If multiple log files are provided (for a REXEE simulation), please make sure the files are in the correct order such that the time series of the alchemical weights are continuous.

  • start_state (int) – The index of the state (starting from 0) whose weight is \(g_1\).

  • end_state (int) – The index of the state (starting from 0) whose weight is \(g_2\).

Returns

dg – The time series of \(Δg\).

Return type

list

ensemble_md.analysis.analyze_traj.plot_dg_evolution(log_files, start_state, end_state, start_idx=None, end_idx=None, dt_log=2)[source]

For a weight-updating simulation, plots the time series of the weight difference (\(Δg = g_2-g_1\)) between the states of interest.

Parameters
  • log_files (list) – The list of file paths to the log file(s). If multiple log files are provided (for a REXEE simulation), please make sure the files are in the correct order such that the time series of the alchemical weights are continuous.

  • start_state (int) – The index of the state (starting from 0) whose weight is \(g_1\).

  • end_state (int) – The index of the state (starting from 0) whose weight is \(g_2\).

  • start_idx (int, Optional) – The index of the first frame to be plotted. The default is None, which means the first frame.

  • end_idx (int, Optional) – The index of the last frame to be plotted. The default is None, which means the last frame.

  • dt_log (float, Optional) – The time interval (in ps) between two consecutive frames in the log file. The default is 2 ps.

ensemble_md.analysis.analyze_traj.get_delta_w_updates(log_file, plot=False)[source]

Parses the log file of a weight-updating simulation and identifies the time frames when the Wang-Landau incrementor was updated.

Parameters
  • log_file (str) – The file path of the LOG file.

  • plot (bool, Optional) – Whether to plot the Wang-Landau incrementor as a function of time. The default is False.

Returns

  • t_updates (list) – A list of time frames (in ns) when the Wang-Landau incrementor was updated.

  • delta_w_updates (list) – A list of the updated Wang-Landau incrementors. Should be the same length as t_updates.

  • equil (bool) – Whether the weights got equilibrated during the simulation.

ensemble_md.analysis.analyze_matrix

The analyze_matrix module provides methods for analyzing matrices obtained from a REXEE simulation.

ensemble_md.analysis.analyze_matrix.calc_transmtx(log_file, simulation_type='EE')[source]

Parses the log file to get the transition matrix of an expanded ensemble or replica exchange simulation. Notably, a theoretical transition matrix is only available in expanded ensemble.

Parameters
  • log_file (str) – The file path of the log file to be parsed.

  • simulation_type (str, Optional) – The type of simulation. It can be either a EE (expanded ensemble) or HREX (Hamiltonian replica exchange) simulation. The default is EE.

Returns

  • empirical (numpy.ndarray) – The final empirical state transition matrix.

  • theoretical (None or numpy.ndarray) – The final theoretical state transition matrix.

  • diff_matrix (None or numpy.ndarray) – The difference calculated by subtracting the theoretical matrix from the empirical matrix.

ensemble_md.analysis.analyze_matrix.calc_equil_prob(trans_mtx)[source]

Calculates the equilibrium probability of each state from a transition matrix. The input transition matrix can be either left or right stochastic, although the left stochastic ones are not common in GROMACS. Generally, transition matrices in GROMACS are either doubly stochastic (replica exchange), or right stochastic (expanded ensemble). For the latter case, the stationary distribution vector is the left eigenvector corresponding to the eigenvalue 1 of the transition matrix. (For the former case, it’s either left or right eigenvector corresponding to the eigenvalue 1 - as the left and right eigenvectors are the same for a doubly stochasti matrix.) Note that the input transition matrix can be either state-space or replica-space.

Parameters

trans_mtx (numpy.ndarray) – The input transition matrix.

Returns

equil_prob

Return type

numpy.ndarray

ensemble_md.analysis.analyze_matrix.calc_spectral_gap(trans_mtx, atol=1e-08, n_bootstrap=50, seed=None)[source]

Calculates the spectral gap of an input transition matrix and estimates its uncertainty using the bootstrap method.

Parameters
  • trans_mtx (numpy.ndarray) – The input transition matrix

  • atol (float, Optional) – The absolute tolerance for checking the sum of columns and rows. The default is 1e-8.

  • n_bootstrap (int, Optional) – The number of bootstrap iterations for uncertainty estimation. The default is 50.

  • seed (int, Optional) – The seed for the random number generator used in the bootstrap method. The default is None, which means no seed will be used.

Returns

  • spectral_gap (float) – The spectral gap of the input transition matrix.

  • spectral_gap_err (float) – The estimated uncertainty of the spectral gap.

  • eig_vals (list) – The list of eigenvalues. The maximum eigenvalue should always be 1.

ensemble_md.analysis.analyze_matrix.calc_t_relax(spectral_gap, exchange_period, spectral_gap_err=None)[source]

Calculates the relaxation time given the spectral gap of a transition matrix of interest. By defintion, the relaxation time is equal to the exchange period divided by the spectral gap.

Parameters
  • spectral_gap (float) – The input spectral gap.

  • exchange_period (float) – The exchange period of the simulation in ps.

  • spectral_gap_err (float, Optional) – The uncertainty of the spectral gap, which is used to calculate the uncertainty of the relaxation time by error propagation. The default is None, in which case the uncertainty of the relaxation time will be None.

Returns

  • t_relax (float) – The relaxation time in ps.

  • t_relax_err (float) – The uncertainty of the relaxation time in ps.

ensemble_md.analysis.analyze_matrix.split_transmtx(trans_mtx, n_sim, n_sub)[source]

Splits the input state-space transition matrix into blocks of smaller matrices corresponding to different state states sampled by different replicas. Notably, the function assumes homogeneous shifts and number of states across replicas. Also, the blocks of the transition matrix is generally not doubly stochastic but right stochastic even if the input is doubly stochastic.

Parameters
  • trans_mtx (numpy.ndarray) – The input state-space transition matrix to be split.

  • n_sim (int) – The number of replicas in the REXEE simulation.

  • n_sub (int) – The number of states in each replica.

Returns

sub_mtx – Blocks of transition matrices split from the input transition matrix.

Return type

list

ensemble_md.analysis.analyze_matrix.plot_matrix(matrix, fig_name, title=None, start_idx=0)[source]

Visualizes a matrix in a heatmap.

Parameters
  • matrix (numpy.ndarray) – The matrix to be visualized.

  • fig_name (str) – The file path to save the figure.

  • title (str, Optional) – The title of the plot. The default is None, which means no title will be added.

  • start_idx (int, Optional) – The starting value of the state index. The default is 0.

ensemble_md.analysis.analyze_free_energy

The analyze_free_energy module provides functions for performing free energy calculations for REXEE simulations.

ensemble_md.analysis.analyze_free_energy.preprocess_data(files_list, temp, data_type, spacing=1, t=None, g=None)[source]

This function preprocesses \(u_{nk}\) or \(dH/dλ\) data for all replicas in an REXEE simulation. For each replica, it reads in \(u_{nk}\) or \(dH/dλ\) data from all iterations, concatenate them, remove the equilibrium region and and decorrelate the concatenated data.

Parameters
  • files_list (list) – A list of lists of naturally sorted DHDL file names from all iterations for different replicas. Specifically, files[i] should be the list of DHDL file names from all iterations of replica i.

  • temp (float) – The simulation temperature in Kelvin. We assume all replicas were performed at the same temperature.

  • data_type (str) – The type of energy data to be procssed. Should be either 'u_nk' or 'dhdl'.

  • spacing (int, Optional) – The spacing (in the number of data points) to consider when subsampling the data, which is assumed to be the same for all replicas. The default is 1.

  • t (int, Optional) – The user-specified index that indicates the start of equilibrated data. If this parameter is not specified, the function will estimate it using pymbar.timeseries.detect_equilibration(). The default is None.

  • g (int, Optional) – The user-specified statistical inefficiency. If this parameter is not specified, the function will estimate it using pymbar.timeseries.detect_equilibration(). The default is None.

Returns

  • preprocessed_data_all (pandas.Dataframe) – A list of preprocessed \(u_{nk}\) or \(dH/dλ\) data for all replicas that can serve as the input to free energy estimators.

  • t_list (list) – A list of indices indicating the start of equilibrated data for different replicas. This list will be empty if the parameter t is specified.

  • g_list (list) – A list of statistical inefficiencies of the equilibrated data for different replicas. This list will be empty if the parameter g is specified.

ensemble_md.analysis.analyze_free_energy.calculate_free_energy(data, state_ranges, df_method='MBAR', err_method='propagate', n_bootstrap=None, seed=None)[source]

Caculates the averaged free energy profile with the chosen method given \(u_{nk}\) or \(dH/dλ\) data obtained from all replicas of the REXEE simulation. Available methods include TI, BAR, and MBAR. TI requires \(dH/dλ\) data while the other two require \(u_{nk}\) data.

Parameters
  • data (pandas.Dataframe) – A list of \(u_{nk}\) or \(dH/dλ\) dataframes obtained from all replicas of the REXEE simulation. Preferrably, the \(u_{nk}\) or \(dH/dλ\) data should be preprocessed by the function proprocess_data().

  • state_ranges (list) – A list of lists of showing the state indices sampled by each replica.

  • df_method (str, Optional) – The method used to calculate the free energy profile. Available choices include "TI", "BAR", and "MBAR". The default is "MBAR".

  • err_method (str, Optional) – The method used to estimate the uncertainty of the free energy combined across multiple replicas. Available options include "propagate" and "bootstrap". The bootstrapping method is more accurate but much more computationally expensive than simple error propagation.

  • n_bootstrap (int, Optional) – The number of bootstrap iterations. This parameter is used only when the boostrapping method is chosen to estimate the uncertainties of the free energies. The default is None. In the CLI analyze_REXEE, this number is set by the YAML parameter n_bootstrap.

  • seed (int, Optional) – The random seed for bootstrapping. Only relevant when err_method is "bootstrap". The default is None.

Returns

  • f (list) – The full-range free energy profile.

  • f_err (list) – The uncertainties corresponding to the values in f.

  • estimators (list) – A list of estimators fitting the input data for all replicas. With this, the user can access all the free energies and their associated uncertainties for all states and replicas.

Example

In the CLI analyze_REXEE, lines like below are used:

>>> import glob
>>> import natsort
>>> from ensemble_md.analysis import analyze_free_energy
>>> state_ranges = [[0, 1, 2, 3], [1, 2, 3, 4], [2, 3, 4, 5], [3, 4, 5, 6]]
>>> file_list = [natsort.natsorted(glob.glob(f'sim_{i}/iteration_*/dhdl*xvg')) for i in range(4)]
>>> data_list, _, _ = analyze_free_energy.preprocess_data(file_list, temp=300, data_type='u_nk')
>>> f, _, _ = analyze_free_energy.calculate_free_energy(data_list, state_ranges, "MBAR", "propagate")
ensemble_md.analysis.analyze_free_energy.calculate_df_rmse(estimators, df_ref, state_ranges)[source]

Calculates the RMSE values of the free energy profiles of different state ranges given the reference free energy profile for the whole range of states.

Parameters
  • estimators (list) – A list of estimators fitting the input data for all replicas. With this, the user can access all the free energies and their associated uncertainties for all states and replicas. The estimators should be generated by the function calculate_free_energy().

  • df_ref (list) – A list of values corresponding to the free energies of the whole range of states. The length of the list should be equal to the number of states in total.

  • state_ranges (list) – A list of lists of showing the state indices sampled by each replica.

Returns

rmse_list – A list of RMSE values of the free energy profiles of different state ranges.

Return type

list

ensemble_md.analysis.analyze_free_energy.plot_free_energy(f, f_err, fig_name)[source]

Plots the free energy profile with error bars.

Parameters
  • f (list) – The full-range free energy profile.

  • f_err (list) – The uncertainties corresponding to the values in f.

  • fig_name (str) – The file path to save the figure.

ensemble_md.analysis.analyze_free_energy.average_weights(g_vecs, frac)[source]

Given the time series of the whole range of alchemical weights, averages the weight differences between the the coupled and decoupled states. This can be an estimate of the free energy difference between two end states. This function is only relevant for weight-updating REXEE simulations.

Parameters
  • g_vecs (numpy.ndarray) – An array of alchemical weights of the whole range of states as a function of simulation time, which is typically generated by combine_weights.

  • frac (float) – The fraction of g_vecs to average over. frac=0.2 means average the last 20% of the weight vectors will be averaged.

Returns

  • dg_avg (float) – The averaged weight difference between the coupled and decoupled states.

  • dg_avg_err (float) – The errors corresponding to the value of dg_avg.

ensemble_md.analysis.synthesize_data

The synthesize_data module provides methods for synthesizing REXEE data, specifically trajectories and transition matrices. This is mainly useful for carrying out the bootstrap analysis for some analysis, such as the spectral gap calculation in analyze_matrix.

ensemble_md.analysis.synthesize_data.synthesize_traj(trans_mtx, n_frames=100000, method='transmtx', start=0, seed=None)[source]

Synthesizes a trajectory based on the input transition matrix.

Parameters
  • trans_mtx (numpy.ndarray) – The input transition matrix.

  • n_frames (int, Optional) – The number of frames to be generated. The default value is 100000.

  • method (str, Optional) – The method to be used for trajectory synthesis. It can be either 'transmtx' or 'equil_prob'. The former refers to generating the trajectory by simulating the moves between states based on the input transition matrix, with the trajectory starting from the state specified by the start parameter. If the method is equil_prob, the trajectory will be generated by simply sampling from the equilibrium probability distribution calculated from the input transition matrix. The method 'transmtx' should generate a trajectory characterized by a transition matrix similar to the input one, while the method 'equil_prob' may generate a trajectory that has a significantly different transition matrix. Still, a trajectory generated by either method should have a similar underlying equilibrium probability distribution (hence the spectral gap as well) as the input transition matrix. The default value is 'transmtx'.

  • start (int, Optional) – The starting state of the synthesized trajectory if the method is transmtx. The default value is 0, i.e., the first state. This parameter is ignored if the method is equil_prob.

  • seed (int, Optional) – The seed for the random number generator. The default value is None, i.e., the seed will not be set.

Returns

syn_traj – The synthesized trajectory.

Return type

numpy.ndarray

ensemble_md.analysis.synthesize_data.synthesize_transmtx(trans_mtx, n_frames=100000, seed=None)[source]

Synthesizes a normalized transition matrix similar to the input transition matrix by first generating a trajectory using synthesize_traj with method='transmtx' and then calculating the transition matrix from the synthesized trajectory. This function can be useful in performing boostrapping analysis when estimating the uncertainty of the spectral gap associated with the input transition matrix.

Parameters
  • trans_mtx (numpy.ndarray) – The input transition matrix.

  • n_frames (int, Optional) – The number of frames of the synthesized trajectory from which the mock transition matrix is calculated. The default value is 100000.

  • seed (int, Optional) – The seed for the random number generator. The default value is None, i.e., the seed will not be set.

Returns

  • syn_mtx (numpy.ndarray) – The synthesized transition matrix.

  • syn_traj (numpy.ndarray) – The synthesized trajectory from which the transition matrix is calculated.

  • diff_mtx (numpy.ndarray) – The input transition matrix subtracted by the synthesized transition matrix.

ensemble_md.analysis.clustering

ensemble_md.analysis.clustering.cluster_traj(gmx_executable, inputs, grps, coupled_only=True, method='linkage', cutoff=0.1, suffix=None)[source]

Performs clustering analysis on a trajectory using the GROMACS command gmx cluster. Note that this function encompasses the use of all the other functions in the module, including get_cluster_info(), get_cluster_members(), and analyze_transitions().

Parameters
  • gmx_executable (str) – The path of the GROMACS executable.

  • inputs (dict) – A dictionary that contains the different input files required for the clustering analysis. The dictionary must have the following four keys: traj (input trajectory file in XTC or TRR format), config (the configuration file in TPR or GRO format), xvg (a GROMACS XVG file), and index (an index/NDX file), with the values being the paths to the files. Note that the value of the key index can be None,in which case the function will use a default index file generated by gmx make_ndx. If the parameter coupled_only is set to True, an XVG file that contains the time series of the state index (e.g., dhdl.xvg) must be provided with the key xvg. Otherwise, the key xvg can be set to None.

  • grps (dict) – A dictionary that contains the names of the groups in the index file (NDX) for centering the system, calculating the RMSD, and outputting. The corresponding keys are center, rmsd, and output.

  • coupled_only (bool, Optional) – Whether to only consider the fully coupled configurations. The default is True.

  • method (str, Optional) – The method for clustering available for the GROMACS command gmx cluster. The default is 'linkage'. Check the GROMACS documentation for other available options.

  • cutoff (float, Optional) – The RMSD cutoff for clustering in nm. The default is 0.1.

  • suffix (str, Optional) – The suffix for the output files. The default is None, which means no suffix will be added.

Example

Below is an example of performing a cluster analysis for all 4 replicas that compose of a REXEE simulation of a host-guest binding complex.

>>> import glob
>>> import natsort
>>> from ensemble_md.analysis.clustering import cluster_traj
>>> from ensemble_md.analysis.analyze_traj import stitch_trajs, convert_npy2xvg
>>> rep_trajs = np.load('rep_trajs.npy')  # Usually genrated by the REXEE simulation
>>> state_trajs = np.load('state_trajs.npy')  # Usually generated by analyze_traj.stitch_time_series
>>> files = [natsort.natsorted(glob.glob(f'sim_{i}/iteration_*/*xtc')) for i in range(4)]
>>> stitch_trajs('gmx', files, rep_trajs)
>>> convert_npy2xvg(state_trajs, 0.2, subsampling=10)
>>> for i in range(4):
>>>     print()
>>>     print(f'Performing clustering analysis for traj_{i}.xtc ...')
>>>     inputs = {
>>>         'traj': f'traj_{i}.xtc',
>>>         'config': 'complex.gro',
>>>         'xvg': f'traj_{i}.xvg',
>>>         'index': 'complex.ndx'
>>>     }
>>>     grps = {
>>>         'center': 'HOS_MOL',
>>>         'rmsd': 'complex_heavy',
>>>         'output': 'HOS_MOL'
>>>     }
>>>     cluster_traj('gmx', inputs, grps, coupled_only=False, cutoff=0.13, suffix=f'{i}')
ensemble_md.analysis.clustering.get_cluster_info(cluster_log)[source]

Extracts basic results from the clustering analysis by parsing the LOG file generated by the GROMACS gmx cluster command.

Parameters

cluster_log (str) – The LOG file generated by the GROMACS gmx cluster command.

Returns

  • rmsd_range (list) – The range of RMSD values

  • rmsd_avg (float) – The average RMSD value.

  • n_clusters (int) – The number of clusters.

ensemble_md.analysis.clustering.get_cluster_members(cluster_log)[source]

Gets the members of each cluster from the LOG file generated by the GROMACS gmx cluster command.

Parameters

cluster_log (str) – The LOG file generated by the GROMACS gmx cluster command.

Returns

  • clusters (dict) – A dictionary that contains the cluster indices (starting from 1) as the keys and the lists of members (represented by time frames) as the values.

  • sizes (dict) – A dictionary that contains the cluster indices (starting from 1) as the keys and the sizes of the cluster (in fraction) as the values.

ensemble_md.analysis.clustering.analyze_transitions(clusters, normalize=True, plot_type=None)[source]

Analyzes transitions between clusters, including estimating the transition matrix, generating/plotting a trajectory showing which cluster each configuration belongs to, and/or plotting the distribution of the clusters.

Parameters
  • clusters (dict) – A dictionary that contains the cluster indices (starting from 1) as the keys and the lists of members (represented by time frames) as the values.

  • normalize (bool, Optional) – Whether to normalize the output transition matrix. The default is True.

  • plot_type (str, Optional) – The type of the figure to be plotted. The default is None, which means no figure will be plotted. The other options are 'bar' and 'xy'. The former plots the distribution of the clusters, while the latter plots the trajectory showing which cluster each configuration belongs to.

Returns

  • transmtx (numpy.ndarray) – The transition matrix.

  • traj (numpy.ndarray) – The trajectory showing which cluster each configuration belongs to.

  • t_transitions (dict) – A dictionary with keys being pairs of cluster indices and values being the time frames of transitions between the two clusters. If there was no transition, an empty dictionary will be returned.

ensemble_md.analysis.msm_analysis

The msm_analysis module provides analysis methods based on Markov state models built from REXEE simulations.

ensemble_md.analysis.msm_analysis.plot_acf(models, n_tot, fig_name)[source]

Plots the state index autocorrelation times for all configurations in a single plot.

Parameters
  • models (list) – A list of MSM models (built by PyEMMA) that have the correlation method.

  • n_tot (int) – The total number of states across all replicas.

  • fig_name (str) – The file path to save the figure.

ensemble_md.analysis.msm_analysis.plot_its(trajs, lags, fig_name, dt=1, units='step')[source]

Plots the implied timescales (ITS) as a function of lag time for all configurations in a subplot.

Parameters
  • trajs (list) – A list of state-space trajectories.

  • lags (list) – A list of lag times to examine.

  • fig_name (str) – The file path to save the figure.

  • dt (float) – Physical time between frames. The default is 1.

  • units (str) – The units of dt. The default is 'step'.

Returns

ts_list – A list of instances of the ImpliedTimescales class in PyEMMA.

Return type

list

ensemble_md.analysis.msm_analysis.decide_lagtimes(ts_list)[source]

This function automatically estimates a lagtime for building an MSM for each configuration. Specifically, the lag time will be estimated using change-point detection enabled by ruptures for each (\(n-1\)) timescales (where \(n\) is the number of states). A good lag time should be long enough such that the timescale is roughly constant but short enough to be smaller than all timescales. If no lag time is smaller than all timescales, then a warning will be printed and a lag time of 1 will be returned.

Parameters

ts_list (list) – A list of instances of the ImpliedTimescales class in PyEMMA.

Returns

chosen_lags – A list of lag times automatically determined for each configuration.

Return type

list