ensemble_md.analysis
ensemble_md.analysis.analyze_traj
The analyze_traj
module provides methods for analyzing trajectories of a REXEE simulation.
- ensemble_md.analysis.analyze_traj.extract_state_traj(dhdl)[source]
Extracts the state-space trajectory from a GROMACS DHDL file. Note that the state indices here are local indices.
- Parameters
dhdl (str) – The file path of the GROMACS DHDL file to be parsed.
- Returns
traj (list) – A list that represents the state-space trajectory
t (list) – A list that represents the time frames of the trajectory
- ensemble_md.analysis.analyze_traj.stitch_time_series(files, rep_trajs, shifts=None, dhdl=True, col_idx=-1, save_npy=True)[source]
Stitches the state-space/CV-space trajectories for each starting configuration from DHDL files or PLUMED output files generated at different iterations.
- Parameters
files (list) – A list of lists of file paths to GROMACS DHDL files or general GROMACS XVG files or PLUMED ouptput files. Specifically,
files[i]
should be a list containing the files of interest from all iterations in replicai
. The files should be sorted naturally.rep_trajs (list) – A list of lists that represents the replica-space trajectories for each starting configuration. For example,
rep_trajs[0] = [0, 2, 3, 0, 1, ...]
means that starting configuration 0 transitioned to replica 2, then 3, 0, 1, in iterations 1, 2, 3, 4, …, respectively. This can be read from therep_trajs.npy
file generated by the REXEE simulation.shifts (list, Optional) – A list of values for shifting the local state indices to global indices for each replica. The length of the list should be equal to the number of replicas. This is only needed when
dhdl=True
.dhdl (bool, Optional) – Whether the input files are GROMACS dhdl files. If
dhdl=False
, the input files must be readable bynumpy.loadtxt()
assuming that the start of a comment is indicated by either the#
or@
characters. Such files include any GROMACS XVG files or PLUMED output files (output by plumed driver, for instance). In this case, trajectories of the configurational collective variable of interest are generated. The default isTrue
.col_idx (int, Optional) – The index of the column to be extracted from the input files. This is only needed when
dhdl=False
, By default, we extract the last column.save_npy (bool, Optional) – Whether to save the output trajectories as an NPY file. The default is
True
.
- Returns
trajs – A list that contains lists of state-space/CV-space trajectory (in global indices) for each starting configuration. For example,
trajs[i]
is the state-space/CV-space trajectory of starting configurationi
.- Return type
Example
>>> import glob >>> import natsort >>> import numpy as np >>> from ensemble_md.analysis import analyze_traj >>> n_sim = 4 # Assuming 4 replicas sampling states sets 0-3, 2-5, 4-7, and 6-9, respectively. >>> files = [natsort.natsorted(glob.glob(f'sim_{i}/iteration_*/*dhdl*xvg')) for i in range(n_sim)] >>> shifts = [0, 2, 4, 6] >>> rep_trajs = np.load('rep_trajs.npy') # rep_trajs.npy is generated by the REXEE simulation >>> state_trajs = analyze_traj.stitch_time_series(files, rep_trajs, shifts, dhdl=True, save_npy=True)
- ensemble_md.analysis.analyze_traj.stitch_time_series_for_sim(files, shifts=None, dhdl=True, col_idx=-1, save_npy=True)[source]
Stitches the state-space/CV-space time series in the same replica/simulation folder. That is, the output time series is contributed by multiple different trajectories (initiated by different starting configurations) to a certain alchemical range.
- Parameters
files (list) – A list of lists of file paths to GROMACS DHDL files or general GROMACS XVG files or PLUMED output files. Specifically,
files[i]
should be a list containing the files of interest from all iterations in replicai
. The files should be sorted naturally.shifts (list, Optional) – A list of values for shifting the local state indices to global indices for each replica. The length of the list should be equal to the number of replicas. This is only needed when
dhdl=True
.dhdl (bool, Optional) – Whether the input files are GROMACS dhdl files. If
dhdl=False
, the input files must be readable bynumpy.loadtxt()
assuming that the start of a comment is indicated by either the#
or@
characters. Such files include any GROMACS XVG files or PLUMED output files (output by plumed driver, for instance). In this case, trajectories of the configurational collective variable of interest are generated. The default isTrue
.col_idx (int, Optional) – The index of the column to be extracted from the input files. This is only needed when
dhdl=False
, By default, we extract the last column.save_npy (bool, Optional) – Whether to save the output trajectories as an NPY file. The default is
True
.
- Returns
trajs – A list that contains lists of state-space/CV-space trajectory (in global indices) for each replica. For example,
trajs[i]
is the state-space/CV-space trajectory of replicai
.- Return type
Example
>>> import glob >>> import natsort >>> from ensemble_md.analysis import analyze_traj >>> n_sim = 4 # Assuming 4 replicas sampling states sets 0-3, 2-5, 4-7, and 6-9, respectively. >>> files = [natsort.natsorted(glob.glob(f'sim_{i}/iteration_*/*dhdl*xvg')) for i in range(n_sim)] >>> shifts = [0, 2, 4, 6] >>> state_trajs = analyze_traj.stitch_time_series(files, shifts, dhdl=True, save_npy=True)
See also
- ensemble_md.analysis.analyze_traj.stitch_xtc_trajs(gmx_executable, files, rep_trajs)[source]
Demuxes GROMACS trajectories from different replicas into individual continuous trajectories.
- Parameters
gmx_executable (str) – The path of the GROMACS executable.
files (list) – A list of lists of file paths to GROMACS XTC files. Specifically,
files[i]
should be a list containing the paths to the files of interest from all iterations in replicai
. The files should be sorted naturally.rep_trajs (list) – A list of lists that represents the replica space trajectories for each starting configuration. For example,
rep_trajs[0] = [0, 2, 3, 0, 1, ...]
means that starting configuration 0 transitioned to replica 2, then 3, 0, 1, in iterations 1, 2, 3, 4, …, respectively. This can be read from therep_trajs.npy
file generated by the REXEE simulation.
- ensemble_md.analysis.analyze_traj.convert_npy2xvg(trajs, dt, subsampling=1)[source]
Convert a
state_trajs.npy
orcv_trajs.npy
file to \(R\) XVG files that have two columns: time (ps) and state index/CV value. (\(R\) is the number of replicas.)- Parameters
trajs (numpy.ndarray) – The state-space or CV-space trajectories read from
state_trajs.npy
orcv_trajs.npy
.dt (float) – The time interval (in ps) between consecutive frames of the trajectories.
subsampling (int, Optional) – The stride for subsampling the time series. The default is 1.
- ensemble_md.analysis.analyze_traj.traj2transmtx(traj, N, normalize=True)[source]
Computes the transition matrix given a trajectory. For example, if a state-space trajectory from a EXE or HREX simulation is given, a state-space transition matrix is returned. If a trajectory showing transitions between replicas in a REXEE simulation is given, a replica-space transition matrix is returned.
- Parameters
traj (list) – A list of state indices showing the trajectory in the state space. The index is assumed to start from 0.
N (int) – The size (N) of the expcted transition matrix (N by N).
normalize (bool) – Whether to normalize the matrix so that each row sum to 1. If
normalize=False
, then the entries will be the counts of transitions.
- Returns
transmtx – The transition matrix computed from the trajectory
- Return type
- ensemble_md.analysis.analyze_traj.plot_rep_trajs(trajs, fig_name, dt=None, stride=None)[source]
Plots the replica-space trajectories for a REXEE simulation.
- Parameters
trajs (list) – A list of lists that represent the all replica-space trajectories.
fig_name (str) – The file path to save the figure.
dt (float or None, Optional) – One trajectory timestep in ps. If
dt=None
, the function assumes there are no time frames but MC steps. The default isNone
.stride (int, Optional) – The stride for plotting the time series. The default is 100 if the length of any trajectory has more than one million frames. Otherwise, it will be 1. Typically plotting more than 10 million frames can take a lot of memory.
See also
- ensemble_md.analysis.analyze_traj.plot_state_trajs(trajs, state_ranges, fig_name, dt=None, stride=None, title_prefix='Trajectory')[source]
Plots the state-space trajectories for a REXEE simulation.
- Parameters
trajs (list) – A list of lists of state indices generated either from different continuous trajectories or from different alchemical ranges (i.e. from different simulation folders). This can be generated by either
stitch_time_series()
orstitch_time_series_for_sim()
.state_ranges (list) – A list of lists of showing the state indices sampled by each replica.
fig_name (str) – The file path to save the figure.
dt (float or None, Optional) – One trajectory timestep in ps. If
dt=None
, the function assumes there are no time frames but MC steps. The default isNone
.stride (int, Optional) – The stride for plotting the time series. The default is 10 if the length of any trajectory has more than 100,000 frames. Otherwise, it will be 1. Typically plotting more than 10 million frames can take a lot of memory.
title_prefix (str, Optional) – The prefix shared by the titles of the subplots. For example, if
title_prefix
is set to “Trajectory”, then the titles of the subplots will be “Trajectory 0”, “Trajectory 1”, …, etc. The default is'Trajectory'
.
See also
- ensemble_md.analysis.analyze_traj.plot_state_hist(trajs, state_ranges, fig_name, stack=True, figsize=None, prefix='Trajectory', subplots=False, save_hist=True)[source]
Plots the histograms of state visitation for all replicas in a REXEE simulation.
- Parameters
trajs (list) – A list of lists of state indices generated either from different continuous trajectories or from different alchemical ranges (i.e. from different simulation folders). This can be generated by either
stitch_time_series()
orstitch_time_series_for_sim()
.state_ranges (list) – A list of lists of showing the state indices sampled by each replica.
fig_name (str) – The file path to save the figure.
stack (bool, Optional) – Whether to stack the histograms. This parameter is only relevant when
subplots
isFalse
. The default isTrue
.figsize (tuple, Optional) – A tuple specifying the length and width of the output figure. The default is
(6.4, 4.8)
for cases having less than 30 states and(10, 4.8)
otherwise.prefix (str, Optional) – The prefix shared by the titles of the subplots, or the labels shown in the same plot. For example, if
prefix
is set to “Trajectory”, then the titles/labels of the will be “Trajectory 0”, “Trajectory 1”, …, etc. The default is'Trajectory'
.subplots (bool, Optional) – Whether to plot the histograms in multiple subplots, with the title of each based on the value of
prefix
. The default isFalse
.save_hist (bool, Optional) – Whether to save the histogram data. The default is
True
.
- Returns
hist_data – The histogram data of the each state index time series.
- Return type
See also
- ensemble_md.analysis.analyze_traj.calc_hist_rmse(hist_data, state_ranges)[source]
Calculates the RMSE of accumulated histogram counts of the state index. The reference is determined by assuming all alchemical states have equal chances to be visited, i.e. the alchemical weights are perfect.
- Parameters
hist_data (list) – The histogram data of the state index for each trajectory.
state_ranges (list) – A list of lists of showing the state indices sampled by each replica.
- Returns
rmse – The RMSE value of accumulated histogram counts of the state index, with respect to the case where equal sampling is reached for all states.
- Return type
- ensemble_md.analysis.analyze_traj.plot_transit_time(trajs, N, fig_prefix=None, dt=None, folder='.')[source]
Calculates and plots the average transit times for each trajectory, including the time it takes from states 0 to k, from k to 0 and from 0 to k back to 0 (i.e., round-trip time). If there are more than 100 round-trips, 3 histograms corresponding to t_0k, t_k0 and t_roundtrip will be generated.
- Parameters
trajs (list) – A list of lists that represent the state-space trajectories of all continuous trajectories.
N (int) – The total number of states in the whole alchemical range.
fig_prefix (str, Optional) – A prefix to use for all generated figures. The default is
None
, which means no prefix.dt (float or None, Optional) – One trajectory timestep in ps. If
dt=None
, the function assumes there are no time frames but MC steps. The default isNone
.folder (str, Optional) – The directory where the figures will be saved. The default is the current directory.
- Returns
t_0k_list (list) – A list of transit times from states 0 to k for each trajectory.
t_k0_list (list) – A list of transit times from states k to 0 for each trajectory.
t_roundtrip_list (list) – A list of round-trip times for each trajectory.
units (str) – The units of the time.
- ensemble_md.analysis.analyze_traj.plot_g_vecs(g_vecs, refs=None, refs_err=None, plot_rmse=True)[source]
For each alchemical intermediate state, plots the alchemical weight as a function of the iteration index. Note that the alchemical weight of the first state (which is always 0) is skipped. If the reference values are given, they will be plotted in the figure (as horizontoal lines) and a final RMSE will be calculated. Note that this function is only meaningful for weight-updating REXEE simulations.
- Parameters
g_vecs (numpy.ndarray) – The alchemical weights of all states as a function of the iteration index. The shape should be (n_iterations, n_states). Such an array can be directly read from
g_vecs.npy
generated by a REXEE simulation.refs (numpy.ndarray) – The reference values of the alchemical weights. The default is
None
.refs_err (list or numpy.ndarray, Optional) – The errors of the reference values. The default is
None
.plot_rmse (bool, Optional) – Whether to plot RMSE as a function of the iteration index. The default is
True
.
- ensemble_md.analysis.analyze_traj.get_swaps(REXEE_log='run_REXEE_log.txt')[source]
For each replica, identifies the states involved in proposed and accepted swaps.
- Parameters
REXEE_log (str, Optional) – The output log file of the REXEE simulation. The default is
'run_REXEE_log.txt'
.- Returns
proposed_swaps (list) – A list of dictionaries showing where the swaps were proposed in each replica. Each dictionary (corresponding to one replica) have keys being the global state indices and values being the number of proposed swaps that involved the state indicated by the key.
accepted_swaps (list) – A list of dictionaries showing where the swaps were accepted in each replica. Each dictionary (corresponding to one replica) have keys being the global state indices and values being the number of accepted swaps that involved the state indicated by the key.
Example
Below is an example based on a REXEE simulations having four replicas sampling states 0-4, 1-5, 2-6 and 3-7, respectively.
>>> from ensemble_md.analysis import analyze_traj >>> proposed_swaps, accepted_swaps = analyze_traj.get_swaps('run_REXEE_log.txt') >>> for i in range(len(proposed_swaps)): >>> print(proposed_swaps[i]) {0: 0, 1: 3, 2: 1, 3: 0, 4: 0} {1: 2, 2: 2, 3: 0, 4: 1, 5: 1} {2: 3, 3: 3, 4: 2, 5: 0, 6: 0} {3: 0, 4: 1, 5: 0, 6: 3, 7: 0}
Todo
We should be able to only use
rep_trajs.npy
andstate_trajs.npy
instead of parsing the REXEE log file to reach the same goal.See also
- ensemble_md.analysis.analyze_traj.plot_swaps(swaps, swap_type='', stack=True, figsize=None)[source]
Plots the histogram of the proposed swaps or accepted swaps for each replica.
- Parameters
swaps (list) – A list of dictionaries showing showing the number of swaps for each state for each replica. This list could be either of the outputs from
get_swaps
.swap_type (str, Optional) – The type of swaps to be plotted. Common options include
'accepted'
and'proposed'
. This value will only influence the name of y-axis and the output file name. The default is an empty string.stack (bool, Optional) – Whether to stack the histograms. The default is
True
.figsize (tuple, Optional) – A tuple specifying the length and width of the output figure. The default is
(6.4, 4.8)
for cases having less than 30 states and(10, 4.8)
otherwise.
See also
- ensemble_md.analysis.analyze_traj.get_g_evolution(log_files, start_state, end_state, avg_frac=0, avg_from_last_update=False)[source]
For a weight-updating simulation, gets the time series of the alchemical weights of all states. Note that this funciton is only suitable for analyzing either a single expanded ensemble simulation or a replica in a REXEE simulation. For the latter case, all the log files for the replica should be provided.
- Parameters
log_files (list) – The list of file paths to the log file(s). If multiple log files are provided (for a REXEE simulation), please make sure the files are in the correct order such that the time series of the alchemical weights are continuous.
start_state (int) – The index of the first state of interest. The index starts from 0.
end_state (int) – The index of the last state of interest. The index start from 0. For example, if
start_state
is set to 1 andend_state
is set to 3, then the weight evolution for states 1, 2 and 3 will be extracted.avg_frac (float, Optional) – The fraction of the last part of the simulation to be averaged. The default is 0, which means no averaging. Note that this parameter is ignored if
avg_from_last_update
isTrue
.avg_from_last_update (bool, Optional) – Whether to average from the last update of the Wang-Landau incrementor. If this option is set to
False
, or the option is set toTrue
but the Wang-Landau incrementor was not updated in the provided log file(s), the all weights will be used for averging.
- Returns
g_vecs_all (list) – The alchemical weights of all states as a function of time. It should be a list of lists. For example,
g_vecs_all[i]
should be the alchemical weights of all states at time frame with indexi
. Weights after equilibration are not included.g_vecs_avg (list) – The alchemical weights of all states averaged over the last part of the simulation. If
avg_frac
is 0,None
will be returned. Note that weights after equilibration are not considered.g_vecs_err (list) – The errors of the alchemical weights of all states averaged over the last part of the simulation. If
avg_frac
is 0 andavg_from_last_update
isFalse
,None
will be returned. Note that weights after equilibration are not considered.
Example
>>> from ensemble_md import analyze_traj >>> log_files = ['EXE.log'] # For analyzing a single expanded ensemble simulation >>> results = analyze_traj.get_g_evolution(log_files, start_state=0, end_state=6)
- ensemble_md.analysis.analyze_traj.get_dg_evolution(log_files, start_state, end_state)[source]
For a weight-updating simulation, gets the time series of the weight difference (\(Δg = g_2-g_1\)) between the states of interest.
- Parameters
log_files (list) – The list of file paths to the log file(s). If multiple log files are provided (for a REXEE simulation), please make sure the files are in the correct order such that the time series of the alchemical weights are continuous.
start_state (int) – The index of the state (starting from 0) whose weight is \(g_1\).
end_state (int) – The index of the state (starting from 0) whose weight is \(g_2\).
- Returns
dg – The time series of \(Δg\).
- Return type
- ensemble_md.analysis.analyze_traj.plot_dg_evolution(log_files, start_state, end_state, start_idx=None, end_idx=None, dt_log=2)[source]
For a weight-updating simulation, plots the time series of the weight difference (\(Δg = g_2-g_1\)) between the states of interest.
- Parameters
log_files (list) – The list of file paths to the log file(s). If multiple log files are provided (for a REXEE simulation), please make sure the files are in the correct order such that the time series of the alchemical weights are continuous.
start_state (int) – The index of the state (starting from 0) whose weight is \(g_1\).
end_state (int) – The index of the state (starting from 0) whose weight is \(g_2\).
start_idx (int, Optional) – The index of the first frame to be plotted. The default is
None
, which means the first frame.end_idx (int, Optional) – The index of the last frame to be plotted. The default is
None
, which means the last frame.dt_log (float, Optional) – The time interval (in ps) between two consecutive frames in the log file. The default is 2 ps.
- ensemble_md.analysis.analyze_traj.get_delta_w_updates(log_file, plot=False)[source]
Parses the log file of a weight-updating simulation and identifies the time frames when the Wang-Landau incrementor was updated.
- Parameters
log_file (str) – The file path of the LOG file.
plot (bool, Optional) – Whether to plot the Wang-Landau incrementor as a function of time. The default is
False
.
- Returns
t_updates (list) – A list of time frames (in ns) when the Wang-Landau incrementor was updated.
delta_w_updates (list) – A list of the updated Wang-Landau incrementors. Should be the same length as
t_updates
.equil (bool) – Whether the weights got equilibrated during the simulation.
ensemble_md.analysis.analyze_matrix
The analyze_matrix
module provides methods for analyzing matrices obtained from a REXEE simulation.
- ensemble_md.analysis.analyze_matrix.calc_transmtx(log_file, simulation_type='EE')[source]
Parses the log file to get the transition matrix of an expanded ensemble or replica exchange simulation. Notably, a theoretical transition matrix is only available in expanded ensemble.
- Parameters
log_file (str) – The file path of the log file to be parsed.
simulation_type (str, Optional) – The type of simulation. It can be either a
EE
(expanded ensemble) orHREX
(Hamiltonian replica exchange) simulation. The default isEE
.
- Returns
empirical (numpy.ndarray) – The final empirical state transition matrix.
theoretical (None or numpy.ndarray) – The final theoretical state transition matrix.
diff_matrix (None or numpy.ndarray) – The difference calculated by subtracting the theoretical matrix from the empirical matrix.
- ensemble_md.analysis.analyze_matrix.calc_equil_prob(trans_mtx)[source]
Calculates the equilibrium probability of each state from a transition matrix. The input transition matrix can be either left or right stochastic, although the left stochastic ones are not common in GROMACS. Generally, transition matrices in GROMACS are either doubly stochastic (replica exchange), or right stochastic (expanded ensemble). For the latter case, the stationary distribution vector is the left eigenvector corresponding to the eigenvalue 1 of the transition matrix. (For the former case, it’s either left or right eigenvector corresponding to the eigenvalue 1 - as the left and right eigenvectors are the same for a doubly stochasti matrix.) Note that the input transition matrix can be either state-space or replica-space.
- Parameters
trans_mtx (numpy.ndarray) – The input transition matrix.
- Returns
equil_prob
- Return type
- ensemble_md.analysis.analyze_matrix.calc_spectral_gap(trans_mtx, atol=1e-08, n_bootstrap=50, seed=None)[source]
Calculates the spectral gap of an input transition matrix and estimates its uncertainty using the bootstrap method.
- Parameters
trans_mtx (numpy.ndarray) – The input transition matrix
atol (float, Optional) – The absolute tolerance for checking the sum of columns and rows. The default is 1e-8.
n_bootstrap (int, Optional) – The number of bootstrap iterations for uncertainty estimation. The default is 50.
seed (int, Optional) – The seed for the random number generator used in the bootstrap method. The default is
None
, which means no seed will be used.
- Returns
spectral_gap (float) – The spectral gap of the input transition matrix.
spectral_gap_err (float) – The estimated uncertainty of the spectral gap.
eig_vals (list) – The list of eigenvalues. The maximum eigenvalue should always be 1.
- ensemble_md.analysis.analyze_matrix.calc_t_relax(spectral_gap, exchange_period, spectral_gap_err=None)[source]
Calculates the relaxation time given the spectral gap of a transition matrix of interest. By defintion, the relaxation time is equal to the exchange period divided by the spectral gap.
- Parameters
spectral_gap (float) – The input spectral gap.
exchange_period (float) – The exchange period of the simulation in ps.
spectral_gap_err (float, Optional) – The uncertainty of the spectral gap, which is used to calculate the uncertainty of the relaxation time by error propagation. The default is
None
, in which case the uncertainty of the relaxation time will beNone
.
- Returns
t_relax (float) – The relaxation time in ps.
t_relax_err (float) – The uncertainty of the relaxation time in ps.
- ensemble_md.analysis.analyze_matrix.split_transmtx(trans_mtx, n_sim, n_sub)[source]
Splits the input state-space transition matrix into blocks of smaller matrices corresponding to different state states sampled by different replicas. Notably, the function assumes homogeneous shifts and number of states across replicas. Also, the blocks of the transition matrix is generally not doubly stochastic but right stochastic even if the input is doubly stochastic.
- Parameters
trans_mtx (numpy.ndarray) – The input state-space transition matrix to be split.
n_sim (int) – The number of replicas in the REXEE simulation.
n_sub (int) – The number of states in each replica.
- Returns
sub_mtx – Blocks of transition matrices split from the input transition matrix.
- Return type
- ensemble_md.analysis.analyze_matrix.plot_matrix(matrix, fig_name, title=None, start_idx=0)[source]
Visualizes a matrix in a heatmap.
- Parameters
matrix (numpy.ndarray) – The matrix to be visualized.
fig_name (str) – The file path to save the figure.
title (str, Optional) – The title of the plot. The default is
None
, which means no title will be added.start_idx (int, Optional) – The starting value of the state index. The default is 0.
ensemble_md.analysis.analyze_free_energy
The analyze_free_energy
module provides functions for performing free energy
calculations for REXEE simulations.
- ensemble_md.analysis.analyze_free_energy.preprocess_data(files_list, temp, data_type, spacing=1, t=None, g=None)[source]
This function preprocesses \(u_{nk}\) or \(dH/dλ\) data for all replicas in an REXEE simulation. For each replica, it reads in \(u_{nk}\) or \(dH/dλ\) data from all iterations, concatenate them, remove the equilibrium region and and decorrelate the concatenated data.
- Parameters
files_list (list) – A list of lists of naturally sorted DHDL file names from all iterations for different replicas. Specifically,
files[i]
should be the list of DHDL file names from all iterations of replicai
.temp (float) – The simulation temperature in Kelvin. We assume all replicas were performed at the same temperature.
data_type (str) – The type of energy data to be procssed. Should be either
'u_nk'
or'dhdl'
.spacing (int, Optional) – The spacing (in the number of data points) to consider when subsampling the data, which is assumed to be the same for all replicas. The default is 1.
t (int, Optional) – The user-specified index that indicates the start of equilibrated data. If this parameter is not specified, the function will estimate it using
pymbar.timeseries.detect_equilibration()
. The default isNone
.g (int, Optional) – The user-specified statistical inefficiency. If this parameter is not specified, the function will estimate it using
pymbar.timeseries.detect_equilibration()
. The default isNone
.
- Returns
preprocessed_data_all (pandas.Dataframe) – A list of preprocessed \(u_{nk}\) or \(dH/dλ\) data for all replicas that can serve as the input to free energy estimators.
t_list (list) – A list of indices indicating the start of equilibrated data for different replicas. This list will be empty if the parameter
t
is specified.g_list (list) – A list of statistical inefficiencies of the equilibrated data for different replicas. This list will be empty if the parameter
g
is specified.
- ensemble_md.analysis.analyze_free_energy.calculate_free_energy(data, state_ranges, df_method='MBAR', err_method='propagate', n_bootstrap=None, seed=None)[source]
Caculates the averaged free energy profile with the chosen method given \(u_{nk}\) or \(dH/dλ\) data obtained from all replicas of the REXEE simulation. Available methods include TI, BAR, and MBAR. TI requires \(dH/dλ\) data while the other two require \(u_{nk}\) data.
- Parameters
data (pandas.Dataframe) – A list of \(u_{nk}\) or \(dH/dλ\) dataframes obtained from all replicas of the REXEE simulation. Preferrably, the \(u_{nk}\) or \(dH/dλ\) data should be preprocessed by the function
proprocess_data()
.state_ranges (list) – A list of lists of showing the state indices sampled by each replica.
df_method (str, Optional) – The method used to calculate the free energy profile. Available choices include
"TI"
,"BAR"
, and"MBAR"
. The default is"MBAR"
.err_method (str, Optional) – The method used to estimate the uncertainty of the free energy combined across multiple replicas. Available options include
"propagate"
and"bootstrap"
. The bootstrapping method is more accurate but much more computationally expensive than simple error propagation.n_bootstrap (int, Optional) – The number of bootstrap iterations. This parameter is used only when the boostrapping method is chosen to estimate the uncertainties of the free energies. The default is
None
. In the CLIanalyze_REXEE
, this number is set by the YAML parametern_bootstrap
.seed (int, Optional) – The random seed for bootstrapping. Only relevant when
err_method
is"bootstrap"
. The default isNone
.
- Returns
f (list) – The full-range free energy profile.
f_err (list) – The uncertainties corresponding to the values in
f
.estimators (list) – A list of estimators fitting the input data for all replicas. With this, the user can access all the free energies and their associated uncertainties for all states and replicas.
Example
In the CLI
analyze_REXEE
, lines like below are used:>>> import glob >>> import natsort >>> from ensemble_md.analysis import analyze_free_energy >>> state_ranges = [[0, 1, 2, 3], [1, 2, 3, 4], [2, 3, 4, 5], [3, 4, 5, 6]] >>> file_list = [natsort.natsorted(glob.glob(f'sim_{i}/iteration_*/dhdl*xvg')) for i in range(4)] >>> data_list, _, _ = analyze_free_energy.preprocess_data(file_list, temp=300, data_type='u_nk') >>> f, _, _ = analyze_free_energy.calculate_free_energy(data_list, state_ranges, "MBAR", "propagate")
- ensemble_md.analysis.analyze_free_energy.calculate_df_rmse(estimators, df_ref, state_ranges)[source]
Calculates the RMSE values of the free energy profiles of different state ranges given the reference free energy profile for the whole range of states.
- Parameters
estimators (list) – A list of estimators fitting the input data for all replicas. With this, the user can access all the free energies and their associated uncertainties for all states and replicas. The estimators should be generated by the function
calculate_free_energy()
.df_ref (list) – A list of values corresponding to the free energies of the whole range of states. The length of the list should be equal to the number of states in total.
state_ranges (list) – A list of lists of showing the state indices sampled by each replica.
- Returns
rmse_list – A list of RMSE values of the free energy profiles of different state ranges.
- Return type
See also
- ensemble_md.analysis.analyze_free_energy.plot_free_energy(f, f_err, fig_name)[source]
Plots the free energy profile with error bars.
- Parameters
f (list) – The full-range free energy profile.
f_err (list) – The uncertainties corresponding to the values in
f
.fig_name (str) – The file path to save the figure.
- ensemble_md.analysis.analyze_free_energy.average_weights(g_vecs, frac)[source]
Given the time series of the whole range of alchemical weights, averages the weight differences between the the coupled and decoupled states. This can be an estimate of the free energy difference between two end states. This function is only relevant for weight-updating REXEE simulations.
- Parameters
g_vecs (numpy.ndarray) – An array of alchemical weights of the whole range of states as a function of simulation time, which is typically generated by
combine_weights
.frac (float) – The fraction of
g_vecs
to average over.frac=0.2
means average the last 20% of the weight vectors will be averaged.
- Returns
dg_avg (float) – The averaged weight difference between the coupled and decoupled states.
dg_avg_err (float) – The errors corresponding to the value of
dg_avg
.
ensemble_md.analysis.synthesize_data
The synthesize_data
module provides methods for synthesizing REXEE data,
specifically trajectories and transition matrices. This is mainly useful
for carrying out the bootstrap analysis for some analysis, such as the spectral gap calculation in
analyze_matrix
.
- ensemble_md.analysis.synthesize_data.synthesize_traj(trans_mtx, n_frames=100000, method='transmtx', start=0, seed=None)[source]
Synthesizes a trajectory based on the input transition matrix.
- Parameters
trans_mtx (numpy.ndarray) – The input transition matrix.
n_frames (int, Optional) – The number of frames to be generated. The default value is 100000.
method (str, Optional) – The method to be used for trajectory synthesis. It can be either
'transmtx'
or'equil_prob'
. The former refers to generating the trajectory by simulating the moves between states based on the input transition matrix, with the trajectory starting from the state specified by thestart
parameter. If the method isequil_prob
, the trajectory will be generated by simply sampling from the equilibrium probability distribution calculated from the input transition matrix. The method'transmtx'
should generate a trajectory characterized by a transition matrix similar to the input one, while the method'equil_prob'
may generate a trajectory that has a significantly different transition matrix. Still, a trajectory generated by either method should have a similar underlying equilibrium probability distribution (hence the spectral gap as well) as the input transition matrix. The default value is'transmtx'
.start (int, Optional) – The starting state of the synthesized trajectory if the method is
transmtx
. The default value is 0, i.e., the first state. This parameter is ignored if the method isequil_prob
.seed (int, Optional) – The seed for the random number generator. The default value is
None
, i.e., the seed will not be set.
- Returns
syn_traj – The synthesized trajectory.
- Return type
- ensemble_md.analysis.synthesize_data.synthesize_transmtx(trans_mtx, n_frames=100000, seed=None)[source]
Synthesizes a normalized transition matrix similar to the input transition matrix by first generating a trajectory using
synthesize_traj
withmethod='transmtx'
and then calculating the transition matrix from the synthesized trajectory. This function can be useful in performing boostrapping analysis when estimating the uncertainty of the spectral gap associated with the input transition matrix.- Parameters
trans_mtx (numpy.ndarray) – The input transition matrix.
n_frames (int, Optional) – The number of frames of the synthesized trajectory from which the mock transition matrix is calculated. The default value is 100000.
seed (int, Optional) – The seed for the random number generator. The default value is
None
, i.e., the seed will not be set.
- Returns
syn_mtx (numpy.ndarray) – The synthesized transition matrix.
syn_traj (numpy.ndarray) – The synthesized trajectory from which the transition matrix is calculated.
diff_mtx (numpy.ndarray) – The input transition matrix subtracted by the synthesized transition matrix.
ensemble_md.analysis.clustering
- ensemble_md.analysis.clustering.cluster_traj(gmx_executable, inputs, grps, coupled_only=True, method='linkage', cutoff=0.1, suffix=None)[source]
Performs clustering analysis on a trajectory using the GROMACS command
gmx cluster
. Note that this function encompasses the use of all the other functions in the module, includingget_cluster_info()
,get_cluster_members()
, andanalyze_transitions()
.- Parameters
gmx_executable (str) – The path of the GROMACS executable.
inputs (dict) – A dictionary that contains the different input files required for the clustering analysis. The dictionary must have the following four keys:
traj
(input trajectory file in XTC or TRR format),config
(the configuration file in TPR or GRO format),xvg
(a GROMACS XVG file), andindex
(an index/NDX file), with the values being the paths to the files. Note that the value of the keyindex
can beNone
,in which case the function will use a default index file generated bygmx make_ndx
. If the parametercoupled_only
is set toTrue
, an XVG file that contains the time series of the state index (e.g.,dhdl.xvg
) must be provided with the keyxvg
. Otherwise, the keyxvg
can be set toNone
.grps (dict) – A dictionary that contains the names of the groups in the index file (NDX) for centering the system, calculating the RMSD, and outputting. The corresponding keys are
center
,rmsd
, andoutput
.coupled_only (bool, Optional) – Whether to only consider the fully coupled configurations. The default is
True
.method (str, Optional) – The method for clustering available for the GROMACS command
gmx cluster
. The default is'linkage'
. Check the GROMACS documentation for other available options.cutoff (float, Optional) – The RMSD cutoff for clustering in nm. The default is 0.1.
suffix (str, Optional) – The suffix for the output files. The default is
None
, which means no suffix will be added.
Example
Below is an example of performing a cluster analysis for all 4 replicas that compose of a REXEE simulation of a host-guest binding complex.
>>> import glob >>> import natsort >>> from ensemble_md.analysis.clustering import cluster_traj >>> from ensemble_md.analysis.analyze_traj import stitch_trajs, convert_npy2xvg >>> rep_trajs = np.load('rep_trajs.npy') # Usually genrated by the REXEE simulation >>> state_trajs = np.load('state_trajs.npy') # Usually generated by analyze_traj.stitch_time_series >>> files = [natsort.natsorted(glob.glob(f'sim_{i}/iteration_*/*xtc')) for i in range(4)] >>> stitch_trajs('gmx', files, rep_trajs) >>> convert_npy2xvg(state_trajs, 0.2, subsampling=10) >>> for i in range(4): >>> print() >>> print(f'Performing clustering analysis for traj_{i}.xtc ...') >>> inputs = { >>> 'traj': f'traj_{i}.xtc', >>> 'config': 'complex.gro', >>> 'xvg': f'traj_{i}.xvg', >>> 'index': 'complex.ndx' >>> } >>> grps = { >>> 'center': 'HOS_MOL', >>> 'rmsd': 'complex_heavy', >>> 'output': 'HOS_MOL' >>> } >>> cluster_traj('gmx', inputs, grps, coupled_only=False, cutoff=0.13, suffix=f'{i}')
- ensemble_md.analysis.clustering.get_cluster_info(cluster_log)[source]
Extracts basic results from the clustering analysis by parsing the LOG file generated by the GROMACS
gmx cluster
command.- Parameters
cluster_log (str) – The LOG file generated by the GROMACS
gmx cluster
command.- Returns
rmsd_range (list) – The range of RMSD values
rmsd_avg (float) – The average RMSD value.
n_clusters (int) – The number of clusters.
- ensemble_md.analysis.clustering.get_cluster_members(cluster_log)[source]
Gets the members of each cluster from the LOG file generated by the GROMACS
gmx cluster
command.- Parameters
cluster_log (str) – The LOG file generated by the GROMACS
gmx cluster
command.- Returns
clusters (dict) – A dictionary that contains the cluster indices (starting from 1) as the keys and the lists of members (represented by time frames) as the values.
sizes (dict) – A dictionary that contains the cluster indices (starting from 1) as the keys and the sizes of the cluster (in fraction) as the values.
- ensemble_md.analysis.clustering.analyze_transitions(clusters, normalize=True, plot_type=None)[source]
Analyzes transitions between clusters, including estimating the transition matrix, generating/plotting a trajectory showing which cluster each configuration belongs to, and/or plotting the distribution of the clusters.
- Parameters
clusters (dict) – A dictionary that contains the cluster indices (starting from 1) as the keys and the lists of members (represented by time frames) as the values.
normalize (bool, Optional) – Whether to normalize the output transition matrix. The default is
True
.plot_type (str, Optional) – The type of the figure to be plotted. The default is
None
, which means no figure will be plotted. The other options are'bar'
and'xy'
. The former plots the distribution of the clusters, while the latter plots the trajectory showing which cluster each configuration belongs to.
- Returns
transmtx (numpy.ndarray) – The transition matrix.
traj (numpy.ndarray) – The trajectory showing which cluster each configuration belongs to.
t_transitions (dict) – A dictionary with keys being pairs of cluster indices and values being the time frames of transitions between the two clusters. If there was no transition, an empty dictionary will be returned.
ensemble_md.analysis.msm_analysis
The msm_analysis
module provides analysis methods based on Markov state models
built from REXEE simulations.
- ensemble_md.analysis.msm_analysis.plot_acf(models, n_tot, fig_name)[source]
Plots the state index autocorrelation times for all configurations in a single plot.
- Parameters
models (list) – A list of MSM models (built by PyEMMA) that have the
correlation
method.n_tot (int) – The total number of states across all replicas.
fig_name (str) – The file path to save the figure.
- ensemble_md.analysis.msm_analysis.plot_its(trajs, lags, fig_name, dt=1, units='step')[source]
Plots the implied timescales (ITS) as a function of lag time for all configurations in a subplot.
- Parameters
trajs (list) – A list of state-space trajectories.
lags (list) – A list of lag times to examine.
fig_name (str) – The file path to save the figure.
dt (float) – Physical time between frames. The default is 1.
units (str) – The units of
dt
. The default is'step'
.
- Returns
ts_list – A list of instances of the
ImpliedTimescales
class in PyEMMA.- Return type
- ensemble_md.analysis.msm_analysis.decide_lagtimes(ts_list)[source]
This function automatically estimates a lagtime for building an MSM for each configuration. Specifically, the lag time will be estimated using change-point detection enabled by
ruptures
for each (\(n-1\)) timescales (where \(n\) is the number of states). A good lag time should be long enough such that the timescale is roughly constant but short enough to be smaller than all timescales. If no lag time is smaller than all timescales, then a warning will be printed and a lag time of 1 will be returned.- Parameters
ts_list (list) – A list of instances of the
ImpliedTimescales
class in PyEMMA.- Returns
chosen_lags – A list of lag times automatically determined for each configuration.
- Return type