Sampling-Uncertainty/specific-observables.tex at master · livecomsjournal/Sampling-Uncertainty · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
\section{Computing uncertainty in specific observables}
\label{sec:specific}

\subsection{Basics}

Here we address the simple but critical question, ``What error bar should I report?''
In general, there is no one-best practice for choosing error bars \cite{Nicholls2014}. However, in the context of simulations, we can nonetheless identify common goals when reporting such estimates: 1) to help authors and readers better understand uncertainty in data; and 2) to provide readers with realistic information about the reproducibility of a given result.

With this in mind, we recommend the following: (a) in fields where there is a definitive standard for reporting uncertainty, the authors should follow existing conventions; (b) otherwise, such as for biomolecular simulations, \emph{authors should report (and graph) their best estimates of 95~\% \hyperref[def:conf_int]{confidence intervals}.}
(c) when feasible and especially for a small number of independent measurements ($n < 10$), authors should consider plotting all of the points instead of an average with error bars.

We emphasize that as opposed to standard uncertainties [$\stdevmean{\mean{x}}$], confidence intervals have several practical benefits that justify their usage. In particular, they directly quantify the range in which the average value of an observed quantity is expected to fall, which is more relatable to everyday experience than, say, the moments of a probability distribution. As such, confidence intervals can help authors and readers better understand the implications of an uncertainty analysis. Moreover, downstream consumers of a given paper may include less statistically oriented readers for whom confidence intervals are a more meaningful measure of variation.

In a related vein, error bars expressed in integer multiples of $\stdevmean{\mean{x}}$ can be misinterpreted as unrealistically under or overestimating uncertainty if taken at face value. For example, reporting $3\stdevmean{\mean{x}}$ uncertainties for a normal random variable amounts to a 99.7~\% level of confidence, which is likely to be a significant overestimate for many applications. On the other hand, $1\stdevmean{\mean{x}}$ uncertainties only correspond to a 68~\% level of confidence, which may be too low. Given that many readers may not take the time to make such conversions in their heads, we feel that it is safest for modelers to explicitly state the confidence level of their error bar or reported confidence interval.

In recommending 95~\% confidence intervals, we are admittedly attempting to address a social issue that nevertheless has important implications for science as a whole. In particular, the authors of a study and the reputation of their field do not benefit in the long run by under-representing uncertainty, since this may lead to incorrect conclusions. Just as importantly, many of the same problems can arise if uncertainties are reported in a technically correct but obscure and difficult-to-interpret manner. For example, $1\stdevmean{\mean{x}}$ error bars may not overlap and thereby mask the inability to statistically distinguish two quantities, since the corresponding confidence intervals are only 68~\%. With this in mind, we therefore wish to emphasize that visual impressions conveyed by figures in a paper are of primary importance. Regardless of what a research paper may explain carefully in text, error bars on graphs create a lasting impression and must be as informative and accurate as possible. If 95~\% confidence intervals are reported, the expert reader can easily estimate the smaller standard uncertainty (especially if it is noted in the text), but showing a graph with overly small error bars is bound to mislead most readers, even experts who do not search out the fine print.

As a final note, we remind readers that only significant figures should be reported.
Additional digits beyond the precision implicit in the uncertainty are unhelpful at best, and potentially misleading to readers who may not be aware of the limitations of simulations or statistical analyses generally.
For example, if the mean of a quantity is calculated to be $1.23456$ with uncertainty $\pm 0.1$ based on a 95~\% confidence interval, then only two significant figures should be reported for the mean ($1.2$).


\subsection{Overview of procedures for computing a confidence interval}
We remind readers that they should perform the semiquantitative sampling checks (Sec.\ \ref{sec:quick}) before attempting to quantify uncertainty.  If the observable of interest is not fluctuating about a mean value but largely increasing or decreasing during the course of a simulation, a reliable quantitative estimate for the observable or its associated uncertainty cannot be obtained.

For observables passing the qualitative tests noted above in Sec.\ \ref{sec:quick}, we advocate obtaining confidence intervals in one of two ways:
\begin{itemize}
\item For observables that are Gaussian-distributed (or assumed to be, as an approximation or due to lack of information), an appropriately chosen \emph{coverage factor} $k$ (typically in the range of 2 to 3; see Sec.~\ref{sec:conf_int} for further details) is multiplied by the standard uncertainty $\stdevmean{\mean{x}}$ to yield the expanded uncertainty, which estimates the 95~\% confidence interval.
\item For non-Gaussian observables, a \emph{bootstrapping} approach (Sec.~\ref{sec:bootstrap}) should be used.
  An example of a potentially non-Gaussian observable is a rate-constant, which must be positive but could exhibit significant variance.  As such, a confidence interval estimated with a coverage factor may lead to an unphysical negative lower limit.
  In contrast, bootstrapping does not assume an underlying distribution but instead constructs a confidence interval based on the recorded data values, and the limits cannot fall outside the extreme data values;
  nevertheless, bootstrapped confidence intervals can have shortcomings \cite{Schenker1985,Chernick2009}.
  Bootstrapping is also sometimes useful for estimating uncertainties associated with \hyperref[def:deriv_obs]{derived observables}.
\end{itemize}

Below we describe approaches for estimating the standard uncertainty $\stdevmean{\mean{x}}$ from a single trajectory with a coverage factor $k$ as well as the bootstrapping approach for direct confidence-interval estimation. Whether using a coverage factor and standard uncertainty or bootstrapping, one requires an estimate for the independent number of observations in a given simulation.  This requires care, but may be accomplished based on the effective sample size described in Sec.~\ref{sec:global}, via block averaging, or by analysis of a time-correlation function. However, these methods have their limitations and must be used with caution.  In particular, both block averaging and autocorrelation analyses will produce effective sample sizes that depend on the quantity of interest.  To produce reliable answers, one must therefore identify and track the slowest relevant degree of freedom in the system, which can be a non-trivial task.  Even apparently fast-varying properties may have significant statistical error if they are coupled to slower varying ones, and this error in uncertainty estimation may not be readily identifiable by solely examining the fast-varying time series.

In the absence of a reliable estimate for the number of independent observations, one can perform $n$ independent simulations and calculate the standard deviation $\stdev{x}$ for quantity $x$ (which could be the ensemble average of a raw data output or a derived observable) among the $n$ simulations, yielding a standard uncertainty of $\stdevmean{\mean{x}} = \stdev{x} / \sqrt{n}$.  When computing the uncertainty with this approach, it is important to ensure that each starting configuration is also independent or else to recognize and report that the uncertainty refers to simulations started from a particular configuration.  The means to obtain independent starting configurations is system-dependent, but might involve repeating the protocol used to construct a configuration (solvating a protein, inserting liquid molecules in a box, etc.), and/or using different seeds to generate random configurations, velocities, etc.  However, readers are cautioned that \emph{for complex systems, it may be effectively impossible to generate truly independent starting configurations pertinent to the ensemble of interest.}  For example, a simulation of a protein in water will nearly always start from the experimental structure, which introduces some correlation in the resulting simulations even when the remaining simulation components (water, salt, etc.) are regenerated {\it de novo}.

\subsection{Dealing with correlated time-series data}

When samples of a simulated observable are independent, the experimental standard deviation of the mean (i.e.,  Eq.~\hyperref[def:exp_st_dev]{\ref{def:exp_st_dev_mean}}) can be used as an estimate of the corresponding standard uncertainty.  Due to correlations, however, the number of independent samples in a simulation is neither equal to the number of observations nor known {\it a priori}; thus Eq.~\ref{def:exp_st_dev_mean} is not directly useful.  To overcome this problem, a variety of techniques have been developed to estimate the effective number of independent samples in a dataset.  Two methods in particular have gained considerable traction in recent years:  (i) autocorrelation analyses, which directly estimate the number of independent samples in a time series; and (ii) block averaging, which projects a time series onto a smaller dataset of (approximately) independent samples.  We now discuss these methods in more detail.


\subsubsection{Autocorrelation method for estimating the standard uncertainty}\label{sec:autocorrelation}

Conceptually, autocorrelation analyses directly compute the effective number of independent samples $N_{\rm ind}$ in a time series, taking into account ``redundant'' (or even possibly new) information arising from correlations.\footnote{It is worth pointing out that correlations do not always provide redundant information.  Consider, for example, the time series $1,-1,1,-1,1,-1,...$.  In the limit that the number of elements goes to infinity, the arithmetic mean also converges to zero.  However, a block of $2n$ entries also has a mean of zero, so that (anti)correlations effectively increase the amount of information.  See also Ref.~\cite{PatroneAIAA}. }  In particular, this approach invokes the fact that the statistical properties of steady-state simulations (e.g., those in equilibrium or non-equilibrium steady state) are, by definition, time-invariant.  As such, correlations between an observable computed at two different times depends only on the lag (i.e., difference) between those times, not their absolute values.

This observation motivates one to compute an autocorrelation function.  Specifically, one computes the stationary autocorrelation function $C_j$ as given in Eq.~\ref{def:obs_ACF} for a set of lags $j$.
Then, the number of independent samples is estimated by\footnote{The reader should note that both the autocorrelation function (Eq.~\ref{def:obs_ACF}) and the number of independent samples (Eq.~\ref{def:Nind}) may be written in different forms\cite{Grossfield2009,Chodera-2016}. Our convention here presents the observations as a list $\left\{x_j\right\}$ in which the time interval (Molecular Dynamics) or trial spacing (Monte Carlo) of adjacent $x_j$ is implicitly fixed. For time-series data, one could alternately write both the observations and autocorrelation function as continuous functions of time, e.g., $x\left(t\right)$ and $C\left(\tau\right)$ where $\tau$ is the lag time. In that case, $N_{\rm ind}$ is written as a division of the total simulation time by the time integral of $C\left(\tau\right)$\cite{Grossfield2009}.}
%
\begin{equation}
  N_{\rm ind} = \dfrac{N}{1+2 \sum_{j=1}^{N_{\rm max}} C_{j}}
  \label{def:Nind}
\end{equation}
%
where $N_{\rm max}$ is an appropriately chosen maximum number of lags (see below).  Note that $N_{\rm ind}$ need not be an integer.  Finally, the standard uncertainty is estimated via
%
\begin{equation}
  \stdevmean{\mean{x}} = \dfrac{\stdev{x}}{\sqrt{N_{\rm ind}}}
  \label{def:ACF_std_unc}
\end{equation}
%
We note that the \hyperref[def:exp_st_dev]{experimental standard deviation} of the observable $x$ is used in Eq.~\ref{def:ACF_std_unc} to estimate the uncertainty. Strictly speaking, the standard uncertainty should be estimated using the true standard deviation of $x$ (e.g., $\sigma_x$); given that the true standard deviation is unknown, the experimental standard deviation is used in its place as an estimate of $\sigma_x$ \cite{PatroneAIAA}.

In evaluating Eq.~\eqref{def:Nind}, the value of $N_{max}$ must be chosen with some care.
Roughly speaking, $N_{max}$ should be large enough so that the sum is converged and insensitive to the choice of upper bound.  Although a very large value of $N_{max}$ might seem necessary for slowly decaying autocorrelation functions, appropriate truncations of the sum will introduce negligible error, even if the correlation time is infinite.  We refer readers to discussions elsewhere on this topic, for example, Refs.~\cite{Janke_Statistical_2002,Shirts_Statistically_2008,Grossfield2009,Gowers_Automated_2018}.  In typical situations, $N_{max}$ can be set to any value greater than $\tau$, since in principle $C_{j}=0$ for all $j > \tau$.  However, care must be exercised to avoid integrating pure noise over too large of an interval, since this can generate Brownian motion; see, for example, Ref.~\cite{acfnoise} and references contained therein.

%In evaluating Eq.~\eqref{def:Nind}, we observe that $\tau$ may in fact be infinite, e.g. if the autocorrelation time decays exponentially or according to some power law.  In principle, this does not significantly alter the sum in Eq.~\eqref{def:Nind} (except that $N_{\max} \to \infty$), since the sum is usually convergent.  However, in such cases there may be ambiguity in how to pick a value of $N_{\max}$ that is useful for practical computation, since appropriate truncations of the sum will introduce negligible error.  Unfortunately a deeper analysis of this issue and recommendations on how to proceed are largely problem specific, and we direct the reader to discussions elsewhere.  See, for example, Refs.~\cite{Janke_Statistical_2002,Shirts_Statistically_2008,Grossfield2009,Gowers_Automated_2018}.

%In the situations in which $N_{\max}$ is finite (or a finite approximation has been chosen), we also observe that this quantity can be set to any value greater than $\tau$, since $C_j = 0$ (or $C_j \approx 0$ in the case of finite approximation) for all $j>\tau$.  In practice, however, setting $N_{\rm max}$ too large can sometimes cause the sum to be dominated by noise appearing in the estimate of $C_j$.  See, for example, Ref.~\cite{acfnoise} and references contained therein.  Thus, care must be taken when choosing an appropriate value of $N_{\rm max}$, making sure that the estimate is sufficiently large so that the sum is converged but otherwise independent of this upper limit.

\subsubsection{Block averaging method for estimating the standard uncertainty}\label{sec:blockavg}

The main idea behind block averaging is to permit the direct usage of Eq.~\ref{def:exp_st_dev} by projecting the original dataset onto one comprised of only independent samples, so that there is no need to compute $N_{\rm ind}$.  Acknowledging that typical MD time series have a finite-correlation time $\tau$, we recognize that a continuous block of $M$ data-points will only be correlated with its adjacent blocks through its first and last $\tau$ points, provided $\tau$ is small compared to the block size $M$.   That is, correlations will be on the order of $\tau / M$, which goes to zero in the limit of large blocks.

This observation motivates a technique known as block averaging \cite{Friedberg1970,Flyvbjerg-1989,FrenkelSmit2002,Grossfield2009}.
Briefly, the set of $N$ observations $\left\{x_1, ..., x_N\right\}$ are converted to a set of $M$ ``block averages'' $\left\{x^b_1, ..., x^b_{M}\right\}$, where a block average $x^b_j$ is the \hyperref[def:arith_mean]{arithmetic mean} of $n$ (the block size) sequential measurements of $x$:
%
\begin{equation}
  x^b_j = \dfrac{\sum\limits_{k=1+(j-1)n}^{jn} x_k}{n}
\end{equation}
%
From this set of block averages, one may then compute the arithmetic mean of the block averages, $\mean{x}^b$, which is an estimator for $\expval{x}$.\footnote{Note that in general $\mean{x} \ne \mean{x}^b$ if the block averages discard some observations of $x_l$, for example, when the total number of observations is not a multiple of the block size.}
Following, one computes the \hyperref[def:exp_st_dev]{experimental standard deviation} of the block averages, $\stdev{\mean{x}^b}$, using Eq.~\ref{def:exp_st_dev}.
Lastly, the standard uncertainty of $\mean{x}^b$ is just the \hyperref[def:exp_st_dev_mean]{experimental standard deviation of the mean} given the set of $M$ block averages:
%
\begin{equation}
  \stdevmean{\mean{x}^b} = \dfrac{\stdev{x^b}}{\sqrt{M}}
\end{equation}
%
This standard uncertainty may then be used to calculate a confidence interval on $\mean{x}^b$.

It is important to note that for statistical purposes, the blocks must all be of the same size in order to be identically distributed, and thereby satisfy the requirements of Eq.~\ref{def:exp_st_dev_mean}.  It is also important to  systematically assess the impact of block size on the corresponding estimates.  In particular, as the blocks get longer, the block averages should decorrelate and $\stdevmean{\mean{x}^b}$ should plateau \cite{Flyvbjerg-1989,Grossfield2009}.
Another approach is to measure the block correlation and to use it to improve the selection of the block size and, hence, uncertainty estimate \cite{Kolafa1986}.
We stress that this final step of adjusting the block size and recomputing the block standard uncertainty is absolutely necessary. Otherwise, the blocks may be correlated, yielding an uncertainty that is not meaningful.


\subsection{Propagation of uncertainty}\label{subsec:linear}

Oftentimes we run simulations for the purposes of computing derived quantities, i.e., those that arise from some analysis applied to raw data.  In such cases, it is necessary to propagate uncertainties in the raw data through the corresponding analysis to arrive at the uncertainties associated with the derived quantity.  Frequently, this can be accomplished through a linear propagation analysis using Taylor series, which yields simple and useful formulas.

The foundation for this approach lies in rigorous results for the propagation of error through linear functions of random variables.
For a derived observable that is a linear function of $M$ uncorrelated raw data measurements, e.g.,
%
\begin{equation}
  F\left( \left\{x_i\right\} \right) = c + \sum_{i=1}^M a_i x_i,
  \label{def:linear_derived_obs}
\end{equation}
%
where $c$ is a constant, the experimental variance of $F$ may be rigorously expressed as \cite{NIST_Sematech_eHandbook}
%
\begin{equation}
  \var{F} = \sum_{i=1}^M a_i^2 \var{x_i}
  \label{def:var_linear_derived}
\end{equation}
%
A key assumption in Eq.~\ref{def:var_linear_derived} is that the raw data, $\left\{x_i\right\}$, are \hyperref[def:unc_obs]{linearly uncorrelated} (see Eq.~\ref{def:unc_obs}). If any observed quantities are correlated, the uncertainty in $F$ must include ``covariance'' terms. The reader may consult Sec.\ 2.5.5, ``Propagation of error considerations'' in Ref.~\cite{NIST_Sematech_eHandbook} for further discussion. For reasons of tractability, we restrict the discussion here to linearly uncorrelated observables or the assumption thereof.

The situation for a nonlinear derived quantity is much more complicated and, as a result, rigorous expressions for the uncertainty of such functions are rarely used in practice. As a simplification, however, one approximates the nonlinear derived quantity as a Taylor-series expansion about a reference point, i.e.,
%
\begin{equation}
  F\left( \left\{x_i\right\} \right)
  \approx
  F\left( \left\{\mean{x_i}\right\} \right)
  + \sum_{j=1}^M \left.
    \left( \dfrac{\partial F}{\partial x_j}\right)
  \right|_{\left\{\mean{x_i}\right\}}
  \left(x_j - \mean{x_j}\right) + \cdots
  \label{def:nonlinear_derived_obs_taylor}
\end{equation}
%
The deviation of a particular measurement from its mean, $\epsilon_i = x_i - \mean{x_i}$, is itself a random quantity, and the uncertainty in those measurements is propagated into uncertainty in $F$.
Note that the ratio $\epsilon_i/\mean{x_i}$ is the so-called ``noise-to-signal'' ratio, which vanishes in the limit of a precise measurement.
With this {\it linear} approximation of $F$, which is analogous to Eq.~\ref{def:linear_derived_obs} with $a_i = \left( \partial F/\partial x_i\right)$, and the assumption that the raw data are uncorrelated, the variance in $F$ may be approximated by
%
\begin{eqnarray}
  \var{F} & \approx & \sum_{j=1}^M
                \left.
                \left( \dfrac{\partial F}{\partial x_j} \right)^2
                \right|_{\left\{\mean{x_i}\right\}}
                \var{x_j}
  \label{def:var_nonlinear_derived}
\end{eqnarray}
%

A simple example illustrates this procedure.
Consider, in particular, the task of estimating the uncertainty in a measurement of density, $\rho=m/V$, from a time series of volumes output by a constant pressure simulation, where $m$ is the (constant) system mass and $V$ is the (fluctuating) system volume.
Application of Eq.~\ref{def:var_nonlinear_derived} to the definition of $\rho$ yields
%
\begin{eqnarray}
  \var{\rho} & \approx & \left( \dfrac{N}{\mean{V}^2} \right)^2 \var{V} \\
  \stdev{\rho} & \approx & \mean{\rho} \dfrac{\stdev{V}}{\mean{V}}
  \label{ex:density_linearized}
\end{eqnarray}
%
with $\mean{\rho} = N/\mean{V}$.
This approximation of the experimental standard deviation may be used to estimate a confidence interval on $\mean{\rho}$ or for other purposes.

In general, approximations in the spirit of Eq.~\ref{ex:density_linearized} are useful and easy to generalize to higher-dimensional settings in which the derived observable is a nonlinear combination of many data-points or sets.
However, the method does have limits.
In particular, it rests on the assumption of a small noise-to-signal ratio, which may not be valid for all simulated data.
If there is doubt as to the quality of an estimate, the uncertainty should therefore be estimated with alternative approaches such as bootstrapping in order to validate the linear approximation.
See also the pooling analysis of Ref.~\cite{patrone1} for a method of assessing the validity of linear approximations.

\subsection{From standard uncertainty to confidence interval for Gaussian variables}\label{sec:conf_int}

Once a standard uncertainty value is obtained for a Gaussian-distributed random variable with mean $\left< x \right>$, and the number of independent samples $n$ has been estimated, the 95~\%-confidence interval $\left[ \mean{x} -k \, \stdevmean{\mean{x}} , \mean{x} +k \, \stdevmean{\mean{x}} \right]$ can be constructed on the basis of an established look-up table (or a statistics software model) for the coverage factor $k$ based on $n$.  The theoretical basis for the table is the ``Student'' or ``$t$'' distribution, which is \emph{not} Gaussian, but governs the behavior of an \emph{average} derived from $n$ independent Gaussian variables  \cite{JCGM:GUM2008}. Table \ref{tab:coveragefactors} lists $k$ for two-sided 95~\% confidence intervals for select values of $n$.

\begin{table}
  \centering
    \begin{tabular}{S S}
      \toprule
       {$n$ (independent samples)} & {$k$ (coverage factor)} \\
       \hline
      6 & 2.57 \\
      11 & 2.23 \\
      16 & 2.13 \\
      21 &  2.09\\
      26 & 2.06 \\
      51 & 2.01 \\
      101 & 1.98 \\
      \bottomrule
    \end{tabular}
  \caption{Coverage factors $k$ required for a two-sided 95~\% confidence interval for a Gaussian variable \cite{JCGM:GUM2008}. Note that $k$ increases with decreasing sample size. This in turn implies that smaller samples yield higher uncertainty for a given estimation of the experimental standard deviation.}
  \label{tab:coveragefactors}
\end{table}

As a reminder, multi-modally distributed variables with multiple peaks in their distributions cannot be considered Gaussian random variables.  Variables with a strict upper or lower limit (such as a non-negative quantity) and long-tailed distributions are also not Gaussian.  These cases should be treated with bootstrapping.

\subsection{Bootstrapping}\label{sec:bootstrap}

Bootstrapping is an approach to uncertainty estimation that does not assume a particular distribution for the observable of interest or a particular kind of relationship between the observable and variables directly obtained from simulation \cite{Tibshirani1998}.  A full discussion of bootstrapping and resampling methods is outside the scope of this article; we will cover the broad strokes here, and suggest interested readers consult excellent discussions elsewhere for more details (e.g. Refs.~\cite{Schenker1985}, \cite{Chernick2009}, and, particularly, \cite{Tibshirani1998}).

In nonparametric bootstrapping, new, ``synthetic'' data sets (corresponding to hypothetical simulation runs) are created by drawing $n$ samples (configurations) from the original collection that was generated during the actual run.  The same sample may be selected twice, while others may not be selected at all in a process called ``sampling with replacement.''  In doing so, these synthetic sets will be different even though they all have the same number of samples and draw from the same pool of data.  Having created a new set, the data are analyzed to determine the derived quantity of interest, and this process is repeated to produce multiple estimates of the quantity.  The distribution of ``synthetic'' observables can be directly used to construct a 95~\% confidence interval from the 2.5 percentile to the 97.5 percentile value.
Readers are cautioned that bootstrapping confidence intervals are not quantitatively reliable in certain cases such as with small-sample sizes or distributions that are skewed or heavy-tailed \cite{Schenker1985,Chernick2009}.

The process described above assumes that the original simulation data are uncorrelated.  If this is not the case, then the resampling method can be reformulated in one of two ways.  The first option is to estimate the number of independent samples in the original set (e.g., using an autocorrelation method \cite{Chodera-2016,Lyman2007a}) and to pull only that many samples to create the new data sets.  The second option is to group the samples into blocks that are uncorrelated based on analyzing varying block sizes (see Sec.~\ref{sec:blockavg}) and to then use the block averages as the samples for bootstrapping.

Alternatively, one could use the difference between errors estimated via block averaging and bootstrapping as a measure of the correlation; if one tracks the bootstrapped and block-averaged estimates of a quantity's uncertainty as a function of block size, the only difference between the two modes of calculation is whether the data are correlated.  The decay in the ratio of the two quantities as a function of time is a measure of the correlation time in the sample \cite{Romo2011}.

\subsubsection{Bootstrapping variants}

An alternate approach that can directly account for correlations is called parametric bootstrapping.  The main idea behind this method is to model the original data as a deterministic function (which can be zero, constant, or have free parameters) plus additive noise.  The parameters of this model, including the structure of the noise (i.e., its covariance), can be determined through a statistical inference procedure.  Having calibrated the model, random number generators can be used to sample the noise, which is then added back to the trial function to generate a synthetic data set.   As with the nonparametric bootstrap, the generated data can be used to compute the derived quantity of interest, and the uncertainty can be obtained from the statistics of the values compute with different generated sets.

To further clarify the procedure of parametric bootstrapping, consider the simplest case in which the data are a collection of uncorrelated random variables fluctuating about a constant mean.  In this situation, one could estimate (I) the deterministic part of a parametric model using the sample mean $\mean{x}$ of the data, and (II) the stochastic part as a Gaussian random variable whose variance equals the sample variance.  If instead the data are correlated (e.g., as in a time series of simulated observables),  one can postulate a covariance function to describe the structure of this randomness.  Often these covariance functions are formulated with free parameters (often called ``hyperparameters'') that characterize properties such as the noise-scale and characteristic length of correlations \cite{Rasmussen}.  In such cases, determining the hyperparameters may require more sophisticated techniques such as maximum likelihood analyses or Bayesian approaches; see, for example, Ref.~\cite{Rasmussen}.  See also Refs.~\cite{patrone1,patrone2,patrone3} for examples and practical implementations applied to cases in which the deterministic component of the data is not constant.

It is important to note that various bootstrapping approaches can and often are used as uncertainty propagation tools.  Nonetheless, care should be exercised when using such methods with either nonlinear functions or naturally constrained functions.  For example, consider application of the parametric bootstrap technique to a free energy calculation,
%
\begin{equation}
  \beta \Delta A = - \ln \left[ \overline{\exp\left(-\beta \Delta U\right)} \right]
\end{equation}
%
If synthetic samples are drawn for $\exp(-\beta \Delta U)$ from, say, a Gaussian centered at 1 with a standard deviation of 0.5, new estimates will eventually output negative numbers. This is, of course, mathematically nonsensical for the exponential function.  Thus, one should be aware of any distributional assumptions imposed either by the physics of the problem or the mathematical analyses of synthetic data.

Lastly, an alternative to bootstrapping is the ``jackknife'' method \cite{Quenouille_Approximate_1949,Quenouille_Notes_1956,Tukey_Bias_1958} (also outlined in Ref.~\cite{Tibshirani1998}). It operates similar to the bootstrap as a resampling technique, but it uses synthetic data sets created by subtraction of some number of samples rather than replacement; as such it is often categorized as a variant of the bootstrap (even though it predates the bootstrap). Since it operates by sample deletion, it may be better suited to smaller data sets for which a few samples may be overrepresented in the synthetic data set created by the bootstrap replacement technique. Ultimately, though, the results are similar in that the jackknife technique creates a distribution of derived observables that can be used to compute an arithmetic mean, an estimate of the standard uncertainty, and confidence intervals.

\subsection{Dark uncertainty analyses}

In some cases, multiple simulations of the same physical observable $\tau$ may yield predictions whose error bars do not overlap.  This situation can arise, for example, in simulations of the glass transition temperature when undersampling the crosslinked network structure of certain polymers.  In such cases, it is reasonable to postulate an unaccounted for source of uncertainty, which we colorfully refer to as ``dark uncertainty'' \cite{patrone1}.  In the context of a statistical model, we postulate that the probability of a simulation output depends on the unobserved or ``true'' mean value $\bar \tau$, an uncertainty $\sigma_i^2$ whose value is specific to the simulation (estimated, e.g., according to uncertainty propagation), and the unaccounted-for dark uncertainty $y^2$.  (For simplicity, the $\sigma_i^2$ and $y^2$ should be treated as variances.)

While details are beyond this scope of this document, such a model motivates an estimate of $\bar \tau$ of the form
\begin{align}
\bar \tau \approx \mathcal T \propto \sum_i \frac{T_i}{\sigma_i^2 + y^2}, \label{eq:darkmean}
\end{align}
where $\mathcal T$ is a ``consensus'' or weighted-mean estimate of the true mean $\bar \tau$, $T_i$ is the prediction from the $i$th simulation, $\sigma_i^2$ is its associated ``within-simulation'' uncertainty, and $y^2$ is the ``dark'' or ``between-simulation'' uncertainty; note that the latter does not depend on $i$.  The variable $y^2$ can be estimated from a maximum-likelihood analysis of the data and amounts to numerically solving a relatively simple nonlinear equation (see Ref.~\cite{patrone1}).  Equation~\ref{eq:darkmean} is useful insofar as it weights simulated results according to their certainty while reducing the impact of overconfident predictions (e.g., having small $\sigma_i^2$).  Additional details on this method are provided in Ref.~\cite{patrone1} and the references contained therein.

\subsection{Propagation across multiple steps}
In some instances, it is useful and/or necessary to propagate uncertainty across different simulation steps.  This occurs, for example, when the property of a constant pressure system can only be computed using constant volume simulations (e.g., for certain viscosity calculations).  In such cases, uncertainty in the system volume must be accounted for in the final property prediction, which carries its own uncertainties associated with the simulation protocol, input parameters, etc.  Related issues arise when attempting to account for uncertainty in force-field parameters \cite{Rizzi2,Rizzi3,Rizzi4}.

Addressing these uncertainty propagation tasks may be as simple as performing a linear or bootstrap analysis as previously described, but at each step of the simulation protocol.  In other cases, especially when propagation must be performed between different simulations, such approaches are computationally expensive and, thus, infeasible.  A variety of methods (surrogate modeling, polynomial chaos, Gaussian-process regression, etc.) are being actively explored by the community, but in many instances few, if any approaches, have emerged as widely accepted strategies.  We encourage the reader to stay informed of current literature, some of which is referenced herein \cite{smith2013uncertainty}.