Chapter 10 Background on Bayesian statistics

The models in eDNAjoint are formulated in a Bayesian framework, so users must have some knowledge of how to form a prior distribution and understand its influence on inference (Lemoine, 2019), as well as understand how to diagnose and mitigate MCMC problems such as lack of convergence.

Here is a brief background on some of the key concepts in Bayesian modeling and estimation, as well as some helpful references for further background.

10.1 Bayes theory and Bayesian statistics

The Bayesian paradigm differs from the Frequentist paradigm (i.e., maximum likelihood estimation) in a few key ways:

All parameters are treated as random variables, meaning they will be encoded as a probability distribution. This is different than Frequentist statistics that assumes parameters are fixed, and uncertainty is entirely attributed to our estimation of the parameters
Similar to the above point, in the Bayesian paradigm, we characterize the distribution of the model parameters, given the observed data. This is different than Frequentist statistics where we characterize the value of the parameters most consistent with the observed data.
The Bayesian framework encodes the idea of a prior belief that can be updated based on data.

Bayesian statistics is based on Bayes theorem:

\[ Pr[parameters | data] = \frac{Pr[data|parameters] \times Pr[parameters]}{Pr[data]} \] Put another way:

\[ posterior = \frac{data\space likelihood \times prior}{marginal \space likelihood} \]

The posterior distribution is the probability of parameter values, given the data. The posterior distribution is the Bayesian analog of the maximum likelihood estimate (MLE) in Frequentist statistics, but instead of a single MLE point estimate, the “best estimate” of the parameter will be a probability distribution.

Here you can see that a prior distribution goes into the calculation of the posterior. This is distinct to the Bayesian paradigm in that we encode what we know about model parameters before the analysis through prior distributions about them. This can be informed through previous studies, informed judgement, or expert opinion. And often, it will be necessary to conduct a prior sensitivity analysis to determine how the choice of the prior affects posterior inference (see example of a prior sensitivity analysis).

The goal is to calculate the posterior distribution. However, often this cannot be calculated directly, since calculating the marginal likelihood is often analytically intractable since it requires integrating over the entire parameter space. Because of this, we often approximate the posterior distribution by sampling from it, or simulating it, using Markov chain Monte Carlo (MCMC) algorithms.

10.2 Bayesian sampling algorithm: MCMC

There are many Markov chain Monte Carlo (MCMC) algorithms. Hamiltonian Monte Carlo is the MCMC algorithm used in the eDNAjoint package, but there are some common principles that will be shared here. In general, Markov chain Monte Carlo methods create samples of the posterior distribution that together, approximate the posterior. The algorithm is analogous to taking a walking tour with your next location chosen at random, but the probability of choosing the next location is proportional to how interesting the new location is, relative to where you are now. The “Markov” aspect of the chain means that the next sample value depends on the current value. MCMC algorithms tend to work well, but they require attention to performance and can be computationally intensive.

10.2.1 Glossary

Posterior samples: Since the posterior distribution is often complex and cannot be computed analytically, MCMC methods generate a large number of posterior samples, which are used to approximate the posterior distribution and summarize parameter estimates (e.g., mean, median, credible intervals).
Chains: The chain is the sequence of random numbers from the posterior and represent our collection of posterior samples. We often run multiple chains because the MCMC algorithm is stochastic, so a given sample from one MCMC chain may not represent the target posterior distribution. We therefore run multiple chains (usually 4) with different initial values, and then the samples of all chains can be combined to form the posterior. Running chains in parallel on multi-core computers will greatly speed up computation, and this can be specified with the multicore argument of joint_model(). These chains are also used to diagnose algorithm convergence (more on this below).
Initial values: the MCMC algorithm has to have values for each parameter to start (i.e., if you are on a walking tour and want to find the most interesting location, you have to start somewhere). The initial values can be chosen at random or pre-defined in joint_model().
Warmup or burn-in: MCMC does not start off yielding samples from the posterior distribution, since the MCMC takes time to “find” the true posterior distribution. If the initial values are far away from the posterior, the warmup, or burn-in period can be long. Often we discard a certain number of warmup samples to give the algorithm time to generate random values from the posterior distribution of interest.
Thinning: There may be some dependence, or correlation, between random samples within a chain, so we can disregard some of these samples to break up dependence. We often refer to highly correlated samples as “slow mixing”. If correlation is high, we can reduce it by thinning the post-warmup samples, where we keep every kth iteration. Stan’s Hamiltonian Monte Carlo is very efficient, however, so we often do not need to thin the chains.

See the customize MCMC for how to customize the MCMC algorithm using arguments to joint_model().

10.3 Understanding algorithm convergence

How do you know if the MCMC algorithm worked? There are a few diagnostics that can be used to determine if the algorithm has converged, meaning that your MCMC samples are a good approximation of the posterior distribution.

10.3.1 Trace plots

Trace plots show how the MCMC algorithm is sampling the posterior distribution. You can find tips for creating trace plots in the visualization tips section. On the x-axis is the MCMC iteration, and on the y-axis is the value of the parameter at that MCMC iteration. The different chains are often shown in different colors and ideally will be on top of each other.

You know the MCMC algorithm has converged when it reaches a “stationary distribution”. This means that the chains are moving around the same mean with the same variance. If the algorithm has converged, the trace plot should look like a fuzzy caterpillar, where the chains are on top of each other and, on average, are not increasing or decreasing across iterations. The chains should also not be autocorrelated, meaning that iteration 10 should not be more similar to iteration 11 than iteration 1000.

Dennis Valle’s blog on MCMC convergence is also a helpful resource here.

10.3.2 Posterior density

You can also plot the posterior distribution, also known posterior density, as a histogram or density plot of the posterior samples. Some examples of this can be found in the visualization tips section for posterior samples. Here, it is mostly important to ensure that the posterior density is uni-modal.

10.3.3 Rhat

The most commonly reported diagnostic of convergence is $\hat{R}$. This statistic compares the between- and within-chain estimates for model parameters. If chains have not mixed well (i.e., the between- and within-chain estimates don’t agree), R-hat is larger than 1. It is recommended to run at least four chains by default and only using the sample if R-hat is less than 1.05. You can use the function joint_summarize() to calculate $\hat{R}$ for the parameters in your model.

10.3.4 Effective sample size (ESS)

In the above trace plot section, the idea of autocorrelation is mentioned. The effective sample size (ESS) is a measure of how much information is in the posterior samples, adjusted for the amount of autocorrelation in the samples. The ESS is also a measure of the efficiency of the algorithm. So here, the higher, the better. You can use the function joint_summarize() to calculate the effective sample size for the parameters in your model.

To quote Andy Jones’s really helpful blog on effective sample size”

“[W]e crucially assume that the N samples are independent of one another. However, in most MCMC sampling approaches, samples drawn sequentially will be positively correlated (or autocorrelated, in the parlance of time-series) with one another. When samples exhibit autocorrelation, the sequence contains less information content about the underlying distribution in some sense.”

“We can use the idea of autocorrelation to adjust the original observed sample size to account for redundant information in the sample. Intuitively, we should expect that our effective sample size will be smaller than the observed sample size when the autocorrelation is large.”

10.3.5 Initial values

You can imagine that the MCMC algorithm is traversing a landscape with mountains and valleys. In a simple model with two parameters, you can think of one parameter as longitude, another parameter as latitude, and elevation as the negative log likelihood. The valleys are therefore the parameter space with the lowest negative log likelihood of the data, given the parameters. Ideally, there is one valley that is the lowest, and the parameter space around this valley is your posterior distribution.

But depending on where the algorithm starts, you might end up in a valley that is low, but not the lowest. This is the idea of “local minima” vs “global minimum”. If the algorithm starts near a local minimum, it could get stuck and never find the global minimum.

One way to ensure that you have found the global minimum is to start the algorithm in different places and then see if the algorithm ends up with the same posterior. You can check the “initial values” – or the values of the model parameters that the algorithm uses to start – that were used in your model with model_fit$initial_values. When you run joint_model, initial values get chosen for you, but you can also choose initial values via the initial_values argument of joint_model, see more in the tips section.

10.4 Useful references

Hobbs, N. T., & Hooten, M. B. (2015). Bayesian models: a statistical primer for ecologists. Princeton University Press.

Andy Jones’s blog on effective sample size

Dennis Valle’s blog on MCMC convergence

Stan user manual and help forum

Credibility intervals