Chapter 6 Use case 4: joint_model() with semi-paired data

This final use case shows how to fit and interpret the joint model with semi-paired eDNA and traditional survey data.

In many cases, you may have some sites where only eDNA samples were collected, while other sites have paired eDNA and traditional samples (i.e., semi-paired data). We can fit the joint model to this semi-paired data and use the sites with paired data to make predictions about the sites with only un-paired eDNA data.



Let’s use the same tidewater goby data from uses cases 1 and 2, except let’s pretend we did not have traditional seine sampling at two of the 39 sites.

6.1 Simulate semi-paired data

Both eDNA and traditional survey data should have a hierarchical structure:

  • Sites (primary sample units) within a study area
  • eDNA and traditional samples (secondary sample units) collected from each site
  • eDNA subsamples (replicate observations) taken from each eDNA sample

First replace the traditional count data at sites 4 and 34 with NA to simulate a scenario where we did not have traditional samples at these sites.

library(eDNAjoint)
data(goby_data)

# create new dataset of semi-paired goby data
goby_data_semipaired <- goby_data
goby_data_semipaired$count[4, ] <- NA
goby_data_semipaired$count[34, ] <- NA

We can now see that we have no data at site 4:

goby_data_semipaired$count[4, ]
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 
## NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

And site 34:

goby_data_semipaired$count[34, ]
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 
## NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

It’s important to ensure that we replace the data with NA so that we have an empty row indicating missing data.

6.2 Prepare the data

Now let’s look at the format of the data. You can see that the data is a list of four matrices all with the same number of rows that represent each site (n = 39). Across all matrices, rows in the data should correspond to the same sites (i.e., row one in all matrices corresponds to site one, and row 31 in all matrices corresponds to site 31).

pcr_k, pcr_n, and count are required for all implementations of joint_model(), and site_cov is optional and is used in use case 2.

Let’s look first at pcr_k. These are the total number of positive PCR detections for each site (row) and eDNA sample (column). The number of columns should equal the maximum number of eDNA samples collected at any site. Blank spaces are filled with NA at sites where fewer eDNA samples were collected than the maximum.

For example, at site one, 11 eDNA samples were collected, and at site five, 20 eDNA samples were collected.

head(goby_data_semipaired$pcr_k)
##      1 2 3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
## [1,] 0 0 0  0  0  0  0  0  0  0  0 NA NA NA NA NA NA NA NA NA NA NA
## [2,] 0 0 0  0  0  0  0  0  0  0  0 NA NA NA NA NA NA NA NA NA NA NA
## [3,] 0 0 0  0  0  0  0  0  0  0  0 NA NA NA NA NA NA NA NA NA NA NA
## [4,] 0 6 6  4  6  5  4  6  5  3 NA NA NA NA NA NA NA NA NA NA NA NA
## [5,] 6 6 4  6  6  6  5  4  2  2  0  6  5  5  6  6  6  5  5  4 NA NA
## [6,] 0 0 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Now let’s look at pcr_n. These are the total number of PCR eDNA subsamples (replicate observations) collected for each site (row) and eDNA sample (column). In this data, six PCR replicate observations were collected for each eDNA sample. Notice that the locations of the NAs in the matrix match pcr_k.

head(goby_data_semipaired$pcr_n)
##      1 2 3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
## [1,] 6 6 6  6  6  6  6  6  6  6  6 NA NA NA NA NA NA NA NA NA NA NA
## [2,] 6 6 6  6  6  6  6  6  6  6  6 NA NA NA NA NA NA NA NA NA NA NA
## [3,] 6 6 6  6  6  6  6  6  6  6  6 NA NA NA NA NA NA NA NA NA NA NA
## [4,] 6 6 6  6  6  6  6  6  6  6 NA NA NA NA NA NA NA NA NA NA NA NA
## [5,] 6 6 6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6 NA NA
## [6,] 6 6 6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Next, let’s look at count. This data is from seine sampling for tidewater gobies. Each integer refers to the catch of each seine sample (i.e., catch per unit effort, when effort = 1). Again, the rows correspond to sites, and the columns refer to replicated seine samples (secondary sample units) at each site, with a maximum of 22 samples. Blank spaces are filled with NA at sites where fewer seine samples were collected than the maximum, as well as at sites with no seine samples. Notice that the number of rows (i.e., number of sites) in the count data still equals the number of rows in the PCR data, even though some of these sites have no count data.

dim(goby_data_semipaired$count)[1] == dim(goby_data_semipaired$pcr_k)[1]
## [1] TRUE

For more data formatting guidance, see section 2.1.1.

6.3 Fit the model

Now that we understand our data, let’s fit the joint model. The key arguments of this function include:

  1. data: list of pcr_k, pcr_n, and count matrices
  2. family: probability distribution used to model the seine count data. A poisson distribution is chosen here.
  3. p10_priors: Beta distribution parameters for the prior on the probability of false positive eDNA detection, \(p_{10}\). c(1,20) is the default specification. More on this later.
  4. q: logical value indicating the presence of multiple traditional gear types. Here, we’re only using data from one traditional method.

More parameters exist to further customize the MCMC sampling, but we’ll stick with the defaults.

# run the joint model
goby_fit_semi <- joint_model(data = goby_data_semipaired, family = "poisson",
                             p10_priors = c(1, 20), q = FALSE)

goby_fit_semi is a list containing:

  1. model fit (goby_fit_semi$model) of the class ‘stanfit’ and can be accessed and interpreted using all functions in the rstan package.
  2. initial values used for each chain in MCMC (goby_fit_semi$inits)

6.4 Interpret the output

6.4.1 Summarize posterior distributions

Let’s interpret goby_fit_semi. Use joint_summarize() to see the posterior summaries of the model parameters. Let’s look at the estimates of the expected catch rate at each site, \(\mu\).

joint_summarize(goby_fit_semi$model, par = "mu")
##           mean se_mean    sd    2.5%   97.5%    n_eff Rhat
## mu[1]    0.021   0.000 0.021   0.001   0.077 16240.36    1
## mu[2]    0.021   0.000 0.021   0.001   0.078 14779.29    1
## mu[3]    0.021   0.000 0.021   0.001   0.078 15721.84    1
## mu[4]    5.439   0.015 1.776   2.861   9.673 13541.80    1
## mu[5]    2.430   0.003 0.298   1.889   3.055 13439.14    1
## mu[6]    0.082   0.001 0.085   0.002   0.311 14104.38    1
## mu[7]    0.339   0.001 0.103   0.173   0.574 14989.91    1
## mu[8]    0.275   0.003 0.296   0.007   1.101 13743.81    1
## mu[9]    0.047   0.000 0.049   0.001   0.181 14677.25    1
## mu[10]   1.516   0.002 0.208   1.143   1.954 13106.55    1
## mu[11]  26.965   0.008 1.079  24.879  29.135 16318.47    1
## mu[12]   0.021   0.000 0.021   0.001   0.078 15670.82    1
## mu[13]   0.021   0.000 0.021   0.001   0.078 13691.18    1
## mu[14]   0.034   0.000 0.034   0.001   0.123 14769.96    1
## mu[15]   0.100   0.001 0.061   0.017   0.255 14255.74    1
## mu[16]   0.066   0.000 0.049   0.006   0.188 13746.66    1
## mu[17]   0.026   0.000 0.027   0.001   0.099 15061.34    1
## mu[18]   0.020   0.000 0.020   0.000   0.074 14501.20    1
## mu[19]   0.047   0.000 0.048   0.001   0.177 15781.81    1
## mu[20]   0.224   0.001 0.082   0.094   0.416 15353.69    1
## mu[21]   0.081   0.001 0.084   0.002   0.308 13958.44    1
## mu[22]   0.254   0.001 0.112   0.087   0.520 17458.17    1
## mu[23]  90.537   0.024 3.197  84.350  96.947 17384.66    1
## mu[24]   1.049   0.002 0.213   0.683   1.523 14888.40    1
## mu[25]  21.046   0.012 1.609  17.987  24.345 18189.39    1
## mu[26]   1.284   0.002 0.208   0.920   1.718 14554.37    1
## mu[27]   2.719   0.003 0.457   1.915   3.700 17366.79    1
## mu[28]   8.153   0.005 0.757   6.748   9.695 19301.35    1
## mu[29]   3.230   0.004 0.464   2.402   4.225 13722.26    1
## mu[30]  13.711   0.016 2.087   9.974  18.062 16584.27    1
## mu[31]  94.521   0.036 4.414  86.121 103.434 14771.12    1
## mu[32]   0.023   0.000 0.024   0.001   0.088 14113.75    1
## mu[33]   0.083   0.001 0.087   0.002   0.314 13683.54    1
## mu[34]   0.260   0.001 0.147   0.064   0.616 14292.78    1
## mu[35]   0.126   0.001 0.132   0.003   0.475 14539.43    1
## mu[36]   0.021   0.000 0.021   0.001   0.080 14860.14    1
## mu[37]   0.023   0.000 0.024   0.001   0.087 13940.14    1
## mu[38] 168.974   0.046 5.869 157.578 180.656 16597.79    1
## mu[39]   0.023   0.000 0.024   0.001   0.087 14321.37    1

This summarizes the mean, standard deviation (sd), and quantiles of the posterior estimates of \(\mu\), as well as the effective sample size (n_eff) and \(\hat{R}\) (Rhat) for the parameters. More informative about effective sample size and \(\hat{R}\) can be found in the algorithm convergence section, but briefly, the \(\hat{R}\) value is the frequently used statistic for assessing model convergence. We typically want \(\hat{R}\) to be less than 1.05, so it looks like our model converged. The 2.5% and 97.5% quantiles show the bounds of the 95% credibility interval (the equal tailed credibility interval, to be specific.)

The model estimates \(\beta_i\) using data from sites with paired samples and uses this estimate to make predictions for \(\mu_{i=4}\) and \(\mu_{i=34}\) where no traditional seine samples were collected.

We can also use functions from the bayesplot package to examine the posterior distributions.

First let’s look at the posterior distributions for \(\mu_{i=4,k=1}\) and \(\mu_{i=29,k=1}\). Traditional seine samples were collected at site 29 but not at site 4.

library(bayesplot)
# plot posterior distribution, highlighting median and 80% credibility interval
mcmc_areas(as.matrix(goby_fit_semi$model),
           pars = c("mu[4,1]", "mu[29,1]"), prob = 0.8)

As you could expect, the credibility interval for the expected catch rate at site four is much wider than the credibility interval at site 29, since no traditional samples were collected at site four.

Let’s do the same for \(\mu_{i=7,k=1}\) and \(\mu_{i=34,k=1}\). Traditional seine samples were collected at site 7 but not at site 34.

# plot posterior distribution, highlighting median and 80% credibility interval
mcmc_areas(as.matrix(goby_fit_semi$model),
           pars = c("mu[7,1]", "mu[34,1]"), prob = 0.8)

Again, the credibility interval for the expected catch rate at site 34 is wider with a longer tail than the credibility interval at site 7, since no traditional samples were collected at site 34.

Note: It’s important to consider that fitting the joint model with semi-paired data will be more successful if there is paired data at most sites. Since the data at paired sites is leveraged to make predictions at the un-paired sites, you want to maximize the amount of data to leverage.

6.5 Initial values

By default, eDNAjoint will provide initial values for parameters estimated by the model, but you can provide your own initial values if you prefer. Here is an example of providing initial values for parameters, mu,p10, alpha, and q, as an input in joint_model().

# set number of chains
n_chain <- 4

# number of sites
nsites <- dim(goby_data_semipaired$count)[1]

# initial values should be a list of named lists
inits <- list()
for (i in 1:n_chain) {
  inits[[i]] <- list(
    # length should equal the number of sites for each chain
    mu = stats::runif(nsites, 0.01, 5),
    # length should equal 1 for each chain
    p10 = stats::runif(1, 0.0001, 0.08),
    # length should equal 1 for each chain,
    # since there are no site-level covariates
    alpha = stats::runif(1, 0.05, 0.2)
  )
}
# now fit the model
fit_inits <- joint_model(data = goby_data_semipaired, initial_values = inits,
                         multicore = TRUE)
# check to see the initial values that were used
fit_inits$inits
## $chain1
## $chain1$mu_trad
##  [1] 3.88678789 0.34533277 3.13640836 4.24117051 0.35231837 0.23901479
##  [7] 0.75990010 1.58351348 4.48857616 2.10694284 4.11066900 1.37281898
## [13] 3.08997347 0.34698448 4.30453909 3.83063555 4.95716807 4.25880569
## [19] 4.79524076 4.43564275 1.96653120 4.81550095 3.02869554 3.25103013
## [25] 4.49420473 2.40487119 1.50305217 1.58262719 0.07031397 2.84243534
## [31] 4.06926917 2.22493580 3.95055330 1.13123025 3.46364201 1.14429093
## [37] 3.76447394
## 
## $chain1$mu
##  [1] 3.88678789 0.34533277 3.13640836 3.52619619 4.24117051 0.35231837
##  [7] 0.23901479 0.75990010 1.58351348 4.48857616 2.10694284 4.11066900
## [13] 1.37281898 3.08997347 0.34698448 4.30453909 3.83063555 4.95716807
## [19] 4.25880569 4.79524076 4.43564275 1.96653120 4.81550095 3.02869554
## [25] 3.25103013 4.49420473 2.40487119 1.50305217 1.58262719 0.07031397
## [31] 2.84243534 4.06926917 2.22493580 0.92746557 3.95055330 1.13123025
## [37] 3.46364201 1.14429093 3.76447394
## 
## $chain1$log_p10
## [1] -4.444975
## 
## $chain1$alpha
## [1] 0.1660135
## 
## $chain1$p_dna
## [1] 0.4 0.4
## 
## $chain1$p11_dna
## [1] 0.39 0.39
## 
## 
## $chain2
## $chain2$mu_trad
##  [1] 4.4957616 2.6535848 1.7346870 4.9186964 1.6316161 3.8460134 0.8730725
##  [8] 0.2999930 0.4805635 3.3968591 3.3819603 2.5069042 4.7366170 3.3562699
## [15] 3.5220432 1.9705725 0.3583773 0.0870234 1.5254890 1.2351271 1.3357053
## [22] 3.0499521 4.4417865 2.3594905 1.0161858 2.1065233 1.9090739 4.4640191
## [29] 3.9503450 2.4318771 3.2818762 3.7377104 0.4383893 1.5064659 4.9366694
## [36] 0.3012942 1.3159130
## 
## $chain2$mu
##  [1] 4.4957616 2.6535848 1.7346870 4.4368024 4.9186964 1.6316161 3.8460134
##  [8] 0.8730725 0.2999930 0.4805635 3.3968591 3.3819603 2.5069042 4.7366170
## [15] 3.3562699 3.5220432 1.9705725 0.3583773 0.0870234 1.5254890 1.2351271
## [22] 1.3357053 3.0499521 4.4417865 2.3594905 1.0161858 2.1065233 1.9090739
## [29] 4.4640191 3.9503450 2.4318771 3.2818762 3.7377104 4.8089227 0.4383893
## [36] 1.5064659 4.9366694 0.3012942 1.3159130
## 
## $chain2$log_p10
## [1] -2.56283
## 
## $chain2$alpha
## [1] 0.1721229
## 
## $chain2$p_dna
## [1] 0.4 0.4
## 
## $chain2$p11_dna
## [1] 0.39 0.39
## 
## 
## $chain3
## $chain3$mu_trad
##  [1] 3.8082303 4.6726811 3.3132446 0.1845146 1.9215927 1.3199387 4.9758713
##  [8] 1.8697917 3.7552545 3.6358025 4.3191917 0.2840704 3.6419785 1.3932532
## [15] 4.3429646 4.6664003 4.7449979 4.3781018 2.8623737 1.8657047 4.1168264
## [22] 1.5778176 1.2458169 1.1965604 4.2759154 3.4374041 3.9889890 3.7224000
## [29] 1.0350033 0.1641932 3.4139549 4.6668994 0.9946618 2.9303321 1.5850506
## [36] 0.5169849 2.9020398
## 
## $chain3$mu
##  [1] 3.8082303 4.6726811 3.3132446 1.1529801 0.1845146 1.9215927 1.3199387
##  [8] 4.9758713 1.8697917 3.7552545 3.6358025 4.3191917 0.2840704 3.6419785
## [15] 1.3932532 4.3429646 4.6664003 4.7449979 4.3781018 2.8623737 1.8657047
## [22] 4.1168264 1.5778176 1.2458169 1.1965604 4.2759154 3.4374041 3.9889890
## [29] 3.7224000 1.0350033 0.1641932 3.4139549 4.6668994 1.5363241 0.9946618
## [36] 2.9303321 1.5850506 0.5169849 2.9020398
## 
## $chain3$log_p10
## [1] -2.782121
## 
## $chain3$alpha
## [1] 0.1381386
## 
## $chain3$p_dna
## [1] 0.4 0.4
## 
## $chain3$p11_dna
## [1] 0.39 0.39
## 
## 
## $chain4
## $chain4$mu_trad
##  [1] 3.86932250 3.54228087 1.99222306 3.18539079 1.38929667 0.03670313
##  [7] 2.53574986 1.31176774 4.49483262 3.55833152 1.16607987 4.30204251
## [13] 0.04209219 2.81424150 4.59469543 2.61103743 0.39552150 0.86938500
## [19] 1.44779708 4.17041337 2.26866899 4.98467453 3.35567240 1.41307275
## [25] 0.24953291 3.08147437 2.97722879 3.69457710 3.79701967 4.65758948
## [31] 2.37760689 0.47790302 2.19544282 4.29782368 3.59081859 0.50235035
## [37] 3.63585373
## 
## $chain4$mu
##  [1] 3.86932250 3.54228087 1.99222306 3.32387716 3.18539079 1.38929667
##  [7] 0.03670313 2.53574986 1.31176774 4.49483262 3.55833152 1.16607987
## [13] 4.30204251 0.04209219 2.81424150 4.59469543 2.61103743 0.39552150
## [19] 0.86938500 1.44779708 4.17041337 2.26866899 4.98467453 3.35567240
## [25] 1.41307275 0.24953291 3.08147437 2.97722879 3.69457710 3.79701967
## [31] 4.65758948 2.37760689 0.47790302 4.49332113 2.19544282 4.29782368
## [37] 3.59081859 0.50235035 3.63585373
## 
## $chain4$log_p10
## [1] -2.576142
## 
## $chain4$alpha
## [1] 0.1036908
## 
## $chain4$p_dna
## [1] 0.4 0.4
## 
## $chain4$p11_dna
## [1] 0.39 0.39