Chapter 6 Use case 4: `joint_model()` with semi-paired data

This final use case shows how to fit and interpret the joint model with semi-paired eDNA and traditional survey data.

In many cases, you may have some sites where only eDNA samples were collected, while other sites have paired eDNA and traditional samples (i.e., semi-paired data). We can fit the joint model to this semi-paired data and use the sites with paired data to make predictions about the sites with only un-paired eDNA data.

Let’s use the same tidewater goby data from uses cases 1 and 2, except let’s pretend we did not have traditional seine sampling at two of the 39 sites.

6.1 Simulate semi-paired data

Both eDNA and traditional survey data should have a hierarchical structure:

Sites (primary sample units) within a study area
eDNA and traditional samples (secondary sample units) collected from each site
eDNA subsamples (replicate observations) taken from each eDNA sample

First replace the traditional count data at sites 4 and 34 with NA to simulate a scenario where we did not have traditional samples at these sites.

library(eDNAjoint)
data(goby_data)

# create new dataset of semi-paired goby data
goby_data_semipaired <- goby_data
goby_data_semipaired$count[4, ] <- NA
goby_data_semipaired$count[34, ] <- NA

We can now see that we have no data at site 4:

goby_data_semipaired$count[4, ]

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 
## NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

And site 34:

goby_data_semipaired$count[34, ]

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 
## NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

It’s important to ensure that we replace the data with NA so that we have an empty row indicating missing data.

6.2 Prepare the data

Now let’s look at the format of the data. You can see that the data is a list of four matrices all with the same number of rows that represent each site (n = 39). Across all matrices, rows in the data should correspond to the same sites (i.e., row one in all matrices corresponds to site one, and row 31 in all matrices corresponds to site 31).

pcr_k, pcr_n, and count are required for all implementations of joint_model(), and site_cov is optional and is used in use case 2.

Let’s look first at pcr_k. These are the total number of positive PCR detections for each site (row) and eDNA sample (column). The number of columns should equal the maximum number of eDNA samples collected at any site. Blank spaces are filled with NA at sites where fewer eDNA samples were collected than the maximum.

For example, at site one, 11 eDNA samples were collected, and at site five, 20 eDNA samples were collected.

head(goby_data_semipaired$pcr_k)

##      1 2 3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
## [1,] 0 0 0  0  0  0  0  0  0  0  0 NA NA NA NA NA NA NA NA NA NA NA
## [2,] 0 0 0  0  0  0  0  0  0  0  0 NA NA NA NA NA NA NA NA NA NA NA
## [3,] 0 0 0  0  0  0  0  0  0  0  0 NA NA NA NA NA NA NA NA NA NA NA
## [4,] 0 6 6  4  6  5  4  6  5  3 NA NA NA NA NA NA NA NA NA NA NA NA
## [5,] 6 6 4  6  6  6  5  4  2  2  0  6  5  5  6  6  6  5  5  4 NA NA
## [6,] 0 0 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Now let’s look at pcr_n. These are the total number of PCR eDNA subsamples (replicate observations) collected for each site (row) and eDNA sample (column). In this data, six PCR replicate observations were collected for each eDNA sample. Notice that the locations of the NAs in the matrix match pcr_k.

head(goby_data_semipaired$pcr_n)

##      1 2 3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
## [1,] 6 6 6  6  6  6  6  6  6  6  6 NA NA NA NA NA NA NA NA NA NA NA
## [2,] 6 6 6  6  6  6  6  6  6  6  6 NA NA NA NA NA NA NA NA NA NA NA
## [3,] 6 6 6  6  6  6  6  6  6  6  6 NA NA NA NA NA NA NA NA NA NA NA
## [4,] 6 6 6  6  6  6  6  6  6  6 NA NA NA NA NA NA NA NA NA NA NA NA
## [5,] 6 6 6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6 NA NA
## [6,] 6 6 6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Next, let’s look at count. This data is from seine sampling for tidewater gobies. Each integer refers to the catch of each seine sample (i.e., catch per unit effort, when effort = 1). Again, the rows correspond to sites, and the columns refer to replicated seine samples (secondary sample units) at each site, with a maximum of 22 samples. Blank spaces are filled with NA at sites where fewer seine samples were collected than the maximum, as well as at sites with no seine samples. Notice that the number of rows (i.e., number of sites) in the count data still equals the number of rows in the PCR data, even though some of these sites have no count data.

dim(goby_data_semipaired$count)[1] == dim(goby_data_semipaired$pcr_k)[1]

## [1] TRUE

For more data formatting guidance, see section 2.1.1.

6.3 Fit the model

Now that we understand our data, let’s fit the joint model. The key arguments of this function include:

data: list of pcr_k, pcr_n, and count matrices
family: probability distribution used to model the seine count data. A poisson distribution is chosen here.
p10_priors: Beta distribution parameters for the prior on the probability of false positive eDNA detection, $p_{10}$ . c(1,20) is the default specification. More on this later.
q: logical value indicating the presence of multiple traditional gear types. Here, we’re only using data from one traditional method.

More parameters exist to further customize the MCMC sampling, but we’ll stick with the defaults.

# run the joint model
goby_fit_semi <- joint_model(data = goby_data_semipaired, family = "poisson",
                             p10_priors = c(1, 20), q = FALSE)

goby_fit_semi is a list containing:

model fit (goby_fit_semi$model) of the class ‘stanfit’ and can be accessed and interpreted using all functions in the rstan package.
initial values used for each chain in MCMC (goby_fit_semi$inits)

6.4 Interpret the output

6.4.1 Summarize posterior distributions

Let’s interpret goby_fit_semi. Use joint_summarize() to see the posterior summaries of the model parameters. Let’s look at the estimates of the expected catch rate at each site, $μ$ .

joint_summarize(goby_fit_semi$model, par = "mu")

##           mean se_mean    sd    2.5%   97.5%    n_eff Rhat
## mu[1]    0.021   0.000 0.022   0.001   0.081 16233.67    1
## mu[2]    0.021   0.000 0.022   0.001   0.081 14638.91    1
## mu[3]    0.021   0.000 0.022   0.000   0.080 16064.10    1
## mu[4]    5.403   0.014 1.739   2.862   9.556 15012.67    1
## mu[5]    2.428   0.002 0.301   1.880   3.049 15766.31    1
## mu[6]    0.080   0.001 0.085   0.002   0.308 14411.17    1
## mu[7]    0.338   0.001 0.105   0.169   0.578 16992.53    1
## mu[8]    0.278   0.003 0.315   0.006   1.160 13196.72    1
## mu[9]    0.048   0.000 0.050   0.001   0.184 14569.94    1
## mu[10]   1.514   0.002 0.204   1.146   1.943 16123.32    1
## mu[11]  26.949   0.008 1.125  24.824  29.162 18268.83    1
## mu[12]   0.021   0.000 0.022   0.001   0.078 18016.42    1
## mu[13]   0.021   0.000 0.022   0.001   0.080 14519.11    1
## mu[14]   0.033   0.000 0.034   0.001   0.126 14498.80    1
## mu[15]   0.100   0.000 0.062   0.018   0.255 16450.20    1
## mu[16]   0.066   0.000 0.049   0.006   0.193 16306.26    1
## mu[17]   0.025   0.000 0.026   0.001   0.096 14562.80    1
## mu[18]   0.020   0.000 0.020   0.001   0.074 16289.80    1
## mu[19]   0.048   0.000 0.049   0.001   0.181 14891.72    1
## mu[20]   0.224   0.001 0.081   0.096   0.413 19443.31    1
## mu[21]   0.082   0.001 0.087   0.002   0.314 14317.11    1
## mu[22]   0.253   0.001 0.114   0.086   0.524 17443.87    1
## mu[23]  90.576   0.022 3.176  84.437  96.976 20996.59    1
## mu[24]   1.051   0.002 0.212   0.690   1.515 16221.45    1
## mu[25]  21.055   0.011 1.576  18.119  24.229 20137.02    1
## mu[26]   1.282   0.002 0.208   0.916   1.728 17762.38    1
## mu[27]   2.713   0.003 0.463   1.901   3.697 19183.83    1
## mu[28]   8.149   0.005 0.749   6.768   9.676 18857.96    1
## mu[29]   3.221   0.003 0.457   2.402   4.186 18108.32    1
## mu[30]  13.699   0.016 2.114   9.857  18.152 18571.74    1
## mu[31]  94.489   0.031 4.350  86.080 103.210 20147.49    1
## mu[32]   0.023   0.000 0.023   0.001   0.086 15930.90    1
## mu[33]   0.081   0.001 0.085   0.002   0.306 16427.12    1
## mu[34]   0.260   0.001 0.149   0.062   0.632 17461.88    1
## mu[35]   0.125   0.001 0.132   0.003   0.486 14135.71    1
## mu[36]   0.022   0.000 0.022   0.001   0.081 13603.72    1
## mu[37]   0.023   0.000 0.023   0.001   0.086 15932.27    1
## mu[38] 168.897   0.041 5.775 157.717 180.399 19592.86    1
## mu[39]   0.023   0.000 0.023   0.001   0.085 15852.18    1

This summarizes the mean, standard deviation (sd), and quantiles of the posterior estimates of $μ$ , as well as the effective sample size (n_eff) and $\hat{R}$ (Rhat) for the parameters. More informative about effective sample size and $\hat{R}$ can be found in the algorithm convergence section, but briefly, the $\hat{R}$ value is the frequently used statistic for assessing model convergence. We typically want $\hat{R}$ to be less than 1.05, so it looks like our model converged. The 2.5% and 97.5% quantiles show the bounds of the 95% credibility interval (the equal tailed credibility interval, to be specific.)

The model estimates $β_{i}$ using data from sites with paired samples and uses this estimate to make predictions for $μ_{i = 4}$ and $μ_{i = 34}$ where no traditional seine samples were collected.

We can also use functions from the bayesplot package to examine the posterior distributions.

First let’s look at the posterior distributions for $μ_{i = 4, k = 1}$ and $μ_{i = 29, k = 1}$ . Traditional seine samples were collected at site 29 but not at site 4.

library(bayesplot)
# plot posterior distribution, highlighting median and 80% credibility interval
mcmc_areas(as.matrix(goby_fit_semi$model),
           pars = c("mu[4,1]", "mu[29,1]"), prob = 0.8)

As you could expect, the credibility interval for the expected catch rate at site four is much wider than the credibility interval at site 29, since no traditional samples were collected at site four.

Let’s do the same for $μ_{i = 7, k = 1}$ and $μ_{i = 34, k = 1}$ . Traditional seine samples were collected at site 7 but not at site 34.

# plot posterior distribution, highlighting median and 80% credibility interval
mcmc_areas(as.matrix(goby_fit_semi$model),
           pars = c("mu[7,1]", "mu[34,1]"), prob = 0.8)

Again, the credibility interval for the expected catch rate at site 34 is wider with a longer tail than the credibility interval at site 7, since no traditional samples were collected at site 34.

Note: It’s important to consider that fitting the joint model with semi-paired data will be more successful if there is paired data at most sites. Since the data at paired sites is leveraged to make predictions at the un-paired sites, you want to maximize the amount of data to leverage.

6.5 Initial values

By default, eDNAjoint will provide initial values for parameters estimated by the model, but you can provide your own initial values if you prefer. Here is an example of providing initial values for parameters, mu,p10, alpha, and q, as an input in joint_model().

# set number of chains
n_chain <- 4

# number of sites
nsites <- dim(goby_data_semipaired$count)[1]

# initial values should be a list of named lists
inits <- list()
for (i in 1:n_chain) {
  inits[[i]] <- list(
    # length should equal the number of sites for each chain
    mu = stats::runif(nsites, 0.01, 5),
    # length should equal 1 for each chain
    p10 = stats::runif(1, 0.0001, 0.08),
    # length should equal 1 for each chain,
    # since there are no site-level covariates
    alpha = stats::runif(1, 0.05, 0.2)
  )
}

# now fit the model
fit_inits <- joint_model(data = goby_data_semipaired, initial_values = inits,
                         multicore = TRUE)

# check to see the initial values that were used
fit_inits$inits

## $chain1
## $chain1$mu_trad
##  [1] 1.3227859 3.6916150 0.9837692 2.1254940 1.5012412 3.0808618 3.5284120 0.9454951 0.6628305 4.6267079
## [11] 4.8932631 2.8575870 2.7445088 3.5415297 0.7315802 0.7533260 3.6916973 4.5407505 1.2669971 2.6991378
## [21] 0.1613091 0.4041204 3.0748685 4.8767813 1.6037583 1.7304875 1.8434497 2.6225432 3.6951972 0.8573006
## [31] 3.0994026 1.4291258 0.4673632 0.6480119 2.3073407 0.3504527 1.3208269
## 
## $chain1$mu
##  [1] 1.3227859 3.6916150 0.9837692 0.8172015 2.1254940 1.5012412 3.0808618 3.5284120 0.9454951 0.6628305
## [11] 4.6267079 4.8932631 2.8575870 2.7445088 3.5415297 0.7315802 0.7533260 3.6916973 4.5407505 1.2669971
## [21] 2.6991378 0.1613091 0.4041204 3.0748685 4.8767813 1.6037583 1.7304875 1.8434497 2.6225432 3.6951972
## [31] 0.8573006 3.0994026 1.4291258 1.6436970 0.4673632 0.6480119 2.3073407 0.3504527 1.3208269
## 
## $chain1$log_p10
## [1] -2.557069
## 
## $chain1$alpha
## [1] 0.1176338
## 
## $chain1$p_dna
## [1] 0.4 0.4
## 
## $chain1$p11_dna
## [1] 0.39 0.39
## 
## 
## $chain2
## $chain2$mu_trad
##  [1] 3.59063542 3.95674347 3.73253714 3.98400552 1.36366029 3.01130594 1.70766046 0.34274060 1.23072903
## [10] 1.04769198 3.70842729 0.46283153 2.38976079 3.08340641 2.08961872 0.10108430 2.12729192 2.75370300
## [19] 4.71879771 3.37064120 3.93212937 1.50605127 1.23569982 2.38590971 1.70705212 4.24436419 4.18185253
## [28] 2.01597770 2.29422006 4.14366729 2.67939124 4.38642098 0.09918503 0.77051777 3.54803402 4.93210771
## [37] 2.06295781
## 
## $chain2$mu
##  [1] 3.59063542 3.95674347 3.73253714 1.53784657 3.98400552 1.36366029 3.01130594 1.70766046 0.34274060
## [10] 1.23072903 1.04769198 3.70842729 0.46283153 2.38976079 3.08340641 2.08961872 0.10108430 2.12729192
## [19] 2.75370300 4.71879771 3.37064120 3.93212937 1.50605127 1.23569982 2.38590971 1.70705212 4.24436419
## [28] 4.18185253 2.01597770 2.29422006 4.14366729 2.67939124 4.38642098 4.35725737 0.09918503 0.77051777
## [37] 3.54803402 4.93210771 2.06295781
## 
## $chain2$log_p10
## [1] -4.871264
## 
## $chain2$alpha
## [1] 0.1198508
## 
## $chain2$p_dna
## [1] 0.4 0.4
## 
## $chain2$p11_dna
## [1] 0.39 0.39
## 
## 
## $chain3
## $chain3$mu_trad
##  [1] 3.7111455 0.4816230 2.6491037 4.4276805 4.3439870 3.7831909 4.6072097 2.9371963 1.1353008 1.7972700
## [11] 1.4050419 4.7069761 4.2580606 1.8573563 1.9683906 2.9729225 4.7016567 0.3879045 0.8037295 3.4195511
## [21] 3.8738332 0.1232631 2.6472935 3.5998153 2.8826638 1.1397462 1.6749636 4.8244455 1.6738675 3.2549020
## [31] 1.9008674 4.4658675 0.9892450 0.2140093 0.4813074 3.2953900 3.0670911
## 
## $chain3$mu
##  [1] 3.7111455 0.4816230 2.6491037 3.5001317 4.4276805 4.3439870 3.7831909 4.6072097 2.9371963 1.1353008
## [11] 1.7972700 1.4050419 4.7069761 4.2580606 1.8573563 1.9683906 2.9729225 4.7016567 0.3879045 0.8037295
## [21] 3.4195511 3.8738332 0.1232631 2.6472935 3.5998153 2.8826638 1.1397462 1.6749636 4.8244455 1.6738675
## [31] 3.2549020 1.9008674 4.4658675 3.4498749 0.9892450 0.2140093 0.4813074 3.2953900 3.0670911
## 
## $chain3$log_p10
## [1] -2.705478
## 
## $chain3$alpha
## [1] 0.1159524
## 
## $chain3$p_dna
## [1] 0.4 0.4
## 
## $chain3$p11_dna
## [1] 0.39 0.39
## 
## 
## $chain4
## $chain4$mu_trad
##  [1] 1.4074775 2.8061541 4.6674241 3.8346694 4.5148164 3.6983434 4.9417946 2.3841883 3.6464028 2.6929954
## [11] 0.4064378 3.4508663 1.2077832 3.0299879 3.0175900 3.9658047 3.1371346 2.7449234 3.5630390 1.6210005
## [21] 0.8554885 4.7037371 0.5838176 2.3608219 4.0082376 2.9118945 3.1706043 4.6953957 4.7373760 2.6386403
## [31] 3.1042481 1.5034714 3.7469587 1.5915333 3.5993356 0.1478814 2.2945713
## 
## $chain4$mu
##  [1] 1.4074775 2.8061541 4.6674241 0.9664594 3.8346694 4.5148164 3.6983434 4.9417946 2.3841883 3.6464028
## [11] 2.6929954 0.4064378 3.4508663 1.2077832 3.0299879 3.0175900 3.9658047 3.1371346 2.7449234 3.5630390
## [21] 1.6210005 0.8554885 4.7037371 0.5838176 2.3608219 4.0082376 2.9118945 3.1706043 4.6953957 4.7373760
## [31] 2.6386403 3.1042481 1.5034714 2.9969914 3.7469587 1.5915333 3.5993356 0.1478814 2.2945713
## 
## $chain4$log_p10
## [1] -2.65019
## 
## $chain4$alpha
## [1] 0.0698874
## 
## $chain4$p_dna
## [1] 0.4 0.4
## 
## $chain4$p11_dna
## [1] 0.39 0.39

Chapter 6 Use case 4: joint_model() with semi-paired data