Using DSsim to investigate truncation distances with individual level covariates.

Distance sampling is a process in which a study area is surveyed to estimate the size of the population within it. It can be thought of as an extension to plot sampling. However, while plot sampling assumes that all objects within the plots are detected, distance sampling relaxes this assumption. To do this Distance sampling makes an assumptions about the distribution of objects with respect to the transects and to satisfy these assumptions the transects (the points or lines) must be randomly located within the study region. Note that for the purposes of distance sampling an object can either be an individual or a cluster or individuals.

The next step in distance sampling is then to record the distances from each detected object to the transect it was detected from and fit a detection function. From this function we can estimate how many objects were missed and hence the total number in the covered area. For example, Figure 1 shows histograms of distances that might be collected on a line transect survey, with a fitted detection function. If the lines have been placed at random within the study region then we would expect on average the same number of object to occur at any given distance from the transect. Therefore the drop in number of detection with increasing distance from the line can be attributed to a failure to detect all objects. We can therefore estimate from this detection function that the probability of seeing an object within the covered region out to a chosen truncation distance is the area under the curve (shaded grey) divided by the area of the rectangle.

The R package DSsim (Marshall, 2019) allows users to simulate both point and line transect surveys, and test out a range of design and analysis decisions specific to their population of interest. To simulate surveys the user must make some assumptions about the population of interest and the detection process giving rise to the observed distances. Simulations can be repeated over a range of assumptions so that the user can be confident that their chosen design will perform well despite any uncertainty.

While these simulations focus on the the issue of data truncation at the analysis stage, an example of using DSsim to compare survey designs can be found at https://synergy.st-andrews.ac.uk/ds-manda/#survey-design-simulation-case-study. Other example simulations can be found in the DSsim wiki at https://github.com/DistanceDevelopment/DSsim/wiki.

DSsim takes information from the user on the study region, population and detection process and uses it to generate distance sampling data. DSsim can then be asked to fit detection functions to this data and produce estimates of density, abundance and the associated uncertainty. DSsim splits this process into three stages. Firstly, it generates an instance of a population and a set of survey transects. Secondly, it simulates the distance sampling survey using the assumed detection function(s) provided by the user. Lastly, DSsim analyses the data from the survey. Figure 2 illustrates the simulation process and highlights the information which must be provided by the user.

DSsim is written using the S4 object orientated system in R. The S4 system is a more formal and rigorous style of object orientated programming than the more commonly implemented S3. The process of defining a simulation involves the specification of many variables relating to the survey region, population, survey design and finally the analysis. The design of DSsim is based around each of these descriptions being contained in its own class and the formal S4 class definition procedure ensures that the objects created are of the correct format for the simulation. As the objects created by DSsim are instances of S4 classes, if the user wishes to access information within them the symbol used is slightly different. To access named parts of S3 objects the “$” symbol would be used, while for S4 objects the “@” symbol must be used. The following code demonstrates this.

```
# load DSsim
library(DSsim)
# Make a default region object
eg.region <- make.region()
# Let's check the structure of the object we have created
str(eg.region)
```

```
Formal class 'Region' [package "DSsim"] with 7 slots
..@ region.name: chr "region"
..@ strata.name: chr(0)
..@ units : chr "m"
..@ area : num 1e+06
..@ box : Named num [1:4] 0 0 2000 500
.. ..- attr(*, "names")= chr [1:4] "xmin" "ymin" "xmax" "ymax"
..@ coords :List of 1
.. ..$ :List of 1
.. .. ..$ :'data.frame': 5 obs. of 2 variables:
.. .. .. ..$ x: num [1:5] 0 0 2000 2000 0
.. .. .. ..$ y: num [1:5] 0 500 500 0 0
..@ gaps :List of 1
.. ..$ : list()
```

```
# If we wanted to extract the area of the region we would use
eg.region@area
```

```
[1] 1e+06
```

It is usual in distance sampling studies to truncate the data at some distance from the transect. This is because the observations far away from the transect are of lesser importance when fitting the detection function and also these sparse observations at large distances could have high influence on model selection and possibly increase variability in estimated abundance / density.

Buckland et al. (2001) suggest truncating the data where the probability of detection is around 0.15 as a general rule of thumb. However, distance sampling data is often costly to obtain and discarding some of the data points can feel counter intuitive. In this vignette we investigate truncation distance in distance sampling analyses. We will do this through a series of three simulations outlined below.

Firstly, this vignette will investigate data generated assuming a simple half normal detection function where every object has the same probability of detection at a specific distance from the transect. Figure 3 shows a simple half normal detection function with three possible truncation distances at \(1*\sigma\), \(2*\sigma\) and \(3*\sigma\) where \(\sigma\) is the scale parameter of the half normal detection function. The truncation distance at \(2*\sigma\) gives a probability of detection of 0.135 so close to the 0.15 rule of thumb.

While the first set of simulations assume a simple half normal detection function, in reality individual objects or clusters of objects will likely have varying probability of being detected based on certain characteristics. Perhaps the behaviour of males will make them easier to detect. It is also easy to see that larger clusters of individuals will be easier to spot at large distances than small clusters. We will also investigate the effects of truncation distance when individual level covariates affect the probability of detection. Figure 4 shows how covariates may affect detectability. The most reliable way to predict covariate effect is based on previous surveys. We will use these type of data to investigate both the effects of truncation when we assume that we were not able to measure the covariate affecting detectability and when we assume that we could and therefore can include the relevant covariate in the detection function model.

When we simulate data, we have to provide the detection function to generate detections, and we therefore know the underlying true detection function. When collecting data in the field, we will not have this information, and so we will have to rely on some form of model selection. One method of model selection is to compare information criterion, DSsim allows the user to select either AIC, AICc or BIC as the model selection criteria. For these simulations we will use AIC and allow DSsim to select between a half-normal and a hazard rate model.

In addition, if the probability of detection is affected by covariates then we may not only have a single underlying detection function but a combination of detection functions giving rise to our observed data. In this situation we can either model detectability as a function of these covariates or rely on a concept called pooling robustness. Pooling robustness refers to the fact that distance sampling techniques are robust to the pooling of multiple detection functions into one. This means that we do not necessarily need to include all the covariates which affect detectability in the detection function to estimate density / abundance. This vignette will examine the concept of pooling robustness to see if it is affected by truncation distance.

This vignette will guide you through the steps to create and run a series of simulations to investigate the effects of varying truncation distance on both data generated from a simple half-normal model and from a model where detectability is affected by a covariate.

First we load the DSsim library.

As detailed in Introduction to DSsim a simulation comprises of a number of components. DSsim is designed so that each of these components is defined individually before they are grouped together into a simulation. This helps keep the process clear and also allows reuse of simulation components between different simulations. Each of the function names to create a simulation component or simulation takes the form *make.<component>*.

These simulations will use a rectangular study region of 5 km by 20 km. Survey regions can either be defined in km or m but all units must be the same throughout the components of the simulation. Here we will define the coordinates in m. As this is a simple study region (Figure 5) with few vertices we can simply provide the coordinates. The structure of the coordinates is a list of data.frames for each strata, themselves grouped together in a list. In this example we only have one polygon (so one data.frame) and one strata (one element in the main list).

We will now define our population within our study region. Firstly, we must describe the distribution of the population by defining a density surface. For these simulations we will assume a uniform distribution of animals throughout the study region. DSsim will generate a grid describing the density surface for us if we provide the x and y spacing and a constant density value for the surface. In this example the value of the constant is not important as we will generate animals based on a fixed population size rather than using the exact values in the density grid.

As an aside, if we wished to add areas of higher or lower density to our density surface we could do this using the *add.hotspot* function in DSsim. This function adds these hot or low spots based on a Gaussian decay function. We have to provide the central coordinates and a sigma value to tell DSsim about the location and shape of the hot/low spot. The amplitude argument gives the value of the hot or low spot at its centre and is combined with the existing density surface through addition.

We can now define other aspects of the population. For the simple case (with no covariates) we only need to define a fixed population size and provide the region and density grid we created above. This fixed population size of 200 was selected as a value sufficient to give around 100 detections per simulated survey while not so large as to cause the simulations to run very slowly. The minimum recommended number of detections for fitting a detection function to is 60 (Buckland et al., 2001).

For our simulations involving covariates we need to define how individuals will be allocated these covariate values. DSsim allows the user to either define their own discrete distribution or alternatively provide a distribution (Normal, Poisson, Zero-truncated Poisson or Lognormal) with associated parameters. For these simulation we will use sex as a covariate and assume that 50% of the population are female and 50% are male.

Detectability refers to the detection function or functions we feed into the simulation to generate the observations. In the simple case we can set all animals to have the same probability of detection given their distance from the transect. Here we define a half-normal detection function with scale parameter \(\sigma = 200\) and data generation truncation distance of 1000. The truncation distance defined here is to aid simulation efficiency and means that no detections can occur beyond this value. We can then plot this function to check we have defined it correctly. As we defined our survey region in m the scale parameter and truncation distance will also be assumed to be in metres.

The scale parameter of 200 was selected as on average it gives around 100 detections out to a truncation distance of 1000m with our chosen population size of 200.

When we have covariates in the population we may choose to vary the scale parameter of the detection function based on the covariate values. DSsim assumes that the scale parameter is a function of the covariates as follows:

\[ \sigma = exp(\beta_0+\sum_{j=1}^{q}\beta_{j}z_{ij}) \]

where \(\beta_0\) is the log of the scale parameter supplied to *make.detectability*, the \(\beta_j\)’s are the covariate parameters supplied on the log scale and \(z_{ij}\) is the ith value of the jth covariate. This formula was taken from Buckland et al. (2004).

The covariate values were selected so that males had a higher probability of detection than females. The values selected in this example give a sample size of around 150 observations out to the 1000m truncation value for our population of 200.

```
[1] 537.8027
```

DSsim-1.1.4 implements two basic designs, systematic parallel lines and a systematic grid of points, to generate transects. Other more complex designs can be used with the aid of the Distance for Windows software (Thomas et al., 2010). This software allows more complex designs to be defined and can generate transects and store them in the form of shapefiles which DSsim can then read in and use.

For the purposes of this example we will use the parallel systematic line transect design built into DSsim. As the recommended minimum number of transects is between 10 and 20 (Buckland et al., 2001) we have set the spacing between the lines to be 1000 m to give 20 transects per survey.

```
transects <- generate.transects(design, region = region)
plot(region, plot.units = "km")
plot(transects, col = 4, lwd = 2)
```

The final stage of the simulation is to analyse the distance sampling data that has been generated. As discussed above, when collecting data in the field we would not know the true underlying detection function and will therefore incorporate model uncertainty. We can ask the simulation to fit two models, a half-normal and a hazard rate, to the data and select the best model based on the minimum AIC as follows:

In this code we have set the truncation distance to 600 but later we will vary this value to investigate the effects of truncation distance on our simulation results. Note that while the truncation distance can be set to any value, it should not exceed the truncation value defined in the detectability as no observations will be made beyond this value.

In addition, in the field it may be possible to identify the covariates that affect detectability so we may wish to fit a detection function that incorporates this. In this example the following model would be appropriate:

The simulation is created by grouping all these components together. It is then a good idea to check that everything is as you intended. The function *check.sim.setup* provides a number of plots to help you do this (Figures 11 and 12).

We will now create a second simulation object for the simulations with covariates. We can re-use the region and design components and then add in the new population description and detectability to include the sex covariate. Here we include the same non-covariate analyses but for the final set of simulations we will change this to fit the covariate detection function model.

To investigate the effects of varying the truncation distance during analyses we do not simply need to run one simulation, but one for each truncation distance. The following code shows how we iterated over a number of different truncation distances and stored the results and summaries of all these simulations as lists.

As the simulations including covariates mean that we have a mixture of detection functions, the easiest way to select candidate truncation distances for the analyses is to plot some example data. Figure 12 shows data generated from a population size of 2500, this increase in population size will increase the number of detections and make the shape of the resulting data less variable. From this histogram five candidate truncation distances were selected and are shown by the red vertical lines. These were selected so that the truncation distances represented a range of values for the probability of detection starting at about 0.6 for the shortest truncation distance.

We can now feed these candidate truncation distances into our covariate simulations in the same way as we did for the simple half normal simulation and again store the results and summaries as lists.

Finally, we may also wish to fit the covariate model we used to generate the data rather than the non covariate half-normal and hazard rate models. This will allow us to investigate the effects of truncation if in fact we were aware of and could “measure” the covariate that we knew to be affecting detectability.

The above simulations concentrate on the question of how accurately and precisely we can estimate the abundance and density of a population. However, we may also be interested in learning how individual level covariates affect detectability. To do this we require a different and slightly more advanced setup. DSsim does not currently store the detection function parameter estimates therefore we need to do this manually, however DSsim does provide functionality so that doing this is fairly straight forward. As before we create our simulation but then we need to get DSsim to give us the survey data so that we can run the analyses and obtain the parameter estimates. Please note that the extraction of the parameter estimates from the ddf model is specific to this model, if you are adapting this code you will need to check the ddf documentation in mrds to understand the parameters for different models.

As these simulations take a substantial amount of time to run we have saved the results summaries into this package as data. Running one of these simulations with 999 repetitions for one truncation distance takes about 11 minutes on an i7-2600K 3.40GHz processor when running in parallel across 7 threads. When running in parallel the maximum number of cores (or threads) permitted is one less than the number on the machine, this is the default number used unless max.cores specifies a lower number.

```
# Running simulations in parallel
run(sim.cov, run.parallel = TRUE, max.cores = 7)
```

The results summaries can be loaded as follows:

```
# Simulations using a simple half normal detection function
data(trunc_summary)
# Covartiate simulations
data(trunc_cov_summary)
# Covariate simulations with covariate model
data(covmod_summary)
data(cov_param)
```

The objects this has loaded into the workspace include *summary.list*, *cov.summary.list*, *covmod.summary.list* and *param.list*. *summary.list* is a list of 3 simulation summaries for the simple half normal simulations with truncation distances of 200, 400 and 600. *cov.summary.list* is a list of 5 simulation summaries for the covariate simulations where detectability is affected by sex but sex is not included as a covariate in the detection function models. These simulations relate to truncation distances of 200, 400, 600, 800 and 1000. *covmod.summary.list* is a list of 5 simulation summaries for the covariate simulations where detectability is affected by sex and with the analyses including the covariate sex in the detection function model. These simulations relate to truncation distances of 200, 400, 600, 800 and 1000. *param.list* contains the parameter estimates from the same simulation set up as *covmod.summary.list*. It is a list of 2 2D arrays, the first containing parameter estimates for sigma for the five truncation distances and the second containing parameter estimates for the male sex parameter for each truncation distance.

```
# To view the full summary for the simple half normal simulation with a truncation distance of 200:
summary.list$t200
# To view the full summary for the covariate simulation with a truncation distance of 600:
cov.summary.list$t600
```

To keep the size of the DSsim package down these objects only store the simulation summaries not the full simulation objects with the complete results from each iteration. Copies of the simulation objects with the full results can be downloaded at https://github.com/DistanceDevelopment/DSsim/tree/master/Vignette%20Results/Truncation_Covariate_Results. These follow the same naming and structure as the summary list objects with the word summary replaced with results. These can be used to obtain histograms of the abundance / density estimates, etc., if desired.

To investigate how truncation distance affects the results we need to produce tables for comparison. This section details how this can be done using knitr. This section is provided for those interested but users can just skip to the next section where the results tables are actually presented. This code is only applicable to study regions which only have one strata, it would need to be modified to deal with multiple strata.

```
library(knitr)
N <- unlist(lapply(summary.list, function(x){x@individuals$N$mean.Estimate}))
n <- unlist(lapply(summary.list, function(x){x@individuals$summary$mean.n}))
se <- unlist(lapply(summary.list, function(x){x@individuals$N$mean.se}))
sd.N <- unlist(lapply(summary.list, function(x){x@individuals$N$sd.of.means}))
bias <- unlist(lapply(summary.list, function(x){x@individuals$N$percent.bias}))
RMSE <- unlist(lapply(summary.list, function(x){x@individuals$N$RMSE}))
cov <- unlist(lapply(summary.list, function(x){x@individuals$N$CI.coverage.prob}))
sim.data <- data.frame(trunc = c(200,400,600),
n = round(n),
N = round(N),
se = round(se,2),
sd.N = round(sd.N,2),
bias = round(bias,2),
RMSE = round(RMSE,2),
cov = round(cov*100,1))
kable(sim.data,
col.names = c("$Truncation$", "$mean\\ n$", "$mean\\ \\hat{N}$", "$mean\\ se$", "$SD(\\hat{N})$", "$\\% Bias$", "$RMSE$", "$\\%\\ CI\\ Coverage$"),
row.names = FALSE,
align = c('c', 'c', 'c', 'c', 'c', 'c', 'c', 'c'),
caption = "Simulation Results for the simple half normal detection probability: The truncation distance, mean number of detections, mean estimated population size (N), mean standard error of $\\hat{N}$, the standard deviation of $\\hat{N}$, percentage bias, root mean squared error, percentage of times the true value of N was captured in the confidence intervals.",
table.placement="!h",
format = "html")
```

For the simulations where the data were generated based on a single half-normal detection function the truncation distance used at the analysis stage made little difference to the estimates of abundance. There was perhaps some trend of decreasing coverage of the 95% confidence intervals as truncation distance was increased. A truncation distance of 600 only captured the true population abundance 93.4% of the time in what should have been 95% confidence intervals, Table 1. The root mean squared error (RMSE) values suggested that the further away from the transect the distances were truncated the closer the abundance estimates were to truth.

\(Truncation\) | \(mean\ n\) | \(mean\ \hat{N}\) | \(mean\ se\) | \(SD(\hat{N})\) | \(\% Bias\) | \(RMSE\) | \(\%\ CI\ Coverage\) |
---|---|---|---|---|---|---|---|

200 | 68 | 203 | 34.45 | 32.60 | 1.51 | 32.72 | 97.1 |

400 | 95 | 201 | 28.62 | 29.65 | 0.35 | 29.64 | 94.7 |

600 | 99 | 196 | 24.77 | 25.07 | -1.84 | 25.33 | 93.4 |

These simulations test whether or not we can rely on our assumption of pooling robustness in this situation. We have deliberately not provided the model used to generate the data as a candidate model in the analysis stage. We can see that for this setup, when we have pooled two quite distinct detection functions, there is some bias in the abundance estimates when the truncation distance is larger, Table 2. These results also show that our 95% confidence intervals capture the true abundance much less than 95% of the time when we use large truncation distances. This could be down to an underestimation of the variability, Table 2 shows that for large truncation values the mean se (mean of the estimated standard errors) is lower than the standard deviation of the estimates of abundance. If the analyses were correctly estimating the variability we would expected these values to be similar. In addition, the RMSE suggests that the larger the truncation distance the further away from truth the abundance estimates become, with the most significant jump between 800 and 1000m.

\(Truncation\) | \(mean\ n\) | \(mean\ \hat{N}\) | \(mean\ se\) | \(SD(\hat{N})\) | \(\% Bias\) | \(RMSE\) | \(\%\ CI\ Coverage\) |
---|---|---|---|---|---|---|---|

200 | 66 | 197 | 34.27 | 34.05 | -1.32 | 34.13 | 97.5 |

400 | 102 | 190 | 31.06 | 34.79 | -5.13 | 36.25 | 87.9 |

600 | 128 | 190 | 34.04 | 35.27 | -5.24 | 36.77 | 81.9 |

800 | 144 | 190 | 34.31 | 36.61 | -5.10 | 37.99 | 77.1 |

1000 | 154 | 184 | 30.93 | 39.49 | -7.76 | 42.42 | 68.1 |

Finally we ran simulations and fitted the model we used to generate the data. In these simulations truncation distance had little influence on the accuracy of the estimates of abundance, there is a small amount of bias for the smaller truncation distance, Table 3. The 95% confidence intervals captured the true abundance at least 95% of the time for all truncation distances. It is interesting to note that for these simulations the mean of the estimated standard errors is always higher than the standard deviation of the estimates, which could explain the high coverage rates for the confidence intervals.

While the estimates of abundance are not greatly affected by truncation distance for these simulations, the same cannot be said for the parameter estimates. Figure 14, suggests that parameter estimation is most accurate and reliable at maximum truncation distance. The unstable parameter estimates for the smallest truncation distance leading to sometimes very large estimates of sigma and a bimodal distribution for sex.male could explain the slight bias in abundance estimates for this truncation distance seen in Table 2. It is hoped that in practise this strange behaviour if it indeed was associated with a poor fit to the data would be identified and these estimates rejected based on model selection criteria.

\(Truncation\) | \(mean\ n\) | \(mean\ \hat{N}\) | \(mean\ se\) | \(SD(\hat{N})\) | \(\% Bias\) | \(RMSE\) | \(\%\ CI\ Coverage\) |
---|---|---|---|---|---|---|---|

200 | 66 | 207 | 37.09 | 28.84 | 3.28 | 29.56 | 98.1 |

400 | 102 | 205 | 29.73 | 24.91 | 2.44 | 25.37 | 97.7 |

600 | 128 | 204 | 28.07 | 23.40 | 1.81 | 23.67 | 97.9 |

800 | 144 | 202 | 26.99 | 20.62 | 0.94 | 20.69 | 98.7 |

1000 | 154 | 202 | 26.77 | 20.64 | 1.24 | 20.78 | 98.9 |

In these simulations we have pushed the concept of pooling robustness quite far in that our two detection functions for males and females were very distinct from one another. This would have increased the potential for spiked data in our simulations, that is when the number of detections falls away quickly at small distances. The recommendation when performing distance sampling surveys is to review your data frequently in the field as it is being collected. If you detect spiked data then field methods should be adapted to achieve a wider shoulder in the detection function. This practise will likely help ensure that pooling robustness holds. In addition, pooling robustness may be more tolerant of high levels of variability in the underlying detection functions for larger sample sizes.

The model selection (if any) applied in these simulations was done purely on the basis of AIC. In practise the AIC value is one of a number of diagnostic techniques researchers rely on to select an appropriate detection function model. It is likely, especially due to the potential for spiked data, that some of the models in these simulations were not good fits to the data and would not have been selected by a researcher. If model selection would have been manual then a researcher may have chosen to include adjustment terms in the half-normal or hazard rate models which may have improved the model fit and associated estimate of abundance.

These simulations do suggest that there is little cost to the researcher in truncating the data, even at what seem like quite harsh levels. In fact, truncation may be beneficial if there are large differences in the underlying detection functions due to a covariate which have not been included in the detection function models.

Conversely, if the researcher hopes to identify which covariates affect detectability and obtain reliable parameter estimates then minimal (if any) truncation appears to be preferable.

The effects of truncation distance on estimated abundance precision are interesting, especially the comparison between our estimated and observed variability. When we only allow the simulations to fit the half-normal and hazard rate models, as truncation distance increases the estimated variability stays roughly the same while the observed variability increases. So at larger truncation distances the variability in our estimated abundance is underestimated. However, when fitting the covariate model our estimated variance is higher than our observed variance suggesting that for this model we are over estimating variability for all truncation distances.

- Truncation can help ensure the concept of pooling robustness holds when there are large differences in the detection functions of the individuals in the population and the covariates affecting detectability are not modelled.
- The estimates of abundance are more accurate and precise when the covariate affecting detectability is included in the detection function model.
- Larger truncation distances or no truncation is preferable when trying to obtain the parameters for the covariates that affect detectability.

Buckland, S. T., Anderson, D. R., Burnham, K. P., Borchers, D. L., & Thomas, L. (2001). *Introduction to distance sampling*. Oxford University Press, Oxford, UK.

Buckland, S. T., Anderson, D. R., Burnham, K. P., Laake, J. L., Borchers, D. L., & Thomas, L. (2004). *Advanced distance sampling*. Oxford University Press.

Marshall, L. (2019). *DSsim: Distance sampling simulations*. Retrieved from https://CRAN.R-project.org/package=DSsim

Thomas, L., Buckland, S. T., Rexstad, E. A., Laake, J. L., Strindberg, S., Hedley, S. L., … Burnham, K. P. (2010). Distance software: Design and analysis of distance sampling surveys for estimating population size. *Journal of Applied Ecology*, *47*(1), 5–14.