```{r echo=FALSE}
library(knitr)
```
```{r include=FALSE}
knitr::opts_chunk$set(eval=TRUE, echo=TRUE, message=FALSE, warnings=FALSE, progress=FALSE)
```
# Introduction
The `dsm` package expects data to be in a specific format. In this example I'll show how to get data into that format and give some explanation as to *why* each of these requirements exist.
I recommend getting to grips with fitting models in `dsm` before reading this tutorial so you know what the end point will look like before trying to grapple with the data. An example for [Gulf of Mexico pantropical spotted dolphins is available here](https://examples.distancesampling.org/dsm-line-dolphins/mexico-analysis.html).
# Prerequisites
To run the code below, you will need to have the `ggplot2` and `sf` libraries installed. You can do this with the following code:
```{r eval=FALSE}
install.packages(c("ggplot2", "sf"))
```
The Rhode Island data required downloaded below:
- [loon observations](data/loons.csv)
- [coastline dbf](data/ri_coast.dbf)
- [coastline projection](data/ri_coast.prj)
- [coastline shapefile](data/ri_coast.shp)
- [coastline shx](data/ri_coast.shx)
- [transects dbf](data/ri_transects.dbf)
- [transects projection](data/ri_transects.prj)
- [transects shapefile](data/ri_transects.shp)
- [transects shx](data/ri_transects.shx)
These should be placed in a folder called `data` for the code below to work.
# Overview
When we collect distance sampling data, we need to record (1) the distance to an observation (along with other detectability covariates like weather), (2) the transect we are currently on, and (3) how much effort we expended on that transect (its length or number of visits).
If we want to fit a DSM that information needs to be expanded to include where on the transect we are too, so we can build the spatial model. For point transects, this is simple as we are always at the point, but for line transects we need to know the position of the detection too (or equivalently, how far along the transect we are).
More abstractly, we need information about both detections and effort. We also need to link these two. In `dsm` we use three objects to hold this information:
- the fitted detection function (argument `ddf.obj`) holds all the detections but ignores spatial information
- the "segments" (argument `segment.data`) holds the chunks of effort (parts of the transect or point visits) and their locations. It might also include environmental covariate information.
- a table linking the above two (argument `observation.data`), making sure each detection has a corresponding segment.
![Correspondence between `dsm` tables](images/dsm_tables.png)
The above image gives an overview of how the different `data.frames` interact, along with how that corresponds to the real life situation of collecting data on a transect.
The rest of this example goes through each table in turn, explaining their construction.
# Table construction
## Detection data
The data requirements are very similar to those for packages `mrds` and `Distance`. In both cases, each detection has a corresponding row. For an analysis using `dsm` we have additional requirements in the data (which can be auto-generated by `Distance` but we recommend against this to ensure that you know that records link up correctly).
For `Distance` we need:
- `distance` : the distance to the detection
- `object` : a unique detection identifier (which will link to the observation table)
- `size` : the number of individuals in the detection (1 if objects occur alone)
`mrds` requires the following extra columns (to allow for double observer surveys):
- `detected` : 1 if detected by the observer and 0 if missed (always 1 for single observer)
- `observer` : observer number (1 or 2) (always 1 for single observer)
If other covariates that affect detectability are recorded (e.g., sex or weather) then these can be included in this data for analysis in `mrds` or `Distance`.
Once the model is fitted, we deal only with that fitted model object in the `dsm` analysis.
## Segment data
I will use the term "segment" here to refer to both points for point transects and small chunks of transect for the line transect case. I'll assume that you've already segmented your lines while first describing the data format, then go on to an example of how to segment line transect data.
The `data.frame` required for the segments needs the following columns:
- `Effort` : the effort expended on this segment (either the length of the segment for lines, or the number of visits for points)
- `Sample.Label` : a unique identifier for the segment (if you are engaged in a complicated survey, this might take a format like "`YEAR`-`AREA`-`LINE`-`SEGMENT`", so you have labels like "`2020-North-A-15`", which can be useful to keep track of where data came from later).
In addition, environmental covariates like location and relevant covariates (sea surface temperature, foliage type, altitude, bathymetry, etc) can be included in this `data.frame` if they are to be used in the spatial model.
### Segmenting
If you have data as lines and you want to chop them up into segments, this section will give some sample R code to do this. How things work is extremely dependent on the input data, but hopefully this gives a template for what you want to do at least.
*Before you jump in*: If you are fluent in ArcGIS or QGIS I recommend doing this task in the software you are familiar with at first. A simple guide to doing this task in [QGIS is here](https://www.northrivergeographic.com/qgis-split-a-line-at-an-interval) and for [ArcGIS I recommend the MGET toolbox](https://mgel.env.duke.edu/mget/).
As an example, we will look at an aerial survey of seabirds off the coast of Rhode Island, USA. Data from these surveys were collected by the University of Rhode Island and analysed in @winiarski_spatially_2013, @winiarski_integrating_2014 and @winiarski_spatial_2014. I have the transects saved as a shapefile that I will then "chop-up" into segments. To start with let's plot the data with the coastline:
```{r}
library(ggplot2)
library(sf)
# coast data, just for reference
coastline <- read_sf("data/ri_coast.shp")
# our transect lines
transects <- read_sf("data/ri_transects.shp")
# now make the plot
p <- ggplot() +
# geom_sf knows what to do with spatial data
geom_sf(data=coastline) +
geom_sf(data=transects, aes(colour=Transect)) +
# chop to be on a better scale
coord_sf(xlim=c(-72, -70.8), ylim=c(40.8, 41.5), expand=FALSE) +
# make the plot look simpler
theme_minimal()
print(p)
```
In the above code we have loaded the [`sf` package](https://r-spatial.github.io/sf/). This is the package we will use to do most of our spatial work in segmenting the data. It has most of the tools expected from a GIS, though the package is still being developed so its functionality is always increasing.
Investigating the `transects` data we loaded, we can see it consists of a `data.frame` with some extra bits (the spatial information, including locations (`geometry`) and projection (`CRS`)):
```{r inspect-transects}
transects
```
We have one row per transect (for a total of 24 transects), each of which consists of a line joining the start and end points that make up the transects. If we want we can plot individual rows via `plot(transects[1,])`; this can be useful to check that each row is a single line, or if further processing is needed (this can be tricky and is not covered in this tutorial but see "More information" below).
This data looks in good shape, so we can use the `st_segmentize` function to take each transect and make segments from them. To make sure everything works correctly, we need to project the data first. Here I'm using an appropriate projected coordinate system (EPSG code 6348), which is the Rhode Island State Plane.
```{r}
# project transects
transects <- st_transform(transects, 6348)
# do the segmenting
segs <- st_segmentize(transects, dfMaxLength=units::set_units(2000, "metres"))
# transform back to lat/long
segs <- st_transform(segs, 4326)
transects <- st_transform(transects, 4326)
```
Here we know the truncation used for the detection function was 1000m (the distances were collected in bins and this was the further bin), and since we're trying to make our segments approximately square, we set the length to be twice that (since the truncation applies to either side of the transect), so 2000m.
Looking at what that did:
```{r inspect-segs}
segs
```
See now that each row has many coordinates attached to it, just looking at the first row (transect 1) and comparing the coordinates between `transects` and `segs`
```{r transects-rowone}
st_coordinates(transects[1,])
```
`transects` has just two coordinates in it (the start and end points of the line). Whereas:
```{r segs-rowone}
st_coordinates(segs[1,])
```
`segs` has 5, corresponding to our segment cut points.
We now need to break up the rows of `segs` into multiple rows, one per segment. This is a bit fiddly. We use a function provided by [`dieghernan`](https://dieghernan.github.io) on [their site](https://dieghernan.github.io/201905_Cast-to-subsegments/) to do this. We first load the function (don't worry about understanding the code!):
```{r cast-to-subsegs-function}
stdh_cast_substring <- function(x, to = "MULTILINESTRING") {
ggg <- st_geometry(x)
if (!unique(st_geometry_type(ggg)) %in% c("POLYGON", "LINESTRING")) {
stop("Input should be LINESTRING or POLYGON")
}
for (k in 1:length(st_geometry(ggg))) {
sub <- ggg[k]
geom <- lapply(
1:(length(st_coordinates(sub)[, 1]) - 1),
function(i)
rbind(
as.numeric(st_coordinates(sub)[i, 1:2]),
as.numeric(st_coordinates(sub)[i + 1, 1:2])
)
) %>%
st_multilinestring() %>%
st_sfc()
if (k == 1) {
endgeom <- geom
}
else {
endgeom <- rbind(endgeom, geom)
}
}
endgeom <- endgeom %>% st_sfc(crs = st_crs(x))
if (class(x)[1] == "sf") {
endgeom <- st_set_geometry(x, endgeom)
}
if (to == "LINESTRING") {
endgeom <- endgeom %>% st_cast("LINESTRING")
}
return(endgeom)
}
```
We can then use it to separate the segments and look at the results:
```{r run-subseg}
segs <- stdh_cast_substring(segs, to="LINESTRING")
segs
```
We now have `r nrow(segs)` segments between 1km and 2km long. We can check the lengths using the `st_length` function and plot a histogram of the segment lengths:
```{r length-segs}
hist(st_length(segs), xlab="Segment lengths (m)", main="")
```
Note that in setting the `dfMaxLength` argument to `st_segmentize` we are giving a rough guide to the segment length and the algorithm inside `st_segmentize` tries to get segments to be as equal in length as possible (note the $x$ axis in the histogram).
Note also that the `Transect` column in `segs` now has duplicate entries as each transect has multiple segments in it. We can now create our required columns for `dsm`.
First, the `Effort` column can be generated using `st_length`:
```{r effort-segs}
segs$Effort <- st_length(segs)
```
Then we can generate the `Sample.Labels`. Here I'll use a more complicated naming scheme to show how that's done, but one could simply write `segs$Sample.Label <- 1:nrow(segs)` and that would be sufficient. Instead, I would like to have my `Sample.Label` be the "`YEAR`-`AREA`-`LINE`-`SEGMENT`" scheme I suggested above.
There are fancier ways to do this, but for clarity, let's use a `for` loop:
```{r samplelabels-segs}
# create a dummy column that we can fill in as we go
segs$Sample.Label <- NA
# set the year and area once for this data
year <- 2010
area <- "RI"
# loop over the transect IDs
for(this_transect in unique(segs$Transect)){
# how many segments in this transect?
n_segs <- nrow(subset(segs, Transect==this_transect))
# generate the n_segs labels that we need
segs$Sample.Label[segs$Transect==this_transect] <- paste(year, area, this_transect,
1:n_segs, sep="-")
}
```
Now we can check that looks right:
```{r check-samplelabels}
segs
```
With the required columns taken care of, we can also calculate the centroids of our segments to get the locations of each segment to use in the spatial model. Handily we can use `st_centroid` to do this in one step and keep all our data at the same time.
We get a warning about how centroids might not be calculated correctly when latitude/longitude are used. So again we need to project the data first (to Rhode Island State Plane) for this step using `st_transform`.
```{r centroid-locs}
# save the line version of segs for plotting later
segs_lines <- segs
# project segs
segs <- st_transform(segs, 6348)
# find centroids
segs <- st_centroid(segs)
# project back to lat/long
segs <- st_transform(segs, 4326)
# how does that look?
segs
```
Notice that our `geometry type` is now `POINT`. We can plot these centroids on our previous map:
```{r}
p <- ggplot() +
# geom_sf knows what to do with spatial data
geom_sf(data=coastline) +
geom_sf(data=transects) +
geom_sf(data=segs) +
# chop to be on a better scale
coord_sf(xlim=c(-72, -70.8), ylim=c(40.8, 41.5), expand=FALSE) +
# make the plot look simpler
theme_minimal()
print(p)
```
At present, the `dsm` package doesn't know how to talk to `sf` objects, so as a final step we need to simplify `segs` to be a regular `data.frame`, which `dsm` can interpret. We can extract the coordinates of the centroids using `st_coordinates` and then append them onto a `data.frame` version of the object with the geometry dropped, like so:
```{r dataframeify}
segs_df <- cbind(as.data.frame(st_drop_geometry(segs)),
st_coordinates(segs))
```
We can double check that looks right:
```{r checkfinal}
head(segs_df)
```
and we are done. See the next section for how to link the segments and the detections.
If we want to include spatial covariate information in the segment table, we could then use (for example) [`rerrdap`](https://CRAN.R-project.org/package=rerddap) to obtain remotely-sensed data. For *in situ* data (i.e., weather conditions recorded while on effort), this can be more complicated and will require the use of `summarize`, see [this thread for more information](https://github.com/r-spatial/sf/issues/321) on how to deal with this situation.
## Linking observations and segments
To link together the data in the detection function and the spatial data giving the effort, we use the observation `data.frame`. This is really just a cross-reference between those two tables, so each detection lives in one segment.
The observation `data.frame` must have (at least) the following columns:
- `object` : unique object identifier (corresponding to those identifiers in the detection function)
- `Sample.Label` : the identifier for the segment that the observation occurred in
- `size` : the size of each observed group (e.g., 1 if all animals occurred individually)
- `distance` : distance to observation
The observation data should have as many rows as there are in the detection function.
### Relating observations and segments in practice
There are a few methods for building the observation table. Here I'll illustrate one assigning detections to segments based on their distance (i.e., detections are associated with their closest segment centroid). This is appropriate when you don't see forwards too far, so it's unlikely that detections will be miss assigned. This is true, for instance, in an aerial survey where observers look out windows located on the sides of the plane but might be inappropriate for a shipboard survey where observers on the flying bridge can see way ahead of the ship.
First loading the detection data for this survey (this would be what we use to fit the detection function, post-processing):
```{r detect-data}
obs <- read.csv("data/loons.csv")
head(obs)
```
We can see there are columns for latitude and longitude (`Lat` and `Long`). We can make this `data.frame` be an `sf` object by telling `sf` that these columns contain spatial coordinates (in `sf` lingo, "geometry"):
```{r geomit}
obs <- st_as_sf(obs, coords=c("Long", "Lat"))
# set the coordinate system
st_crs(obs) <- 4326
obs
```
We can overlay this on our previous map of transects:
```{r}
p <- ggplot() +
# geom_sf knows what to do with spatial data
geom_sf(data=coastline) +
geom_sf(data=transects) +
geom_sf(data=obs, size=0.5) +
# chop to be on a better scale
coord_sf(xlim=c(-72, -70.8), ylim=c(40.8, 41.5), expand=FALSE) +
# make the plot look simpler
theme_minimal()
print(p)
```
Going back to our `segs` object above, we can use the `st_join` function, combined with the `st_nearest_feature` function to control how the tables are linked. Again, we need to project the data to avoid issues.
```{r}
# project segs
segs <- st_transform(segs, 6348)
obs <- st_transform(obs, 6348)
# do the join
obs <- st_join(obs, segs, join=st_nearest_feature)
# project back to lat/long
segs <- st_transform(segs, 4326)
obs <- st_transform(obs, 4326)
# how does that look?
obs
```
Now `obs` has acquired additional columns from `segs`, including the `Sample.Label` (which we want) and `Effort` (which we don't). To check this worked okay, we can plot the `obs` and `segs_lines` (the line version of the segments which we saved earlier) with colour coding for the `Sample.Label`. I randomized the colour order so it would be easier to tell if observations were misallocated.
```{r plot-check}
# make a random colour palette to avoid similar colours being near each other
pal <- rainbow(nrow(segs), s=.6, v=.9)[sample(1:nrow(segs),nrow(segs))]
p <- ggplot() +
# geom_sf knows what to do with spatial data
geom_sf(data=coastline) +
geom_sf(data=segs_lines, aes(colour=Sample.Label), pch=21) +
geom_sf(data=obs, size=0.5, aes(colour=Sample.Label)) +
# chop to be on a better scale
coord_sf(xlim=c(-72, -70.8), ylim=c(40.8, 41.5), expand=FALSE) +
scale_colour_manual(values=pal) +
# make the plot look simpler
theme_minimal() +
theme(legend.position = "none")
print(p)
```
As with `segs` we can remove unnecessary columns and geometry to get the required columns only:
```{r obscols}
# get rid of "spatialness"
obs <- st_drop_geometry(obs)
# select the columns we need
obs <- obs[, c("object", "Sample.Label", "size", "Bin")]
head(obs)
```
(Note here since we have binned data, we just keep the `Bin` column and need to process this further later.)
A different way to approach this would be if there were timestamps on waypoints from the GPS, which could be related to the segments. One could then look at whether an observation was made between the start and end times of a segment.
# More information
We have provided some information about segmenting on [this wiki page](https://osf.io/wgc4f/wiki/Splitting%20transects%20into%20segments/), as part of the US Navy-funded project [DenMod](https://synergy.st-andrews.ac.uk/denmod/).
This information is summarized at `?"dsm-data"` in the `dsm` package. It may also be useful to look at the data in the example dataset `mexdolphins` which can be loaded with `data(mexdolphins)`. The vignette for an analysis of those data [is available here](https://examples.distancesampling.org/dsm-line-dolphins/mexico-analysis.html).
It is hard to give very general information on how to segment lines, as to some extent it depends on how the data were originally formatted (going all the way back to the make and model of the GPS unit used). More information on dealing with spatial data in R using the `sf` package is available at [the R spatial website](https://r-spatial.github.io/sf/) (see the "Articles" drop down in the header).
Another source of help is the [distance sampling mailing list](https://groups.google.com/forum/#!topic/distance-sampling/). Be sure to search the archives prior to posting as there have been several threads on segmenting previously that might be helpful.
# Acknowledgements
Thanks to Phil Bouchet for providing helpful comments to an early version of this document and to Iúri Correia for finding an important bug.