Supervised machine learning (ML) was employed to assess whether contact call structure differed between ranges, and whether acoustic structure changed over time in the invasive range. We decided to use supervised machine learning with ranges as class labels after visually assessing calls during pre-processing. Stochastic gradient boosting and random forests (RF) models were trained and validated with a large set of acoustic measurements and features for the majority of native and invasive range calls across social scales. Both models were trained to classify calls back to range, and we proceeded with the RF model for prediction.

Classification accuracy throughout predictive modeling was used as an indicator of structural differentiation between ranges. We also used the RF proximity matrix to visualize overlap between ranges in low dimensional acoustic space. Finally, we used misclassifications per range to more closely evaluate structural change between ranges, as well as structural change over time in the invasive range.

The prefix “AM” stands for “Additional Material” and is used to indicate figures included here that are not present in the main manuscript or appendix, but are still valuable for reproducibility. Some of the functions and code in this script may need to be updated in order to reproduce results with more recent versions of the packages used.

rm(list = ls())

X <- c("warbleR", "ggplot2", "tidyverse", "caret", "MASS", "pbapply", "dplyr", "data.table", "parallel", "ranger", "corrplot", "MLmetrics", "e1071", "gbm", "pdp", "scales", "egg", "ggplotify", "grid", "edarf", "pracma")

invisible(lapply(X, library, character.only = TRUE))

path <- "/media/gsvidaurre/MYIOPSITTA/R/VocalLearning_PostInvasion/Data"
gpath <- "/home/gsvidaurre/Desktop/MANUSCRIPTS/SimplerSignatures_PostInvasion/FIGURES"
seed <- 401
cores <- parallel::detectCores() - 2

Read in the extended selection table (EST). This contains metadata and wave objects for pre-processed native and invasive range calls. Throughout this script, you can reproduce analyses by reading in the file “nat_inv_indiv_site_seltbl.csv”, which is a selection table made available on figshare that contains metadata needed here. Measurements needed to reproduce analyses here have also been provided on figshare.

nat_inv_est <- readRDS(file.path(path, "nat_inv_indiv_site_EST.RDS"))
# glimpse(nat_inv_est)

ML training plan

In previous work^[1], and an independent analysis of hierarchical mapping, we used known repeatedly sampled individuals for training supervised machine learning models. Here, we did not use these individuals for model training, but we did use these birds to visually validate the different measurements used as predictors for stochastic gradient boosting and random forests models. See the exploratory visuals generated below (AM1 - 7).

nat_inv_est %>%
  filter(social_scale == "Individual") %>%
  group_by(site_year, Bird_ID) %>%
  summarise(num_calls = length(sound.files))

## `summarise()` has grouped output by 'site_year'. You can override using the `.groups` argument.

## # A tibble: 17 x 3
## # Groups:   site_year [10]
##    site_year Bird_ID  num_calls
##    <chr>     <chr>        <int>
##  1 1145_2017 NAT-AAT         12
##  2 1145_2017 NAT-UM1         25
##  3 1145_2017 NAT-UM2         23
##  4 1145_2017 NAT-UM3          5
##  5 1145_2017 NAT-UM4         13
##  6 ASCA_2019 INV-UM6         25
##  7 BART_2011 INV-UM1         23
##  8 CAME_2004 INV-UM19        12
##  9 CHAC_2017 NAT-UM5          7
## 10 ELEM_2019 INV-UM7         28
## 11 ELEM_2019 INV-UM9          5
## 12 EMBR_2017 NAT-RAW          4
## 13 EMBR_2017 NAT-ZW8          8
## 14 INTR_2019 INV-UM10         6
## 15 ROBE_2011 INV-UM5         20
## 16 SOCC_2019 INV-UM16         8
## 17 SOCC_2019 INV-UM17         5

# Native range: NAT-AAT, NAT-UM1, NAT-UM2, NAT-UM4 (all from site 1145)

# Invasive range: INV-UM1 (BART_2011), INV-UM7 (ELEM_2019), INV-UM5 (ROBE_2011), INV-UM19 (CAME_2004)

train_indivs <- c("NAT-AAT", "NAT-UM1", "NAT-UM2", "NAT-UM4", "INV-UM19", "INV-UM1", "INV-UM5", "INV-UM7")
# train_indivs

Acoustic measurements

Acoustic measurements and features were obtained for all calls in the EST. However, I removed sound files with the "_site_scale" suffix prior to running ML below. These calls were included at the site scale for an independent analysis of hierarchcal mapping patterns, but were not necessary for supervised machine learning analysis here.

Spectrographic cross-correlation (SPCC)

# SPCC on spectrograms
xc_mat <- warbleR::xcorr(nat_inv_est, wl = 378, ovlp = 90, wn = "hanning", cor.method = "pearson", parallel = cores, na.rm = FALSE, bp = c(0.5, 9), cor.mat = TRUE, path = path, type = "spectrogram")
str(xc_mat)

saveRDS(xc_mat, file.path(path, "xc_mat_NAT_INV.RDS"))

# SPCC on cepstral coefficients
xc_mat_mfcc <- warbleR::xcorr(nat_inv_est, wl = 378, ovlp = 90, wn = "hanning", cor.method = "pearson", parallel = cores, na.rm = FALSE, bp = c(0.5, 9), cor.mat = TRUE, path = path, type = "mfcc")
str(xc_mat_mfcc)

saveRDS(xc_mat_mfcc, file.path(path, "xc_mat_NAT_INV_mfcc.RDS"))

Dynamic time warping similarity (DTW)

We measured spectral entropy and dominant frequency as time series, which we converted to acoustic similarity measurements using dynamic time warping (DTW_domf and DTW_spen). We extracted 100 time points per time series. We also performed multivariate DTW using both spectral entropy and dominant frequency time series (DTW_mult).

# Extract spectral entropy across signals as a time series
# Set img to TRUE to see time series plotted over spectrograms
sp_ts <- sp.en.ts(nat_inv_est, wl = 378, length.out = 100, wn = "hanning", ovlp = 90, bp = c(0.5, 9), threshold = 15, img = FALSE, parallel = cores, img.suffix = "spec_ent", pb = TRUE, clip.edges = TRUE, leglab = "spec_ent", sp.en.range = c(0.5, 9), path = path, flim = c(0, 10))
str(sp_ts)

# Ensure that the selec column isn't included in dtw
sp_ts$selec <- as.character(sp_ts$selec)
sapply(sp_ts, is.numeric)

# Perform dynamic time warping (DTW) on spectral entropy time series
# Returns distances by default
sp_ts_DTW <- dtw::dtwDist(sp_ts[, sapply(sp_ts, is.numeric)], sp_ts[, sapply(sp_ts, is.numeric)], window.type = "none", open.end = FALSE, path = path)
str(sp_ts_DTW)
saveRDS(sp_ts_DTW, file.path(path, "DTW_spec_ent.RDS"))

# Extract dominant frequency as a time series 
# Set img to TRUE to see time series plotted over spectrograms
df_ts <- dfts(nat_inv_est, wl = 378, wl.freq = 378, length.out = 100, wn = "hanning", ovlp = 90, bp = c(0.5, 9), threshold = 15, img = FALSE, parallel = cores, img.suffix = "dfts", pb = TRUE, clip.edges = TRUE, path = path, flim = c(0, 10))
str(df_ts)

# Are there any NAs?
any(apply(df_ts, c(1, 2), is.na))

# Ensure that the selec column isn't included in dtw
df_ts$selec <- as.character(df_ts$selec)
sapply(df_ts, is.numeric)

# Perform DTW on dominant frequency time series
# Returns distances by default
df_ts_DTW <- dtw::dtwDist(df_ts[, sapply(df_ts, is.numeric)], df_ts[, sapply(df_ts, is.numeric)], window.type = "none", open.end = FALSE, path = path)
str(df_ts_DTW)
saveRDS(df_ts_DTW, file.path(path, "DTW_domf.RDS"))

# Perform multivariate DTW on spectral entropy and dominant frequency time series
dim(sp_ts[, sapply(sp_ts, is.numeric)])
dim(df_ts[, sapply(df_ts, is.numeric)])

# Returns distances by default
# Note that there are some zeroes off the diagonal that do not correspond to duplicated calls
mDTW <- warbleR::multi_DTW(ts.df1 = sp_ts, ts.df2 = df_ts, parallel = cores, window.type = "none", open.end = FALSE, scale = TRUE, dist.mat = TRUE, path = path)
str(mDTW)
class(mDTW)

saveRDS(mDTW, file.path(path, "DTW_multi.RDS"))

The zeroes off the diagonal for the multivariate DTW matrix were replaced with the minimum distance found in this matrix. This replacement represents the fact that although these calls were found to be very similar by the algorithm, they were not the same calls compared to each other.

mDTW <- readRDS(file.path(path, "DTW_multi.RDS"))
str(mDTW)

# Replace all zeroes with the minimum distance value
# Then replace the diagonal with zeroes
mDTW[mDTW == 0] <- min(mDTW[mDTW != 0])

diag(mDTW) <- 0

# Diagonal is zero 
unique(diag(mDTW))

# No zeroes off the diagonal 
length(which(mDTW[lower.tri(mDTW, diag = FALSE)] == 0))

# Looks good
# str(mDTW)

saveRDS(mDTW, file.path(path, "DTW_multi.RDS"))

Acoustic structure measurements

We collected standard spectral measurements of contact call acoustic structure, as well as Mel-frequency cepstral coefficients.

# Calculate spectral acoustic parameters across calls
# Exclude measurements related to fundamental frequency (harmonicity = FALSE)
acs_params <- warbleR::specan(nat_inv_est, bp = c(0.5, 9), wl = 378, wl.freq = 378, ovlp = 90, fast = FALSE, parallel = cores, wn = "hanning", harmonicity = FALSE, path = path)
glimpse(acs_params)

# Make sure that the selec column is non-numeric and non-integer
acs_params$selec <- as.character(acs_params$selec)
glimpse(acs_params)

# Write out acoustic parameters for later
write.csv(acs_params, file.path(path, "acoustic_parameters.csv"), row.names = FALSE)

# Also collected acoustic measurements on Mel-frequency cepstral coefficients
# Used default values of numcep and nbands because these accurately represented patterns of individual overlap and distinctiveness in exploratory visualizations in a previous analysis (Smith-Vidaurre et al. 2020, Behav. Ecol.)
cep_coeff <- warbleR::mfcc_stats(nat_inv_est, ovlp = 90, wl = 378, bp = c(0.5, 9), numcep = 12, nbands = 40, parallel = cores, path = path)
glimpse(cep_coeff)

cep_coeff$selec <- as.character(cep_coeff$selec)
glimpse(cep_coeff)

# Write out cepstral acousic parameters for later
write.csv(cep_coeff, file.path(path, "Mel_freq_cepstral_coefficients.csv"), row.names = FALSE)

Visual validation

We visualized patterns of acoustic variation within and among repeatedly sampled individuals by each acoustic similarity or structure measurement. We used MDS and PCA to visualize calls in two dimensional acoustic space by similarity measurement (pairwise matrices) or structural measurement type (non-pairwise matrices), respectively. Here, we visually assessed whether a given measurement type recapitulated patterns of overdispersion among individuals in acoustic space that were identified in previous native range work^[1], and an independent analysis of hierachical mapping patterns between ranges.

Aesthetics for plots with calls by individual.

n <- 12
# pie(rep(1, n), col = heat.colors(n))

# Colors and shapes by individual, ordered by native and then invasive
cols <- c("navy", "royalblue2", "turquoise", "dodgerblue", "gold4", "darkorange3", "goldenrod2", "gold2")

shps <- c(21, 22, 24, 23, 21, 25, 17, 15, 17)

Visuals of similarity measurements

Read in the SPCC and DTW similarity matrices.

xc_mat_spec <- readRDS(file.path(path, "xc_mat_NAT_INV.RDS"))
# dim(xc_mat_spec) == nrow(nat_inv_est)

xc_mat_ceps <- readRDS(file.path(path, "xc_mat_NAT_INV_mfcc.RDS"))
# dim(xc_mat_ceps) == nrow(nat_inv_est)

dtw_domf <- readRDS(file.path(path, "DTW_domf.RDS"))
# dim(dtw_domf) == nrow(nat_inv_est)

dtw_spen <- readRDS(file.path(path, "DTW_spec_ent.RDS"))
# dim(dtw_spen) == nrow(nat_inv_est)

dtw_mult <- readRDS(file.path(path, "DTW_multi.RDS"))
# dim(dtw_mult) == nrow(nat_inv_est)

Change dimension names of similarity matrices.

# Change dimension names of each matrix to facilitate pattern searching
# head(dimnames(xc_mat_spec)[[1]])
# head(nat_inv_est$sound.files)

# Make a vector of sound file names for substitution
wavs <- dimnames(xc_mat_spec)[[1]]
wavs <- gsub(".WAV-1", ".WAV", wavs)
# head(wavs)

# Substitute the SPCC matrix dimension names
dimnames(xc_mat_spec) <- list(wavs, wavs)
dimnames(xc_mat_ceps) <- list(wavs, wavs)
dimnames(dtw_domf) <- list(wavs, wavs)
dimnames(dtw_spen) <- list(wavs, wavs)
dimnames(dtw_mult) <- list(wavs, wavs)

# Checking, looks good
# head(dimnames(xc_mat_spec)[[1]])
# head(dimnames(dtw_mult)[[1]])

Filter similarity matrices by the individuals selected for visual validation.

# Find calls per each individual to be used for model training

# Checking, looks good
nat_inv_est %>%
  filter(social_scale == "Individual") %>%
  filter(Bird_ID %in% train_indivs) %>%
  group_by(Bird_ID) %>%
  summarise(num_calls = length(sound.files))

## # A tibble: 8 x 2
##   Bird_ID  num_calls
##   <chr>        <int>
## 1 INV-UM1         23
## 2 INV-UM19        12
## 3 INV-UM5         20
## 4 INV-UM7         28
## 5 NAT-AAT         12
## 6 NAT-UM1         25
## 7 NAT-UM2         23
## 8 NAT-UM4         13

train_indivs

## [1] "NAT-AAT"  "NAT-UM1"  "NAT-UM2"  "NAT-UM4"  "INV-UM19" "INV-UM1"  "INV-UM5" 
## [8] "INV-UM7"

calls <- nat_inv_est %>%
  filter(social_scale == "Individual") %>%
  filter(Bird_ID %in% train_indivs) %>%
  pull(sound.files)
# head(calls)

# Subset matrices by these calls
# Grepping on SPCC matrix dimension names and using these same indices across similarity measurements because calls were provided to similarity functions in the same order
inds <- grep(paste(paste("^", calls, "$", sep = ""), collapse = "|"), dimnames(xc_mat_spec)[[1]])
# length(inds) == length(calls)

xc_mat_spec_tmp <- xc_mat_spec[inds, inds]
str(xc_mat_spec_tmp)

##  num [1:156, 1:156] 1 0.435 0.373 0.398 0.428 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:156] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV" ...
##   ..$ : chr [1:156] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV" ...

xc_mat_ceps_tmp <- xc_mat_ceps[inds, inds]
str(xc_mat_ceps_tmp)

##  num [1:156, 1:156] 1 0.848 0.611 0.733 0.73 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:156] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV" ...
##   ..$ : chr [1:156] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV" ...

dtw_domf_tmp <- dtw_domf[inds, inds]
str(dtw_domf_tmp)

##  num [1:156, 1:156] 0 35.4 60.6 45.2 50.7 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:156] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV" ...
##   ..$ : chr [1:156] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV" ...

dtw_spen_tmp <- dtw_spen[inds, inds]
str(dtw_spen_tmp)

##  num [1:156, 1:156] 0 2.42 2.48 1.89 2.12 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:156] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV" ...
##   ..$ : chr [1:156] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV" ...

dtw_mult_tmp <- dtw_mult[inds, inds]
str(dtw_mult_tmp)

##  num [1:156, 1:156] 0 111 135 123 140 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:156] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV" ...
##   ..$ : chr [1:156] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV" ...

AM2.1 - AM2.5

Performed MDS per filtered matrix (converted to distance matrices as appropriate), and generated a visual per method.

method <- c("SPCC - spectrogram", "SPCC - cepstral coefficients", "DTW - dominant frequency", "DTW - spectral entropy", "DTW - multivariate")

# Convert only the SPCC matrices to distance matrices
# DTW matrices already contain distances
mat_list <- list(1 - xc_mat_spec_tmp, 1- xc_mat_ceps_tmp, dtw_domf_tmp, dtw_spen_tmp, dtw_mult_tmp)
# str(mat_list)

# x <- 3
invisible(pblapply(1:length(mat_list), function(x){
  
  # Convert to a distance object for isoMDS
  dist_mat <- stats::as.dist(mat_list[[x]], diag = TRUE, upper = TRUE)
  # str(dist_mat)

  # Reduce dimensionality of the SPCC matrix to visualize calls in 2D acoustic space
  iso <- invisible(MASS::isoMDS(dist_mat, k = 2, maxit = 1000, trace = FALSE))
  # str(iso)
  
  bird_ids <- nat_inv_est$Bird_ID[grep(paste(paste("^", calls, "$", sep = ""), collapse = "|"), nat_inv_est$sound.files)]
  # unique(bird_ids)
  
  mds_df <- data.frame(X = iso$points[, 1], Y = iso$points[, 2], indiv = bird_ids)
  
  # Order individuals by native and then invasive range
  mds_df$indiv <- factor(mds_df$indiv, levels = train_indivs)

  # Convex hull polygons per indiviual
  hulls <- plyr::ddply(mds_df, "indiv", function(x){
    x[chull(x$X, x$Y), ]
  })

  gg <- ggplot(mds_df, aes(x = X, y = Y)) + 
  
    geom_point(aes(color = indiv, fill = indiv, shape = indiv), size = 4) + 
  
    geom_polygon(data = hulls, aes(x = X, y = Y, fill = indiv, color = indiv), alpha = 0.2, size = 0.2) +
  
    scale_colour_manual(values = cols) + scale_fill_manual(values = cols) +
    scale_shape_manual(values = shps) +
    guides(color = guide_legend(title = "Individual", nrow = 2, byrow = TRUE), fill = guide_legend(title = "Individual"), shape = guide_legend(title = "Individual")) +
    xlab("Dimension 1") + ylab("Dimension 2") + 
    theme_bw() +
    theme(legend.position = "top") +
    ggtitle(method[x]) + 
    theme(
      plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
      axis.title = element_text(size = 14),
      axis.text = element_text(size = 12),
      legend.title = element_text(size = 14),
      legend.text = element_text(size = 14)
      )
  
  print(gg)
  
}))

Visuals of structural measurements

Read in acoustic measurements, including cepstral coefficients. Sound file names do not need to be changed.

acs_params <- read.csv(file.path(path, "acoustic_parameters.csv")) %>%
  dplyr::mutate(
    selec = as.character(selec)
  )
# nrow(acs_params) == nrow(nat_inv_est)
# glimpse(acs_params)

cep_coeff <- read.csv(file.path(path, "Mel_freq_cepstral_coefficients.csv")) %>%
  dplyr::mutate(
    selec = as.character(selec)
  )
# nrow(cep_coeff) == nrow(nat_inv_est)
# glimpse(cep_coeff)

Filter acoustic measurements by the given individuals.

# Find calls per each individual

# Checking, looks good
nat_inv_est %>%
  filter(social_scale == "Individual") %>%
  filter(Bird_ID %in% train_indivs) %>%
  group_by(Bird_ID) %>%
  summarise(num_calls = length(sound.files))

## # A tibble: 8 x 2
##   Bird_ID  num_calls
##   <chr>        <int>
## 1 INV-UM1         23
## 2 INV-UM19        12
## 3 INV-UM5         20
## 4 INV-UM7         28
## 5 NAT-AAT         12
## 6 NAT-UM1         25
## 7 NAT-UM2         23
## 8 NAT-UM4         13

train_indivs

## [1] "NAT-AAT"  "NAT-UM1"  "NAT-UM2"  "NAT-UM4"  "INV-UM19" "INV-UM1"  "INV-UM5" 
## [8] "INV-UM7"

calls <- nat_inv_est %>%
  filter(social_scale == "Individual") %>%
  filter(Bird_ID %in% train_indivs) %>%
  pull(sound.files)
head(calls)

## [1] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV"
## [2] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV"
## [3] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV"
## [4] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV"
## [5] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_6.WAV"
## [6] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_7.WAV"

# Subset data frame by these calls
acs_params_tmp <- acs_params[grep(paste(paste("^", calls, "$", sep = ""), collapse = "|"), acs_params$sound.files), ]
# nrow(acs_params_tmp)
# all(acs_params_tmp$sound.files %in% calls)
# glimpse(acs_params_tmp)

cep_coeff_tmp <- cep_coeff[grep(paste(paste("^", calls, "$", sep = ""), collapse = "|"), cep_coeff$sound.files), ]
# nrow(cep_coeff_tmp)
# all(cep_coeff_tmp$sound.files %in% calls)
# glimpse(cep_coeff_tmp)

AM2.6, AM2.7

Perform PCA and make a plot per measurement type.

method <- c("Acoustic parameters", "Mel-frequency cepstral coefficients")

mat_list <- list(acs_params_tmp[, sapply(acs_params_tmp, is.numeric)], cep_coeff_tmp[, sapply(cep_coeff_tmp, is.numeric)])

# x <- 3
invisible(pblapply(1:length(mat_list), function(x){

  # Pre-process using the caret package
  pp_list <- caret::preProcess(mat_list[[x]], method = c("YeoJohnson", "zv", "nzv", "center", "scale"), thresh = 0.98, freqCut = 98/2, uniqueCut = 2)

  pp <- predict(pp_list, mat_list[[x]])

  # Perform PCA with base package for easier access of summary and loadings
  pp_pca <- stats::prcomp(pp, center = FALSE)
  # str(pp_pca)
  
  bird_ids <- nat_inv_est$Bird_ID[grep(paste(paste("^", calls, "$", sep = ""), collapse = "|"), nat_inv_est$sound.files)]
  # unique(bird_ids)

  pca_df <- data.frame(X = pp_pca$x[, 1], Y = pp_pca$x[, 2], indiv = bird_ids)
  
  # Order individuals by native and then invasive range
  pca_df$indiv <- factor(pca_df$indiv, levels = train_indivs)

  # Convex hull polygons per indiviual
  hulls <- plyr::ddply(pca_df, "indiv", function(x){
    x[chull(x$X, x$Y), ]
  })

  gg <- ggplot(pca_df, aes(x = X, y = Y)) + 
  
    geom_point(aes(color = indiv, fill = indiv, shape = indiv), size = 4) + 
  
    geom_polygon(data = hulls, aes(x = X, y = Y, fill = indiv, color = indiv), alpha = 0.2, size = 0.2) +
  
    scale_colour_manual(values = cols) + scale_fill_manual(values = cols) +
    scale_shape_manual(values = shps) +
    guides(color = guide_legend(title = "Individual", nrow = 2, byrow = TRUE), fill = guide_legend(title = "Individual"), shape = guide_legend(title = "Individual")) +
    xlab("Dimension 1") + ylab("Dimension 2") + 
    theme_bw() +
    theme(legend.position = "top") +
    ggtitle(method[x]) + 
    theme(
      plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
      axis.title = element_text(size = 14),
      axis.text = element_text(size = 12),
      legend.title = element_text(size = 14),
      legend.text = element_text(size = 14)
      )
  
  print(gg)
  
}))

Overall, these exploratory plots indicated that MDS and PCA features contained information about individual identity (confirmed by checking patterns of overdisperson in acoustic space by individual). Different feature sets also contained different information (patterns of dispersion among individuals were different across different feature types). Ready to proceed with feature extraction.

Extract features by MDS and PCA

Here, we extracted MDS features from similarity matrices. We did not find an “optimal” number of dimensions as in previous work, but instead chose to keep the number of MDS dimensions the same across measurement types and datasets. We extracted MDS features separately per dataset because the larger dataset representing greater geographic spread contained some of the same calls as the dataset of repeatedly sampled individuals (suffix "_site_scale" in sound file names), and MDS cannot handle duplicated calls (pairwise similarity = 1 or pairwise distance = 0 off of the matrix diagonal). I proceeded with this feature extraction routine such that feature extraction was performed in the same way for this analysis and the independent analysis of hierarchical mapping, but for this analysis of acoustic structure, I removed the "_site_scale" suffix calls after feature extraction and prior to training models.

Get calls for both social scales.

indiv_calls <- nat_inv_est %>%
  filter(social_scale == "Individual") %>%
  dplyr::mutate(
    sound.files = as.character(sound.files)
  ) %>%
  pull(sound.files)
length(indiv_calls)

site_calls <- nat_inv_est %>%
  filter(social_scale == "Site") %>%
  dplyr::mutate(
    sound.files = as.character(sound.files)
  ) %>%
  pull(sound.files)
length(site_calls)

# Checking
# length(indiv_calls) + length(site_calls) == nrow(nat_inv_est)

ss <- c("Individual", "Site")
call_pats <- c(paste(paste("^", indiv_calls, "$", sep = ""), collapse = "|"), paste(paste("^", site_calls, "$", sep = ""), collapse = "|"))

Split matrices by social scale. We generated features in this way because this analysis was performed for a manuscript in prep on hierarchical mapping patterns that used some repeatedly sampled individual calls at the site scale, and it isn’t possible to have duplicate samples in MDS.

mthd <- c("SPCC_spec", "SPCC_ceps", "DTW_domf", "DTW_spen", "DTW_mult")

# Convert only the SPCC matrices to distance matrices
# DTW matrices actually already contain distances
mat_list <- list(1 - xc_mat_spec, 1- xc_mat_ceps, dtw_domf, dtw_spen, dtw_mult)

# Remove the original matrices from the global environment to free up memory
rm(list = c('xc_mat_spec', 'xc_mat_ceps', 'dtw_domf', 'dtw_spen', 'dtw_mult'))

# Loop over social scales and methods to subset the full matrices
ss_mat_list <- invisible(pblapply(1:length(ss), function(i){
  
  tmp_mat_list <- lapply(1:length(mthd), function(x){
    
    tmp_mat <- mat_list[[x]][grep(call_pats[i], dimnames(mat_list[[x]])[[1]]), grep(call_pats[i], dimnames(mat_list[[x]])[[2]])]
    return(tmp_mat)
    
  })
  
  names(tmp_mat_list) <- mthd
  return(tmp_mat_list)
  
}))

names(ss_mat_list) <- ss
str(ss_mat_list)

# Remove the original matrix list to free up memory
rm(list = "mat_list")

Perform MDS with 15 dimensions per social scale and similarity measurement type.

dims <- 15

invisible(pblapply(1:length(ss), function(i){
  
  mds_res <- lapply(1:length(mthd), function(x){

    # Memory error when tried 1000 iterations with 25 dimensions for the site scale
    iso <- invisible(MASS::isoMDS(ss_mat_list[[i]][[x]], k = dims, maxit = 1000, trace = FALSE))
  
    saveRDS(iso, file.path(path, paste("MDS_results_", ss[i], "_", mthd[x], ".RDS", sep = ""))) 
  
  })

}))

rm(list = "ss_mat_list")

Split acoustic structure datasets by social scale, to match how MDS was run.

mthd <- c("acs_params", "cep_coeff")

df_list <- list(acs_params, cep_coeff)

# Loop over social scales and methods to subset the full data frames
ss_df_list <- invisible(pblapply(1:length(ss), function(i){
  
  tmp_df_list <- lapply(1:length(mthd), function(x){
    
    tmp_df <- df_list[[x]][grep(call_pats[i], df_list[[x]]$sound.files), ]
    return(tmp_df)
    
  })
  
  names(tmp_df_list) <- mthd
  return(tmp_df_list)
  
}))

names(ss_df_list) <- ss
# str(ss_df_list)

# Remove the original data frame list to free up memory
rm(list = "df_list")

Get PCA features for acoustic and image parameters.

mthd <- c("acs_params", "cep_coeff")

invisible(pblapply(1:length(ss), function(i){
  
  res_df <- lapply(1:length(mthd), function(x){

    # Get numeric columns only
    df_num <- ss_df_list[[i]][[x]][, sapply(ss_df_list[[i]][[x]], is.numeric)]
    
    # Pre-process using the caret package
    pp_list <- caret::preProcess(df_num, method = c("YeoJohnson", "zv", "nzv", "center", "scale"), thresh = 0.98, freqCut = 98/2, uniqueCut = 2)

    pp <- predict(pp_list, df_num)

    # Perform PCA with base package for easier access of summary and loadings
    pp_pca <- stats::prcomp(pp, center = FALSE)
  
    saveRDS(pp_pca, file.path(path, paste("PCA_results_", ss[i], "_", mthd[x], ".RDS", sep = ""))) 
  
  })

}))

How many principal components were extracted per parameter type?

ss <- c("Individual", "Site")
mthd <- c("acs_params", "cep_coeff")

dims <- invisible(pblapply(1:length(ss), function(i){
  
  dim_list <- lapply(1:length(mthd), function(x){
    
    pca <- readRDS(file.path(path, paste("PCA_results_", ss[i], "_", mthd[x], ".RDS", sep = ""))) 

    dim(pca$x)[2]
    
  })
  
  names(dim_list) <- mthd
  return(dim_list)
  
}))

names(dims) <- ss
dims

## $Individual
## $Individual$acs_params
## [1] 27
## 
## $Individual$cep_coeff
## [1] 88
## 
## 
## $Site
## $Site$acs_params
## [1] 27
## 
## $Site$cep_coeff
## [1] 88

Combine features across datasets

Combine MDS and PCA features across call datasets into a single data frame. For each feature type, make a data frame with a sound.files column, and merge data frames with the full selection table afterwards using this column.

dim_red <- c("MDS", "PCA")

mthd <- list(c("SPCC_spec", "SPCC_ceps", "DTW_domf", "DTW_spen", "DTW_mult"), c("acs_params", "cep_coeff"))

mthd_nms <- list(c("SPCC - spectrogram", "SPCC - cepstral coefficients", "DTW - dominant frequency", "DTW - spectral entropy", "DTW - multivariate"), c("Acoustic parameters", "Mel-frequency cepstral coefficients"))

# length(ss_df_list)
# length(ss_df_list[[1]])

# i <- 2
# x <- 2
# z <- 3
feat_df_list <- invisible(pblapply(1:length(ss), function(i){
  
  df_list2 <- lapply(1:length(dim_red), function(x){
  
    df_list <- lapply(1:length(mthd[[x]]), function(z){
    
      if(dim_red[x] == "MDS"){
      
        iso <- readRDS(file.path(path, paste("MDS_results_", ss[i], "_", mthd[[x]][z], ".RDS", sep = "")))
      
        sound.files <- dimnames(iso$points)[[1]]
        feats <- data.frame(iso$points)
        names(feats) <- paste(mthd[[x]][z], "MDS", seq(1, ncol(feats), 1), sep = "_")
        tmp_df <- data.frame(sound.files = sound.files, feats) %>%
            dplyr::mutate(
              sound.files = as.character(sound.files)
            )
        # str(tmp_df)
        
        return(tmp_df)
    
      } else if(dim_red[x] == "PCA"){
      
        # Take all principal components
        pca <- readRDS(file.path(path, paste("PCA_results_", ss[i], "_", mthd[[x]][z], ".RDS", sep = "")))
        # str(pca)
        
        sound.files <- ss_df_list[[i]][[z]]$sound.files
        feats <- data.frame(pca$x)
          
        names(feats) <- paste(mthd[[x]][z], "PCA", seq(1, ncol(feats), 1), sep = "_")
        tmp_df <- data.frame(sound.files = sound.files, feats) %>%
          dplyr::mutate(
            sound.files = as.character(sound.files)
          )
        # str(tmp_df)
        
        return(tmp_df)
      
      }
    
    })
    
    names(df_list) <- mthd[[x]]
    return(df_list)
    
  })
  
  names(df_list2) <- dim_red
  return(df_list2)
  
}))

names(feat_df_list) <- ss
str(feat_df_list[["Individual"]])
str(feat_df_list[["Site"]])

Combine each set of features across measurement types with metadata in the extended selection table per call dataset, then combine features across datasets. Note that a column “rf_set” was added to each social scale for model training, validation and prediction for the hierarchical mapping analysis. This will be ignored for the structural analysis here, as I split calls into training, validation and prediction datasets differently for this manuscript.

Repeatedly sampled individuals.

# Features that will be used for random forests modelling
names(feat_df_list[["Individual"]][["MDS"]])
names(feat_df_list[["Individual"]][["PCA"]])

indiv_df <- nat_inv_est %>%
  filter(social_scale == "Individual") %>%
  dplyr::select(
    c(sound.files, date, range, social_scale, site, Bird_ID, region, country, site_year, year, dept_state, invasive_city)
  ) %>%
  # Add a new column to split calls into datasets for predictive modeling
  # If an individual is not one of those designated for training, then assign to validation
  dplyr::mutate(
    rf_set = ifelse(Bird_ID %in% train_indivs, "training", "validation"),
  ) %>%
  # Merge the MDS features after merging this list into a single data frame
  left_join(
    feat_df_list[["Individual"]][["MDS"]] %>%
      purrr::reduce(
      left_join, by = "sound.files"
    ),
    by = "sound.files"
  ) %>%
  # Merge the PCA features after merging this list into a single data frame
  left_join(
    feat_df_list[["Individual"]][["PCA"]] %>%
      purrr::reduce(
      left_join, by = "sound.files"
    ),
    by = "sound.files"
  ) %>%
  droplevels()  %>%
  dplyr::mutate(
    year = as.character(year)
  )

glimpse(indiv_df)

Dataset representing broader geographic sampling.

# Features that will be used for random forests modeling
names(feat_df_list[["Site"]][["MDS"]])
names(feat_df_list[["Site"]][["PCA"]])

site_df <- nat_inv_est %>%
  filter(social_scale == "Site") %>%
  dplyr::select(
    c(sound.files, date, range, social_scale, site, Bird_ID, region, country, site_year, year, dept_state, invasive_city)
  ) %>%
  # Add a new column to split calls into datasets for predictive modeling
  # Calls used for random forests prediction
  dplyr::mutate(
    rf_set = "prediction",
  ) %>%
  # Merge the MDS features after merging this list into a single data frame
  left_join(
    feat_df_list[["Site"]][["MDS"]] %>%
      purrr::reduce(
      left_join, by = "sound.files"
    ),
    by = "sound.files"
  ) %>%
  # Merge the PCA features after merging this list into a single data frame
  left_join(
    feat_df_list[["Site"]][["PCA"]] %>%
      purrr::reduce(
      left_join, by = "sound.files"
    ),
    by = "sound.files"
  ) %>%
  droplevels() %>%
  dplyr::mutate(
    year = as.character(year)
  )

glimpse(site_df)

Merged the individual and site scale features and metadata into a single data frame for random forests modelling.

sup_rf_df <- bind_rows(indiv_df, site_df)
glimpse(sup_rf_df)

Wrote out all features across the 1596 calls for predictive modeling.

write.csv(sup_rf_df, file.path(path, "supervised_RF_datasets_acousticSimilarity.csv"), row.names = FALSE)

Splitting data

How to split calls into training, validation, prediction sets to address the main question? Here, we wanted to train supervised machine learning models to classify calls back to ranges as a way to infer whether calls were structurally different between ranges. First we needed to figure out which invasive range areas were sampled over time, to perform both spatial comparisons between ranges, and account for potential temporal change in the invasive range.

Drop calls with the suffix "_site_scale", as these were a single call per repeatedly sampled individual included at the site scale for hierarchical mapping analysis. Also drop the 4 NAT-UM4 calls that were inadvertently retained at the site scale in previous analysis (but did not change results reported)^[1]. We used the remaining 1561 calls for supervised machine learning analyses.

dropUM4 <- nat_inv_est %>%
  filter(social_scale == "Site") %>%
  filter(Bird_ID == "NAT-UM4" & !grepl("site_scale", sound.files)) %>%
  pull(sound.files)
dropUM4

## [1] "2017_07_26_1145-02_MZ000321_1.WAV" "2017_07_26_1145-02_MZ000321_2.WAV"
## [3] "2017_07_26_1145-02_MZ000321_5.WAV" "2017_07_26_1145-02_MZ000321_6.WAV"

nat_inv_est <- nat_inv_est %>%
  as_tibble() %>%
  filter(!grepl("site_scale", sound.files)) %>%
  filter(!sound.files %in% dropUM4) %>%
  droplevels()
dim(nat_inv_est)

## [1] 1561   26

Spatial sampling

What is our spatial resolution in the invasive range? Number of calls by city in the U.S.

nat_inv_est %>%
  filter(range == "Invasive") %>%
  group_by(invasive_city) %>%
  dplyr::summarise(
    n_calls = length(sound.files)
  )

## # A tibble: 8 x 2
##   invasive_city n_calls
##   <chr>           <int>
## 1 Austin            512
## 2 Dallas              9
## 3 El Paso            25
## 4 Gilbert            16
## 5 Miami              40
## 6 Milford           105
## 7 New Orleans       109
## 8 Stratford          50

We should include calls from cities sampled over time, but avoid depleting the temporal dataset for temporal questions. Which years had the most calls? 2019 for Austin, 2004 for New Orleans.

nat_inv_est %>%
  filter(invasive_city %in% c("Austin", "New Orleans")) %>%
  group_by(invasive_city, year) %>%
  dplyr::summarise(
    n_calls = length(sound.files)
  )

## `summarise()` has grouped output by 'invasive_city'. You can override using the `.groups` argument.

## # A tibble: 5 x 3
## # Groups:   invasive_city [2]
##   invasive_city year  n_calls
##   <chr>         <chr>   <int>
## 1 Austin        2004       93
## 2 Austin        2011       84
## 3 Austin        2019      335
## 4 New Orleans   2004       61
## 5 New Orleans   2011       48

Deciding which Austin and New Orleans sites to use for spatial questions (e.g. comparing call structure between ranges). Austin has several sites that were sampled over years, while New Orleans was sampled in different years at different sites (still allows us to assess change over time at the city-scale).

For Austin, take calls from site-years not sampled over time for the spatial question. This includes the Austin site sampled at only the individual scale (BART). Aside from BART, 4 Austin site-years were not sampled over time: two 2019 sites, one 2011 site, one 2004 site.

spatial_austin <- nat_inv_est %>%
  as_tibble() %>%
  # filter(social_scale == "Site") %>%
  filter(invasive_city %in% c("Austin")) %>%
  group_by(invasive_city, site) %>%
  dplyr::summarise(
    n_years = n_distinct(year)
  ) %>%
  filter(n_years == 1) %>%
  ungroup() %>%
  inner_join(
    nat_inv_est %>%
      as_tibble() %>%
      # filter(social_scale == "Site") %>%
      dplyr::select(site, sound.files, year),
    by = "site"
  ) %>%
  group_by(invasive_city, site, year) %>%
  dplyr::summarise(
    n_calls = length(sound.files)
  )

## `summarise()` has grouped output by 'invasive_city'. You can override using the `.groups` argument.

## `summarise()` has grouped output by 'invasive_city', 'site'. You can override using the `.groups` argument.

spatial_austin

## # A tibble: 5 x 4
## # Groups:   invasive_city, site [5]
##   invasive_city site  year  n_calls
##   <chr>         <chr> <chr>   <int>
## 1 Austin        AIRP  2019        7
## 2 Austin        BART  2011       23
## 3 Austin        COMM  2004       11
## 4 Austin        MANO  2019        5
## 5 Austin        SOFT  2011       14

How many calls would we use for the spatial question from Austin?

# Note that BART 2011 is from the individual scale
spatial_austin %>%
  pull(n_calls) %>%
  sum()

## [1] 60

Looks good, initialize these Austin site-years for spatial questions.

sp_aust_sites <- spatial_austin %>%
  pull(site)

sp_aust_sites

## [1] "AIRP" "BART" "COMM" "MANO" "SOFT"

Now choosing New Orleans calls for spatial questions. Note that CAME is from the individual scale only as well.

nat_inv_est %>%
  filter(invasive_city %in% c("New Orleans")) %>%
  group_by(invasive_city, site, year) %>%
  dplyr::summarise(
    n_calls = length(sound.files)
  )

## `summarise()` has grouped output by 'invasive_city', 'site'. You can override using the `.groups` argument.

## # A tibble: 6 x 4
## # Groups:   invasive_city, site [6]
##   invasive_city site  year  n_calls
##   <chr>         <chr> <chr>   <int>
## 1 New Orleans   BALL  2004       26
## 2 New Orleans   CAME  2004       12
## 3 New Orleans   CANA  2004       13
## 4 New Orleans   FOLS  2004       10
## 5 New Orleans   LAKE  2011        5
## 6 New Orleans   ROBE  2011       43

Only two 2011 sites. Take calls from two 2004 sites, which leaves two sites in each year for temporal questions. Initialized FOLS (low number of calls) and CAME (low number of calls, individual scale only) for spatial questions.

sp_norl_sites <- c("FOLS", "CAME")

Temporal sampling

Which invasive cities were sampled over time? Austin, New Orleans.

nat_inv_est %>%
  filter(!is.na(invasive_city)) %>%
  group_by(invasive_city) %>%
  summarise(n_years = n_distinct(year))

## # A tibble: 8 x 2
##   invasive_city n_years
##   <chr>           <int>
## 1 Austin              3
## 2 Dallas              1
## 3 El Paso             1
## 4 Gilbert             1
## 5 Miami               1
## 6 Milford             1
## 7 New Orleans         2
## 8 Stratford           1

Minimum of 5 calls in site-years for these cities.

nat_inv_est %>%
  filter(!is.na(invasive_city)) %>%
  filter(invasive_city %in% c("Austin", "New Orleans")) %>%
  group_by(invasive_city, site_year) %>%
  dplyr::summarise(
    n_calls = length(sound.files)
      ) %>%
  pull(n_calls) %>%
  min()

## `summarise()` has grouped output by 'invasive_city'. You can override using the `.groups` argument.

## [1] 5

Which sites per city were sampled over time? Here, dropping site-years already set aside used for spatial questions.

nat_inv_est %>%
  filter(social_scale == "Site") %>%
  filter(range == "Invasive") %>%
  filter(invasive_city %in% c("Austin", "New Orleans")) %>%
  filter(!site %in% c(sp_aust_sites, sp_norl_sites)) %>%
  group_by(invasive_city, year) %>%
  dplyr::summarise(
    uniq_sites = paste(unique(site), collapse = "; ")
  )

## `summarise()` has grouped output by 'invasive_city'. You can override using the `.groups` argument.

## # A tibble: 5 x 3
## # Groups:   invasive_city [2]
##   invasive_city year  uniq_sites            
##   <chr>         <chr> <chr>                 
## 1 Austin        2004  SOCC; VALL            
## 2 Austin        2011  ELEM; INTR; MART; VALL
## 3 Austin        2019  ELEM; INTR; SOCC; MART
## 4 New Orleans   2004  CANA; BALL            
## 5 New Orleans   2011  LAKE; ROBE

Pretty decent number of sites sampled over time. We can assess change over time at two scales, either for the whole city or sites within cities. New Orleans can be assessed only at the city scale, but Austin can be assessed at specific sites over time.

Sampling for predictive modelling

Split the invasive range calls for the spatial (between ranges) and temporal (over time in the invasive range) comparisons. Here, setting aside site-years in Austin sampled over time, and 2 sites in New Orleans per sampling year for the temporal question, and the rest for the spatial comparison between ranges. The set of sites per comparison will be sampled differently for training, validation, prediction.

For native range sites and invasive range sites used for the comparison between ranges, I took 1/2 of the total calls per site for training. Among the remaining calls, I took 1/3 for validation and 2/3 for prediction. For invasive range sites used for temporal comparisons (change over time in the invasive range), I set 20 calls aside for prediction (or all calls if less than 20 were present), and split the remaining calls by half into validation and training (or none for either, if all calls taken for prediction).

Split the invasive range calls into spatial and temporal datasets, as random sampling will be done differently depending on whether a given site/city was sampled over time.

inv_spatial <- nat_inv_est %>%
  filter(range == "Invasive") %>%
  # Get all sites from cities not sampled over time
  filter(!invasive_city %in% c("Austin", "New Orleans")) %>%
  # Add back the Austin and New Orleans sites designated for the spatial question
  bind_rows(
    nat_inv_est %>%
    filter(range == "Invasive") %>%
    filter(site %in% c(sp_aust_sites, sp_norl_sites))
  ) 

# unique(inv_spatial$site)
# glimpse(inv_spatial)

inv_temporal <- nat_inv_est %>%
  filter(range == "Invasive") %>%
  # Filter to Austin and New Orleans sites
  filter(invasive_city %in% c("Austin", "New Orleans")) %>%
  # Exclude sites already designated for the spatial question
  filter(!site %in% c(sp_aust_sites, sp_norl_sites)) 

# unique(inv_temporal$site)
# glimpse(inv_temporal)

# Checking, looks good
# (nrow(inv_spatial) + nrow(inv_temporal)) == nrow(nat_inv_est %>% filter(range == "Invasive"))

Check out sites and number of calls for the invasive spatial dataset. 15 sites (BART and CAME are individual scale only because we used calls from both social scales for this analysis), 8 cities sampled across years.

inv_spatial %>%
  group_by(range, invasive_city, site, year) %>%
  dplyr::summarise(n_calls = length(sound.files))

## `summarise()` has grouped output by 'range', 'invasive_city', 'site'. You can override using the `.groups` argument.

## # A tibble: 15 x 5
## # Groups:   range, invasive_city, site [15]
##    range    invasive_city site  year  n_calls
##    <fct>    <chr>         <chr> <chr>   <int>
##  1 Invasive Austin        AIRP  2019        7
##  2 Invasive Austin        BART  2011       23
##  3 Invasive Austin        COMM  2004       11
##  4 Invasive Austin        MANO  2019        5
##  5 Invasive Austin        SOFT  2011       14
##  6 Invasive Dallas        LAWT  2011        9
##  7 Invasive El Paso       ASCA  2019       25
##  8 Invasive Gilbert       GILB  2018       16
##  9 Invasive Miami         BAPT  2004       40
## 10 Invasive Milford       AUDU  2004       17
## 11 Invasive Milford       BUCK  2004       60
## 12 Invasive Milford       MEAD  2004       28
## 13 Invasive New Orleans   CAME  2004       12
## 14 Invasive New Orleans   FOLS  2004       10
## 15 Invasive Stratford     SHAK  2004       50

Check out sites and number of calls for the invasive temporal dataset. 14 site-years, 2 invasive cities. Five sites sampled over 2 years in Austin, 2 years of sampling at the city scale in New Orleans.

inv_temporal %>%
  group_by(range, invasive_city, site, year) %>%
  dplyr::summarise(n_calls = length(sound.files))

## `summarise()` has grouped output by 'range', 'invasive_city', 'site'. You can override using the `.groups` argument.

## # A tibble: 14 x 5
## # Groups:   range, invasive_city, site [9]
##    range    invasive_city site  year  n_calls
##    <fct>    <chr>         <chr> <chr>   <int>
##  1 Invasive Austin        ELEM  2011        8
##  2 Invasive Austin        ELEM  2019       88
##  3 Invasive Austin        INTR  2011       15
##  4 Invasive Austin        INTR  2019       84
##  5 Invasive Austin        MART  2011       14
##  6 Invasive Austin        MART  2019       49
##  7 Invasive Austin        SOCC  2004       77
##  8 Invasive Austin        SOCC  2019      102
##  9 Invasive Austin        VALL  2004        5
## 10 Invasive Austin        VALL  2011       10
## 11 Invasive New Orleans   BALL  2004       26
## 12 Invasive New Orleans   CANA  2004       13
## 13 Invasive New Orleans   LAKE  2011        5
## 14 Invasive New Orleans   ROBE  2011       43

General predictive modelling approach

Did contact call structure change after invasion? Supervised ML models were trained to discriminate between native and invasive range calls (ranges were class labels in this two-class prediction problem). After validating models, we chose one model to predict classes (native or invasive range) of calls in the prediction dataset that were set aside for spatial and temporal comparisons. We used classification accuracy throughout the modeling process as an indicator of structural differences between ranges, and visualized overlap between ranges in low-dimensional acoustic space. We assessed the number of misclassified calls from different cities in the U.S. for a finer geographic assessment of structural change, and used misclassified calls from Austin and New Orleans to assess the possibility of structural change over time in the invasive range.

Split data for predictive modelling

Sampling scheme across native and invasive ranges for the spatial and temporal comparisons, as described above.

# Make a data frame that contains the number of randomly sampled calls that should be chosen for training, validation and prediction at each site-year
ML_scheme <- nat_inv_est %>%
  filter(range == "Native") %>%
  # Make sampling scheme for the native and invasive sites used for spatial questions
  bind_rows(inv_spatial) %>%
  # Per site-year, find the number of calls to sample for training, validation and prediction
  # The number of calls per site-year will vary depending on the total calls
  # Start by calculating the total number of calls per site-year
  group_by(site_year) %>%
  dplyr::summarise(
    total_calls = length(sound.files)
  ) %>%
  # Calculate 1/2 of total for training
  # 1/3 of the remaining for validation
  # 2/3 of the remaining for prediction
  dplyr::mutate(
    n_train = round(total_calls/2),
    n_validation = round((total_calls - n_train)/3),
    n_prediction = total_calls - (n_train + n_validation)
  ) %>% 
  ungroup() %>%
  # Make sampling scheme for the invasive sites used for temporal questions
  # Set aside 20 calls or the minimum for prediction
  # Evenly split the rest between training and validation
  bind_rows(
    inv_temporal %>%
    group_by(site_year) %>%
    dplyr::summarise(
      total_calls = length(sound.files)
    ) %>%
    dplyr::mutate(
      n_prediction = ifelse(total_calls > 20, 20, total_calls),
      n_train = ifelse(n_prediction >= 20, round((total_calls - n_prediction)/2), 0),
      n_validation = total_calls - (n_train + n_prediction)
    ) %>% 
    ungroup()
  )
  
head(ML_scheme)

## # A tibble: 6 x 5
##   site_year total_calls n_train n_validation n_prediction
##   <chr>           <int>   <dbl>        <dbl>        <dbl>
## 1 1145_2017          86      43           14           29
## 2 AIRP_2019           7       4            1            2
## 3 ARAP_2017          12       6            2            4
## 4 ARAZ_2017          15       8            2            5
## 5 ASCA_2019          25      12            4            9
## 6 AUDU_2004          17       8            3            6

Randomly sample calls per site for training.

set.seed(seed)

train_df <- nat_inv_est %>%
  filter(range == "Native") %>%
  bind_rows(inv_spatial) %>%
  bind_rows(inv_temporal) %>%
  # Random sampling for both questions
  group_by(site_year) %>%
  nest() %>%
  ungroup() %>%
  inner_join(
    ML_scheme %>%
      dplyr::select(site_year, n_train),
    by = "site_year"
  ) %>%
  dplyr::mutate(
    rsamp_call = purrr::map2(data, n_train, sample_n, replace = FALSE)
  ) %>%
  dplyr::select(-data) %>%
  unnest(rsamp_call) %>%
  dplyr::select(-c(n_train))

# glimpse(train_df)

Randomly sample calls per site for validation.

set.seed(seed)

validation_df <- nat_inv_est %>%
  filter(range == "Native") %>%
  bind_rows(inv_spatial) %>%
  bind_rows(inv_temporal) %>%
  # Filter out calls already allocated to training
  filter(!sound.files %in% train_df$sound.files) %>%
  # Random sampling for both questions
  group_by(site_year) %>%
  nest() %>%
  ungroup() %>%
  inner_join(
    ML_scheme %>%
      dplyr::select(site_year, n_validation),
    by = "site_year"
  ) %>%
  dplyr::mutate(
    rsamp_call = purrr::map2(data, n_validation, sample_n, replace = FALSE)
  ) %>%
  dplyr::select(-data) %>%
  unnest(rsamp_call) %>%
  dplyr::select(-c(n_validation))

# glimpse(validation_df)

Se aside the remaining calls per site for prediction.

set.seed(seed)

prediction_df <- nat_inv_est %>%
  filter(range == "Native") %>%
  bind_rows(inv_spatial) %>%
  bind_rows(inv_temporal) %>%
  # Remove calls already allocated to training and validation
  filter(!sound.files %in% train_df$sound.files & !sound.files %in% validation_df$sound.files) %>%
  droplevels()

# glimpse(prediction_df)

Check out sample sizes across the different datasets.

676 calls allocated to training: 43% of calls used for supervised machine learning.

nrow(train_df)

## [1] 676

round(nrow(train_df)/nrow(nat_inv_est), 2)*100

## [1] 43

337 calls allocated to validation: 22% of calls.

nrow(validation_df)

## [1] 337

round(nrow(validation_df)/nrow(nat_inv_est), 2)*100

## [1] 22

548 calls allocated to prediction: 35% of calls.

nrow(prediction_df)

## [1] 548

round(nrow(prediction_df)/nrow(nat_inv_est), 2)*100

## [1] 35

How many native and invasive range calls used for training? About 20 less invasive than native calls (327 versus 349).

train_df %>%
  group_by(range) %>%
  dplyr::summarise(
    n_calls = length(sound.files)
  )

## # A tibble: 2 x 2
##   range    n_calls
##   <fct>      <int>
## 1 Native       349
## 2 Invasive     327

How many native and invasive range calls used for validation? 116 and 221, respectively.

validation_df %>%
  group_by(range) %>%
  dplyr::summarise(
    n_calls = length(sound.files)
  )

## # A tibble: 2 x 2
##   range    n_calls
##   <fct>      <int>
## 1 Native       116
## 2 Invasive     221

How many native and invasive range calls used for prediction? 230 and 318, respectively.

prediction_df %>%
  group_by(range) %>%
  dplyr::summarise(
    n_calls = length(sound.files)
  )

## # A tibble: 2 x 2
##   range    n_calls
##   <fct>      <int>
## 1 Native       230
## 2 Invasive     318

The balance between the native and invasive classes is pretty even, although most balanced for training. The percentage of calls used across these datasets for predictive modeling is pretty good as well (almost 1/2 of calls used for training, the rest split between validation and prediction).

Evaluate sample sizes

Full prediction dataset for invasive range, by site-year.

prediction_df %>%
  filter(range == "Invasive") %>%
  group_by(range, invasive_city) %>%
  summarise(
    n_calls = length(sound.files)
  )

## `summarise()` has grouped output by 'range'. You can override using the `.groups` argument.

## # A tibble: 8 x 3
## # Groups:   range [1]
##   range    invasive_city n_calls
##   <fct>    <chr>           <int>
## 1 Invasive Austin            171
## 2 Invasive Dallas              3
## 3 Invasive El Paso             9
## 4 Invasive Gilbert             5
## 5 Invasive Miami              13
## 6 Invasive Milford            35
## 7 Invasive New Orleans        65
## 8 Invasive Stratford          17

How many sites and calls per range and site-year in the prediction dataset for the comparison between ranges? 37 native, 15 invasive site-years (two invasive sites are individual scale only).

spat_pred_sites <- inv_spatial %>%
  pull(site_year) %>%
  unique()

# spat_pred_sites

prediction_df %>%
  filter(site_year %in% spat_pred_sites | range == "Native") %>%
  group_by(range) %>%
  summarise(
    n_sites = n_distinct(site)
  )

## # A tibble: 2 x 2
##   range    n_sites
##   <fct>      <int>
## 1 Native        37
## 2 Invasive      15

How good is our spatial resolution of the invasive range for prediction? Not bad, although resolution for Dallas and New Orleans is poor. But the number of calls for Dallas was limited to begin with (9 calls total), and we had a native - invasive comparison built into the temporal question because native range calls were set aside for prediction as well (another opportunity to evaluate New Orleans calls with respect to native range calls).

# Invasive range only, spatial comparison calls 
prediction_df %>%
  filter(site_year %in% spat_pred_sites) %>%
  group_by(invasive_city) %>%
  summarise(
    n_calls = length(sound.files)
  )

## # A tibble: 8 x 2
##   invasive_city n_calls
##   <chr>           <int>
## 1 Austin             19
## 2 Dallas              3
## 3 El Paso             9
## 4 Gilbert             5
## 5 Miami              13
## 6 Milford            35
## 7 New Orleans         7
## 8 Stratford          17

The number of calls per site-year ranges from 2 to 29.

# Native and invasive range
prediction_df %>%
  filter(site_year %in% spat_pred_sites | range == "Native") %>%
  group_by(range, site_year) %>%
  summarise(
    n_calls = length(sound.files)
  ) %>%
  pull(n_calls) %>%
  range()

## `summarise()` has grouped output by 'range'. You can override using the `.groups` argument.

## [1]  2 29

Sample sizes for temporal question in prediction dataset. Looks great, for city scale and site-year scale temporal comparisons.

By city: 152 Austin calls, 58 New Orleans calls over time.

temp_pred_sites <- inv_temporal %>%
  pull(site_year) %>%
  unique()

temp_pred_sites

##  [1] "ROBE_2011" "ELEM_2019" "INTR_2019" "SOCC_2019" "ELEM_2011" "INTR_2011"
##  [7] "MART_2011" "VALL_2011" "LAKE_2011" "MART_2019" "SOCC_2004" "CANA_2004"
## [13] "BALL_2004" "VALL_2004"

prediction_df %>%
  filter(site_year %in% temp_pred_sites) %>%
  group_by(invasive_city) %>%
  summarise(
    n_calls = length(sound.files)
  )

## # A tibble: 2 x 2
##   invasive_city n_calls
##   <chr>           <int>
## 1 Austin            152
## 2 New Orleans        58

By city and year.

prediction_df %>%
  filter(site_year %in% temp_pred_sites) %>%
  group_by(invasive_city, year) %>%
  summarise(
    n_calls = length(sound.files)
  )

## `summarise()` has grouped output by 'invasive_city'. You can override using the `.groups` argument.

## # A tibble: 5 x 3
## # Groups:   invasive_city [2]
##   invasive_city year  n_calls
##   <chr>         <chr>   <int>
## 1 Austin        2004       25
## 2 Austin        2011       47
## 3 Austin        2019       80
## 4 New Orleans   2004       33
## 5 New Orleans   2011       25

By site-year.

prediction_df %>%
  filter(site_year %in% temp_pred_sites) %>%
  group_by(invasive_city, site_year) %>%
  summarise(
    n_calls = length(sound.files)
  )

## `summarise()` has grouped output by 'invasive_city'. You can override using the `.groups` argument.

## # A tibble: 14 x 3
## # Groups:   invasive_city [2]
##    invasive_city site_year n_calls
##    <chr>         <chr>       <int>
##  1 Austin        ELEM_2011       8
##  2 Austin        ELEM_2019      20
##  3 Austin        INTR_2011      15
##  4 Austin        INTR_2019      20
##  5 Austin        MART_2011      14
##  6 Austin        MART_2019      20
##  7 Austin        SOCC_2004      20
##  8 Austin        SOCC_2019      20
##  9 Austin        VALL_2004       5
## 10 Austin        VALL_2011      10
## 11 New Orleans   BALL_2004      20
## 12 New Orleans   CANA_2004      13
## 13 New Orleans   LAKE_2011       5
## 14 New Orleans   ROBE_2011      20

Final predictive modelling dataset

Combined the training, validation and prediction datasets into a single object.

sup_ML_o <- bind_rows(
  train_df %>%
    dplyr::mutate(
      set = "training"
    ),
  validation_df %>%
    dplyr::mutate(
      set = "validation"
    ),
  prediction_df %>%
    dplyr::mutate(
      set = "prediction"
    )
)
glimpse(sup_ML_o)
# nrow(sup_ML_o) == nrow(nat_inv_est) # Looks good

# Check for duplicated sound files. None, looks good
# length(which(duplicated(sup_ML_o$sound.files)))

Merge this data frame with the features measured above. We also decided to add the standard set of spectral acoustic measurements, as it is easier map structural differentiation back to raw measurements instead of features (like principal components), which are less interpretable combinations of original measurements. Retained metadata useful for subsequent analyses.

# Features generated for supervised random forests that can be used for other classification questions by supervised ML (full EST across social scales)
ML_feats_df <- read.csv(file.path(path, "supervised_RF_datasets_acousticSimilarity.csv")) %>%
  dplyr::mutate(
    year = as.character(year)
  ) %>%
  # Remove image features that were generated for a separate analysis of hierarchical mapping
  dplyr::select(-c(names(.)[grep("^img_", names(.))]))
glimpse(ML_feats_df)

# 27 original spectral acoustic measurements across calls (full EST across social scales)
acs_params <- read.csv(file.path(path, "acoustic_parameters.csv")) %>%
  dplyr::mutate(
    selec = as.character(selec)
  )
glimpse(acs_params)

# Merge data frames
sup_ML <- sup_ML_o %>%
  dplyr::mutate(
    selec = as.character(selec)
  ) %>%
  inner_join(
    acs_params,
    by = c("sound.files", "selec")
  ) %>%
  inner_join(
    ML_feats_df %>%
      dplyr::select(-c(date, range, social_scale, site, Bird_ID, region, country, site_year, year, dept_state, invasive_city, rf_set)),
    by = "sound.files"
  ) %>% 
  dplyr::select(-c(selec, start, end, date, social_scale, site, Bird_ID, lat, lon, Overlapping_signal, Truncated, tailored, prev_SNR, SNR_margin_works, SNR_before_only, old.sound.file.name, SNR))

# Get names of predictors
pnms <- names(sup_ML)[which(sapply(sup_ML, is.numeric))]
head(pnms)
length(pnms)

# Remove additional metadata columns to retain only those useful for either ML or joining back metdata later (this line removes the "rf_set" column used for supervised machine learning analysis in hierarchical mapping project)
sup_ML <- sup_ML %>%
  dplyr::select(c(sound.files, range, set, all_of(pnms)))

# Looks good, 1561 calls, 217 predictors, 3 metadata columns
glimpse(sup_ML)
dim(sup_ML)

Check collinearity of all predictors prior to building models.

# Perform Pearson's correlation among all numeric values
cor <- stats::cor(sup_ML[, sapply(sup_ML, is.numeric)], method = "pearson")
str(cor)
range(cor[cor < 1])

# Use the correlation matrix to find names of variables that have > 0.75 or < -0.75 collinearity
corr_nms <- caret::findCorrelation(cor, cutoff = 0.75, verbose = TRUE, names = TRUE, exact = TRUE)

# 14 variables with Pearson's |r| > 0.75, these should be removed prior to building models
corr_nms

# [1] "meanfreq"         "acs_params_PCA_1" "acs_params_PCA_2" "freq.median"     
# [5] "freq.Q25"         "entropy"          "sd"               "meandom"         
# [9] "sfm"              "time.Q75"         "time.IQR"         "time.median"     
# [13] "time.ent"         "skew"

Check collinearity with signal to noise ratio (SNR) too. Using Pearson’s correlation.

# Are sound files still in the same order? If so, then can do the column addition below
all(sup_ML$sound.files == sup_ML_o$sound.files)

sup_ML_SNR <- sup_ML %>%
  dplyr::mutate(
    SNR = sup_ML_o$SNR
  ) %>%
  filter(!is.na(SNR)) %>%
  droplevels()
dim(sup_ML_SNR)

# Check for high correlations among features and SNR
cor_SNR <- stats::cor(sup_ML_SNR[, sapply(sup_ML_SNR, is.numeric) & names(sup_ML_SNR) != "SNR"], sup_ML_SNR$SNR, method = "pearson")

range(cor_SNR[, 1])

# No features are highly correlated with SNR
which(abs(cor_SNR[, 1]) > 0.75)

Remove collinear predictors prior to proceeding with model-building. 203 numeric predictors remain.

sup_ML_fin <- sup_ML %>%
  dplyr::select(-c(all_of(corr_nms)))

dim(sup_ML_fin)
length(which(sapply(sup_ML_fin, is.numeric)))

# 14 columns dropped, looks good
ncol(sup_ML) - ncol(sup_ML_fin)

# Check for duplicated sound files one more time? None, looks good
# length(which(duplicated(sup_ML_fin$sound.files)))

Write out the final dataset for supervised machine learning analyses, also write out as a .csv for sharing data.

saveRDS(sup_ML_fin, file.path(path, "sup_ML_fin.RDS"))
write.csv(sup_ML_fin, file.path(path, "sup_ML_fin.csv"), row.names = FALSE)

Supervised machine learning

Start training/tuning models for each question. Read the data back in to begin analysis.

# sup_ML_df <- read.csv(file.path(path, "sup_ML_fin.csv"))
sup_ML_df <- readRDS(file.path(path, "sup_ML_fin.RDS"))
dim(sup_ML_df)

## [1] 1561  206

# glimpse(sup_ML_df)

203 numeric predictors.

length(which(sapply(sup_ML_df, is.numeric)))

## [1] 203

15 of these predictors were original acoustic measurements, and 188 were MDS or PCA features. I decided to use these predictors later to determine whether any of the original spectral measurements explained the probability of assignment of calls back to either range by the final model (an indication of structural differences between ranges, see partial dependence plots below).

pnms <- names(sup_ML_df)[which(sapply(sup_ML_df, is.numeric))]
length(pnms[-grep("MDS|PCA", pnms)])

## [1] 15

pnms[-grep("MDS|PCA", pnms)]

##  [1] "duration"  "freq.Q75"  "freq.IQR"  "time.Q25"  "kurt"      "sp.ent"   
##  [7] "mindom"    "maxdom"    "dfrange"   "modindx"   "startdom"  "enddom"   
## [13] "dfslope"   "meanpeakf" "peakf"

length(pnms[grep("SPCC_spec", pnms)])

## [1] 15

length(pnms[grep("SPCC_ceps", pnms)])

## [1] 15

length(pnms[grep("DTW_domf", pnms)])

## [1] 15

length(pnms[grep("DTW_spen", pnms)])

## [1] 15

length(pnms[grep("DTW_mult", pnms)])

## [1] 15

length(pnms[grep("acs_params", pnms)])

## [1] 25

length(pnms[grep("cep_coeff", pnms)])

## [1] 88

AM2.8

Check out the distributions of the raw acoustic measurements per range.

raw_params_df <- sup_ML_df %>%
  dplyr::select(range, pnms[-grep("MDS|PCA", pnms)]) %>%
  # Convert to long format
  pivot_longer(
    cols = names(.)[-grep("^range$", names(.))],
    names_to = "parameter",
    values_to = "values"
  ) %>%
  dplyr::mutate(
    range = as.character(range),
    range = factor(range, levels = c("Native", "Invasive"))
  )
# glimpse(raw_params_df)

# names(raw_params_df)

# levels(raw_params_df$range)
cols <- scales::alpha(c("navy", "orange"), 0.85)

raw_params_df %>%
  ggplot(aes(x = range, y = values, fill = range, color = range)) +
  geom_boxplot() +
  scale_fill_manual(values = cols) +
  scale_color_manual(values = cols) +
  facet_wrap(~ parameter, scales = "free_y") +
  guides(fill = guide_legend(title = "Range"), color = guide_legend(title = "Range")) +
  xlab("") +
  ylab("Parameter values") +
  theme_bw() + 
  theme(
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 12),
    strip.text = element_text(size = 14),
    legend.position = "top",
    panel.grid.major = element_line(size = 0.15),
    panel.grid.minor = element_line(size = 0.15),
    axis.ticks = element_line(size = 0.25),
    legend.margin = margin(0, 0, 0, 0),
    legend.box.margin = margin(-8, -8, -8, -8)
  )

Some of these spectral acoustic measurements show visible differences between ranges.

Training and tuning models

Trained and tuned models for classification back to the native or invasive range that have built-in variable importance metrics: stochastic gradient boosting and random forests. I wanted to be able to assess the importance of the raw acoustic measurements and acoustic features used as predictors.

Initialize training data, retain only columns needed for machine learning.

table(sup_ML_df$set)

## 
## prediction   training validation 
##        548        676        337

train_df <- sup_ML_df %>%
  filter(set == "training") %>%
  dplyr::mutate(
    range = factor(range) # Must be a factor for caret::train
    ) %>%
  dplyr::select(-c(sound.files, set))
dim(train_df)

## [1] 676 204

head(names(train_df))

## [1] "range"    "duration" "freq.Q75" "freq.IQR" "time.Q25" "kurt"

Set caret::train control parameters, to be used across models for training and/or tuning.

# Must provide a tunegrid object to caret::train if search is set to grid in caret::trainControl. Alternative is to set tuneLength in caret::train instead of tuneGrid, and set search to "random" in trainControl
fiveStats <- function(...) c(twoClassSummary(...), +
defaultSummary(...))

fitControl <- trainControl(
    method = 'repeatedcv',             # repeated k-fold cross validation
    number = 5,                        # number of folds
    repeats = 5,                       # number of repeats
    savePredictions = 'final',         # save predictions for optimal tuning parameter
    classProbs = TRUE,                 # return class probabilities
    summaryFunction = fiveStats,       # summary function for metrics (2-class problem)
    allowParallel = TRUE,              # use parallel backend if available
    search = "grid",                   # use grid search routine for tuning
    returnData = TRUE,                 # save data
    returnResamp = "final",            # return the final resampled summary metrics
    verboseIter = TRUE                 # display training log, otherwise no progress bar
)

Stochastic gradient boosting

Trained and tuned a stochastic gradient boosting model, method = “gbm”", package gbm.

n.trees <- seq(100, 1600, 500) # default is 100 trees
interaction.depth <- seq(1, 3, 1) # default is 1 (additive model), 2 = model with up to 2-way interactions, etc
shrinkage <- c(seq(0.001, 0.1, 0.05), 0.1) # controls expansion of trees or learning rate, default is 0.1, values of 0.001 to 0.1 usually work 
n.minobsinnode <- 1 # min observations in trees' terminal nodes

tunegrid <- expand.grid(.n.trees = n.trees, .interaction.depth = interaction.depth, .shrinkage = shrinkage, .n.minobsinnode = n.minobsinnode)
tunegrid

# Gradient boosting (GBM)
set.seed(seed)

model_GBM <- train(range ~ ., data = train_df, method = 'gbm', tuneGrid = tunegrid, trControl = fitControl)

saveRDS(model_GBM, file.path(path, "model_GBM.RDS"))

Check out final model: n.trees = 1600, interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 1.

model_GBM <- readRDS(file.path(path, "model_GBM.RDS"))
model_GBM

## Stochastic Gradient Boosting 
## 
## 676 samples
## 203 predictors
##   2 classes: 'Invasive', 'Native' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 541, 540, 541, 540, 542, 541, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  ROC        Sens       Spec     
##   0.001      1                   100     0.8666084  0.5331282  0.9041491
##   0.001      1                   600     0.8834416  0.7168858  0.8303354
##   0.001      1                  1100     0.8923738  0.7285501  0.8497805
##   0.001      1                  1600     0.9009991  0.7413986  0.8561242
##   0.001      2                   100     0.8877427  0.6124196  0.9053913
##   0.001      2                   600     0.9038213  0.7756084  0.8366542
##   0.001      2                  1100     0.9151044  0.7963730  0.8527288
##   0.001      2                  1600     0.9245664  0.8110676  0.8573085
##   0.001      3                   100     0.9019266  0.6350303  0.9185424
##   0.001      3                   600     0.9173590  0.8037576  0.8567288
##   0.001      3                  1100     0.9291327  0.8233100  0.8607288
##   0.001      3                  1600     0.9369714  0.8428904  0.8641822
##   0.051      1                   100     0.9386602  0.8287646  0.8721822
##   0.051      1                   600     0.9699025  0.8898275  0.9226087
##   0.051      1                  1100     0.9719870  0.8898462  0.9249110
##   0.051      1                  1600     0.9727460  0.8923077  0.9271801
##   0.051      2                   100     0.9546729  0.8666667  0.8859710
##   0.051      2                   600     0.9719701  0.8990396  0.9186335
##   0.051      2                  1100     0.9746530  0.9051189  0.9260455
##   0.051      2                  1600     0.9757203  0.9045221  0.9294907
##   0.051      3                   100     0.9575512  0.8740233  0.8893996
##   0.051      3                   600     0.9735224  0.8978089  0.9260704
##   0.051      3                  1100     0.9759176  0.9051282  0.9295155
##   0.051      3                  1600     0.9767323  0.9075991  0.9317930
##   0.100      1                   100     0.9544239  0.8562424  0.8945590
##   0.100      1                   600     0.9698300  0.8911049  0.9226087
##   0.100      1                  1100     0.9717820  0.8947599  0.9243147
##   0.100      1                  1600     0.9722513  0.9002238  0.9214410
##   0.100      2                   100     0.9612362  0.8764382  0.9019876
##   0.100      2                   600     0.9731488  0.8984242  0.9254576
##   0.100      2                  1100     0.9748158  0.9002611  0.9277598
##   0.100      2                  1600     0.9760072  0.9027319  0.9288861
##   0.100      3                   100     0.9656127  0.8868718  0.9048861
##   0.100      3                   600     0.9754100  0.9100326  0.9289193
##   0.100      3                  1100     0.9766214  0.9106946  0.9294990
##   0.100      3                  1600     0.9772890  0.9113100  0.9334990
##   Accuracy   Kappa    
##   0.7246132  0.4420302
##   0.7754282  0.5490385
##   0.7910928  0.5803041
##   0.8005788  0.5994044
##   0.7636113  0.5224180
##   0.8070735  0.6131827
##   0.8254334  0.6499850
##   0.8349129  0.6690797
##   0.7813366  0.5583391
##   0.8310609  0.6613093
##   0.8425991  0.6845822
##   0.8538432  0.7072869
##   0.8511722  0.7016860
##   0.9067785  0.8131682
##   0.9079549  0.8155188
##   0.9103319  0.8202886
##   0.8766300  0.7528817
##   0.9091620  0.8180253
##   0.9159419  0.8315946
##   0.9174300  0.8345518
##   0.8819351  0.7635594
##   0.9123995  0.8244601
##   0.9177242  0.8351454
##   0.9201012  0.8399086
##   0.8760155  0.7514967
##   0.9073754  0.8143815
##   0.9100313  0.8197110
##   0.9111902  0.8220793
##   0.8896302  0.7788466
##   0.9123885  0.8244451
##   0.9144736  0.8286270
##   0.9162470  0.8321974
##   0.8961510  0.7920013
##   0.9197873  0.8393140
##   0.9203952  0.8405386
##   0.9227547  0.8452523
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 1600, interaction.depth =
##  3, shrinkage = 0.1 and n.minobsinnode = 1.

Check out the confusion matrix: 92.28% accuracy by repeated cross-fold validation. 95% CI : (91.33, 93.16). Kappa = 0.8453.

confusionMatrix(model_GBM$pred$obs, model_GBM$pred$pred)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Invasive Native
##   Invasive     1490    145
##   Native        116   1629
##                                           
##                Accuracy : 0.9228          
##                  95% CI : (0.9133, 0.9316)
##     No Information Rate : 0.5249          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.8453          
##                                           
##  Mcnemar's Test P-Value : 0.08307         
##                                           
##             Sensitivity : 0.9278          
##             Specificity : 0.9183          
##          Pos Pred Value : 0.9113          
##          Neg Pred Value : 0.9335          
##              Prevalence : 0.4751          
##          Detection Rate : 0.4408          
##    Detection Prevalence : 0.4837          
##       Balanced Accuracy : 0.9230          
##                                           
##        'Positive' Class : Invasive        
##

Random forests

Spatial random forests (RF) training and tuning. Setting the seed here is not really necessary, as every forest will be different due to the random sampling at every node.

# 10 mtry values for tuning
mtry <- round(seq(2, ncol(train_df[, sapply(train_df, is.numeric)]), ncol(train_df[, sapply(train_df, is.numeric)])/10))
mtry
length(mtry)

# Other parameters for ranger random forests
splitrule <- "gini"
min.node.size <- 1

# Make a data frame of all combinations of the parameters initialized for ranger random forests that can be placed into a tuning grid
# Note that the number of trees cannot be placed in a tuning grid for caret, here using 2000 trees
tunegrid <- expand.grid(.mtry = mtry, .splitrule = splitrule, .min.node.size = min.node.size)
tunegrid

set.seed(seed)
model_RF <- train(range ~ ., data = train_df, num.trees = 2000, method = "ranger", trControl = fitControl, tuneGrid = tunegrid, importance = "permutation", replace = TRUE, scale.permutation.importance = TRUE, num.threads = cores, maximize = TRUE)

saveRDS(model_RF, file.path(path, "model_RF.RDS"))

mtry of 2 in the final model.

model_RF <- readRDS(file.path(path, "model_RF.RDS"))
model_RF

## Random Forest 
## 
## 676 samples
## 203 predictors
##   2 classes: 'Invasive', 'Native' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 541, 540, 541, 540, 542, 541, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec       Accuracy   Kappa    
##     2   0.9713853  0.8911329  0.9295072  0.9109312  0.8214807
##    22   0.9687683  0.8935291  0.9105590  0.9023492  0.8043769
##    43   0.9636139  0.8923450  0.9019876  0.8973253  0.7943952
##    63   0.9597669  0.8911142  0.8888033  0.8899156  0.7796407
##    83   0.9564917  0.8843823  0.8807702  0.8825123  0.7648629
##   104   0.9529082  0.8795058  0.8784845  0.8789656  0.7577544
##   124   0.9508079  0.8783030  0.8784513  0.8783619  0.7565414
##   144   0.9485359  0.8721772  0.8727288  0.8724534  0.7447047
##   164   0.9459910  0.8703590  0.8744513  0.8724579  0.7446895
##   185   0.9440756  0.8648205  0.8710145  0.8680067  0.7357485
## 
## Tuning parameter 'splitrule' was held constant at a value of gini
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 2, splitrule = gini
##  and min.node.size = 1.

Check out the confusion matrix: 91.09% accuracy with 95% CI : (90.08, 92.03). Kappa = 0.8215.

confusionMatrix(model_RF$pred$obs, model_RF$pred$pred)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Invasive Native
##   Invasive     1457    178
##   Native        123   1622
##                                           
##                Accuracy : 0.9109          
##                  95% CI : (0.9008, 0.9203)
##     No Information Rate : 0.5325          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8215          
##                                           
##  Mcnemar's Test P-Value : 0.001855        
##                                           
##             Sensitivity : 0.9222          
##             Specificity : 0.9011          
##          Pos Pred Value : 0.8911          
##          Neg Pred Value : 0.9295          
##              Prevalence : 0.4675          
##          Detection Rate : 0.4311          
##    Detection Prevalence : 0.4837          
##       Balanced Accuracy : 0.9116          
##                                           
##        'Positive' Class : Invasive        
##

Compare training performance

AM2.9

mlist <- list(model_RF, model_GBM)
names(mlist) <- c("RF", "GBM")
results <- resamples(mlist)
# summary(results)

dotplot(results, metric = c("Kappa", "Accuracy", "ROC"))

Both models performed really well with > 90% training classification accuracy.

Variable importance

Here, a good number of the original spectral acoustic measurements had high importance. The most important among the spectral acoustic measurements were spectral entropy and frequency interquartile range (freq.IQR).

n <- 12

# Feature types
# "SPCC_spec"  "SPCC_ceps"  "DTW_domf"   "DTW_spen"   "DTW_mult"   "acs_params"
# "cep_coeff", raw parameters
cols <- c(
  scales::alpha("purple", 0.6), 
  topo.colors(n)[2], 
  terrain.colors(n)[1], 
  heat.colors(n)[1], 
  heat.colors(n, alpha = 0.7)[5], 
  heat.colors(n)[8], 
  gray.colors(n)[5], 
  "black"
  )

types <- c("SPCC_spec", "SPCC_ceps", "DTW_domf", "DTW_spen", "DTW_mult", "acs_params", "cep_coeff")

models <- c("Random Forests", "Stochastic \n Gradient Boosting")

gg_list <- list()

# i <- 1
invisible(pblapply(1:length(mlist), function(i){
  
  # Make a data frame of variable importance with feature type
  if(grepl("Boosting", models[i])){
    tmp <- data.frame(
      var_nms = rownames(varImp(mlist[[i]])$importance), 
      imp = varImp(mlist[[i]])$importance[["Overall"]]
    ) 
  } else {
    tmp <- data.frame(
      var_nms = names(mlist[[i]]$finalModel$variable.importance), 
      imp = mlist[[i]]$finalModel$variable.importance
    )
  }
  
  tmp <- tmp %>%
    # Add in a feature type column
    dplyr::mutate(
      var_type = sapply(1:nrow(tmp), function(y){
        paste(
          strsplit(as.character(tmp$var_nms[y]), split = "_")[[1]][1],
          strsplit(as.character(tmp$var_nms[y]), split = "_")[[1]][2],
          sep = "_"
        )}),
      var_type = ifelse(!var_type %in% types, "raw_params", var_type)
    ) %>%
    dplyr::mutate(
      var_type = factor(var_type, levels = c("SPCC_spec", "SPCC_ceps", "DTW_domf", "DTW_spen", "DTW_mult", "acs_params", "cep_coeff", "img_params", "raw_params")),
      var_nms = as.character(var_nms)
    ) %>%
    arrange(desc(imp)) %>%
    slice(1:30) %>%
    dplyr::mutate(
      var_nms = factor(var_nms, levels = var_nms),
      model = models[i]
    ) %>%
    droplevels()
  
  tmp_cols <- cols[grep(paste(levels(tmp$var_type), collapse = "|"), c(types, "raw_params"))]
  
  # Add y-axis text to RF plot only
  if(!grepl("Boosting", models[i])){
    y_txt <- "Top 30 Most Important Features"
  } else {
    y_txt <- ""
  }
  
  gg_list[[i]] <<- ggplot(data = tmp) + 
    facet_grid(~ model) +
    geom_segment(aes(x = 0, xend = imp, y = forcats::fct_rev(var_nms), yend = forcats::fct_rev(var_nms), color = var_type), size = 2) +
    geom_point(aes(y = forcats::fct_rev(var_nms), x = imp), pch = 21, fill = gray.colors(n)[2], color = "black", size = 3.5, stroke = 0.25) +
    scale_color_manual(values = tmp_cols) +
    guides(color = FALSE) +
    xlab("Variable Importance") + 
    ylab(y_txt) +
    theme_bw() + 
    theme(
      axis.title = element_text(size = 12),
      axis.text = element_text(size = 10),
      strip.text = element_text(size = 12),
      panel.grid.major.y = element_line(size = 0.15),
      panel.grid.major.x = element_line(color = "black", size = 0.15, linetype = "dotted"),
      panel.grid.minor = element_line(size = 0.15),
      axis.ticks = element_line(size = 0.15)
    ) 
  
}))

# Extract legend
tmp_df <- data.frame(
      var_nms = names(mlist[[1]]$finalModel$variable.importance), 
      imp = mlist[[1]]$finalModel$variable.importance
    ) %>%
  # Add in a feature type column
  dplyr::mutate(
    var_type = sapply(1:nrow(.), function(y){
      paste(
        strsplit(as.character(.$var_nms[y]), split = "_")[[1]][1],
        strsplit(as.character(.$var_nms[y]), split = "_")[[1]][2],
        sep = "_"
      )}),
    var_type = ifelse(!var_type %in% types, "raw_params", var_type)
  ) %>%
  dplyr::mutate(
    var_type = factor(var_type, levels = c("SPCC_spec", "SPCC_ceps", "DTW_domf", "DTW_spen", "DTW_mult", "acs_params", "cep_coeff", "raw_params")),
    var_nms = as.character(var_nms)
  ) 

unique(tmp_df$var_type)

tmp_cols <- cols[grep(paste(levels(tmp_df$var_type), collapse = "|"), c(types, "raw_params"))]

gg_leg <- gtable::gtable_filter(ggplot_gtable(ggplot_build(
  ggplot(data = tmp_df) +
  geom_segment(aes(x = 0, xend = imp, y = forcats::fct_rev(var_nms), yend = forcats::fct_rev(var_nms), color = var_type), size = 2.5) +
  scale_color_manual(values = cols) +
  guides(color = guide_legend(title = "", nrow = 2)) +
  theme(
    legend.direction = "horizontal",
      legend.title = element_text(size = 12),
      legend.text = element_text(size = 12),
      legend.position = "top",
      legend.key = element_rect(fill = scales::alpha("white", 0)),
      legend.margin = margin(0, 0, 0, 0),
      legend.box.margin = margin(-10, -10, -10, -10)
    )
)), "guide-box")

# grid::grid.draw(gg_leg)
dev.off()

jpeg(file.path(gpath, "AM10_VarImp.jpeg"), units = "in", width = 8, height = 6, res = 300)

ggarrange(
  as.ggplot(gg_leg),
  as.ggplot(
    ggarrange(
      as.ggplot(gg_list[[1]]),
      as.ggplot(gg_list[[2]]),
      nrow = 1,
      widths = c(5, 5)
    )
  ),
  nrow = 2,
  heights = c(1, 5)
)

dev.off()

AM2.10

Variable importance per model

The models shared some of the top most important variables, but the RF model had more original spectral acoustic measurements than the GBM model in the top 30 most important features. Although GBM slightly outperformed RF in terms of classification accuracy, I decided to proceed with RF for validation and prediction, because this model had more original spectral acoustic measurements included in the top important variables, and I wanted to try assessing partial dependence on these measurements to take a closer look at structural differences in calls between ranges.

Proceeding with RF model validation. Here, looking for warning signs associated with overfitting on the training data, specifically, substantially lower validation accuracy compared to training.

# Initialize training data, retain only columns needed for ML
validation_df <- sup_ML_df %>%
  filter(set == "validation") %>%
  dplyr::mutate(
    range = factor(range) # Must be a factor for caret::train
    ) %>%
  dplyr::select(-c(sound.files, set))
dim(validation_df)

## [1] 337 204

head(names(validation_df))

## [1] "range"    "duration" "freq.Q75" "freq.IQR" "time.Q25" "kurt"

The RF model had 91.99% validation accuracy, similarly high compared to cross-fold validated training accuracy. 20 invasive range and 7 native range calls were misclassified.

# Consider using probability of labels
valid_RF <- predict(model_RF, newdata = validation_df, type = "raw")
str(valid_RF)

##  Factor w/ 2 levels "Invasive","Native": 2 2 2 2 2 1 1 2 2 2 ...

# View(valid_RF)

caret::confusionMatrix(valid_RF, validation_df$range)$table

##           Reference
## Prediction Invasive Native
##   Invasive      201      7
##   Native         20    109

caret::confusionMatrix(valid_RF, validation_df$range)$overall[["Accuracy"]]

## [1] 0.9198813

Validation accuracy was high, so we moved on to prediction with the validated model.

Prediction of native/invasive classes

Initialize prediction data.

sup_ML_df <- readRDS(file.path(path, "sup_ML_fin.RDS"))
dim(sup_ML_df)

## [1] 1561  206

# glimpse(sup_ML_df)

prediction_df <- sup_ML_df %>%
  filter(set == "prediction") %>%
  dplyr::mutate(
    range = factor(range) # Must be a factor for caret::train
    ) %>%
  dplyr::select(-c(sound.files, set))
dim(prediction_df)

## [1] 548 204

head(names(prediction_df))

## [1] "range"    "duration" "freq.Q75" "freq.IQR" "time.Q25" "kurt"

Carried out prediction with the final RF model. Overall prediction accuracy was 87.59%. 46 invasive range calls and 22 native range calls were misclassified.

predict_RF <- predict(model_RF, newdata = prediction_df, type = "raw")
# str(predict_RF)

caret::confusionMatrix(predict_RF, prediction_df$range)$table

##           Reference
## Prediction Invasive Native
##   Invasive      272     22
##   Native         46    208

caret::confusionMatrix(predict_RF, prediction_df$range)$overall[["Accuracy"]]

## [1] 0.8759124

# Also get the probability of each class from this prediction
predict_RF_prob <- predict(model_RF, newdata = prediction_df, type = "prob")
# str(predict_RF_prob)

# Order of classes predicted here
levels(predict_RF)

## [1] "Invasive" "Native"

# "Invasive" "Native"

saveRDS(predict_RF, file.path(path, "predict_RF.RDS"))
saveRDS(predict_RF_prob, file.path(path, "predict_RF_prob.RDS"))

Added prediction results to the prediction dataset.

predict_RF <- readRDS(file.path(path, "predict_RF.RDS"))
# glimpse(predict_RF)

predict_RF_prob <- readRDS(file.path(path, "predict_RF_prob.RDS"))
# glimpse(predict_RF_prob)

predict_res <- sup_ML_df %>%
  filter(set == "prediction") %>%
  dplyr::select(c(sound.files, range)) %>%
  inner_join(
    nat_inv_est %>%
      as_tibble() %>%
      dplyr::select(c(sound.files, range, year, site, site_year, invasive_city)),
    by = c("sound.files", "range")
  ) %>% 
  dplyr::mutate(
    predicted_class = as.character(predict_RF),
    prob_native = predict_RF_prob$Native,
    prob_invasive = predict_RF_prob$Invasive
  ) %>%
  droplevels()
# glimpse(predict_res)

RF acoustic space

Get the RF proximity matrix for the prediction dataset.

rf_proxm <- edarf::extract_proximity(model_RF$finalModel, prediction_df)
str(rf_proxm)

# save output
saveRDS(rf_proxm, file.path(path, "rf_proxm_call_structure.RDS"))

Perform MDS in 2 dimensions.

rf_proxm <- readRDS(file.path(path, "rf_proxm_call_structure.RDS"))
# head(rf_proxm)
# dim(rf_proxm)

iso <- invisible(MASS::isoMDS(stats::as.dist(1 - rf_proxm, diag = FALSE, upper = FALSE), k = 2, maxit = 1000, trace = FALSE))

# Join with metadata
mds_rf <- data.frame(X = iso$points[, 1]) %>%
  dplyr::mutate(
    Y = iso$points[, 2],
    sound.files = predict_res$sound.files
  ) %>%
  inner_join(
    predict_res,
    by = "sound.files"
  ) %>%
  dplyr::mutate(
    range = as.character(range),
    range = factor(range, levels = c("Native", "Invasive")),
    # Flip X coordinates so native range appears on the left side
    X = -X,
    Y = Y
  )
# glimpse(mds_rf)

Figure 1B

All sites used for RF prediction in RF acoustic space. This is a main figure, with a Gaussian kernel density estimator used to generate contours in acoustic space.

I relied on geom_density2d for visualizing density contours. This function uses MASS::kde2d to apply a two dimensional Gaussian density kernel to the MDS coordinates. For each kernel density estimator (KDE) per range, I set the bandwidth (argument h in geom_density2d) that controls the width of the kernel density estimator in both dimensions to 0.5. In other words, the KDE had a width of 0.5 in each direction (x and y). After the density was estimated by geom_density2d, contours were drawn across bins of the density values. Here, I set the binwidth to 1/10th of the maximum density value per range, such that density values per range were split into 10 bins of even width. Note that I flipped x-axis coordinates so that the native range contours were on the left side of the graphic.

# levels(mds_rf$range)
cols <- scales::alpha(c("navy", "orange"), 0.85)

h <- 0.5 # bandwidth (KDE width)

# Get density values and bin them to check out bins used by geom_density2d
nat_dens <- kde2d(
  x = mds_rf %>%
      filter(range == "Native") %>%
      pull(X),
  y = mds_rf %>%
      filter(range == "Native") %>%
      pull(Y),
  h = 0.5
)

# str(nat_dens)

inv_dens <- kde2d(
  x = mds_rf %>%
      filter(range == "Invasive") %>%
      pull(X),
  y = mds_rf %>%
      filter(range == "Invasive") %>%
      pull(Y),
  h = 0.5
)

# str(inv_dens)

range(nat_dens)

## [1] -0.5670525  2.9509838

range(inv_dens)

## [1] -0.5687283  2.1119321

bw_nat <- max(nat_dens$z)/10
bw_inv <- max(inv_dens$z)/10

bw_nat

## [1] 0.2950984

bw_inv

## [1] 0.2111932

brks_nat <- seq(min(nat_dens$z), max(nat_dens$z), bw_nat)
brks_inv <- seq(min(inv_dens$z), max(inv_dens$z), bw_inv)

length(brks_nat) # 10 bins of width bw_nat for native range

## [1] 10

length(brks_inv) # 10 bins of width bw_inv for invasive range

## [1] 10

ggplot(data = mds_rf, aes(x = X, y = Y)) +
  geom_density2d(
    data = mds_rf %>%
      filter(range == "Native") %>%
      droplevels(),
    size = 0.65,
    binwidth = bw_nat,
    h = h,
    color = cols[1]
    ) +
  geom_density2d(
    data = mds_rf %>%
      filter(range == "Invasive") %>%
      droplevels(),
    size = 0.65,
    binwidth = bw_inv,
    h = h,
    color = cols[2]
    ) +
  scale_color_manual(values = cols) +
  xlab("Dimension 1") + ylab("Dimension 2") + 
  theme_bw() +
  theme(
    legend.position = "top",
    axis.title = element_text(size = 16),
    axis.text = element_text(size = 14),
    legend.title = element_text(size = 16),
    legend.text = element_text(size = 16),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.ticks = element_line(size = 0.35),
    legend.margin = margin(0, 0, 0, 0),
    legend.box.margin = margin(-5, -5, -5, -5),
    legend.key.width = unit(3, "lines")
  )

ggsave(file.path(gpath, "KernelDensity_RFAcousticSpace.tiff"), units = "in", width = 11, height = 3.06, dpi = 300) # Main manuscript

Finer-scale spatial and temporal comparisons

After visualizing invasive and native range calls in RF acoustic space, we took a closer look at call structure differentiation in different invasive range areas (finer spatial comparison between ranges), as well as over time. For these analyses, we used calls in the prediction dataset from sites that did not represent sampling over time, and Austin/New Orleans sites that represented invasive range sampling over time, respectively.

Spatial comparisons

Which invasive range cities and sites were missclassified during prediction? The fact that we could build a model with high training and validation accuracy was a good indicator that there were indeed structural differences between ranges, but it was possible that some regions of the invasive range were more native range-like in call structure. If so, such calls should have been misclassified to the native range.

Checking out missclassification in the invasive range by city. While more calls were misclassified for Milford CT, this city also had more representation in the prediction dataset (35 calls). Overall, we saw no clear pattern of misclassification back to native range (e.g. structural change) by invasive city or region.

inv_spat_pred <- predict_res %>%
  filter(range != predicted_class & range == "Invasive") %>%
  # Filter by sites for spatial question
  filter(site_year %in% unique(inv_spatial$site_year)) %>%
  group_by(range, invasive_city) %>%
  dplyr::summarise(
    n_misclass = length(sound.files)
  ) %>%
  # Add number of calls used for prediction
  inner_join(
    sup_ML_df %>%
      filter(set == "prediction" & range == "Invasive") %>%
      dplyr::select(c(sound.files, range)) %>%
      inner_join(
        nat_inv_est %>%
          as_tibble() %>%
          filter(range == "Invasive") %>%
          # Filter by sites for spatial question
          filter(site_year %in% unique(inv_spatial$site_year)) %>%
          dplyr::select(c(sound.files, range, region, invasive_city)),
        by = c("sound.files", "range")
      ) %>%
      group_by(range, region, invasive_city) %>%
      dplyr::summarise(
        n_total = length(sound.files)
      ),
    by = c("range", "invasive_city")
  )

inv_spat_pred %>%
  dplyr::select(c(range, region, invasive_city, n_misclass, n_total)) %>%
  kable()

range	region	invasive_city	n_misclass	n_total
Invasive	Southwest	Austin	3	19
Invasive	Southwest	Dallas	1	3
Invasive	Southeast	Miami	2	13
Invasive	Northeast	Milford	5	35
Invasive	Southeast	New Orleans	1	7
Invasive	Northeast	Stratford	2	17

Fewer native range calls were missclassified. The western region has greater representation in the misclassified calls, but we also sampled more intensively in the west.

predict_res %>%
  filter(range != predicted_class & range == "Native") %>%
  group_by(range, site) %>%
  dplyr::summarise(
    n_misclass = length(sound.files)
  ) %>%
  # Add number of calls used for prediction
  inner_join(
    sup_ML_df %>%
      filter(set == "prediction" & range == "Native") %>%
      dplyr::select(c(sound.files, range)) %>%
      inner_join(
        nat_inv_est %>%
          as_tibble() %>%
          filter(range == "Native") %>%
          dplyr::select(c(sound.files, range, site, region)),
        by = c("sound.files", "range")
      ) %>%
      group_by(range, region, site) %>%
      dplyr::summarise(
        n_total = length(sound.files)
      ),
    by = c("range", "site")
  ) %>% 
  dplyr::select(range, site, region, n_misclass, n_total) %>%
  arrange(region) %>%
  kable()

range	site	region	n_misclass	n_total
Native	BCAR	Central	1	5
Native	FAGR	Central	1	2
Native	PEIX	Central	1	6
Native	PROO	Central	1	4
Native	OJOS	East	2	7
Native	PLVE	East	1	3
Native	VALI	East	1	7
Native	1145	West	6	29
Native	ARAZ	West	2	5
Native	CHAC	West	1	6
Native	EMBR	West	3	11
Native	ROSA	West	2	5

Note that BCAR and FAGR were native range urban sites in Montevideo. Which sites had no misclassifications during prediction?

predict_res %>%
  filter(range == predicted_class & range == "Native") %>%
  group_by(range, site) %>%
  dplyr::summarise(
    n_corrclass = length(sound.files)
  ) %>%
  # Add number of calls used for prediction
  inner_join(
    sup_ML_df %>%
      filter(set == "prediction" & range == "Native") %>%
      dplyr::select(c(sound.files, range)) %>%
      inner_join(
        nat_inv_est %>%
          as_tibble() %>%
          filter(range == "Native") %>%
          dplyr::select(c(sound.files, range, site, region)),
        by = c("sound.files", "range")
      ) %>%
      group_by(range, region, site) %>%
      dplyr::summarise(
        n_total = length(sound.files)
      ),
    by = c("range", "site")
  ) %>% 
  dplyr::select(range, site, region, n_corrclass, n_total) %>%
  arrange(region) %>%
  kable()

range	site	region	n_corrclass	n_total
Native	BAGU	Central	7	7
Native	BCAR	Central	4	5
Native	CEME	Central	2	2
Native	FAGR	Central	1	2
Native	GOLF	Central	7	7
Native	INBR	Central	6	6
Native	PEIX	Central	5	6
Native	PROO	Central	3	4
Native	CISN	East	9	9
Native	ELTE	East	7	7
Native	HIPE	East	2	2
Native	OJOS	East	5	7
Native	PAVO	East	9	9
Native	PLVE	East	2	3
Native	QUEB	East	5	5
Native	SAUC	East	2	2
Native	VALI	East	6	7
Native	ARAP	Northwest	4	4
Native	1145	West	23	29
Native	ARAZ	West	3	5
Native	CHAC	West	5	6
Native	ECIL	West	6	6
Native	EMBR	West	8	11
Native	INES-01	West	4	4
Native	INES-03	West	5	5
Native	INES-04	West	3	3
Native	INES-05	West	2	2
Native	INES-06	West	2	2
Native	INES-07	West	3	3
Native	INES-08	West	9	9
Native	KIYU	West	3	3
Native	LENA	West	6	6
Native	PFER	West	18	18
Native	PIED	West	7	7
Native	RIAC	West	9	9
Native	ROSA	West	3	5
Native	SEMI	West	3	3

The other two native range urban sites, CEME and GOLF, had 100% correct classification back to the native range (2 and 7 calls, respectively).

Temporal comparisons

Which invasive range cities and sites sampled over time had missclassified calls? Temporal change in acoustic structure could have influenced misclassification of calls. If invasive populations grew over time, birds would have to recognize more individuals, which could lead to invasive range calls becoming more native range-like.

inv_temp_pred <- predict_res %>%
  filter(range == "Invasive") %>%
  # Filter by sites used for the temporal questions
  filter(site_year %in% unique(inv_temporal$site_year)) %>%
  dplyr::mutate(
    misclass = ifelse(range != predicted_class, "Y", "N")
  ) %>%
   # Summarise number of calls correctly and incorrectly classified by site-year
  group_by(invasive_city, site_year) %>%
  dplyr::summarise(
    n_misclass = length(which(misclass == "Y"))
  ) %>%
  # Add number of calls used for prediction
  inner_join(
    sup_ML_df %>%
      filter(set == "prediction" & range == "Invasive") %>%
      dplyr::select(c(sound.files)) %>%
      inner_join(
        nat_inv_est %>%
          filter(range == "Invasive") %>%
          # Filter by sites used for the temporal questions
          filter(site_year %in% unique(inv_temporal$site_year)) %>%
          dplyr::select(c(sound.files, year, invasive_city, site, site_year)),
        by = c("sound.files")
      ) %>%
      group_by(year, invasive_city, site, site_year) %>%
      dplyr::summarise(
        n_total = length(sound.files)
      ),
    by = c("invasive_city", "site_year")
  ) %>%
  ungroup() %>%
  dplyr::select(invasive_city, site, year, n_misclass, n_total) %>%
   ungroup()

## `summarise()` has grouped output by 'invasive_city'. You can override using the `.groups` argument.

## `summarise()` has grouped output by 'year', 'invasive_city', 'site'. You can override using the `.groups` argument.

inv_temp_pred %>%
  kable()

invasive_city	site	year	n_misclass	n_total
Austin	ELEM	2011	4	8
Austin	ELEM	2019	2	20
Austin	INTR	2011	7	15
Austin	INTR	2019	1	20
Austin	MART	2011	0	14
Austin	MART	2019	5	20
Austin	SOCC	2004	0	20
Austin	SOCC	2019	1	20
Austin	VALL	2004	1	5
Austin	VALL	2011	3	10
New Orleans	BALL	2004	2	20
New Orleans	CANA	2004	1	13
New Orleans	LAKE	2011	1	5
New Orleans	ROBE	2011	4	20

AM2.11

Make a visual of Austin and New Orleans at the city-scale. 32 calls from the sites representing temporal sampling were misclassified, and the highest misclassification occurred for Austin 2011.

cols <- scales::alpha("orange", 0.85)

# Austin city scale
inv_temp_pred %>%
  group_by(invasive_city, year) %>%
  dplyr::summarise(n_calls = sum(n_misclass)) %>%
  ggplot(aes(x = year, y = n_calls)) +
  geom_col(color = scales::alpha("white", 0), fill = cols) +
  xlab("Year") +
  ylab("Number of misclassified calls") +
  facet_wrap(~ invasive_city) +
  scale_y_continuous(limits = c(0, 20)) +
  theme_bw() +
  ggtitle(paste(inv_temp_pred %>%
            pull(n_misclass) %>%
            sum(), "calls misclassified")) +
  theme(
    plot.title = element_text(size = 12, hjust = 0.5),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 12),
    strip.text = element_text(size = 12),
    panel.grid.major = element_line(size = 0.15),
    panel.grid.minor = element_line(size = 0.15),
    axis.ticks = element_line(size = 0.1)
  )

## `summarise()` has grouped output by 'invasive_city'. You can override using the `.groups` argument.

Note that these difference in misclassifications among years rely on few calls, so this difference in misclassification over time is pretty weak. In addition, these results are not in line with what we would have predicted for change over time. If populations grew or decreased in size, we would expect to see directional increase or decrease in misclassification accuracy, which would have represented more or less structural change (becoming more or less native-range like over time).

AM2.12

We took a closer look at Austin to evaluate misclassification per site sampled over time, and again saw no clear patterns of temporal change.

# Austin site-year scale, drop BART because this was sampled at individual scale and included in prediction dataset only for spatial question

# Get sites
sites <- inv_temp_pred %>%
  ungroup() %>%
  filter(invasive_city == "Austin" & site != "BART") %>%
  pull(site) %>%
  unique()
sites

## [1] "ELEM" "INTR" "MART" "SOCC" "VALL"

# Make a new data frame that contains a row per site and year, even if calls were not misclassified for that year. So the x-axis should have all 3 years per panel
inv_temp_pred %>%
  filter(site %in% sites) %>%
  # Add a column with years in which each site was sampled
  dplyr::mutate(
    year_samp = sapply(1:nrow(.), function(x){
      yrs <- unique(nat_inv_est$year[grep(.$site[x], nat_inv_est$site)])
      ord <- order(yrs, decreasing = FALSE)
      paste(site[x], paste(yrs[ord], collapse = ", "), sep = ":\n ")
    }),
    # Change factor levels of year so all 3 years are on x-axis
    year = as.character(year),
    year = factor(year, levels = c("2004", "2011", "2019"))
  ) %>%
  ggplot(aes(x = year, y = n_misclass)) +
  geom_col(color = scales::alpha("white", 0), fill = cols) +
  xlab("Year") +
  ylab("Number of misclassified calls") +
  facet_grid(~ year_samp, drop = FALSE) +
  scale_y_continuous(limits = c(0, 13)) +
  theme_bw() +
  theme(
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 12),
    strip.text = element_text(size = 12),
    panel.grid.major = element_line(size = 0.15),
    panel.grid.minor = element_line(size = 0.15),
    axis.ticks = element_line(size = 0.1)
  )

AM2.13

Was misclassification of invasive range calls in the prediction dataset driven by SNR? If so, we would expect lower SNR calls to be misclassified. This was not the case. Instead, calls that were misclassified had higher mean SNR.

SNR_df <- predict_res %>%
  filter(range == "Invasive") %>%
  dplyr::mutate(
    misclass = ifelse(range != predicted_class, "Yes", "No")
  ) %>%
  # Join back to the EST to get SNR
  inner_join(
    nat_inv_est %>%
      dplyr::select(sound.files, SNR), 
    by = "sound.files"
  ) %>%
  # Remove calls that had missing SNR measurements
  filter(!is.na(SNR)) %>%
  droplevels() %>%
  # Calculate mean and SE of SNR for calls that were misclassified or not
  group_by(misclass) %>%
  dplyr::summarise(
    mean_SNR = mean(SNR),
    se_SNR = std_err(SNR)
  )

# glimpse(SNR_df)

SNR_df %>%
  ggplot(aes(x = misclass, y = mean_SNR)) +
  geom_errorbar(aes(ymin = mean_SNR - se_SNR, ymax = mean_SNR + se_SNR), size = 1, width = 0.1) +
  geom_point(fill = "gray60", shape = 21, size = 4) +
  scale_fill_manual(values = cols) +
  xlab("Misclassified?") +
  ylab("Mean +/- SE SNR") +
  theme_bw()

Partial dependence plots

Checked out random forests partial dependence plots for original spectral acoustic measurements. Did any of these measurements show a change in assignment probability back to the native range for certain values? For instance, were shorter duration calls more likely to be assigned to the invasive range? Such patterns would point to structural differences in calls between ranges. The original acoustic measurements used here are easier to interpret than features used for random forests. In other words, assignment probability back to the invasive range can be assessed across the values of a single given measurement, rather than values of a given feature (which is a combination of multiple original measurements).

model_RF <- readRDS(file.path(path, "model_RF.RDS"))
# model_RF

sup_ML_df <- readRDS(file.path(path, "sup_ML_fin.RDS"))
dim(sup_ML_df)
# glimpse(sup_ML_df)

prediction_df <- sup_ML_df %>%
  filter(set == "prediction") %>%
  dplyr::mutate(
    range = factor(range) # Must be a factor for caret::train
    ) %>%
  dplyr::select(-c(sound.files, set))
dim(prediction_df)
head(names(prediction_df))

preds <- names(prediction_df)[-grep("^range$|MDS|PCA", names(prediction_df))]
preds

# y-hat is classification probability back to inavsive range
levels(prediction_df$range)
# "Invasive" "Native"  

pdp_objs <- list()
for(i in preds){
  pdp_objs[[i]] <- partial(model_RF, pred.var = i, progress = "text", plot = FALSE, prob = TRUE)
}

pdp_plots <- list()
for(i in preds){
  pdp_plots[[i]] <- plotPartial(pdp_objs[[i]], alpha = 0.5, smooth = TRUE)
}

dev.off()

jpeg(file.path(gpath, "PDP_RF_01.jpeg"), units = "in", width = 6, height = 6, res = 150)
grid.draw(grid.arrange(grobs = list(pdp_plots[[1]], pdp_plots[[2]], pdp_plots[[3]], pdp_plots[[4]], pdp_plots[[5]], pdp_plots[[6]], pdp_plots[[7]], pdp_plots[[8]]), ncol = 2))
dev.off()

jpeg(file.path(gpath, "PDP_RF_02.jpeg"), units = "in", width = 6, height = 6, res = 150)
grid.draw(grid.arrange(grobs = list(pdp_plots[[9]], pdp_plots[[10]], pdp_plots[[11]], pdp_plots[[12]], pdp_plots[[13]], pdp_plots[[14]], pdp_plots[[14]]), ncol = 2))
dev.off()

AM2.14

partial dependence plots, page 1

AM2.15

partial dependence plots, page 2

Some interesting trends in these plots by different predictors (e.g modulation index), but take note of the small range of the y-axis for most of these values, which indicates small overall changes in classification probability. In these plots, y-hat on the y-axis is the probability of classification back to the invasive range (e.g. lower values indicate higher classification back to the native range). Black lines indicate the unsmoothed relationship between classification back to the invasive range over the distribution of each parameter. The blue line per panel is the Loess-smoothed line.

References

1. Smith-Vidaurre, G., Araya-Salas, M., and T.F. Wright. 2020. Individual signatures outweigh social group identity in contact calls of a communally nesting parrot. Behavioral Ecology 31(2), 448-458.

The session info printed here represents the environment used for the final RMarkdown knitting. The software and package versions employed for main results are reported in the appendix of the associated article.

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] pracma_2.3.3       edarf_1.1.1        ggplotify_0.0.7    egg_0.4.5         
##  [5] gridExtra_2.3      scales_1.1.1       pdp_0.7.0          gbm_2.1.8         
##  [9] e1071_1.7-7        MLmetrics_1.1.1    corrplot_0.89      ranger_0.12.1     
## [13] data.table_1.14.0  pbapply_1.4-3      MASS_7.3-54        caret_6.0-88      
## [17] lattice_0.20-44    forcats_0.5.1      stringr_1.4.0      dplyr_1.0.6       
## [21] purrr_0.3.4        readr_1.4.0        tidyr_1.1.3        tibble_3.1.2      
## [25] tidyverse_1.3.1    ggplot2_3.3.3      warbleR_1.1.27     NatureSounds_1.0.4
## [29] knitr_1.33         seewave_2.1.6      tuneR_1.3.3       
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_2.0-1     rjson_0.2.20         ellipsis_0.3.2      
##  [4] class_7.3-19         fs_1.5.0             rstudioapi_0.13     
##  [7] proxy_0.4-25         farver_2.1.0         prodlim_2019.11.13  
## [10] fansi_0.5.0          lubridate_1.7.10     xml2_1.3.2          
## [13] codetools_0.2-18     splines_4.1.0        jsonlite_1.7.2      
## [16] pROC_1.17.0.1        broom_0.7.6          dbplyr_2.1.1        
## [19] BiocManager_1.30.15  compiler_4.1.0       httr_1.4.2          
## [22] rvcheck_0.1.8        backports_1.2.1      assertthat_0.2.1    
## [25] Matrix_1.3-4         cli_2.5.0            htmltools_0.5.1.1   
## [28] tools_4.1.0          gtable_0.3.0         glue_1.4.2          
## [31] reshape2_1.4.4       Rcpp_1.0.6           cellranger_1.1.0    
## [34] vctrs_0.3.8          nlme_3.1-152         iterators_1.0.13    
## [37] timeDate_3043.102    gower_0.2.2          xfun_0.23           
## [40] rvest_1.0.0          lifecycle_1.0.0      ipred_0.9-11        
## [43] hms_1.1.0            yaml_2.2.1           rpart_4.1-15        
## [46] stringi_1.6.2        highr_0.9            foreach_1.5.1       
## [49] checkmate_2.0.0      lava_1.6.9           mmpf_0.0.5          
## [52] rlang_0.4.11         pkgconfig_2.0.3      dtw_1.22-3          
## [55] bitops_1.0-7         evaluate_0.14        labeling_0.4.2      
## [58] recipes_0.1.16       tidyselect_1.1.1     plyr_1.8.6          
## [61] magrittr_2.0.1       R6_2.5.0             fftw_1.0-6          
## [64] generics_0.1.0       DBI_1.1.1            pillar_1.6.1        
## [67] haven_2.4.1          withr_2.4.2          survival_3.2-11     
## [70] RCurl_1.98-1.3       nnet_7.3-16          modelr_0.1.8        
## [73] crayon_1.4.1         utf8_1.2.1           rmarkdown_2.8       
## [76] isoband_0.2.4        readxl_1.3.1         ModelMetrics_1.2.2.2
## [79] reprex_2.0.0         digest_0.6.27        gridGraphics_0.5-1  
## [82] signal_0.7-7         stats4_4.1.0         munsell_0.5.0

Additional Materials 02 :

Evaluating Differences in Acoustic Structure with Supervised Machine Learning

19 June 2021