Supervised machine learning (ML) was employed to assess whether contact call structure differed between ranges, and whether acoustic structure changed over time in the invasive range. We decided to use supervised machine learning with ranges as class labels after visually assessing calls during pre-processing. Stochastic gradient boosting and random forests (RF) models were trained and validated with a large set of acoustic measurements and features for the majority of native and invasive range calls across social scales. Both models were trained to classify calls back to range, and we proceeded with the RF model for prediction.
Classification accuracy throughout predictive modeling was used as an indicator of structural differentiation between ranges. We also used the RF proximity matrix to visualize overlap between ranges in low dimensional acoustic space. Finally, we used misclassifications per range to more closely evaluate structural change between ranges, as well as structural change over time in the invasive range.
The prefix “AM” stands for “Additional Material” and is used to indicate figures included here that are not present in the main manuscript or appendix, but are still valuable for reproducibility. Some of the functions and code in this script may need to be updated in order to reproduce results with more recent versions of the packages used.
rm(list = ls())
X <- c("warbleR", "ggplot2", "tidyverse", "caret", "MASS", "pbapply", "dplyr", "data.table", "parallel", "ranger", "corrplot", "MLmetrics", "e1071", "gbm", "pdp", "scales", "egg", "ggplotify", "grid", "edarf", "pracma")
invisible(lapply(X, library, character.only = TRUE))
path <- "/media/gsvidaurre/MYIOPSITTA/R/VocalLearning_PostInvasion/Data"
gpath <- "/home/gsvidaurre/Desktop/MANUSCRIPTS/SimplerSignatures_PostInvasion/FIGURES"
seed <- 401
cores <- parallel::detectCores() - 2
Read in the extended selection table (EST). This contains metadata and wave objects for pre-processed native and invasive range calls. Throughout this script, you can reproduce analyses by reading in the file “nat_inv_indiv_site_seltbl.csv”, which is a selection table made available on figshare that contains metadata needed here. Measurements needed to reproduce analyses here have also been provided on figshare.
nat_inv_est <- readRDS(file.path(path, "nat_inv_indiv_site_EST.RDS"))
# glimpse(nat_inv_est)
In previous work[1], and an independent analysis of hierarchical mapping, we used known repeatedly sampled individuals for training supervised machine learning models. Here, we did not use these individuals for model training, but we did use these birds to visually validate the different measurements used as predictors for stochastic gradient boosting and random forests models. See the exploratory visuals generated below (AM1 - 7).
nat_inv_est %>%
filter(social_scale == "Individual") %>%
group_by(site_year, Bird_ID) %>%
summarise(num_calls = length(sound.files))
## `summarise()` has grouped output by 'site_year'. You can override using the `.groups` argument.
## # A tibble: 17 x 3
## # Groups: site_year [10]
## site_year Bird_ID num_calls
## <chr> <chr> <int>
## 1 1145_2017 NAT-AAT 12
## 2 1145_2017 NAT-UM1 25
## 3 1145_2017 NAT-UM2 23
## 4 1145_2017 NAT-UM3 5
## 5 1145_2017 NAT-UM4 13
## 6 ASCA_2019 INV-UM6 25
## 7 BART_2011 INV-UM1 23
## 8 CAME_2004 INV-UM19 12
## 9 CHAC_2017 NAT-UM5 7
## 10 ELEM_2019 INV-UM7 28
## 11 ELEM_2019 INV-UM9 5
## 12 EMBR_2017 NAT-RAW 4
## 13 EMBR_2017 NAT-ZW8 8
## 14 INTR_2019 INV-UM10 6
## 15 ROBE_2011 INV-UM5 20
## 16 SOCC_2019 INV-UM16 8
## 17 SOCC_2019 INV-UM17 5
# Native range: NAT-AAT, NAT-UM1, NAT-UM2, NAT-UM4 (all from site 1145)
# Invasive range: INV-UM1 (BART_2011), INV-UM7 (ELEM_2019), INV-UM5 (ROBE_2011), INV-UM19 (CAME_2004)
train_indivs <- c("NAT-AAT", "NAT-UM1", "NAT-UM2", "NAT-UM4", "INV-UM19", "INV-UM1", "INV-UM5", "INV-UM7")
# train_indivs
Acoustic measurements and features were obtained for all calls in the EST. However, I removed sound files with the "_site_scale" suffix prior to running ML below. These calls were included at the site scale for an independent analysis of hierarchcal mapping patterns, but were not necessary for supervised machine learning analysis here.
# SPCC on spectrograms
xc_mat <- warbleR::xcorr(nat_inv_est, wl = 378, ovlp = 90, wn = "hanning", cor.method = "pearson", parallel = cores, na.rm = FALSE, bp = c(0.5, 9), cor.mat = TRUE, path = path, type = "spectrogram")
str(xc_mat)
saveRDS(xc_mat, file.path(path, "xc_mat_NAT_INV.RDS"))
# SPCC on cepstral coefficients
xc_mat_mfcc <- warbleR::xcorr(nat_inv_est, wl = 378, ovlp = 90, wn = "hanning", cor.method = "pearson", parallel = cores, na.rm = FALSE, bp = c(0.5, 9), cor.mat = TRUE, path = path, type = "mfcc")
str(xc_mat_mfcc)
saveRDS(xc_mat_mfcc, file.path(path, "xc_mat_NAT_INV_mfcc.RDS"))
We measured spectral entropy and dominant frequency as time series, which we converted to acoustic similarity measurements using dynamic time warping (DTW_domf and DTW_spen). We extracted 100 time points per time series. We also performed multivariate DTW using both spectral entropy and dominant frequency time series (DTW_mult).
# Extract spectral entropy across signals as a time series
# Set img to TRUE to see time series plotted over spectrograms
sp_ts <- sp.en.ts(nat_inv_est, wl = 378, length.out = 100, wn = "hanning", ovlp = 90, bp = c(0.5, 9), threshold = 15, img = FALSE, parallel = cores, img.suffix = "spec_ent", pb = TRUE, clip.edges = TRUE, leglab = "spec_ent", sp.en.range = c(0.5, 9), path = path, flim = c(0, 10))
str(sp_ts)
# Ensure that the selec column isn't included in dtw
sp_ts$selec <- as.character(sp_ts$selec)
sapply(sp_ts, is.numeric)
# Perform dynamic time warping (DTW) on spectral entropy time series
# Returns distances by default
sp_ts_DTW <- dtw::dtwDist(sp_ts[, sapply(sp_ts, is.numeric)], sp_ts[, sapply(sp_ts, is.numeric)], window.type = "none", open.end = FALSE, path = path)
str(sp_ts_DTW)
saveRDS(sp_ts_DTW, file.path(path, "DTW_spec_ent.RDS"))
# Extract dominant frequency as a time series
# Set img to TRUE to see time series plotted over spectrograms
df_ts <- dfts(nat_inv_est, wl = 378, wl.freq = 378, length.out = 100, wn = "hanning", ovlp = 90, bp = c(0.5, 9), threshold = 15, img = FALSE, parallel = cores, img.suffix = "dfts", pb = TRUE, clip.edges = TRUE, path = path, flim = c(0, 10))
str(df_ts)
# Are there any NAs?
any(apply(df_ts, c(1, 2), is.na))
# Ensure that the selec column isn't included in dtw
df_ts$selec <- as.character(df_ts$selec)
sapply(df_ts, is.numeric)
# Perform DTW on dominant frequency time series
# Returns distances by default
df_ts_DTW <- dtw::dtwDist(df_ts[, sapply(df_ts, is.numeric)], df_ts[, sapply(df_ts, is.numeric)], window.type = "none", open.end = FALSE, path = path)
str(df_ts_DTW)
saveRDS(df_ts_DTW, file.path(path, "DTW_domf.RDS"))
# Perform multivariate DTW on spectral entropy and dominant frequency time series
dim(sp_ts[, sapply(sp_ts, is.numeric)])
dim(df_ts[, sapply(df_ts, is.numeric)])
# Returns distances by default
# Note that there are some zeroes off the diagonal that do not correspond to duplicated calls
mDTW <- warbleR::multi_DTW(ts.df1 = sp_ts, ts.df2 = df_ts, parallel = cores, window.type = "none", open.end = FALSE, scale = TRUE, dist.mat = TRUE, path = path)
str(mDTW)
class(mDTW)
saveRDS(mDTW, file.path(path, "DTW_multi.RDS"))
The zeroes off the diagonal for the multivariate DTW matrix were replaced with the minimum distance found in this matrix. This replacement represents the fact that although these calls were found to be very similar by the algorithm, they were not the same calls compared to each other.
mDTW <- readRDS(file.path(path, "DTW_multi.RDS"))
str(mDTW)
# Replace all zeroes with the minimum distance value
# Then replace the diagonal with zeroes
mDTW[mDTW == 0] <- min(mDTW[mDTW != 0])
diag(mDTW) <- 0
# Diagonal is zero
unique(diag(mDTW))
# No zeroes off the diagonal
length(which(mDTW[lower.tri(mDTW, diag = FALSE)] == 0))
# Looks good
# str(mDTW)
saveRDS(mDTW, file.path(path, "DTW_multi.RDS"))
We collected standard spectral measurements of contact call acoustic structure, as well as Mel-frequency cepstral coefficients.
# Calculate spectral acoustic parameters across calls
# Exclude measurements related to fundamental frequency (harmonicity = FALSE)
acs_params <- warbleR::specan(nat_inv_est, bp = c(0.5, 9), wl = 378, wl.freq = 378, ovlp = 90, fast = FALSE, parallel = cores, wn = "hanning", harmonicity = FALSE, path = path)
glimpse(acs_params)
# Make sure that the selec column is non-numeric and non-integer
acs_params$selec <- as.character(acs_params$selec)
glimpse(acs_params)
# Write out acoustic parameters for later
write.csv(acs_params, file.path(path, "acoustic_parameters.csv"), row.names = FALSE)
# Also collected acoustic measurements on Mel-frequency cepstral coefficients
# Used default values of numcep and nbands because these accurately represented patterns of individual overlap and distinctiveness in exploratory visualizations in a previous analysis (Smith-Vidaurre et al. 2020, Behav. Ecol.)
cep_coeff <- warbleR::mfcc_stats(nat_inv_est, ovlp = 90, wl = 378, bp = c(0.5, 9), numcep = 12, nbands = 40, parallel = cores, path = path)
glimpse(cep_coeff)
cep_coeff$selec <- as.character(cep_coeff$selec)
glimpse(cep_coeff)
# Write out cepstral acousic parameters for later
write.csv(cep_coeff, file.path(path, "Mel_freq_cepstral_coefficients.csv"), row.names = FALSE)
We visualized patterns of acoustic variation within and among repeatedly sampled individuals by each acoustic similarity or structure measurement. We used MDS and PCA to visualize calls in two dimensional acoustic space by similarity measurement (pairwise matrices) or structural measurement type (non-pairwise matrices), respectively. Here, we visually assessed whether a given measurement type recapitulated patterns of overdispersion among individuals in acoustic space that were identified in previous native range work[1], and an independent analysis of hierachical mapping patterns between ranges.
Aesthetics for plots with calls by individual.
n <- 12
# pie(rep(1, n), col = heat.colors(n))
# Colors and shapes by individual, ordered by native and then invasive
cols <- c("navy", "royalblue2", "turquoise", "dodgerblue", "gold4", "darkorange3", "goldenrod2", "gold2")
shps <- c(21, 22, 24, 23, 21, 25, 17, 15, 17)
Read in the SPCC and DTW similarity matrices.
xc_mat_spec <- readRDS(file.path(path, "xc_mat_NAT_INV.RDS"))
# dim(xc_mat_spec) == nrow(nat_inv_est)
xc_mat_ceps <- readRDS(file.path(path, "xc_mat_NAT_INV_mfcc.RDS"))
# dim(xc_mat_ceps) == nrow(nat_inv_est)
dtw_domf <- readRDS(file.path(path, "DTW_domf.RDS"))
# dim(dtw_domf) == nrow(nat_inv_est)
dtw_spen <- readRDS(file.path(path, "DTW_spec_ent.RDS"))
# dim(dtw_spen) == nrow(nat_inv_est)
dtw_mult <- readRDS(file.path(path, "DTW_multi.RDS"))
# dim(dtw_mult) == nrow(nat_inv_est)
Change dimension names of similarity matrices.
# Change dimension names of each matrix to facilitate pattern searching
# head(dimnames(xc_mat_spec)[[1]])
# head(nat_inv_est$sound.files)
# Make a vector of sound file names for substitution
wavs <- dimnames(xc_mat_spec)[[1]]
wavs <- gsub(".WAV-1", ".WAV", wavs)
# head(wavs)
# Substitute the SPCC matrix dimension names
dimnames(xc_mat_spec) <- list(wavs, wavs)
dimnames(xc_mat_ceps) <- list(wavs, wavs)
dimnames(dtw_domf) <- list(wavs, wavs)
dimnames(dtw_spen) <- list(wavs, wavs)
dimnames(dtw_mult) <- list(wavs, wavs)
# Checking, looks good
# head(dimnames(xc_mat_spec)[[1]])
# head(dimnames(dtw_mult)[[1]])
Filter similarity matrices by the individuals selected for visual validation.
# Find calls per each individual to be used for model training
# Checking, looks good
nat_inv_est %>%
filter(social_scale == "Individual") %>%
filter(Bird_ID %in% train_indivs) %>%
group_by(Bird_ID) %>%
summarise(num_calls = length(sound.files))
## # A tibble: 8 x 2
## Bird_ID num_calls
## <chr> <int>
## 1 INV-UM1 23
## 2 INV-UM19 12
## 3 INV-UM5 20
## 4 INV-UM7 28
## 5 NAT-AAT 12
## 6 NAT-UM1 25
## 7 NAT-UM2 23
## 8 NAT-UM4 13
train_indivs
## [1] "NAT-AAT" "NAT-UM1" "NAT-UM2" "NAT-UM4" "INV-UM19" "INV-UM1" "INV-UM5"
## [8] "INV-UM7"
calls <- nat_inv_est %>%
filter(social_scale == "Individual") %>%
filter(Bird_ID %in% train_indivs) %>%
pull(sound.files)
# head(calls)
# Subset matrices by these calls
# Grepping on SPCC matrix dimension names and using these same indices across similarity measurements because calls were provided to similarity functions in the same order
inds <- grep(paste(paste("^", calls, "$", sep = ""), collapse = "|"), dimnames(xc_mat_spec)[[1]])
# length(inds) == length(calls)
xc_mat_spec_tmp <- xc_mat_spec[inds, inds]
str(xc_mat_spec_tmp)
## num [1:156, 1:156] 1 0.435 0.373 0.398 0.428 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:156] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV" ...
## ..$ : chr [1:156] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV" ...
xc_mat_ceps_tmp <- xc_mat_ceps[inds, inds]
str(xc_mat_ceps_tmp)
## num [1:156, 1:156] 1 0.848 0.611 0.733 0.73 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:156] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV" ...
## ..$ : chr [1:156] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV" ...
dtw_domf_tmp <- dtw_domf[inds, inds]
str(dtw_domf_tmp)
## num [1:156, 1:156] 0 35.4 60.6 45.2 50.7 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:156] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV" ...
## ..$ : chr [1:156] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV" ...
dtw_spen_tmp <- dtw_spen[inds, inds]
str(dtw_spen_tmp)
## num [1:156, 1:156] 0 2.42 2.48 1.89 2.12 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:156] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV" ...
## ..$ : chr [1:156] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV" ...
dtw_mult_tmp <- dtw_mult[inds, inds]
str(dtw_mult_tmp)
## num [1:156, 1:156] 0 111 135 123 140 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:156] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV" ...
## ..$ : chr [1:156] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV" "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV" ...
Performed MDS per filtered matrix (converted to distance matrices as appropriate), and generated a visual per method.
method <- c("SPCC - spectrogram", "SPCC - cepstral coefficients", "DTW - dominant frequency", "DTW - spectral entropy", "DTW - multivariate")
# Convert only the SPCC matrices to distance matrices
# DTW matrices already contain distances
mat_list <- list(1 - xc_mat_spec_tmp, 1- xc_mat_ceps_tmp, dtw_domf_tmp, dtw_spen_tmp, dtw_mult_tmp)
# str(mat_list)
# x <- 3
invisible(pblapply(1:length(mat_list), function(x){
# Convert to a distance object for isoMDS
dist_mat <- stats::as.dist(mat_list[[x]], diag = TRUE, upper = TRUE)
# str(dist_mat)
# Reduce dimensionality of the SPCC matrix to visualize calls in 2D acoustic space
iso <- invisible(MASS::isoMDS(dist_mat, k = 2, maxit = 1000, trace = FALSE))
# str(iso)
bird_ids <- nat_inv_est$Bird_ID[grep(paste(paste("^", calls, "$", sep = ""), collapse = "|"), nat_inv_est$sound.files)]
# unique(bird_ids)
mds_df <- data.frame(X = iso$points[, 1], Y = iso$points[, 2], indiv = bird_ids)
# Order individuals by native and then invasive range
mds_df$indiv <- factor(mds_df$indiv, levels = train_indivs)
# Convex hull polygons per indiviual
hulls <- plyr::ddply(mds_df, "indiv", function(x){
x[chull(x$X, x$Y), ]
})
gg <- ggplot(mds_df, aes(x = X, y = Y)) +
geom_point(aes(color = indiv, fill = indiv, shape = indiv), size = 4) +
geom_polygon(data = hulls, aes(x = X, y = Y, fill = indiv, color = indiv), alpha = 0.2, size = 0.2) +
scale_colour_manual(values = cols) + scale_fill_manual(values = cols) +
scale_shape_manual(values = shps) +
guides(color = guide_legend(title = "Individual", nrow = 2, byrow = TRUE), fill = guide_legend(title = "Individual"), shape = guide_legend(title = "Individual")) +
xlab("Dimension 1") + ylab("Dimension 2") +
theme_bw() +
theme(legend.position = "top") +
ggtitle(method[x]) +
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.title = element_text(size = 14),
legend.text = element_text(size = 14)
)
print(gg)
}))
Read in acoustic measurements, including cepstral coefficients. Sound file names do not need to be changed.
acs_params <- read.csv(file.path(path, "acoustic_parameters.csv")) %>%
dplyr::mutate(
selec = as.character(selec)
)
# nrow(acs_params) == nrow(nat_inv_est)
# glimpse(acs_params)
cep_coeff <- read.csv(file.path(path, "Mel_freq_cepstral_coefficients.csv")) %>%
dplyr::mutate(
selec = as.character(selec)
)
# nrow(cep_coeff) == nrow(nat_inv_est)
# glimpse(cep_coeff)
Filter acoustic measurements by the given individuals.
# Find calls per each individual
# Checking, looks good
nat_inv_est %>%
filter(social_scale == "Individual") %>%
filter(Bird_ID %in% train_indivs) %>%
group_by(Bird_ID) %>%
summarise(num_calls = length(sound.files))
## # A tibble: 8 x 2
## Bird_ID num_calls
## <chr> <int>
## 1 INV-UM1 23
## 2 INV-UM19 12
## 3 INV-UM5 20
## 4 INV-UM7 28
## 5 NAT-AAT 12
## 6 NAT-UM1 25
## 7 NAT-UM2 23
## 8 NAT-UM4 13
train_indivs
## [1] "NAT-AAT" "NAT-UM1" "NAT-UM2" "NAT-UM4" "INV-UM19" "INV-UM1" "INV-UM5"
## [8] "INV-UM7"
calls <- nat_inv_est %>%
filter(social_scale == "Individual") %>%
filter(Bird_ID %in% train_indivs) %>%
pull(sound.files)
head(calls)
## [1] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_1.WAV"
## [2] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_3.WAV"
## [3] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_4.WAV"
## [4] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_5.WAV"
## [5] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_6.WAV"
## [6] "2017_07_24_1145-01_INDIV-UNMARKED-BIRD2_MZ000284_7.WAV"
# Subset data frame by these calls
acs_params_tmp <- acs_params[grep(paste(paste("^", calls, "$", sep = ""), collapse = "|"), acs_params$sound.files), ]
# nrow(acs_params_tmp)
# all(acs_params_tmp$sound.files %in% calls)
# glimpse(acs_params_tmp)
cep_coeff_tmp <- cep_coeff[grep(paste(paste("^", calls, "$", sep = ""), collapse = "|"), cep_coeff$sound.files), ]
# nrow(cep_coeff_tmp)
# all(cep_coeff_tmp$sound.files %in% calls)
# glimpse(cep_coeff_tmp)
Perform PCA and make a plot per measurement type.
method <- c("Acoustic parameters", "Mel-frequency cepstral coefficients")
mat_list <- list(acs_params_tmp[, sapply(acs_params_tmp, is.numeric)], cep_coeff_tmp[, sapply(cep_coeff_tmp, is.numeric)])
# x <- 3
invisible(pblapply(1:length(mat_list), function(x){
# Pre-process using the caret package
pp_list <- caret::preProcess(mat_list[[x]], method = c("YeoJohnson", "zv", "nzv", "center", "scale"), thresh = 0.98, freqCut = 98/2, uniqueCut = 2)
pp <- predict(pp_list, mat_list[[x]])
# Perform PCA with base package for easier access of summary and loadings
pp_pca <- stats::prcomp(pp, center = FALSE)
# str(pp_pca)
bird_ids <- nat_inv_est$Bird_ID[grep(paste(paste("^", calls, "$", sep = ""), collapse = "|"), nat_inv_est$sound.files)]
# unique(bird_ids)
pca_df <- data.frame(X = pp_pca$x[, 1], Y = pp_pca$x[, 2], indiv = bird_ids)
# Order individuals by native and then invasive range
pca_df$indiv <- factor(pca_df$indiv, levels = train_indivs)
# Convex hull polygons per indiviual
hulls <- plyr::ddply(pca_df, "indiv", function(x){
x[chull(x$X, x$Y), ]
})
gg <- ggplot(pca_df, aes(x = X, y = Y)) +
geom_point(aes(color = indiv, fill = indiv, shape = indiv), size = 4) +
geom_polygon(data = hulls, aes(x = X, y = Y, fill = indiv, color = indiv), alpha = 0.2, size = 0.2) +
scale_colour_manual(values = cols) + scale_fill_manual(values = cols) +
scale_shape_manual(values = shps) +
guides(color = guide_legend(title = "Individual", nrow = 2, byrow = TRUE), fill = guide_legend(title = "Individual"), shape = guide_legend(title = "Individual")) +
xlab("Dimension 1") + ylab("Dimension 2") +
theme_bw() +
theme(legend.position = "top") +
ggtitle(method[x]) +
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.title = element_text(size = 14),
legend.text = element_text(size = 14)
)
print(gg)
}))
Overall, these exploratory plots indicated that MDS and PCA features contained information about individual identity (confirmed by checking patterns of overdisperson in acoustic space by individual). Different feature sets also contained different information (patterns of dispersion among individuals were different across different feature types). Ready to proceed with feature extraction.
Here, we extracted MDS features from similarity matrices. We did not find an “optimal” number of dimensions as in previous work, but instead chose to keep the number of MDS dimensions the same across measurement types and datasets. We extracted MDS features separately per dataset because the larger dataset representing greater geographic spread contained some of the same calls as the dataset of repeatedly sampled individuals (suffix "_site_scale" in sound file names), and MDS cannot handle duplicated calls (pairwise similarity = 1 or pairwise distance = 0 off of the matrix diagonal). I proceeded with this feature extraction routine such that feature extraction was performed in the same way for this analysis and the independent analysis of hierarchical mapping, but for this analysis of acoustic structure, I removed the "_site_scale" suffix calls after feature extraction and prior to training models.
Get calls for both social scales.
indiv_calls <- nat_inv_est %>%
filter(social_scale == "Individual") %>%
dplyr::mutate(
sound.files = as.character(sound.files)
) %>%
pull(sound.files)
length(indiv_calls)
site_calls <- nat_inv_est %>%
filter(social_scale == "Site") %>%
dplyr::mutate(
sound.files = as.character(sound.files)
) %>%
pull(sound.files)
length(site_calls)
# Checking
# length(indiv_calls) + length(site_calls) == nrow(nat_inv_est)
ss <- c("Individual", "Site")
call_pats <- c(paste(paste("^", indiv_calls, "$", sep = ""), collapse = "|"), paste(paste("^", site_calls, "$", sep = ""), collapse = "|"))
Split matrices by social scale. We generated features in this way because this analysis was performed for a manuscript in prep on hierarchical mapping patterns that used some repeatedly sampled individual calls at the site scale, and it isn’t possible to have duplicate samples in MDS.
mthd <- c("SPCC_spec", "SPCC_ceps", "DTW_domf", "DTW_spen", "DTW_mult")
# Convert only the SPCC matrices to distance matrices
# DTW matrices actually already contain distances
mat_list <- list(1 - xc_mat_spec, 1- xc_mat_ceps, dtw_domf, dtw_spen, dtw_mult)
# Remove the original matrices from the global environment to free up memory
rm(list = c('xc_mat_spec', 'xc_mat_ceps', 'dtw_domf', 'dtw_spen', 'dtw_mult'))
# Loop over social scales and methods to subset the full matrices
ss_mat_list <- invisible(pblapply(1:length(ss), function(i){
tmp_mat_list <- lapply(1:length(mthd), function(x){
tmp_mat <- mat_list[[x]][grep(call_pats[i], dimnames(mat_list[[x]])[[1]]), grep(call_pats[i], dimnames(mat_list[[x]])[[2]])]
return(tmp_mat)
})
names(tmp_mat_list) <- mthd
return(tmp_mat_list)
}))
names(ss_mat_list) <- ss
str(ss_mat_list)
# Remove the original matrix list to free up memory
rm(list = "mat_list")
Perform MDS with 15 dimensions per social scale and similarity measurement type.
dims <- 15
invisible(pblapply(1:length(ss), function(i){
mds_res <- lapply(1:length(mthd), function(x){
# Memory error when tried 1000 iterations with 25 dimensions for the site scale
iso <- invisible(MASS::isoMDS(ss_mat_list[[i]][[x]], k = dims, maxit = 1000, trace = FALSE))
saveRDS(iso, file.path(path, paste("MDS_results_", ss[i], "_", mthd[x], ".RDS", sep = "")))
})
}))
rm(list = "ss_mat_list")
Split acoustic structure datasets by social scale, to match how MDS was run.
mthd <- c("acs_params", "cep_coeff")
df_list <- list(acs_params, cep_coeff)
# Loop over social scales and methods to subset the full data frames
ss_df_list <- invisible(pblapply(1:length(ss), function(i){
tmp_df_list <- lapply(1:length(mthd), function(x){
tmp_df <- df_list[[x]][grep(call_pats[i], df_list[[x]]$sound.files), ]
return(tmp_df)
})
names(tmp_df_list) <- mthd
return(tmp_df_list)
}))
names(ss_df_list) <- ss
# str(ss_df_list)
# Remove the original data frame list to free up memory
rm(list = "df_list")
Get PCA features for acoustic and image parameters.
mthd <- c("acs_params", "cep_coeff")
invisible(pblapply(1:length(ss), function(i){
res_df <- lapply(1:length(mthd), function(x){
# Get numeric columns only
df_num <- ss_df_list[[i]][[x]][, sapply(ss_df_list[[i]][[x]], is.numeric)]
# Pre-process using the caret package
pp_list <- caret::preProcess(df_num, method = c("YeoJohnson", "zv", "nzv", "center", "scale"), thresh = 0.98, freqCut = 98/2, uniqueCut = 2)
pp <- predict(pp_list, df_num)
# Perform PCA with base package for easier access of summary and loadings
pp_pca <- stats::prcomp(pp, center = FALSE)
saveRDS(pp_pca, file.path(path, paste("PCA_results_", ss[i], "_", mthd[x], ".RDS", sep = "")))
})
}))
How many principal components were extracted per parameter type?
ss <- c("Individual", "Site")
mthd <- c("acs_params", "cep_coeff")
dims <- invisible(pblapply(1:length(ss), function(i){
dim_list <- lapply(1:length(mthd), function(x){
pca <- readRDS(file.path(path, paste("PCA_results_", ss[i], "_", mthd[x], ".RDS", sep = "")))
dim(pca$x)[2]
})
names(dim_list) <- mthd
return(dim_list)
}))
names(dims) <- ss
dims
## $Individual
## $Individual$acs_params
## [1] 27
##
## $Individual$cep_coeff
## [1] 88
##
##
## $Site
## $Site$acs_params
## [1] 27
##
## $Site$cep_coeff
## [1] 88
Combine MDS and PCA features across call datasets into a single data frame. For each feature type, make a data frame with a sound.files column, and merge data frames with the full selection table afterwards using this column.
dim_red <- c("MDS", "PCA")
mthd <- list(c("SPCC_spec", "SPCC_ceps", "DTW_domf", "DTW_spen", "DTW_mult"), c("acs_params", "cep_coeff"))
mthd_nms <- list(c("SPCC - spectrogram", "SPCC - cepstral coefficients", "DTW - dominant frequency", "DTW - spectral entropy", "DTW - multivariate"), c("Acoustic parameters", "Mel-frequency cepstral coefficients"))
# length(ss_df_list)
# length(ss_df_list[[1]])
# i <- 2
# x <- 2
# z <- 3
feat_df_list <- invisible(pblapply(1:length(ss), function(i){
df_list2 <- lapply(1:length(dim_red), function(x){
df_list <- lapply(1:length(mthd[[x]]), function(z){
if(dim_red[x] == "MDS"){
iso <- readRDS(file.path(path, paste("MDS_results_", ss[i], "_", mthd[[x]][z], ".RDS", sep = "")))
sound.files <- dimnames(iso$points)[[1]]
feats <- data.frame(iso$points)
names(feats) <- paste(mthd[[x]][z], "MDS", seq(1, ncol(feats), 1), sep = "_")
tmp_df <- data.frame(sound.files = sound.files, feats) %>%
dplyr::mutate(
sound.files = as.character(sound.files)
)
# str(tmp_df)
return(tmp_df)
} else if(dim_red[x] == "PCA"){
# Take all principal components
pca <- readRDS(file.path(path, paste("PCA_results_", ss[i], "_", mthd[[x]][z], ".RDS", sep = "")))
# str(pca)
sound.files <- ss_df_list[[i]][[z]]$sound.files
feats <- data.frame(pca$x)
names(feats) <- paste(mthd[[x]][z], "PCA", seq(1, ncol(feats), 1), sep = "_")
tmp_df <- data.frame(sound.files = sound.files, feats) %>%
dplyr::mutate(
sound.files = as.character(sound.files)
)
# str(tmp_df)
return(tmp_df)
}
})
names(df_list) <- mthd[[x]]
return(df_list)
})
names(df_list2) <- dim_red
return(df_list2)
}))
names(feat_df_list) <- ss
str(feat_df_list[["Individual"]])
str(feat_df_list[["Site"]])
Combine each set of features across measurement types with metadata in the extended selection table per call dataset, then combine features across datasets. Note that a column “rf_set” was added to each social scale for model training, validation and prediction for the hierarchical mapping analysis. This will be ignored for the structural analysis here, as I split calls into training, validation and prediction datasets differently for this manuscript.
Repeatedly sampled individuals.
# Features that will be used for random forests modelling
names(feat_df_list[["Individual"]][["MDS"]])
names(feat_df_list[["Individual"]][["PCA"]])
indiv_df <- nat_inv_est %>%
filter(social_scale == "Individual") %>%
dplyr::select(
c(sound.files, date, range, social_scale, site, Bird_ID, region, country, site_year, year, dept_state, invasive_city)
) %>%
# Add a new column to split calls into datasets for predictive modeling
# If an individual is not one of those designated for training, then assign to validation
dplyr::mutate(
rf_set = ifelse(Bird_ID %in% train_indivs, "training", "validation"),
) %>%
# Merge the MDS features after merging this list into a single data frame
left_join(
feat_df_list[["Individual"]][["MDS"]] %>%
purrr::reduce(
left_join, by = "sound.files"
),
by = "sound.files"
) %>%
# Merge the PCA features after merging this list into a single data frame
left_join(
feat_df_list[["Individual"]][["PCA"]] %>%
purrr::reduce(
left_join, by = "sound.files"
),
by = "sound.files"
) %>%
droplevels() %>%
dplyr::mutate(
year = as.character(year)
)
glimpse(indiv_df)
Dataset representing broader geographic sampling.
# Features that will be used for random forests modeling
names(feat_df_list[["Site"]][["MDS"]])
names(feat_df_list[["Site"]][["PCA"]])
site_df <- nat_inv_est %>%
filter(social_scale == "Site") %>%
dplyr::select(
c(sound.files, date, range, social_scale, site, Bird_ID, region, country, site_year, year, dept_state, invasive_city)
) %>%
# Add a new column to split calls into datasets for predictive modeling
# Calls used for random forests prediction
dplyr::mutate(
rf_set = "prediction",
) %>%
# Merge the MDS features after merging this list into a single data frame
left_join(
feat_df_list[["Site"]][["MDS"]] %>%
purrr::reduce(
left_join, by = "sound.files"
),
by = "sound.files"
) %>%
# Merge the PCA features after merging this list into a single data frame
left_join(
feat_df_list[["Site"]][["PCA"]] %>%
purrr::reduce(
left_join, by = "sound.files"
),
by = "sound.files"
) %>%
droplevels() %>%
dplyr::mutate(
year = as.character(year)
)
glimpse(site_df)
Merged the individual and site scale features and metadata into a single data frame for random forests modelling.
sup_rf_df <- bind_rows(indiv_df, site_df)
glimpse(sup_rf_df)
Wrote out all features across the 1596 calls for predictive modeling.
write.csv(sup_rf_df, file.path(path, "supervised_RF_datasets_acousticSimilarity.csv"), row.names = FALSE)
How to split calls into training, validation, prediction sets to address the main question? Here, we wanted to train supervised machine learning models to classify calls back to ranges as a way to infer whether calls were structurally different between ranges. First we needed to figure out which invasive range areas were sampled over time, to perform both spatial comparisons between ranges, and account for potential temporal change in the invasive range.
Drop calls with the suffix "_site_scale", as these were a single call per repeatedly sampled individual included at the site scale for hierarchical mapping analysis. Also drop the 4 NAT-UM4 calls that were inadvertently retained at the site scale in previous analysis (but did not change results reported)[1]. We used the remaining 1561 calls for supervised machine learning analyses.
dropUM4 <- nat_inv_est %>%
filter(social_scale == "Site") %>%
filter(Bird_ID == "NAT-UM4" & !grepl("site_scale", sound.files)) %>%
pull(sound.files)
dropUM4
## [1] "2017_07_26_1145-02_MZ000321_1.WAV" "2017_07_26_1145-02_MZ000321_2.WAV"
## [3] "2017_07_26_1145-02_MZ000321_5.WAV" "2017_07_26_1145-02_MZ000321_6.WAV"
nat_inv_est <- nat_inv_est %>%
as_tibble() %>%
filter(!grepl("site_scale", sound.files)) %>%
filter(!sound.files %in% dropUM4) %>%
droplevels()
dim(nat_inv_est)
## [1] 1561 26
What is our spatial resolution in the invasive range? Number of calls by city in the U.S.
nat_inv_est %>%
filter(range == "Invasive") %>%
group_by(invasive_city) %>%
dplyr::summarise(
n_calls = length(sound.files)
)
## # A tibble: 8 x 2
## invasive_city n_calls
## <chr> <int>
## 1 Austin 512
## 2 Dallas 9
## 3 El Paso 25
## 4 Gilbert 16
## 5 Miami 40
## 6 Milford 105
## 7 New Orleans 109
## 8 Stratford 50
We should include calls from cities sampled over time, but avoid depleting the temporal dataset for temporal questions. Which years had the most calls? 2019 for Austin, 2004 for New Orleans.
nat_inv_est %>%
filter(invasive_city %in% c("Austin", "New Orleans")) %>%
group_by(invasive_city, year) %>%
dplyr::summarise(
n_calls = length(sound.files)
)
## `summarise()` has grouped output by 'invasive_city'. You can override using the `.groups` argument.
## # A tibble: 5 x 3
## # Groups: invasive_city [2]
## invasive_city year n_calls
## <chr> <chr> <int>
## 1 Austin 2004 93
## 2 Austin 2011 84
## 3 Austin 2019 335
## 4 New Orleans 2004 61
## 5 New Orleans 2011 48
Deciding which Austin and New Orleans sites to use for spatial questions (e.g. comparing call structure between ranges). Austin has several sites that were sampled over years, while New Orleans was sampled in different years at different sites (still allows us to assess change over time at the city-scale).
For Austin, take calls from site-years not sampled over time for the spatial question. This includes the Austin site sampled at only the individual scale (BART). Aside from BART, 4 Austin site-years were not sampled over time: two 2019 sites, one 2011 site, one 2004 site.
spatial_austin <- nat_inv_est %>%
as_tibble() %>%
# filter(social_scale == "Site") %>%
filter(invasive_city %in% c("Austin")) %>%
group_by(invasive_city, site) %>%
dplyr::summarise(
n_years = n_distinct(year)
) %>%
filter(n_years == 1) %>%
ungroup() %>%
inner_join(
nat_inv_est %>%
as_tibble() %>%
# filter(social_scale == "Site") %>%
dplyr::select(site, sound.files, year),
by = "site"
) %>%
group_by(invasive_city, site, year) %>%
dplyr::summarise(
n_calls = length(sound.files)
)
## `summarise()` has grouped output by 'invasive_city'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'invasive_city', 'site'. You can override using the `.groups` argument.
spatial_austin
## # A tibble: 5 x 4
## # Groups: invasive_city, site [5]
## invasive_city site year n_calls
## <chr> <chr> <chr> <int>
## 1 Austin AIRP 2019 7
## 2 Austin BART 2011 23
## 3 Austin COMM 2004 11
## 4 Austin MANO 2019 5
## 5 Austin SOFT 2011 14
How many calls would we use for the spatial question from Austin?
# Note that BART 2011 is from the individual scale
spatial_austin %>%
pull(n_calls) %>%
sum()
## [1] 60
Looks good, initialize these Austin site-years for spatial questions.
sp_aust_sites <- spatial_austin %>%
pull(site)
sp_aust_sites
## [1] "AIRP" "BART" "COMM" "MANO" "SOFT"
Now choosing New Orleans calls for spatial questions. Note that CAME is from the individual scale only as well.
nat_inv_est %>%
filter(invasive_city %in% c("New Orleans")) %>%
group_by(invasive_city, site, year) %>%
dplyr::summarise(
n_calls = length(sound.files)
)
## `summarise()` has grouped output by 'invasive_city', 'site'. You can override using the `.groups` argument.
## # A tibble: 6 x 4
## # Groups: invasive_city, site [6]
## invasive_city site year n_calls
## <chr> <chr> <chr> <int>
## 1 New Orleans BALL 2004 26
## 2 New Orleans CAME 2004 12
## 3 New Orleans CANA 2004 13
## 4 New Orleans FOLS 2004 10
## 5 New Orleans LAKE 2011 5
## 6 New Orleans ROBE 2011 43
Only two 2011 sites. Take calls from two 2004 sites, which leaves two sites in each year for temporal questions. Initialized FOLS (low number of calls) and CAME (low number of calls, individual scale only) for spatial questions.
sp_norl_sites <- c("FOLS", "CAME")
Which invasive cities were sampled over time? Austin, New Orleans.
nat_inv_est %>%
filter(!is.na(invasive_city)) %>%
group_by(invasive_city) %>%
summarise(n_years = n_distinct(year))
## # A tibble: 8 x 2
## invasive_city n_years
## <chr> <int>
## 1 Austin 3
## 2 Dallas 1
## 3 El Paso 1
## 4 Gilbert 1
## 5 Miami 1
## 6 Milford 1
## 7 New Orleans 2
## 8 Stratford 1
Minimum of 5 calls in site-years for these cities.
nat_inv_est %>%
filter(!is.na(invasive_city)) %>%
filter(invasive_city %in% c("Austin", "New Orleans")) %>%
group_by(invasive_city, site_year) %>%
dplyr::summarise(
n_calls = length(sound.files)
) %>%
pull(n_calls) %>%
min()
## `summarise()` has grouped output by 'invasive_city'. You can override using the `.groups` argument.
## [1] 5
Which sites per city were sampled over time? Here, dropping site-years already set aside used for spatial questions.
nat_inv_est %>%
filter(social_scale == "Site") %>%
filter(range == "Invasive") %>%
filter(invasive_city %in% c("Austin", "New Orleans")) %>%
filter(!site %in% c(sp_aust_sites, sp_norl_sites)) %>%
group_by(invasive_city, year) %>%
dplyr::summarise(
uniq_sites = paste(unique(site), collapse = "; ")
)
## `summarise()` has grouped output by 'invasive_city'. You can override using the `.groups` argument.
## # A tibble: 5 x 3
## # Groups: invasive_city [2]
## invasive_city year uniq_sites
## <chr> <chr> <chr>
## 1 Austin 2004 SOCC; VALL
## 2 Austin 2011 ELEM; INTR; MART; VALL
## 3 Austin 2019 ELEM; INTR; SOCC; MART
## 4 New Orleans 2004 CANA; BALL
## 5 New Orleans 2011 LAKE; ROBE
Pretty decent number of sites sampled over time. We can assess change over time at two scales, either for the whole city or sites within cities. New Orleans can be assessed only at the city scale, but Austin can be assessed at specific sites over time.
Split the invasive range calls for the spatial (between ranges) and temporal (over time in the invasive range) comparisons. Here, setting aside site-years in Austin sampled over time, and 2 sites in New Orleans per sampling year for the temporal question, and the rest for the spatial comparison between ranges. The set of sites per comparison will be sampled differently for training, validation, prediction.
For native range sites and invasive range sites used for the comparison between ranges, I took 1/2 of the total calls per site for training. Among the remaining calls, I took 1/3 for validation and 2/3 for prediction. For invasive range sites used for temporal comparisons (change over time in the invasive range), I set 20 calls aside for prediction (or all calls if less than 20 were present), and split the remaining calls by half into validation and training (or none for either, if all calls taken for prediction).
Split the invasive range calls into spatial and temporal datasets, as random sampling will be done differently depending on whether a given site/city was sampled over time.
inv_spatial <- nat_inv_est %>%
filter(range == "Invasive") %>%
# Get all sites from cities not sampled over time
filter(!invasive_city %in% c("Austin", "New Orleans")) %>%
# Add back the Austin and New Orleans sites designated for the spatial question
bind_rows(
nat_inv_est %>%
filter(range == "Invasive") %>%
filter(site %in% c(sp_aust_sites, sp_norl_sites))
)
# unique(inv_spatial$site)
# glimpse(inv_spatial)
inv_temporal <- nat_inv_est %>%
filter(range == "Invasive") %>%
# Filter to Austin and New Orleans sites
filter(invasive_city %in% c("Austin", "New Orleans")) %>%
# Exclude sites already designated for the spatial question
filter(!site %in% c(sp_aust_sites, sp_norl_sites))
# unique(inv_temporal$site)
# glimpse(inv_temporal)
# Checking, looks good
# (nrow(inv_spatial) + nrow(inv_temporal)) == nrow(nat_inv_est %>% filter(range == "Invasive"))
Check out sites and number of calls for the invasive spatial dataset. 15 sites (BART and CAME are individual scale only because we used calls from both social scales for this analysis), 8 cities sampled across years.
inv_spatial %>%
group_by(range, invasive_city, site, year) %>%
dplyr::summarise(n_calls = length(sound.files))
## `summarise()` has grouped output by 'range', 'invasive_city', 'site'. You can override using the `.groups` argument.
## # A tibble: 15 x 5
## # Groups: range, invasive_city, site [15]
## range invasive_city site year n_calls
## <fct> <chr> <chr> <chr> <int>
## 1 Invasive Austin AIRP 2019 7
## 2 Invasive Austin BART 2011 23
## 3 Invasive Austin COMM 2004 11
## 4 Invasive Austin MANO 2019 5
## 5 Invasive Austin SOFT 2011 14
## 6 Invasive Dallas LAWT 2011 9
## 7 Invasive El Paso ASCA 2019 25
## 8 Invasive Gilbert GILB 2018 16
## 9 Invasive Miami BAPT 2004 40
## 10 Invasive Milford AUDU 2004 17
## 11 Invasive Milford BUCK 2004 60
## 12 Invasive Milford MEAD 2004 28
## 13 Invasive New Orleans CAME 2004 12
## 14 Invasive New Orleans FOLS 2004 10
## 15 Invasive Stratford SHAK 2004 50
Check out sites and number of calls for the invasive temporal dataset. 14 site-years, 2 invasive cities. Five sites sampled over 2 years in Austin, 2 years of sampling at the city scale in New Orleans.
inv_temporal %>%
group_by(range, invasive_city, site, year) %>%
dplyr::summarise(n_calls = length(sound.files))
## `summarise()` has grouped output by 'range', 'invasive_city', 'site'. You can override using the `.groups` argument.
## # A tibble: 14 x 5
## # Groups: range, invasive_city, site [9]
## range invasive_city site year n_calls
## <fct> <chr> <chr> <chr> <int>
## 1 Invasive Austin ELEM 2011 8
## 2 Invasive Austin ELEM 2019 88
## 3 Invasive Austin INTR 2011 15
## 4 Invasive Austin INTR 2019 84
## 5 Invasive Austin MART 2011 14
## 6 Invasive Austin MART 2019 49
## 7 Invasive Austin SOCC 2004 77
## 8 Invasive Austin SOCC 2019 102
## 9 Invasive Austin VALL 2004 5
## 10 Invasive Austin VALL 2011 10
## 11 Invasive New Orleans BALL 2004 26
## 12 Invasive New Orleans CANA 2004 13
## 13 Invasive New Orleans LAKE 2011 5
## 14 Invasive New Orleans ROBE 2011 43
Did contact call structure change after invasion? Supervised ML models were trained to discriminate between native and invasive range calls (ranges were class labels in this two-class prediction problem). After validating models, we chose one model to predict classes (native or invasive range) of calls in the prediction dataset that were set aside for spatial and temporal comparisons. We used classification accuracy throughout the modeling process as an indicator of structural differences between ranges, and visualized overlap between ranges in low-dimensional acoustic space. We assessed the number of misclassified calls from different cities in the U.S. for a finer geographic assessment of structural change, and used misclassified calls from Austin and New Orleans to assess the possibility of structural change over time in the invasive range.
Sampling scheme across native and invasive ranges for the spatial and temporal comparisons, as described above.
# Make a data frame that contains the number of randomly sampled calls that should be chosen for training, validation and prediction at each site-year
ML_scheme <- nat_inv_est %>%
filter(range == "Native") %>%
# Make sampling scheme for the native and invasive sites used for spatial questions
bind_rows(inv_spatial) %>%
# Per site-year, find the number of calls to sample for training, validation and prediction
# The number of calls per site-year will vary depending on the total calls
# Start by calculating the total number of calls per site-year
group_by(site_year) %>%
dplyr::summarise(
total_calls = length(sound.files)
) %>%
# Calculate 1/2 of total for training
# 1/3 of the remaining for validation
# 2/3 of the remaining for prediction
dplyr::mutate(
n_train = round(total_calls/2),
n_validation = round((total_calls - n_train)/3),
n_prediction = total_calls - (n_train + n_validation)
) %>%
ungroup() %>%
# Make sampling scheme for the invasive sites used for temporal questions
# Set aside 20 calls or the minimum for prediction
# Evenly split the rest between training and validation
bind_rows(
inv_temporal %>%
group_by(site_year) %>%
dplyr::summarise(
total_calls = length(sound.files)
) %>%
dplyr::mutate(
n_prediction = ifelse(total_calls > 20, 20, total_calls),
n_train = ifelse(n_prediction >= 20, round((total_calls - n_prediction)/2), 0),
n_validation = total_calls - (n_train + n_prediction)
) %>%
ungroup()
)
head(ML_scheme)
## # A tibble: 6 x 5
## site_year total_calls n_train n_validation n_prediction
## <chr> <int> <dbl> <dbl> <dbl>
## 1 1145_2017 86 43 14 29
## 2 AIRP_2019 7 4 1 2
## 3 ARAP_2017 12 6 2 4
## 4 ARAZ_2017 15 8 2 5
## 5 ASCA_2019 25 12 4 9
## 6 AUDU_2004 17 8 3 6
Randomly sample calls per site for training.
set.seed(seed)
train_df <- nat_inv_est %>%
filter(range == "Native") %>%
bind_rows(inv_spatial) %>%
bind_rows(inv_temporal) %>%
# Random sampling for both questions
group_by(site_year) %>%
nest() %>%
ungroup() %>%
inner_join(
ML_scheme %>%
dplyr::select(site_year, n_train),
by = "site_year"
) %>%
dplyr::mutate(
rsamp_call = purrr::map2(data, n_train, sample_n, replace = FALSE)
) %>%
dplyr::select(-data) %>%
unnest(rsamp_call) %>%
dplyr::select(-c(n_train))
# glimpse(train_df)
Randomly sample calls per site for validation.
set.seed(seed)
validation_df <- nat_inv_est %>%
filter(range == "Native") %>%
bind_rows(inv_spatial) %>%
bind_rows(inv_temporal) %>%
# Filter out calls already allocated to training
filter(!sound.files %in% train_df$sound.files) %>%
# Random sampling for both questions
group_by(site_year) %>%
nest() %>%
ungroup() %>%
inner_join(
ML_scheme %>%
dplyr::select(site_year, n_validation),
by = "site_year"
) %>%
dplyr::mutate(
rsamp_call = purrr::map2(data, n_validation, sample_n, replace = FALSE)
) %>%
dplyr::select(-data) %>%
unnest(rsamp_call) %>%
dplyr::select(-c(n_validation))
# glimpse(validation_df)
Se aside the remaining calls per site for prediction.
set.seed(seed)
prediction_df <- nat_inv_est %>%
filter(range == "Native") %>%
bind_rows(inv_spatial) %>%
bind_rows(inv_temporal) %>%
# Remove calls already allocated to training and validation
filter(!sound.files %in% train_df$sound.files & !sound.files %in% validation_df$sound.files) %>%
droplevels()
# glimpse(prediction_df)
Check out sample sizes across the different datasets.
676 calls allocated to training: 43% of calls used for supervised machine learning.
nrow(train_df)
## [1] 676
round(nrow(train_df)/nrow(nat_inv_est), 2)*100
## [1] 43
337 calls allocated to validation: 22% of calls.
nrow(validation_df)
## [1] 337
round(nrow(validation_df)/nrow(nat_inv_est), 2)*100
## [1] 22
548 calls allocated to prediction: 35% of calls.
nrow(prediction_df)
## [1] 548
round(nrow(prediction_df)/nrow(nat_inv_est), 2)*100
## [1] 35
How many native and invasive range calls used for training? About 20 less invasive than native calls (327 versus 349).
train_df %>%
group_by(range) %>%
dplyr::summarise(
n_calls = length(sound.files)
)
## # A tibble: 2 x 2
## range n_calls
## <fct> <int>
## 1 Native 349
## 2 Invasive 327
How many native and invasive range calls used for validation? 116 and 221, respectively.
validation_df %>%
group_by(range) %>%
dplyr::summarise(
n_calls = length(sound.files)
)
## # A tibble: 2 x 2
## range n_calls
## <fct> <int>
## 1 Native 116
## 2 Invasive 221
How many native and invasive range calls used for prediction? 230 and 318, respectively.
prediction_df %>%
group_by(range) %>%
dplyr::summarise(
n_calls = length(sound.files)
)
## # A tibble: 2 x 2
## range n_calls
## <fct> <int>
## 1 Native 230
## 2 Invasive 318
The balance between the native and invasive classes is pretty even, although most balanced for training. The percentage of calls used across these datasets for predictive modeling is pretty good as well (almost 1/2 of calls used for training, the rest split between validation and prediction).
Full prediction dataset for invasive range, by site-year.
prediction_df %>%
filter(range == "Invasive") %>%
group_by(range, invasive_city) %>%
summarise(
n_calls = length(sound.files)
)
## `summarise()` has grouped output by 'range'. You can override using the `.groups` argument.
## # A tibble: 8 x 3
## # Groups: range [1]
## range invasive_city n_calls
## <fct> <chr> <int>
## 1 Invasive Austin 171
## 2 Invasive Dallas 3
## 3 Invasive El Paso 9
## 4 Invasive Gilbert 5
## 5 Invasive Miami 13
## 6 Invasive Milford 35
## 7 Invasive New Orleans 65
## 8 Invasive Stratford 17
How many sites and calls per range and site-year in the prediction dataset for the comparison between ranges? 37 native, 15 invasive site-years (two invasive sites are individual scale only).
spat_pred_sites <- inv_spatial %>%
pull(site_year) %>%
unique()
# spat_pred_sites
prediction_df %>%
filter(site_year %in% spat_pred_sites | range == "Native") %>%
group_by(range) %>%
summarise(
n_sites = n_distinct(site)
)
## # A tibble: 2 x 2
## range n_sites
## <fct> <int>
## 1 Native 37
## 2 Invasive 15
How good is our spatial resolution of the invasive range for prediction? Not bad, although resolution for Dallas and New Orleans is poor. But the number of calls for Dallas was limited to begin with (9 calls total), and we had a native - invasive comparison built into the temporal question because native range calls were set aside for prediction as well (another opportunity to evaluate New Orleans calls with respect to native range calls).
# Invasive range only, spatial comparison calls
prediction_df %>%
filter(site_year %in% spat_pred_sites) %>%
group_by(invasive_city) %>%
summarise(
n_calls = length(sound.files)
)
## # A tibble: 8 x 2
## invasive_city n_calls
## <chr> <int>
## 1 Austin 19
## 2 Dallas 3
## 3 El Paso 9
## 4 Gilbert 5
## 5 Miami 13
## 6 Milford 35
## 7 New Orleans 7
## 8 Stratford 17
The number of calls per site-year ranges from 2 to 29.
# Native and invasive range
prediction_df %>%
filter(site_year %in% spat_pred_sites | range == "Native") %>%
group_by(range, site_year) %>%
summarise(
n_calls = length(sound.files)
) %>%
pull(n_calls) %>%
range()
## `summarise()` has grouped output by 'range'. You can override using the `.groups` argument.
## [1] 2 29
Sample sizes for temporal question in prediction dataset. Looks great, for city scale and site-year scale temporal comparisons.
By city: 152 Austin calls, 58 New Orleans calls over time.
temp_pred_sites <- inv_temporal %>%
pull(site_year) %>%
unique()
temp_pred_sites
## [1] "ROBE_2011" "ELEM_2019" "INTR_2019" "SOCC_2019" "ELEM_2011" "INTR_2011"
## [7] "MART_2011" "VALL_2011" "LAKE_2011" "MART_2019" "SOCC_2004" "CANA_2004"
## [13] "BALL_2004" "VALL_2004"
prediction_df %>%
filter(site_year %in% temp_pred_sites) %>%
group_by(invasive_city) %>%
summarise(
n_calls = length(sound.files)
)
## # A tibble: 2 x 2
## invasive_city n_calls
## <chr> <int>
## 1 Austin 152
## 2 New Orleans 58
By city and year.
prediction_df %>%
filter(site_year %in% temp_pred_sites) %>%
group_by(invasive_city, year) %>%
summarise(
n_calls = length(sound.files)
)
## `summarise()` has grouped output by 'invasive_city'. You can override using the `.groups` argument.
## # A tibble: 5 x 3
## # Groups: invasive_city [2]
## invasive_city year n_calls
## <chr> <chr> <int>
## 1 Austin 2004 25
## 2 Austin 2011 47
## 3 Austin 2019 80
## 4 New Orleans 2004 33
## 5 New Orleans 2011 25
By site-year.
prediction_df %>%
filter(site_year %in% temp_pred_sites) %>%
group_by(invasive_city, site_year) %>%
summarise(
n_calls = length(sound.files)
)
## `summarise()` has grouped output by 'invasive_city'. You can override using the `.groups` argument.
## # A tibble: 14 x 3
## # Groups: invasive_city [2]
## invasive_city site_year n_calls
## <chr> <chr> <int>
## 1 Austin ELEM_2011 8
## 2 Austin ELEM_2019 20
## 3 Austin INTR_2011 15
## 4 Austin INTR_2019 20
## 5 Austin MART_2011 14
## 6 Austin MART_2019 20
## 7 Austin SOCC_2004 20
## 8 Austin SOCC_2019 20
## 9 Austin VALL_2004 5
## 10 Austin VALL_2011 10
## 11 New Orleans BALL_2004 20
## 12 New Orleans CANA_2004 13
## 13 New Orleans LAKE_2011 5
## 14 New Orleans ROBE_2011 20
Combined the training, validation and prediction datasets into a single object.
sup_ML_o <- bind_rows(
train_df %>%
dplyr::mutate(
set = "training"
),
validation_df %>%
dplyr::mutate(
set = "validation"
),
prediction_df %>%
dplyr::mutate(
set = "prediction"
)
)
glimpse(sup_ML_o)
# nrow(sup_ML_o) == nrow(nat_inv_est) # Looks good
# Check for duplicated sound files. None, looks good
# length(which(duplicated(sup_ML_o$sound.files)))
Merge this data frame with the features measured above. We also decided to add the standard set of spectral acoustic measurements, as it is easier map structural differentiation back to raw measurements instead of features (like principal components), which are less interpretable combinations of original measurements. Retained metadata useful for subsequent analyses.
# Features generated for supervised random forests that can be used for other classification questions by supervised ML (full EST across social scales)
ML_feats_df <- read.csv(file.path(path, "supervised_RF_datasets_acousticSimilarity.csv")) %>%
dplyr::mutate(
year = as.character(year)
) %>%
# Remove image features that were generated for a separate analysis of hierarchical mapping
dplyr::select(-c(names(.)[grep("^img_", names(.))]))
glimpse(ML_feats_df)
# 27 original spectral acoustic measurements across calls (full EST across social scales)
acs_params <- read.csv(file.path(path, "acoustic_parameters.csv")) %>%
dplyr::mutate(
selec = as.character(selec)
)
glimpse(acs_params)
# Merge data frames
sup_ML <- sup_ML_o %>%
dplyr::mutate(
selec = as.character(selec)
) %>%
inner_join(
acs_params,
by = c("sound.files", "selec")
) %>%
inner_join(
ML_feats_df %>%
dplyr::select(-c(date, range, social_scale, site, Bird_ID, region, country, site_year, year, dept_state, invasive_city, rf_set)),
by = "sound.files"
) %>%
dplyr::select(-c(selec, start, end, date, social_scale, site, Bird_ID, lat, lon, Overlapping_signal, Truncated, tailored, prev_SNR, SNR_margin_works, SNR_before_only, old.sound.file.name, SNR))
# Get names of predictors
pnms <- names(sup_ML)[which(sapply(sup_ML, is.numeric))]
head(pnms)
length(pnms)
# Remove additional metadata columns to retain only those useful for either ML or joining back metdata later (this line removes the "rf_set" column used for supervised machine learning analysis in hierarchical mapping project)
sup_ML <- sup_ML %>%
dplyr::select(c(sound.files, range, set, all_of(pnms)))
# Looks good, 1561 calls, 217 predictors, 3 metadata columns
glimpse(sup_ML)
dim(sup_ML)
Check collinearity of all predictors prior to building models.
# Perform Pearson's correlation among all numeric values
cor <- stats::cor(sup_ML[, sapply(sup_ML, is.numeric)], method = "pearson")
str(cor)
range(cor[cor < 1])
# Use the correlation matrix to find names of variables that have > 0.75 or < -0.75 collinearity
corr_nms <- caret::findCorrelation(cor, cutoff = 0.75, verbose = TRUE, names = TRUE, exact = TRUE)
# 14 variables with Pearson's |r| > 0.75, these should be removed prior to building models
corr_nms
# [1] "meanfreq" "acs_params_PCA_1" "acs_params_PCA_2" "freq.median"
# [5] "freq.Q25" "entropy" "sd" "meandom"
# [9] "sfm" "time.Q75" "time.IQR" "time.median"
# [13] "time.ent" "skew"
Check collinearity with signal to noise ratio (SNR) too. Using Pearson’s correlation.
# Are sound files still in the same order? If so, then can do the column addition below
all(sup_ML$sound.files == sup_ML_o$sound.files)
sup_ML_SNR <- sup_ML %>%
dplyr::mutate(
SNR = sup_ML_o$SNR
) %>%
filter(!is.na(SNR)) %>%
droplevels()
dim(sup_ML_SNR)
# Check for high correlations among features and SNR
cor_SNR <- stats::cor(sup_ML_SNR[, sapply(sup_ML_SNR, is.numeric) & names(sup_ML_SNR) != "SNR"], sup_ML_SNR$SNR, method = "pearson")
range(cor_SNR[, 1])
# No features are highly correlated with SNR
which(abs(cor_SNR[, 1]) > 0.75)
Remove collinear predictors prior to proceeding with model-building. 203 numeric predictors remain.
sup_ML_fin <- sup_ML %>%
dplyr::select(-c(all_of(corr_nms)))
dim(sup_ML_fin)
length(which(sapply(sup_ML_fin, is.numeric)))
# 14 columns dropped, looks good
ncol(sup_ML) - ncol(sup_ML_fin)
# Check for duplicated sound files one more time? None, looks good
# length(which(duplicated(sup_ML_fin$sound.files)))
Write out the final dataset for supervised machine learning analyses, also write out as a .csv for sharing data.
saveRDS(sup_ML_fin, file.path(path, "sup_ML_fin.RDS"))
write.csv(sup_ML_fin, file.path(path, "sup_ML_fin.csv"), row.names = FALSE)
Start training/tuning models for each question. Read the data back in to begin analysis.
# sup_ML_df <- read.csv(file.path(path, "sup_ML_fin.csv"))
sup_ML_df <- readRDS(file.path(path, "sup_ML_fin.RDS"))
dim(sup_ML_df)
## [1] 1561 206
# glimpse(sup_ML_df)
203 numeric predictors.
length(which(sapply(sup_ML_df, is.numeric)))
## [1] 203
15 of these predictors were original acoustic measurements, and 188 were MDS or PCA features. I decided to use these predictors later to determine whether any of the original spectral measurements explained the probability of assignment of calls back to either range by the final model (an indication of structural differences between ranges, see partial dependence plots below).
pnms <- names(sup_ML_df)[which(sapply(sup_ML_df, is.numeric))]
length(pnms[-grep("MDS|PCA", pnms)])
## [1] 15
pnms[-grep("MDS|PCA", pnms)]
## [1] "duration" "freq.Q75" "freq.IQR" "time.Q25" "kurt" "sp.ent"
## [7] "mindom" "maxdom" "dfrange" "modindx" "startdom" "enddom"
## [13] "dfslope" "meanpeakf" "peakf"
length(pnms[grep("SPCC_spec", pnms)])
## [1] 15
length(pnms[grep("SPCC_ceps", pnms)])
## [1] 15
length(pnms[grep("DTW_domf", pnms)])
## [1] 15
length(pnms[grep("DTW_spen", pnms)])
## [1] 15
length(pnms[grep("DTW_mult", pnms)])
## [1] 15
length(pnms[grep("acs_params", pnms)])
## [1] 25
length(pnms[grep("cep_coeff", pnms)])
## [1] 88
Check out the distributions of the raw acoustic measurements per range.
raw_params_df <- sup_ML_df %>%
dplyr::select(range, pnms[-grep("MDS|PCA", pnms)]) %>%
# Convert to long format
pivot_longer(
cols = names(.)[-grep("^range$", names(.))],
names_to = "parameter",
values_to = "values"
) %>%
dplyr::mutate(
range = as.character(range),
range = factor(range, levels = c("Native", "Invasive"))
)
# glimpse(raw_params_df)
# names(raw_params_df)
# levels(raw_params_df$range)
cols <- scales::alpha(c("navy", "orange"), 0.85)
raw_params_df %>%
ggplot(aes(x = range, y = values, fill = range, color = range)) +
geom_boxplot() +
scale_fill_manual(values = cols) +
scale_color_manual(values = cols) +
facet_wrap(~ parameter, scales = "free_y") +
guides(fill = guide_legend(title = "Range"), color = guide_legend(title = "Range")) +
xlab("") +
ylab("Parameter values") +
theme_bw() +
theme(
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.title = element_text(size = 12),
legend.text = element_text(size = 12),
strip.text = element_text(size = 14),
legend.position = "top",
panel.grid.major = element_line(size = 0.15),
panel.grid.minor = element_line(size = 0.15),
axis.ticks = element_line(size = 0.25),
legend.margin = margin(0, 0, 0, 0),
legend.box.margin = margin(-8, -8, -8, -8)
)
Some of these spectral acoustic measurements show visible differences between ranges.
Trained and tuned models for classification back to the native or invasive range that have built-in variable importance metrics: stochastic gradient boosting and random forests. I wanted to be able to assess the importance of the raw acoustic measurements and acoustic features used as predictors.
Initialize training data, retain only columns needed for machine learning.
table(sup_ML_df$set)
##
## prediction training validation
## 548 676 337
train_df <- sup_ML_df %>%
filter(set == "training") %>%
dplyr::mutate(
range = factor(range) # Must be a factor for caret::train
) %>%
dplyr::select(-c(sound.files, set))
dim(train_df)
## [1] 676 204
head(names(train_df))
## [1] "range" "duration" "freq.Q75" "freq.IQR" "time.Q25" "kurt"
Set caret::train control parameters, to be used across models for training and/or tuning.
# Must provide a tunegrid object to caret::train if search is set to grid in caret::trainControl. Alternative is to set tuneLength in caret::train instead of tuneGrid, and set search to "random" in trainControl
fiveStats <- function(...) c(twoClassSummary(...), +
defaultSummary(...))
fitControl <- trainControl(
method = 'repeatedcv', # repeated k-fold cross validation
number = 5, # number of folds
repeats = 5, # number of repeats
savePredictions = 'final', # save predictions for optimal tuning parameter
classProbs = TRUE, # return class probabilities
summaryFunction = fiveStats, # summary function for metrics (2-class problem)
allowParallel = TRUE, # use parallel backend if available
search = "grid", # use grid search routine for tuning
returnData = TRUE, # save data
returnResamp = "final", # return the final resampled summary metrics
verboseIter = TRUE # display training log, otherwise no progress bar
)
Trained and tuned a stochastic gradient boosting model, method = “gbm”", package gbm.
n.trees <- seq(100, 1600, 500) # default is 100 trees
interaction.depth <- seq(1, 3, 1) # default is 1 (additive model), 2 = model with up to 2-way interactions, etc
shrinkage <- c(seq(0.001, 0.1, 0.05), 0.1) # controls expansion of trees or learning rate, default is 0.1, values of 0.001 to 0.1 usually work
n.minobsinnode <- 1 # min observations in trees' terminal nodes
tunegrid <- expand.grid(.n.trees = n.trees, .interaction.depth = interaction.depth, .shrinkage = shrinkage, .n.minobsinnode = n.minobsinnode)
tunegrid
# Gradient boosting (GBM)
set.seed(seed)
model_GBM <- train(range ~ ., data = train_df, method = 'gbm', tuneGrid = tunegrid, trControl = fitControl)
saveRDS(model_GBM, file.path(path, "model_GBM.RDS"))
Check out final model: n.trees = 1600, interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 1.
model_GBM <- readRDS(file.path(path, "model_GBM.RDS"))
model_GBM
## Stochastic Gradient Boosting
##
## 676 samples
## 203 predictors
## 2 classes: 'Invasive', 'Native'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 541, 540, 541, 540, 542, 541, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees ROC Sens Spec
## 0.001 1 100 0.8666084 0.5331282 0.9041491
## 0.001 1 600 0.8834416 0.7168858 0.8303354
## 0.001 1 1100 0.8923738 0.7285501 0.8497805
## 0.001 1 1600 0.9009991 0.7413986 0.8561242
## 0.001 2 100 0.8877427 0.6124196 0.9053913
## 0.001 2 600 0.9038213 0.7756084 0.8366542
## 0.001 2 1100 0.9151044 0.7963730 0.8527288
## 0.001 2 1600 0.9245664 0.8110676 0.8573085
## 0.001 3 100 0.9019266 0.6350303 0.9185424
## 0.001 3 600 0.9173590 0.8037576 0.8567288
## 0.001 3 1100 0.9291327 0.8233100 0.8607288
## 0.001 3 1600 0.9369714 0.8428904 0.8641822
## 0.051 1 100 0.9386602 0.8287646 0.8721822
## 0.051 1 600 0.9699025 0.8898275 0.9226087
## 0.051 1 1100 0.9719870 0.8898462 0.9249110
## 0.051 1 1600 0.9727460 0.8923077 0.9271801
## 0.051 2 100 0.9546729 0.8666667 0.8859710
## 0.051 2 600 0.9719701 0.8990396 0.9186335
## 0.051 2 1100 0.9746530 0.9051189 0.9260455
## 0.051 2 1600 0.9757203 0.9045221 0.9294907
## 0.051 3 100 0.9575512 0.8740233 0.8893996
## 0.051 3 600 0.9735224 0.8978089 0.9260704
## 0.051 3 1100 0.9759176 0.9051282 0.9295155
## 0.051 3 1600 0.9767323 0.9075991 0.9317930
## 0.100 1 100 0.9544239 0.8562424 0.8945590
## 0.100 1 600 0.9698300 0.8911049 0.9226087
## 0.100 1 1100 0.9717820 0.8947599 0.9243147
## 0.100 1 1600 0.9722513 0.9002238 0.9214410
## 0.100 2 100 0.9612362 0.8764382 0.9019876
## 0.100 2 600 0.9731488 0.8984242 0.9254576
## 0.100 2 1100 0.9748158 0.9002611 0.9277598
## 0.100 2 1600 0.9760072 0.9027319 0.9288861
## 0.100 3 100 0.9656127 0.8868718 0.9048861
## 0.100 3 600 0.9754100 0.9100326 0.9289193
## 0.100 3 1100 0.9766214 0.9106946 0.9294990
## 0.100 3 1600 0.9772890 0.9113100 0.9334990
## Accuracy Kappa
## 0.7246132 0.4420302
## 0.7754282 0.5490385
## 0.7910928 0.5803041
## 0.8005788 0.5994044
## 0.7636113 0.5224180
## 0.8070735 0.6131827
## 0.8254334 0.6499850
## 0.8349129 0.6690797
## 0.7813366 0.5583391
## 0.8310609 0.6613093
## 0.8425991 0.6845822
## 0.8538432 0.7072869
## 0.8511722 0.7016860
## 0.9067785 0.8131682
## 0.9079549 0.8155188
## 0.9103319 0.8202886
## 0.8766300 0.7528817
## 0.9091620 0.8180253
## 0.9159419 0.8315946
## 0.9174300 0.8345518
## 0.8819351 0.7635594
## 0.9123995 0.8244601
## 0.9177242 0.8351454
## 0.9201012 0.8399086
## 0.8760155 0.7514967
## 0.9073754 0.8143815
## 0.9100313 0.8197110
## 0.9111902 0.8220793
## 0.8896302 0.7788466
## 0.9123885 0.8244451
## 0.9144736 0.8286270
## 0.9162470 0.8321974
## 0.8961510 0.7920013
## 0.9197873 0.8393140
## 0.9203952 0.8405386
## 0.9227547 0.8452523
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 1600, interaction.depth =
## 3, shrinkage = 0.1 and n.minobsinnode = 1.
Check out the confusion matrix: 92.28% accuracy by repeated cross-fold validation. 95% CI : (91.33, 93.16). Kappa = 0.8453.
confusionMatrix(model_GBM$pred$obs, model_GBM$pred$pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Invasive Native
## Invasive 1490 145
## Native 116 1629
##
## Accuracy : 0.9228
## 95% CI : (0.9133, 0.9316)
## No Information Rate : 0.5249
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8453
##
## Mcnemar's Test P-Value : 0.08307
##
## Sensitivity : 0.9278
## Specificity : 0.9183
## Pos Pred Value : 0.9113
## Neg Pred Value : 0.9335
## Prevalence : 0.4751
## Detection Rate : 0.4408
## Detection Prevalence : 0.4837
## Balanced Accuracy : 0.9230
##
## 'Positive' Class : Invasive
##
Spatial random forests (RF) training and tuning. Setting the seed here is not really necessary, as every forest will be different due to the random sampling at every node.
# 10 mtry values for tuning
mtry <- round(seq(2, ncol(train_df[, sapply(train_df, is.numeric)]), ncol(train_df[, sapply(train_df, is.numeric)])/10))
mtry
length(mtry)
# Other parameters for ranger random forests
splitrule <- "gini"
min.node.size <- 1
# Make a data frame of all combinations of the parameters initialized for ranger random forests that can be placed into a tuning grid
# Note that the number of trees cannot be placed in a tuning grid for caret, here using 2000 trees
tunegrid <- expand.grid(.mtry = mtry, .splitrule = splitrule, .min.node.size = min.node.size)
tunegrid
set.seed(seed)
model_RF <- train(range ~ ., data = train_df, num.trees = 2000, method = "ranger", trControl = fitControl, tuneGrid = tunegrid, importance = "permutation", replace = TRUE, scale.permutation.importance = TRUE, num.threads = cores, maximize = TRUE)
saveRDS(model_RF, file.path(path, "model_RF.RDS"))
mtry of 2 in the final model.
model_RF <- readRDS(file.path(path, "model_RF.RDS"))
model_RF
## Random Forest
##
## 676 samples
## 203 predictors
## 2 classes: 'Invasive', 'Native'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 541, 540, 541, 540, 542, 541, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec Accuracy Kappa
## 2 0.9713853 0.8911329 0.9295072 0.9109312 0.8214807
## 22 0.9687683 0.8935291 0.9105590 0.9023492 0.8043769
## 43 0.9636139 0.8923450 0.9019876 0.8973253 0.7943952
## 63 0.9597669 0.8911142 0.8888033 0.8899156 0.7796407
## 83 0.9564917 0.8843823 0.8807702 0.8825123 0.7648629
## 104 0.9529082 0.8795058 0.8784845 0.8789656 0.7577544
## 124 0.9508079 0.8783030 0.8784513 0.8783619 0.7565414
## 144 0.9485359 0.8721772 0.8727288 0.8724534 0.7447047
## 164 0.9459910 0.8703590 0.8744513 0.8724579 0.7446895
## 185 0.9440756 0.8648205 0.8710145 0.8680067 0.7357485
##
## Tuning parameter 'splitrule' was held constant at a value of gini
##
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 2, splitrule = gini
## and min.node.size = 1.
Check out the confusion matrix: 91.09% accuracy with 95% CI : (90.08, 92.03). Kappa = 0.8215.
confusionMatrix(model_RF$pred$obs, model_RF$pred$pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Invasive Native
## Invasive 1457 178
## Native 123 1622
##
## Accuracy : 0.9109
## 95% CI : (0.9008, 0.9203)
## No Information Rate : 0.5325
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8215
##
## Mcnemar's Test P-Value : 0.001855
##
## Sensitivity : 0.9222
## Specificity : 0.9011
## Pos Pred Value : 0.8911
## Neg Pred Value : 0.9295
## Prevalence : 0.4675
## Detection Rate : 0.4311
## Detection Prevalence : 0.4837
## Balanced Accuracy : 0.9116
##
## 'Positive' Class : Invasive
##
mlist <- list(model_RF, model_GBM)
names(mlist) <- c("RF", "GBM")
results <- resamples(mlist)
# summary(results)
dotplot(results, metric = c("Kappa", "Accuracy", "ROC"))
Both models performed really well with > 90% training classification accuracy.
Here, a good number of the original spectral acoustic measurements had high importance. The most important among the spectral acoustic measurements were spectral entropy and frequency interquartile range (freq.IQR).
n <- 12
# Feature types
# "SPCC_spec" "SPCC_ceps" "DTW_domf" "DTW_spen" "DTW_mult" "acs_params"
# "cep_coeff", raw parameters
cols <- c(
scales::alpha("purple", 0.6),
topo.colors(n)[2],
terrain.colors(n)[1],
heat.colors(n)[1],
heat.colors(n, alpha = 0.7)[5],
heat.colors(n)[8],
gray.colors(n)[5],
"black"
)
types <- c("SPCC_spec", "SPCC_ceps", "DTW_domf", "DTW_spen", "DTW_mult", "acs_params", "cep_coeff")
models <- c("Random Forests", "Stochastic \n Gradient Boosting")
gg_list <- list()
# i <- 1
invisible(pblapply(1:length(mlist), function(i){
# Make a data frame of variable importance with feature type
if(grepl("Boosting", models[i])){
tmp <- data.frame(
var_nms = rownames(varImp(mlist[[i]])$importance),
imp = varImp(mlist[[i]])$importance[["Overall"]]
)
} else {
tmp <- data.frame(
var_nms = names(mlist[[i]]$finalModel$variable.importance),
imp = mlist[[i]]$finalModel$variable.importance
)
}
tmp <- tmp %>%
# Add in a feature type column
dplyr::mutate(
var_type = sapply(1:nrow(tmp), function(y){
paste(
strsplit(as.character(tmp$var_nms[y]), split = "_")[[1]][1],
strsplit(as.character(tmp$var_nms[y]), split = "_")[[1]][2],
sep = "_"
)}),
var_type = ifelse(!var_type %in% types, "raw_params", var_type)
) %>%
dplyr::mutate(
var_type = factor(var_type, levels = c("SPCC_spec", "SPCC_ceps", "DTW_domf", "DTW_spen", "DTW_mult", "acs_params", "cep_coeff", "img_params", "raw_params")),
var_nms = as.character(var_nms)
) %>%
arrange(desc(imp)) %>%
slice(1:30) %>%
dplyr::mutate(
var_nms = factor(var_nms, levels = var_nms),
model = models[i]
) %>%
droplevels()
tmp_cols <- cols[grep(paste(levels(tmp$var_type), collapse = "|"), c(types, "raw_params"))]
# Add y-axis text to RF plot only
if(!grepl("Boosting", models[i])){
y_txt <- "Top 30 Most Important Features"
} else {
y_txt <- ""
}
gg_list[[i]] <<- ggplot(data = tmp) +
facet_grid(~ model) +
geom_segment(aes(x = 0, xend = imp, y = forcats::fct_rev(var_nms), yend = forcats::fct_rev(var_nms), color = var_type), size = 2) +
geom_point(aes(y = forcats::fct_rev(var_nms), x = imp), pch = 21, fill = gray.colors(n)[2], color = "black", size = 3.5, stroke = 0.25) +
scale_color_manual(values = tmp_cols) +
guides(color = FALSE) +
xlab("Variable Importance") +
ylab(y_txt) +
theme_bw() +
theme(
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
strip.text = element_text(size = 12),
panel.grid.major.y = element_line(size = 0.15),
panel.grid.major.x = element_line(color = "black", size = 0.15, linetype = "dotted"),
panel.grid.minor = element_line(size = 0.15),
axis.ticks = element_line(size = 0.15)
)
}))
# Extract legend
tmp_df <- data.frame(
var_nms = names(mlist[[1]]$finalModel$variable.importance),
imp = mlist[[1]]$finalModel$variable.importance
) %>%
# Add in a feature type column
dplyr::mutate(
var_type = sapply(1:nrow(.), function(y){
paste(
strsplit(as.character(.$var_nms[y]), split = "_")[[1]][1],
strsplit(as.character(.$var_nms[y]), split = "_")[[1]][2],
sep = "_"
)}),
var_type = ifelse(!var_type %in% types, "raw_params", var_type)
) %>%
dplyr::mutate(
var_type = factor(var_type, levels = c("SPCC_spec", "SPCC_ceps", "DTW_domf", "DTW_spen", "DTW_mult", "acs_params", "cep_coeff", "raw_params")),
var_nms = as.character(var_nms)
)
unique(tmp_df$var_type)
tmp_cols <- cols[grep(paste(levels(tmp_df$var_type), collapse = "|"), c(types, "raw_params"))]
gg_leg <- gtable::gtable_filter(ggplot_gtable(ggplot_build(
ggplot(data = tmp_df) +
geom_segment(aes(x = 0, xend = imp, y = forcats::fct_rev(var_nms), yend = forcats::fct_rev(var_nms), color = var_type), size = 2.5) +
scale_color_manual(values = cols) +
guides(color = guide_legend(title = "", nrow = 2)) +
theme(
legend.direction = "horizontal",
legend.title = element_text(size = 12),
legend.text = element_text(size = 12),
legend.position = "top",
legend.key = element_rect(fill = scales::alpha("white", 0)),
legend.margin = margin(0, 0, 0, 0),
legend.box.margin = margin(-10, -10, -10, -10)
)
)), "guide-box")
# grid::grid.draw(gg_leg)
dev.off()
jpeg(file.path(gpath, "AM10_VarImp.jpeg"), units = "in", width = 8, height = 6, res = 300)
ggarrange(
as.ggplot(gg_leg),
as.ggplot(
ggarrange(
as.ggplot(gg_list[[1]]),
as.ggplot(gg_list[[2]]),
nrow = 1,
widths = c(5, 5)
)
),
nrow = 2,
heights = c(1, 5)
)
dev.off()
Variable importance per model
The models shared some of the top most important variables, but the RF model had more original spectral acoustic measurements than the GBM model in the top 30 most important features. Although GBM slightly outperformed RF in terms of classification accuracy, I decided to proceed with RF for validation and prediction, because this model had more original spectral acoustic measurements included in the top important variables, and I wanted to try assessing partial dependence on these measurements to take a closer look at structural differences in calls between ranges.
Proceeding with RF model validation. Here, looking for warning signs associated with overfitting on the training data, specifically, substantially lower validation accuracy compared to training.
# Initialize training data, retain only columns needed for ML
validation_df <- sup_ML_df %>%
filter(set == "validation") %>%
dplyr::mutate(
range = factor(range) # Must be a factor for caret::train
) %>%
dplyr::select(-c(sound.files, set))
dim(validation_df)
## [1] 337 204
head(names(validation_df))
## [1] "range" "duration" "freq.Q75" "freq.IQR" "time.Q25" "kurt"
The RF model had 91.99% validation accuracy, similarly high compared to cross-fold validated training accuracy. 20 invasive range and 7 native range calls were misclassified.
# Consider using probability of labels
valid_RF <- predict(model_RF, newdata = validation_df, type = "raw")
str(valid_RF)
## Factor w/ 2 levels "Invasive","Native": 2 2 2 2 2 1 1 2 2 2 ...
# View(valid_RF)
caret::confusionMatrix(valid_RF, validation_df$range)$table
## Reference
## Prediction Invasive Native
## Invasive 201 7
## Native 20 109
caret::confusionMatrix(valid_RF, validation_df$range)$overall[["Accuracy"]]
## [1] 0.9198813
Validation accuracy was high, so we moved on to prediction with the validated model.
Initialize prediction data.
sup_ML_df <- readRDS(file.path(path, "sup_ML_fin.RDS"))
dim(sup_ML_df)
## [1] 1561 206
# glimpse(sup_ML_df)
prediction_df <- sup_ML_df %>%
filter(set == "prediction") %>%
dplyr::mutate(
range = factor(range) # Must be a factor for caret::train
) %>%
dplyr::select(-c(sound.files, set))
dim(prediction_df)
## [1] 548 204
head(names(prediction_df))
## [1] "range" "duration" "freq.Q75" "freq.IQR" "time.Q25" "kurt"
Carried out prediction with the final RF model. Overall prediction accuracy was 87.59%. 46 invasive range calls and 22 native range calls were misclassified.
predict_RF <- predict(model_RF, newdata = prediction_df, type = "raw")
# str(predict_RF)
caret::confusionMatrix(predict_RF, prediction_df$range)$table
## Reference
## Prediction Invasive Native
## Invasive 272 22
## Native 46 208
caret::confusionMatrix(predict_RF, prediction_df$range)$overall[["Accuracy"]]
## [1] 0.8759124
# Also get the probability of each class from this prediction
predict_RF_prob <- predict(model_RF, newdata = prediction_df, type = "prob")
# str(predict_RF_prob)
# Order of classes predicted here
levels(predict_RF)
## [1] "Invasive" "Native"
# "Invasive" "Native"
saveRDS(predict_RF, file.path(path, "predict_RF.RDS"))
saveRDS(predict_RF_prob, file.path(path, "predict_RF_prob.RDS"))
Added prediction results to the prediction dataset.
predict_RF <- readRDS(file.path(path, "predict_RF.RDS"))
# glimpse(predict_RF)
predict_RF_prob <- readRDS(file.path(path, "predict_RF_prob.RDS"))
# glimpse(predict_RF_prob)
predict_res <- sup_ML_df %>%
filter(set == "prediction") %>%
dplyr::select(c(sound.files, range)) %>%
inner_join(
nat_inv_est %>%
as_tibble() %>%
dplyr::select(c(sound.files, range, year, site, site_year, invasive_city)),
by = c("sound.files", "range")
) %>%
dplyr::mutate(
predicted_class = as.character(predict_RF),
prob_native = predict_RF_prob$Native,
prob_invasive = predict_RF_prob$Invasive
) %>%
droplevels()
# glimpse(predict_res)
Get the RF proximity matrix for the prediction dataset.
rf_proxm <- edarf::extract_proximity(model_RF$finalModel, prediction_df)
str(rf_proxm)
# save output
saveRDS(rf_proxm, file.path(path, "rf_proxm_call_structure.RDS"))
Perform MDS in 2 dimensions.
rf_proxm <- readRDS(file.path(path, "rf_proxm_call_structure.RDS"))
# head(rf_proxm)
# dim(rf_proxm)
iso <- invisible(MASS::isoMDS(stats::as.dist(1 - rf_proxm, diag = FALSE, upper = FALSE), k = 2, maxit = 1000, trace = FALSE))
# Join with metadata
mds_rf <- data.frame(X = iso$points[, 1]) %>%
dplyr::mutate(
Y = iso$points[, 2],
sound.files = predict_res$sound.files
) %>%
inner_join(
predict_res,
by = "sound.files"
) %>%
dplyr::mutate(
range = as.character(range),
range = factor(range, levels = c("Native", "Invasive")),
# Flip X coordinates so native range appears on the left side
X = -X,
Y = Y
)
# glimpse(mds_rf)
All sites used for RF prediction in RF acoustic space. This is a main figure, with a Gaussian kernel density estimator used to generate contours in acoustic space.
I relied on geom_density2d for visualizing density contours. This function uses MASS::kde2d to apply a two dimensional Gaussian density kernel to the MDS coordinates. For each kernel density estimator (KDE) per range, I set the bandwidth (argument h in geom_density2d) that controls the width of the kernel density estimator in both dimensions to 0.5. In other words, the KDE had a width of 0.5 in each direction (x and y). After the density was estimated by geom_density2d, contours were drawn across bins of the density values. Here, I set the binwidth to 1/10th of the maximum density value per range, such that density values per range were split into 10 bins of even width. Note that I flipped x-axis coordinates so that the native range contours were on the left side of the graphic.
# levels(mds_rf$range)
cols <- scales::alpha(c("navy", "orange"), 0.85)
h <- 0.5 # bandwidth (KDE width)
# Get density values and bin them to check out bins used by geom_density2d
nat_dens <- kde2d(
x = mds_rf %>%
filter(range == "Native") %>%
pull(X),
y = mds_rf %>%
filter(range == "Native") %>%
pull(Y),
h = 0.5
)
# str(nat_dens)
inv_dens <- kde2d(
x = mds_rf %>%
filter(range == "Invasive") %>%
pull(X),
y = mds_rf %>%
filter(range == "Invasive") %>%
pull(Y),
h = 0.5
)
# str(inv_dens)
range(nat_dens)
## [1] -0.5670525 2.9509838
range(inv_dens)
## [1] -0.5687283 2.1119321
bw_nat <- max(nat_dens$z)/10
bw_inv <- max(inv_dens$z)/10
bw_nat
## [1] 0.2950984
bw_inv
## [1] 0.2111932
brks_nat <- seq(min(nat_dens$z), max(nat_dens$z), bw_nat)
brks_inv <- seq(min(inv_dens$z), max(inv_dens$z), bw_inv)
length(brks_nat) # 10 bins of width bw_nat for native range
## [1] 10
length(brks_inv) # 10 bins of width bw_inv for invasive range
## [1] 10
ggplot(data = mds_rf, aes(x = X, y = Y)) +
geom_density2d(
data = mds_rf %>%
filter(range == "Native") %>%
droplevels(),
size = 0.65,
binwidth = bw_nat,
h = h,
color = cols[1]
) +
geom_density2d(
data = mds_rf %>%
filter(range == "Invasive") %>%
droplevels(),
size = 0.65,
binwidth = bw_inv,
h = h,
color = cols[2]
) +
scale_color_manual(values = cols) +
xlab("Dimension 1") + ylab("Dimension 2") +
theme_bw() +
theme(
legend.position = "top",
axis.title = element_text(size = 16),
axis.text = element_text(size = 14),
legend.title = element_text(size = 16),
legend.text = element_text(size = 16),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks = element_line(size = 0.35),
legend.margin = margin(0, 0, 0, 0),
legend.box.margin = margin(-5, -5, -5, -5),
legend.key.width = unit(3, "lines")
)
ggsave(file.path(gpath, "KernelDensity_RFAcousticSpace.tiff"), units = "in", width = 11, height = 3.06, dpi = 300) # Main manuscript
After visualizing invasive and native range calls in RF acoustic space, we took a closer look at call structure differentiation in different invasive range areas (finer spatial comparison between ranges), as well as over time. For these analyses, we used calls in the prediction dataset from sites that did not represent sampling over time, and Austin/New Orleans sites that represented invasive range sampling over time, respectively.
Which invasive range cities and sites were missclassified during prediction? The fact that we could build a model with high training and validation accuracy was a good indicator that there were indeed structural differences between ranges, but it was possible that some regions of the invasive range were more native range-like in call structure. If so, such calls should have been misclassified to the native range.
Checking out missclassification in the invasive range by city. While more calls were misclassified for Milford CT, this city also had more representation in the prediction dataset (35 calls). Overall, we saw no clear pattern of misclassification back to native range (e.g. structural change) by invasive city or region.
inv_spat_pred <- predict_res %>%
filter(range != predicted_class & range == "Invasive") %>%
# Filter by sites for spatial question
filter(site_year %in% unique(inv_spatial$site_year)) %>%
group_by(range, invasive_city) %>%
dplyr::summarise(
n_misclass = length(sound.files)
) %>%
# Add number of calls used for prediction
inner_join(
sup_ML_df %>%
filter(set == "prediction" & range == "Invasive") %>%
dplyr::select(c(sound.files, range)) %>%
inner_join(
nat_inv_est %>%
as_tibble() %>%
filter(range == "Invasive") %>%
# Filter by sites for spatial question
filter(site_year %in% unique(inv_spatial$site_year)) %>%
dplyr::select(c(sound.files, range, region, invasive_city)),
by = c("sound.files", "range")
) %>%
group_by(range, region, invasive_city) %>%
dplyr::summarise(
n_total = length(sound.files)
),
by = c("range", "invasive_city")
)
inv_spat_pred %>%
dplyr::select(c(range, region, invasive_city, n_misclass, n_total)) %>%
kable()
| range | region | invasive_city | n_misclass | n_total |
|---|---|---|---|---|
| Invasive | Southwest | Austin | 3 | 19 |
| Invasive | Southwest | Dallas | 1 | 3 |
| Invasive | Southeast | Miami | 2 | 13 |
| Invasive | Northeast | Milford | 5 | 35 |
| Invasive | Southeast | New Orleans | 1 | 7 |
| Invasive | Northeast | Stratford | 2 | 17 |
Fewer native range calls were missclassified. The western region has greater representation in the misclassified calls, but we also sampled more intensively in the west.
predict_res %>%
filter(range != predicted_class & range == "Native") %>%
group_by(range, site) %>%
dplyr::summarise(
n_misclass = length(sound.files)
) %>%
# Add number of calls used for prediction
inner_join(
sup_ML_df %>%
filter(set == "prediction" & range == "Native") %>%
dplyr::select(c(sound.files, range)) %>%
inner_join(
nat_inv_est %>%
as_tibble() %>%
filter(range == "Native") %>%
dplyr::select(c(sound.files, range, site, region)),
by = c("sound.files", "range")
) %>%
group_by(range, region, site) %>%
dplyr::summarise(
n_total = length(sound.files)
),
by = c("range", "site")
) %>%
dplyr::select(range, site, region, n_misclass, n_total) %>%
arrange(region) %>%
kable()
| range | site | region | n_misclass | n_total |
|---|---|---|---|---|
| Native | BCAR | Central | 1 | 5 |
| Native | FAGR | Central | 1 | 2 |
| Native | PEIX | Central | 1 | 6 |
| Native | PROO | Central | 1 | 4 |
| Native | OJOS | East | 2 | 7 |
| Native | PLVE | East | 1 | 3 |
| Native | VALI | East | 1 | 7 |
| Native | 1145 | West | 6 | 29 |
| Native | ARAZ | West | 2 | 5 |
| Native | CHAC | West | 1 | 6 |
| Native | EMBR | West | 3 | 11 |
| Native | ROSA | West | 2 | 5 |
Note that BCAR and FAGR were native range urban sites in Montevideo. Which sites had no misclassifications during prediction?
predict_res %>%
filter(range == predicted_class & range == "Native") %>%
group_by(range, site) %>%
dplyr::summarise(
n_corrclass = length(sound.files)
) %>%
# Add number of calls used for prediction
inner_join(
sup_ML_df %>%
filter(set == "prediction" & range == "Native") %>%
dplyr::select(c(sound.files, range)) %>%
inner_join(
nat_inv_est %>%
as_tibble() %>%
filter(range == "Native") %>%
dplyr::select(c(sound.files, range, site, region)),
by = c("sound.files", "range")
) %>%
group_by(range, region, site) %>%
dplyr::summarise(
n_total = length(sound.files)
),
by = c("range", "site")
) %>%
dplyr::select(range, site, region, n_corrclass, n_total) %>%
arrange(region) %>%
kable()
| range | site | region | n_corrclass | n_total |
|---|---|---|---|---|
| Native | BAGU | Central | 7 | 7 |
| Native | BCAR | Central | 4 | 5 |
| Native | CEME | Central | 2 | 2 |
| Native | FAGR | Central | 1 | 2 |
| Native | GOLF | Central | 7 | 7 |
| Native | INBR | Central | 6 | 6 |
| Native | PEIX | Central | 5 | 6 |
| Native | PROO | Central | 3 | 4 |
| Native | CISN | East | 9 | 9 |
| Native | ELTE | East | 7 | 7 |
| Native | HIPE | East | 2 | 2 |
| Native | OJOS | East | 5 | 7 |
| Native | PAVO | East | 9 | 9 |
| Native | PLVE | East | 2 | 3 |
| Native | QUEB | East | 5 | 5 |
| Native | SAUC | East | 2 | 2 |
| Native | VALI | East | 6 | 7 |
| Native | ARAP | Northwest | 4 | 4 |
| Native | 1145 | West | 23 | 29 |
| Native | ARAZ | West | 3 | 5 |
| Native | CHAC | West | 5 | 6 |
| Native | ECIL | West | 6 | 6 |
| Native | EMBR | West | 8 | 11 |
| Native | INES-01 | West | 4 | 4 |
| Native | INES-03 | West | 5 | 5 |
| Native | INES-04 | West | 3 | 3 |
| Native | INES-05 | West | 2 | 2 |
| Native | INES-06 | West | 2 | 2 |
| Native | INES-07 | West | 3 | 3 |
| Native | INES-08 | West | 9 | 9 |
| Native | KIYU | West | 3 | 3 |
| Native | LENA | West | 6 | 6 |
| Native | PFER | West | 18 | 18 |
| Native | PIED | West | 7 | 7 |
| Native | RIAC | West | 9 | 9 |
| Native | ROSA | West | 3 | 5 |
| Native | SEMI | West | 3 | 3 |
The other two native range urban sites, CEME and GOLF, had 100% correct classification back to the native range (2 and 7 calls, respectively).
Which invasive range cities and sites sampled over time had missclassified calls? Temporal change in acoustic structure could have influenced misclassification of calls. If invasive populations grew over time, birds would have to recognize more individuals, which could lead to invasive range calls becoming more native range-like.
inv_temp_pred <- predict_res %>%
filter(range == "Invasive") %>%
# Filter by sites used for the temporal questions
filter(site_year %in% unique(inv_temporal$site_year)) %>%
dplyr::mutate(
misclass = ifelse(range != predicted_class, "Y", "N")
) %>%
# Summarise number of calls correctly and incorrectly classified by site-year
group_by(invasive_city, site_year) %>%
dplyr::summarise(
n_misclass = length(which(misclass == "Y"))
) %>%
# Add number of calls used for prediction
inner_join(
sup_ML_df %>%
filter(set == "prediction" & range == "Invasive") %>%
dplyr::select(c(sound.files)) %>%
inner_join(
nat_inv_est %>%
filter(range == "Invasive") %>%
# Filter by sites used for the temporal questions
filter(site_year %in% unique(inv_temporal$site_year)) %>%
dplyr::select(c(sound.files, year, invasive_city, site, site_year)),
by = c("sound.files")
) %>%
group_by(year, invasive_city, site, site_year) %>%
dplyr::summarise(
n_total = length(sound.files)
),
by = c("invasive_city", "site_year")
) %>%
ungroup() %>%
dplyr::select(invasive_city, site, year, n_misclass, n_total) %>%
ungroup()
## `summarise()` has grouped output by 'invasive_city'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'year', 'invasive_city', 'site'. You can override using the `.groups` argument.
inv_temp_pred %>%
kable()
| invasive_city | site | year | n_misclass | n_total |
|---|---|---|---|---|
| Austin | ELEM | 2011 | 4 | 8 |
| Austin | ELEM | 2019 | 2 | 20 |
| Austin | INTR | 2011 | 7 | 15 |
| Austin | INTR | 2019 | 1 | 20 |
| Austin | MART | 2011 | 0 | 14 |
| Austin | MART | 2019 | 5 | 20 |
| Austin | SOCC | 2004 | 0 | 20 |
| Austin | SOCC | 2019 | 1 | 20 |
| Austin | VALL | 2004 | 1 | 5 |
| Austin | VALL | 2011 | 3 | 10 |
| New Orleans | BALL | 2004 | 2 | 20 |
| New Orleans | CANA | 2004 | 1 | 13 |
| New Orleans | LAKE | 2011 | 1 | 5 |
| New Orleans | ROBE | 2011 | 4 | 20 |
Make a visual of Austin and New Orleans at the city-scale. 32 calls from the sites representing temporal sampling were misclassified, and the highest misclassification occurred for Austin 2011.
cols <- scales::alpha("orange", 0.85)
# Austin city scale
inv_temp_pred %>%
group_by(invasive_city, year) %>%
dplyr::summarise(n_calls = sum(n_misclass)) %>%
ggplot(aes(x = year, y = n_calls)) +
geom_col(color = scales::alpha("white", 0), fill = cols) +
xlab("Year") +
ylab("Number of misclassified calls") +
facet_wrap(~ invasive_city) +
scale_y_continuous(limits = c(0, 20)) +
theme_bw() +
ggtitle(paste(inv_temp_pred %>%
pull(n_misclass) %>%
sum(), "calls misclassified")) +
theme(
plot.title = element_text(size = 12, hjust = 0.5),
axis.title = element_text(size = 12),
axis.text = element_text(size = 12),
strip.text = element_text(size = 12),
panel.grid.major = element_line(size = 0.15),
panel.grid.minor = element_line(size = 0.15),
axis.ticks = element_line(size = 0.1)
)
## `summarise()` has grouped output by 'invasive_city'. You can override using the `.groups` argument.
Note that these difference in misclassifications among years rely on few calls, so this difference in misclassification over time is pretty weak. In addition, these results are not in line with what we would have predicted for change over time. If populations grew or decreased in size, we would expect to see directional increase or decrease in misclassification accuracy, which would have represented more or less structural change (becoming more or less native-range like over time).
We took a closer look at Austin to evaluate misclassification per site sampled over time, and again saw no clear patterns of temporal change.
# Austin site-year scale, drop BART because this was sampled at individual scale and included in prediction dataset only for spatial question
# Get sites
sites <- inv_temp_pred %>%
ungroup() %>%
filter(invasive_city == "Austin" & site != "BART") %>%
pull(site) %>%
unique()
sites
## [1] "ELEM" "INTR" "MART" "SOCC" "VALL"
# Make a new data frame that contains a row per site and year, even if calls were not misclassified for that year. So the x-axis should have all 3 years per panel
inv_temp_pred %>%
filter(site %in% sites) %>%
# Add a column with years in which each site was sampled
dplyr::mutate(
year_samp = sapply(1:nrow(.), function(x){
yrs <- unique(nat_inv_est$year[grep(.$site[x], nat_inv_est$site)])
ord <- order(yrs, decreasing = FALSE)
paste(site[x], paste(yrs[ord], collapse = ", "), sep = ":\n ")
}),
# Change factor levels of year so all 3 years are on x-axis
year = as.character(year),
year = factor(year, levels = c("2004", "2011", "2019"))
) %>%
ggplot(aes(x = year, y = n_misclass)) +
geom_col(color = scales::alpha("white", 0), fill = cols) +
xlab("Year") +
ylab("Number of misclassified calls") +
facet_grid(~ year_samp, drop = FALSE) +
scale_y_continuous(limits = c(0, 13)) +
theme_bw() +
theme(
axis.title = element_text(size = 12),
axis.text = element_text(size = 12),
strip.text = element_text(size = 12),
panel.grid.major = element_line(size = 0.15),
panel.grid.minor = element_line(size = 0.15),
axis.ticks = element_line(size = 0.1)
)
Was misclassification of invasive range calls in the prediction dataset driven by SNR? If so, we would expect lower SNR calls to be misclassified. This was not the case. Instead, calls that were misclassified had higher mean SNR.
SNR_df <- predict_res %>%
filter(range == "Invasive") %>%
dplyr::mutate(
misclass = ifelse(range != predicted_class, "Yes", "No")
) %>%
# Join back to the EST to get SNR
inner_join(
nat_inv_est %>%
dplyr::select(sound.files, SNR),
by = "sound.files"
) %>%
# Remove calls that had missing SNR measurements
filter(!is.na(SNR)) %>%
droplevels() %>%
# Calculate mean and SE of SNR for calls that were misclassified or not
group_by(misclass) %>%
dplyr::summarise(
mean_SNR = mean(SNR),
se_SNR = std_err(SNR)
)
# glimpse(SNR_df)
SNR_df %>%
ggplot(aes(x = misclass, y = mean_SNR)) +
geom_errorbar(aes(ymin = mean_SNR - se_SNR, ymax = mean_SNR + se_SNR), size = 1, width = 0.1) +
geom_point(fill = "gray60", shape = 21, size = 4) +
scale_fill_manual(values = cols) +
xlab("Misclassified?") +
ylab("Mean +/- SE SNR") +
theme_bw()
Checked out random forests partial dependence plots for original spectral acoustic measurements. Did any of these measurements show a change in assignment probability back to the native range for certain values? For instance, were shorter duration calls more likely to be assigned to the invasive range? Such patterns would point to structural differences in calls between ranges. The original acoustic measurements used here are easier to interpret than features used for random forests. In other words, assignment probability back to the invasive range can be assessed across the values of a single given measurement, rather than values of a given feature (which is a combination of multiple original measurements).
model_RF <- readRDS(file.path(path, "model_RF.RDS"))
# model_RF
sup_ML_df <- readRDS(file.path(path, "sup_ML_fin.RDS"))
dim(sup_ML_df)
# glimpse(sup_ML_df)
prediction_df <- sup_ML_df %>%
filter(set == "prediction") %>%
dplyr::mutate(
range = factor(range) # Must be a factor for caret::train
) %>%
dplyr::select(-c(sound.files, set))
dim(prediction_df)
head(names(prediction_df))
preds <- names(prediction_df)[-grep("^range$|MDS|PCA", names(prediction_df))]
preds
# y-hat is classification probability back to inavsive range
levels(prediction_df$range)
# "Invasive" "Native"
pdp_objs <- list()
for(i in preds){
pdp_objs[[i]] <- partial(model_RF, pred.var = i, progress = "text", plot = FALSE, prob = TRUE)
}
pdp_plots <- list()
for(i in preds){
pdp_plots[[i]] <- plotPartial(pdp_objs[[i]], alpha = 0.5, smooth = TRUE)
}
dev.off()
jpeg(file.path(gpath, "PDP_RF_01.jpeg"), units = "in", width = 6, height = 6, res = 150)
grid.draw(grid.arrange(grobs = list(pdp_plots[[1]], pdp_plots[[2]], pdp_plots[[3]], pdp_plots[[4]], pdp_plots[[5]], pdp_plots[[6]], pdp_plots[[7]], pdp_plots[[8]]), ncol = 2))
dev.off()
jpeg(file.path(gpath, "PDP_RF_02.jpeg"), units = "in", width = 6, height = 6, res = 150)
grid.draw(grid.arrange(grobs = list(pdp_plots[[9]], pdp_plots[[10]], pdp_plots[[11]], pdp_plots[[12]], pdp_plots[[13]], pdp_plots[[14]], pdp_plots[[14]]), ncol = 2))
dev.off()
partial dependence plots, page 1
partial dependence plots, page 2
Some interesting trends in these plots by different predictors (e.g modulation index), but take note of the small range of the y-axis for most of these values, which indicates small overall changes in classification probability. In these plots, y-hat on the y-axis is the probability of classification back to the invasive range (e.g. lower values indicate higher classification back to the native range). Black lines indicate the unsmoothed relationship between classification back to the invasive range over the distribution of each parameter. The blue line per panel is the Loess-smoothed line.
1. Smith-Vidaurre, G., Araya-Salas, M., and T.F. Wright. 2020. Individual signatures outweigh social group identity in contact calls of a communally nesting parrot. Behavioral Ecology 31(2), 448-458.
The session info printed here represents the environment used for the final RMarkdown knitting. The software and package versions employed for main results are reported in the appendix of the associated article.
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] pracma_2.3.3 edarf_1.1.1 ggplotify_0.0.7 egg_0.4.5
## [5] gridExtra_2.3 scales_1.1.1 pdp_0.7.0 gbm_2.1.8
## [9] e1071_1.7-7 MLmetrics_1.1.1 corrplot_0.89 ranger_0.12.1
## [13] data.table_1.14.0 pbapply_1.4-3 MASS_7.3-54 caret_6.0-88
## [17] lattice_0.20-44 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.6
## [21] purrr_0.3.4 readr_1.4.0 tidyr_1.1.3 tibble_3.1.2
## [25] tidyverse_1.3.1 ggplot2_3.3.3 warbleR_1.1.27 NatureSounds_1.0.4
## [29] knitr_1.33 seewave_2.1.6 tuneR_1.3.3
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-1 rjson_0.2.20 ellipsis_0.3.2
## [4] class_7.3-19 fs_1.5.0 rstudioapi_0.13
## [7] proxy_0.4-25 farver_2.1.0 prodlim_2019.11.13
## [10] fansi_0.5.0 lubridate_1.7.10 xml2_1.3.2
## [13] codetools_0.2-18 splines_4.1.0 jsonlite_1.7.2
## [16] pROC_1.17.0.1 broom_0.7.6 dbplyr_2.1.1
## [19] BiocManager_1.30.15 compiler_4.1.0 httr_1.4.2
## [22] rvcheck_0.1.8 backports_1.2.1 assertthat_0.2.1
## [25] Matrix_1.3-4 cli_2.5.0 htmltools_0.5.1.1
## [28] tools_4.1.0 gtable_0.3.0 glue_1.4.2
## [31] reshape2_1.4.4 Rcpp_1.0.6 cellranger_1.1.0
## [34] vctrs_0.3.8 nlme_3.1-152 iterators_1.0.13
## [37] timeDate_3043.102 gower_0.2.2 xfun_0.23
## [40] rvest_1.0.0 lifecycle_1.0.0 ipred_0.9-11
## [43] hms_1.1.0 yaml_2.2.1 rpart_4.1-15
## [46] stringi_1.6.2 highr_0.9 foreach_1.5.1
## [49] checkmate_2.0.0 lava_1.6.9 mmpf_0.0.5
## [52] rlang_0.4.11 pkgconfig_2.0.3 dtw_1.22-3
## [55] bitops_1.0-7 evaluate_0.14 labeling_0.4.2
## [58] recipes_0.1.16 tidyselect_1.1.1 plyr_1.8.6
## [61] magrittr_2.0.1 R6_2.5.0 fftw_1.0-6
## [64] generics_0.1.0 DBI_1.1.1 pillar_1.6.1
## [67] haven_2.4.1 withr_2.4.2 survival_3.2-11
## [70] RCurl_1.98-1.3 nnet_7.3-16 modelr_0.1.8
## [73] crayon_1.4.1 utf8_1.2.1 rmarkdown_2.8
## [76] isoband_0.2.4 readxl_1.3.1 ModelMetrics_1.2.2.2
## [79] reprex_2.0.0 digest_0.6.27 gridGraphics_0.5-1
## [82] signal_0.7-7 stats4_4.1.0 munsell_0.5.0