We compared SPCC and random forests acoustic similarity of contact calls at the site scale between two species: monk parakeets (Myiopsitta monachus) and yellow-naped amazons (Amazona auropalliata)[1].
Our overall results can be reproduced by using the data provided with this publication (extended selection tables for the individual and higher social scales, and selection table spreadsheets). Some changes to this code may be necessary to use signals saved in the extended selection tables. Users may observe some slight differences in parameter calculations between signals used from extended selection tables and those we report here (as we performed analyses using signals within original recordings), and should expect slight differences in results from some unsupervised or supervised machine learning methods, due to stochasticity associated with these methods.
We began by moving published yellow-naped amazon (YNA) calls into the same working directory. We then converted .aif files to .wav files using the following code in a UNIX bash shell script, which calls sox
for the .aif to .wav file format conversion in the working directory:
aifs=`ls *.aif`
for aif in $aifs
do
#echo ${aif%*.aif}
nm=${aif%*.aif}
#echo $nm
oldf="$nm.aif"
newf="$nm.wav"
#echo $newf
sox $oldf $newf
done
rm(list = ls())
X <- c("warbleR", "Rraven", "ggplot2", "pbapply", "parallel", "data.table", "bibtex", "randomForest", "caret", "e1071", "ranger", "edarf", "mclust", "rgeos", "rgdal", "sp", "tidyverse")
invisible(lapply(X, library, character.only = TRUE))
# ggplot theme for plots in knitted output
theme_AcousticSpace <- function(){
theme(panel.background = element_rect(fill = "white"), plot.background = element_rect(fill = "white"),
panel.grid.major = element_line(size = 1, colour = "white"),
panel.grid.minor = element_line(size = 0.75, colour = "white"),
axis.line = element_line(size = 0.45, colour = "black"),
axis.title = element_text(size = 14),
axis.text = element_text
(size = 12))
}
# theme for some manuscript figures
theme_AcousticSpaceSm <- function(){
theme(panel.background = element_rect(fill = "white"), plot.background = element_rect(fill = "white"),
panel.grid.major = element_line(size = 1, colour = "white"),
panel.grid.minor = element_line(size = 0.75, colour = "white"),
axis.line = element_line(size = 0.25, colour = "black"),
axis.title = element_text(size = 8),
axis.text = element_text(size = 6),
axis.ticks = element_line(size = 0.25))
}
Set path for validation with another parrot species.
path <- "/media/owner/MYIOPSITTA/R/Uruguay2017_MonkParakeet_CallAnalysis/Data/YNA_published_data"
Total YNA calls.
wavs <- list.files(path = path, pattern = ".wav$")
length(wavs[grep("YNA", wavs)]) # these include calls from multiple years
The YNA calls were cuts of original recordings, and did not have associated selection tables, which are necessary to use warbleR for sound analysis. We created selection tables for the YNA calls and standardized/checked sound file parameters (sampling and bit rates).
sels <- pblapply(1:length(wavs), function(w){
tmp <- tuneR::readWave(file.path(path, wavs[w]))
species <- strsplit(wavs[w], split = "_")[[1]][1]
samp_rate <- tmp@samp.rate
bit_rate <- tmp@bit
start <- 0
end <- duration(tmp)
return(data.frame(species = species, sound.files = wavs[w], selec = 1, start = start, end = end, samp_rate = samp_rate, bit_rate = bit_rate))
})
sels <- rbindlist(sels)
# str(sels)
YNA calls were recorded at 22.05kHz sampling rates at 16 bit sampling depth. We recorded monk parakeet (MNK) calls at a 44.1kHz sampling rate and 16 bit sampling depth, we downsampled MNK calls for this comparison.
table(sels$samp_rate, sels$bit_rate)
##
## 16
## 22050 2248
We performed visual quality control filtering for the YNA calls as per quality control filtering for MNK calls in Supplementary Methods 2. We did not filter YNA calls by SNR, since these clipped calls did not have sufficient space around each call for measuring noise around each signal.
We used the same spectrogram parameters here for YNA as for MNK calls (see Supplementary Methods 2). After generating spectrograms we deleted image files for calls that were visibly poor quality. We also determined a frequency bandpass of 0 - 4kHz for YNA to use in SPCC later.
warbleR::specreator(sels, wn = "hanning", wl = 378, collev = seq(-53, 0, 1), flim = c(0, 10), ovlp = 90, line = FALSE, mar = 0.01, res = 300, path = path)
Filter the poor quality calls from the selection table.
img_files <- list.files(path = path, pattern = ".jpeg$")
img_files
# fix image file names for the image files that remain (e.g. the visibly better quality calls)
img_files <- gsub("-1.jpeg", "", img_files)
img_files <- gsub(".wav.wav", ".wav", img_files)
# make a metadata column in selection table related to filtering
sels$Use_Status <- "Y"
# add in N (No) for the 57 calls with deleted image files
sels$Use_Status[grep(paste(wavs[!wavs %in% img_files], collapse = "|"), sels$sound.files)] <- "N"
# write out the poor quality sound file names for future reference
saveRDS(wavs[!wavs %in% img_files], file.path(path, "poor_quality_calls.RDS"))
# filter the YNA selection table
sels <- sels[sels$Use_Status == "Y"]
# how many calls remain?
nrow(sels)
We added blank space before and after each call to facilitate SPCC measurements (see warbleR::xcorr
documentation). As many calls are separate .wav files clipped to the very start and end of each signal, xcorr
can return NAs as it encounters non-existent space around each signal.
We added metadata to the YNA selection table to use for sound analysis and visualizations, and wrote out the selection table as both a .csv and an extended selection table.
sels$sound.files <- as.character(sels$sound.files)
# signal duration
max(wavdur(files = sels$sound.files[grep("YNA", sels$sound.files)])$duration)
# given these summary statistics, add 0.5 seconds of silence before and after each call
sil <- 0.5
# x <- 1 # testing
sels_list <- invisible(pblapply(1:nrow(sels), function(x){
# add silence before and after signal
tmp <- seewave::addsilw(readWave(file.path(path, sels$sound.files[x])), at = "start", d = sil, output = "Wave")
new_wav <- seewave::addsilw(tmp, at = "end", d = sil, output = "Wave")
# write out new .wav file
fn <- sels$sound.files[x]
fn <- paste(gsub(".wav", "", fn), "buffered.wav", sep = "_")
tuneR::writeWave(new_wav, file.path(path, fn), extensible = TRUE)
# change the sound file name, start and end coordinates per selection
sels$sound.files[x] <- fn
sels$start[x] <- sels$start[x] + sil
sels$end[x] <- sels$end[x] + sil
return(sels[x, ])
}))
sels_mod <- data.table::rbindlist(sels_list)
str(sels_mod)
# make metadata variables
# make a site variable
# x <- 1
site <- sapply(1:nrow(sels_mod), function(x){
spp <- strsplit(sels_mod$sound.files[x], split = "_")[[1]][1]
tmp <- strsplit(sels_mod$sound.files[x], split = "_")[[1]]
return(tmp[5])
})
unique(site)
# combine site and narrated bird ID to make a unique individual ID
# this ID will be used to run SPCC per individual per species
bird_ID <- sapply(1:nrow(sels_mod), function(x){
spp <- strsplit(sels_mod$sound_files_mod[x], split = "_")[[1]][1]
tmp <- strsplit(sels_mod$sound_files_mod[x], split = "_")[[1]]
return(paste(tmp[5], strsplit(tmp[6], split = "\\.")[[1]][1], sep = "-"))
})
unique(bird_ID)
# generate a unique call ID variable
call_ID <- sapply(1:nrow(sels_mod), function(x){
spp <- strsplit(sels_mod$sound_files_mod[x], split = "_")[[1]][1]
tmp <- strsplit(sels_mod$sound_files_mod[x], split = "_")[[1]]
return(strsplit(tmp[6], split = "\\.")[[1]][2])
})
unique(call_ID)
sels_mod2 <- data.frame(sels_mod, site = site, bird_ID = bird_ID, call_ID = call_ID)
# str(sels_mod2)
write.csv(sels_mod2, file.path(path, "Selected_YNA_calls_28Oct18.csv"), row.names = FALSE)
sels_table <- selection_table(sels_mod2, extended = TRUE, mar = 0.5, path = path)
saveRDS(sels_table, file.path(path, "YNA_extended_selection_table.RDS"))
Set path for MNK calls.
path <- "/media/owner/MYIOPSITTA/R/Uruguay2017_MonkParakeet_CallAnalysis/Data/Focal_Individuals"
Read in the repeatedly sampled individual data set.
mnk <- read.csv(file.path(path, "indiv_scale_calls_preprocessed.csv"), header = TRUE)
Downsample MNK repeatedly sampled individual calls. See Supplementary Methods 1 for summary statistics of these calls.
# clip the selected calls from sound files, move to a folder called 22050
cut_sels(mnk, mar = 0.01, parallel = 1, pb = TRUE, labels = c("sound.files", "selec"), overwrite = FALSE, path = path)
# downsample MNK call cuts to 22.05kHz to match YNA sampling rate
path <- "/media/owner/MYIOPSITTA/R/Uruguay2017_MonkParakeet_CallAnalysis/Data/YNA_published_data"
wavs <- list.files(path = file.path(path, "22050"), pattern = "wav$")
# if rerunning code after including site-level calls
wavs <- wavs[grep("INDIV", wavs)]
wavs <- wavs[-grep("buffered", wavs)]
wavs
# w <- 1
invisible(lapply(1:length(wavs), function(w){
tuneR::writeWave(tuneR::downsample(object = tuneR::readWave(file.path(path, wavs[w])), samp.rate = 22050), filename = file.path(path, wavs[w]))
}))
# do a quick test, sampling rate has been lowered to 22050, good to go
# readWave(wavs[1])
# change sound file names accordingly
mnk$sound.files <- wavs
# make new temporal coordinates for selection table of clipped, downsampled MNK calls
sels <- pblapply(1:length(wavs), function(w){
tmp <- readWave(file.path(path, wavs[w]))
samp_rate <- tmp@samp.rate
start <- 0
end <- duration(tmp)
return(data.frame(sound.files = wavs[w], selec = 1, start = start, end = end, samp_rate = samp_rate))
})
sels <- rbindlist(sels)
# str(sels)
# replace original temporal coordinates with those for the downsampled calls
mnk$start <- sels$start
mnk$end <- sels$end
# add 0.5 seconds of silence before and after each call, to standaridize these calls to the YNA workflow above
mnk$sound.files <- as.character(mnk$sound.files)
sil <- 0.5
# x <- 1 # testing
mnk_list <- invisible(pblapply(1:nrow(mnk), function(x){
# add silence before and after signal
tmp <- seewave::addsilw(readWave(file.path(path, mnk$sound.files[x])), at = "start", d = sil, output = "Wave")
new_wav <- seewave::addsilw(tmp, at = "end", d = 0.5, output = "Wave")
# write out new .wav file
fn <- mnk$sound.files[x]
fn <- paste(gsub(".wav", "", fn), "buffered.wav", sep = "_")
tuneR::writeWave(new_wav, file.path(path, fn), extensible = TRUE)
# change the sound file name, start and end coordinates per selection
mnk$sound.files[x] <- fn
mnk$start[x] <- mnk$start[x] + sil
mnk$end[x] <- mnk$end[x] + sil
return(mnk[x, ])
}))
mnk_mod <- data.table::rbindlist(mnk_list)
str(mnk_mod)
# write out revised selection table
write.csv(mnk_mod, file.path(path, "MNK_indiv_selection_table_22050_28Oct18.csv"), row.names = FALSE)
sels_table <- selection_table(mnk_mod, extended = TRUE, mar = 0.5, path = path)
saveRDS(sels_table, file.path(path, "MNK_indiv_extended_selection_table.RDS"))
Read in the higher social scales data set.
path <- "/media/owner/MYIOPSITTA/R/Uruguay2017_MonkParakeet_CallAnalysis/Data/Site_Recordings"
mnk_sites <- read.csv(file.path(path, "higher_social_scales_calls_preprocessed_final.csv"), header = TRUE)
Downsample MNK site calls. See Supplementary Methods 1 for summary statistics of these calls.
# exclude the 7 calls for which a margin of 0.01 seconds on either side of the call overlapped with a neighboring call
mnk_sites <- mnk_sites[-grep("^N$", mnk_sites$SNR_mar_works), ]
nrow(mnk_sites)
# clip the selected calls from sound files, move to the folder called 22050
cut_sels(mnk_sites, mar = 0.01, parallel = cores, pb = TRUE, labels = c("sound.files", "selec"), overwrite = FALSE, path = path)
# downsample MNK call cuts to 22.05kHz to match YNA sampling rate
path <- "/media/owner/MYIOPSITTA/R/Uruguay2017_MonkParakeet_CallAnalysis/Data/YNA_published_data/22050"
wavs <- list.files(path = path, pattern = "wav$")
wavs <- wavs[grep("^2017_", wavs)]
wavs <- wavs[-grep("INDIV", wavs)]
wavs <- wavs[-grep("buffered", wavs)]
length(wavs)
# w <- 1
invisible(lapply(1:length(wavs), function(w){
tuneR::writeWave(tuneR::downsample(object = tuneR::readWave(file.path(path, wavs[w])), samp.rate = 22050), filename = file.path(path, wavs[w]))
}))
# moved these files manually back to the main directory
path <- "/media/owner/MYIOPSITTA/R/Uruguay2017_MonkParakeet_CallAnalysis/Data/YNA_published_data"
# do a quick test, sampling rate has been lowered to 22050, good to go
readWave(wavs[1])
# change sound file names accordingly
mnk_sites$sound.files <- wavs
# make new temporal coordinates for selection table of clipped, downsampled MNK calls
sels <- pblapply(1:length(wavs), function(w){
tmp <- readWave(file.path(path, wavs[w]))
samp_rate <- tmp@samp.rate
start <- 0
end <- duration(tmp)
return(data.frame(sound.files = wavs[w], selec = 1, start = start, end = end, samp_rate = samp_rate))
})
sels <- rbindlist(sels)
# unique(sels$samp_rate)
# str(sels)
# replace original temporal coordinates with those for the downsampled calls
mnk_sites$start <- sels$start
mnk_sites$end <- sels$end
str(mnk_sites)
# add 0.5 seconds of silence before and after each call, to standaridize these calls to the YNA workflow above
mnk_sites$sound.files <- as.character(mnk_sites$sound.files)
sil <- 0.5
# x <- 1 # testing
mnk_sites_list <- invisible(pblapply(1:nrow(mnk_sites), function(x){
# add silence before and after signal
tmp <- seewave::addsilw(readWave(file.path(path, mnk_sites$sound.files[x])), at = "start", d = sil, output = "Wave")
new_wav <- seewave::addsilw(tmp, at = "end", d = 0.5, output = "Wave")
# write out new .wav file
fn <- mnk_sites$sound.files[x]
fn <- paste(gsub(".wav", "", fn), "buffered.wav", sep = "_")
tuneR::writeWave(new_wav, file.path(path, fn), extensible = TRUE)
# change the sound file name, start and end coordinates per selection
mnk_sites$sound.files[x] <- fn
mnk_sites$start[x] <- mnk_sites$start[x] + sil
mnk_sites$end[x] <- mnk_sites$end[x] + sil
return(mnk_sites[x, ])
}))
mnk_sites_mod <- data.table::rbindlist(mnk_sites_list)
str(mnk_sites_mod)
# write out revised selection table
write.csv(mnk_sites_mod, file.path(path, "MNK_sites_selection_table_22050_28Oct18.csv"), row.names = FALSE)
sels_table <- selection_table(mnk_sites_mod, extended = TRUE)
saveRDS(sels_table, file.path(path, "MNK_site_extended_selection_table.RDS"))
Read in selection tables for SPCC and random forests sound analysis.
path <- "/media/owner/MYIOPSITTA/R/Uruguay2017_MonkParakeet_CallAnalysis/Data/YNA_published_data"
# MNK individual scale
mnk_indiv <- read.csv(file.path(path, "MNK_indiv_selection_table_22050_28Oct18.csv"), header = TRUE)
# dim(mnk_indiv)
# MNK site scale
mnk_site <- read.csv(file.path(path, "MNK_sites_selection_table_22050_28Oct18.csv"), header = TRUE)
# dim(mnk_site)
# YNA
yna_sels <- read.csv(file.path(path, "Selected_YNA_calls_28Oct18.csv"), header = TRUE)
# subset the selection table by usable calls
yna_sels <- yna_sels[yna_sels$Use_Status == "Y", ]
yna <- droplevels(yna_sels[grep("YNA", yna_sels$sound.files), ])
# dim(yna_sels)
# nrow(yna)
# str(yna)
How many calls per individual for YNA?
# these were recorded during a single year, 1994
yna_indiv_tbl <- table(yna$bird_ID)
yna_indiv_tbl
##
## baju-1 baju-2 baju-3 baju-4 belen-1 belen-2 belen-3 cabu-1 cabu-2
## 10 10 10 10 10 8 10 10 3
## curu-1 curu-2 grand-1 grand-2 grand-3 horiz-1 horiz-2 horiz-3 horiz-4
## 10 6 10 9 6 10 9 7 7
## inoc-1 inoc-2 inoc-3 inoc-4 junq-1 junq-2 junq-3 junq-4 pelal-1
## 10 10 9 8 9 10 10 10 10
## pelal-2 pelal-3 pelal-4 penas-1 penas-2 penas-3 penas-4 pnar-1 pnar-2
## 10 10 9 8 9 9 10 10 9
## pnar-3 pnar-4 psapa-1 psapa-2 tarc-1 tarc-2 tarc-3 zapol-1 zapol-2
## 7 6 9 9 7 9 7 10 10
## zapol-3 zapol-4
## 9 9
range(yna_indiv_tbl)
## [1] 3 10
These calls were recorded over 14 sites.
# how many individuals?
length(yna_indiv_tbl)
## [1] 47
# mean calls per individual?
mean(yna_indiv_tbl)
## [1] 8.87234
# range of calls per individual
range(yna_indiv_tbl)
## [1] 3 10
# how many sites?
yna_sites <- sapply(1:length(yna_indiv_tbl), function(x){
strsplit(names(yna_indiv_tbl)[x], split = "-")[[1]][1]
})
length(unique(yna_sites))
## [1] 14
# how many individuals per site?
table(yna_sites)
## yna_sites
## baju belen cabu curu grand horiz inoc junq pelal penas pnar psapa
## 4 3 2 2 3 4 4 4 4 4 4 2
## tarc zapol
## 3 4
# total calls
sum(yna_indiv_tbl)
## [1] 417
We optimized spectrogram parameters per species prior to running SPCC. We visually evaluated the resulting spectrogram mosaic to choose the Fourier transform window type, window length and the partitioning of signal amplitude.
cores <- parallel::detectCores() - 2
# pick a random call to generate a spectrogram mosaic with varying parameters
# MNK
set.seed(401)
r <- sample(x = nrow(mnk_site), size = 1)
spec_param(mnk_site[r, ], length.out = 10, ovlp = 90, wl = c(200, 600), wn = c("hanning", "bartlett", "hamming"), collev.min = c(-40, -100), pal = "reverse.gray.colors.2", rm.axes = TRUE, nrow = 3, ncol = 3, img.suffix = "spec_param_mnk", flim = c(0.5, 9), mar = 0.03, cex = 0.45, path = path)
catalog2pdf(keep.img = FALSE, parallel = cores, by.img.suffix = TRUE, path = path)
# YNA
set.seed(401)
r <- sample(x = nrow(yna), size = 1)
spec_param(yna[r, ], length.out = 10, ovlp = 90, wl = c(200, 600), wn = c("hanning", "bartlett", "hamming"), collev.min = c(-40, -100), pal = "reverse.gray.colors.2", rm.axes = TRUE, nrow = 3, ncol = 3, img.suffix = "spec_param_yna", flim = c(0, 4), mar = 0.03, cex = 0.45, path = path)
catalog2pdf(keep.img = FALSE, parallel = cores, by.img.suffix = TRUE, path = path)
We decided to proceed with the following spectrogram parameters after assessing spectrogram mosaics:
MNK: window length = 289, window type = hanning, collev.min = -40 YNA: window length = 378, window type = hanning, collev.min = -40
Set common arguments for acoustic and image measurements.
# Set cores for SPCC parallel processing if you have > 2 cores, otherwise set to 1 to avoid processing in parallel
cores <- parallel::detectCores() - 2
spp <- c("MNK_indiv", "MNK_site", "YNA")
sels_list <- list(mnk_indiv, mnk_site, yna)
wl <- c(288, 288, 378) # wl must be an even number
bp <- list(c(0.5, 9), c(0.5, 9), c(0, 4))
prefix <- c("xc_mat", "sp_ts_DTW", "df_ts_DTW", "mDTW")
type <- c("SPCC", "spDTW", "dfDTW", "multiDTW")
as_nms <- c("SPCC", "Spectral_entropy_DTW", "Dominant_freq_DTW", "Multivariate_DTW")
# set amplitude threshold for dtfs, threshold of 15 fails for YNA calls
th <- c(rep(15, 2), 10)
path <- "/media/owner/MYIOPSITTA/R/Uruguay2017_MonkParakeet_CallAnalysis/Data/YNA_published_data"
# x <- 1
invisible(pblapply(1:length(spp), function(x){
xc_mat_tmp <- warbleR::xcorr(sels_list[[x]], wl = wl[x], ovlp = 90, bp = bp[[x]], wn = "hanning", cor.method = "pearson", parallel = cores, na.rm = FALSE, frange = bp[[x]], cor.mat = TRUE, path = path)
# SPCC matrices will be used in random forests analysis, as well as a separate measurement of acoustic similarity
saveRDS(xc_mat_tmp, file.path(path, paste(paste("xc_mat", spp[x], sep = "_"), ".RDS", sep = "")))
}))
We used random forests to predict acoustic similarity of YNA calls. We carried out a similar workflow as in Supplementary Methods 2. We collected a broad set of acoustic similarity and acoustic structure measurements, including image parameters. We then extracted MDS/PCA and t-Distributed Stochastic Neighbor Embedding (t-SNE)[3;4] features from these measurements, and used these features to build random forests models per species. After model training, we used the best model per species to perform model validation and finally, to predict acoustic similarity of calls set aside for testing.
We had already collected SPCC acoustic similarity measurements above. We collected three additional acoustic similarity measurements based on dynamic time-warping (DTW): spectral entropy, dominant frequency and multivariate DTW, as per methods used in Supplementary Methods 2. We used the frequency bandpasses and the window lengths determined above per species.
invisible(pblapply(1:length(spp), function(x){
cat(paste("Species = ", spp[x], "\n", sep = ""))
####### Spectral entropy DTW #######
# extract spectral entropy across signals as a time series
sp_ts <- sp.en.ts(sels_list[[x]], wl = wl[x], length.out = 100, wn = "hanning", ovlp = 90, bp = bp[[x]], threshold = 15, img = TRUE, parallel = cores, img.suffix = "sp.en.ts", pb = TRUE, clip.edges = TRUE, leglab = "sp.en.ts", sp.en.range = bp[[x]], path = path)
# ensure that the selec column isn't included in dtw
sp_ts$selec <- as.character(sp_ts$selec)
# perform dynamic time warping (DTW) on spectral entropy time series
sp_ts_DTW <- dtw::dtwDist(sp_ts[, sapply(sp_ts, is.numeric)], sp_ts[, sapply(sp_ts, is.numeric)], window.type = "none", open.end = FALSE)
saveRDS(sp_ts_DTW, file.path(path, paste(paste("sp_ts_DTW", spp[x], sep = "_"), ".RDS", sep = "")))
####### Dominant frequency DTW #######
# extract dominant frequency as a time series
# some calls still have NA values at the end even with clip.edges, so generate a few extra df values
df_ts <- dfts(sels_list[[x]], wl = wl[x], wl.freq = wl[x], length.out = 115, wn = "hanning", ovlp = 90, bp = bp[[x]], threshold = th[x], img = TRUE, parallel = cores, img.suffix = "dfts", pb = TRUE, clip.edges = TRUE, path = path)
# clip down to 102 columns (2 metadata plus 100 time series) to remove NAs not removed by clip.edges
df_ts <- df_ts[, 1:102]
# checking for columns with NAs
# unlist(sapply(df_ts, function(x){
# which(is.na(x))
# }))
# ensure that the selec column isn't included in dtw
df_ts$selec <- as.character(df_ts$selec)
# perform DTW on dominant frequency time series
df_ts_DTW <- dtw::dtwDist(df_ts[, sapply(df_ts, is.numeric)], df_ts[, sapply(df_ts, is.numeric)], window.type = "none", open.end = FALSE)
saveRDS(df_ts_DTW, file.path(path, paste(paste("df_ts_DTW", spp[x], sep = "_"), ".RDS", sep = "")))
####### Multivariate DTW #######
# perform multivariate DTW on spectral entropy and dominant frequency time series
# check input dimensions for time series data
dim(sp_ts[, sapply(sp_ts, is.numeric)])
dim(df_ts[, sapply(df_ts, is.numeric)])
mDTW <- warbleR::multi_DTW(ts.df1 = sp_ts, ts.df2 = df_ts, parallel = cores, window.type = "none", open.end = FALSE, scale = TRUE, dist.mat = TRUE, path = path)
saveRDS(mDTW, file.path(path, paste(paste("mDTW", spp[x], sep = "_"), ".RDS", sep = "")))
}))
We collected other quantitative measurements that characterized contact call structure (measurements across the time, frequency and amplitude domains, including measurements on Mel-frequency cepstral coefficients), as per Supplementary Methods 2. We removed all estimates of fundamental frequency for YNA calls for consistency with MNK sound analysis in Supplementary Methods 2.
invisible(pblapply(1:length(spp), function(x){
# calculate acoustic parameters across calls, excluding measurements related to harmonicity, e.g. fundamental frequency
# estimates of the fundamental frequency are not accurate, given previous exploratory analysis of acoustic parameters
sp_tmp <- warbleR::specan(sels_list[[x]], bp = bp[[x]], wl = wl[x], wl.freq = wl[x], ovlp = 90, fast = FALSE, parallel = cores, wn = "hanning", harmonicity = FALSE, path = path)
# make sure that the selections column is non-numeric and non-integer
sp_tmp$selec <- as.character(sp_tmp$selec)
# write out acoustic parameters
write.csv(sp_tmp, file.path(path, paste(paste("acoustic_parameters_specan", spp[x], sep = "_"), ".csv", sep = "")), row.names = FALSE)
# we also collected acoustic measurements on Mel-frequency cepstral coefficients
# we used the same numcep and nbands used for monk parakeet calls
cep_tmp <- warbleR::mfcc_stats(sels_list[[x]], ovlp = 90, wl = wl[x], bp = bp[[x]], numcep = 12, nbands = 40, parallel = cores, path = path)
cep_tmp$selec <- as.character(cep_tmp$selec)
# write out cepstral acousic parameters
write.csv(cep_tmp, file.path(path, paste(paste("acoustic_parameters_cepstral", spp[x], sep = "_"), ".csv", sep = "")), row.names = FALSE)
}))
We also used image processing software to derive image parameters that characterized acoustic structure. We ran the image-processing software WND-CHRM [5] outside of R to measure several thousand image parameters across spectrograms. First, we made spectrograms for WND-CHRM to process.
# make spectrograms without labels (including axes) for image feature extraction by WND-CHRM, which takes only tiff or ppm files
# manually moved image files into ~/img_processing subfolders per species
invisible(pblapply(1:length(spp), function(x){
warbleR::specreator(sels_list[[x]], wn = "hanning", wl = wl[x], collev = seq(-40, 0, 1), flim = bp[[x]], ovlp = 90, line = FALSE, mar = 0.01, title = FALSE, res = 200, axisX = FALSE, axisY = FALSE, inner.mar = c(0,0,0,0), it = "tiff", parallel = cores, path = path)
}))
We ran WND-CHRM in a working directory that contained the spectrograms generated above, by executing the following code in a UNIX Bash shell: wndchrm train -ml ./ feature_large_file.fit
. Among other parameters, the resulting data set included:
WND-CHARM originally included 1025 total image measurements[5]. However, the software has since been updated to include more parameters. We generated 2919 total variables using the -ml
flag.
We read in the resulting .fit file for preprocessing prior to feature extraction.
invisible(pblapply(1:length(spp), function(x){
path <- file.path(path, paste("img_processing", tolower(spp[x]), sep = "_"))
# read in the resulting. fit file
# the 1st 3 lines are a header
fitf_tmp <- readLines(file.path(path, paste(paste("feature_large_file", spp[x], sep = "_"), ".fit", sep = "")))[-c(1:3)]
# str(fitf_tmp)
# column names
length(grep("\\[", fitf_tmp)) # 2919 or 1059 if large or small parameter run
cnms <- fitf_tmp[grep("\\[", fitf_tmp)]
# str(cnms)
# fix column names
cnms <- gsub(" ", "", cnms, fixed = TRUE)
cnms <- gsub(")", "", cnms, fixed = TRUE)
cnms <- gsub("(", "", cnms, fixed = TRUE)
cnms <- gsub("[", "_", cnms, fixed = TRUE)
cnms <- gsub("]", "", cnms, fixed = TRUE)
# cnms[1:10]
# generate row names (spectrogram image file names)
# should have the same number of rows as calls
# length(grep(".tiff$", readLines(file.path(path, paste(paste("feature_large_file", tolower(spp[x]), sep = "_"), ".fit", sep = "")))))
rnms <- fitf_tmp[grep(".tiff$", fitf_tmp)]
# x <- 1
rnms <- sapply(1:length(rnms), function(x){
gsub(".tiff", "", strsplit(rnms[x], split = "./")[[1]][2])
})
# str(rnms)
# extract data, has one extra line
# length(grep("\\[|.tiff$", fitf_yna, invert = TRUE))
dats <- fitf_tmp[grep("\\[|.tiff$", fitf_tmp, invert = TRUE)]
# remove any lines that are just a space
dats <- dats[!dats == ""]
# str(dats)
# split the values by space
# x <- 1
dats <- sapply(1:length(dats), function(x){
as.numeric(strsplit(dats[x], split = " ")[[1]])
})
# str(dats)
# there are 2920 data points but only 2919 variables names
# recreate the fit file information as a data frame
fit_df <- data.frame(call_ID = rnms, t(dats))
names(fit_df) <- c("call_ID", cnms)
# str(fit_df)
# the very last value (which is a 0), does not exist in the original sig file, so remove this column from fit_df to obtain the original number of parameters
fit_df <- fit_df[, -2921]
# str(fit_df)
# dim(fit_df)
# names(fit_df)[1:10]
# write out image parameters for later
path <- "/media/owner/MYIOPSITTA/R/Uruguay2017_MonkParakeet_CallAnalysis/Data/YNA_published_data"
write.csv(fit_df, file.path(path, paste(paste("img_parameters_large", spp[x], sep = "_"), ".csv", sep = "")), row.names = FALSE)
}))
We extracted acoustic and image features from the acoustic similarity and structure measurements generated above, as per monk parakeet calls in Supplementary Methods 2: Sound Analysis. Here we used Multidimensional Scaling (MDS) for matrices of pairwise acoustic similarity, or Principal Components Analysis (PCA) for all other acoustic and image parameters as feature extraction methods.
Read back in acoustic similarity measurements.
# read in data sets as distance objects in a loop
asim_list <- invisible(pblapply(1:length(spp), function(x){
lapply(1:length(type), function(y){
# convert SPCC correlation (e.g. proximity) values to distance values
if(type[y] == "SPCC"){
tmp <- stats::as.dist(file.path(path, readRDS(paste(paste(prefix[y], spp[x], sep = "_"), ".RDS", sep = ""))))
return(1 - tmp)
# DTW matrices already contain distance values
} else if(type[y] != "SPCC"){
stats::as.dist(file.path(path, readRDS(paste(paste(prefix[y], spp[x], sep = "_"), ".RDS", sep = ""))))
}
})
}))
# str(asim_list)
# add names to list for easier parsing
names(asim_list) <- spp
invisible(sapply(1:length(spp), function(x){
names(asim_list[[x]]) <- type
}))
We began with MDS feature extraction. We worked out a routine to choose an adequate MDS method and number of dimensions per parameter set by optimizing MDS stress.
dims <- seq(2, 25, 1)
stress <- invisible(pblapply(1:length(spp), function(x){
stress <- lapply(1:length(type), function(y){
stress <- lapply(1:length(dims), function(z){
# these MDS methods calculate stress differently
# sammon MDS documentation states it is nonmetric but other sources indicate it is in fact metric
iso <- invisible(MASS::isoMDS(asim_list[[x]][[y]], k = dims[z], maxit = 1000, trace = FALSE))
sam <- invisible(MASS::sammon(asim_list[[x]][[y]], k = dims[z], niter = 1000, trace = FALSE))
# isoMDS returns stress as a percentage
return(data.frame(k = dims[z], stress = c(iso$stress/100, sam$stress), mds_type = c("isoMDS", "sammon"), species = spp[x], type = type[[y]]))
})
return(rbindlist(stress))
})
return(rbindlist(stress))
}))
stress <- rbindlist(stress)
stress <- data.frame(stress)
str(stress)
saveRDS(stress, file.path(path, "MDS_stress_2spp.RDS"))
stress <- readRDS(file.path(path, "MDS_stress_2spp.RDS"))
We chose an MDS procedure by comparing MDS stress. Lower stress is preferable, and we used an stress value of 0.05 as a general rule-of-thumb for low stress. isoMDS minimized stress better than sammon MDS (stress decreased more evenly over dimensions). We chose the number of dimensions that best minimized isoMDS stress per species.
# isoMDS is a better MDS method across feature data
# jpeg("MDS_stress.jpeg", units = "mm", width = 180, height = 120, res = 300)
ggplot(stress, aes(x = k, y = stress)) +
geom_point(aes(fill = mds_type, color = mds_type), shape = 21, size = 1) +
facet_grid(rows = vars(type), cols = vars(species)) +
geom_line(aes(group = mds_type, color = mds_type)) +
geom_hline(aes(yintercept = 0.05), linetype = "dashed", color = "black", size = 0.25) +
xlab("MDS Dimensions") + ylab("Stress") +
scale_x_continuous(breaks = seq(min(dims), max(dims), 5), labels = seq(min(dims), max(dims), 5)) +
scale_y_continuous(breaks = seq(0, 0.4, 0.05), labels = seq(0, 0.4, 0.05)) +
theme(panel.background = element_rect(fill = "white"), plot.background = element_rect(fill = "white"),
panel.grid.major = element_line(size = 1, colour = "white"),
panel.grid.minor = element_line(size = 0.75, colour = "white"),
axis.line = element_line(size = 0.45, colour = "black"),
axis.title = element_text(size = 10),
axis.text = element_text(size = 8), strip.text.x = element_text(size = 8), legend.position = "top")
isoMDS performs better with respect to the Sammon’s method. We identified the optimal number of isoMDS dimensions per species and acoustic similarity method.
stress <- stress[grep("isoMDS", stress$mds_type), ]
tapply(stress$stress, list(stress$species, stress$type), function(x){
stress$k[which(abs(x - 0.05) == min(abs(x - 0.05)))]
})
## SPCC spDTW dfDTW multiDTW
## MNK_indiv 13 13 17 23
## MNK_site 25 22 25 25
## YNA 24 17 25 25
We ground-truthed the isoMDS optimization by plotting the optimal isoMDS features per acoustic similarity method. We prepared visualizations for feature extraction using primarily individual identity, and then site identity, to examine how well each method grouped calls by individual and/or site. We used the focal individual data set only for MNK. For MNK, we did not expect to see much grouping by site, but rather overdispersion within sites, and better clustering by individual identity. For YNA, we expected to see calls cluster by individual, site and dialect identity (we picked sites representing different dialects).
# randomly select 8 individuals over 2 sites per species for visualization
# ensure that each individual chosen has at least 3 calls
######## MNK individuals ########
# we used all 8 focal individuals for MNK
unique(mnk_indiv$site)
## [1] EMBR 1145 CHAC
## Levels: 1145 CHAC EMBR
table(mnk_indiv$indiv)
##
## AAT RAW UM1 UM2 UM3 UM4 UM5 ZW8
## 12 4 25 23 5 13 7 8
# add bird_ID to mnk
mnk_indiv$bird_ID <- paste(mnk_indiv$site, mnk_indiv$indiv, sep = "-")
mnk_indivs <- unique(as.character(mnk_indiv$bird_ID))
######## MNK sites ########
unique(mnk_site$General_site)
## [1] ARAP PLVE PFER-01 PFER-03 RIAC-01 RIAC-02 INES-01 INES-03
## [9] INES-04 INES-06 INES-07 INES-08 INES-05 EMBR SEMI 1145
## [17] ROSA ECIL CHAC INBR FAGR HIPE SAUC CISN
## [25] ELTE QUEB PROO PEIX BAGU PAVO CEME BCAR
## [33] LENA PIED ARAZ KIYU OJOS VALI GOLF
## 39 Levels: 1145 ARAP ARAZ BAGU BCAR CEME CHAC CISN ECIL ELTE EMBR ... VALI
# add a site column and bird_ID column to the full mnk_site metadata
names(mnk_site)[grep("General_site", names(mnk_site))] <- "site"
# x <- 1
mnk_site$bird_ID <- unlist(lapply(1:length(unique(mnk_site$site)), function(x){
tmp_site <- mnk_site$site[grep(unique(mnk_site$site)[x], mnk_site$site)]
paste(tmp_site, seq(1, length(tmp_site), 1), sep = "-")
}))
# filter by 4 individuals in each of 2 sites used for MNK site visualizations in Supplementary Methods 2
tmp <- mnk_site[grep("PIED|OJOS", mnk_site$site), ]
# checking
table(tmp$bird_ID)
##
## OJOS-1 OJOS-10 OJOS-11 OJOS-12 OJOS-13 OJOS-14 OJOS-15 OJOS-16 OJOS-17
## 1 1 1 1 1 1 1 1 1
## OJOS-18 OJOS-19 OJOS-2 OJOS-20 OJOS-21 OJOS-22 OJOS-23 OJOS-3 OJOS-4
## 1 1 1 1 1 1 1 1 1
## OJOS-5 OJOS-6 OJOS-7 OJOS-8 OJOS-9 PIED-1 PIED-10 PIED-11 PIED-12
## 1 1 1 1 1 1 1 1 1
## PIED-13 PIED-14 PIED-15 PIED-16 PIED-17 PIED-18 PIED-19 PIED-2 PIED-20
## 1 1 1 1 1 1 1 1 1
## PIED-3 PIED-4 PIED-5 PIED-6 PIED-7 PIED-8 PIED-9
## 1 1 1 1 1 1 1
set.seed(401)
r <- sample(1:length(unique(tmp$bird_ID)), size = 8, replace = FALSE)
mnk_site_indivs <- unique(as.character(tmp$bird_ID))[r]
######## YNA ########
# ensure the sites picked for YNA represent different dialects
unique(yna$site)
## [1] penas belen cabu horiz inoc junq pelal pnar baju curu grand
## [12] psapa tarc zapol
## 14 Levels: baju belen cabu curu grand horiz inoc junq pelal penas ... zapol
tmp <- yna[grep("inoc|baju", yna$site), ]
tmp <- droplevels(tmp[grep(paste(paste("^", names(which(table(tmp$bird_ID) > 2)), "$", sep = ""), collapse = "|"), tmp$bird_ID), ])
# checking
table(tmp$bird_ID)
##
## baju-1 baju-2 baju-3 baju-4 inoc-1 inoc-2 inoc-3 inoc-4
## 10 10 10 10 10 10 9 8
set.seed(401)
r <- sample(1:length(unique(tmp$bird_ID)), size = 8, replace = FALSE)
yna_indivs <- unique(as.character(tmp$bird_ID))[r]
# collate individuals
indivs <- list(mnk_indivs, mnk_site_indivs, yna_indivs)
names(indivs) <- c("MNK_indiv", "MNK_site", "YNA")
saveRDS(indivs, file.path(path, "indivs_plotting.RDS"))
# isoMDS dimensions by species and acoustic similarity method
dims <- list(c(13, 13, 17, 23), c(25, 22, 25, 25), c(24, 17, 25, 25))
# metadata
dfs <- list(mnk_indiv, mnk_site, yna)
names(dfs) <- spp
# read in individuals to be used for plotting
indivs <- readRDS(file.path(path, "indivs_plotting.RDS"))
# x <- 1 # testing
# y <- 1
isoMDS_RFfeats <- invisible(pblapply(1:length(spp), function(x){
tmp <- lapply(1:length(as_nms), function(y){
dats <- readRDS(file.path(path, paste(paste(prefix[y], spp[x], sep = "_"), ".RDS", sep = "")))
if(grepl("SPCC", as_nms[y])){
dats <- stats::as.dist(1 - dats, diag = TRUE, upper = FALSE)
} else if(!grepl("SPCC", as_nms[y])){
dats <- stats::as.dist(dats, diag = TRUE, upper = FALSE)
}
iso <- invisible(MASS::isoMDS(dats, k = dims[[x]][y], maxit = 1000, trace = TRUE))
return(iso$points)
})
names(tmp) <- type
return(tmp)
}))
names(isoMDS_RFfeats) <- spp
str(isoMDS_RFfeats)
# save the list as an RDS object
saveRDS(isoMDS_RFfeats, file.path(path, "RFfeats_isoMDS_2spp.RDS"))
isoMDS_RFfeats <- readRDS(file.path(path, "RFfeats_isoMDS_2spp.RDS"))
dims <- list(c(13, 13, 17, 23), c(25, 22, 25, 25), c(24, 17, 25, 25))
# metadata
dfs <- list(mnk_indiv, mnk_site, yna)
names(dfs) <- spp
# read in individuals to be used for plotting
indivs <- readRDS(file.path(path, "indivs_plotting.RDS"))
invisible(pblapply(1:length(spp), function(x){
lapply(1:length(as_nms), function(y){
iso <- isoMDS_RFfeats[[x]][[y]]
# filter metadata for plotting
metadats <- dfs[[grep(spp[x], names(dfs), ignore.case = TRUE)]]
tmp_indivs <- indivs[[grep(spp[x], names(indivs), ignore.case = TRUE)]]
# create a data frame with MDS embeddings for plotting
df <- data.frame(X = iso[, 1], Y = iso[, 2], indiv = as.character(metadats$bird_ID), site = as.character(metadats$site))
# str(df)
# filter by individuals
df <- droplevels(df[grep(paste(paste("^", tmp_indivs, "$", sep = ""), collapse = "|"), df$indiv), ])
# aesthetics by individual
sts <- unique(as.character(df$site))
ins <- unique(as.character(df$indiv))
cols <- c("#F8766D", "#00BFC4", grey.colors(12)[3])
plot_cols <- unlist(lapply(1:length(sts), function(s){
rep(cols[s], length(ins[grep(sts[s], ins)]))
}))
# if MNK_site
if(grepl("site", spp[x])){
# convex hull polygons by site
hulls <- plyr::ddply(df, "site", function(z){
z[chull(z$X, z$Y), ]
})
} else if(!grepl("site", spp[x])){
# convex hull polygons by individual
hulls <- plyr::ddply(df, "indiv", function(z){
z[chull(z$X, z$Y), ]
})
}
# make graphic
gg <- ggplot(df, aes(x = X, y = Y))
if(grepl("site", spp[x])){
gg <- gg + geom_point(aes(color = site, fill = site, shape = site), size = 4) +
geom_polygon(data = hulls, aes(x = X, y = Y, fill = site, color = site), alpha = 0.2)
} else if(!grepl("site", spp[x])){
gg <- gg + geom_point(aes(color = indiv, fill = indiv, shape = indiv), size = 4) +
geom_polygon(data = hulls, aes(x = X, y = Y, fill = indiv, color = indiv), alpha = 0.2) +
scale_colour_manual(values = plot_cols) + scale_fill_manual(values = plot_cols) +
scale_shape_manual(values = c(0, 1, 2, 4, 21, 22, 8, 24))
}
gg <- gg + xlab("MDS Dimension 1") + ylab("MDS Dimension 2") +
theme_AcousticSpace() +
ggtitle(paste(paste(spp[x], as_nms[y], sep = " - "), "; ", dims[[x]][y], " dims, 1000 reps", sep = ""))
print(gg)
})
}))
The isoMDS feature optimization looks good.
We used PCA to extract features from the acoustic and image measurement data sets that characterized acoustic structure.
# new parameters
prefix <- c("acoustic_parameters_specan", "acoustic_parameters_cepstral", "img_parameters_large")
pca_nms <- c("Specan_parameters", "Mel-frequency_cepstral_parameters", "Image_parameters")
type <- c("asp_params", "cep_params", "img_params")
# metadata
dfs <- list(mnk_indiv, mnk_site, yna)
names(dfs) <- spp
# read in individuals to be used for plotting
indivs <- readRDS(file.path(path, "indivs_plotting.RDS"))
# x <- 2 # testing
# y <- 1
PCA_RFfeats <- invisible(pblapply(1:length(spp), function(x){
tmp <- lapply(1:length(pca_nms), function(y){
dats <- read.csv(file.path(path, paste(paste(prefix[y], spp[x], sep = "_"), ".csv", sep = "")), header = TRUE)
# make sure selection IDs will not be confused for numeric variables
if(!grepl("Image", pca_nms[y])) dats$selec <- as.character(dats$selec)
# some image parameters have NAs that cause prcomp to fail, find and remove these
if(grepl("Image", pca_nms[y])){
nas <- sapply(1:nrow(dats), function(x){
which(is.na(dats[x, ]))
})
nas <- unique(unlist(nas)) # 3 variables
dats <- dats[, -nas]
dim(dats)
}
pp_list <- caret::preProcess(dats, method = c("YeoJohnson", "zv", "nzv", "center", "scale"), thresh = 0.98, freqCut = 98/2, uniqueCut = 2)
pp_sp <- predict(pp_list, dats)
# perform PCA with base package for easier access of summary and loadings
pp_sp_pca <- stats::prcomp(pp_sp[, sapply(pp_sp, is.numeric)], center = FALSE)
# str(pp_sp_pca)
# importance of PCs, the first two explain about 47% of variance
# summary(pp_sp_pca)
# PCA loadings across original variables
# pp_sp_pca$rotation
# return the whole pca object to examine summary and loadings later as needed
return(pp_sp_pca)
})
names(tmp) <- type
return(tmp)
}))
names(PCA_RFfeats) <- spp
str(PCA_RFfeats)
# save the list as an RDS object
saveRDS(PCA_RFfeats, file.path(path, "PCA_RFfeats_2spp.RDS"))
PCA_RFfeats <- readRDS(file.path(path, "PCA_RFfeats_2spp.RDS"))
invisible(pblapply(1:length(spp), function(x){
lapply(1:length(pca_nms), function(y){
pp_pca <- PCA_RFfeats[[x]][[y]]
# filter metadata for plotting
metadats <- dfs[[grep(spp[x], names(dfs), ignore.case = TRUE)]]
tmp_indivs <- indivs[[grep(spp[x], names(indivs), ignore.case = TRUE)]]
# create a data frame with PCA embeddings for plotting
df <- data.frame(X = pp_pca$x[, 1], Y = pp_pca$x[, 2], indiv = as.character(metadats$bird_ID), site = as.character(metadats$site))
# str(df)
# filter by individuals
df <- droplevels(df[grep(paste(paste("^", tmp_indivs, "$", sep = ""), collapse = "|"), df$indiv), ])
# aesthetics by individual
sts <- unique(as.character(df$site))
ins <- unique(as.character(df$indiv))
cols <- c("#F8766D", "#00BFC4", grey.colors(12)[3])
plot_cols <- unlist(lapply(1:length(sts), function(s){
rep(cols[s], length(ins[grep(sts[s], ins)]))
}))
# if MNK_site
if(grepl("site", spp[x])){
# convex hull polygons by site
hulls <- plyr::ddply(df, "site", function(z){
z[chull(z$X, z$Y), ]
})
} else if(!grepl("site", spp[x])){
# convex hull polygons by individual
hulls <- plyr::ddply(df, "indiv", function(z){
z[chull(z$X, z$Y), ]
})
}
# make graphic
gg <- ggplot(df, aes(x = X, y = Y))
if(grepl("site", spp[x])){
gg <- gg + geom_point(aes(color = site, fill = site, shape = site), size = 4) +
geom_polygon(data = hulls, aes(x = X, y = Y, fill = site, color = site), alpha = 0.2)
} else if(!grepl("site", spp[x])){
gg <- gg + geom_point(aes(color = indiv, fill = indiv, shape = indiv), size = 4) +
geom_polygon(data = hulls, aes(x = X, y = Y, fill = indiv, color = indiv), alpha = 0.2) +
scale_colour_manual(values = plot_cols) + scale_fill_manual(values = plot_cols) +
scale_shape_manual(values = c(0, 1, 2, 4, 21, 22, 8, 24))
}
gg <- gg + xlab("PCA Dimension 1") + ylab("PCA Dimension 2") +
theme_AcousticSpace() +
ggtitle(paste(spp[x], pca_nms[y], sep = " - "))
print(gg)
})
}))
The PCA features also look good. MNK individuals show more separation by Mel-frequency cepstral coefficients. There is a lot of separation among PIED and OJOS calls for some parameters, but this was using a subset of calls per site. YNA calls grouped by sites (representative of dialects).
How many Principal Components were extracted per parameter type across species?
lapply(1:length(PCA_RFfeats), function(x){
unlist(lapply(1:length(type), function(y){
ncol(PCA_RFfeats[[x]][[y]]$x)
}))
})
## [[1]]
## [1] 27 88 97
##
## [[2]]
## [1] 27 88 598
##
## [[3]]
## [1] 27 88 417
We assessed t-SNE as an additional feature extraction method for all acoustic similarity and structure measurements. t-SNE is a machine learning approach used to reduce high-dimensional data and under some conditions, can be more effective at pulling out clustering patterns than PCA (also an unsupervised machine learning approach) [3;4]. t-SNE involves optimization of the perplexity parameter (related to the number of nearest neighbors per point), which van der Maaten suggests should range from 5 - 50 (see http://lvdmaaten.github.io/tsne/). We optimized t-SNE for input into subsequent machine learning analysis, to determine whether this method would provide improved learning of acoustic similarity compared to MDS and PCA derived features. t-SNE can effectively embed in 2 or 3 dimensions.
We optimized t-SNE perplexity across species/social scales using SPCC.
# x <- 2 # testing
# p <- 1
tsne_opt <- invisible(pblapply(1:length(spp), function(x){
spcc <- readRDS(file.path(path, paste(paste("xc_mat", spp[x], sep = "_"), ".RDS", sep = "")))
# make a distance matrix
xcdist <- stats::as.dist(1 - spcc, diag = TRUE, upper = FALSE)
# initialize perplexity values for optimization
if(grepl("MNK_indiv", spp[x])){
n <- (nrow(xcdist) - 1)/3
plx <- floor(seq(15, n, n/5))
plx
} else if(!grepl("MNK_indiv", spp[x])){
plx <- floor(seq(15, 50, 10))
plx
}
tmp_tsne <- lapply(1:length(plx), function(p){
# run tsne 10 times for the given perplexity and select the iteration with the lowest final value of Kullback-Leibler divergence (last value of $ itercosts in the result, as the result is ordered)
tmp <- list()
tmp <- lapply(1:10, function(n){
Rtsne::Rtsne(xcdist, dims = 3, pca = FALSE, max_iter = 5000, perplexity = plx[p], is_distance = TRUE, theta = 0.0)
})
# str(tmp)
KL <- sapply(1:length(tmp), function(i){
tmp[[i]]$itercosts[100]
}, simplify = TRUE)
# overwrite tmp with the solution that minimized KL
tmp <- tmp[[which(KL == min(KL))]]
return(tmp)
})
names(tmp_tsne) <- plx
return(tmp_tsne)
}))
names(tsne_opt) <- spp
saveRDS(tsne_opt, file.path(path, "tsne_opt_2spp.RDS"))
Read back in the t-SNE optimization results and plot.
tsne_opt <- readRDS(file.path(path, "tsne_opt_2spp.RDS"))
invisible(pblapply(1:length(spp), function(x){
tsne <- tsne_opt[[x]]
spcc <- readRDS(file.path(path, paste(paste("xc_mat", spp[x], sep = "_"), ".RDS", sep = "")))
# make a distance matrix
xcdist <- stats::as.dist(1 - spcc, diag = TRUE, upper = FALSE)
# initialize perplexity values used for optimization
if(grepl("MNK_indiv", spp[x])){
n <- (nrow(xcdist) - 1)/3
plx <- floor(seq(15, n, n/5))
} else if(!grepl("MNK_indiv", spp[x])){
plx <- floor(seq(15, 50, 10))
}
# filter metadata for plotting
metadats <- dfs[[grep(spp[x], names(dfs), ignore.case = TRUE)]]
tmp_indivs <- indivs[[grep(spp[x], names(indivs), ignore.case = TRUE)]]
lapply(1:length(plx), function(p){
# create a data frame with t-SNE embeddings for plotting
df <- data.frame(X = tsne[[p]]$Y[, 1], Y = tsne[[p]]$Y[, 2], indiv = as.character(metadats$bird_ID), site = as.character(metadats$site))
# str(df)
# filter by individuals
df <- droplevels(df[grep(paste(paste("^", tmp_indivs, "$", sep = ""), collapse = "|"), df$indiv), ])
# aesthetics by individual
sts <- unique(as.character(df$site))
ins <- unique(as.character(df$indiv))
cols <- c("#F8766D", "#00BFC4", grey.colors(12)[3])
plot_cols <- unlist(lapply(1:length(sts), function(s){
rep(cols[s], length(ins[grep(sts[s], ins)]))
}))
# if MNK_site
if(grepl("site", spp[x])){
# convex hull polygons by site
hulls <- plyr::ddply(df, "site", function(z){
z[chull(z$X, z$Y), ]
})
} else if(!grepl("site", spp[x])){
# convex hull polygons by individual
hulls <- plyr::ddply(df, "indiv", function(z){
z[chull(z$X, z$Y), ]
})
}
# make graphic
gg <- ggplot(df, aes(x = X, y = Y))
if(grepl("site", spp[x])){
gg <- gg + geom_point(aes(color = site, fill = site, shape = site), size = 4) +
geom_polygon(data = hulls, aes(x = X, y = Y, fill = site, color = site), alpha = 0.2)
} else if(!grepl("site", spp[x])){
gg <- gg + geom_point(aes(color = indiv, fill = indiv, shape = indiv), size = 4) +
geom_polygon(data = hulls, aes(x = X, y = Y, fill = indiv, color = indiv), alpha = 0.2) +
scale_colour_manual(values = plot_cols) + scale_fill_manual(values = plot_cols) +
scale_shape_manual(values = c(0, 1, 2, 4, 21, 22, 8, 24))
}
gg <- gg + xlab("t-SNE Dimension 1") + ylab("t-SNE Dimension 2") +
theme_AcousticSpace() +
ggtitle(paste(spp[x], " SPCC", "t-SNE perplexity = ", plx[p], sep = " "))
print(gg)
})
}))
Perplexity of 21, 45 and 45 work well for MNK_indiv, MNK_site and YNA, respectively. We extracted t-SNE features for all acoustic similarity and structure measurements, using the optimized perplexity values and number of iterations determined with the SPCC data set.
RFtsne_feats <- invisible(pblapply(1:length(spp), function(x){
tmp_tsne <- lapply(1:length(nms), function(y){
if(grepl("SPCC|DTW", nms[y])){
dats <- readRDS(file.path(path, paste(paste(prefix[y], spp[x], sep = "_"), ".RDS", sep = "")))
} else if(!grepl("SPCC|DTW", nms[y])){
dats <- read.csv(file.path(path, paste(paste(prefix[y], spp[x], sep = "_"), ".csv", sep = "")), header = TRUE)
}
# make distances matrices as appropriate
if(grepl("SPCC", nms[y])){
X <- stats::as.dist(1 - dats, diag = TRUE, upper = FALSE)
is_distance <- TRUE
} else if(grepl("DTW", nms[y])){
X <- stats::as.dist(dats, diag = TRUE, upper = FALSE)
is_distance <- TRUE
# For tabular acoustic and image parameters, make sure selection IDs will not be confused for numeric variables, and select only nuemric variables
} else if(grepl("Specan|Mel-f", nms[y])){
dats$selec <- as.character(dats$selec)
dats <- dats[, sapply(dats, is.numeric)]
is_distance <- FALSE
}
# some image parameters have NAs that cause prcomp to fail, find and remove these
if(grepl("Image", nms[y])){
nas <- sapply(1:nrow(dats), function(x){
which(is.na(dats[x, ]))
})
nas <- unique(unlist(nas)) # 3 variables
dats <- dats[, -nas]
# dim(dats)
}
# Pre-process tabular acoustic and image parameters
if(grepl("Specan|Mel-f|Image", nms[y])){
pp_list <- caret::preProcess(dats, method = c("YeoJohnson", "zv", "nzv", "center", "scale"), thresh = 0.98, freqCut = 98/2, uniqueCut = 2)
X <- predict(pp_list, dats)
}
# run tsne 10 times for the given perplexity and select the iteration with the lowest final value of Kullback-Leibler divergence (last value of $ itercosts in the result, as the result is ordered)
tmp <- list()
tmp <- lapply(1:10, function(n){
Rtsne::Rtsne(X, dims = 3, pca = FALSE, max_iter = 5000, perplexity = plx[x], is_distance = is_distance, theta = 0.0)
})
# str(tmp)
cat(paste("spp = ", spp[x], " param = ", type[y], "\n", sep = ""))
KL <- sapply(1:length(tmp), function(i){
tmp[[i]]$itercosts[100]
}, simplify = TRUE)
# overwrite tmp with the solution that minimized KL
tmp <- tmp[[which(KL == min(KL))]]
return(tmp)
})
names(tmp_tsne) <- type
return(tmp_tsne)
}))
names(RFtsne_feats) <- spp
saveRDS(RFtsne_feats, file.path(path, "RFtsne_feats_2spp.RDS"))
We generated visuals of t-SNE features across acoustic similarity and structure measurements to ground-truth the t-SNE features.
tsne_feats <- readRDS(file.path(path, "RFtsne_feats_2spp.RDS"))
invisible(pblapply(1:length(spp), function(x){
lapply(1:length(nms), function(y){
tsne <- tsne_feats[[x]][[y]]
# filter metadata for plotting
metadats <- dfs[[grep(spp[x], names(dfs), ignore.case = TRUE)]]
tmp_indivs <- indivs[[grep(spp[x], names(indivs), ignore.case = TRUE)]]
# create a data frame with t-SNE embeddings for plotting
df <- data.frame(X = tsne$Y[, 1], Y = tsne$Y[, 2], indiv = as.character(metadats$bird_ID), site = as.character(metadats$site))
# str(df)
# filter by individuals
df <- droplevels(df[grep(paste(paste("^", tmp_indivs, "$", sep = ""), collapse = "|"), df$indiv), ])
# aesthetics by individual
sts <- unique(as.character(df$site))
ins <- unique(as.character(df$indiv))
cols <- c("#F8766D", "#00BFC4", grey.colors(12)[3])
plot_cols <- unlist(lapply(1:length(sts), function(s){
rep(cols[s], length(ins[grep(sts[s], ins)]))
}))
# if MNK_site
if(grepl("site", spp[x])){
# convex hull polygons by site
hulls <- plyr::ddply(df, "site", function(z){
z[chull(z$X, z$Y), ]
})
} else if(!grepl("site", spp[x])){
# convex hull polygons by individual
hulls <- plyr::ddply(df, "indiv", function(z){
z[chull(z$X, z$Y), ]
})
}
# make graphic
gg <- ggplot(df, aes(x = X, y = Y))
if(grepl("site", spp[x])){
gg <- gg + geom_point(aes(color = site, fill = site, shape = site), size = 4) +
geom_polygon(data = hulls, aes(x = X, y = Y, fill = site, color = site), alpha = 0.2)
} else if(!grepl("site", spp[x])){
gg <- gg + geom_point(aes(color = indiv, fill = indiv, shape = indiv), size = 4) +
geom_polygon(data = hulls, aes(x = X, y = Y, fill = indiv, color = indiv), alpha = 0.2) +
scale_colour_manual(values = plot_cols) + scale_fill_manual(values = plot_cols) +
scale_shape_manual(values = c(0, 1, 2, 4, 21, 22, 8, 24))
}
gg <- gg + xlab("t-SNE Dimension 1") + ylab("t-SNE Dimension 2") +
theme_AcousticSpace() +
ggtitle(paste(spp[x], "-", nms[y], "t-SNE perplexity = ", plx[x], sep = " "))
print(gg)
})
}))
The t-SNE features look great. Overall, MNK calls overlap quite a lot in acoustic space but are overdispersed relative to YNA calls, which cluster by site (representative of dialects). The patterns visible across the t-SNE extracted features are similar to those we saw in MDS and PCA features prior to optimizing t-SNE.
We trained supervised random forests models using individuals for MNK and dialects for YNA. We split the remaining calls into two datasets: a validation data set, which we used to ground-truth how well random forests learned patterns of acoustic similarity, and a final test data set of all remaining calls for predicting random forests acoustic similarity at the site social scale. After model training and validation, we used the best model per species to predict acoustic similarity of the calls set aside for final testing.
We named features across parameter types. We also generated 4 random variables as built-in controls to better assess random forests variable importance (any acoustic or image features with importance <= random variable importance are noise). We combined all MDS, PCA, t-SNE features and random variables, as well as metadata important for random forests response variables into a single predictor data set per species.
# add call_ID to MNK_indiv
dfs$MNK_indiv$call_ID <- sapply(1:nrow(dfs$MNK_indiv), function(x){
strsplit(as.character(dfs$MNK_indiv$sound.files)[x], split = "_")[[1]][6]
})
isoMDS_feats <- readRDS(file.path(path, "RFfeats_isoMDS_2spp.RDS"))
PCA_feats <- readRDS(file.path(path, "PCA_RFfeats_2spp.RDS"))
tsne_feats <- readRDS(file.path(path, "RFtsne_feats_2spp.RDS"))
invisible(pblapply(1:length(spp), function(x){
# make unique call ID variables to name feature dimensions.
uniq_ID <- paste(dfs[[x]]$bird_ID, dfs[[x]]$call_ID, sep = "-")
# read in MDS and PCA features and change dimension names
SPCC_MDS <- isoMDS_feats[[x]]$SPCC
dimnames(SPCC_MDS)[[1]] <- uniq_ID
dimnames(SPCC_MDS)[[2]] <- c(paste("SPCC_MDS", seq(1, dim(SPCC_MDS)[2], 1), sep = "_D"))
spent_DTW_MDS <- isoMDS_feats[[x]]$spDTW
dimnames(spent_DTW_MDS)[[1]] <- uniq_ID
dimnames(spent_DTW_MDS)[[2]] <- c(paste("spentDTW_MDS", seq(1, dim(spent_DTW_MDS)[2], 1), sep = "_D"))
dfDTW_MDS <- isoMDS_feats[[x]]$dfDTW
dimnames(dfDTW_MDS)[[1]] <- uniq_ID
dimnames(dfDTW_MDS)[[2]] <- c(paste("dfDTW_MDS", seq(1, dim(dfDTW_MDS)[2], 1), sep = "_D"))
multiDTW_MDS <- isoMDS_feats[[x]]$multiDTW
dimnames(multiDTW_MDS)[[1]] <- uniq_ID
dimnames(multiDTW_MDS)[[2]] <- c(paste("multiDTW_MDS", seq(1, dim(multiDTW_MDS)[2], 1), sep = "_D"))
spf_PCA <- PCA_feats[[x]]$asp_params$x
dimnames(spf_PCA)[[1]] <- uniq_ID
dimnames(spf_PCA)[[2]] <- paste("sp_acoustic_PCA", seq(1, ncol(spf_PCA), 1), sep = "_D")
cepf_PCA <- PCA_feats[[x]]$cep_params$x
dimnames(cepf_PCA)[[1]] <- uniq_ID
dimnames(cepf_PCA)[[2]] <- paste("cep_acoustic_PCA", seq(1, ncol(cepf_PCA), 1), sep = "_D")
imgf_PCA <- PCA_feats[[x]]$img_params$x
dimnames(imgf_PCA)[[1]] <- uniq_ID
dimnames(imgf_PCA)[[2]] <- paste("img_acoustic_PCA", seq(1, ncol(imgf_PCA), 1), sep = "_D")
# read in t-SNE features and change dimension names
SPCC_tsne <- tsne_feats[[x]]$SPCC$Y
dimnames(SPCC_tsne)[[1]] <- uniq_ID
dimnames(SPCC_tsne)[[2]] <- c(paste("SPCC_tSNE", seq(1, dim(SPCC_tsne)[2], 1), sep = "_D"))
spent_DTW_tsne <- tsne_feats[[x]]$spDTW$Y
dimnames(spent_DTW_tsne)[[1]] <- uniq_ID
dimnames(spent_DTW_tsne)[[2]] <- c(paste("spentDTW_tSNE", seq(1, dim(spent_DTW_tsne)[2], 1), sep = "_D"))
dfDTW_tsne <- tsne_feats[[x]]$dfDTW$Y
dimnames(dfDTW_tsne)[[1]] <- uniq_ID
dimnames(dfDTW_tsne)[[2]] <- c(paste("dfDTW_tSNE", seq(1, dim(dfDTW_tsne)[2], 1), sep = "_D"))
multiDTW_tsne <- tsne_feats[[x]]$multiDTW$Y
dimnames(multiDTW_tsne)[[1]] <- uniq_ID
dimnames(multiDTW_tsne)[[2]] <- c(paste("multiDTW_tSNE", seq(1, dim(multiDTW_tsne)[2], 1), sep = "_D"))
spf_tsne <- tsne_feats[[x]]$asp_params$Y
dimnames(spf_tsne)[[1]] <- uniq_ID
dimnames(spf_tsne)[[2]] <- c(paste("sp_tSNE", seq(1, dim(spf_tsne)[2], 1), sep = "_D"))
cepf_tsne <- tsne_feats[[x]]$cep_params$Y
dimnames(cepf_tsne)[[1]] <- uniq_ID
dimnames(cepf_tsne)[[2]] <- c(paste("cep_tSNE", seq(1, dim(cepf_tsne)[2], 1), sep = "_D"))
imgf_tsne <- tsne_feats[[x]]$img_params$Y
dimnames(imgf_tsne)[[1]] <- uniq_ID
dimnames(imgf_tsne)[[2]] <- c(paste("img_tSNE", seq(1, dim(imgf_tsne)[2], 1), sep = "_D"))
# collect features in a data frame
sup_rf <- data.frame(site = dfs[[x]]$site, indiv = dfs[[x]]$bird_ID, uniq_call_ID = uniq_ID, SPCC_MDS, spent_DTW_MDS, dfDTW_MDS, multiDTW_MDS, spf_PCA, cepf_PCA, imgf_PCA, SPCC_tsne, spent_DTW_tsne, dfDTW_tsne, multiDTW_tsne, spf_tsne, cepf_tsne, imgf_tsne)
# make 4 random variables as built-in controls
set.seed(401)
RND_vars <- data.frame(RND1 = rnorm(n = nrow(sup_rf), mean = 3, sd = 15), RND2 = rpois(n = nrow(sup_rf), lambda = 100), RND3 = runif(n = nrow(sup_rf), min = 0.5, max = 1000), RND4 = rlogis(n = nrow(sup_rf), location = 1, scale = 10))
# preprocess random variables by transforming, centering and scaling
pp_list <- caret::preProcess(RND_vars, method = c("YeoJohnson", "center", "scale"), thresh = 0.98, freqCut = 98/2, uniqueCut = 2)
RND_vars <- predict(pp_list, RND_vars)
# add the random variables to the set of features
sup_rf <- data.frame(sup_rf, RND_vars)
saveRDS(sup_rf, file.path(path, paste(paste("sup_rf", spp[x], sep = "_"), ".RDS", sep = "")))
}))
spp <- c("MNK_indiv", "MNK_site", "YNA")
sup_rf_list <- pblapply(1:length(spp), function(x){
readRDS(file.path(path, paste(paste("sup_rf", spp[x], sep = "_"), ".RDS", sep = "")))
})
The MNK_indiv predictor data set contained 97 calls and 303 quantitative features, the MNK_site data set was 598 calls and 835 features. The YNA predictor data set contained to 417 calls and 648 features. All data sets included the 4 built-in controls.
sapply(1:length(spp), function(x){
dim(sup_rf_list[[x]][, sapply(sup_rf_list[[x]], is.numeric)])
})
We preprocessed the predictor set by removing highly collinear variables (Pearson’s r > 0.75). We assessed collinearity among features, but not collinearity with SNR. Calls were extracted from original recordings as cuts, and this signal selection method did not yield sufficient space before and after each call to calculate SNR.
sup_rf_pp <- invisible(pblapply(1:length(spp), function(x){
cor <- stats::cor(sup_rf_list[[x]][, sapply(sup_rf_list[[x]], is.numeric)], method = "pearson")
# find names of variables that have > 0.75 collinearity
corr_nms <- caret::findCorrelation(cor, cutoff = 0.75, verbose = TRUE, names = TRUE, exact = TRUE)
# for MNK_indiv, additionally remove features that were removed for MNK_site
if(grepl("MNK_indiv", spp[x])){
corr_nms <- c(corr_nms, "dfDTW_MDS_D2", "spentDTW_tSNE_D1", "cep_tSNE_D2")
}
# remove any collinear variables from the data frame
sup_rf_tmp <- sup_rf_list[[x]][, -grep(paste(paste("^", corr_nms, "$", sep = ""), collapse = "|"), names(sup_rf_list[[x]]))]
return(sup_rf_tmp)
}))
names(sup_rf_pp) <- spp
# check that all features match between the MNK_indiv and MNK_site data sets
# model training and validation will use MNK_indiv and MNK_site for model testing, so the same features need to be included per data set
names(sup_rf_pp[["MNK_indiv"]])[-which(names(sup_rf_pp[["MNK_indiv"]]) %in% names(sup_rf_pp[["MNK_site"]]))]
How many quantitative features remain after preprocessing? 285 for MNK_indiv, 825 for MNK_site and 634 for YNA.
sapply(1:length(spp), function(x){
dim(sup_rf_pp[[x]][, sapply(sup_rf_pp[[x]], is.numeric)])
})
saveRDS(sup_rf_pp, file.path(path, "sup_rf_pp.RDS"))
sup_rf_pp <- readRDS(file.path(path, "sup_rf_pp.RDS"))
We trained models using a subset of individuals per species with the most calls per bird. The MNK training data set consisted of 4 individuals from the same site with the most calls (73 calls total, same as model training in Supplementary Methods 2). The YNA training data set was 8 sites representing 2 dialects for 274 calls total (4 sites from each of the Northern and Southern dialects). 148 calls were from the Northern dialect and 126 calls were from the Southern dialect.
We used our remaining MNK repeatedly sampled (sometimes referred to as “focal”) individual calls as a validation data set and used all higher social scale calls as our final test data set. For YNA validation, we chose 1 individual each from sites Pnar and Cabu (Northern dialect[2]), and 2 individuals from the site Curu (Southern dialect[2]), to serve as validation data sets. We assigned all remaining YNA calls to the test data set.
We used two random forests implementations from the ranger
and randomForest
packages. In general, we observed higher classification performance for models trained with ranger
(see below).
# regional dialect labels are encoded in original sound file names
yna$regional_dialect <- sapply(1:nrow(yna), function(x){
strsplit(as.character(yna$sound.files[x]), split = "_")[[1]][4]
})
sup_rf_pp[["YNA"]]$regional_dialect <- sapply(1:nrow(sup_rf_pp[["YNA"]]), function(x){
strsplit(as.character(yna$sound.files[x]), split = "_")[[1]][4]
})
Tables of calls by individuals and sites for YNA, which we used to partition calls into training, validation and test data sets.
table(yna$site)
##
## baju belen cabu curu grand horiz inoc junq pelal penas pnar psapa
## 40 28 13 16 25 33 37 39 39 36 32 18
## tarc zapol
## 23 38
table(yna$bird_ID)
##
## baju-1 baju-2 baju-3 baju-4 belen-1 belen-2 belen-3 cabu-1 cabu-2
## 10 10 10 10 10 8 10 10 3
## curu-1 curu-2 grand-1 grand-2 grand-3 horiz-1 horiz-2 horiz-3 horiz-4
## 10 6 10 9 6 10 9 7 7
## inoc-1 inoc-2 inoc-3 inoc-4 junq-1 junq-2 junq-3 junq-4 pelal-1
## 10 10 9 8 9 10 10 10 10
## pelal-2 pelal-3 pelal-4 penas-1 penas-2 penas-3 penas-4 pnar-1 pnar-2
## 10 10 9 8 9 9 10 10 9
## pnar-3 pnar-4 psapa-1 psapa-2 tarc-1 tarc-2 tarc-3 zapol-1 zapol-2
## 7 6 9 9 7 9 7 10 10
## zapol-3 zapol-4
## 9 9
1 Nicaraguan dialect site, 7 Northern dialect sites, 6 Southern dialect sites.
table(yna$site)
##
## baju belen cabu curu grand horiz inoc junq pelal penas pnar psapa
## 40 28 13 16 25 33 37 39 39 36 32 18
## tarc zapol
## 23 38
table(yna$regional_dialect)
##
## Nic Nor Sou
## 36 221 160
table(yna$regional_dialect, yna$site)
##
## baju belen cabu curu grand horiz inoc junq pelal penas pnar psapa
## Nic 0 0 0 0 0 0 0 0 0 36 0 0
## Nor 0 28 13 0 0 33 37 39 39 0 32 0
## Sou 40 0 0 16 25 0 0 0 0 0 0 18
##
## tarc zapol
## Nic 0 0
## Nor 0 0
## Sou 23 38
We added a variable to denote training status of each call per species.
train_labels <- list(c("1145-AAT", "1145-UM1", "1145-UM2", "1145-UM4"), c("horiz", "inoc", "junq", "pelal", "baju", "grand", "tarc", "zapol"))
spp_train <- c("MNK_indiv", "YNA")
col_nm <- list("indiv", "site")
# retain data frames to use for training only
rf_df <- list(sup_rf_pp[["MNK_indiv"]], sup_rf_pp[["YNA"]])
# x <- 2
sup_rf_set <- invisible(pblapply(1:length(spp_train), function(x){
rf_df[[x]]$set <- rf_df[[x]][[col_nm[[x]]]]
rf_df[[x]]$set <- gsub(paste(train_labels[[x]], collapse = "|"), "train", rf_df[[x]]$set)
rf_df[[x]]$set[rf_df[[x]]$set != "train"] <- "test"
# change training class labels to cooperate with caret training
# if MNK_indiv, add a letter in front of indiv, which starts with number
if(grepl("MNK_indiv", spp_train[x])) rf_df[[x]]$indiv <- gsub("1145-", "", rf_df[[x]]$indiv)
# if YNA, replace "-" in indiv IDs with "_"
if(grepl("YNA", spp_train[x])) rf_df[[x]]$indiv <- gsub("-", "_", rf_df[[x]]$indiv)
return(rf_df[[x]])
}))
names(sup_rf_set) <- spp_train
saveRDS(sup_rf_set, file.path(path, "sup_rf_set.RDS"))
How many calls remain for training?
spp_train <- c("MNK_indiv", "YNA")
sup_rf_set <- readRDS(file.path(path, "sup_rf_set.RDS"))
sapply(1:length(spp_train), function(x){
length(sup_rf_set[[x]]$set[grep("train", sup_rf_set[[x]]$set)])
})
## [1] 73 274
table(sup_rf_set[["YNA"]]$set, sup_rf_set[["YNA"]]$regional_dialect)
##
## Nic Nor Sou
## test 36 73 34
## train 0 148 126
We trained models with 5 repeats of repeated 10-fold cross-validation. The models we built below per species (trained on individuals) were multiclass models (e.g. 4 training MNK individuals = 4 class labels). Here we trained the first model per species, which included all features (no additional feature selection beyond the previous pre-processing). We trained models using both ranger
and randomForest
implementations.
# set seed to obtain the same results
seed <- 401
# set training labels by species
train_col <- list("indiv", "regional_dialect")
trees <- seq(500, 2500, 500)
implem <- c("ranger", "rf")
cores <- parallel::detectCores() - 2
# x <- 1
# i <- 1
model1_list <- invisible(pblapply(1:length(spp_train), function(x){
tmp <- sup_rf_set[[spp_train[x]]]
# set seed to obtain the same results
seed <- 401
# initialize training data, mtry and number of trees
train <- droplevels(tmp[grep("train", tmp$set), ])
# remove categorical variables that will not be used for classification
train <- train[, -grep("site|uniq_call_ID|set", names(train))]
# dim(train)
# names(train)
train[[train_col[[x]]]] <- as.factor(train[[train_col[[x]]]])
mtry <- round(seq(2, ncol(train[, sapply(train, is.numeric)]), ncol(train[, sapply(train, is.numeric)])/10))
# mtry
# perform model training over two random forests implementations
# i <- 2
model1_list <- lapply(1:length(implem), function(i){
if(implem[i] == "rf"){
# initialize train control parameters for caret
fitControl <- caret::trainControl(method = c("repeatedcv"), number = 10, repeats = 5, search = "grid", returnData = TRUE, returnResamp = "final", savePredictions = "final", classProbs = TRUE, summaryFunction = multiClassSummary)
tunegrid <- expand.grid(.mtry = mtry)
} else if(implem[i] == "ranger"){
splitrule <- "gini"
min.node.size <- 1
tunegrid <- expand.grid(.mtry = mtry, .splitrule = splitrule, .min.node.size = min.node.size)
fitControl <- caret::trainControl(method = "repeatedcv", number = 10, repeats = 5, search = "grid", returnData = TRUE, returnResamp = "final", savePredictions = "final", classProbs = TRUE, summaryFunction = multiClassSummary, verboseIter = TRUE)
}
mlist <- list()
mlist <- pblapply(1:length(trees), function(n){
if(implem[i] == "rf"){
set.seed(seed)
rf_model <- caret::train(x = train[, sapply(train, is.numeric)], y = train[[train_col[[x]]]], method = "rf", ntree = trees[n], replace = TRUE, sampsize = 10, importance = TRUE, proximity = TRUE, norm.votes = TRUE, do.trace = FALSE, metric = c("Kappa"), maximize = TRUE, trControl = fitControl, tuneGrid = tunegrid, tuneLength = 10)
} else if(implem[i] == "ranger"){
rf_model <- caret::train(x = train[, sapply(train, is.numeric)], y = train[[train_col[[x]]]], num.trees = trees[n], method = "ranger", trControl = fitControl, tuneGrid = tunegrid, importance = "permutation", replace = TRUE, scale.permutation.importance = TRUE, num.threads = cores, metric = c("Kappa"), maximize = TRUE, tuneLength = 10, seed = seed)
}
return(rf_model)
})
# yields warnings
# Warning message:
# In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
# There were missing values in resampled performance measures.
names(mlist) <- trees
return(mlist)
})
names(model1_list) <- implem
return(model1_list)
}))
length(model1_list)
names(model1_list) <- spp_train
saveRDS(model1_list, file.path(path, "model1_list_2spp.RDS"))
m1list <- readRDS(file.path(path, "model1_list_2spp.RDS"))
We visualized performance metrics of the first model.
trees <- seq(500, 2500, 500)
implem <- c("ranger", "rf")
invisible(lapply(1:length(spp_train), function(x){
lapply(1:length(implem), function(i){
results <- resamples(m1list[[spp_train[x]]][[implem[i]]])
print(dotplot(results, metric = c("Kappa", "Accuracy"), main = paste(spp_train[x], implem[i], sep = " : ")))
})
}))
Confusion matrix for the MNK_indiv ranger
model.
confusionMatrix(m1list$MNK_indiv$ranger[["2500"]])
## Cross-Validated (10 fold, repeated 5 times) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction AAT UM1 UM2 UM4
## AAT 14.0 0.0 0.0 0.0
## UM1 1.6 33.7 0.0 0.0
## UM2 0.8 0.0 31.5 0.3
## UM4 0.0 0.5 0.0 17.5
##
## Accuracy (average) : 0.9671
# print(m1list$MNK_indiv$ranger[["2500"]]) # to see mtry used
Confusion matrix for the MNK_indiv randomForest
model.
confusionMatrix(m1list$MNK_indiv$rf[["2000"]])
## Cross-Validated (10 fold, repeated 5 times) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction AAT UM1 UM2 UM4
## AAT 0.0 0.0 0.0 0.0
## UM1 12.3 34.2 0.3 9.9
## UM2 4.1 0.0 31.2 2.7
## UM4 0.0 0.0 0.0 5.2
##
## Accuracy (average) : 0.7068
# print(m1list$MNK_indiv$rf[["2000"]])
Confusion matrix for the YNA ranger
model.
confusionMatrix(m1list$YNA$ranger[["2000"]])
## Cross-Validated (10 fold, repeated 5 times) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction Nor Sou
## Nor 53.2 0.7
## Sou 0.8 45.3
##
## Accuracy (average) : 0.9847
# print(m1list$YNA$ranger[["2000"]])
Confusion matrix for the YNA randomForest
model.
confusionMatrix(m1list$YNA$rf[["2500"]])
## Cross-Validated (10 fold, repeated 5 times) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction Nor Sou
## Nor 53.1 6.2
## Sou 0.9 39.8
##
## Accuracy (average) : 0.9292
# print(m1list$YNA$rf[["2500"]])
The ranger
random forests implementation yielded better performance across species. The highest performing model for MNK (trained on individuals) had 96.71% classification accuracy. For YNA, training on regional dialects yielded 98.47% accuracy.
We evaluated random forests variable importance across species and implementations.
names(m1list) <- spp_train
train_list <- list(list(m1list$MNK_indiv$ranger[["2500"]]$finalModel, m1list$MNK_indiv$rf[["2000"]]$finalModel), list(m1list$YNA$ranger[["2000"]]$finalModel, m1list$YNA$rf[["2500"]]$finalModel))
spp_train <- c("MNK_indiv", "YNA")
implem <- c("ranger", "rf")
opt_trees <- list(c("2500", "2000"), c("2000", "2500"))
n <- 12
# pie(rep(1, n), col = heat.colors(n))
# cep, SPCC, dfDTW, sp, multiDTW, img, spentDTW, RND
orig_cols <- c(alpha("purple", 0.6), topo.colors(12)[2], terrain.colors(12)[1], heat.colors(12)[1], heat.colors(12, alpha = 0.7)[5], heat.colors(12)[8], gray.colors(12)[5], "black")
invisible(lapply(1:length(spp_train), function(x){
lapply(1:length(implem), function(i){
# make a data frame to visualize variable importance
if(implem[i] == "ranger"){
var_imp_df <- data.frame(var_nms = names(train_list[[x]][[i]]$variable.importance), imp = train_list[[x]][[i]]$variable.importance)
} else if(implem[i] == "rf"){
var_imp_df <- data.frame(var_nms = dimnames(train_list[[x]][[i]]$importance)[[1]], imp = train_list[[x]][[i]]$importance[, grep("MeanDecreaseAccuracy", dimnames(train_list[[x]][[i]]$importance)[[2]])])
}
# make a variable of parameter type
# x <- 1
var_imp_df$var_type <- sapply(1:nrow(var_imp_df), function(y){
strsplit(as.character(var_imp_df$var_nms[y]), split = "_")[[1]][1]
})
# standardize variable names if necessary (changed parameter type names slightly in sup_rf_pp)
var_imp_df$var_type <- gsub("RND[0-9]+", "RND", var_imp_df$var_type)
var_imp_df$var_type <- gsub("cepf", "cep", var_imp_df$var_type)
var_imp_df$var_type <- gsub("spf", "sp", var_imp_df$var_type)
var_imp_df$var_type <- gsub("imgf", "img", var_imp_df$var_type)
var_imp_df$var_type <- gsub("spent$", "spentDTW", var_imp_df$var_type)
# order by importance
ord <- order(var_imp_df$imp, decreasing = TRUE)
var_imp_df <- var_imp_df[ord, ]
var_imp_df$var_nms <- factor(var_imp_df$var_nms, levels = unique(var_imp_df$var_nms))
var_imp_df$var_type <- factor(var_imp_df$var_type, levels = c("cep", "SPCC", "dfDTW", "sp", "multiDTW", "img", "spentDTW", "RND"))
# levels(var_imp_df$var_type)
# plot only the top 20 variables by importance
plot_df <- droplevels(var_imp_df[1:20, ])
plot_df$var_type <- factor(plot_df$var_type, levels = unique(plot_df$var_type))
levels(plot_df$var_type)
# subset original colors by the current variable type
cols <- unlist(sapply(1:length(levels(plot_df$var_type)), function(v){
orig_cols[grep(paste("^", levels(plot_df$var_type)[v], "$", sep = ""), levels(var_imp_df$var_type))]
}))
gg <- ggplot(data = plot_df) +
geom_segment(aes(x = 0, xend = imp, y = forcats::fct_rev(var_nms), yend = forcats::fct_rev(var_nms), color = var_type), size = 1.5) +
geom_point(aes(y = forcats::fct_rev(var_nms), x = imp), pch = 21, fill = gray.colors(n)[2], color = "black", size = 4) +
scale_color_manual(values = cols) +
xlab("Variable Importance") + ylab("Top 20 Most Important Features") +
theme_AcousticSpace() + theme(panel.grid.major.x = element_line(color = "black", size = 0.25, linetype = "dashed")) +
ggtitle(paste(spp_train[x], implem[i], paste(opt_trees[[x]][[i]], "trees", sep = " "), sep = " - "))
print(gg)
})
}))
Variable importance was quite different between ranger
and randomForest
, most notably for the YNA model. Across models, we saw higher variable importance for earlier dimensions of the MDS, PCA and t-SNE features, which typically contain more varation than later dimensions and are therefore expected to have higher importance.
ranger
vs. randomForest
Variable ImportanceWe also compared variable importance for the same 40 variables across random forest implementations by species.
invisible(lapply(1:length(spp_train), function(x){
var_imp_df_ranger <- data.frame(var_nms = names(train_list[[x]][[1]]$variable.importance), imp = train_list[[x]][[1]]$variable.importance)
var_imp_df_rf <- data.frame(var_nms = dimnames(train_list[[x]][[2]]$importance)[[1]], imp = train_list[[x]][[2]]$importance[, grep("MeanDecreaseAccuracy", dimnames(train_list[[x]][[2]]$importance)[[2]])])
# order by importance
ord <- order(var_imp_df_ranger$imp, decreasing = TRUE)
var_imp_df_ranger <- var_imp_df_ranger[ord, ]
ord <- order(var_imp_df_rf$imp, decreasing = TRUE)
var_imp_df_rf <- var_imp_df_rf[ord, ]
# among the first top 40 impotant variables, find 40 common important variables across ranger and randomForest implementations
vars <- as.character(var_imp_df_ranger$var_nms[1:40][which(var_imp_df_ranger$var_nms[1:40] %in% var_imp_df_rf$var_nms[1:40])])
X <- var_imp_df_ranger$imp[grep(paste(paste("^", vars, "$", sep = ""), collapse = "|"), var_imp_df_ranger$var_nms)]
Y <- var_imp_df_rf$imp[grep(paste(paste("^", vars, "$", sep = ""), collapse = "|"), var_imp_df_rf$var_nms)]
plot_df <- data.frame(X, Y)
gg <- ggplot(data = plot_df) +
geom_point(aes(x = X, y = Y)) +
xlab("ranger variable importance") + ylab("randomForest variable importance") + ggtitle(spp_train[x])
print(gg)
}))
Variable importance did not agree among ranger
and randomForest
, similar to results found when comparing randomForest
to Python and SAS random forests implementations [6]. A caveat here is that we used different importance metrics per implementation, so future work on a direct comparison should take care to use the same variable importance metric.
We decided to proceed with the ranger
random forests implementation in subsequent model training, as this consistently yielded higher classification performance. We moved on to make the second random forest model per species, using manual feature selection.
sup_rf_set <- readRDS(file.path(path, "sup_rf_set.RDS"))
We manually selected variables by importance, using built-in controls.
names(m1list) <- spp_train
train_list <- list(m1list$MNK_indiv$ranger[["2500"]]$finalModel, m1list$YNA$ranger[["2000"]]$finalModel)
# x <- 1
model2_train <- invisible(pblapply(1:length(spp_train), function(x){
var_imp_df <- data.frame(var_nms = names(train_list[[x]]$variable.importance), imp = train_list[[x]]$variable.importance)
ord <- order(var_imp_df$imp, decreasing = TRUE)
var_imp_df <- var_imp_df[ord, ]
# make variable names match the random forests training object if necessary
# see note above, had changed parameter type names between runs
var_imp_df$var_nms <- gsub("RND[0-9]+", "RND", var_imp_df$var_nms)
var_imp_df$var_nms <- gsub("cepf", "cep", var_imp_df$var_nms)
var_imp_df$var_nms <- gsub("spf", "sp", var_imp_df$var_nms)
var_imp_df$var_nms <- gsub("imgf", "img", var_imp_df$var_nms)
var_imp_df$var_nms <- gsub("spent_DTW", "spentDTW", var_imp_df$var_nms)
# find the maximum importance of random variables in the previous model
RND_var_imp <- max(var_imp_df$imp[grep("RND", var_imp_df$var_nms)])
RND_var_imp
# use the maximum importance of random variables as a threshold to exclude acoustic features
# find which variables have mean importance greater than this, to retain in the final model
var_retain <- as.character(var_imp_df$var_nms[var_imp_df$imp > RND_var_imp])
# grep("^RND", var_retain) # checking
# initialize training data
tmp <- sup_rf_set[[x]]
train <- droplevels(tmp[grep("train", tmp$set), ])
# dim(train)
# build and train a random forests model of features after manual feature selection
# if MNK_indiv, add a letter in front of indiv, which starts with number
if(grepl("MNK_indiv", spp_train[x])) train$indiv <- gsub("1145-", "", train$indiv)
if(!grepl("YNA", spp_train[x])){
train_ms <- train[, grep(paste(paste("^", c("indiv", var_retain), "$", sep = ""), collapse = "|"), names(train))]
train_ms$indiv <- as.factor(train_ms$indiv)
} else {
train_ms <- train[, grep(paste(paste("^", c("regional_dialect", var_retain), "$", sep = ""), collapse = "|"), names(train))]
train_ms$regional_dialect <- as.factor(train_ms$regional_dialect)
}
return(train_ms)
}))
names(model2_train) <- spp_train
How many features remain per species after manual feature selection?
sapply(1:length(spp_train), function(x){
tmp <- model2_train[[x]]
ncol(tmp[, sapply(tmp, is.numeric)])
})
## [1] 187 85
Check that no random variables remain.
unique(sapply(1:length(spp_train), function(x){
tmp <- model2_train[[x]]
names(tmp)[grep("^RND", names(tmp))]
}))
## [[1]]
## character(0)
trees <- seq(500, 2500, 500)
cores <- parallel::detectCores() - 2
spp_train <- c("MNK_indiv", "YNA")
# set seed to obtain the same results
seed <- 401
# x <- 2
model2_list <- invisible(pblapply(1:length(spp_train), function(x){
train <- model2_train[[spp_train[x]]]
# str(train)
names(train)
mtry <- round(seq(2, ncol(train[, sapply(train, is.numeric)]), ncol(train[, sapply(train, is.numeric)])/10))
# mtry
if(!grepl("YNA", spp_sub[x])){
train_col <- "indiv"
} else {
train_col <- "regional_dialect"
}
# perform model training
splitrule <- "gini"
min.node.size <- 1
tunegrid <- expand.grid(.mtry = mtry, .splitrule = splitrule, .min.node.size = min.node.size)
fitControl <- caret::trainControl(method = "repeatedcv", number = 10, repeats = 5, search = "grid", returnData = TRUE, returnResamp = "final", savePredictions = "final", classProbs = TRUE, summaryFunction = multiClassSummary, verboseIter = TRUE)
mlist <- list()
mlist <- pblapply(1:length(trees), function(n){
rf_model <- caret::train(x = train[, sapply(train, is.numeric)], y = train[[train_col]], num.trees = trees[n], method = "ranger", trControl = fitControl, tuneGrid = tunegrid, importance = "permutation", replace = TRUE, scale.permutation.importance = TRUE, num.threads = cores, metric = c("Kappa"), maximize = TRUE, tuneLength = 10, seed = seed)
return(rf_model)
})
# warning about missing values in resampled performance measures
names(mlist) <- trees
return(mlist)
}))
names(model2_list) <- spp_train
saveRDS(model2_list, file.path(path, "model2_list_2spp.RDS"))
m2list <- readRDS(file.path(path, "model2_list_2spp.RDS"))
spp_train <- c("MNK_indiv", "YNA")
# x <- 1
invisible(lapply(1:length(spp_train), function(x){
results <- resamples(m2list[[x]])
print(dotplot(results, metric = c("Kappa", "Accuracy"), main = paste(spp_train[x], "ranger", sep = " : ")))
}))
Confusion matrix for MNK.
confusionMatrix(m2list$MNK_indiv[["2500"]])
## Cross-Validated (10 fold, repeated 5 times) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction AAT UM1 UM2 UM4
## AAT 15.1 0.0 0.5 0.0
## UM1 1.4 34.0 0.0 0.0
## UM2 0.0 0.0 31.0 0.3
## UM4 0.0 0.3 0.0 17.5
##
## Accuracy (average) : 0.9753
# print(m2list$MNK_indiv[["2500"]])
# length(names(m2list$MNK_indiv[["2500"]]$trainingData)[-grep("outcome", names(m2list$MNK_indiv[["2500"]]$trainingData))]) # number of features used
Confusion matrix for YNA, trained on dialects.
confusionMatrix(m2list$YNA[["2000"]])
## Cross-Validated (10 fold, repeated 5 times) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction Nor Sou
## Nor 53.1 0.1
## Sou 0.9 45.8
##
## Accuracy (average) : 0.9898
# print(m2list$YNA[["2000"]])
# length(names(m2list$YNA[["2000"]]$trainingData)[-grep("outcome", names(m2list$YNA[["2000"]]$trainingData))]) # number of features used
We evaluated random forests variable importance for model 2 across species.
train_list <- list(m2list$MNK_indiv[["2500"]]$finalModel, m2list$YNA[["2000"]]$finalModel)
spp_train <- c("MNK_indiv", "YNA")
opt_trees <- list(c("2500"), c("2000"))
n <- 12
# pie(rep(1, n), col = heat.colors(n))
# cep, SPCC, dfDTW, sp, multiDTW, img, spentDTW, RND
orig_cols <- c(alpha("purple", 0.6), topo.colors(12)[2], terrain.colors(12)[1], heat.colors(12)[1], heat.colors(12, alpha = 0.7)[5], heat.colors(12)[8], gray.colors(12)[5], "black")
# x <- 2
invisible(lapply(1:length(spp_train), function(x){
# make a data frame to visualize variable importance
var_imp_df <- data.frame(var_nms = names(train_list[[x]]$variable.importance), imp = train_list[[x]]$variable.importance)
# make a variable of parameter type
# x <- 1
var_imp_df$var_type <- sapply(1:nrow(var_imp_df), function(y){
strsplit(as.character(var_imp_df$var_nms[y]), split = "_")[[1]][1]
})
var_imp_df$var_type <- gsub("RND[0-9]+", "RND", var_imp_df$var_type)
var_imp_df$var_type <- gsub("cepf", "cep", var_imp_df$var_type)
var_imp_df$var_type <- gsub("spf", "sp", var_imp_df$var_type)
var_imp_df$var_type <- gsub("imgf", "img", var_imp_df$var_type)
var_imp_df$var_type <- gsub("spent$", "spentDTW", var_imp_df$var_type)
# order by importance
ord <- order(var_imp_df$imp, decreasing = TRUE)
var_imp_df <- var_imp_df[ord, ]
var_imp_df$var_nms <- factor(var_imp_df$var_nms, levels = unique(var_imp_df$var_nms))
var_imp_df$var_type <- factor(var_imp_df$var_type, levels = c("cep", "SPCC", "dfDTW", "sp", "multiDTW", "img", "spentDTW", "RND"))
# levels(var_imp_df$var_type)
# plot only the top 20 variables by importance
plot_df <- droplevels(var_imp_df[1:20, ])
plot_df$var_type <- factor(plot_df$var_type, levels = unique(plot_df$var_type))
levels(plot_df$var_type)
# subset original colors by the current variable type
cols <- unlist(sapply(1:length(levels(plot_df$var_type)), function(v){
orig_cols[grep(paste("^", levels(plot_df$var_type)[v], "$", sep = ""), levels(var_imp_df$var_type))]
}))
gg <- ggplot(data = plot_df) +
geom_segment(aes(x = 0, xend = imp, y = forcats::fct_rev(var_nms), yend = forcats::fct_rev(var_nms), color = var_type), size = 1.5) +
geom_point(aes(y = forcats::fct_rev(var_nms), x = imp), pch = 21, fill = gray.colors(n)[2], color = "black", size = 4) +
scale_color_manual(values = cols) +
xlab("Variable Importance") + ylab("Top 20 Most Important Features") +
theme_AcousticSpace() + theme(panel.grid.major.x = element_line(color = "black", size = 0.25, linetype = "dashed")) +
ggtitle(paste(spp_train[x], paste(opt_trees[[x]], "trees", sep = " "), sep = " - "))
print(gg)
}))
These variable importance plots look great.
We made bar graphs of variable importance across models (Supplementary Figure 5 in Supplementary Materials).
train_list <- list(m2list$MNK_indiv[["2500"]]$finalModel, m2list$YNA[["2000"]]$finalModel)
spp <- c("MNK", "YNA")
opt_trees <- list(c("2500"), c("2000"))
# make a data frame of the variable types most important per species
# x <- 1
var_type_df <- invisible(lapply(1:length(spp), function(x){
# make a data frame to visualize variable importance
var_imp_df <- data.frame(var_nms = names(train_list[[x]]$variable.importance), imp = train_list[[x]]$variable.importance)
# order by importance
ord <- order(var_imp_df$imp, decreasing = TRUE)
var_imp_df <- var_imp_df[ord, ]
# take the 40 most important variablea
var_imp_df <- var_imp_df[1:40, ]
# make a variable of parameter type
# x <- 1
var_imp_df$var_type <- sapply(1:nrow(var_imp_df), function(y){
strsplit(as.character(var_imp_df$var_nms[y]), split = "_")[[1]][1]
})
# for manuscript figure
var_imp_df$var_type <- as.character(var_imp_df$var_type)
var_imp_df$var_type <- gsub("RND[0-9]+", "RND", var_imp_df$var_type)
var_imp_df$var_type <- gsub("cepf|cep", "Cepstral Coefficients", var_imp_df$var_type)
var_imp_df$var_type <- gsub("spf$|sp$", "Acoustic Measurements", var_imp_df$var_type)
var_imp_df$var_type <- gsub("imgf|img", "Image Processing", var_imp_df$var_type)
var_imp_df$var_type <- gsub("spent$|spentDTW$", "Spectral Entropy DTW", var_imp_df$var_type)
var_imp_df$var_type <- gsub("dfDTW", "Dominant Frequency DTW", var_imp_df$var_type)
var_imp_df$var_type <- gsub("multiDTW", "Multivariate DTW", var_imp_df$var_type)
df <- data.frame(var_type = var_imp_df$var_type, spp = spp[x])
# remove RND variables
if("^RND$" %in% as.character(var_imp_df$var_type)){
df <- droplevels(df[-grep("^RND$", df$var_type), ])
}
return(df)
}))
var_type_df <- data.table::rbindlist(var_type_df)
# str(var_type_df)
unique(var_type_df$var_type)
## [1] Cepstral Coefficients SPCC Dominant Frequency DTW
## [4] Acoustic Measurements Multivariate DTW Image Processing
## [7] Spectral Entropy DTW
## 7 Levels: Acoustic Measurements ... Spectral Entropy DTW
n <- 12
# pie(rep(1, n), col = heat.colors(n))
# colors by variable type
# cep, SPCC, dfDTW, sp, multiDTW, img, spentDTW
cols <- alpha(c("purple", topo.colors(12)[2], terrain.colors(12)[1], heat.colors(12)[1], heat.colors(12, alpha = 0.7)[5], heat.colors(12)[8], gray.colors(12)[5]), 0.8)
var_type_df$var_type <- factor(var_type_df$var_type, levels = c("Cepstral Coefficients", "SPCC", "Dominant Frequency DTW", "Spectral Entropy DTW", "Multivariate DTW", "Acoustic Measurements", "Image Processing"))
max_num <- max(table(var_type_df$var_type, var_type_df$spp))
ggplot(data = var_type_df, aes(fct_rev(var_type))) +
geom_bar(aes(fill = var_type), color = gray.colors(n)[4]) +
facet_wrap(~ spp) +
scale_color_manual(values = cols) +
scale_fill_manual(values = cols) +
scale_y_discrete(breaks = seq(1, max_num, 1), limits = seq(1, max_num, 1)) +
coord_flip() +
xlab("Feature Type") + ylab("Number of Features Among the Top 40 \n Most Important Variables") +
guides(fill = FALSE) +
theme_bw() + theme(legend.position = "top", legend.text = element_text(size = 10), axis.text = element_text(size = 8), axis.title = element_text(size = 10), strip.text = element_text(size = 10, face = "bold"))
# ggsave(file.path(path, "SupplementaryFigure5_RFVarImp_2spp.tiff"), units = "in", width = 5, height = 4.5, dpi = 325)
We saved the best ranger
model per species to use for model validation. For MNK, the second model of manually selcted features had slightly better performance than the first model with all features. For YNA, the second model also performed slightly better.
final_model_list <- list(m2list$MNK_indiv[["2500"]]$finalModel, m2list$YNA[["2000"]]$finalModel)
names(final_model_list) <- spp_train
saveRDS(final_model_list, file.path(path, "final_ranger_models_2spp.RDS"))
We then split testing calls into a validation set, and a final test set to yield acoustic similarity at the site scale.
sup_rf_set <- readRDS(file.path(path, "sup_rf_set.RDS"))
How many calls remain for validation and/or final testing? For MNK, we used all remaining repeated individual calls for validation and the calls for higher social scales for final testing.
col_nm <- list("indiv", "regional_dialect")
sapply(1:length(spp_train), function(x){
table(sup_rf_set[[spp_train[x]]][[col_nm[[x]]]][grep("test", sup_rf_set[[x]]$set)])
})
## [[1]]
##
## CHAC-UM5 EMBR-RAW EMBR-ZW8 UM3
## 7 4 8 5
##
## [[2]]
##
## Nic Nor Sou
## 36 73 34
We partitioned the remaining calls into validation and final site-level testing.
# initialize individuals to be be designated as validation (ground-truthing) test data sets
# using the same individuals for the YNA models
validatn_indivs <- list(c("UM3", "CHAC-UM5", "EMBR-RAW", "EMBR-ZW8"), c("pnar_1", "cabu_1", "curu_1", "curu_2"))
rf_df <- sup_rf_set
# x <- 2
sup_rf_test <- invisible(pblapply(1:length(spp_train), function(x){
rf_df[[x]]$set[grep(paste(paste("^", validatn_indivs[[x]], "$", sep = ""), collapse = "|"), as.character(rf_df[[x]]$indiv))] <- "validatn"
# if MNK, site testing data set is separate
# if YNA, check which sites have more than 2 individuals remaining and assign these as site-level test sets
if(grepl("YNA", spp_train[x])){
test_labels <- as.character(rf_df[[x]]$indiv[grep("test$", rf_df[[x]]$set)])
tmp <- droplevels(rf_df[[x]][grep("test$", rf_df[[x]]$set), ])
# sites with >= 2 birds remaining after assigning the ground-truthing data set
sts <- unlist(lapply(1:length(unique(tmp$site)), function(y){
len <- length(unique(tmp$indiv[grep(unique(tmp$site)[y], tmp$indiv)]))
if(len > 2) return(as.character(unique(tmp$site)[y]))
}))
# sts
# filter test labels to remove birds when 2 or less remain for a site
test_labels <- unique(test_labels[grep(paste(sts, collapse = "|"), test_labels)])
rf_df[[x]]$set[grep(paste(paste("^", test_labels, "$", sep = ""), collapse = "|"), rf_df[[x]]$indiv)] <- "test_site"
# if any calls remain that were not assigned to validation or final test data sets, give these calls a different label
if(length( rf_df[[x]]$set[grep("^test$", rf_df[[x]]$set)]) > 1){
rf_df[[x]]$set[grep("^test$", rf_df[[x]]$set)] <- "not_used4testing"
}
}
return(rf_df[[x]])
}))
names(sup_rf_test) <- spp_train
saveRDS(sup_rf_test, file.path(path, "sup_rf_test.RDS"))
sup_rf_test <- readRDS(file.path(path, "sup_rf_test.RDS"))
Check how many calls were assigned to data sets across species. Looks good!
sapply(1:length(spp_train), function(x){
table(sup_rf_test[[x]]$set)
})
## [[1]]
##
## train validatn
## 73 24
##
## [[2]]
##
## not_used4testing test_site train validatn
## 21 86 274 36
We performed random forests validation.
final_model_list <- readRDS(file.path(path, "final_ranger_models_2spp.RDS"))
seed <- 401
spp <- c("MNK", "YNA")
cores <- parallel::detectCores() - 2
# x <- 1
rf_validatn_proxm <- invisible(pblapply(1:length(spp), function(x){
if(grepl("MNK", spp[x])){
response_col <- "indiv"
} else if(grepl("YNA", spp[x])) {
response_col <- "regional_dialect"
}
# filter by ground-truthing test set and variables retained in the final rf model
# need to change some parameter names to match among predictor data set and final model object, as I changed names slightly after running model 1
final_model_list[[x]]$xNames <- gsub("cepf", "cep", final_model_list[[x]]$xNames)
final_model_list[[x]]$xNames <- gsub("spentDTW", "spent_DTW", final_model_list[[x]]$xNames)
names(sup_rf_test[[x]]) <- gsub("spentDTW", "spent_DTW", names(sup_rf_test[[x]]))
final_model_list[[x]]$forest$independent.variable.names <- gsub("spentDTW", "spent_DTW", final_model_list[[x]]$forest$independent.variable.names)
final_model_list[[x]]$forest$independent.variable.names <- gsub("cepf", "cep", final_model_list[[x]]$forest$independent.variable.names)
test_df <- sup_rf_test[[x]][grep("validatn", sup_rf_test[[x]]$set), grep(paste(paste("^", c(response_col, final_model_list[[x]]$xNames), "$", sep = ""), collapse = "|"), names(sup_rf_test[[x]]))]
# extract proximity matrix
set.seed(seed)
proxm <- edarf::extract_proximity(final_model_list[[x]], test_df)
return(proxm)
}))
names(rf_validatn_proxm) <- spp
saveRDS(rf_validatn_proxm, file.path(path, "rf_validatn_proxm.RDS"))
Perform t-SNE on random forests proximity matrices for visualization.
rf_validatn_proxm <- readRDS(file.path(path, "rf_validatn_proxm.RDS"))
tsne_rf_validatn <- invisible(pblapply(1:length(spp), function(x){
X <- stats::as.dist(1 - rf_validatn_proxm[[x]], diag = TRUE, upper = TRUE)
tmp <- list()
# embed in 2 dimensions for visualization with intermediate perplexity
tmp <- lapply(1:10, function(n){
Rtsne::Rtsne(X, dims = 2, pca = FALSE, max_iter = 5000, perplexity = round((nrow(sup_rf_test[[x]][grep("validatn", sup_rf_test[[x]]$set), ]) - 1)/4), is_distance = TRUE, theta = 0.0)
})
# str(tmp)
cat(paste("spp = ", spp[x], "\n", sep = ""))
KL <- sapply(1:length(tmp), function(i){
tmp[[i]]$itercosts[100]
}, simplify = TRUE)
# overwrite tmp with the solution that minimized KL
tmp <- tmp[[which(KL == min(KL))]]
return(tmp)
}))
names(tsne_rf_validatn) <- spp
saveRDS(tsne_rf_validatn, file.path(path, "tsne_rf_validatn_2spp.RDS"))
We continued model validation via model-based clustering on random forests acoustic similarity for the MNK and YNA models. This was a useful approach to obtain classification accuracy for the MNK model trained on individuals, since the training class labels did did not correspond to the validation test data set (new individuals), and as such, we could not use labels predicted by random forests to assess classification accuracy. However, the class labels for the validation data set of the YNA model (trained on dialects) corresponded to the class labels used in training, so we assessed model performance by evaluating labels predicted by random forests as well as model-based clustering.
We performed model-based clustering by restricting the number of clusters to the true number of groups (e.g. individuals or dialects) in the ground-truthing test data sets.
rf_validatn_proxm <- readRDS(file.path(path, "rf_validatn_proxm.RDS"))
seed <- 401
# maximum clusters to explore
G <- list(1:4, 1:2)
# x <- 1
mBIC_res <- invisible(pblapply(1:length(spp), function(x){
set.seed(seed)
mBIC <- mclust::Mclust(data = stats::as.dist(1 - rf_validatn_proxm[[spp[x]]], diag = TRUE, upper = TRUE), G = G[[x]])
}))
names(mBIC_res) <- spp
saveRDS(mBIC_res, file.path(path, "mBIC_res_2spp_restricted.RDS"))
Check out model-based clustering results. Here the optimal number of clusters matches the true number of clusters in the data, but we needed to also visualize results to see how well clustering assignments reflected individual or dialect identity.
tsne_rf_validatn <- readRDS(file.path(path, "tsne_rf_validatn_2spp.RDS"))
mBIC_res <- readRDS(file.path(path, "mBIC_res_2spp_restricted.RDS"))
invisible(sapply(1:length(spp), function(x){
cat(paste(spp[x], "\n", sep = ""))
print(summary(mBIC_res[[x]]))
print(mBIC_res[[x]]$modelName)
print(mBIC_res[[x]]$G)
cat("\n")
}))
## MNK
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVI (diagonal, varying volume and shape) model with 4 components:
##
## log.likelihood n df BIC ICL
## 687.2608 24 195 754.801 754.801
##
## Clustering table:
## 1 2 3 4
## 3 8 9 4
## [1] "VVI"
## [1] 4
##
## YNA
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EVI (diagonal, equal volume, varying shape) model with 2
## components:
##
## log.likelihood n df BIC ICL
## 1756.951 36 144 2997.875 2997.875
##
## Clustering table:
## 1 2
## 20 16
## [1] "EVI"
## [1] 2
Visuals of ground-truthing with model-based clustering on random forests proximity matrices reduced by t-SNE.
n <- 12
# colors for clustering
clust_cols <- c("#F8766D", "#00BFC4", terrain.colors(n)[1], topo.colors(n)[2], heat.colors(n)[1], heat.colors(n)[8])
# x <- 3
invisible(pblapply(1:length(spp), function(x){
tsne <- tsne_rf_validatn[[x]]
# subset predictor data with metadata by the ground-truthing test data set
tmp_df <- droplevels(sup_rf_test[[x]][grep("validatn", sup_rf_test[[x]]$set), ])
# create a data frame with t-SNE embeddings for plotting
if(grepl("MNK", spp[x])){
df <- data.frame(X = tsne$Y[, 1], Y = tsne$Y[, 2], indiv = as.character(tmp_df$indiv), site = as.character(tmp_df$site), cluster = as.factor(mBIC_res[[x]]$classification))
} else if(grepl("YNA", spp[x])){
df <- data.frame(X = tsne$Y[, 1], Y = tsne$Y[, 2], indiv = as.character(tmp_df$indiv), site = as.character(tmp_df$site), regional_dialect = as.character(tmp_df$regional_dialect), cluster = as.factor(mBIC_res[[x]]$classification))
}
# initialize aesthetics
if(grepl("MNK", spp[x])){
# shapes by individual
shps <- c(0, 1, 2, 6)
} else if(grepl("YNA", spp[x])){
# shapes by dialect
shps <- c(0, 1, 2)
}
if(grepl("MNK", spp[x])){
# convex hull polygons by individual
hulls <- plyr::ddply(df, "indiv", function(z){
z[chull(z$X, z$Y), ]
})
} else if(grepl("YNA", spp[x])){
# convex hull polygons by dialect
hulls <- plyr::ddply(df, "regional_dialect", function(z){
z[chull(z$X, z$Y), ]
})
}
# make graphic
if(grepl("MNK", spp[x])){
gg <- ggplot(df, aes(x = X, y = Y)) +
geom_point(aes(fill = indiv, shape = indiv), size = 4) +
geom_point(aes(color = cluster), shape = 19, size = 2) +
geom_polygon(data = hulls, aes(x = X, y = Y, fill = indiv), alpha = 0.2) +
scale_color_manual(values = clust_cols) +
scale_fill_manual(values = rep(gray.colors(n)[1], length(unique(df$indiv)))) +
scale_shape_manual(values = shps)
} else if(grepl("YNA", spp[x])){
gg <- ggplot(df, aes(x = X, y = Y)) +
geom_point(aes(fill = regional_dialect, shape = site), size = 4) +
geom_polygon(data = hulls, aes(x = X, y = Y, fill = regional_dialect), alpha = 0.2) +
geom_point(aes(color = cluster), shape = 19, size = 2) +
# geom_text(aes(x = X, y = Y - 2.7, label = indiv)) +
scale_color_manual(values = clust_cols) +
scale_fill_manual(values = rep(gray.colors(n)[1], length(unique(df$regional_dialect)))) +
scale_shape_manual(values = shps) + guides(fill = FALSE)
}
gg <- gg + xlab("t-SNE Dimension 1") + ylab("t-SNE Dimension 2") +
theme_AcousticSpace() +
ggtitle(paste(spp[x], sep = " "))
print(gg)
}))
22/24 or 91.7% of calls were correctly classified for MNK, demonstrating that the random forests model learned patterns of individual distinctiveness and consistency. The YNA_dialect model performed beautifully, with 100% classification accuracy.
We also checked out classification performance of the YNA_dialect model, which was possible because the training and test data shared the same classes (Northern or Southern dialect).
final_model_list <- readRDS(file.path(path, "final_ranger_models_2spp.RDS"))
sup_rf_test <- readRDS(file.path(path, "sup_rf_test.RDS"))
seed <- 401
response_col <- "regional_dialect"
# make parameter names match between random forests object and predictor data set
final_model_list[["YNA"]]$xNames <- gsub("cepf", "cep", final_model_list[["YNA"]]$xNames)
final_model_list[["YNA"]]$xNames <- gsub("spentDTW", "spent_DTW", final_model_list[["YNA"]]$xNames)
names(sup_rf_test[["YNA"]]) <- gsub("spentDTW", "spent_DTW", names(sup_rf_test[["YNA"]]))
final_model_list[["YNA"]]$forest$independent.variable.names <- gsub("spentDTW", "spent_DTW", final_model_list[["YNA"]]$forest$independent.variable.names)
final_model_list[["YNA"]]$forest$independent.variable.names <- gsub("cepf", "cep", final_model_list[["YNA"]]$forest$independent.variable.names)
# subset the predictor data set by the validation calls
test_df <- sup_rf_test[["YNA"]][grep("validatn", sup_rf_test[["YNA"]]$set), grep(paste(paste("^", c(response_col, final_model_list[["YNA"]]$xNames), "$", sep = ""), collapse = "|"), names(sup_rf_test[["YNA"]]))]
# perform random forests prediction on the validation calls
predictn <- predict(final_model_list[["YNA"]], test_df, seed = seed)
tmp <- as.data.frame(predictions(predictn))
classes <- sapply(1:nrow(test_df), function(x){
ifelse(tmp[x, grep("Nor", names(tmp))] > 0.5, "Nor", "Sou")
})
# visualize the confusion matrix
confusionMatrix(as.factor(classes), as.factor(test_df$regional_dialect))$table
## Reference
## Prediction Nor Sou
## Nor 20 0
## Sou 0 16
The YNA_dialect random forests model also classified the validation calls by regional dialect with 100% accuracy.
In sum, the model-based clustering results allowed us to evaluate the biological relevance of random forests models. The MNK model trained on individuals learned biologically relevant acoustic similarity patterns. The YNA model trained on dialects had great validation performance.
We predicted acoustic similarity for random forests test data sets across species.
final_model_list <- readRDS(file.path(path, "final_ranger_models_2spp.RDS"))
seed <- 401
spp <- c("MNK", "YNA")
names(final_model_list) <- spp
cores <- parallel::detectCores() - 2
Prepare final test data sets and check total number of calls used for this testing step.
# add a set variable to MNK_site calls
dim(sup_rf_pp[[2]])
sup_rf_pp[[2]]$set <- "test_site"
# combine final predictor data sets
sup_rf_test_site <- list(sup_rf_pp[[2]], sup_rf_test[["YNA"]])
saveRDS(sup_rf_test_site , file.path(path, "sup_rf_test_site.RDS"))
sup_rf_test_site <- readRDS(file.path(path, "sup_rf_test_site.RDS"))
invisible(sapply(1:length(spp), function(x){
cat(paste(spp[x], " - ", length(sup_rf_test_site[[x]]$set[grep("test_site", sup_rf_test_site[[x]]$set)]), "\n", sep = ""))
}))
## MNK - 598
## YNA - 86
We ran test calls down the final random forest model per species and retained the proximity matrix as the predicted acoustic similarity.
rf_final_test <- invisible(pblapply(1:length(spp), function(x){
if(grepl("MNK", spp[x])){
response_col <- "indiv"
} else if(grepl("YNA", spp[x])){
response_col <- "regional_dialect"
}
# filter by ground-truthing test set and variables retained in the final rf model
final_model_list[[x]]$xNames <- gsub("cepf", "cep", final_model_list[[x]]$xNames)
final_model_list[[x]]$xNames <- gsub("spentDTW", "spent_DTW", final_model_list[[x]]$xNames)
names(sup_rf_test_site[[x]]) <- gsub("spentDTW", "spent_DTW", names(sup_rf_test_site[[x]]))
final_model_list[[x]]$forest$independent.variable.names <- gsub("spentDTW", "spent_DTW", final_model_list[[x]]$forest$independent.variable.names)
final_model_list[[x]]$forest$independent.variable.names <- gsub("cepf", "cep", final_model_list[[x]]$forest$independent.variable.names)
test_df <- sup_rf_test_site[[x]][grep("test_site", sup_rf_test_site[[x]]$set), grep(paste(paste("^", c(response_col, final_model_list[[x]]$xNames), "$", sep = ""), collapse = "|"), names(sup_rf_test_site[[x]]))]
# extract proximity matrix
proxm <- edarf::extract_proximity(final_model_list[[x]], test_df)
return(proxm)
}))
names(rf_final_test) <- spp
saveRDS(rf_final_test, file.path(path, "rf_final_test.RDS"))
Perform t-SNE on random forests proximity matrices for visualization.
rf_final_test <- readRDS(file.path(path, "rf_final_test.RDS"))
tsne_rf_final_test <- invisible(pblapply(1:length(spp), function(x){
X <- stats::as.dist(1 - rf_final_test[[x]], diag = TRUE, upper = TRUE)
tmp <- list()
# embed in 2 dimensions for visualization with intermediate perplexity
tmp <- lapply(1:10, function(n){
Rtsne::Rtsne(X, dims = 2, pca = FALSE, max_iter = 5000, perplexity = round((nrow(sup_rf_test_site[[x]][grep("test_site", sup_rf_test_site[[x]]$set), ]) - 1)/4), is_distance = TRUE, theta = 0.0)
})
# str(tmp)
cat(paste("spp = ", spp[x], "\n", sep = ""))
KL <- sapply(1:length(tmp), function(i){
tmp[[i]]$itercosts[100]
}, simplify = TRUE)
# overwrite tmp with the solution that minimized KL
tmp <- tmp[[which(KL == min(KL))]]
return(tmp)
}))
names(tsne_rf_final_test) <- spp
saveRDS(tsne_rf_final_test, file.path(path, "tsne_rf_final_test_2spp.RDS"))
We performed model-based clustering as a means of determining how clusters derived from random forests acoustic similarity reflected site identity. We used only 3 sites (PIED, PEIX, OJOS) from the MNK data set, representing the westernmost, middle and easternmost end of the sampling transect.
tsne_rf_final_test_2spp <- readRDS(file.path(path, "tsne_rf_final_test_2spp.RDS"))
rf_final_test <- readRDS(file.path(path, "rf_final_test.RDS"))
spp <- c("MNK", "YNA")
# make dimension names for MNK to facilitate parsing for a few sites
sites <- unique(as.character(mnk_site$General_site))
mnk_site$site_uniq_ID <- unlist(lapply(1:length(sites), function(x){
tmp <- mnk_site[grep(sites[x], mnk_site$General_site), ]
return(paste(sites[x], seq(1, nrow(tmp)), sep = "_"))
}))
seed <- 401
# restrict clusters per species to the number of groups we know are in the data, to see how well clusters match group identity
G <- list(1:3, 1:3)
mBIC_res <- invisible(pblapply(1:length(spp), function(x){
set.seed(seed)
if(grepl("MNK", spp[x])){
dimnames(rf_final_test[[spp[x]]]) <- list(mnk_site$site_uniq_ID, mnk_site$site_uniq_ID)
tmp <- rf_final_test[[spp[x]]][grep("PIED|PEIX|OJOS", dimnames(rf_final_test[[spp[x]]])[[1]]), grep("PIED|PEIX|OJOS", dimnames(rf_final_test[[spp[x]]])[[2]])]
# parse for final test data sets
} else if(!grepl("MNK", spp[x])){
nms <- as.character(sup_rf_test_site[[x]]$indiv[grep("test_site", sup_rf_test_site[[x]]$set)])
dimnames(rf_final_test[[spp[x]]]) <- list(nms, nms)
ids <- as.character(sup_rf_test_site[[x]]$indiv[grep("test_site", sup_rf_test_site[[x]]$set)])
indices <- grep(paste(paste("^", ids, "$", sep = ""), collapse = "|"), dimnames(rf_final_test[[spp[x]]])[[1]])
tmp <- rf_final_test[[spp[x]]][indices, indices]
}
mBIC <- mclust::Mclust(data = stats::as.dist(1 - tmp, diag = TRUE, upper = TRUE), G = G[[x]])
}))
names(mBIC_res) <- spp
saveRDS(mBIC_res, file.path(path, "mBIC_res_2spp_rf_testing.RDS"))
Check out model-based clustering results. Here the optimal number of clusters matches the true number of clusters in the data, but we need to visualize results to see if clusters reflect individual or dialect identity.
mBIC_res <- readRDS(file.path(path, "mBIC_res_2spp_rf_testing.RDS"))
invisible(sapply(1:length(spp), function(x){
cat(paste(spp[x], "\n", sep = ""))
print(summary(mBIC_res[[x]]))
print(mBIC_res[[x]]$modelName)
print(mBIC_res[[x]]$G)
cat("\n")
}))
## MNK
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVI (diagonal, varying volume and shape) model with 3 components:
##
## log.likelihood n df BIC ICL
## 4088.455 62 374 6633.361 6633.36
##
## Clustering table:
## 1 2 3
## 30 22 10
## [1] "VVI"
## [1] 3
##
## YNA
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVI (diagonal, varying volume and shape) model with 3 components:
##
## log.likelihood n df BIC ICL
## 11269.77 86 518 20232.19 20232.19
##
## Clustering table:
## 1 2 3
## 36 27 23
## [1] "VVI"
## [1] 3
Visuals of random forests acoustic similarity matrices reduced by t-SNE. This was part of Supplementary Figure 3 in supplementary materials.
n <- 12
# fill colors and shapes by site
site_cols <- c(gray.colors(n)[1], gray.colors(n)[6], gray.colors(n)[9])
shps <- c(21, 22, 24)
# colors for clustering
clust_cols <- c("#F8766D", "#00BFC4", terrain.colors(n)[1], topo.colors(n)[2], heat.colors(n)[1], heat.colors(n)[8])
# x <- 2
invisible(pblapply(1:length(spp), function(x){
tsne <- as.data.frame(tsne_rf_final_test_2spp[[x]]$Y)
tsne$set <- sup_rf_test_site[[x]]$set[grep("test_site", sup_rf_test_site[[x]]$set)]
tsne <- tsne[grep("test_site", tsne$set), ]
# subset predictor data with metadata by the test data set
tmp_df <- droplevels(sup_rf_test_site[[x]][grep("test_site", sup_rf_test_site[[x]]$set), ])
# str(tmp_df)
# create a data frame with t-SNE embeddings for plotting
if(grepl("MNK", spp[x])){
df <- data.frame(X = tsne[, 1], Y = tsne[, 2], indiv = as.character(tmp_df$indiv), site = as.character(tmp_df$site))
# subset by the 3 sites used for visualization in MNK 44100 Hz analyses
df <- droplevels(df[grep("PIED|PEIX|OJOS", df$site), ])
# add cluster info
df <- data.frame(df, cluster = as.factor(mBIC_res[[x]]$classification))
} else if(grepl("YNA", spp[x])){
df <- data.frame(X = tsne[, 1], Y = tsne[, 2], indiv = as.character(tmp_df$indiv), site = as.character(tmp_df$site), regional_dialect = as.character(tmp_df$regional_dialect), cluster = as.factor(mBIC_res[[x]]$classification))
}
# convex hull polygons by site
hulls <- plyr::ddply(df, "site", function(z){
z[chull(z$X, z$Y), ]
})
gg <- ggplot(df, aes(x = X, y = Y)) +
geom_polygon(data = hulls, aes(x = X, y = Y, fill = site), alpha = 0.6) +
geom_point(aes(fill = site, shape = site), size = 6) +
geom_point(aes(color = cluster), shape = 19, size = 3) +
scale_color_manual(values = clust_cols) +
scale_fill_manual(values = site_cols) +
scale_shape_manual(values = shps)
gg <- gg + guides(fill = guide_legend(title = "Site"), shape = guide_legend(title = "Site"), color = guide_legend(title = "Cluster", override.aes = list(size = 2)))
gg <- gg + xlab("t-SNE Dimension 1") + ylab("t-SNE Dimension 2")
gg <- gg + theme_AcousticSpace() +
ggtitle(paste(spp[x], sep = " "))
print(gg)
}))
These visuals confirm that acoustic similarity is very weak at the site scale for MNK (model-based clustering on random forests acoustic similarity cannot reconstruct patterns of site identity), and stronger for YNA. The more distant site in acoustic space for YNA is a Nicaraguan dialect site (pink cluster). This analysis does a pretty good job at reconstructing patterns of site identity within the Northern dialect calls (blue and green clusters).
We repeated the clustering analysis for SPCC.
spp <- c("MNK_site", "YNA")
dfs <- list(mnk_site, yna)
xc_mat_list <- sapply(1:length(spp), function(x){
if(grepl("MNK", spp[x])){
tmp <- readRDS(paste(paste("xc_mat", spp[x], sep = "_"), ".RDS", sep = ""))
dimnames(tmp) <- list(as.character(dfs[[x]]$site), as.character(dfs[[x]]$site))
} else {
tmp <- readRDS(paste(paste("xc_mat", spp[x], sep = "_"), ".RDS", sep = ""))
dimnames(tmp) <- list(as.character(dfs[[x]]$bird_ID), as.character(dfs[[x]]$bird_ID))
}
return(tmp)
})
Subset ech SPCC matrix by the sites in the random forests test data sets.
spp <- c("MNK", "YNA")
xc_mat_list2 <- sapply(1:length(spp), function(x){
if(grepl("MNK", spp[x])){
return(xc_mat_list[[x]])
} else if(grepl("YNA", spp[x])){
ids <- as.character(sup_rf_test_site[[x]]$indiv[grep("test_site", sup_rf_test_site[[x]]$set)])
# ids
indices <- grep(paste(paste("^", ids, "$", sep = ""), collapse = "|"), dimnames(xc_mat_list[[x]])[[1]])
return(xc_mat_list[[x]][indices, indices])
}
})
names(xc_mat_list2) <- spp
saveRDS(xc_mat_list2, file.path(path, "xc_mat_2spp_filt.RDS"))
We performed t-SNE on SPCC proximity matrices for visualization.
xc_mat_list <- readRDS(file.path(path, "xc_mat_2spp_filt.RDS"))
tsne_spcc <- invisible(pblapply(1:length(spp), function(x){
X <- stats::as.dist(1 - xc_mat_list[[x]], diag = TRUE, upper = TRUE)
# dim(X)
tmp <- list()
# embed in 2 dimensions for visualization
# nrow(sup_rf_test_site[[x]][grep("test_site", sup_rf_test_site[[x]]$set), ])
n <- round((dim(X)[1] - 1)/4)
plx <- ifelse(n > 50, 45, n)
tmp <- lapply(1:10, function(n){
Rtsne::Rtsne(X, dims = 2, pca = FALSE, max_iter = 5000, perplexity = plx, is_distance = TRUE, theta = 0.0)
})
# str(tmp)
cat(paste("spp = ", spp[x], "\n", sep = ""))
KL <- sapply(1:length(tmp), function(i){
tmp[[i]]$itercosts[100]
}, simplify = TRUE)
# overwrite tmp with the solution that minimized KL
tmp <- tmp[[which(KL == min(KL))]]
return(tmp)
}))
names(tsne_spcc) <- spp
saveRDS(tsne_spcc, file.path(path, "tsne_spcc_2spp.RDS"))
We performed model-based clustering as a means of determining how clusters derived from SPCC acoustic similarity reflected site identity. We used the same 3 sites from the MNK data set as above.
xc_mat_list <- readRDS(file.path(path, "xc_mat_2spp_filt.RDS"))
spp <- c("MNK", "YNA")
seed <- 401
# restrict clusters per species to the number of groups we know are in the data, to see how well clusters match group identity
G <- list(1:3, 1:3)
# x <- 2
mBIC_res <- invisible(pblapply(1:length(spp), function(x){
set.seed(seed)
mBIC <- mclust::Mclust(data = stats::as.dist(1 - xc_mat_list[[x]], diag = TRUE, upper = TRUE), G = G[[x]])
}))
names(mBIC_res) <- spp
saveRDS(mBIC_res, file.path(path, "mBIC_res_SPCC_2spp_restricted.RDS"))
Check out model-based clustering results. Here the optimal number of clusters matches the true number of clusters in the data, but we need to visualize results to see if clusters reflect individual or dialect identity.
mBIC_res <- readRDS(file.path(path, "mBIC_res_SPCC_2spp_restricted.RDS"))
tsne_spcc <- readRDS(file.path(path, "tsne_spcc_2spp.RDS"))
invisible(sapply(1:length(spp), function(x){
cat(paste(spp[x], "\n", sep = ""))
print(summary(mBIC_res[[x]]))
print(mBIC_res[[x]]$modelName)
print(mBIC_res[[x]]$G)
cat("\n")
}))
## MNK
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVI (diagonal, varying volume and shape) model with 3 components:
##
## log.likelihood n df BIC ICL
## 567513.3 598 3590 1112074 1112074
##
## Clustering table:
## 1 2 3
## 260 172 166
## [1] "VVI"
## [1] 3
##
## YNA
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVI (diagonal, varying volume and shape) model with 3 components:
##
## log.likelihood n df BIC ICL
## 8805.694 86 518 15304.04 15304.04
##
## Clustering table:
## 1 2 3
## 36 26 24
## [1] "VVI"
## [1] 3
We made visuals of SPCC acoustic similarity matrices reduced by t-SNE with the model-based clustering results. This was also part of Supplementary Figure 4 in supplementary materials.
n <- 12
# colors for clustering
clust_cols <- c("#F8766D", "#00BFC4", terrain.colors(n)[1], topo.colors(n)[2], heat.colors(n)[1], heat.colors(n)[8])
invisible(pblapply(1:length(spp), function(x){
tsne <- as.data.frame(tsne_spcc[[x]]$Y)
tsne$set <- sup_rf_test_site[[x]]$set[grep("test_site", sup_rf_test_site[[x]]$set)]
tsne <- tsne[grep("test_site", tsne$set), ]
# subset predictor data with metadata by the test data set
tmp_df <- droplevels(sup_rf_test_site[[x]][grep("test_site", sup_rf_test_site[[x]]$set), ])
# create a data frame with t-SNE embeddings for plotting
if(grepl("MNK", spp[x])){
df <- data.frame(X = tsne[, 1], Y = tsne[, 2], indiv = as.character(tmp_df$indiv), site = as.character(tmp_df$site))
# add cluster info
df <- data.frame(df, cluster = as.factor(mBIC_res[[x]]$classification))
# subset by the 3 sites used for visualization in MNK 44100 Hz analyses
df <- droplevels(df[grep("PIED|PEIX|OJOS", df$site), ])
} else if(grepl("YNA", spp[x])){
df <- data.frame(X = tsne[, 1], Y = tsne[, 2], indiv = as.character(tmp_df$indiv), site = as.character(tmp_df$site), regional_dialect = as.character(tmp_df$regional_dialect), cluster = as.factor(mBIC_res[[x]]$classification))
}
# initialize aesthetics
# fill colors by site
# site_cols <- c("#F8766D", "#00BFC4", terrain.colors(n)[1])
site_cols <- c(gray.colors(n)[1], gray.colors(n)[6], gray.colors(n)[9])
# shapes and colors by site
shps <- c(21, 22, 24)
# convex hull polygons by site
hulls <- plyr::ddply(df, "site", function(z){
z[chull(z$X, z$Y), ]
})
gg <- ggplot(df, aes(x = X, y = Y)) +
geom_polygon(data = hulls, aes(x = X, y = Y, fill = site), alpha = 0.6) +
geom_point(aes(fill = site, shape = site), size = 6) +
geom_point(aes(color = cluster), shape = 19, size = 3) +
scale_color_manual(values = clust_cols) +
scale_fill_manual(values = site_cols) +
scale_shape_manual(values = shps)
gg <- gg + guides(fill = guide_legend(title = "Site"), shape = guide_legend(title = "Site"), color = guide_legend(title = "Cluster", override.aes = list(size = 2)))
gg <- gg + xlab("t-SNE Dimension 1") + ylab("t-SNE Dimension 2")
gg <- gg + theme_AcousticSpace() +
ggtitle(paste(spp[x], sep = " "))
print(gg)
}))
rf_final_test <- readRDS(file.path(path, "rf_final_test.RDS"))
sup_rf_test_site <- readRDS(file.path(path, "sup_rf_test_site.RDS"))
We parsed the random forests proximity matrices to yield average acoustic similarity at the site social scale.
spp <- c("MNK", "YNA")
# x <- 1
rf_site_stats <- pblapply(1:length(spp), function(x){
metdf <- sup_rf_test_site[[x]][grep("test_site", sup_rf_test_site[[x]]$set), ]
dimnames(rf_final_test[[x]]) <- list(as.character(metdf$site), as.character(metdf$site))
grp <- as.character(unique(metdf$site))
# s <- 2
tmp_res <- pblapply(1:length(grp), function(s){
indices <- grep(paste("^", grp[s], "$", sep = ""), dimnames(rf_final_test[[x]])[[1]])
# within group values
tmp <- rf_final_test[[x]][indices, indices]
tmp[tmp == 1] <- NA # ensure diagonals will not be included in means
# outside of group values
tmp2 <- rf_final_test[[x]][-indices, indices]
tmp2[tmp2 == 1] <- NA
# mean(tmp2[lower.tri(tmp2)])
return(data.frame(spp = spp[x], id = grp[s], means = c(round(mean(tmp[lower.tri(tmp)], na.rm = TRUE), digits = 2), round(mean(tmp2[lower.tri(tmp2)], na.rm = TRUE), digits = 2)), type = c("within_group", "among_group"), asim_type = "Random Forests"))
})
return(rbindlist(tmp_res))
})
rf_site_stats <- rbindlist(rf_site_stats)
str(rf_site_stats)
saveRDS(rf_site_stats, file.path(path, "rf_site_stats_2spp.RDS"))
xc_mat_list <- readRDS(file.path(path, "xc_mat_2spp_filt.RDS"))
We repeated these calculations for SPCC across species at the site scale.
spcc_site_stats <- pblapply(1:length(spp), function(x){
metdf <- sup_rf_test_site[[x]][grep("test_site", sup_rf_test_site[[x]]$set), ]
dimnames(xc_mat_list[[x]]) <- list(as.character(metdf$site), as.character(metdf$site))
grp <- as.character(unique(metdf$site))
# s <- 2
tmp_res <- pblapply(1:length(grp), function(s){
indices <- grep(paste("^", grp[s], "$", sep = ""), dimnames(xc_mat_list[[x]])[[1]])
# within group values
tmp <- xc_mat_list[[x]][indices, indices]
tmp[tmp == 1] <- NA # ensure diagonal is not included in calculation below
# outside of group values
tmp2 <- xc_mat_list[[x]][-indices, indices]
tmp2[tmp2 == 1] <- NA
return(data.frame(spp = spp[x], id = grp[s], means = c(round(mean(tmp[lower.tri(tmp)], na.rm = TRUE), digits = 2), round(mean(tmp2[lower.tri(tmp2)], na.rm = TRUE), digits = 2)), type = c("within_group", "among_group"), asim_type = "SPCC"))
})
return(rbindlist(tmp_res))
})
spcc_site_stats <- rbindlist(spcc_site_stats)
str(spcc_site_stats)
saveRDS(spcc_site_stats, file.path(path, "spcc_site_stats_2spp.RDS"))
We made a graphic to evaluate general patterns within and among sites across species and acoustic similarity methods.
rf_site_stats <- readRDS(file.path(path, "rf_site_stats_2spp.RDS"))
spcc_site_stats <- readRDS(file.path(path, "spcc_site_stats_2spp.RDS"))
Supplementary Figure 3 in supplementary materials.
df <- rbind(rf_site_stats, spcc_site_stats)
spp <- as.character(unique(df$spp))
type_diff <- lapply(1:length(spp), function(s){
id <- unique(as.character(df$id[grep(paste("^", spp[s], sep = ""), as.character(df$spp))]))
asim_type <- as.character(unique(df$asim_type))
res <- lapply(1:length(id), function(i){
res <- lapply(1:length(asim_type), function(a){
inds <- which(grepl(paste(spp[s], "$", sep = ""), df$spp) & grepl(id[i], df$id) & grepl(asim_type[a], df$asim_type))
tmp <- df[inds, ]
diff <- tmp$means[grep("within", tmp$type)]/tmp$means[grep("among", tmp$type)]
if(length(diff) != 1){
diff <- NA
}
return(data.frame(spp = spp[s], id = id[i], asim_type = asim_type[a], diff = diff))
})
return(rbindlist(res))
})
return(rbindlist(res))
})
type_diff <- rbindlist(type_diff)
# str(type_diff)
# organize levels for boxplot as desired
type_diff$asim_type <- factor(type_diff$asim_type, levels = c("SPCC", "Random Forests"))
saveRDS(type_diff, file.path(path, "compare_SPCC_RF_stats.RDS"))
type_diff <- readRDS(file.path(path, "compare_SPCC_RF_stats.RDS"))
ggplot(type_diff) +
geom_boxplot(aes(x = spp, y = diff, fill = asim_type), color = "black", size = 0.5, outlier.size = 0.45) +
guides(shape = FALSE, fill = guide_legend(title = "Acoustic Similarity Method")) +
geom_hline(yintercept = 1, size = 0.5, linetype = "dashed", color = "black") +
theme(panel.background = element_rect(fill = "white"), plot.background = element_rect(fill = "white"), plot.title = element_text(size = 25, hjust = 0.5),
panel.grid.major.x = element_line(size = 0.5, colour = "white"),
panel.grid.major.y = element_line(size = 0.01, colour = "black", linetype = "dashed"),
panel.grid.minor = element_line(size = 0.35, colour = "white"),
axis.line = element_line(size = 0.45, colour = "black"),
axis.title = element_text(size = 12),
axis.text = element_text(size = 10), legend.position = "top", legend.title = element_text(size = 12), legend.text = element_text(size = 10), axis.ticks = element_line(size = 0.45)) +
xlab("Species") + ylab("Ratio of mean acoustic similarity within vs. among sites")
1. Guerra J. E., Cruz-Nieto J., Ortiz-Maciel S.G., & Wright T.F. 2008. Limited geographic variation in the vocalizations of the endangered thick-billed parrot: Implications for conservation strategies. The Condor 110(4), 639.
2. Wright T.F. 1996. Regional dialects in the contact call of a parrot. Proc. R. Soc. Lond. B 263(1372), 867-872.
3. van der Maaten, L., and Hinton, G. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 2579–2605.
4. van der Maaten, L. 2009. Learning a parametric embedding by preserving local structure. Journal of Machine Learning Research Proceedings Vol. 5 (AISTATS), 384–391.
5. Shamir, L., Orlov, N., Eckley, D. M., Macura, T., Johnston, J., & Goldberg, I. G. 2008. Wndchrm - an open source utility for biological image analysis. Source Code for Biology and Medicine 3, 1–13.
6. Soifua, Breckell. 2018. A Comparison of R, SAS, and Python Implementations of Random Forests. (Master's thesis). Logan, Utah: Utah State University. Retrieved from: https://digitalcommons.usu.edu/gradreports/1268.