Submitted 21 December 2016 Accepted 5 April 2017 Published 24 April 2017 Corresponding author Carlos J. Corrada Bravo, carlos.corrada2@upr.edu Academic editor Elena Marchiori Additional Information and Declarations can be found on page 21 DOI 10.7717/peerj-cs.113 Copyright 2017 Corrada Bravo et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Species-specific audio detection: a comparison of three template-based detection algorithms using random forests Carlos J. Corrada Bravo1,2, Rafael Álvarez Berríos2 and T. Mitchell Aide2,3 1 Department of Computer Science, University of Puerto Rico—Rio Piedras, San Juan, Puerto Rico 2 Sieve Analytics, Inc., San Juan, Puerto Rico 3 Department of Biology, University of Puerto Rico—Río Piedras, San Juan, Puerto Rico ABSTRACT We developed a web-based cloud-hosted system that allow users to archive, listen, visualize, and annotate recordings. The system also provides tools to convert these annotations into datasets that can be used to train a computer to detect the presence or absence of a species. The algorithm used by the system was selected after comparing the accuracy and efficiency of three variants of a template-based detection. The algorithm computes a similarity vector by comparing a template of a species call with time increments across the spectrogram. Statistical features are extracted from this vector and used as input for a Random Forest classifier that predicts presence or absence of the species in the recording. The fastest algorithm variant had the highest average accuracy and specificity; therefore, it was implemented in the ARBIMON web-based system. Subjects Bioinformatics, Computational Biology, Data Mining and Machine Learning Keywords Acoustic monitoring, Machine learning, Animal vocalizations, Recording visualization, Recording annotation, Generic species algorithm, Web-based cloud-hosted system, Random Forest classifier, Species prediction, Species-specific audio detection INTRODUCTION Monitoring fauna is an important task for ecologists, natural resource managers, and conservationists. Historically, most data were collected manually by scientists that went to the field and annotated their observations (Terborgh et al., 1990). This generally limited the spatial and temporal extend of the data. Furthermore, given that the data were based on an individual’s observations, the information was difficult to verify, reducing its utility for understanding long-term ecological processes (Acevedo & Villanueva-Rivera, 2006). To understand the impacts of climate change and deforestation on the fauna, the scientific community needs long-term, wide-spread and frequent data (Walther et al., 2002). Passive acoustic monitoring (PAM) can contribute to this need because it facilitates the collection of large amounts of data from many sites simultaneously, and with virtually no impact to the fauna and environment (Brandes, 2008; Lammers et al., 2008; Tricas & Boyle, 2009; Celis-Murillo, Deppe & Ward, 2012). In general, PAM systems include a microphone or a hydrophone connected to a self powered system and enough memory to store various weeks or months of recordings, but there are also permanent systems that How to cite this article Corrada Bravo et al. (2017), Species-specific audio detection: a comparison of three template-based detection al- gorithms using random forests. PeerJ Comput. Sci. 3:e113; DOI 10.7717/peerj-cs.113 https://peerj.com mailto:carlos.corrada2@upr.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.113 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.113 use solar panels and an Internet connection to upload recordings in real time to a cloud based analytical platform (Aide et al., 2013). Passive recorders can easily create a very large data set (e.g., 100,000s of recordings) that is overwhelming to manage and analyze. Although researchers often collect recordings twenty-four hours a day for weeks or months (Acevedo & Villanueva-Rivera, 2006; Brandes, 2008; Lammers et al., 2008; Sueur et al., 2008; Marques et al., 2013; Blumstein et al., 2011), in practice, most studies have only analyzed a small percentage of the total number of recordings. Web-based applications have been developed to facilitate data management of these increasingly large datasets (Aide et al., 2013; Villanueva-Rivera & Pijanowski, 2012), but the biggest challenge is to develop efficient and accurate algorithms for detecting the presence or absence of a species in many recordings. Algorithms for species identification have been developed using spectrogram matched filtering (Clark, Marler & Beeman, 1987; Chabot, 1988), statistical feature extraction (Taylor, 1995; Grigg et al., 1996), k-Nearest neighbor algorithm (Han, Muniandy & Dayou, 2011; Gunasekaran & Revathy, 2010), Support Vector Machine (Fagerlund, 2007; Acevedo et al., 2009), tree- based classifiers (Adams, Law & Gibson, 2010; Henderson, Hildebrand & Smith 2011) and template based detection (Anderson, Dave & Margoliash, 1996; Mellinger & Clark, 2000), but most of these algorithms are built for a specific species and there was no infrastructure provided for the user to create models for other species. In this study, we developed a method that detects the presence or absence of a species’ specific call type in recordings with a response time that allows researchers to create, run, tune and re-run models in real time as well as detect hundreds of thousands of recordings in a reasonable time. The main objective of the study was to compare the performance (e.g., efficiency and accuracy) of three variants of a template-based detection algorithm and incorporate the best into the ARBIMON II bioacoustics platform. The first variant is the Structural Similarity Index described in Wang et al. (2004), a widely use method to find how similar two images are (in our case the template with the tested recording). The second method filters the recordings with the dynamic thresholding method described in Van der Walt et al. (2014) and then use the Frobenius norm to find similarities with the template. The final method uses the Structural Similarity Index, but it is only applied to regions with high match probability determined by the OpenCV’s matchTemplate procedure (Bradski, 2000). MATERIALS AND METHODS Passive acoustic data acquisition We gathered recordings from five locations: four in Puerto Rico and one in Peru. Some of the recordings were acquired using the Automated Remote Biodiversity Monitoring Network (ARBIMON) data acquisition system described in Aide et al. (2013), while others were acquired using the newest version of ARBIMON permanent recording station, which uses an Android cell phone and transmits the recorded data through a cellular network. All recordings have a sampling rate of 44.1 kHz, a sampling depth of 16-bit and an approximate duration of 60 s (±.5 s) Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 2/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Figure 1 Recording locations in Puerto Rico. Map data: Google, Image—Landsat/Copernicus and Data—SIO, NOAA, US Navy, NGA and GEBCO. Figure 2 Recording location in Peru. Map data: Google, US Dept. of State Geographer, Image— Landsat/Copernicus and Data—SIO, NOAA, US Navy, NGA and GEBCO. The locations in Puerto Rico were the Sabana Seca permanent station in Toa Baja, the Casa la Selva station in Carite Mountains (Patillas), El Yunque National Forest in Rio Grande and Mona Island (see Fig. 1). The location in Peru was the Amarakaeri Communal Reserve in the Madre de Dios Region (see Fig. 2). In all the locations, the recorders were programmed to record one minute of audio every 10 min. The complete dataset has more than 100,000 1-minute recordings. We randomly chose 362 recordings from Puerto Rico and 547 recordings from Peru for comparing the three algorithm variants. We used the ARBIMON II web application interface to annotate the presence or absence of 21 species in all the recordings. Regions in the recording where a species emits a sound were also marked using the web interface. Each region of interest (ROI) is a rectangle Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 3/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Table 1 Species, class, location and count of recordings with validated data. Species Group Total Presence Absence Location Eleutherodactylus cooki Amphibian 38 19 19 Carite Eleutherodactylus brittoni Amphibian 38 17 21 Sabana Seca Eleutherodactylus cochranae Amphibian 54 30 24 Sabana Seca Eleutherodactylus coqui Amphibian 53 41 12 Sabana Seca Eleutherodactylus juanariveroi Amphibian 35 14 21 Sabana Seca Unknown Insect Insect 48 22 26 Sabana Seca Epinephelus guttatus Fish 152 76 76 Mona Island Megascops nudipes Bird 100 50 50 El Yunque Microcerculus marginatus Bird 80 40 40 Peru Basileuterus chrysogaster Bird 60 30 30 Peru Myrmoborus leucophrys Bird 160 80 80 Peru Basileuterus bivittatus Bird 100 50 50 Peru Liosceles thoracicus Bird 76 38 38 Peru Chlorothraupis carmioli Bird 112 56 56 Peru Megascops guatemalae Bird 28 8 20 Peru Saltator grossus Bird 68 34 34 Peru Myrmeciza hemimelaena Bird 180 90 90 Peru Thamnophilus schistaceus Bird 60 30 30 Peru Hypocnemis subflava Bird 140 70 70 Peru Percnostola lophotes Bird 100 50 50 Peru Formicarius analis Bird 80 40 40 Peru delimited by starting time, ending time, lowest frequency and highest frequency along with a species and sound type. The species included in the analysis are listed in Table 1, along with the number of total recordings and the number of recordings where the species is present or absent. Algorithm The algorithm recognition process is divided into three phases: (1) Template Computation, (2) Model Training and (3) Detection (see Fig. 3). In Template computation, all ROIs submitted by the user in the training set are aggregated into a template. In Model Training the template is used to compute recognition functions from validated audio recordings and features from the resulting vector V are computed. These features are used to train a random forest model. In the Detection phase the template is used to compute the features, but this time the features are fed to the trained random forest model to compute a prediction of presence or absence. In the following sections the Template Computation process will be explained, then the process of using the Template to extract features from a recording is presented and finally, the procedures to use the features to train the model and to detect recordings are discussed. Template computation The template refers to the combination of all ROIs in the training data. To create a template, we first start with the examples of the specific call of interest (i.e., ROIs) that were annotated Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 4/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Figure 3 The three phases of the algorithm to create the species-specific models. In the Model Training phase Reci is a recording, Vi is the vector generated by the recognition function on Reci and in the Detec- tion phase V is the vector generated by the recognition function on the incoming recording. from a set of recordings for a given species and a specific call type (e.g., common, alarm). Each ROI encompasses an example of the call, and is an instance of time between time t1 and time t2 of a given recording and low and high boundary frequencies of f1 and f2, where t1 < t2 and f1 < f 2. In a general sense, we combine these examples to produce a template of a specific song type of a single species. Specifically, for each recording that has an annotated ROI, a spectrogram matrix (SM) is computed using the Short Time Fourier Transform with a frame size of 1024 samples, 512 samples of overlap and a Hann analysis window, thus the matrices have 512 rows. For a recording with a sampling rate of 44,100 Hz, the matrix bin bandwidth is approximately 43.06 Hz. The SM is arranged so that the row of index 0 represents the lowest frequency and the row with index 511 represents the highest frequency of the spectrum. Properly stated the columns c1 to c2 and the rows from r1 to r2 of SM were extracted, where: c1=bt1×44,100c, c2=bt2×44,100c, r1=bf1/43.06c and r2=bf2/43.06c. The rows and columns that represent the ROI in the recording (between frequencies f1 and f2 and between times t1 and t2) are extracted. The submatrix of SM that contains only the area bounded by the ROI is define as SMROI and refer in the manuscript as the ROI matrix. Since the ROI matrices can vary in size, to compute the aggregation from the ROI matrices we have to take into account the difference in the number of rows and columns of the matrices. All recordings have the same sampling rate, 44,100 Hz. Thus, the rows from different SMs, computed with the same parameters, will represent the same frequencies, i.e., rows with same indexes represent the same frequency. After the ROI matrix, SMROI , has been extracted from SM, the rows of SMROI will also represent specific frequencies. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 5/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Figure 4 Flowchart of the algorithm to generate the template of each species. Thus, if we were to perform an element-wise matrix sum between two ROI matrices with potentially different number of rows, we should only sum rows that represent the same frequency. To take into account the difference in the number of columns of the ROI matrices, we use the Frobenius norm to optimize the alignment of the smaller ROI matrices and perform element-wise sums between rows that represent the same frequency. We present that algorithm in the following section and a flow chart of the process in Fig. 4. Template computation algorithm: 1. Generate the set of SMROI matrices by computing the short time Fourier Transform of all the user generated ROIs. 2. Create matrix SMmax, a duplicate of the first created matrix among the matrices with the largest number of columns. 3. Set cmax as the number of columns in SMmax. 4. Create matrix Ttemp, with the same dimensions as SMmax and all entries equal to 0. This matrix will contain the element-wise addition of all the extracted SMROI matrices. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 6/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 5. Create matrix W with the same dimensions of SMmax and all entries equal to 0. This matrix will hold the count on the number of SMROI matrices that participate in the calculation of each element of Ttemp. 6. For each one of the SMi ROI matrices in SMROI : (a) If SMi has the same number of columns as Ttemp: i. Align the rows of SMi and Ttemp so they represent equivalent frequencies and perform an element-wise addition of the matrices and put the result in Ttemp. ii. Add one to all the elements of the W matrix where the previous addition participated. (b) If the number of columns differs between SMi and Ttemp, then find the optimal alignment with SMmax as follows: i. Set ci as the number of columns in SMi. ii. Define (SMmax)I as the set of all submatrices of SMmax with the same dimensions as SMi. Note that the cardinality of (SMmax)I is cmax−ci. iii. For each Subk ∈(SMmax)I : A. Compute dk =NORM(Subk−SMi) where NORM is the Frobenius norm defined as: NORM(A)= √∑ (i,j) |a2i,j| where ai,j are the elements of matrix A. iv. Define Submin{dk} as the Subk matrix with the minimum dk. This is the optimal alignment of SMi with SMmax. v. Align the rows of Submin{dk} and Ttemp so they represent equivalent frequencies, perform an element-wise addition of the matrices and put the result in Ttemp. vi. Add one to all the elements of the W matrix where the previous addition participated. 7. Define the matrix Ttemplate as the element-wise division between the Ttemp matrix and the W matrix. The resulting Ttemplate matrix summarizes the information available in the ROI matrices submitted by the user and it will be used to extract information from the audio recordings that are to be analyzed. In this article each species Ttemplate was created using five ROIs. In Fig. 5A a training set for the Eleutherodactylus coqui is presented and in Fig. 5B the resulting template can be seen. This tool is very useful because the user can see immediately the effect of adding or subtracting a specific sample to the training set. Model training The goal of this phase is to train a random forest model. The input to train the random forest are a series of statistical features extracted from vectors Vi that are created by computing a recognition function (similarity measure) between the computed Ttemplate and submatrices of the spectrogram matrices of a series of recordings. In the following section, we present the details of the algorithm that processes a recording to create the recognition function vector, and in Fig. 6 we present a flowchart of the process. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 7/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Figure 5 (A) A training set with 16 examples of the call of E. coqui. (B) The resulting template from the training set. Figure 6 Flowchart of the algorithm to generate the similarity vector of each recording. Algorithm to Create the Similarity Vector: 1. Compute matrix SPEC, the submatrix of the spectrogram matrix that contains the frequencies in Ttemplate. Note that we are dealing with recordings that have the same sample rate as the recordings used to compute the Ttemplate. 2. Define cSPEC, the number of columns of SPEC. 3. Define ctemplate, the number of columns of Ttemplate. Note that cSPEC �ctemplate since the SPEC matrix have the same number of columns as the whole spectrogram and that the Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 8/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 1Note that for recordings with a sample rate of 44,100 when we calculate the STFT with a window of size 512 and a 50% overlap, one step is equivalent to 5.8 ms; therefore, 16 steps is less than 100 ms. Although this procedure may miss the strongest match, the length of the calls are much longer than the step interval; therefore, there is a high probability of detecting the species-specific call. Ttemplate matrix fits c =cSPEC −ctemplate+1 times inside the SPEC matrix. There are c submatrices of SPEC with the same dimensions as Ttemplate. 4. Define step, the step factor by which Ttemplate will progressed over the SPEC matrix. 5. Define n= ⌊ cSPEC−ctemplate step ⌋ +1. Note that if step=1 then n=c. In this work, however, this parameter was selected as step=16 as a trade-off for speed.1 6. Define SPECi as the submatrix of SPEC that spans the columns from i×step to i×step+ ctemplate. 7. Set i=1. 8. While i≤n (a) Compute the similarity measure measi for SPECi (the definition of measi for each of the three variants is provided in the following section). (b) Increase i by 1. Note that this is equivalent to progressing step columns in the SPEC matrix. 9. Define the vector V as the vector containing the n similarity measures resulting from the previous steps. That is, V =[meas1,meas2,meas3,...,measn]. Recognition function We used three variations of a pattern match procedure to define the similarity measure vector V . First, the Structural Similarity Index described in Wang et al. (2004) and implemented in Van der Walt et al. (2014) as compare_ssim with the default window size of seven unless the generated pattern is smaller. It will be referred in the rest of the manuscript as the SSIM variant. For the SSIM variant we define measi as: measi=SSI(Ttemplate,SPECi), where SPECi is the submatrix of SPEC that spans the columns from i× step to i×step+ ctemplate and the same number of rows as Ttemplate and V =[meas1,meas2,meas3, ...,measn] with n= ⌊ cSPEC −ctemplate step ⌋ +1. Second, the dynamic thresholding method (threshold_adaptive) described in Van der Walt et al. (2014) with a block size of 127 and an arithmetic mean filter is used over both Ttemplate and SPECi before multiplying them and applying the Frobenius norm and normalized by the norm of a matrix with same dimensions as Ttemplate and all elements equal to one. Therefore, measi for the NORM variant is defined as: measi=FN ( DTM(Ttemplate) .∗ DTM(SPECi) ) /FN(U), where again SPECi is the submatrix of SPEC that spans the columns from i×step to i×step+ ctemplate, FN is the Frobenius norm, DTM is the dynamic thresholding method, U is a matrix with same dimensions as Ttemplate with all elements equal to one and .∗performs an element-wise multiplication of the matrices. Again, V =[meas1,meas2,meas3,...,measn] with n= ⌊ cSPEC −ctemplate step ⌋ +1. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 9/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Finally, for the CORR variation we first apply the OpenCV’s matchTemplate procedure (Bradski, 2000) with the Normalized Correlation Coefficient option to SPECi, the submatrix of SPEC that spans the columns from i×step to i×step+ ctemplate. However, for this variant, SPECi includes two additions rows above and below, thus it is slightly larger than the Ttemplate. With these we can define: measj,i=CORR(Ttemplate,SPECj,i) where SPECj,i is the submatrix of SPECi that starts at row j (note that there are five such SPECj,i matrices). Now, we select five points at random from all the points above the 98.5 percentile of measj,i and apply the Structural Similarity Index to the 5 strongly-matching regions. The size of these regions is eight thirds (8/3) of the length of Ttemplate, 4/3 before and 4/3 after the strongly-matched point that was selected. Then, define FilterSPEC as the matrix that contains these 5 strongly-matching regions and FilterSPECi as the submatrix of FilterSPEC that spans the columns from i to i+ ctemplate then, the similarity measure for this variant is define as: measi=SSI(Ttemplate,FilterSPECi) and the resulting vector V =[meas1,meas2,meas3,...,measn] but this time with n=5× (⌊ 8/3×ctemplate ⌋ +1 ) . It is important to note that no matter which variant is used to calculate the similarity measures, the result will always be a vector of measurements V . The idea is that the statistical properties of these computed recognition functions have enough information to distinguish between a recording that has the target species present and a recording that does not have the target species present. However, notice that since cSPEC, the length of SPEC, is much larger that ctemplate the length of the vector V for the CORR variant is much smaller than the other two. Random forest model creation After calculating V for many recordings we can train a random forest model. First, we need a set of validated recordings with the specific species vocalization present in some recordings and absent in others. Then for each recording we compute a vector Vi as described in the previous section and extract the statistical features presented in Table 2. These statistical features represent the dataset used to train the random forest model, which will be used to detect recordings for presence or absence of a species call event. These 12 features along with the species presence information are used as input to a random forest classifier with a 1,000 trees. Recording detection Now that we have a trained model to detect a recording, we have to compute the statistical features from the similarity vector V of the selected recording. This is performed in the same way as it was described in the previous section. These features are then used as the input dataset to the previously trained random forest classifier and a label indicating presence or absence of the species in the recording is given as output. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 10/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Table 2 The statistical features extracted from vector V . Features 1. mean 2. median 3. minimum 4. maximum 5. standard deviation 6. maximum–minimum 7. skewness 8. kurtosis 9. hyper-skewness 10. hyper-kurtosis 11. Histogram 12. Cumulative frequency histogram The experiment In order to decide which of the three variants should be incorporated into the ARBIMON web-based system, we performed the algorithm explained in the previous section with each of the similarity measures. We computed 10-fold validations on each of the variants to obtained measurements of the performance of the algorithm. In each validation, 90% of the data is used as training and 10% of the data is used as validation data. Each algorithm variant used the same 10-fold validation partition for each species. The measures calculated were the area under the receiver operating characteristic (ROC) curve (AUC), accuracy or correct detection rate (Ac), negative predictive value (Npv), precision or positive predictive value (Pr), sensitivity, recall or true positive rate (Se) and specificity or true negative rate (Sp). To calculate the AUC, the ROC curve is created by plotting the false positive rate (which can be calculated as 1− specificity) against the true positive rate (sensitivity), then, the AUC is created by calculating the area under that curve. Notice that the further the AUC is from 0.5 the better. The rest of the measures are defined as follows: Ac = tp+tn tp+tn+fp+fn , Npv = tn tn+fn , Pr = tp tp+fp , Se= tp tp+fn and Sp= tn tn+fp with tp the number of true positives (number of times both the expert and the algorithm agree that the species is present), tn the number of true negatives (number of times both the expert and the algorithm agree that the species is not present), fp the number of false positives (number of times the algorithm states that the species is present while the expert states is absent) and fn the number of false negatives (number of times the algorithm states that the species is not present while the expert states it is present). Note that accuracy is a weighted average of the sensitivity and the specificity. Although we present and discuss all measures, we gave accuracy and the AUC more importance because they include information on the true positive and true negative rates. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 11/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Specifically, AUC is important when the number of positives is different than the number of negatives as is the case with some of the species. The experiment was performed in a computer with an Intel i7 4790K 4 cores processor at 4.00 GHz with 32 GB of RAM and running Ubuntu Linux. The execution time needed to detect each recording was registered and the mean and standard deviation of the execution times were calculated for each variant of the algorithm. We also computed the quantity of pixels on all the Ttemplate matrices and correlated with the execution time of each of the variants. A global one-way analysis of variance (ANOVA) was performed on the five calculated measures across all of the 10-fold validations to identify if there was a significant difference between the variants of the algorithm. Then, a post-hoc Tukey HSD comparison test was performed to identify which one of the variants was significantly different at the 95% confidence level. Additionally, an ANOVA was performed locally between the 10-fold validation of each species and on the mean execution time for each species across the algorithm variants to identify if there was any significant execution time difference at the 95% confidence level. Similarly, a post-hoc Tukey HSD comparison test was performed on the execution times. RESULTS The six measurements (area under the ROC curve—AUC, accuracy, negative predictive value, precision, sensitivity and specificity) computed to compared the model across the three variants varied greatly among the 21 species. The lowest scores were among bird species while most of the highest scores came from amphibian species. Table 3 presents a summary of the results of the measurements comparing the three variants of the algorithm (for a detail presentation, see Appendix 1). The NORM variant did not have the highest value for any of the measures summarized in Table 3, while the CORR variant had a greater number of species with 80% or greater for all the measures and an overall median accuracy of 81%. We considered these two facts fundamental for a general-purpose species detection system. The local species ANOVA suggested that there are significant accuracy differences at the 95% significance level for six of the 21 species studied as well as four in terms of precision and three in terms of specificity (see https://doi.org/10.6084/m9.figshare.4488149.v1). The algorithm variant CORR had a higher mean and median AUC at 78% and 81% respectively, but the SSIM variant seems to be more stable with a standard deviation of 20%. In terms of accuracy, both the SSIM and CORR have higher mean accuracy than the NORM variant. Nevertheless, variant CORR had the highest median accuracy of 81%, which is slightly higher than the median accuracy of the SSIM variant at 76%. In addition, variant CORR had more species with an accuracy of 80% or greater. In terms of median precision, the three variants had similar values, although in terms of mean precision variants SSIM and CORR have greater values than the NORM variant. Moreover, the median and mean precision of the SSIM variant were only 1% higher than the median and mean precision of the CORR variant. In terms of sensitivity, variants SSIM Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 12/23 https://peerj.com https://doi.org/10.6084/m9.figshare.4488149.v1 http://dx.doi.org/10.7717/peerj-cs.113 Table 3 Summary of the measures of the three variants of the algorithm. Best values are in bold. Summary of measures SSIM NORM CORR Number of species with an Area under the curve of 80% or greater 8 7 12 Number of species with statistically significant Area under the curve 0 0 0 Mean Area under the curve 0.76 0.71 0.78 Median Area under the curve 0.75 0.72 0.81 Standard Deviation of Area under the curve 0.20 0.21 0.21 Number of species with an Accuracy of 80% or greater 8 7 12 Number of species with statistically significant Accuracy 3 0 3 Mean Accuracy 0.77 0.73 0.77 Median Accuracy 0.76 0.75 0.81 Standard Deviation of Accuracy 0.12 0.14 0.14 Number of species with a Negative predictive value of 80% or greater 7 5 10 Number of species with statistically significant Negative predictive value 0 0 0 Mean Negative predictive value 0.73 0.71 0.74 Median Negative predictive value 0.71 0.75 0.79 Standard Deviation of Negative predictive value 0.08 0.12 0.13 Number of species with a Precision of 80% or greater 5 5 9 Number of species with statistically significant Precision 2 0 2 Mean Precision 0.73 0.68 0.72 Median Precision 0.75 0.73 0.74 Standard Deviation of Precision 0.12 0.13 0.16 Number of species with a Sensitivity of 80% or greater 8 6 11 Number of species with statistically significant Sensitivity 0 0 0 Mean Sensitivity 0.77 0.70 0.74 Median Sensitivity 0.79 0.73 0.80 Standard Deviation of Sensitivity 0.12 0.16 0.17 Number of species with a Specificity of 80% or greater 4 6 7 Number of species with statistically significant Specificity 3 0 0 Mean Specificity 0.69 0.68 0.72 Median Specificity 0.67 0.70 0.75 Standard Deviation of Specificity 0.13 0.15 0.16 Ratio of False positive to True positive 0.37 0.47 0.39 Ratio of False negative to True positive 0.45 0.47 0.39 Ratio of False positive to True negative 0.3 0.43 0.35 Ratio of False negative to True negative 0.37 0.43 0.35 and CORR had greater values than the NORM variant. It is only in terms of specificity that the CORR variant has greater values than all other variants. Figures 7 and 8 present a summary of these results with whisker graphs. In terms of execution times, an ANOVA analysis on the mean execution time suggests a difference between the variants (F =9.9341e+30,df =3,p < 2.2e–16). The CORR variant had the lowest mean execution time at 0.255 s followed closely by the NORM variant with 0.271 s, while the SSIM variant had the slowest mean execution time of 2.269 s (Fig. 9). The Tukey HSD test suggests that there was no statistical significant difference between the Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 13/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Figure 7 Whisker boxes of the 10-fold validations for the three variants of the presented algorithm for: (A) Area under the ROC curve and (B) Accuracy. Table 4 Summary of the execution times of the three variants of the algorithm. Best values are in bold. PPMCC is the Pearson product-moment correlation coefficient. Summary of execution times SSIM NORM CORR Mean Execution Time 2.27 0.27 0.26 Standard Deviation of Execution Time 3.04 0.06 0.07 PPMCC between Execution Time and size of template 0.96 0.33 0.11 mean execution times of the NORM and CORR variants (p=0.999). However, there was a statistical significant difference at the 95% confidence level between the mean execution times of all other pairs of variants, specifically variants SSIM and CORR (p < .2e–16). Moreover, the mean execution time of the SSIM variant increased as the number of pixels in the Ttemplate matrix increases (Fig. 9B). There was no statistically significant relationship between the Ttemplate pixel size and the execution time for the other two variants (Table 4). In summary, variants SSIM and CORR outperform the NORM variant in most of the statistical measures computed having statistically significant high accuracy for three species each. In terms of execution time, the CORR variant was faster than the SSIM variant (Table 3), and the mean execution time of CORR variant did not increase with increasing Ttemplate size (Table 4). DISCUSSION The algorithm used by the ARBIMON system was selected by comparing three variants of a template-based method for the detection of presence or absence of a species vocalization in recordings. The most important features for selecting the algorithm were that it works well for many types of species calls and that it can process hundreds of thousands of recordings in a reasonable amount of time. The CORR algorithm was selected because of its speed Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 14/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Figure 8 Whisker boxes of the 10-fold validations for the three variants of the presented algorithm for: (A) Negative predictive value, (B) Precision, (C) Sensitivity and (D) Specificity. and its comparable performance in terms of detection efficiency with the SSIM variant. It achieved AUC and accuracy of 0.80 or better in 12 of the 21 species and sensitivity of 0.80 or more in 11 of the 21 species and the average execution time of 0.26 s per minute of recording means that it can process around 14,000 minutes of recordings per hour. The difference in execution time between the SSIM variant and the other two was due to a memory management issue in the SSIM algorithm. An analysis revealed that all the Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 15/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Figure 9 (A) Whisker boxes of the execution times of the three algorithms. (B) Execution times as a function of the size of the template in number of pixels. algorithms have time complexity of O (( cSPEC −ctemplate ) ×ctemplate×rtemplate ) where cSPEC and ctemplate are the number of columns in SPEC and Ttemplate respectively and rtemplate is the number of rows in Ttemplate. The only explanation we can give is that the SSIM function uses an uniformly distributed filter (uniform_filter) that has a limit on the size of the memory buffer (4,000 64-bit doubles divided by the number of elements in the dimension been process). Therefore, as the size of Ttemplate increases the number of calls to allocate the buffer, free and allocate again can become a burden since it has a smaller locality of reference even when the machine has enough memory and cache to handle the process. Further investigation is required to confirm this. An interesting comparison is the method described in the work by Fodor (2013) and adapted and tested by Lasseck (2013). This method was design for the Neural Information Processing Scaled for Bioacoustics (NIPS4B) competition, and although the results are very good they do not report on time of execution. As we have mentioned, it is very important to us to have a method that provides good response times and the execution time of Lasseck’s method seems to be greater than ours given the extensive pre-processing that the method performs. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 16/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Table 5 Summary of the usage of the ARBIMON2 system and its model creation feature. Number of users in the system 453 Number of recordings in the system 1,749,551 Number of models created by users 659 Total number of detected recordings 3,780,552 Number of distinct detected recordings 723,054 Average times a recording is detected 5.22 Standard deviation of the number of times a recording is detected 7.78 Maximum number of times a recordings has been detected 58 CONCLUSIONS AND FUTURE WORK Now that passive autonomous acoustic recorders are readily available the amount of data is growing exponentially. For example, one permanent station recording one minute of every 10 minutes every day of the year generates 52,560 one minute recordings. If this is multiplied by the need to monitor thousands of locations across the planet, one can understand the magnitude of the task at hand. We have shown how the algorithm used in the ARBIMON II web-based cloud-hosted system was selected. We compared the performance in terms of the ability to detect and the efficiency in terms of time execution of three variants of a template-based detection algorithm. The result was a method that uses the power of a widely use method to determine the similarity between two images (Structural Similarity Index (Wang et al., 2004)), but to accelerate the detection process, the analysis was only done in regions where there was a strong-match determined by the OpenCV’s matchTemplate procedure (Bradski, 2000). The results show that this method performed better both in terms of ability to detect as well as in terms of execution time. A fast and accurate general-purpose algorithm for detecting presence or absence of a species complements the other tools of the ARBIMON system, such as options for creating playlists based on many different parameters including user-created tags (see Table 5). For example, the system currently has 1,749,551 1-minute recordings uploaded by 453 users and 659 species specific models have been created and run over 3,780,552 min of recordings of which 723,054 are distinct recordings. While this research was a prove of concept, we provide the tools and encourage users to increase the size of the training data set as this should improve the performance of the algorithm. In addition, we will pursue other approaches, such as multi-label learning (Xie et al., 2016; Zhang et al., 2016; Briggs et al., 2012). As a society, it is fundamental that we study the effects of climate change and deforestation on the fauna and we have to do it with the best possible tools. We are collecting a lot of data, but until recently there was not an intuitive and user-friendly system that allowed scientists to manage and analyze large number of recordings. Here we presented a web-based cloud-hosted system that provides a simple way to manage large quantities of recordings with a general-purpose method to detect their presence in recordings. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 17/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 ACKNOWLEDGEMENTS The authors want to thank Marconi Campos-Cerqueira for his helpful comments on the manuscript. APPENDIX 1 Table 6 provide a detail presentation of the performance of each variant of the algorithm: The area under the ROC curve, mean accuracy, mean precision, mean sensitivity and mean specificity values for each species, of the 10-fold validations for the three variants of the presented algorithm (SSIM, NORM and CORR). The mean, median and standard deviation values across all species are presented at the bottom of the table. APPENDIX 2 The templates created by the training sets of each species are presented. We classified them by the algorithm that presented a better accuracy for that species. Figure 10 presents the templates of the species where the SSIM variant presented better accuracy, Fig. 11 presents those where the NORM variant presented better accuracy, and Fig. 12 presents the species where the CORR variant presented better accuracy. Figure 10 Sample of species that the SSIM variant presented better accuracy. (A) F. analis, (B) M. hemimelaena, (C) E. cooki, (D) P. lophotes, (E) E. coqui, (F) T. schistaceus and (G) H. subflava. Species (A–C) are statistically significant. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 18/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Table 6 Area Under the ROC Curve (AUC), Accuracy (Ac), negative predictive value (Npv), precision (Pr), sensitivity (Se) and specificity (Sp) of the 21 species and three variants of the algorithm. Best values are shaded and the cases where the ANOVA suggested a significant difference between the algorithm variants at the 95% con- fidence level are in bold . Species SSIM NORM CORR AUC Ac Npv Pr Se Sp AUC Ac Npv Pr Se Sp AUC Ac Npv Pr Se Sp E. brittoni 1.00 0.92 0.81 0.77 0.72 0.95 0.42 0.89 0.83 0.80 0.77 0.92 1.00 0.98 0.84 0.80 0.77 1.00 E. cochranae 1.00 0.87 0.84 0.94 0.88 0.85 0.88 0.72 0.70 0.81 0.77 0.68 1.00 0.98 0.96 1.00 0.97 1.00 M. guatemalae 0.50 0.93 0.81 0.50 0.45 0.97 1.00 0.97 0.82 0.50 0.45 1.00 0.50 0.90 0.80 0.47 0.45 0.87 E. cooki 1.00 0.96 0.85 0.77 0.77 0.97 0.72 0.82 0.78 0.73 0.67 0.87 0.88 0.89 0.82 0.72 0.73 0.92 Unknown Insect 1.00 0.90 0.79 0.84 0.75 0.82 1.00 0.92 0.84 0.83 0.82 0.83 1.00 0.90 0.79 0.84 0.75 0.82 E. coqui 0.88 0.90 0.75 0.96 0.93 0.70 0.92 0.86 0.75 0.88 0.96 0.47 1.00 0.88 0.85 0.89 0.98 0.47 M. leucophrys 0.98 0.87 0.88 0.87 0.89 0.87 0.77 0.76 0.79 0.74 0.81 0.72 0.98 0.88 0.87 0.89 0.87 0.90 E. juanariveroi 0.20 0.78 0.69 0.60 0.48 0.79 0.50 0.88 0.70 0.55 0.48 0.83 0.63 0.81 0.69 0.47 0.45 0.80 M. nudipes 0.90 0.74 0.76 0.75 0.77 0.74 0.84 0.81 0.84 0.80 0.85 0.79 0.90 0.85 0.83 0.88 0.82 0.86 B. bivittatus 0.77 0.59 0.65 0.65 0.64 0.65 0.90 0.74 0.78 0.73 0.80 0.73 0.95 0.85 0.84 0.88 0.83 0.87 C. carmioli 0.78 0.77 0.75 0.83 0.73 0.83 0.78 0.73 0.75 0.73 0.76 0.72 0.83 0.81 0.80 0.86 0.80 0.84 L. thoracicus 0.70 0.73 0.71 0.76 0.67 0.79 0.90 0.76 0.80 0.73 0.80 0.77 0.97 0.81 0.83 0.82 0.84 0.80 F. analis 0.82 0.81 0.81 0.79 0.82 0.79 0.68 0.63 0.65 0.63 0.69 0.57 0.57 0.58 0.59 0.58 0.62 0.55 E. guttatus 0.74 0.69 0.70 0.69 0.70 0.69 0.72 0.75 0.76 0.77 0.77 0.75 0.78 0.77 0.77 0.78 0.77 0.77 M. hemimelaena 0.75 0.76 0.71 0.77 0.67 0.82 0.61 0.59 0.59 0.58 0.60 0.57 0.61 0.63 0.62 0.63 0.65 0.59 B. chrysogaster 0.56 0.68 0.66 0.67 0.62 0.74 0.69 0.75 0.70 0.72 0.65 0.83 0.80 0.73 0.69 0.64 0.66 0.78 S. grossus 0.70 0.66 0.66 0.68 0.66 0.67 0.78 0.74 0.72 0.75 0.70 0.76 0.81 0.71 0.73 0.74 0.78 0.62 P. lophotes 0.73 0.71 0.68 0.73 0.63 0.78 0.58 0.58 0.60 0.59 0.62 0.57 0.65 0.61 0.63 0.62 0.64 0.61 H. subflava 0.74 0.64 0.64 0.64 0.66 0.61 0.51 0.51 0.51 0.52 0.53 0.49 0.51 0.51 0.52 0.51 0.56 0.48 M. marginatus 0.58 0.59 0.55 0.60 0.59 0.51 0.32 0.49 0.43 0.47 0.39 0.47 0.69 0.61 0.62 0.61 0.66 0.56 T. schistaceus 0.62 0.58 0.58 0.61 0.51 0.67 0.30 0.50 0.46 0.45 0.49 0.43 0.28 0.52 0.48 0.49 0.44 0.52 Mean values 0.76 0.77 0.73 0.73 0.69 0.77 0.71 0.73 0.71 0.68 0.68 0.70 0.78 0.77 0.74 0.72 0.72 0.74 Median values 0.75 0.76 0.71 0.75 0.67 0.79 0.72 0.75 0.75 0.73 0.70 0.73 0.81 0.81 0.79 0.74 0.75 0.80 Standard Dev. 0.20 0.12 0.09 0.12 0.13 0.12 0.21 0.14 0.12 0.13 0.15 0.16 0.21 0.14 0.13 0.16 0.16 0.17 C orrada B ravo etal. (2017),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.113 19/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Figure 11 Sample of species that the NORM variant presented better accuracy. (A) Unknown Insect, (B) B. chrysogaster, (C) S. grossus, (D) M. guatemalae and (E) E. juanariveroi. Neither is statistically significant. Figure 12 Sample of species that the CORR variant presented better accuracy. (A) E. cochranae, (B) M. leucophrys, (C) B. bivittatus, (D) C. carmioli, (E) M. marginatus, (F) M. nudipes, (G) E. brittoni, (H) E. gut- tatus and (I) L. thoracicus. Species (A–C) are statistically significant. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 20/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests Carlos J. Corrada Bravo and T. Mitchell Aide are the founders of Sieve Analytics, Inc. Rafael Álvarez Berríos was an employee of Sieve Analytics, Inc. Author Contributions • Carlos J. Corrada Bravo conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. • Rafael Álvarez Berríos conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work. • T. Mitchell Aide conceived and designed the experiments, analyzed the data, wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Corrada Bravo, Carlos J (2016): Code, recordings, validations and Regions of interest. figshare. https://doi.org/10.6084/m9.figshare.4488149.v1. REFERENCES Acevedo MA, Corrada-Bravo CJ, Corrada-Bravo H, Villanueva-Rivera LJ, Aide TM. 2009. Automated classification of bird and amphibian calls using ma- chine learning: a comparison of methods. Ecological Informatics 4:206–214 DOI 10.1016/j.ecoinf.2009.06.005. Acevedo M, Villanueva-Rivera L. 2006. Using automated digital recording systems as effective tools for the monitoring of birds and amphibians. Wildlife Society Bulletin 34:211–214 DOI 10.2193/0091-7648(2006)34[211:UADRSA]2.0.CO;2. Adams MD, Law BS, Gibson MS. 2010. Reliable automation of bat call identification for Eastern New South Wales, Australia, using classification trees and anascheme software. Acta Chiropterologica 12(1):231–245. Aide TM, Corrada-Bravo C, Campos-Cerqueira M, Milan C, Vega G, Alvarez R. 2013. Real-time bioacoustics monitoring and automated species identification. PeerJ 1:e103 DOI 10.7717/peerj.103. Anderson SE, Dave AS, Margoliash D. 1996. Template-based automatic recognition of birdsong syllables from continuous recordings. The Journal of the Acoustical Society of America 100:1209–1219 DOI 10.1121/1.415968. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 21/23 https://peerj.com https://doi.org/10.6084/m9.figshare.4488149.v1 http://dx.doi.org/10.1016/j.ecoinf.2009.06.005 http://dx.doi.org/10.2193/0091-7648(2006)34[211:UADRSA]2.0.CO;2 http://dx.doi.org/10.7717/peerj.103 http://dx.doi.org/10.1121/1.415968 http://dx.doi.org/10.7717/peerj-cs.113 Blumstein DT, Mennill DJ, Clemins P, Girod L, Yao K, Patricelli G, Deppe JL, Krakauer AH, Clark C, Cortopassi KA, Hanser SF, McCowan B, Ali AM, Kirschel1 ANG. 2011. Acoustic monitoring in terrestrial environments using microphone arrays: applications, technological considerations and prospectus. Journal of Applied Ecology 48(3):758–767 DOI 10.1111/j.1365-2664.2011.01993.x. Bradski G. 2000. The OpenCV Library. Doctor Dobbs Journal 25(11):120–126. Brandes ST. 2008. Automated sound recording and analysis techniques for bird surveys and conservation. Bird Conservation International 18:163–173 DOI 10.1017/S0959270908000415. Briggs F, Lakshminarayanan B, Neal L, Fern XZ, Raich R, Hadley SJK, Hadley AS, Betts MG. 2012. Acoustic classification of multiple simultaneous bird species: a multi-instance multi-label approach. The Journal of the Acoustical Society of America 131(6):4640–4650 DOI 10.1121/1.4707424. Celis-Murillo A, Deppe JL, Ward MP. 2012. Effectiveness and utility of acoustic recordings for surveying tropical birds. Journal of Field Ornithology 83(2):166–179 DOI 10.1111/j.1557-9263.2012.00366.x. Chabot D. 1988. A quantitative technique to compare and classify humpback whale (Megaptera novaeangliae) sounds. Ethology 77(2):89–102 DOI 10.1111/j.1439-0310.1988.tb00195.x. Clark CW, Marler P, Beeman K. 1987. Quantitative analysis of animal vocal phonology: an application to swamp sparrow song. Ethology 76(2):101–115 DOI 10.1111/j.1439-0310.1987.tb00676.x. Fagerlund S. 2007. Bird species recognition using support vector machines. EURASIP Journal on Applied Signal Processing 2007(1):1–8. Fodor G. 2013. The ninth annual MLSP competition: first place. In: Machine learning for signal processing (MLSP), 2013 IEEE international workshop on IEEE, 1–2. Grigg G, Taylor A, Mc Callum H, Watson G. 1996. Monitoring frog communities: an application of machine learning. In: Proceedings of eighth innovative applications of artificial intelligence conference, Portland Oregon, 1564–1569. Gunasekaran S, Revathy K. 2010. Content-based classification and retrieval of wild animal sounds using feature selection algorithm. In: Second international conference on machine learning and computing.. Han NC, Muniandy SV, Dayou J. 2011. Acoustic classification of Australian anurans based on hybrid spectral-entropy approach. Applied Acoustics 72(9):639–645. Henderson E, Hildebrand JA, Smith MH. 2011. Classification of behavior using vocalizations of Pacific white-sided dolphins (Lagenorhynchus obliquidens). The Journal of the Acoustical Society of America 130(1):557–567. Lammers MO, Brainard RE, Au WWL, Mooney TA, Wong KB. 2008. An ecological acoustic recorder (EAR) for long-term monitoring of biological and anthropogenic sounds on coral reefs and other marine habitats. The Journal of the Acoustical Society of America 123:1720–1728 DOI 10.1121/1.2836780. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 22/23 https://peerj.com http://dx.doi.org/10.1111/j.1365-2664.2011.01993.x http://dx.doi.org/10.1017/S0959270908000415 http://dx.doi.org/10.1121/1.4707424 http://dx.doi.org/10.1111/j.1557-9263.2012.00366.x http://dx.doi.org/10.1111/j.1439-0310.1988.tb00195.x http://dx.doi.org/10.1111/j.1439-0310.1987.tb00676.x http://dx.doi.org/10.1121/1.2836780 http://dx.doi.org/10.7717/peerj-cs.113 Lasseck M. 2013. Bird song classification in field recordings: winning solution for NIPS4B 2013 competition. In: Proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod. org/nips4b, joint to NIPS, Nevada, 176–181. Marques TA, Thomas L, Martin SW, Mellinger DK, Ward JA, Moretti DJ, Harris D, Tyack PL. 2013. Estimating animal population density using passive acoustics. Biological Reviews 88(2):287–309 DOI 10.1111/brv.12001. Mellinger DK, Clark CW. 2000. Recognizing transient low-frequency whale sounds by spectrogram correlation. The Journal of the Acoustical Society of America 107:3518–3529 DOI 10.1121/1.429434. Sueur J, Pavoine S, Hamerlynck O, Duvail S. 2008. Rapid acoustic survey for biodiver- sity appraisal. PLOS ONE 3:e4065 DOI 10.1371/journal.pone.0004065. Taylor A. 1995. Bird flight call discrimination using machine learning. The Journal of the Acoustical Society of America 97(5):3370–3370. Terborgh J, Robinson SK, Parker III TA, Munn CA, Pierpont N. 1990. Structure and organization of an Amazonian forest bird community. Ecological Monographs 60:213–238 DOI 10.2307/1943045. Tricas TC, Boyle K. 2009. Validated reef fish sound scans of passive acoustic monitors on Hawaiian coral reefs [Abstract 2589]. The Journal of the Acoustical Society of America 125(4) DOI 10.1121/1.4783839. Van der Walt S, Schönberger JL, Nunez-Iglesias J, Boulogne F, Warner JD, Yager N, Gouillart E, Yu T, The Scikit-Image Contributors. 2014. scikit-image: image processing in Python. PeerJ 2:e453 DOI 10.7717/peerj.453. Villanueva-Rivera LJ, Pijanowski BC. 2012. Pumilio: a web-based management system for ecological recordings. Emerging Technologies 93:71–81 DOI 10.1890/0012-9623-93.1.71. Walther G-R, Post E, Convey P, Menzel A, Parmesan C, Beebee TJC, Fromentin J-M, Hoegh-Guldberg O, Bairlein F. 2002. Ecological responses to recent climate change. Nature 416(6879):389–395 DOI 10.1038/416389a. Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13(4):600–612 DOI 10.1109/TIP.2003.819861. Xie J, Michael T, Zhang J, Roe P. 2016. Detecting frog calling activity based on acoustic event detection and multi-label learning. Procedia Computer Science 80:627–638 DOI 10.1016/j.procs.2016.05.352. Zhang L, Towsey M, Xie J, Zhang J, Roe P. 2016. Using multi-label classification for acoustic pattern detection and assisting bird species surveys. Applied Acoustics 110:91–98 DOI 10.1016/j.apacoust.2016.03.027. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 23/23 https://peerj.com http://dx.doi.org/10.1111/brv.12001 http://dx.doi.org/10.1121/1.429434 http://dx.doi.org/10.1371/journal.pone.0004065 http://dx.doi.org/10.2307/1943045 http://dx.doi.org/10.1121/1.4783839 http://dx.doi.org/10.7717/peerj.453 http://dx.doi.org/10.1890/0012-9623-93.1.71 http://dx.doi.org/10.1038/416389a http://dx.doi.org/10.1109/TIP.2003.819861 http://dx.doi.org/10.1016/j.procs.2016.05.352 http://dx.doi.org/10.1016/j.apacoust.2016.03.027 http://dx.doi.org/10.7717/peerj-cs.113