bioacoustics data analysis a taxonomy, survey and open

Received December 28, 2019, accepted January 29, 2020, date of publication March 5, 2020, date of current version April 1, 2020.

Digital Object Identifier 10.1109/ACCESS.2020.2978547

Bioacoustics Data Analysis – A Taxonomy,Survey and Open ChallengesRAMA RAO KVSN 1, JAMES MONTGOMERY1, (Member, IEEE),SAURABH GARG 1, (Member, IEEE), AND MICHAEL CHARLESTON21School of Technology, Environments and Design, University of Tasmania, Hobart, TAS 7005, Australia2School of Natural Sciences, University of Tasmania, Hobart, TAS 7005, Australia

Corresponding author: Rama Rao KVSN ([email protected])

This work was supported by an Australian Government Research Training Program (RTP) Scholarship.

ABSTRACT Biodiversity monitoring has become a critical task for governments and ecological researchagencies for reducing significant loss of animal species. Existing monitoring methods are time-intensiveand techniques such as tagging are also invasive and may adversely affect animals. Bioacoustics basedmonitoring is becoming an increasingly prominent non-invasive method, involving the passive recording ofanimal sounds. Bioacoustics analysis can provide deep insights into key environmental integrity issues suchas biodiversity, density of individuals and present or absence of species. However, analysing environmentalrecordings is not a trivial task. In last decade several researchers have tried to apply machine learningmethods to automatically extract insights from these recordings. To help current researchers and identifyresearch gaps, this paper aims to summarise and classify these works in the form of a taxonomy of thevarious bioacoustics applications and analysis approaches. We also present a comprehensive survey ofbioacoustics data analysis approaches with an emphasis on bird species identification. The survey firstidentifies common processing steps to analyse bioacoustics data. As bioacoustics monitoring has grown,so does the volume of raw acoustic data that must be processed. Accordingly, this survey examines howbioacoustics analysis techniques can be scaled to work with big data. We conclude with a review of openchallenges in the bioacoustics domain, such as multiple species recognition, call interference and automaticselection of detectors.

INDEX TERMS Bioacoustics, biodiversity, density estimation, species identification, features, syllables.

I. INTRODUCTIONThroughout human history people have developed bothmutu-ally supporting and conflicting relationships with the naturalworld. Significant animal species loss has been observedin recent decades due to habitat destruction, which putsenvironmental integrity and biodiversity at risk [1], [2].Imbalanced biodiversity may result in undesirable effectssuch as change in climatic conditions and pollution. Hence,assessing biodiversity and monitoring of individual speciesby ecologists and zoologists is increasingly important. How-ever, zoologist’s species estimates can vary significantly,with a high order of variation between their assessments [3].Moreover, monitoring methods such as marking and taggingof animals, which primarily depend on visual characteris-tics, are invasive and may harm animals. Hence, to over-come the difficulties of invasive methods and manual data

The associate editor coordinating the review of this manuscript and

approving it for publication was Prakasam Periasamy .

collection complications leading to unproductive analysis,there is need for a sophisticated non-invasive approach formonitoring animals. As animals indicate and communicatetheir presence through sounds, sound monitoring is a suitablenon-invasive method. The sounds generated by animals canbe used to understand their distribution and behaviour. Thescience related to studying sound is known as acoustics, andthe subfield related to the study of biologically-producedsound is bioacoustics. Acoustic sounds can be recorded withease, played, synthesized and analysed to recognize animalcommunication [4]. Acoustic monitoring reduces the needfor invasive survey techniques [5] and offers a way to monitorremote locations in a cost effective manner. Moreover, acous-tic signals provide significant information on environmental,seasonal and climatic effects on species [6].

Bioacoustics is used in a number of ecological applicationsincluding biodiversity assessment, density estimation andspecies identification as well as learning about tempo-spatialbehaviour, ecology and communication. In order to obtain

57684 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 8, 2020

https://orcid.org/0000-0002-9249-9114

https://orcid.org/0000-0001-8719-284X

https://orcid.org/0000-0002-2471-6375

R. R. KVSN et al.: Bioacoustics Data Analysis – A Taxonomy, Survey and Open Challenges

significant information about these application areas usingbioacoustics, a common series of processing steps is fol-lowed, and considerable research has been conducted in vari-ous aspects of this process flow. To the best of our knowledge,there is no broad survey in this domain discussing bioacous-tic process flow activities and their application to particularenvironmental monitoring areas. Au and Hastings [7] andZimmer [8] have presented a detailed discussion with a focuson marine bioacoustics. This paper reviews the current bodyof work in the field of bioacoustics, with an emphasis on birdacoustics and the problem of species identification. Based onthis survey open challenges and research opportunities areidentified. The key contributions of the paper are summarizedas follows:• Presents a holistic review of state-of-the-art researchworks in the bioacoustics domain.

• Summarizes bioacoustics workflow steps and compre-hensively presents relevant detailed studies about eachstep, as many previous researchers have focused only oneither a single step or single animal.

• Presents a taxonomy of bioacoustics applications such asdensity estimation and biodiversity, with a special focuson species identification.

• Reviews big data analytics for bioacoustics data tohandle the massive data generated from differentsources.

• Describes different bioacoustics specific software, iden-tifying open issues and research targets for future bioa-coustics research.

The review should benefit both ecologists looking to usebioacoustics analysis and researchers investigating analysisand processing techniques for this domain.

This paper is organized as follows. Section II formallyintroduces bioacoustics terms and important applicationsof bioacoustics analyses. Section III discusses collectiontechnologies and pre-processing methods, while Section IVdiscusses different audio features that are widely used inbioacoustics analysis studies. We present a taxonomy ofbioacoustics analysis techniques in Section V, summarisingwork related to density estimation, biodiversity and speciesidentification. Section VI presents a discussion on severaltechniques that have been used to identify bird sounds auto-matically while Section VII discusses other animal speciesidentification methods in order to provide a comprehensiveview. Section VIII discusses different work that has beendesigned to deal with big data and various bioacousticssoftware that is available. Section IX discusses open researchchallenges in the bioacoustics field.

II. BIOACOUSTICS: OVERVIEWBioacoustics is a formal study that involves the production,transmission and reception of sound [9], [10] to gain insightsabout animal relationships with atmosphere. Before proceed-ing to discuss bioacoustics analysis in detail, this sectionprovides a background to biacoustics such as its origins,applications, call categories and activities.

A. BACKGROUNDBioacoustics has origins during the second world war [11] tomonitor fish sounds. During World War II, acoustic deviceslike sonar were developed to detect acoustics in the ocean.The detected fish acoustics are primarily used for sensing,discovering and catching fish. At night time, using theseacoustics, fisherman are able to detect the type of fish andtheir abundance. Dang and Andrews [11] also described workcarried out in England and Russia on fish species identifica-tion. Initially it was done by correlating the response withswim bladder harmonics. Knowledge of fish school shapewas used at a later stage to identify species. Based on thiswork, Dang and Andrews [11] concluded that:• The sound of one species would be qualitatively differ-ent from another.

• Sounds may vary depending on the season andbehavioural contexts.

Riede et al. [12] also supported these findings by experi-menting on dog barking in two different behavioural contexts,healthy and unhealthy dogs. Barks of the dogs in these twodifferent contexts are recorded. Riede et al. [12] establishedthat there is considerable variation of sound in two differentcontexts. Hence, obtaining sounds of animals in differentcontexts, seasons and species may be useful for several bioa-coustics analysis applications.

B. BIOACOUSTICS APPLICATIONSAcoustics can be used to study and analyse several keyphenomena such as:Biodiversity: This concerns knowing about abundance and

evenness of animal species living in particular surroundingsor on the earth. Assessing and maintaining biodiversity willhelp us to maintain ecosystems, which is as essential as airand water. Bioacoustics serves as an important tool to assessbiodiversity.Species Identification: Species can be termed as a cluster

of animals with similar characteristics. Each species canbe classified based on biological characteristics specific toit. Bioacoustics can be applied to this field of research toassist the classification into genus, species and individuallevels.Density Estimation (abundance): In a defined location,

some species may be commonly found but some may be rare.Using bioacoustics, we can find relative abundance such asthe percentage of organisms found at a particular location.Migrations: Most species relocate from one place to

another place due to climatic conditions, foraging or forbreeding. Using bioacoustics, we can monitor such relocationpatterns.Cryptic Species Detection: Some species which are not

native to the surroundings, difficult to observe and rarespecies can also be detected using bioacoustics.Animal Communication: This can be defined as exchange

of information between sender and receiver animal that willaffect and trigger change in the emotions and behaviours ofthe receiver animal.

VOLUME 8, 2020 57685


C. TYPES OF ACOUSTICSIn the bioacoustics domain, typically used terms are active/passive acoustics and calls/songs. This section introducessuch bioacoustic jargon.

1) ACTIVE VERSUS PASSIVEAcoustic detection can be categorized into active andpassive [13]. Active acoustics involves production of soundby a device that converts variations into electrical signals.To convert signals, transducer is typically used with sev-eral measures in active acoustics to understand or iden-tify the species population. Species identification with thismethod requiresmore information. On the other hand, passiveacoustics encompasses obtaining data, listening sounds andanalysis.

2) CALLS VERSUS SONGSBird vocalization activity is key for exchange of infor-mation. Each bird sound is different and varies in termsof pitch, rhythm and pattern. For vocalization, birds haveevolved a unique primary sound-producing organ named thesyrinx [14]. Functionally, the syrinx is very similar to vocalcords in human. In addition to the syrinx, bird sound iscoordinated by several groups of muscles such as respira-tory apparatus and upper vocal tract [15]. More detaileddiscussion of syrinx and call structure can be found inBolhuis and Everaert [16] and Kershenbaum et al. [17].The sounds made by birds can be broadly divided into two

categories, Calls and Songs [18].1) Bird Calls: Calls are very short and simple sounds.

Calls fall into several categories such as mating calls,reproduction calls, feeding calls, distress calls, andexcitement calls. These calls can be used to performanalytic activities such as species identification. Forinstance, Glaw andVences [19] consideredmating callsto identify species.

2) Bird Songs: Songs are of longer duration sounds thancalls. Male and female birds produce songs of approx-imately the same length. Bird song can be best heardduring Spring and are most likely to sing at dawn, as itis a favourable time for several activities [20]. Thesesongs are to attract other birds or for defence.

In order to utilize these calls or songs for any bioacousticsapplication, several steps are involved constituting a processflow.

D. BIOACOUSTICS ANALYSIS PROCESS FLOWMany bioacoustics analyses comprise the steps illustratedin Figure 1:

FIGURE 1. Common bioacoustics process flow.

• Data Collection: This activity involves recording animalsounds in the field and obtaining the recorded audio datain digital format.

• Pre-processing: The collected raw audio data may con-tain noise. This step may be applied to remove noise andprepare the collected audio data ready for analysis.

• Syllable Extraction: These are like fingerprints of theaudio signal that are analogous to phonetics for humans,which capture important segments of the signal.

• Feature Extraction: This activity captures thesignificant/informative properties of the signal for betterunderstanding.

• Analysis: Various kinds of analysis can be performedusing the above features.

A detailed discussion on these steps will be presented in thesubsequent sections.

III. DATA COLLECTION AND PREPARATIONData collection and preparation is the first bioacoustics anal-ysis activity. It involves gathering the data and further makingit ready for analysis through several pre-processing steps.This section primarily focuses on two aspects, recording andpre-processing.

A. FIELD RECORDINGS AND VISUALIZATIONFor recording animal sounds, two devices are used, a micro-phone and a sound recorder. A detailed discussion on thesetwo devices is presented below.

1) MICROPHONESMicrophones convert sound into electrical signals with thehelp of a transducer. In addition to the transducer, severalother parameters such as efficiency, self-noise, polar patternand frequency also play a major role in recording quality.There are a variety of microphones including dynamic micro-phones, piezoelectric transducers, condenser microphones,solid-dielectric microphones, electret (capacitor-based)microphones, hydrophones and directionalmicrophones [21].

1) Dynamic Microphones: These are robust, reliableand are particularly suitable for loud environments.A mechanical element in the device produces power bythe process of electromagnetic induction.

2) Piezo-electric Transducers: These are particularlyuseful in detecting high frequencies(e.g., bats) andas microphones for musical instruments. Wheneveracoustic waves are created, these transducers producepower which make them more suitable to detect ultra-sonic sound.

3) Condenser microphones: These microphones have adiaphragm whose movement changes the capacitancein a condenser. This change in the condenser is trans-formed to electrical energy.

4) Solid-dielectric microphones: These have to be chargedby external supply or by using internal batteries.These are best suited for bat detection but mechanical

57686 VOLUME 8, 2020


membranes are subject to humidity changes, givingscope for noise introduction.

5) Electret microphones: These consume low powerrequirements, being pre-charged. The recently intro-duced Micro-Electrical-Mechanical System (MEMS)comes under this category. These devices are small andvery sensitive.

6) Hydrophones: These are particularly useful for detect-ing underwater sounds. The sensitivity issue in electretmicrophones is addressed in these by utilizing a piezo-electric element, which produces voltage when com-pressed by sound waves. These are omni-directionalsingle transducer microphones, covering frequencyrange up to 100kHz. Tomonitor larger regions, an arrayof hydrophones can be used. A pack of hydrophonescoupled with amplifiers and transmitter in a pres-sure resistant container is capable of recording evenunderwater sounds.

7) Directional microphones: These microphones are con-figured to focus on the sound originating from onedirection or a single source. Directional microphonerecording can be for two purposes. One purpose isfor ambiance recording, which is used to understanddetailed audio characteristics of a specific environ-ment. All sounds in that particular environment willbe recorded. The other purpose is for species record-ing, where sound is selectively recorded from a singlesource avoiding noise from other sources. For this pur-pose devices such as parabolic receivers are preferred,because they reduce unwanted noises coming fromother directions and focus only on sound coming fromthe desired object.

2) SOUND RECORDERSAs discussed earlier, Microphones convert sounds to elec-trical signals and sound recorders subsequently record thesegenerated electrical signals. Sound recorders can be oftwo types: analog or digital. Analog recorders such as cas-settes suffer more signal degradation. Digital recorders havereplaced analog by overcoming all the major disadvantages.Digital recorders usually store the audio files in ‘uncom-pressed wave (.wav) format’. Such .wav files are storeddigitally and the storage volume depends on available storagecapacity. Hence long-term acoustic monitoring programshave to be planned properly with an automated recordingsystem with sufficient systems deployed backed up withenough power and storage facilities. More detailed discussionon bioacoustics prerequisites, precautions while recording,hardware, power requirements and other information hasbeen outlined by the Bioacoustic Diversity group [22]. Therecorded sounds can be archived, similarly to any other digitaldata on a CD, DVD or BluRay disc, The recorded soundfile documentation should be accompanied with metadatainformation such as locality, temperature and frequency.

Once audio data is obtained in digital format, it can bevisualized using different tools.

3) SOUND VISUALIZATIONVisualizing acoustic properties such as temporal and spec-tral characteristics is important for analysis. To visualize,play and edit audio recordings many tools are available.Two commonly used visualizations are oscillograms andspectrograms.

1) Oscillogram: This is a basic graphic display unit interms of voltages and amplitudes revealing temporalchanges of sound. It exposes the signal frequency com-position data at a particular moment.

2) Spectrogram: This is the most widely used tool inbioacoustics for visualization as it visualizes in threedimensions: frequency, time and amplitude. Visualiza-tion is constructed by Fourier decomposition.

Once data are collected and optionally visualized, the nextprocess is often to remove noise from the audio.

B. NOISE REMOVALIn the context of bioacoustics, noise can be interpreted as anunwanted and unpleasant sound which is added to the desiredsound involuntarily. These unwanted sounds may originatefrom several sources including airplanes, wind and rain [23].Such noise should be processed and removed. To processnoisy signals, there are several standard methods that can beapplied to audio signals. This section discusses such noiseprocessing methods.

1) WIENER FILTER METHODSLim et al. [24] pointed out that, relative to phase, short-timespectral amplitude is important for speech quality. Groundedon the concept of short-time spectral amplitude, noiseremoval techniques can be broadly classified into two cate-gories [24]. The first category encompasses using the processof spectral subtraction for explicit estimation of the short-timespectral magnitude, which will remove noise and enhance thesignal. The second category of speech signal enhancementis based on Wiener filtering. In this method, several filtersare initially obtained from the degraded speech. Among thesefilters, an optimal filter will be found. To estimate the noise,this optimal filter is applied in the time or frequency domain.Since zero-phase frequency response is computed using theWiener filter, spectral amplitudewill be enhanced keeping thefiltered speech phase similar to that of degraded speech.

2) SIGNAL SUBSPACE METHODSEphraim and Van Trees has proposed the subspacemethod [25]. The basic principle lies in decomposing thenoisy signal vector space into two subspaces, a noisesubspace and a signal-plus-noise subspace, by usingKarhunen-Loeve transform (KLT). On these sub spaces,linear estimation is applied on a frame by frame by usingtwo criteria. The first estimator criterion is signal distortionpower minimization on an average residual noise power. Thisis called a time domain constrained estimator. The second cri-terion estimator is called spectral domain constrained estima-tor, again focused on signal distortion power minimization,

VOLUME 8, 2020 57687


but with a condition that the residual noise should be lessthan a threshold. KLT is applied to the two estimators. TheKLT decomposition estimator is extracted by computing aneigen decomposition of the Toeplitz covariance estimate ofnoisy vector. By nullifying the KLT components of noisesubspace, signal subspace KLT components are modified byusing either of the estimators. While the spectral subtrac-tion method uses DLT, the subspace method uses KLT fordecomposing the noisy signal vector space. Ephraim and VanTrees [25] pointed out that sub space method performancewas better than spectral subtraction method with less distor-tions, no residual noise, and noise being additive and white.These authors concluded that coloured noise could also bewhitened. Lev-Ari and Ephraim [26] have extended the signalsubspace approach to the colored noise process by proposingexplicit forms of time and spectral domain estimators.

3) STATISTICAL METHODS1) Minimum Mean Square Error (MMSE) method:

Ephraim and Malah [27] stated that neither spectralsubtraction norWeiner filter are optimal spectral ampli-tude estimators. From noisy observations, these authorsproposed to derive an optimal short time spectral ampli-tude (STSA) estimator by utilizing MMSE. To derivean MMSE STSA estimator, Fourier expansion coef-ficients of speech and noise probability distributionsmust be known. To obtain them, using either a sta-tistical model or probability distribution measurementcan be used. Since the probability distributions ofsamples is time varying, it would be complicated andimpractical to measure them. Hence a statistical modelutilizing asymptotic properties of Fourier expansioncoefficients has been used by the authors. These Fouriercoefficients are presumed to be independent Gaussianrandom variables. This estimator also considers thepresence of speech signals in noisy observations. Sincethe complex exponential of MMSE does not affectSTSA estimation, derivation is proposed from the noisysignal.

2) Unified Approach: Model based methods have becomepopular in speech enhancement. Ephraim [28] pro-posed a unified statistical framework to address speechenhancement issues such as quality, intelligibility,robustness of speech coders and recognition systemsto noise. Vector quantization and hidden markov mod-els (HMM) are two powerful statistical techniques thathave found good application in speech enhancementdomain in terms of signal estimation and classificationof noisy signals. To design speech enhancement sys-tems with HMM, the type of HMM used for signaland noise and distortion measures plays a key role.Ephraim [28] further discusses how HMMs can beextended in several ways to resolve noise problems.

However, all these methods (Weiner, signal sub space, statis-tical) assume that noise can be estimated from the first fewframes determined by voice activity detectors (VAD) [23].

These authors discussed that it is not practical to assume thatfirst few frames only contain noise. In bioacoustics record-ings, bird calls without any gap in the first few frames arefound in several recordings. In such situations noise esti-mation algorithms based on such an assumption will suffer.In addition, VADs do not work in low signal-noise-ratio(SNR). To address this, Cai et al.proposed a novel noisereduction algorithm which can estimate noise from everyframe [23]. This algorithm also eliminates VAD.

After noise removal, next step is to extract features. Dif-ferent audio features that are used by various studies in bioa-coustic analyses are presented in the next section.

IV. AUDIO FEATURES USED FOR BIOACOUSTICSANALYSISThis section mainly discusses three different types of featuresthat were used in bioacoustics studies during the processof analysis such as acoustic features, syllables and otherfeatures.

A. ACOUSTIC FEATURES1) SPECTRAL AND TEMPORALThese are the features which are extracted from frequencyand time plots of the signal using oscillograms or spec-trograms. Commonly used features are Start Frequency,End Frequency, Maximum Frequency, Minimum Frequency,Middle Frequency, Maximum Intensity frequency andDuration.

2) MEL FREQUENCY CEPSTRAL COEFFICIENTS (MFCC)MFCCs are extensively applied in human speech identifi-cation. Human and certain animal auditory systems possesssimilarities and their basilar membrane physical characteris-tics yield similar frequency characteristics [29]. Cai et al. [23]also discussed several similarities between human and birdwith regard to hearing, vocal tract and auditory processing.These similarities enable MFCCs to be the useful for speciesidentification across a diverse set of animals such as frogs,crickets [30]–[32] and birds [31], [33].

Neither human nor animal perception scale is linear[30], [31] with critical band frequencies very much influ-enced by energies. To resolve such linearity disturbances,MFCCs streamline the frequencies across the continuum.Further, MFCCs offer several advantages: they are simple,robust and computationally efficient. They have good accu-racy, computation not requiring any performance tuningand exceptional recognition rates irrespective of call type[30], [32]. MFCC computation typically involves these steps:

1) Audio signal is split into short frames typically rangingbetween 20 ms to 30 ms

2) Since spectral content does not exist in every frame,segmentation irregularities exist which are rectified bymultiplying each frame with a Hamming window. Thiswill minimize signal discontinuities and create frameoverlapping with good resolution.

57688 VOLUME 8, 2020


3) On this Hamming windowed signala Fourier transformis applied to calculate the power spectrum and identifythe frames of concern.

4) The resulting spectrum is mapped to the mel-frequencyscale.

5) Subsequently, a filter bank is applied to mel-scalemapped spectrum to identify each frequency regionenergy.

6) Logarithm is applied obtain energy log for each filter.7) These log energies are transformed to cepstral domain

by applying a discrete cosine transform(DCT).8) The DCT will result in a 64-dimensional cepstral fea-

ture vector referred to as MFCC.

Classically, only the lower 12 DCT values out of 64 areretained to reduce the complexity and increase accuracy.

3) ENTROPY BASEDTo identify frog species, Han et al. [34] applied a hybridspectral entropy method for extracting several features suchas Spectral centroid, Shannon entropy and Renvi entropy.To enhance species identification, this study specifically usedtwo types of entropies to create a hybrid feature system.• Spectral centroid: This is very useful in pattern recogni-tion. It represents the centre point of the spectrum wherethe sound is bright.

• Shannon entropy: This is used to measure the degreeof the signal which quantifies the richness of sound.After the species richness index, this is the most pre-ferred index as it increases the evenness in the rela-tive richness in a habitat. This measure characterizesacoustic diversity even at time units with low probabilityamplitude mass function. Shannon entropy serves as aninformation content richness measure which quantifiesthe diversity.

• Renyi entropy: This has found its applications exten-sively in the fields of signal processing, data mining,classification and segmentation. In information theory,Renyi entropy is an estimate of the noise during signaltransfer. In this study, authors used Renyi entropy toidentify the noise content and its complexity.

Dayou et al. [35] has also used Shannon entropy, Renyientropy with Tsallis entropy as features. Tsallis entropy, alsoreferred to as q-entropy, is another generalization of Shannonentropy. This study utilized Tsallis entropy to measure thesignal complexity.

B. SYLLABLESThe process of dividing audio signal into fragments is calledsegmentation. Bird vocalization hierarchies can be dividedinto notes, syllables, phrases and calls. Among these, forbird species recognition, syllables are ideal, as phrases andcalls have more variations in terms of region and individ-uality [36]. Syllables are brief sounds which the speciesproduces with the lungs in a single blow of air and hencecan be considered similar to phonetic sounds in humans [29].

Somervuo et al. [37] termed syllables as organized sequenceof brief sounds from a species-specific vocabulary.The process of syllable segmentation primarily involves

computing a threshold value of the signal. Using a thresholdvalue and signal energy level, the syllables can be identified.The start of a syllable is recognized as the point where sig-nal energy first exceeds the threshold and where the energydrops it is considered as the end point. The signal betweenthe start and end point is referred as the syllable [30]. Thenumber of syllables varies between 12 to 96 according tospecies [34].

The predominant way to perform this syllable identifica-tion is through spectrogram analysis and manual labeling.However, manual process is time intensive and subjective.Towards automating this, Harma [38] used a sinusoidalparameterization model to extract syllables. Each syllableis parameterized by the sinusoidal model by represent-ing its frequency and amplitude trajectories. To improverecognition accuracy and characterize syllable harmonics,Somervuo and Harma [39] introduced four supplementaryparameters. Kogan and Margoliash [33] have experimentedwith syllable extraction using HMM.

As discussed earlier, the bird acoustics hierarchy containsnotes at the lowest level and syllables at the next higher level.Several works have been done based on syllables and theirfeature extraction. However, at the lower level below syllablesthere are notes, which may also be used for classification.Graciarena et al. [40] has created bird species models fromnote sequences. Note sequence models were produced usingunsupervised clustering techniques. The note models aretrained by performing vector quantization of acoustic featuresand two-pass index alignment training models. To create birdspecies models, a note n-gram model with support vectormachine (SVM) is used to capture note sequence statistics.When compared with Guassian mixture models (GMM), notemodel is superior in several aspects such as multi-scale mod-eling, syllable modeling and modeling longer time depen-dencies. The features are obtained by computing note looplattices, extracting n-gram statistics and normalizing them.SVM is used for training.

It is evident that any speech recognition system depends onthe vocabulary and grammar of the language. Bird acousticsalso have a vocabulary and grammar. These structured briefsounds are known as syllables. Lakshminarayanan et al. [41]developed probabilistic models to process these syllablesfor species identification. Inspired by document classifica-tion, syllables are treated as words and species as topic.A probabilistic model is built based on document classifi-cation. Lakshminarayanan et al. [41] introduced the Inde-pendent Syllable model in which the syllable frames aretreated by two models, the Independent Frame IndependentSyllable (IFIS) model and Markov Chain Frame Indepen-dent Syllable (MCFIS) model. Then a maximum aposteriorirule is derived for each model. These authors concludedthrough numerical evaluation that classifier performance iscompetitive.

VOLUME 8, 2020 57689


From the above, we can say that segmenting into pulses orsyllables has become essential for most ecological machinelearning problems. However, audio from recorders containsmore noise, which compramise the classification accuracy.Hence for noisy signals syllable extraction, Neal et al. [42]proposed a supervised model for audio segmentation whichtransforms the input signals into a spectrogram representationand creates a binary masked label for each time-frequencyslot as either bird sound or noise. This activity extractsindividual song syllables of the bird, even when there areoverlapping syllables. The binary mask is evaluated withmanually-labelled ground truth using a metric of true positiverate (TPR) versus false positive rate (FPR), and a metric ofenergy-weighted TPR versus FPR. Experiments indicate thatthe method achieves 90.5% TPR for the first metric, and93.6% TPR for the second, which suggests that the proposedmethod performs better than segmentation by energy thresh-olding method.

Most of the research in syllable extraction is call depen-dent. Cheng et al. [31] developed call independent identi-fication of 10 passerine (perching birds) species using fourmachine learning methods, namely radial basis function net-works (RBFN, a special kind of artificial neural network(ANN)), SVM, HMM and GMM. The models are trainedusing any type of syllables and used on the same or differenttypes.

In the Multi-Instance Multi Label (MIML) frameworkdeveloped by Briggs et al. [43], the classified objects consti-tute bags (audio recording) of instances (syllables) and classlabels (set of species). This bag generator algorithm convertsrecordings into bag of instances.

Colonna et al. [44] presented a technique to treat syllablesin real time by simply storing time series statistics instead ofusing sliding windows. The beginning and ending of each syl-lable is detected and the noise part is eliminated. Incrementalversions of signal energy (E) and zero crossing rate (ZCR)equations are computed. The power of these two equationsis that they remember past decisions even when new sam-ples arrive. They also use few parameters thus consumingless memory and processing constraints. However they sufferfrom a high false positive rate. To handle the problem of falsepositives, a mode filter has been utilized which considersthe most frequently observed value as precise. Further, thek-NN classifier is used for classification. Experimentsdemonstrated 37% improvement and is well suited for wire-less sensor networks.

Following syllable identification, the next step is to extractvarious features from these syllables or directly from thesignal. This is discussed in the next section.

C. OTHER FEATURES1) LINEAR PREDICTIVE CODING (LPC)LPC is robust in providing approximation of vocal trackspectral envelope based on past samples. LPC exploits therelationship by converting them as a set of coefficients, called

LPC coefficients. LPC coefficients can be computed by usingDurbins method [45] and autocorrelation analysis.

2) CODE BOOK OF FRAME LEVEL FEATURESAny bird classification algorithm begins with extractingaudio signal features. Such features that are extracted belongto single frame which are very short segments. However,Briggs et al. [46] pointed out that because single framefeatures are insufficient, it would be advantageous to extractfeatures from multiple frames to represent sound in orderto apply them on several standard classification algorithms.Features extracted from these multiple frames will forma fixed-length feature vector. From this vector, a commonapproach is to take an average of interesting frames. However,Briggs et al. [46] proposed a codebook concept, inspired bysimilar work in the fields of computer vision and music genreclassification. In the codebook concept, the features fromall frames are aggregated as a bag of code words throughclustering.

3) MARSYAS FRAMEWORKTheMARSYAS (Music Analysis, Retrieval and Synthesis forAudio Signals) framework was developed by Tzanetakis [47]for use in music information retrieval applications. This fea-ture is well applied in music genre classification researchwork. Lopes et al. [48] has first made use of this feature setin the domain of bird species recognition. More details aboutthis feature set can be found at the MARSYAS portal [47].

Once the features are extracted, the next step is to performanalysis on various bio-acoustic applications.

V. A TAXONOMY OF BIOACOUSTICS ANALYSESBioacoustics analysis provides us with deep insights intokey environmental integrity issues such as biodiversity, den-sity estimation and species identification. Each bioacousticsapplication has diverse analysis techniques and methods. Fora comprehensive outlook of the techniques related to eachbioacoustics application, we present a taxonomy illustratedin the Figure 2. This section focuses on presenting the detailsof taxonomy and work on Density Estimation, biodiversityand species identification.

A. DENSITY ESTIMATIONDensity or abundance estimation is an approach to assessbird population size. This helps in bird conservation as itcan assist in evaluating the impact of pollution and habitatloss on birds. After the 2002 World Summit on Sustain-able Development, it has been globally agreed to reduce theextinction of animal species. This activated the investigationof approaches for monitoring bird population trends. To mon-itor population trends, there are two models, namely openand closed. Closed population models assume that births,deaths and immigrations do not occur while open popula-tion models relax these assumptions. Seber [49] discussedseveral estimation methods for closed population and openpopulations. For closed population methods such as absolute

57690 VOLUME 8, 2020


FIGURE 2. Analyses Taxonomy.

density and relative density were proposed while Jolly-Sebermethod, regression and several methods are discussed foropen population. Williams et al. [50] extensively discusseda framework for modelling, estimation and management ofanimal population. Some of the most frequently used densityestimation methods are discussed below.

1) PLOT SAMPLINGAlthough a typical objective is to estimate total bird pop-ulation, it is more practical to estimate the population insmall areas and use these small area samples to estimatethe total. Plot Sampling is one such method that esti-mates population based on samples from random plots.Reynolds et al. [51] developed a circular plot method thatestimates the bird density in a given area. In their circularplot method, stations are established at scattered locations tomonitor birds. Whenever a bird sound is heard, it is counted.The distance from each station to where they were foundfirst is estimated. For each species, a plot is drawn for theareas where the observed bird species begin to decline. Thedensity count is a sum of individual bird counts found withina specified radius.

Dawson and Efford [52] discussed several issues with theplot methods and further pointed out that counting of one orother bird species and estimate their density may not be fea-sible until counts on sample plots are available and complete.This introduces reliability and detectability challenges whichare addressed by using distance vector approach.

2) DISTANCE SAMPLINGDetecting an animal relies primarily on the distance betweenanimal and sensor. Distance sampling methods use distanceas a key parameter to estimate the probability of detectionas a function of distance. To have a broad area coverage,a systematic design with a large number of distances will berecorded. Further, a detection function g(y) should be definedto calculate animal detection probability by using the distance

point values. Rosenstock et al. [53] made use of distancesampling method to count land birds. This method overcomesmany drawbacks of index count. Burnham et al. [54] utilizeda Fourier series method to estimate bird abundance.

Hiby and Ward [55] discussed a variation to distance sam-pling which is cue counting. In cue counting, cues producedby animals are counted rather direct animal observations. Cuecounting was initially used in estimating whale density byusing whale blows as cues. Since it is difficult to differentiateindividual whale cues, cue density is estimated as the numberof whale blows per unit area per unit time. The obtained valueof cue density is divided by an estimated average cue rate toobtain whale density. Buckland [56] extended this techniqueto birds where a call or song from a bird is considered as acue. Cue density is the number of bird calls per unit area perunit time and cue rate is the average number of calls per unittime. Using these two, bird density can be estimated.

3) COUNT BASED METHODSThough spatial replication themes are generally used to countanimals, it is difficult to estimate population size from themas they generate sparse count data. Royle [57] developed anN -mixture model to overcome this and estimate populationfrom such sparse data. For a population size of N specific tothe site, viewed as distributed poisson independent randomvariables. Carol and Lombard have developed estimators butthey have limitations. This model has attempted to addressthis issue.

4) VARIANCE ESTIMATIONTo draw meaningful inferences, precision is considered tobe an important parameter for density estimation. Precisionestimation should be reliable. To obtain reliable precisionmeasure, variance estimation method will be used. Thereare two approaches for variance estimation namely analyticvariance estimation and bootstrap variance estimation [58].The first approach is analytic variance, which is estimated in

VOLUME 8, 2020 57691


terms of random (e.g. detection probability) and constant (e.g.number of sensors, recording time) components. The secondapproach, bootstrap variance estimation uses resampling ofseveral units such as sensors to generate a boot strap datasetwhich is used to obtain estimate of density. The resamplingis repeated many times which yields a density estimationset. Using the variance of these set of estimates, variance oforiginal estimator can be approximated.

B. BIODIVERSITYConservation of biodiversity in a general sense aims at main-taining a balance of a large variety of interdependent biolog-ical creatures. These dependencies mean losing that balancethreatens environmental health as a whole. One of the mainthreats to global biodiversity is tropical habitat loss due tohuman activities. Laiolo [4] has elaborately discussed theeffects of human activities on animals and on environmentalacoustics due to noise pollution, habitat fragmentation, chem-ical pollution, direct human disturbance, hunting, introduceddiseases and food supplementation. This threatens biodiver-sity which initiates the dire need to survey. Acoustics willbe an important input to analyse the biodiversity. Below wediscuss biodiversity analysis research works.

1) TRADITIONAL MEASURESThe ShannonWiener statistic (H) is a species diversity index.It is used to estimate species richness. Riede [3] has made useof this statistic to estimate crickets diversity of the Amazonforest. Sound generated by crickets were recorded at AmazonRainforest for a duration of two weeks at 10 different pointsdaily two times. With Condenser microphones, individualsongs are recorded very closely. Spectrogram has shownfrequencies ranging from 0 to 10 kHz. Mostly the frequenciesbelow 3 kHz are identified to be produced by frogs, mammals,birds etc. Within a range of 4 kHz to 9 kHz, acoustics ofcrickets are identified. Bird wingmovement causes resonanceand generates sounds at a certain pulse rate and at a narrowfrequency. These two parameters (pulse rate and carrier fre-quency) differ species to species and hence can be used toidentify species. The frequency is dense at 5 kHz to 8 kHzwith up to 80 pulse rate. We can get an understanding ofspecies and their abundance by using recordings of ith speciesand total recordings by using shannon-wiener statistic. Basedon authors experiment results, Shannon-Wiener statistic turnsout to beH = 2.789. This value ofH is low (typically shouldbe 5.0), which indicates the cricket diversity is low.

2) ACOUSTIC INDICESIn earlier bioacoustic analyses, all the biodiversity estimationmethods relied on species richness estimation. However,Sueur et al. [59] attempted to address biodiversity estimationby considering a community level. They derived α and βindices to analyse the animal sounds. The α index indicatesspecies count in the area whereas β index designates dif-ferent species in that community. The α index is derivedfrom Acoustic Entropy Index (H ) which is obtained by the

product of temporal and spectral entropies. To obtain tempo-ral entropy, Shannon index is used and for spectral entropyshort time Fourier transform is used. Acoustic EntropyIndex (H ) will be zero for solo tone and tend to reach 1 as thenumber of tones increases. Using the value ofH , species rich-ness can be estimated. To derive β index, Acoustic dissimilar-ity Index (D) is computed as product of temporal and spectraldissimilarities. The value of D increases if there are differentspecies in the community. Their results demonstrated thatacoustic entropy index typically ranged between 0.3 to 0.9.

Particularly during spring, certain habitat’s acoustics sub-stantially affect these H index values. To address this,Depraetere et al. [60] developed an alternative to α indexwhich is referred as Acoustic Richness (AR). Temporalentropy and amplitude of the signal are primarily used todevelop this index. AR enhanced the acoustic clarity byaddressing the noise component. Tests demonstrated that itis as good as human observation. Acoustic Richness revealedthat there is good acoustic activity at small forests rather thanin big forests.

Sueur et al. [10] has given an overview on α and β indexesdeveloped during several years. Their survey indicates that,up to now 21 α indexes and seven β indices were proposed.Authors recommended the simultaneous use of these indexesto obtain more complementary information to build sophisti-cated mathematical tools for biodiversity monitoring.

However, technological advances such as sensors and othersophisticated devices have made it possible to collect hugeamounts of acoustic recordings from several remote loca-tions resulting in a surge of audio data. Processing suchenormous quantities of data is a challenge. One approachto meet this challenge is to reduce the size of the audiorecordingswithout loosing important information. Inlinewiththis, Towsey [61] has stated that acoustic indices offer bet-ter solution in such scenario, which substantially reduceacoustic recording information and at the same time give acomprehensive understanding of the whole audio recording.The reduction is possible, as acoustic indices summarize theenergy aspects of the recordings and yield a single value.Sankupellay et al. [62],Towsey et al. [61] and Towsey [63]have made use of several indices in their works and discussedcalculation and application of acoustic indices for assessingbiodiversity. The acoustic indices used in their work are Aver-age Signal Amplitude, Background Noise, Signal to NoiseRatio (SNR), Acoustic Activity, Count of Acoustic Events,Average Duration of Acoustic Events, temporal entropy,Spectral Cover, Entropy of spectral maxima, spectral entropy,Entropy of the Variance Spectrum, Spectral Diversity andSpectral Persistence. Among these, authors concluded that,spectral entropy, Spectral Diversity and Spectral Persistenceare the most useful indices for obtaining optimized samples.

3) SOUNDSCAPEAlthough biodiversity can be estimated by species rich-ness, Celis-Murillo et al. [64] has developed a soundscaperecording system (SRS) to overcome the limitations of point

57692 VOLUME 8, 2020


count methods. The SRS design consists of four microphonespositioned above ground level and capturing sound by plac-ing microphones facing each other at 90 degrees to eachother. On the collected acoustic data, a closed populationcapture-recapture approach is used to for species richnessestimation. Further jackknife estimator is used to performinterpolation on estimates. Average detection probability iscalculated which in turn helps to estimate species richness.To understand dissimilarities in species composition, a jac-card similarity index is used. Results have demonstrated that,SRS has given better average detection probability and withjackard index tending to 1.0 showing good similarity amongspecies.

4) SAMPLINGMost bioacoustic analysis depends ideally on the quality ofrecorded acoustic data. However, some recordings may bemade in less tha ideal circumustances such as bad weather.This should not affect the analysis. Bardeli et al. [65] hasdescribed algorithms which work even in bad recording con-ditions. Sounds of the Eurasian bittern species are used byauthors to test the algorithms. The sounds are found to beshort segments. Short segment calls with higher frequencyare not suitable to apply pattern recognition. Hence inputsignals are down sampled and analysed by sliding window.This down sampled signal aids in obtaining energy weightednovelty measure. The peaks in this novelty measure curveindicates calls. But this may contain noise, which can be elim-inated from a bittern call by subtracting with a low pass filter.Even then, this may contain several false positive detections.An auto correlation method is applied to reduce the falsepositives. This method is adequate for bittern calls, but if verylow calls are to be detected, false positives will be a challenge.Also since the acoustic sensors record huge volumes of dataover long periods, analysing such huge volume is complexand may be subjected to false positives and negatives.

Hence, Wimmer et al. [66] have made use of samplingmethods. Data were collected at four locations for a period offive days by placing sensors at the center of each site. Survey-ors critically analysed the callsmanually and annotated. In thespectrogram, a specific marking is made and a tag is assignedto this selection. To simplify, the recordings are split into oneminute segments. Five different samples (full day-random,dawn, dusk, dawn+dusk, systematic) are collected from theseone minute segments. Samples drawn during the dawn periodshown high detection rate, followed by dawn+dusk periodsamples. This sampling method can reduce the manual anal-ysis burden. But removing noise and cryptic vocalizationswill ease the analysis. Authors concluded that strengths ofboth manual and automated analysis makes biodiversity morefeasible.

Thus far, we have discussed two important aspects of bioa-coustics applications: density estimation and biodiversity.Another significant application area of bioacoustics is speciesidentification, which is discussed in the next section.

C. SPECIES IDENTIFICATIONThere are three broad categories of approaches for speciesidentification: manual, automatic and semi-automatic.

1) MANUALUntil a few decades ago, only ornithologists and ecologistsused to identify birds using their expertise. Based on sev-eral factors such as bird origin, colour patterns, environ-ment context (time, weather, habitat etc.) and its behaviour,experts could identify the species of the bird. However, thismethod has several disadvantages such as availability of fewexperts who can identify confidently and observer bias due tovariations in their abilities. Moreover, manual observation islimited to smaller volumes.

2) AUTOMATICOvercoming drawbacks of the manual method, automaticspecies identification has taken this initiative to next levelby using several Machine Learning Algorithms powered bynumerous machine learning techniques. We would presentautomatic species identification methods used by variousstudies in case of birds and other animals.

3) SEMI-AUTOMATICSemi-automatic methods consist of using both manual andautomatic notions and will be of immense use in real timescenarios. In real time, acoustic sensors, which are effectivein monitoring and classification would capture huge amountof data. To analyse such big data, combination of manualand automatic methods i.e. semi-automatic methods weredeveloped. This is because manual methods are not scalablewhile automatic methods suffer high false positives and falsenegative rate. Hence, to have a right balance of manual andautomatic method benefits, Truskinger et al. [67] designeda semi-automatic analysis method (also known as crowdbased analysis method) to analyse big data, which is acombination of manual and automatic methods. The centralnotion of this method is to source complex classificationtask to a set of participants (crowd) and use tools at certainplaces to perform analysis. Authors implemented this methodfor rapid scanning of spectrograms. Authors ascertain that,humans can distinguish regions of interest intelligently andeffectively from the spectrograms. Beyond human capability,certain tools were used to analyze further. Experiment resultsshow that analysis is sped up by twelve times. There areseveral other semi-automatic methods available in addition tocrowd-based.

VI. BIRD IDENTIFICATION METHODSAs discussed earlier, bird sounds can be of call or song type,and both vary from species to species. Both types of soundsare useful for species identification. To automatically identifyspecies based on these calls or songs, a wide variety ofsophisticated approaches have been proposed in the literature.

VOLUME 8, 2020 57693


These are discussed below, based on the type of algorithmused.

A. NEURAL NETWORKSBird vocalization hierarchies can be divided into notes,syllables, phrases and calls. Among these, syllables (sets ofnotes) are considered to be the best for bird species recogni-tion, as phrases and calls have more variation by region andindividuals. To work on such syllables, Selouani et al. [36]developed a model with a combination of time delay neuralnetworks and an autoregressive (AR) version of the backpropagation algorithm. This combination increases contextmemorization capability and pattern matching capacity. Forthis combination model, during training, a multilayer percep-tron is trained on sounds using standard back propagationtechniques. Training is iteratively repeated until the errorreaches a value which is less than a user-specified threshold.When used for prediction, a value of zero indicates no matchwhile 1 indicates a strong match against the training data.Though recurrent networks perform quite well, to overcomedifficulties when events are time shifted, a delay componentis introduced to the network, which enhances its patternrecognition capacity even when encountering misalignments.Results demonstrated 83% accuracy with 16% of improve-ment over a base ANN system. However, this method hadtrouble processing large datasets. Seeking to address issuesof scalability, Deecke et al. [68] developed a classificationmodel that applies dynamic time-warping and an adaptiveresonance theory (ART) neural network to categorize dol-phins. This method classified the species as well as humansand was found to be effective in the case of large datasetanalysis.

Juang and Chen [69] mentioned that bird sounds fol-low temporal patterns, which indicates the output not onlydepends on present input but also on past and future inputs.In such cases, rather than simply applying a back propagationfeed forward neural network which increases the delay ininput, creating a large network, it is advantageous to use arecurrent network. Towards this direction, Juang and Chenproposed a prediction based singleton-type recurrent fuzzyneural network model to address scaling and complexityissues. To apply the model data are gathered and syllablesextracted. LPC is applied to calculate 12 coefficients. These12 features are used to train a singleton-type recurrent fuzzyneural network, which achieved a high recognition rate.

Cai et al. [23] investigated bird species recognition withdifferent set of features and pre-processing methods usingneural networks. The authors established sensor networksto record sounds and images. They employed voice activitydetectors (VAD) to estimate the noise level from noise-onlyframes, although these VADs may not work well with signalshaving low SNR. Hence, they developed a noise reductionalgorithm. Using the output of this noise reduction algorithm,MFCC features are extracted. Tomodel changes, deltaMFCCfeatures and delta-delta MFCC features were used in com-bination with time delay neural networks, which combine

information both from past and current frames. Thirteendimensional MFCC and 13 dynamic features are input toTime delay Neural Networks. Their results demonstrated98.7% accuracy.

McIlraith and Card [70] tested classification of birds usinga back propagation and multivariate statistics method. Ini-tially, a windowed Fourier analysis was applied to acquiretime frequency representation of the signal. Further, as a partof pre-processing, using power spectral densities statisticalestimates of the spectrum are obtained and linear predictivecoding (source-filter model) is applied. Data dimensional-ity was reduced using a discriminant analysis technique.To extract spectral features, a fast Fourier transform wasapplied to LPC wave forms. Next, a back propagation clas-sifier with 10 inputs, 12 hidden nodes, and six outputs isapplied. As a secondmethod, after parsing and normalization,ANOVA is applied. Overall, the authors found that thiscombination of pre-processing and statistical methods helpedin feature refinement and that the back propagation methodcan trade-off accuracy for computational efficiency.

In summary, machine learning techniques applied to bioa-coustic data have been used extensively by ecologists. Thereader is also referred to the 2008 review of machine learningapplied to ecology in general by Olden et al. [71].

B. DEEP LEARNING METHODSA relatively recent extension to ANN techniques are deeplearning methods, which can be superior in identifying latentfeatures in a dataset, and which have been proposed as suit-able for large bioacoustics datasets [72].

Fazekas et al. [73] has developed a deep learning basedmodel for bird song identification. The input data for thismodel is two-fold: data collected from habitats as well asand its meta data. To clean the acoustic data and separatenoise, several pre-processing steps are applied. The cleaneddata is fed into a convolution neural network with four layers.The authors concluded that frequency features aremore easilydistinguishable than time features.

Sankupellay and Konovalov [74] have used a 50 layerneural network named ResNet-50 which is a deep learningapproach used in residual learning. The authors concludedthatmodel was able to achieve good accuracy for shorter inputcalls. Xia et al. [75] presented a survey outlining various deeplearning based acoustic event detection studies.

C. PROBABLISTIC MODELSTrawicki et al. [29] developed a model to assist in identify-ing species decline by using HMM and MFCC. The modelis trained on data representing 115 species. Syllables areextracted and joined together to generate multiple song types,then divided into frames (of 25 ms size and 10 ms stepsize). Using frames with a hamming window, fast Fouriertransforms(FFT) are computed. FFT is fed to filter bank toobtain log amplitudes. These are used to compute 12 MFCCsusing a discrete cosine transform (DCT). To these MFCCs,delta coefficients and log energy are also added as features.

57694 VOLUME 8, 2020


With this feature vector, the HMM was used for speakeridentification comprising states, transitions and output like-lihood through use of a GMM. Results indicated 83% to 95%accuracy in classification.

Many of the methods consider only a particular type ofsong. Similar to text independent recognition in humans,Fox et al. [32] developed a call independent identifica-tion method for birds. Recordings and divided into frames,then multiplied by a Hamming window. To these windowedframes, a Fourier transform is applied and multiplied by amel-scale filter bank. To convert the filter bank energies to thecepstral domain, a DCT is applied. This generates the featureset. Once the classifier is trained with known signals, it isused to analyze the features from unknown signals to obtain asimilarity score for identifying classes. Results indicate 67%to 97% accuracy. Cheng et al. [31] has also worked on a callindependent recognition system using MFCC and GaussianMixture Models (GMM) for classification of passerine birdspecies. The GMM enables representation of bird-dependentspectral shapes and supports modeling any arbitrary densities.

Somervuo et al. [37] names syllables as ‘‘an orga-nized sequence of brief sounds from a species-specificvocabulary’’. Feature extraction from these syllables is crit-ical and at the same time forms the basis for classification.The authors compared three of the best parametric models,sinusoidalmodelling, amel-cepstrummodel and a descriptiveparameters model, to determine which provides superior fea-tures for classification. Initially, the recording data is dividedinto syllables using an iterative time-domain algorithm.Usinga threshold, syllables that are 15 ms apart are groupedtogether, forming segmented regions. The segmented regionsare parameterized using the three models. In the sinusoidalmodel, the analysis-by-synthesis method is used for param-eter estimation to find the most significant frequency perframe. A frequency domain algorithm is also used to obtainthe phase andmagnitude of the sinusoidal pulse. Each leadingsinusoid is analysed for harmonic structure by dividing intofour classes. As the result of sinusoidal modelling, a sequenceof triplets and the four harmonic classes are produced.However one issue observed with this model is that it resultedin a relatively large number of parameters, so DCT was thenapplied to reduce the number of parameters. The secondmodel, a mel-cepstrum model, involves applying DCT tologarithmic mel-spectrum features to reduce dimensionality.However, the computedMFCC feature vector in this casemaymiss pitch information. This lost information can be obtainedby the third method, which is descriptive parameters model.Seven spectral and five temporal features are identified bythis method. Dynamic time wraping is applied to calculatethe distances between syllables. To model probability densityfunctions, GMMs were used and trained using the standardexpectation maximization algorithm. Results indicated thatthe best model is MFCC combined with dynamic time warp-ing. The optimal feature set can be obtained by combinationof MFCC with descriptive parameters.

Lee et al. [76] proposed a two-dimensional MFCC modelwhich collects both static and dynamic features. Instanta-neous cepstrum represents static features, while temporalvariations represent dynamic features. Static features areacquired by applying a 2D-DCT to logarithmic energiesof Mel-scale bandpass filters to obtain a matrix. The first15 rows and the first five columns of the matrix are chosenas preliminary sound features of a syllable, making 74 coef-ficients in total. Dynamic features are also obtained in asimilar manner. These two types of features are combinedto obtain a large vector. The feature vector is normalized,and then further PCA is applied for dimensionality reduc-tion. To increase accuracy of feature vector, prototype vec-tors are generated using vector quantization (VQ) and aGaussian mixture model. To improve further the discrim-inability between various brid species, linear discriminantanalysis (LDA) was employed. A nearest neighbour classifieris used for classification, with 84% classification accuracy.

D. NAIVE BAYES AND DECISION TREESBayesian techniques have been used to investigate a varietyof ecological phenomena [77]. Vaca-Castano and Rodriguez[78] developed a database model of features that can saveinformation as attributes and entities from various sources,which can be useful for later analyses. However, severalanalysis tools produce a huge number of attributes fromthe acoustic signal. In resource-constrained applications suchas bird species identification using sensor networks, ascer-taining the most important attributes plays a major role.Vilches et al. [79] has explored the use of data mining tech-niques to the problem of attribute dimensionality reductionin the case of bird species recognition. Three algorithms suchas Decision tree based ID3 and J4.8, the probabilistic classi-fier naive bayes and vector quantization were applied to thedataset. In the first step, vector quantization is applied, whichwill convert a numeric vector to quantized vector values. Thequantization calculates three values such as two intermediatevectors, partition and codebook. As a next step ID3, a deci-sion tree algorithm is applied to the quantized data whichcalculates the attributes’ entropy. In the resultant decisiontree, the most significant attributes (leaves) are used for birdclassification. This is applied in the J4.8 algorithm to gain theadvantages of reduced error pruning and numeric data han-dling. The extracted attributes are then fed into a Naive Bayesclassifier to overcome some drawbacks of decision trees (theymay be unstable or complex). Results indicated that J4.8 algo-rithm is accurate up to 98.39% while Naive Bayes have givenslightly better performance on reduced datasets (when thenumber of attributes is reduced from 71 to 47).

E. HIERARCHICAL CLASSIFICATIONSilla and Kaestner [80] discussed the relatively new methodof hierarchical classification of bird species from acousticdata. A dataset from the xeno-canto library was used. Forfeature extraction, they used the MARSYAS framework,

VOLUME 8, 2020 57695


producing 64 features. Hierarchical classification can beachieved using three approaches, a flat classificationapproach, a local model hierarchical classification approach(each non-leaf node trained and top down classification), anda global model hierarchical classification approach (whichpredicts class in hierarchy levels). In this work, the globalmodel hierarchical classification approach was based onNaive Bayes. Hierarchical precision, hierarchical recall andthe hierarchical F-measure were used as evaluation metrics.The hierarchical F-measure obtained from the experimentsindicated that this approach outperforms flat and local modelsand was found to be useful when class counts are high and ifclass hierarchy exists.

Signal detection (feature extraction) and signal character-ization (classification) are the two major activities of thespecies identification. Acevedo et al. [81] focused on signalcharacterization, for which they considered three supervisedmachine learning algorithms, linear discriminant analysis(LDA), decision tree and SVM. A dataset of 10,061 callswere used. LDA assumes features vectors of a class followa Gaussian distribution. Decision trees iteratively partitionuntil a condition is met and each partition represents a class.Further pruning is applied to avoid over fitting. SVM opti-mizes training and function complexity. Results indicate thatSVM achieved 94.5% accuracy while decision trees and LDAachieved 89% and 71% accuracies respectively.

Many classification approaches implicitly make the sim-plifying assumption that only a single species is present in arecording (requiring multiple passes with different classifiersto identify all species present). Briggs et al. [43] developed amulti-instance multi label (MIML) supervised classificationframework, which predicts multiple species present in arecording in a single analysis pass. In MIML, the classi-fied objects constitute bags (audio recordings) of instances(syllables) and class labels (set of species). A bag generatoralgorithm converts the recording into a bag of instances.Random forest is used for segmentation. TheMIML classifieris applied to identify the species in the recording usingthree underlying classification algorithms: MIMLSVM,MIMLRBF and MIMLkNN. Experiments demonstrated96% accuracy. Further in [82], Briggs et al. discussed moresophisticated methods for detection and classification ofmultiple species in noisy environment.

Stowell and Plumbley [83] experimented with unsuper-vised feature learning as a surrogate to MFCC. Usually,MFCC features are obtained by applying short-time Fourierto audio, producing a Mel Spectrum which is transformedusing cepstral analysis and preserving the last 13 coeffi-cients. Rather than transformation, Stowell and Plumbleyproposed automatic feature learning. For feature learning,initially a high pass filter and normalization are applied tospectrograms. Further spectral median noise reduction is per-formed. Subsequently, features are derived from the datasetusing PCA. For feature learning, spherical k-means is used.To reduce the feature set further, mean and standard deviation,or maximum of each feature is tested. The authors used the

random forest classifier. They observed that feature learningboosted performance, especially for single-label classifica-tion tasks, but for datasets with few annotations, the perfor-mance of this model is not promising.

F. METHOD COMPARISON WORKSA common approach in audio classification is to use theaverage value of features calculated across several framesof the source audio. This approach was used by the SVMapproach of Fagerlund [84]. In contrast, Briggs et al. [46] rep-resented audio features using histograms, as in the codebookapproach, instead of averaged frame level features. A code-book is collection of ‘‘words’’ where each word is a featurevector. The experiment was conducted by segmenting thesignal into frames and applying noise reduction. Frames ofinterest were identified and only 10% with the highest mag-nitude retained. Then MFCCs are calculated and histogramsconstructed, resulting in a 5000 dimensional feature vector.The codebook is constructed using the k-means clusteringalgorithm. Several frame-level features are aggregated andapplied with the classifiers nearest neighbour (with L1, L2,KL (Kullback-Leibler divergence) and Hellinger distances),Interval-IID MAP classifier, and SVMs. The interval-IIDmodel was proposed by those authors to model featuredissemination, subsequently obtaining a MAP classifier.This classifier aggregates histograms of features and usesthe KL-nearest neighbour algorithm for classification. Thestudy indicated that frame histograms are better than averageframe level features, nearest neighbour classifiers using KLand Hellinger distances outperformed SVM and rather thanEuclidean distances, the KL and Hellinger distances metricsare appropriate for histograms.

Kampichler et al. [85] experimented with several clas-sification algorithms including decision trees, ANN, SVM,random forest and fuzzy classifiers. Ocellated Turkey birdacoustics were chosen as the dataset for these five techniques.They concluded that the performance of neural networks,LDA and SVMwas poor, while the joint use of decision treesand random forest is highly recommended to achieve a highlevel of accuracy, transparency and comprehensibility.

Lopes et al. [48] has also emphasised that the choice ofmachine learning algorithm and features plays a major role inclassification performance. The authors used the MARSYASframework to produce feature sets, which are mostly usedin automated music genre classification problems. The fea-ture set has 64 features which comprise 12 MFCCs, andmeans and variances of timbral features. In comparison withInset-Onset Interval Histogram Coefficients (IOIHC) and thesound ruler feature sets, the authors claim that theMARSYASfeature set’s performance is better [86]. Several classifierswere trained and tested using these features: Naive Bayes;k-NN with three clusters; the decision tree classifier J4.8;an MLP neural network trained with the back-propagationmomentum algorithm; and the SVM classifier using thePlatts Sequential Minimization Algorithm (SMO) imple-mentation. Results indicated that, rather than using full bird

57696 VOLUME 8, 2020


TABLE 1. Summary of articles addressing bird species identification.

song recordings, the use of pulses (sounds with high ampli-tudes) significantly increases classification performance.Using these pulses, the best results are obtained with amulti-layer perceptron classifier and SMO classifier, with upto 95% accuracy.

Most of the research in species identification is call depen-dent. Cheng et al. [87] developed call independent identifi-cation of 10 passerine species using four machine learningmethods, radial basis function networks (RBFN, a specialkind of ANN), SVM, HMM and GMM. For these models,MFCCs and Linear Predictive Coefficients (LPCs) are chosenas feature sets. One feature vector per frame was extracted,which contains 13 LPCs and 24 MFCCs. Using these fea-tures, training and test data were prepared separately for thespecies identification purpose. They concluded that the LPCfeature set with HMM classifier and MFCCs feature set withSVM were the best combinations.

Table 1 summarizes the discussion on bird speciesidentification.

VII. OTHER ANIMAL IDENTIFICATION METHODSWhile a major target for bioacoustics analysis is birds, manyother animals emit identifiable sounds and have been thesubject of bioacoustics-based ecological studies. This sectionsummarizes such research on bats, cetaceans, terrestrial,ground-based animals, and insects.

A. BATS1) MULTIVARIATE ANALYSIS TECHNIQUEVaughan et al. [88] used multivariate analysis of echolocationcall parameters to identify bat species. Echolocation callsof around 536 bats were recorded from 15 known speciesin Great Britain. The recorded sounds were analysed usingsonographs. From each sequence of calls, either the secondor, i nthe case of noisy data, penultimate call was used.Calls were characterised by six features: duration, interpulseinterval, peak, start, end and centre frequencies. Multivariateanalysis was used, demonstrating good results. The authors

VOLUME 8, 2020 57697


concluded that, of the six features, the most informativeare the duration, start and end frequencies. In related work,O’Farrell et al. [89] devised a qualitative identificationapproach using critical call parameters max and min frequen-cies, linearity, and slope. In a given time pass duration, eachindividual bat may make series of calls which is referred toas a sequence. The approach begins by establishing a libraryof recordings of known species. New, unidentified calls arecompared with these archived calls visually using the min andmax frequencies to qualitatively classify bats. The quality ofthe method depends on the library of identified recordingsand visualization expertise of a human operator.

2) PATTERN RECOGNITION APPROACH USING SYNERGETICCLASSIFIERWithin any species there will be call variation between indi-viduals [90]. To address this, Obrist et al. [91] developeda pattern recognition approach using a synergetic classifier.Variation between individuals may mean that individualsof one species are misclassified as another when visuallyinspecting spectrograms of their calls. This is because spec-trograms show characteristics in the form of frequency curve.This shape or curve may be better analysed by pattern recog-nition software to classify the signals. Using standard equip-ment, 643 sequences of calls were recorded and a high passfilter applied to cut single echolocation calls, with additionalprocessing to identify the cleanest calls producing 14354 callssuitable for analysis. Calls were characterized by duration,high and low frequencies with the help of a discriminant func-tion. The discriminant function is repeated for various per-centages and these values are transformed for comparisons.Next synergetic algorithm is implemented which combines aclass of training patterns in to a feature vector, allowing it tohandle high dimensions. A training base was established byselecting 20 calls from each of 26 species and testing on thedatabase. The database was successively refined, then withinspecies variability was tested using coefficients of variation.Discriminant Function Analysis (DFA) is used to explorethe classification strength. Finally, the authors concluded thatthis method has the strength of processing huge data sets inaddition to classification accuracy and trustworthiness.

3) DISCRIMINANT FUNCTION ANALYSIS (DFA)Hughes et al. [92] tested the robustness of DFA on identifyingThai bat species. Recordings were made of free flying batsand spectrograms created to extract call structure informa-tion. Calls were divided into four categories based on theirstructure: broadband FM calls, narrowband FM calls, longmultiharmonic calls and short multiharmonic calls. One callfrom each individual is considered for analysis with 10 fea-tures calculated: duration, frequency of maximum energy,start, end, max, and min frequencies, frequency range, num-ber of harmonics, interharmonic distance, and pulse interval.The DFA technique with cross-validation was applied to clas-sify the species. For calls in the broadband FM category, DFAachieved an accuracy up to 85.9%, while it achieved 70.4%

for those in the narrowband FM category. For long harmoniccalls DFA shows 84.4% accuracy, and 96.7% classificationaccuracy in the case of short harmonic calls. Hence for allcall types DFA resulted in greater than 70% classificationaccuracy.

4) ARTIFICIAL NEURAL NETWORKS (ANN)Although the DFA method has become a highly trustedapproach for classification of bat species, other machinelearning algorithms have also been used. Preatoni et al. [93]evaluated four different classification methods (DFA, clus-ter analysis, CART, and ANN) by collecting 3-second callsequences from 126 bats. Noise is eliminated by digital fil-tering and the cleanest ultrasonic clicks extracted. Each clickwas characterised by its pulse duration, max intensity fre-quency, and start, end, middle, min and max frequencies. Dueto multicollinearity, only four features—duration, min andmax frequencies, and frequency of maximum intensity— areconsidered for DFA and cluster analysis methods whereas allthe seven parameters were used for ANN and CART. evaluateperformance of networks during training. After comparisonof the four methods, the authors concluded that DFA andANN performed well, while cluster analysis was poor forclassification. CART had only moderate success separatingthe classes.

Jennings et al. [94] has conducted an experiment to eval-uate the classification of bats by humans and an ANN.Recordings of 3–4 calls of 16 species were made. Therecordings were given to 26 human participants workingon bats (academicians, researchers, ecologists). They wereasked to classify the calls using any method except statisticalmodelling. If a species is identified correctly, a score of 1 isassigned. On the other hand, from the same recorded callsdataset, several frequency parameters are measured and arefed into an ANN. Classification performance was assessedusing sensitivity (% of known calls classified) and positivepredictive power (% of unknown calls classified). Resultsindicated that the ANN performed 75% better than humans.The authors concluded that by improving the ANN andtraining data, this can be further improved.

Apart from DFA and ANN methods, support vectormachines (SVMs) have also been used in the classificationof bat species by Redgwell et al. [95]. The authors alsotested ensemble neural networks (ENN) while keeping DFAas a canonical approach against which to measure perfor-mance. The experiment commenced by procuring a libraryof 713 calls from 14 species. A Butterworth high pass filterwas used to remove noise and improve the signal. An algo-rithm fromMATLAB was used for call extraction which williteratively run until it finds a frequency with highest energywith 8 kHz difference from previous repetition or averagesignal to noise ratio (SNR) exceeds 0.01. Twelve parameterswere extracted from each call. Five are considered to be ‘‘baseparameters’’: duration, start frequency, end frequency, middlefrequency and max energy frequency. The sixth and seventhare rate of change and bandwidthmeasurements, respectively.

57698 VOLUME 8, 2020


TABLE 2. Summary of articles addressing bat species identification.

To obtain the next four parameters, each call was exposedto a Hilbert transform and multiplying by its conjugate toobtain the call’s energy distribution. The 12th parameter is adiscrete variable that divides calls into various categories suchas constant frequency, frequency modulated, quasi constantfrequency and kinkled (calls with unexpected weakness offrequency in last quarter). Untransformed data is used byDFA since it is robust to normality deviation. DFA showed73% accuracy in species classification. Twenty SVMs weretrained using a radial function with the step-varying range.All the classifiers were combined to identify calls, withthe combined SVMs achieving 87% classification accuracy.The ENN approach achieved 98% accuracy. The authorsconcluded that SVM and ENN performed better than DFAand also that the five base parameters are most critical forclassification.

Some research-based analysis techniques have been trans-formed into end-user tools. For instance, Walters et al. [96]developed an ANN based continental scale tool iBatsID,which assists in classifying any European bat calls, achieving93% of group classification and 83% of species classification.This tool was made publicly available for monitoring batacoustics in Europe.

In conclusion, for acoustic based bat species classification,methods such as DFA, SVM, and ANN have been extensivelyused. A summary is presented in Table 2.

B. OTHER TERRESTRIAL ANIMALS1) FROGSBedoya et al. [30] has pointed out that acoustic-based iden-tification of species has the advantage of being non-invasive,with advantages over marking procedures which may harmanimals, particularly sensitive species. They used an unsuper-vised classification for anurans based on a fuzzy classifier andMFCCs. Datasets from two different sources were collectedwith 916 calls from 13 anuran species (from STRI) and813 calls from 6 anuran species (from Antioquia) consideredfor analysis. The authors developed a four stage classificationmethodology:

1) Noise reduction: Threshold estimation is performed toidentify noisy segments. A Fast Fourier transform FFT)is applied on the windows of noisy segments, whichgives a threshold estimate. During noise removal,the gain control is set to be in the limits of this thresh-old value. These gain controls are applied to the FFT,followed by an inverse FFT and finally a hammingwindow, which optimizes the signals.

2) Segmentation: The noise-reduced signal is divided intosegments for easier analysis. A syllable of fixed lengthis formed, which is product of individual vocalizations.The start and end points of the calls are identified incomparison with the threshold value.

VOLUME 8, 2020 57699


TABLE 3. Summary of articles addressing frog species identification.

3) Feature extraction: The syllables are divided intoframes. A Fourier transform is calculated for eachframe to identify interesting frequency bands in theframe and to calculate the power spectrum, whichis mapped to the mel frequency scale, from whichMFCCs are calculated. The mean values of the MFCCsare used as the classification inputs.

4) Classification: The author’s Learning Algorithm forMultivariate Data Analysis (LAMDA) is used, initial-izing cluster 0 as a non-information cluster (NIC). Thefirst element goes unrecognized and is put into thisNIC. A new cluster is created by calculating the meanand Global Adequacy Degree (GAD) values of the call.This process is repeated until all calls are analysed.To classify an unknown call, the GAD is calculatedand the call is assigned to the species that exhibits themaximum GAD.

Results demonstrated 99% of accuracy in classification.Further this method is able to detect different species eventhey are not present in training stage.

Han et al. [34] introduced a spectral entropy approachfor frog species identification. The method consists of thefollowing three steps:

1) Syllable segmentation: Using the Raven software pack-age (see Section VIII-A.2), syllables are extracted anddigitized. From 12 to 96 syllables can be extracted froma call of 3 seconds duration.

2) Feature extraction: Three different features areextracted from syllables:

• Spectral centroid: A highly informative feature formachine learning representing the center point ofthe spectrum where the sound is ‘‘bright’’.

• Shannon entropy: This quantifies the richness ofsound.

• Renyi entropy: It is used to identify the complexityof the sound, i.e., to identify the noise content.

3) Classification: The above entropies are given as inputto a k-Nearest Neighbor classifier.

Results demonstrated that the use of entropy in the classi-fier has improved the accuracy.

Dayou et al. [35] also studied the use of an entropy-basedapproach for frog species identification. The work applied ak-NN classifier using Shannon entropy, Renyi entropy andTsallis entropy as the features. The approach achieved 100%classification accuracy for seven species out of the ninepresent in their dataset.

Xie et al. [97] developed models to recognise differentcalls within four frog species. Their work sought to selectsuitable acoustic features for identifying the likely speciesand a potentially different set of features for identifying thecalls within a species. The technique achieved an averageaccuracy of 84% at the species level and just below that whenidentifying calls. By combining the two approaches accuracyup to 93% could be achieved.

Table 3 summarizes frog species identification works.

2) DOGSBioacoustics relating to dogs has focused on identifying thestate of the animal or situation they are in.

Yin and McCowan [98] analysed dog barks to examineclassification in different contexts. Ten dogs ranging from 3 to13 years of age, of both sexes, were considered for analysis.Recordings were made in disturbance, isolation and playingsituations. Only 5% of the collected data was identified asnoise and the remaining 95% used for analysis. Sixty fre-quency and 60 amplitudemeasures were taken, with a numberof features derived from these. Spectrograms were gener-ated using FFTs with a hamming window. To discriminatebetween different barks in several contexts DFA was used,producing a set of features useful for classification. Resultsindicated good recognition and classification accuracy.

Riede et al. [12] has also studied dog barking. They con-sidered two dog categories, with a dataset comprising 10 dogsin good health and 10 dogs in unhealthy condition (e.g.,undergoing treatment in a veterinary clinic). Barks of the dogsin these two different contexts were recorded. To identify thedifferences in the acoustics, harmonic to noise ratio (HNR)

57700 VOLUME 8, 2020


TABLE 4. Summary of articles addressing dog species identification.

TABLE 5. Summary of articles addressing insect species identification.

was calculated. The harmonics of healthy dogs’ barks showsthem to be regular, while the barks of unhealthy dogs areirregular. Due to illness, some noise will be introduced whichalso influences the acoustics. Recordings of all dogs overa 6-month period was used. The authors concluded thatHNR-based classification is a good measure for quantifyingnoise in bioacoustics in the context of dogs. Table 4 summa-rizes dog species identification works.

3) ELEPHANTSClemins et al. [99] has made a study of one male and sixfemale African elephants. Data were collected by fitting eachanimal with a microphone and radio frequency transmitter.Acoustic features are extracted from spectrograms using ahamming window. A 60 ms window is used for call classi-fication and 300 ms window is used for speaker identifica-tion. To model transitions, a Hidden Markov Model is used.For classifying vocalization type, the Hidden Markov Modelperformed as well as humans. Further, speaker identificationwas performed in different contexts, one where the male wasisolated from the females, and another was when the malewaswith four other elephants. The accuracy of the experimentwas 82.5%.

4) INSECTSAcoustics have also been used to study insects.Chesmore and Nellenbach [100] identified an automaticmethod for identification of grasshoppers and crickets. Usingtime domain signal processing and an ANN, they demon-strated 100% classification accuracy, and claimed that theapproach can be used for other species as well.

Potamitis et al. [101] has highlighted that insects can becategorized based on their appearance and sound production.Trapping and identifying insects through appearance is diffi-cult. Hence several forms of soundsmade by insects in severalinstances such as eating, flying or locomotion can be usedas a means of communication. The sounds produced includemating calls or sounds to warn others of danger. Speciesrecognition is done in a two-step process:

1) Signal parameterization: the energy of fixed lengthframes is estimated, and then a discrete Fourier trans-form is applied. A linearly spaced filter bank is appliedto the DFT and linear frequency cepstral coefficientsare calculated. The first 24 of these are used for recog-nition. On all feature vectors dynamic normalization isapplied.

2) Classification: These normalized features are used withprobabilistic neural networks (PNN) and Gaussianmixture models (GMM) classifiers. For each targetspecies, a model is built. The Bayesian rule is appliedto finalize the class of the target predicted.

The models’ accuracy was 90–99%. Table 5 summarizesworks on insect species identification.

C. CETACEANSCetaceans includes whales, dolphins and related species.Soon after the SecondWorldWar it was identified that acous-tics could be used with these for different kinds of analy-sis. By the 1970s the potential for species recognition wasknown, but could not proceed due to technology limitations.Advances in the digital technologies have paved the way forcetacean species identification using acoustics [102].

1) STATISTICAL METHODSDFA has been applied by Oswald et al. [103] to identifydolphin species. Whistles from four species were collectedand only loud and clear whistles of nine decibels morethan background noise were chosen randomly for analy-sis. Four different spectrograms with variable frequencieswere created and eight measures extracted: start, end, minand max frequencies, duration, infection points, number ofsteps, and a harmonics indicator. Taking these measures asinput, a DFA based on orthogonal linear functions classi-fied to specified groups. Results indicated that, althoughthe DFA classified accurately, recording quality and analy-sis bandwidth greatly influence the effectiveness of speciesclassification. The authors concluded that a bandwidth ofnot less than 24 kHz is needed for accurate analysis andclassification.

VOLUME 8, 2020 57701


TABLE 6. Summary of articles addressing cetacean species identification.

2) REAL-TIME ODONTOCETE CALL CLASSIFICATIONALGORITHM (ROCCA)Several large species like whales produce whistles thatcan be easily identified, but to recognize whistles ofspecies like delphinds is more challenging. To addressthis, Oswald et al. [104] developed a tool named ROCCA(Real-time Odontocete Call Classification Algorithm) whichclassifies the species in real time. From the recordings,50% (35 whistles per session) of whistles that were loud andclear were randomly selected. As in their previous work start,end, min and max frequencies, duration, infection points,number of steps, and a harmonics indicator were used as fea-tures, supplemented by slope of beginning and end sweeps.After normality tests and transformation, DFA was appliedto these measures to classify them. Mahalanobis distance iscalculated to identify the group centroid. Another classifica-tion method, CART, which creates a binary tree, was alsoapplied. The jackknife method was used to obtain classifica-tion scores. DFA showed up to 63.5% accuracy while CARTgave 57% accuracy.

3) GAUSSIAN MIXTURE MODEL (GMM))Roch et al. [105] used a Gaussian mixture model for speciesclassification using the common analysis pattern of collec-tion of the call data, call detection, feature extraction andfinally classification steps. Recordings were obtained fromthe California coastline. Single species calls only were keptfor further analysis. Using spectrograms, calls’ start and endpoints were identified, and calls with good quality and above18 dB, were considered for analysis. The identified calls aredivided into frames of size 21 ms and a hamming windowapplied. A filter was applied to identify the poor calls below5 kHz. To derive the classification features, cepstral featureswere identified by applying the discrete cosine transform onthe filter. Since it is difficult to assess the values of a GMM,an initial Gaussian classifier was initialized using mean andcovariance values. An iterative algorithm was applied whichsplits each single mixture into two mixtures with an offsetvalue. The algorithm is iterated until the anticipated mixtures

are formed. Once the expectation is known with this algo-rithm, the model is trained. Then the posterior probability ofeach species was computed using Bayes’ rule. The posteriorvalues from each observation was summed up correctly toidentify the species. The model achieved up to 75% classifi-cation accuracy. Table 6 summarizes the work on cetaceans.

From the studies summarised above (for birds, bats, otherterrestrial animals, and cetaceans) one can see that varioustechniques have been used for different animal types, but withsome commonalities. These are summarised in Table 7.

VIII. BIOACOUSTICS SOFTWARE AND BIG DATAA. SYSTEMS AND SOFTWAREAs discussed earlier, bioacoustics analysis involves severalactivities such as data collection, visualization, pre-processing, feature extraction and analysis. Due to their highcomplexity, each activity requires support from sophisticatedsoftware tools to enable automatic processing. Several orga-nizations have developed software which to support someor all of these activities. Initially we provide discussionon general audio processing software and, in subsequentsections, we discuss bioacoustics specific software and bigdata handling systems.

1) GENERAL AUDIO PROCESSING SOFTWARESoftware such as Sound Ruler, Audacity and seewave aredesigned for general audio processing purposes, but can alsoused in bioacoustics work.

1) SoundRuler is free and open source interactive visualtool to perform analysis tasks on acoustic data [106].Bee [107] provides a comprehensive overview ofthe software’s features and its operation. Its featuresinclude handling unlimited size audio files, adjustablefilters, support for both manual and automatic analysisand generation of more than 50 acoustic measures.It can produce graphical summaries of acoustic data,such as spectrograms and oscillograms.

2) Audacity is sound processing software developed atCarnegie Mellon University by Dominic Mazzoni andRoger Dannenberg in 2000. It facilitates multiple

57702 VOLUME 8, 2020


TABLE 7. Similar techniques applied to the study of different animal groups.

source recording and several post-processing activities.It offers several features such as recording, editing,importing and exporting, as well as support for multiplechannel modes, and spectrum analysis.

3) Seewave (typically rendered in all lower case) is anR package (available through CRAN) for performingsound analysis and synthesis. According to the see-wave website [108], it offers a variety of functions foranalysis, display, manipulation, editing and synthesisof audio. Further, this tool processes both 2D and 3Dspectrograms, computes entropy, correlation and sev-eral other values.

2) BIOACOUSTICS SPECIFIC SOFTWAREThe software discussed below are specifically developed forbioacoustics applications.

1) AviSoft-SASLab Pro is a tool for performing variousbioacoustics activities [109]. The software displays realtime spectrograms with high quality output. It offersautomated syllables classification with advanced meta-data management capabilities. It is good platform formanaging large number of audio files and is suitable forbatch and real-time processing. It is available in bothfreeware and proprietary versions.

2) Raven [110] is designed by the Cornell Lab of Ornithol-ogy, which supports data uploading, visualization andacoustic analysis. Raven Lite is a freeware versionthat can used for research on birdsong recognition.It possess several features such as annotations, detec-tion, correlation, support for various data acquisitionsoftware, multiple simultaneous windows and views,support for multiple audio formats, editing, filteringand amplifying facilities.

3) Kaleidoscope Pro [111] is produced by WildlifeAcoustics, a manufacturer of field recorders, and offersa variety of features, including detecting similar soundsand identifying them as clusters, sound visualization,and tools for editing and labelling, as well as intelligentclassifiers that can perform species recognition. It alsohas support for batch processing and noise analysis.

4) Bioacoustic Workbench:Wimmer et al. [112] proposeda web-based workbench to address issues associatedwith the large volume of bioacoustics data being col-lected. They developed a framework which facilitatesuploading, organisation and structure, visualization,recording and analysis through annotation facilities.Data upload is provided via a web service, which hasthe advantages of ease of access and support for databackup. Role-based access control mechanisms enablesmall to large project groups to be managed. The play-back and visualization components facilitate splittingaudio into segments. Users can perform manual analy-sis throughmarking annotations and search the acousticdatabase for matches.

5) Automated Remote Biodiversity Monitoring Network(ARBIMON), developed by Aide et al. [113] is acombination of research field project and web servicesupporting data hosting and species identification. Thesoftware component of the system generates spec-trograms using short-time fourier transforms (STFT)and Hann windows. Signals are further processed togenerate regions of interest. The interface for speciesidentification has four components: a visualizer forinspection, listening and manual analysis; a speciesvalidation component that allows the user to specifythe presence or absence of species vocalization in a

VOLUME 8, 2020 57703


recording; a model building component to train thedata using ROIs (using HMM); and a model applica-tion component, which allows an entire dataset to beprocessed with a previously creaed model.

6) Multi-layer framework model: Zhang et al. [114]describe a big acoustic data management framework,analysis and visualization tool. Data is collected fromsensors, recorders and handheld devices. This largescale data collection is exposed to issues such as largevolume, and high variety and velocity. These threeissues are expected to grow further in future as themonitoring and sensing area increases. To addressthis, the authors presented a Multi-layer frameworkmodel for acoustic data. The framework includes adata collection layer and data management layer forupload and management of audio, respectively. Thethird layer offers event processing which performspre-processing, rapid scanning of spectrograms as dis-cussed in Truskinger et al. [115] and identification ofevents of interest. The fourth layer performs knowledgediscovery (data mining). The system offers visualiza-tion using spectrograms, tag linking and 3D display.The authors concluded that their future work will focuson mining and extracting knowledge through tags, birdacoustics correlation analysis, their behaviours withrespect to environment, time and location.

7) High Performance Computer Acoustic Data Acceler-ator (HPC-ADA): Also with a view to addressing bigdata issues in bioacoustics, Dugan et al. [116] proposedHPC-ADA, which operates on cloud infrastructure.authors designed software named Detection and Local-ization using Machine learning Algorithms (DeLMA)which can run across multiple cloud-based machinesand run a variety of actual machine learning tools.Scalability is achieved by distribution of jobs, whileuse of a generic data format supports interoperabil-ity. MATLAB libraries are used in DeLMA to dealwith algorithms efficiently across distributed computerarchitectures.

8) Finally, Reyes and Camargo [117] developed a pro-totype system based on the use of visual analyticsto assists in tasks such as species identification. Theexperiment initiated by collecting bird audio soundsand extracting MFCC features. Further cosine distancefunction is used to identify similarity between audiorecords through its vector representations. To obtain a2D representation and reduce dimensionalty, PCA isused. Based on 2D visualization and distance matrix,users can visually analyse bird species’ sounds.

B. BIG DATA HANDLING SOFTWAREBioacoustics data is growing, as the data is received from dif-ferent sources at different rates resulting in the generation of amassive volume of data. As a branch of ecology, bioacousticshas the task of identifying species, performing density esti-mation, tracking environmental changes, and so on, which

requires data to be gathered for a long time periods and atappropriate, potentially huge scales. qualifies bioacoustics asa big data challenge. Al-Jarrah et al. [118] pointed out thatcurrent machine learning algorithms need to be scalable todeal with big data challenges. They reviewed the current stateof research in sustainable data modelling such as ensemblemodels, Bayes nonparametric learning models, local learningstrategy, semi-parametric approximation and deep learningauthors also discussed several batch and stream processingbig data computing methods. Their opinion was that thesesustainable data models are capable of handling huge vol-umes of data with ease, in any field of science.

A number of research projects are working to addressthe big data challenges in this field. Kelling et al. [119]discussed the phases of a biodiversity research data scienceworkflow. Initially observation data collected from varioussources should be validated and organized. As an exampleof observational data, the authors developed an avian knowl-edge network containing 60 bird occurrences with descriptivemetadata. Since the observation data is from a diverse setof sources and dispersed, a common data warehouse such asbirdmonitoring data exchange (BMDE)was developed. Thenanalysis can be performed on such common data by usingbagged decision trees and several non-parametric tools.

Technology advances have enabled automated collectionof larger volumes of acoustic data, which presents challengesfor storage and organisation. Kasten et al. [120] developed anensemble extraction model from sensor data streams. To sup-port online and incremental learning from the sensor steamsto detect bird species, the Multi-Element Self-Organizingtree (MESO) memory system. After filtering and transfor-mations, spectrograms are plotted to represent each seg-ment. Piecewise aggregate approximation (PAA) was usedto decrease the dimensionality of the time series. Further,symbolic aggregate approximation (SAX) converts the PAArepresentation to symbols which helps in computing anomalyscore of the signal. On this, a distributed stream processingpipeline named Dynamic River was created. Followed bythis, MESO is used for classification and detection. A trainedMESO instance returns either an exact matching patternor similar pattern. The authors stated that the classificationaccuracies are promising. For these experiments, ensemblesare extracted from data streams comprising a single signal.Ensembles extraction from multiple correlated data streamsis left as future work.

Truskinger et al. [67] have presented their applied researchon managing and analysing huge raw audio datasets. Theircore contribution lies in the presentation of analysis method-ologies for practical large scale acoustic data. authors catego-rized automated analysis activities into two broad groups. Thefirst group comprises event detectors, which work on spectralcomponents to identify regions of interest. The second groupare acoustic indices, which calculate summary statistics of anaudio stream. Initial exploration of results was performed inR, which was then transcoded to C#. Currently the project ismigrated to cloud and their future work aims at developing a

57704 VOLUME 8, 2020


scalable architecture, enhancing analysis functions and spec-trogram generation.

Dugan et al. [121] has developed an efficient and scal-able algorithm for data processing. The acoustic data-miningaccelerator (ADA) algorithm uses distributed computing con-cepts, where sound is partitioned into blocks and each blockis assigned to one worker. All results from the workers arecollected and merged to provide analytical results. resultsdemonstrated two times improvement in execution speeds.The authors’ future work aims at working on multi-channeland multi-rate data.

IX. OPEN CHALLENGESEven though there has been lot of work to address issues inanalyzing bioacoustics data, there remain several open issuesthat have to be tackled before developing a fully automatedanalytical system.

1) Adverse environmental conditions: Animal recordingsare obtained directly from their habitats and fieldsand, as such, are subject to environmental conditions.For instance, it is very common that wind, rain orsounds of aircraft will be also recorded. Several workspoint out that this can hamper classification accu-racy. Consequently there is a need for the develop-ment of effective pre-processing, feature extractionand classification methods that are robust to suchinterference.

2) Automatic selection of detectors: In general bioacous-tics algorithms need to work in diverse and changingnatural environments. Future bioacoustics analysis sys-tems should be capable of automatically tuning basedon these changes. This will also require data miningalgorithms that are robust enough to detect changesand adapt the models accordingly. For example, deeplearning and meta-algorithms need to be developed toautomate detector selection for different weather con-ditions [122].

3) Acoustic index optimization: The space of alternativeacoustic indices should be more widely examined,including their use as features in machine learningalgorithms.When indices are calculated independently,biased sampling is another option to be explored [61].Moreover, most current indices, even though they havewide applicability, also need to be specifically madeapplicable for a given context. In other words, newindices are needed that integrate the characteristics ofthe surrounding environment where the audio was cap-tured or prior knowledge of the species found in thatenvironment.

4) Call interference: Most classification methods thatare applied for recognizing bird species are based onhow precisely they can differentiate between differentacoustic signals. Interference between different birdcalls can make this difficult and can severely affect theaccuracy of automatic recognition methods. Hence it isstill a major challenge in the domain of bioacoustics.

5) Multiple species recognition: In a recording there maybe several species present apart from noise. But classi-ficationmethods typically assume that only one speciesis present in a segment of a recording [43]. The domainof multiple species detection should be explored bydeveloping scalable models that can detect a set of allspecies present in a recording simultaneously.

6) Scalable, streaming data analysis:There is much scopefor improvement in algorithms for efficient speciesdetection from large scale multiple datasets. Currenttechniques are mostly tested with small datasets ata time, yet given the large scale deployment of therecorders processing algorithms should be scalable.Techniques for analysing multiple correlated datastreams (from spatially-related recordings) are also anopen area of research.

X. CONCLUSIONModern bioacoustics has a nearly 70-year history, and hasgrown considerably in recent decades with advances in pro-cessing techniques and hardware. Most current bioacous-tics research focuses on the application areas of biodiversityassessment using statistical techniques and acoustic indices,density estimation, using statistical analysis of detected calls,and species identification, using a variety of machine learningtechniques. Growth in the use of acoustic monitoring hasled to commensurate growth in the amount of raw audioneeding to be analysed, necessitating the use of distributedcomputing tools, although the application of these tools in thebioacoustics domain is somewhat in its infancy. This reviewhas identified a number of targets for future bioacousticsresearch.

REFERENCES[1] K. R. Crooks and M. Sanjayan, Connectivity Conservation, vol. 14.

Cambridge, U.K.: Cambridge Univ. Press, 2006.[2] D. S. Wilcove, D. Rothstein, J. Dubow, A. Phillips, and E. Losos,

‘‘Quantifying threats to imperiled species in the United States: Assessingthe relative importance of habitat destruction, alien species, pollution,overexploitation, and disease,’’ BioScience, vol. 48, no. 8, pp. 607–615,1998.

[3] K. Riede, ‘‘Monitoring biodiversity: Analysis of Amazonian rainforestsounds,’’ Ambio, pp. 546–548, Dec. 1993.

[4] P. Laiolo, ‘‘The emerging significance of bioacoustics in animal speciesconservation,’’ Biol. Conservation, vol. 143, no. 7, pp. 1635–1645,Jul. 2010.

[5] C. Huetz and T. Aubin, ‘‘Bioacoustics approaches to locate and identifyanimals in terrestrial environments,’’ in Sensors for Ecology, TowardsIntegrated Knowledge of Ecosystems, J. J. Le Galliard, J. M. Guarini, andF. Gaill, Eds. Paris, France: CNRS, 2002, pp. 83–96.

[6] H. Klinck, S. L. Nieukirk, D. K.Mellinger, K. Klinck, H.Matsumoto, andR. P. Dziak, ‘‘Seasonal presence of cetaceans and ambient noise levels inpolar waters of the North Atlantic,’’ J. Acoust. Soc. Amer., vol. 132, no. 3,pp. EL176–EL181, Sep. 2012.

[7] W. W. Au and M. C. Hastings, Principles of Marine Bioacoustics.New York, NY, USA: Springer, 2008.

[8] W. M. Zimmer, Passive Acoustic Monitoring of Cetaceans. New York,NY, USA: Cambridge Univ. Press, 2011.

[9] D. K. Mellinger, ‘‘Introduction to animal bioacoustics,’’ J. Acoust. Soc.Amer., vol. 129, no. 4, p. 2406, 2011.

[10] J. Sueur, A. Farina, A. Gasc, N. Pieretti, and S. Pavoine, ‘‘Acoustic indicesfor biodiversity assessment and landscape investigation,’’ Acta AcusticaUnited Acustica, vol. 100, no. 4, pp. 772–781, Jul. 2014.

VOLUME 8, 2020 57705


[11] B. S. Dang and F. A. Andrews, A Literature Survey on the Subject of theUse of Acoustics in Fish Catching and Fish Study. Washington, DC, USA:Institute of Ocean Science and Engineering, Catholic Univ of America,1971.

[12] T. Riede, H. Herzel, K. Hammerschmidt, L. Brunnberg, andG. Tembrock,‘‘The harmonic-to-noise ratio applied to dog barks,’’ J. Acoust. Soc.Amer., vol. 110, no. 4, pp. 2191–2197, Oct. 2001.

[13] D. A.Mann, A. D. Hawkins, and J.M. Jech, ‘‘Active and passive acousticsto locate and study fish,’’ in Fish Bioacoustics. New York, NY, USA:Springer, 2008, pp. 279–309.

[14] A. King, ‘‘Functional anatomy of the syrinx,’’ in Form and Function inBirds, vol. 4. London, U.K.: Academic, 1989, pp. 105–192.

[15] J. M. Wild, ‘‘Neural pathways for the control of birdsong production,’’J. Neurobiol., vol. 33, no. 5, pp. 653–670, Nov. 1997.

[16] J. J. Bolhuis and M. Everaert, Birdsong, Speech, and Language: Explor-ing the Evolution of Mind and Brain. Cambridge, MA, USA: MIT Press,2013, pp. 1–26.

[17] A. Kershenbaum et al., ‘‘Acoustic sequences in non-human animals:A tutorial review and prospectus,’’ Biol. Rev., vol. 91, no. 1, pp. 13–52,Feb. 2016.

[18] J. R. Krebs and D. E. Kroodsma, ‘‘Repertoires and geographical variationin bird song,’’ Adv. Study Behav., vol. 11, pp. 143–177, Jan. 1980.

[19] F. Glaw and M. Vences, ‘‘Bioacoustic differentiation in painted frogs(Discoglossus),’’ Amphibia-Reptilia, vol. 12, no. 4, pp. 385–394, 1991.

[20] R. Mace, ‘‘The dawn chorus in the great tit paras major is directly relatedto female fertility,’’ Nature, vol. 330, no. 6150, pp. 745–746, Dec. 1987.

[21] M. K. Obrist, G. Pavan, J. Sueur, K. Riede, D. Llusia, and R. Márquez,‘‘Bioacoustics approaches in biodiversity inventories,’’ Abc Taxa, vol. 8,pp. 68–99, Jan. 2010.

[22] K. Riede. (2010). Bioacoustic Diversity. [Online]. Available: http://www.bioacoustics.myspecies.info/en

[23] J. Cai, D. Ee, B. Pham, P. Roe, and J. Zhang, ‘‘Sensor network for themonitoring of ecosystem: Bird species recognition,’’ in Proc. 3rd Int.Conf. Intell. Sensors, Sensor Netw. Inf., 2007, pp. 293–298.

[24] J. S. Lim and A. V. Oppenheim, ‘‘Enhancement and bandwidth com-pression of noisy speech,’’ Proc. IEEE, vol. 67, no. 12, pp. 1586–1604,Dec. 1979.

[25] Y. Ephraim and H. L. Van Trees, ‘‘A signal subspace approach forspeech enhancement,’’ IEEE Trans. Speech Audio Process., vol. 3, no. 4,pp. 251–266, Jul. 1995.

[26] H. Lev-Ari and Y. Ephraim, ‘‘Extension of the signal subspace speechenhancement approach to colored noise,’’ IEEE Signal Process. Lett.,vol. 10, no. 4, pp. 104–106, Apr. 2003.

[27] Y. Ephraim and D. Malah, ‘‘Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,’’ IEEE Trans.Acoust., Speech, Signal Process., vol. ASSP-32, no. 6, pp. 1109–1121,Dec. 1984.

[28] Y. Ephraim, ‘‘Statistical-model-based speech enhancement systems,’’Proc. IEEE, vol. 80, no. 10, pp. 1526–1555, Oct. 1992.

[29] M. B. Trawicki, M. T. Johnson, and T. S. Osiejuk, ‘‘Automatic song-typeclassification and speaker identification of Norwegian Ortolan Bunting(Emberiza Hortulana) vocalizations,’’ in Proc. IEEE Workshop Mach.Learn. Signal Process., Sep. 2005, pp. 277–282.

[30] C. Bedoya, C. Isaza, J. M. Daza, and J. D. López, ‘‘Automatic recognitionof anuran species based on syllable identification,’’ Ecol. Inform., vol. 24,pp. 200–209, Nov. 2014.

[31] J. Cheng, Y. Sun, and L. Ji, ‘‘A call-independent and automatic acousticsystem for the individual recognition of animals: A novel model usingfour passerines,’’ Pattern Recognit., vol. 43, no. 11, pp. 3846–3852,Nov. 2010.

[32] E. J. S. Fox, J. D. Roberts, and M. Bennamoun, ‘‘Call-independentindividual identification in birds,’’ Bioacoustics, vol. 18, no. 1, pp. 51–67,Jan. 2008.

[33] J. A. Kogan and D. Margoliash, ‘‘Automated recognition of bird songelements from continuous recordings using dynamic time warping andhidden Markov models: A comparative study,’’ J. Acoust. Soc. Amer.,vol. 103, no. 4, pp. 2185–2196, Apr. 1998.

[34] N. C. Han, S. V. Muniandy, and J. Dayou, ‘‘Acoustic classification ofAustralian anurans based on hybrid spectral-entropy approach,’’ Appl.Acoust., vol. 72, no. 9, pp. 639–645, Sep. 2011.

[35] J. Dayou, N. C. Han, H. C. Mun, A. H. Ahmad, S. V. Muniandy, andM. N. Dalimin, ‘‘Classification and identification of frog sound based onentropy approach,’’ in Proc. Int. Conf. Life Sci. Technol., vol. 3, 2011,pp. 184–187.

[36] S.-A. Selouani, M. Kardouchi, E. Hervet, and D. Roy, ‘‘Automatic bird-song recognition based on autoregressive time-delay neural networks,’’ inProc. ICSC Congr. Comput. Intell. Methods Appl., Dec. 2005, pp. 1–6.

[37] P. Somervuo, A. Harma, and S. Fagerlund, ‘‘Parametric representationsof bird sounds for automatic species recognition,’’ IEEE Trans. Audio,Speech, Language Process., vol. 14, no. 6, pp. 2252–2263, Nov. 2006.

[38] A. Harma, ‘‘Automatic identification of bird species based on sinusoidalmodeling of syllables,’’ in Proc. IEEE Int. Conf. Acoust., Speech, SignalProcess. (ICASSP), vol. 5, Apr. 2003, p. V-545.

[39] P. Somervuo and A. Harma, ‘‘Bird song recognition based on syllable pairhistograms,’’ in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,vol. 5, May 2004, p. V-825.

[40] M. Graciarena, M. Delplanche, E. Shriberg, and A. Stolcke, ‘‘Birdspecies recognition combining acoustic and sequencemodeling,’’ inProc.IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), May 2011,pp. 341–344.

[41] B. Lakshminarayanan, R. Raich, and X. Fern, ‘‘A syllable-level prob-abilistic framework for bird species identification,’’ in Proc. Int. Conf.Mach. Learn. Appl. (ICMLA), Dec. 2009, pp. 53–59.

[42] L. Neal, F. Briggs, R. Raich, and X. Z. Fern, ‘‘Time-frequency segmenta-tion of bird song in noisy acoustic environments,’’ inProc. IEEE Int. Conf.Acoust., Speech Signal Process. (ICASSP), May 2011, pp. 2012–2015.

[43] F. Briggs, B. Lakshminarayanan, L. Neal, X. Z. Fern, R. Raich,S. J. K. Hadley, A. S. Hadley, and M. G. Betts, ‘‘Acoustic classifi-cation of multiple simultaneous bird species: A multi-instance multi-label approach,’’ J. Acoust. Soc. Amer., vol. 131, no. 6, pp. 4640–4650,Jun. 2012.

[44] J. G. Colonna, M. Cristo, M. Salvatierra, and E. F. Nakamura, ‘‘An incre-mental technique for real-time bioacoustic signal segmentation,’’ ExpertSyst. Appl., vol. 42, no. 21, pp. 7367–7374, Nov. 2015.

[45] L. R. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition.Upper Saddle River, NJ, USA: Prentice-Hall, 1993.

[46] F. Briggs, R. Raich, and X. Z. Fern, ‘‘Audio classification of bird species:A statistical manifold approach,’’ in Proc. 9th IEEE Int. Conf. DataMining, Dec. 2009, pp. 51–60.

[47] J. Leben. (2020).MARSYAS: Music Analysis, Retrieval and Synthesis forAudio Signals. [Online]. Available: http://marsyas.info

[48] M. T. Lopes, L. L. Gioppo, T. T. Higushi, C. A. A. Kaestner, C. N. Silla,Jr., and A. L. Koerich, ‘‘Automatic bird species identification for largenumber of species,’’ in Proc. IEEE Int. Symp. Multimedia, Dec. 2011,pp. 117–122.

[49] G. A. F. Seber, The Estimation of Animal Abundance. Caldwell, NJ, USA:Blackburn Press, 1982.

[50] B. K. Williams, J. D. Nichols, and M. J. Conroy, Analysis and Manage-ment of Animal Populations. New York, NY, USA: Academic, 2002.

[51] R. T. Reynolds, J. M. Scott, and R. A. Nussbaum, ‘‘A variable circular-plot method for estimating bird numbers,’’ Condor, vol. 82, no. 3,pp. 309–313, Aug. 1980.

[52] D. K. Dawson and M. G. Efford, ‘‘Bird population density estimatedfrom acoustic signals,’’ J. Appl. Ecol., vol. 46, no. 6, pp. 1201–1209,2009.

[53] S. S. Rosenstock, D. R. Anderson, K. M. Giesen, T. Leukering, andM. F. Carter, ‘‘Landbird counting techniques: Current practices and analternative,’’ Auk, vol. 119, no. 1, pp. 46–53, Jan. 2002.

[54] K. P. Burnham, D. R. Anderson, and J. L. Laake, ‘‘Line transect estima-tion of bird population density using a Fourier series,’’ Stud. Avian Biol,vol. 6, pp. 466–482, Jan. 1981.

[55] A. Hiby and A.Ward, ‘‘Analysis of cue-counting and blow rate estimationexperiments carried out during the 1984/85 IDCR minke whale assess-ment cruise,’’ Int. Whaling Commission, Cambridge, U.K., Tech. Rep.,1986, pp. 473–476, vol. 36.

[56] S. T. Buckland, ‘‘Point-transect surveys for songbirds: Robust method-ologies,’’ Auk, vol. 123, no. 2, pp. 345–357, 2006.

[57] J. A. Royle, ‘‘N-mixture models for estimating population size fromspatially replicated counts,’’ Biometrics, vol. 60, no. 1, pp. 108–115,Mar. 2004.

[58] T. A. Marques, L. Thomas, S. W. Martin, D. K. Mellinger, J. A. Ward,D. J. Moretti, D. Harris, and P. L. Tyack, ‘‘Estimating animal populationdensity using passive acoustics,’’ Biol. Rev., vol. 88, no. 2, pp. 287–309,May 2013.

[59] J. Sueur, S. Pavoine, O. Hamerlynck, and S. Duvail, ‘‘Rapid acousticsurvey for biodiversity appraisal,’’ PLoS ONE, vol. 3, no. 12, p. e4065,2008.

57706 VOLUME 8, 2020


[60] M. Depraetere, S. Pavoine, F. Jiguet, A. Gasc, S. Duvail, and J. Sueur,‘‘Monitoring animal diversity using acoustic indices: Implementationin a temperate woodland,’’ Ecol. Indicators, vol. 13, no. 1, pp. 46–54,Feb. 2012.

[61] M.W. Towsey, ‘‘The calculation of acoustic indices to characterise acous-tic recordings of the environment,’’ Queensland Univ. Technol., Brisbane,QLD, Australia, Tech. Rep., 2012.

[62] M. Sankupellay, M. Towsey, A. Truskinger, and P. Roe, ‘‘Visual fin-gerprints of the acoustic environment: The use of acoustic indices tocharacterise natural habitats,’’ in Proc. Big Data Vis. Anal. (BDVA),Sep. 2015, pp. 1–8.

[63] M. Towsey, L. Zhang, M. Cottman-Fields, J. Wimmer, J. Zhang, andP. Roe, ‘‘Visualization of long-duration acoustic recordings of the envi-ronment,’’ Procedia Comput. Sci., vol. 29, pp. 703–712, Jan. 2014.

[64] A. Celis-Murillo, J. L. Deppe, and M. F. Allen, ‘‘Using soundscaperecordings to estimate bird species abundance, richness, and composi-tion,’’ J. Field Ornithol., vol. 80, no. 1, pp. 64–78, Mar. 2009.

[65] R. Bardeli, D. Wolff, F. Kurth, M. Koch, K.-H. Tauchert, andK.-H. Frommolt, ‘‘Detecting bird sounds in a complex acoustic environ-ment and application to bioacoustic monitoring,’’ Pattern Recognit. Lett.,vol. 31, no. 12, pp. 1524–1534, Sep. 2010.

[66] J. Wimmer, M. Towsey, P. Roe, and I. Williamson, ‘‘Sampling environ-mental acoustic recordings to determine bird species richness,’’ Ecol.Appl., vol. 23, no. 6, pp. 1419–1428, Sep. 2013.

[67] A. Truskinger, M. Cottman-Fields, P. Eichinski, M. Towsey, and P. Roe,‘‘Practical analysis of big acoustic sensor data for environmental moni-toring,’’ in Proc. IEEE 4th Int. Conf. Big Data Cloud Comput., Dec. 2014,pp. 91–98.

[68] V. B. Deecke and V. M. Janik, ‘‘Automated categorization of bioacousticsignals: Avoiding perceptual pitfalls,’’ J. Acoust. Soc. Amer., vol. 117,no. 4, p. 2470, Apr. 2005.

[69] C.-F. Juang and T.-M. Chen, ‘‘Birdsong recognition using prediction-based recurrent neural fuzzy networks,’’ Neurocomputing, vol. 71,nos. 1–3, pp. 121–130, Dec. 2007.

[70] A. L. McIlraith and H. C. Card, ‘‘Birdsong recognition using backpropa-gation and multivariate statistics,’’ IEEE Trans. Signal Process., vol. 45,no. 11, pp. 2740–2748, Nov. 1997.

[71] J. D. Olden, J. J. Lawler, and N. L. Poff, ‘‘Machine learning methodswithout tears: A primer for ecologists,’’ Quart. Rev. Biol., vol. 83, no. 2,pp. 171–193, Jun. 2008.

[72] S. Kahl, T. Wilhelm-Stein, H. Klinck, D. Kowerko, andM. Eibl, ‘‘A base-line for largescale bird species identification in field recordings,’’ Chem-nitz Univ. Technol., Chemnitz, Germany, Tech. Rep., 2018.

[73] B. Fazeka, A. Schindler, T. Lidy, and A. Rauber, ‘‘A multi-modaldeep neural network approach to bird-song identification,’’ 2018,arXiv:1811.04448. [Online]. Available: http://arxiv.org/abs/1811.04448

[74] M. Sankupellay and D. Konovalov, ‘‘Bird call recognition using deepconvolutional neural network, ResNet-50,’’ in Proc. ACOUSTICS, vol. 7,2018, pp. 1–8.

[75] X. Xia, R. Togneri, F. Sohel, Y. Zhao, and D. Huang, ‘‘A survey: Neuralnetwork-based deep learning for acoustic event detection,’’Circuits, Syst.,Signal Process., vol. 38, no. 8, pp. 3433–3453, Aug. 2019.

[76] C.-H. Lee, C.-C. Han, and C.-C. Chuang, ‘‘Automatic classification ofbird species from their sounds using two-dimensional cepstral coeffi-cients,’’ IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 8,pp. 1541–1550, Nov. 2008.

[77] M. Termansen, C. J. McClean, and C. D. Preston, ‘‘The use of geneticalgorithms and Bayesian classification to model species distributions,’’Ecol. Model., vol. 192, nos. 3–4, pp. 410–424, Feb. 2006.

[78] G. Vaca-Castano and D. Rodriguez, ‘‘Using syllabic mel cepstrumfeatures and k-nearest neighbors to identify anurans and birdsspecies,’’ in Proc. IEEE Workshop Signal Process. Syst., Oct. 2010,pp. 466–471.

[79] E. Vilches, I. A. Escobar, E. E. Vallejo, and C. E. Taylor, ‘‘Data miningapplied to acoustic bird species recognition,’’ in Proc. 18th Int. Conf.Pattern Recognit. (ICPR), vol. 3, 2006, pp. 400–403.

[80] C. N. Silla and C. A. A. Kaestner, ‘‘Hierarchical classification of birdspecies using their audio recorded songs,’’ in Proc. IEEE Int. Conf. Syst.,Man, Cybern. (SMC), Oct. 2013, pp. 1895–1900.

[81] M. A. Acevedo, C. J. Corrada-Bravo, H. Corrada-Bravo,L. J. Villanueva-Rivera, and T. M. Aide, ‘‘Automated classificationof bird and amphibian calls using machine learning: A comparison ofmethods,’’ Ecol. Inform., vol. 4, no. 4, pp. 206–214, Sep. 2009.

[82] F. Briggs et al., ‘‘The 9th annual MLSP competition: New methods foracoustic classification of multiple simultaneous bird species in a noisyenvironment,’’ in Proc. IEEE Int. WorkshopMach. Learn. Signal Process.(MLSP), Sep. 2013, pp. 1–8.

[83] D. Stowell and M. D. Plumbley, ‘‘Automatic large-scale classificationof bird sounds is strongly improved by unsupervised feature learning,’’PeerJ, vol. 2, p. e488, Jul. 2014.

[84] S. Fagerlund, ‘‘Bird species recognition using support vector machines,’’EURASIP J. Adv. Signal Process., vol. 2007, no. 1, p. 64, Dec. 2007.

[85] C. Kampichler, R. Wieland, S. Calmé, H. Weissenberger, andS. Arriaga-Weiss, ‘‘Classification in conservation biology: A comparisonof five machine-learning methods,’’ Ecol. Inform., vol. 5, no. 6,pp. 441–450, Nov. 2010.

[86] M. T. Lopes, A. L. Koerich, C. N. Silla, and C. A. A. Kaestner, ‘‘Featureset comparison for automatic bird species identification,’’ in Proc. IEEEInt. Conf. Syst., Man, Cybern. (SMC), Oct. 2011, pp. 965–970.

[87] J. Cheng, B. Xie, C. Lin, and L. Ji, ‘‘A comparative study in birds: Call-type-independent species and individual recognition using four machine-learning methods and two acoustic features,’’ Bioacoustics, vol. 21, no. 2,pp. 157–171, Jun. 2012.

[88] N. Vaughan, G. Jones, and S. Harris, ‘‘Identification of British bat speciesby multivariate analysis of echolocation call parameters,’’ Bioacoustics,vol. 7, no. 3, pp. 189–207, Jan. 1997.

[89] M. J. O’Farrell, B. W. Miller, and W. L. Gannon, ‘‘Qualitative identifica-tion of free-flying bats using the Anabat detector,’’ J. Mammal., vol. 80,no. 1, pp. 11–23, Feb. 1999.

[90] D. R. Griffin, Listening in the Dark: The Acoustic Orientation of Bats andMen. New Haven, CT, USA: Yale Univ. Press 1958.

[91] M. K. Obrist, R. Boesch, and P. F. Flückiger, ‘‘Variability in echolocationcall design of 26 swiss bat species: Consequences, limits and optionsfor automated field identification with a synergetic pattern recognitionapproach,’’ Mammalia, vol. 68, no. 4, pp. 307–322, Jan. 2004.

[92] A. C. Hughes, C. Satasook, P. J. J. Bates, P. Soisook, T. Sritongchuay,G. Jones, and S. Bumrungsri, ‘‘Using echolocation calls to identifyThai bat species: Vespertilionidae, Emballonuridae, Nycteridae andMegadermatidae,’’ Acta Chiropterol., vol. 13, no. 2, pp. 447–455,Dec. 2011.

[93] D. G. Preatoni, M. Nodari, R. Chirichella, G. Tosi, L. A. Wauters, andA. Martinoli, ‘‘Identifying bats from time-expanded recordings of searchcalls: Comparing classification methods,’’ J. Wildlife Manage., vol. 69,no. 4, pp. 1601–1614, Oct. 2005.

[94] N. Jennings, S. Parsons, and M. J. O. Pocock, ‘‘Human vs. machine:Identification of bat species from their echolocation calls by humans andby artificial neural networks,’’ Can. J. Zool., vol. 86, no. 5, pp. 371–377,May 2008.

[95] R. Redgwell, J. Szewczak, G. Jones, and S. Parsons, ‘‘Classification ofecholocation calls from 14 species of bat by support vector machines andensembles of neural networks,’’ Algorithms, vol. 2, no. 3, pp. 907–924,2009.

[96] C. L. Walters, R. Freeman, A. Collen, C. Dietz, M. B. Fenton, G. Jones,M. K. Obrist, S. J. Puechmaille, T. Sattler, B. M. Siemers, S. Parsons,and K. E. Jones, ‘‘A continental-scale tool for acoustic identification ofEuropean bats,’’ J. Appl. Ecol., vol. 49, no. 5, pp. 1064–1074, Oct. 2012.

[97] J. Xie, K. Indraswari, L. Schwarzkopf, M. Towsey, J. Zhang, andP. Roe, ‘‘Acoustic classification of frog within-species and species-specific calls,’’ Appl. Acoust., vol. 131, pp. 79–86, Feb. 2018.

[98] S. Yin and B. McCowan, ‘‘Barking in domestic dogs: Context specificityand individual identification,’’Animal Behav., vol. 68, no. 2, pp. 343–355,Aug. 2004.

[99] P. J. Clemins, M. T. Johnson, K. M. Leong, and A. Savage, ‘‘Auto-matic classification and speaker identification of African elephant (Lox-odonta Africana) vocalizations,’’ J. Acoust. Soc. Amer., vol. 117, no. 2,pp. 956–963, Feb. 2005.

[100] E. D. Chesmore and C. Nellenbach, ‘‘Acoustic methods for the automateddetection and identification of insects,’’ in Proc. 3rd Int. Symp. SensorsHorticulture, 1997, pp. 223–231.

[101] I. Potamitis, T. Ganchev, and N. Fakotakis, ‘‘Automatic acoustic identi-fication of crickets and cicadas,’’ in Proc. 9th Int. Symp. Signal Process.Appl. (ISSPA), Feb. 2007, pp. 1–4.

[102] J. K. Horne, ‘‘Acoustic approaches to remote species identification:A review,’’ Fisheries Oceanogr., vol. 9, no. 4, pp. 356–371, Dec. 2000.

[103] J. N. Oswald, S. Rankin, and J. Barlow, ‘‘The effect of recording and anal-ysis bandwidth on acoustic identification of delphinid species,’’ J. Acoust.Soc. Amer., vol. 116, no. 5, pp. 3178–3185, Nov. 2004.

VOLUME 8, 2020 57707


[104] J. N. Oswald, S. Rankin, J. Barlow, and M. O. Lammers, ‘‘A tool for real-time acoustic species identification of delphinid whistles,’’ J. Acoust. Soc.Amer., vol. 122, no. 1, pp. 587–595, Jul. 2007.

[105] M. A. Roch, M. S. Soldevilla, J. C. Burtenshaw, E. E. Henderson, andJ. A. Hildebrand, ‘‘Gaussian mixture model classification of odontocetesin the Southern California Bight and the Gulf of California,’’ J. Acoust.Soc. Amer., vol. 121, no. 3, pp. 1737–1748, Mar. 2007.

[106] M. Gridi-Papp. (2011). SoundRuler. [Online]. Available:http://soundruler.sourceforge.net

[107] M. A. Bee, ‘‘Equipment review,’’ Bioacoustics, vol. 14, no. 2,pp. 171–178, 2004.

[108] J. Sueur. (2019). Seewave: Sound Analysis and Synthesis. [Online]. Avail-able: https://cran.r-project.org/web/packages/seewave

[109] (2020). Avisoft Bioacoustics. Accessed: Feb. 2017. [Online]. Available:http://www.avisoft.com

[110] Raven Pro, Lite, and Exhibit, Cornell Lab Ornithol., Ithaca, NY, USA,2020.

[111] Wildlife Acoustics. (2020). Kaleidoscope Pro Analysis Software.https://www.wildlifeacoustics.com/products/kaleidoscope-pro

[112] J. Wimmer, M. Towsey, B. Planitz, P. Roe, and I. Williamson, ‘‘Scalingacoustic data analysis through collaboration and automation,’’ in Proc.IEEE 6th Int. Conf. e-Sci., Dec. 2010, pp. 308–315.

[113] T. M. Aide, C. Corrada-Bravo, M. Campos-Cerqueira, C. Milan, G. Vega,and R. Alvarez, ‘‘Real-time bioacoustics monitoring and automatedspecies identification,’’ PeerJ, vol. 1, p. e103, Jul. 2013.

[114] J. Zhang, K. Huang, M. Cottman-Fields, A. Truskinger, P. Roe, S. Duan,X. Dong, M. Towsey, and J. Wimmer, ‘‘Managing and analysing bigaudio data for environmental monitoring,’’ in Proc. IEEE 16th Int. Conf.Comput. Sci. Eng., Dec. 2013, pp. 997–1004.

[115] A. Truskinger, M. Cottman-Fields, D. Johnson, and P. Roe, ‘‘Rapidscanning of spectrograms for efficient identification of bioacoustic eventsin big data,’’ in Proc. IEEE 9th Int. Conf. e-Sci., Oct. 2013, pp. 270–277.

[116] P. Dugan, J. Zollweg, H. Glotin, M. Popescu, D. Risch, Y. LeCun, andC. Clark, ‘‘High performance computer acoustic data accelerator (HPC-ADA): A new system for exploring marine mammal acoustics for big dataapplications,’’ in Proc. ICML Unsupervised Learn. Bioacoustics, 2014,pp. 1–7.

[117] A. K. Reyes and J. E. Camargo, ‘‘Visualization of audio records for auto-matic bird species identification,’’ in Proc. 20th Symp. Signal Process.,Images Comput. Vis. (STSIVA), Sep. 2015, pp. 1–6.

[118] O. Y. Al-Jarrah, P. D. Yoo, S.Muhaidat, G. K. Karagiannidis, and K. Taha,‘‘Efficient machine learning for big data: A review,’’Big Data Res., vol. 2,no. 3, pp. 87–93, Sep. 2015.

[119] S. Kelling, W. M. Hochachka, D. Fink, M. Riedewald, R. Caruana,G. Ballard, and G. Hooker, ‘‘Data-intensive science: A new paradigm forbiodiversity studies,’’ BioScience, vol. 59, no. 7, pp. 613–620, Jul. 2009.

[120] E. P. Kasten, P. K. McKinley, and S. H. Gage, ‘‘Ensemble extraction forclassification and detection of bird species,’’ Ecol. Inform., vol. 5, no. 3,pp. 153–166, May 2010.

[121] P. J. Dugan, H. Klinck, J. A. Zollweg, and C. W. Clark, ‘‘Data miningsound archives: A new scalable algorithm for parallel-distributing pro-cessing,’’ in Proc. IEEE Int. Conf. Data Mining Workshop (ICDMW),Nov. 2015, pp. 768–772.

[122] D. Stowell, M. Wood, Y. Stylianou, and H. Glotin, ‘‘Bird detection inaudio: A survey and a challenge,’’ in Proc. IEEE 26th Int. WorkshopMach. Learn. Signal Process. (MLSP), Sep. 2016, pp. 1–6.

RAMA RAO KVSN received the master’s degreefrom Andhra University, India. His experiencespans across academic and industry domains. He iscurrently involved in research on bioacoustics withtheUniversity of Tasmania, Australia. His researchinterests include bioacoustics, machine learning,and stream processing.

JAMES MONTGOMERY (Member, IEEE)received the B.Inf.Tech. degree (Hons.) and thePh.D. degree in computer science from BondUniversity, Australia, in 2000 and 2005, respec-tively. He has held postdoctoral appointments atthe Swinburne University of Technology and TheAustralian National University. He is currently aSenior Lecturer with ICT, University of Tasmania,Australia. His research interests span evolutionarycomputation, bioacoustics, machine learning, andweb services.

SAURABH GARG (Member, IEEE) received thePh.D. degree from the University of Melbourne.He is currently a Senior Lecturer with the Uni-versity of Tasmania, Australia. He has authoredover 40 articles in highly cited journals and con-ferences. His research interests include resourcemanagement, scheduling, utility and grid comput-ing, cloud computing, green computing, wirelessnetworks, and ad hoc networks.

MICHAEL CHARLESTON is currently an Asso-ciate Professor with the University of Tasma-nia. He has held academic positions at MasseyUniversity, University of Texas at Austin, TX,Glasgow University, Oxford University, and Uni-versity of Sydney. He has authored or coau-thored over 100 journal articles in internationalpeer-reviewed journals. His research interestsinclude bioinformatics methods, phylogeneticsand cophylogenetic epidemiology of infectious

disease, biological modeling, ecology, graph theory, and software develop-ment. He was awarded a Royal Society University Research Fellowship atOxford University.

57708 VOLUME 8, 2020

bioacoustics data analysis a taxonomy, survey and open

Documents