deep neural network for musical instrument recognition

Deep Neural Network for Musical Instrument Recognition usingMFCCs

Saranga Kingkor Mahanta1, Abdullah Faiz Ur Rahman Khilji2, Partha Pakray2

1 Department of Electronics and Communication Engineering, National Institute of Technology, Silchar, Assam,India

2 Department of Computer Science and Engineering, National Institute of Technology, Silchar, Assam,India

{saranga ug@ece, abdullah ug@cse, partha@cse}.nits.ac.in

Abstract. The task of efficient automatic musicclassification is of vital importance and forms the basisfor various advanced applications of AI in the musicaldomain. Musical instrument recognition is the task ofinstrument identification by virtue of its audio. This audio,also termed as the sound vibrations are leveraged bythe model to match with the instrument classes. Inthis paper, we use an artificial neural network (ANN)model that was trained to perform classification ontwenty different classes of musical instruments. Herewe use use only the mel-frequency cepstral coefficients(MFCCs) of the audio data. Our proposed model trainson the full London philharmonic orchestra dataset whichcontains twenty classes of instruments belonging tothe four families viz. woodwinds, brass, percussion,and strings. Based on experimental results our modelachieves state-of-the-art accuracy on the same.

Keywords. Musical Instrument Recognition, ArtificialNeural Network, Deep Learning, Multi-class Classifica-tion

1 Introduction

Music is an essential part of our daily lives for amajority of the world population. Categorizationof music can be based on various parameters likegenre, performer or composer. Machine Learningtechniques provide numerous ways to performmusic categorization as per need. Automaticmusical instrument recognition and classificationis a non-trivial and practically valuable task asit effectively classifies music with respect to the

instrument being played in a faster and cheaperway than carrying it out manually. Even seasonedmusicians face difficulty to distinguish betweentwo instruments belonging to the same family,thus manual classification is also prone to errors.Automatic recognition of musical instruments formsthe basis of more complex tasks such as melodyextraction, music information retrieval, recognizingthe dominant instruments from polyphonic audio[5], and so on. Hence the classification task is vitalfor subsequent downstream tasks.

This paper proposes a deep artificial neuralnetwork model that efficiently distinguishes andrecognizes 20 different classes of musical in-struments, even across instruments belonging tothe same family. Additionally, an attempt toexamine the distinctive potency of MFCCs for theclassification task is made. Currently, MFCCsare being widely used as deciding feature fornumerous cognition tasks since they take intoaccount the perception of sound with regards tothe functioning of the human hearing system,by converting the conventional frequency to theMel Scale. The results achieved with thesimple proposed model on a highly imbalanceddataset without needing to implement dataaugmentation techniques or explicitly handlingthe class imbalance bolsters the significance ofMFCCs. The promising results can be claimedas a new state-of-the-art on the particular datasetbecause previous works have considered different

arX

iv:2

105.

0093

3v1

[cs

.SD

] 3

May

202

1

subsets of the dataset, while this work considersthe entire dataset.

The organization of the paper is as follows.Section 2 contains the musical instrument recog-nition related works and literature survey, Section3 describes the dataset; the pre-processing stepshave been explained in Section 4; Section 5 is theSystem Description; the setup and description ofour model is explained in Section 6; the results anddiscussions are present under Section 7; finally,Section 8 concludes the paper with future scopesand potential improvements.

2 Related Work

Ample research projects have been undertaken torecognize and classify one musical instrument fromthe other. Numerous feature extraction techniques[2] have been formulated followed by variousmachine learning implementations to accuratelyperform this particular task of classification. In1999, a study classified 8 different instrumentswith a 30% error rate using gaussian mixturemodels and support vector machines (SVM) [8].Hidden markov models have been used by [3] toclassify audio streams among four instruments ona dataset containing 600 recordings. Profounduse of The K-Nearest Neighbour classifier tocompare performances with other models hasbeen observed as well. The Gaussian and K-NNclassifiers were implemented by [4] to classifyinstruments with a set of 43 feature inputs. Anotherstudy implemented the K-NN and SVM on cepstralfeatures of the audio data [6].

With the advent of advancing deep learningmethods, recent studies have exploited the powerof neural networks to attain excellent levels ofclassification accuracies. Moreover, deep learningmethods exempt researchers from the tedioustask of manually extracting multiple features to befed into a learning model. Convolutional neuralnetworks (CNNs) have been used by [7], [12], and[11] on mel-spectrograms, visual representation ofsound signals that encompasses both time andfrequency domain features.

A few studies have worked on the Londonphilharmonic orchestra dataset and achievednotable results. ANNs were trained by [13]

on 8 out of the 20 instrument classes currentlypresent in the dataset. They also conductedseveral comparative experiments on differentcharacteristics of the musical instruments thatmight have impacted the resulting accuracy. Theyachieved a maximum accuracy of 93.5% on thebase experiment. A stable precision and recall of94% were achieved by [10] on the classificationperformed on 18 classes of the same dataset.Another study achieved a 99% accuracy using aCNN, however, the model was trained on only 6classes [10].

With increase in the number of classes, thecomplexity of the classification increases. Unlikethe previously mentioned works that work ona subset of the dataset, our model is trainedon all 20 classes of the London philharmonicdataset that are currently available, and achievesa commendable accuracy of 97%.

3 Dataset

The proposed model has been trained on allclasses of the London philharmonic orchestradataset1. At the time of downloading thedataset, there were a total of 13679 examplesdivided non-uniformly among 20 classes of musicalinstruments.

For each instrument class, except ‘percussion’,the samples are musical notes ranging from A1(note A of the first octave) to G7 (note G of theseventh octave). The tones have been played withvarying dynamics including forte, fortissimo, piano,pianissimo; playing techniques, trills and sustains-staccato, staccatissimo, legato, vibrato, tremolo,pizzicato, ponticello, thus bestowing us with diversesamples for a learning system to generalize over.

From Figure 1 it can be observed that thedataset is highly imbalanced with ‘violin’ having1502 examples while ‘banjo’ having only 74.

Additionally, The class named ‘percussion’ has39 sub-classes of percussion instruments rangingfrom ‘agogo bells’ to ‘woodblock’. The sub-classeshave too few examples to be treated as individual

1London philharmonic orchestra all instruments dataset pub-licly available at https://philharmonia.co.uk/resources/

sound-samples/

https://philharmonia.co.uk/resources/sound-samples/

https://philharmonia.co.uk/resources/sound-samples/

0 200 400 600 800 1000 1200 1400

banjobass clarinet

bassooncello

clarinetcontrabassoon

cor anglaisdouble bass

flutefrench horn

guitarmandolin

oboepercussionsaxophonetrombone

trumpettubaviolaviolin

74944

720889

846710

691852

878652

10680

596146

732831

485972973

1502

Fig. 1. Data distribution among the 20 classes

classes, hence they are consolidated into a singleclass i.e. ‘percussion’.

4 Pre-processing

The data was already noise-free and consistedof single instrument tones per example corre-sponding to the respective class, thus relieving usfrom performing complex processing procedures.The various steps of pre-processing that wereperformed have been described in detail in thefollowing sections.

4.1 Normalization of Audio Files

The audio data have durations ranging from 0.078seconds to 77.06 seconds, as shown in the Figure2. In the proposed model, we use MFCCs of thetraining examples as input features into the ANN.It is mandatory that the input dimensions must bea constant as the input layer of the ANN has astatic number of neurons, thus implying that theMFCC matrices of all the examples need to have

a fixed dimensional size. To achieve this, each ofthe examples must compulsorily have a constantduration resulting in a fixed number of sampleswhen sampled with a constant sampling rate.

Out of the 13679 examples, it was observedthat 6342, 11097, 12196, 12550, 12701 exampleshad durations less than or equal to 1, 2, 3, 4and 5 seconds respectively. Since a majority ofthe audio files had a duration lesser than or equalto 3 seconds, a fixed duration of 3 seconds waschosen for all the examples2. As a result, whileloading the examples those having a duration ofmore (Figure 3) or less (Figure 4) than 3 secondswere trimmed or padded respectively and finally inthe preprocessed data all 13679 examples had aduration from 0 to 3 seconds.

It can be concluded that no promising informa-tion was lost from the trimmed samples because allof the audio files’ onsets and attacks of the soundtake place within the first 3 seconds followingwhich the sound either decays or sustain, i.e the

23 seconds correspond to 66150 samples for the defaultsampling rate of 22050 Hz

Fig. 2. Durations of all examples

0:00 0:10 0:20 0:30 0:40 0:50 1:00 1:10Time

0.06

0.04

0.02

0.00

0.02

0.04

0.06Clarinet (piano) - 1 min 17 seconds

0 0.5 1 1.5 2 2.5Time

0.06

0.04

0.02

0.00

0.02

0.04

0.06Clarinet (piano) trimmed to 3 seconds

Fig. 3. Longest audio clip trimmed to 3 seconds

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07Time

0.004

0.003

0.002

0.001

0.000

0.001

0.002

0.003

0.004Viola (pianissimo) - 0.07 seconds

0 0.5 1 1.5 2 2.5Time

0.004

0.003

0.002

0.001

0.000

0.001

0.002

0.003

0.004Viola (pianissimo) padded until 3 seconds

Fig. 4. Shortest audio clip padded until 3 seconds

defining features of the audio clip mostly occurbefore 3 seconds into the signal and most ofthe examples are almost periodic with minusculeperiods. Another reason for not choosing a fixedlength of more than 3 seconds is to limit the numberof sparse values that result from padding and toreduce dimensional size.

4.2 Extracting Mel-Frequency CepstralCoefficients

MFCCs are useful in identifying the formants andtimbre of sound [1], as described in Section 5.1. 13

MFCC coefficients were extracted from each frameof each audio file. A single frame contained 2048samples. A hop length of 512 frames was used forthe framing window. The 3 second sound samplesresulted in MFCC matrices having dimensions of13x130. These matrices were then simply flattenedand fed into our proposed ANN model.

5 System Description

Our approach consists of two principal steps.Extraction of the MFCC features, as shown inFigure 5 from the constant length examplesand feeding them into an ANN model to makepredictions.

5.1 Mel-Frequency Cepstral Coefficients

For our approach, we are most interested for ourmodel to be able to distinguish the timbres oftones belonging to different instruments efficiently.Timbre, or tone colour, is a characteristic ofsound that can distinguish between two soundspossessing the same intensity, frequency, andduration. Since many examples belonging todifferent classes play the same note with similarintensity, the timbre becomes a crucial factor todifferentiate amongst them.

The information of the rate of change in spectralbands of a signal is given by its cepstrum. Itconveys the different values that construct theformants and timbre of a sound. The cepstralcoefficients can be extracted using Equation 1.

C(x(t)) = F−1(log(F [x(t)])) (1)

Fig. 5. Steps to extract Cepstral Coefficients from an audio signal

Peaks are observed at periodic elements ofthe original signal while computing the log ofthe magnitude of the Fourier transform of theaudio signal followed by taking its spectrum by acosine transformation. The resulting spectrum liesin the quefrency domain [9]. Humans perceiveamplitude logarithmically, hence conversion to theLog-Amplitude Spectrum is perceptually relevant,as shown in Figure 5. Mel scaling is performed onit by using Equation 2 on frequencies measured inHz.

Mel(f) = 2595 ∗ log(1 + f/700) (2)

Human beings can distinguish small changesin speech at lower frequencies. The Mel Scalecaptures these tiny differences to draw relationswith the hearing process of humans. The discretecosine transformation is mostly applied insteadof an inverse fourier transform as shown byEquation 1. Since the former provides real-valuedcoefficients and also decorrelates energy indifferent mel bands.

For our model, the traditional 13 MFCCcoefficients were chosen per frame of eachexample since the lower end of the quefrencyaxis of the cepstrum contains the most relevantinformation to our particular task viz. formants,spectral envelope and timbre. Towards the higherend of the same, information related to the glottalpulse can be obtained, which is not very definingfor the same.

5.2 The Artificial Neural Network

A deep ANN comprises numerous layers con-taining a varying number of neurons terminatinginto the output layer having an equal number ofneurons as the number of classes with regardsto a classification task; in our case 20. Thefeature values are multiplied with weights andadded with bias terms. These weights arethen updated after each epoch consisting of aforward and backward propagation with respect toa chosen loss function, as input feature valuestraverse through the neurons of subsequent layers.The neurons also apply an activation function onthe computed values to introduce non-linearityand selectivity in the network. These activationfunctions distinguish a neural network from aregular linear regression model.

In our model, we use the Rectified Linear Unit(ReLU) activation function for all the hidden layers.It simply activates the neurons containing a positivevalue after the aforementioned computations. It isgiven by Equation 3.

y = max(0,x) (3)

The current problem being a multi-class classifi-cation, the Softmax function is used in the outputlayer. It provides the confidence scores of eachclass using the Equation 4. The scores add up to1. The class having the highest confidence scoreis the model’s predicted class for a particular set ofinput features.

σ(z)i =ezi∑Kj=1 e

zj(4)

6 Experimental Setup

Our approach uses an ANN whose architectureis described in Figure 6. The initial flatteninglayer flattens a 13x130 MFCC matrix intoa single-dimensional layer having 1690 inputneurons, which are connected to the first densehidden layer having 512 neurons followed byReLU activation function. The second and thirdhidden layers contain 1024 and 512 neuronsrespectively both followed by the ReLU activationfunction. A dropout layer with a rate of 0.3 isthen added to induce regularization and avoidoverfitting. After the dropout layer, the values passthrough two more hidden layers containing 128and 64 neurons respectively along with the ReLUactivation function again, and another dropout layerwith a 0.2 rate. The final output layer having 20neurons, equal to the total number of classes,terminates the neural network. Due to the problembeing a multi-class classification, the output layermakes use of a Softmax function which formulatesthe final output probabilities of each class.

6.1 Dataset Split

The dataset was divided into training and validationor testing sets in the ratio 8:2 using stratifiedsplitting, such that the number of examples fromeach of the 20 classes split proportionally intotwo sets. Stratifying was necessary due to theimbalanced classes, to avoid a disproportionatedivision of examples of classes with a relativelylittle or huge number of examples. The training andtest sets contained 10,943 training examples and2,736 test examples respectively after the split.

6.2 Model Training

The total number of trainable parameters that re-sulted from the proposed architecture is 1,991,124.The Adam optimizer was used with an initial learn-ing rate of 0.0001 while training on the exampleswhich were further divided into mini-batches of size32, to implement mini-batch gradient descent withrespect to the sparse categorical cross-entropyloss function. The training took place over 100epochs.

7 Results and Analysis

During model training, the training accuracypeaked 0.9913 and validation accuracy 0.9726.

Albeit a high and stable validation accuracywas obtained as shown in Figure 7, it couldnot be concluded as the best metrics for themodel evaluation since the dataset had highlyimbalanced classes, as described in Section3. This often brings in the accuracy paradox[14], i.e misclassifying minority class examplesyet achieving a high accuracy due to correctclassification of a relatively bigger number ofmajority class examples. Therefore, a confusionmatrix was plotted and the F1 score of each classprediction were calculated. This would reveal thetrue evaluation of our model.

From the Confusion matrix shown in Figure8, and F1 scores displayed in Table 1, it isevident that the test minority class exampleshave been predicted with minimal error as well.Class ‘percussion’ has a relatively lower F1 score,which is quite expected because, unlike the otherclasses, it consists of a very small number ofexamples from 39 different percussion instruments,as explained in Section 3.

To further support the high accuracy of 97%,the AUC-ROC and Precision-Recall curves werealso plotted, as shown in Figure 9. An AUC scoreof 0.996 was achieved, which further supportsthe results, and permits to make the claim ofthe commendable results without being influencedby the accuracy paradox, which is commonlythe case of high accuracy on imbalanced classproblems. It can also be observed that ‘percussion’has comparatively worse Precision-Recall andAUC-ROC curves.

8 Conclusion and Future Work

In this paper, we formulated a baseline modelto work on musical instrument recognition. Themodel surprisingly achieved a new state-of-the-artaccuracy of 97% on the full dataset containingall 20 classes of different musical instruments,despite the heavy class imbalance and the fact thatmost instruments belonged to a particular familyviz. Strings, Woodwinds, Brass and Percussion.

Fig. 6. Proposed model architecture

Fig. 7. Accuracy and Loss on the Train and Test sets

The experimental setup, Section 6, was finalizedafter a commendable number of iterations ofhyperparameter tuning. Although a different setof number of layers and neurons had resulted ina slightly better validation accuracy, this particularmodel resulted in a better and more uniform F1score over the classes in addition to a more stablefluctuation of the accuracy curves, as shown inFigure 7. Additionally, the h5 file of the modelconsumes only 22.8 MB of memory. Thus, theproposed model can be incorporated with weband mobile applications with cheaper memoryrequirements.

Nevertheless, there are tremendous scopes forfuture work that may result in an even betterperformance on this particular dataset. Data

augmentation measures may be adopted to dealwith the imbalance problem. Different activationfunctions and optimizers with varying learning ratesmay be tried. MFCCs and mel-sprectogramsprovide excellent visual perceptions of sound, thusCNNs may prove to be quite efficient. Thereare a lot of variables that can be tweakedduring the pre-processing stages as well, suchas choosing a longer duration of the examples, adifferent frame size, hop length, or more numberof MFCC coefficients. Expanding our targetspace by supporting the recognition of even moreinstruments including the piano, or the ukulele, forinstance, would be a notable improvement on ourcurrent model.

Fig. 8. Confusion Matrix

Fig. 9. Precision-Recall and AUC-ROC curves

Acknowledgements

We would like to thank the Department ofComputer Science and Engineering and Center for

Natural Language Processing (CNLP) at NationalInstitute of Technology Silchar for providing therequisite support and infrastructure to execute thiswork.

Table 1. Precision, Recall and F1 Scores of each class

Precision Recall F1-Score Supportbanjo 1 0.93 0.97 15bass clarinet 0.99 0.98 0.98 189bassoon 0.97 1 0.99 144cello 0.99 0.94 0.97 178clarinet 0.98 0.95 0.96 169contrabassoon 0.99 0.96 0.98 142cor anglais 0.99 0.99 0.99 138double bass 0.96 0.99 0.97 170flute 0.95 0.99 0.97 176french horn 0.98 0.97 0.98 130guitar 1 1 1 21mandolin 0.89 1 0.94 16oboe 0.96 0.96 0.96 119percussion 0.86 0.66 0.75 29saxophone 0.88 0.97 0.92 146trombone 0.99 0.98 0.99 166trumpet 0.96 0.92 0.94 97tuba 0.99 0.99 0.99 195viola 0.96 0.93 0.95 195violin 0.97 0.99 0.98 301accuracy 0.97 2736macro avg 0.96 0.96 0.96 2736weighted avg 0.97 0.97 0.97 2736

References

1. Chakraborty, S. S. & Parekh, R. (2018). Improvedmusical instrument classification using cepstralcoefficients and neural networks. In Methodologiesand Application Issues of Contemporary ComputingFramework. Springer, pp. 123–138.

2. Deng, J. D., Simmermacher, C., & Cranefield,S. (2008). A study on feature analysis formusical instrument classification. IEEE Transactionson Systems, Man, and Cybernetics, Part B(Cybernetics), Vol. 38, No. 2, pp. 429–438.

3. Eichner, M., Wolff, M., & Hoffmann, R. (2006).Instrument classification using hidden markovmodels. system, Vol. 1, No. 2, pp. 3.

4. Eronen, A. & Klapuri, A. (2000). Musicalinstrument recognition using cepstral coefficientsand temporal features. 2000 IEEE InternationalConference on Acoustics, Speech, and SignalProcessing. Proceedings (Cat. No. 00CH37100),volume 2, IEEE, pp. II753–II756.

5. Essid, S., Richard, G., & David, B. (2005).Instrument recognition in polyphonic music basedon automatic taxonomies. IEEE Transactions onAudio, Speech, and Language Processing, Vol. 14,No. 1, pp. 68–80.

6. Gulhane, S. R., Suresh, D. S., & Sanjay,S. B. (2018). Identification of musical instrumentsusing mfcc features. International Conference OnComputational Vision and Bio Inspired Computing,Springer, pp. 957–968.

7. Haidar-Ahmad, L. (2019). Music and instrumentclassification using deep learning technics. Recall,Vol. 67, No. 37.00, pp. 80–00.

8. Marques, J. & Moreno, P. J. (1999). A studyof musical instrument classification using gaussianmixture models and support vector machines.Cambridge Research Laboratory Technical ReportSeries CRL, Vol. 4, pp. 143.

9. Oppenheim, A. V. & Schafer, R. W. (2004). Fromfrequency to quefrency: A history of the cepstrum.

IEEE signal processing Magazine, Vol. 21, No. 5,pp. 95–106.

10. Siebert, X., Melot, H., & Hulshof, C., . Study of therobustness of descriptors for musical instrumentsclassification.

11. Singh, P., Bachhav, D., Joshi, O., & Patil, N.(2019). Implementing musical instrument recogni-tion using cnn and svm. International ResearchJournal of Engineering and Technology, pp. 1487–1493.

12. Solanki, A. & Pandey, S. (2019). Music in-strument recognition using deep convolutionalneural networks. International Journal of InformationTechnology, pp. 1–10.

13. Toghiani-Rizi, B. & Windmark, M. (2017).Musical instrument recognition using their distinctivecharacteristics in artificial neural networks. arXivpreprint arXiv:1705.04971.

14. Valverde-Albacete, F. J. & Pelaez-Moreno, C.(2014). 100% classification accuracy consideredharmful: The normalized information transfer factorexplains the accuracy paradox. PloS one, Vol. 9,No. 1, pp. e84217.

deep neural network for musical instrument recognition

Documents