letter of approval€¦ · iv acknowledgement we express our sincere gratitude to the department of...
TRANSCRIPT
ii
LETTER OF APPROVAL
The undersigned hereby certify that they have read, and recommended to the Institute of
Engineering for acceptance, this project report entitled “Music Genre Classification”
submitted by Anjan Rai, Anju Maharjan, Dipendra Shrestha and Komal Kadmiya in partial
fulfilment of the requirements for the Bachelor‟s Degree in Computer Engineering.
_________________________________________
Internal Examiner
Dr. Sanjeev Prasad Pandey
Professor
Department of Electronics & Computer Engineering,
Institute of Engineering, Central Campus Pulchowk,
Tribhuwan University, Nepal
________________________________________
External Examiner
Saroj Shakya
Associate Professor
Nepal College of Information Technology,
Pokhara University, Nepal
________________________________________
Dr. Nanda Bikram Adhikari
Deputy Head
Department of Electronics & Computer Engineering,
Institute of Engineering, Central Campus Pulchowk,
Tribhuvan University, Nepal
__________________________________________
Dr. Dibakar Raj Pant
Head
Department of Electronics & Computer Engineering,
Institute of Engineering, Central Campus Pulchowk,
Tribhuvan University, Nepal
DATE OF APPROVAL:
________________________________________
Supervisor
Dr. Shashidhar Ram Joshi
Professor
Department of Electronics & Computer Engineering,
Institute of Engineering, Central Campus Pulchowk,
Tribhuwan University, Nepal
iii
COPYRIGHT
The author has agreed that the Library, Department of Electronics and Computer
Engineering, Pulchowk Campus, Institute of Engineering may make this report freely
available for inspection. Moreover, the author has agreed that permission for extensive
copying of this project report for scholarly purpose may be granted by the supervisors who
supervised the project work recorded herein or, in their absence, by the Head of the
Department wherein the project report was done. It is understood that the recognition will be
given to the author of this report and to the Department of Electronics and Computer
Engineering, Pulchowk Campus, Institute of Engineering in any use of the material of this
project report. Copying or publication or the other use of this report for financial gain without
approval of to the Department of Electronics and Computer Engineering, Pulchowk Campus,
Institute of Engineering and author‟s written permission is prohibited.
Request for permission to copy or to make any other use of the material in this report in
whole or in part should be addressed to:
Dr. Dibakar Raj Pant
Head
Department of Electronics and Computer Engineering
Pulchowk Campus, Institute of Engineering
Lalitpur, Kathmandu
Nepal
iv
ACKNOWLEDGEMENT
We express our sincere gratitude to the Department of Electronics and Computer
Engineering for providing us the opportunity to undertake the project. Likewise, we extend
our thankfulness to our project supervisor Prof. Dr. Sashidhar Ram Joshi for providing us
the essential guidelines and supports for understanding the feasibility and other technical
aspects of the project. Finally, we would also like to thank our friends especially Mr. Bikram
Basnet and Mr. Pravesh Koirala, and our seniors whose knowledge and experiences
contributed to make this one better. They all helped us to know and improvise the flexibility,
flaws and the limitations related to the project.
- Anjan Rai (70803)
- Anju Maharjan (70804)
- Komal Kadmiya (70819)
- Dipendra Shrestha (70822)
v
ABSTRACT
People all over the world love music, but not the same kind of music. Different people have
different taste for music. Some people love pop music while some like to listen to rock music.
These are different genres of music. Music can be divided into different genres in several
ways. The artistic nature of music means that these classifications are often arbitrary and
controversial, and some genres may overlap. Classification of musical genre from audio is a
well- researched area of music research. The creation of huge databases coming from both
restoration of existing analogue archives and new content is demanding fast and reliable tools
for content analysis and description, to be used for searches, content queries and interactive
access. In that context, musical genres are crucial descriptors since they have been widely
used for years to organize music catalogues, libraries and shops. Apparently, there are 126
different genres music can be classified into which include pop, rock, jazz, trance, hip hop,
and so on. With the advent of variety of music, the level of difficulty in distinguishing the
genre of music has increased. Through our project “Music Genre Recognition”, we have
simplified the difficulty by automatically classifying the given set of music files on the basis
of the genre they belong to. Most automatic genre classification models rely on the low-level
temporal relationships between audio chunks when classifying audio signals in terms of their
genre, i.e. generally models are based on the investigation of means to model short-term time
structures from context information in music segments to consolidate classification
consistency by reducing ambiguities. In our project, we have implied the technique of the
pattern recognition architecture which encompasses the concept of feature extraction from the
chunks of audio signal and classifying the features independently via different classification
techniques.
Keywords: Classification techniques, Feature extraction, Music genre.
vi
Contents
ACKNOWLEDGEMENT ........................................................................................................ iv
ABSTRACT ............................................................................................................................... v
LIST OF ABBREVIATIONS ................................................................................................ viii
LIST OF FIGURES ................................................................................................................... x
LIST OF TABLES .................................................................................................................... xi
1 INTRODUCTION .............................................................................................................. 2
1.1. Background ................................................................................................................. 2
1.2. Motivation ................................................................................................................... 3
1.3. Problem Statement ...................................................................................................... 3
1.4. Objectives .................................................................................................................... 4
1.5. Scope of the work ........................................................................................................ 4
1.6. Overview of the project ............................................................................................... 5
2. LITERATURE REVIEW ................................................................................................... 7
2.1. Introduction ................................................................................................................. 7
2.2. A Study of Human Music Genre Classification .......................................................... 8
2.3. Related Works ............................................................................................................. 8
2.4. Training and Testing Data Sets ................................................................................... 9
2.5. Linear Discriminant Analysis.................................................................................... 10
2.5.1. Class-dependent transformation......................................................................... 11
2.5.2. Class-independent transformation ..................................................................... 11
2.6. Support Vector Machine ........................................................................................... 11
3. FEATURE EXTRACTION .............................................................................................. 13
3.1. Introduction ............................................................................................................... 13
3.2. Formal Notation ........................................................................................................ 14
3.3. Feature Extraction Process ........................................................................................ 14
3.4. Basic Features of an audio sample ............................................................................ 17
3.4.1. Beat and Meter: .................................................................................................. 17
3.4.2. Harmony: ........................................................................................................... 17
vii
3.4.3. Pitch: .................................................................................................................. 18
3.5. Mel-Frequency Cepstral Coefficients ....................................................................... 18
4. CLASSIFICATION .......................................................................................................... 23
4.1. Introduction ............................................................................................................... 23
4.2. Domain Independence ............................................................................................... 23
4.3. Difficulties ................................................................................................................. 24
4.4. Training and Learning ............................................................................................... 25
4.5. Model development ................................................................................................... 25
4.5.1. Gaussian Mixture Model.................................................................................... 25
4.5.2. Parameter Estimation ......................................................................................... 27
5. REQUIREMENT ANALYSIS ......................................................................................... 30
5.1. Functional Requirements........................................................................................... 30
5.2. Non-functional Requirements ................................................................................... 30
6. METHODOLOGY ........................................................................................................... 33
6.1. Introduction ............................................................................................................... 33
6.2. Various system diagrams and descriptions: .............................................................. 34
6.3. Project Tools ............................................................................................................. 39
6.3.1. Why MATLAB and Python? ............................................................................. 39
6.3.2. Pycharm as IDE ................................................................................................. 40
7. OUTPUT .......................................................................................................................... 42
8. RESULT AND ANALYSIS ............................................................................................. 46
9. CONCLUSION AND FURTHER ENHANCEMENT .................................................... 50
9.1. Conclusion ................................................................................................................. 50
9.2. Limitations ................................................................................................................ 50
9.3. Further Enhancements ............................................................................................... 51
10. REFERENCE ................................................................................................................ 52
11. APPENDIX A. WINDOW FUNCTION AND WINDOWING ................................... 54
12. APPENDIX B. FILTERBANK .................................................................................... 57
13. APPENDIX C. SUPERVISED LEARNING ............................................................... 59
viii
LIST OF ABBREVIATIONS
2D Two Dimensional
AI Artificial Intelligence
AMGC Automatic Music Genre Classification
CD Compact Disk
DCT Discrete Cosine Transform
DFT Discrete Fourier Transform
EM Expectation Maximization
FFT Fast Fourier Transform
GMM Gaussian Mixture Model
HMM Hidden Markov Model
IDE Integrated Development Environment
k-NN K- Nearest Neighbor
LDA Linear Discriminant Analysis
MAP Maximum A Posterior
MATLAB Matrix Laboratory
MFCC Mel- Frequency Cepstral Coefficient
MIR Music Information Retrieval
ML Maximum Likelihood
MP3 Media Player 3
PCA Principal Component Analysis
SVM Support Vector Machine
TV Television
ix
VCS Version Control System
VQ Vector Quantizer
WT Wavelet Transform
x
LIST OF FIGURES
Figure.2.1 Figure showing data sets and test vectors in original ...................................... 10
Figure.3.1 Generating a feature vector from an input data set .......................................... 13
Figure.3.2 Illustration of the traditional feature extraction process ................................. 15
Figure.3.3 Illustration of the frequency spectrum of a harmonic signal with a fundamental
and`four overtones ………………………………………………………………………16
Figure.3.4 Beat Histograms for Classical (left) and Pop (right) ....................................... 17
Figure.3.5 Illustration of the calculation of the MFCCs ................................................... 19
Figure.3.6. Illustration of the filterbank/matrix ............................................................... 21
Figure.6.2 Flow chart of the system.................................................................................. 34
Figure 6.3a Activity Diagram of system training ............................................................. 37
Figure 6.3b Activity Diagram for testing .......................................................................... 38
Figure.7.1 Classification using 16 Gaussian component model of 90s train data for 5s test
data ........................ ………………………………………………………………………44
Figure 8.1 Plot of accuracy obtained for different lengths of test data ............................. 47
Figure A.a Hamming Window ......................................................................................... 56
Figure A.b Hanning Window ............................................................................................ 56
Figure B.a One dimensional Filter Bank .......................................................................... 58
Figure B.b Two dimensional Filter Bank ......................................................................... 58
xi
LIST OF TABLES
Table 7.1 8-component for 5 second test data .................................................................. 42
Table 7.2 8-component for 10 second test data ................................................................ 42
Table 7.3 16-component for 5 second test data ................................................................ 43
Table 7.4 16-component for 10 second test data ............................................................. 43
Table 8.1 Time taken for training and testing ……………………………………………48
xii
CHAPTER 1
2
1 INTRODUCTION
1.1.Background
Distinguishing between the musical genres is one of the herculean tasks for human
beings. A musical genre is a conventional category that identifies pieces of music as
belonging to a shared tradition or set of conventions. A few seconds of music usually
suffice to allow us to do a rough classification, such as identifying a song as rock or
classical music. The nebulous definitions and overlapping boundaries of genres makes
reliable and consistent genre classification a non-trivial task for humans and computer
alike.
A musical genre is characterized by the common characteristics shared by its
members. These characteristics typically are related to the instrumentation, rhythmic
structure, and harmonic content of the music. Genre hierarchies are commonly used to
structure the large collections of music available on the web. Currently, musical genre
annotation is performed manually. Automatic music genre classification (AMGC) can
assist or replace the human user in this process and would be a valuable addition to
music information retrieval systems. In addition, AMGC provides a framework for
developing and evaluating features for any type of content-based analysis of musical
signals [1].
The need for an effective automatic means of classifying music is becoming
increasingly pressing as the number of recordings available continues to increase at a
rapid rate. It is estimated that 2000 Compact Disks (CDs) a month are released for a
wide distribution in Western countries alone. Software capable of performing
automatic classifications would be particularly useful to the administrators of the
exponentially growing networked music archives, as their success is heavily linked to
the ease with which the users can search for types of music on their sites. These sites
currently rely on manual genre classification techniques, a methodology that is slow
and inconsistent.
This project eases as much possible the difficulty of classifying the musical audio
pieces with the approach of initial feature extraction stage followed by a classification
procedure, exploring both the variation parameters used as input and the classifier
architecture.
3
1.2. Motivation
Lots of facts make AMGC intelligent systems vital in the current scenario. The case
of downloading and storing music files on computers, the huge availability of albums
on the internet, with free or paid downloading, peer-to-peer servers and the fact that
nowadays artists deliberately distribute their songs on their websites, make music
database management a must.
Another recent tendency is to consume music via streaming, raising the popularity of
on-line radio stations that play similar songs based on a genre preference. In addition,
browsing and searching by genre on the web and smart playlists generation choosing
specific tunes among gigabytes of songs on personal portable audio players are
important tasks that facilitate music mining. As the demand for multimedia grows, the
development of information retrieval systems including information about music is
increasing concern. Radio stations and music television (TV) channels hold archives
of millions of music tapes. Gigabytes of music files are also spread over the web.
These facts make the manual classification of musical genres inappropriate and
overlapping.
End users are nonetheless already accustomed to browse both physical and on-line
music collections by genre, and this approach is seemingly at least reasonably
effective, without an automatic means to do so. The currently prevailing manual
procedures motivated us to develop an automatic and consistent system based on
feature extraction and classification techniques.
1.3. Problem Statement
As a development to the prevailing manual classification of musical genres, system-
oriented approach has been applied for simplifying the task. Still, the main task have
been in temporal feature integration which is the process of combining a time-series
of short time feature vectors into a single feature vector on a larger time scale.
However, such an approach has complex processes involved and often concludes with
inconsistent results.
4
In our project, we try to build a system that outputs the genre that the music sample
belongs to by extracting some features from the audio data to manipulate more
meaningful information and to reduce the further processing of the classification task.
Systematic feature selection techniques are used so as to output a system which is
robust, fast and consistent.
1.4. Objectives
Our primary objective is to develop a system that implements the automatic feature
extraction and learning / pattern classification techniques that have the important
benefit of being adaptable to a variety of other content-based (i.e. relating directly to
and only to music itself) musical analysis and classification tasks. Our objectives can
be further simplified as:
i. To develop a system that implements the machine learning algorithms for fast
and consistent classifications.
ii. To develop a system that can upgrade the current applications which feature
the music genre classification and its implementation.
iii. To contribute to the creation of more appropriate and specific music data
warehouse.
iv. To implement the principles and techniques of digital signal processing.
1.5. Scope of the work
In simple words, AMGC is the classification of a piece of music into its
corresponding genre by a computer. It is considered to be a cornerstone of the
research area Music Information Retrieval (MIR) and closely linked to the other areas
in MIR. MIR carries the scope of being a key element in the processing, searching
and retrieval of digital music in the near future.
The automatic classification of audio data according to music genres aids the creation
of music databases. It also allows the users to generate personal playlists on the fly,
where the user specifies a general description such as 80s Synth-Pop, and the software
5
does the actual file selection [2]. Furthermore, the features developed for automatic
music genre recognition is useful in related fields such as similarity-based searching.
1.6. Overview of the project
The first chapter of the report gives the introduction of the project which includes the
background related to the project, scope of the project, the factors that motivated us to
initiate the project as well as the objectives behind it. The second chapter of the report
deals with the literature review which covers the details of the related works done
earlier for such projects. The different theories and algorithms incorporated for the
completion of the project is dealt in detail in the chapters 3 and 4. The sixth chapter of
the report depicts the methodology behind the completion of the project. It includes
the different diagrams like the Use case Diagram, Flow Diagram (or Flow Chart),
Activity Diagram, etc associated with the project. The eighth chapter consists of the
results and the output of the project whereas the last chapter of the report contains the
necessary conclusions regarding the project along with the limitations of the project.
6
CHAPTER 2
7
2. LITERATURE REVIEW
2.1. Introduction
Music genre classification is not a new milestone in the era of technological
development. Musical genre is used by retailers, libraries and people in general as a
primary means of organizing music. Anyone who has attempted to search through the
discount bins of a music store will have experienced the frustration of searching
through music that is not sorted by genre. Listeners use genres to find music that
they‟re looking for or to get a rough idea of whether they‟re likely to like a piece of
music before hearing it. The music industry, in contrast, uses genre as a key way of
defining and targeting different markets. The importance of genre in the mind of
listeners is exemplified by research indicating that the style in which a piece is
performed can influence listeners‟ liking for the piece of the music [1, 3].
The types of features developed for a classification system could be adapted for other
types of analyses by musicologists and music theorists. Taken in conjunction with
genre classification results, the features could also provide valuable insights into the
particular attributes of different genres and what characteristics are important in
different cases. Automatic feature extraction and learning / pattern classification
techniques have the important benefit of being adaptable to a variety of other content-
based (i.e. relating directly to and only to the music itself) musical analysis and
classification tasks, such as similarity measurements in general or segmentation.
Systems could be constructed that, to give just a few examples, compare or classify
pieces based on compositional or performance style, group music based on
geographical / cultural origin or historical period, search for unknown music that a
user might like based on examples of what he or she is known to like already, sort
music based on perception of mood, or classify music based on when a user might
want to listen to it (e.g. while driving, while eating dinner, etc.). Music librarians and
database administrators could use these systems to classify recordings along whatever
lines they wished. Individual users could use such systems to sort their music
collections automatically as they grow and automatically generate play lists with
certain themes. It would also be possible for them to upload their own classification
8
parameters to search on-line databases equipped with the same classification software
[4].
2.2. A Study of Human Music Genre Classification
Humans are capable of performing the music genre classification with use of the ears,
the auditory processing system in the ears as well as higher-level cognitive processes
in the brain. Musical genres are used among humans as a compact description which
facilitates sharing of information. For instance, the statements “I like heavy metal” or
“I can‟t stand classical music!” are often used to share the information and relies on
shared knowledge about the genres and their relation to society, history and musical
structure.
According to a study conducted by R.O Gjerrdigen and D. Perrot, human listeners
have significant capability to recognize the musical genres. They used ten different
genres of music and eight sample songs for each genre were downloaded from the
web in the Media Player 3 (MP3) format. Half of the eight songs for each style
contained vocals, and half of them contained instrumental only. Five excerpts were
taken from each song, with durations 475 ms, 400 ms, 325 ms, and 250 ms.
The accuracy of the genre prediction for the 250 ms samples was around 40% and the
agreement between the 250ms classification to the 475 ms classification was around
44%. The results of the study are especially interesting, since they show that is
possible to accurately recognize musical genres without any higher level abstractions.
But since the accuracy level is seemingly unsatisfactory, an AMGC system makes a
remarkable room for itself [1, 4].
2.3. Related Works
Though unsupervised clustering of music collections based on similarity measures is
gaining more and more interest in music information retrieval community, most
works related to classification of music titles into genres are based on supervised
techniques. These methods suppose that taxonomy of genres is given and they try to
map a database of songs into it by machine learning algorithms.
9
Soltau et al. have compared a Hidden Markov Model (HMM) to new classification
architecture, the Explicit Time Modeling with Neural Networks in a classification
experiment involving 360 songs distributed over 4 genres.
Tzanetakis and Cook and Li et al. have worked on a database of 1000 songs over 10
genres and have compared the use of different audio features (timbre features,
rhythmic features, pitch features, Wavelet Transform (WT)) and different classifier
(Support Vector Machines (SVMs), Gaussian Mixtures, Linear Discriminant Analysis
(LDA), K-Nearest Neighbor (k-NN)) on time-independent chunks.
Panagakis and Kotropoulos proposed a musical genre classification framework that
considers the properties of the auditory human perception system i.e. two dimensional
(2D) auditory temporal modulations representing music and genre classification based
on sparse representation.
It is observable that a lot of work is being done in the area, but most of the approaches
explore the timbre texture, the rhythmic content, the pitch content, or their
combinations.
2.4. Training and Testing Data Sets
A training set is a set of data used in various areas of information science to discover
potentially predictive relationships. Training sets are used in Artificial Intelligence
(AI), machine learning, genetic programming, intelligent systems, and statistics. In all
these fields, a training set has much the same role and is often used in conjunction
with a test set [5]. A test set is a set of data used in various areas of information
science to assess the strength and utility of predictive relationship.
Separating data into training and testing sets is an important part of evaluating data
mining models. Typically when we separate a data set into a training set and testing
set, most of the data is used for training, and a smaller portion of the data is used for
testing. Analysis Services randomly samples the data to help ensure that the testing
and training sets are similar. By using similar data for training and testing, we can
minimize the effects of data discrepancies and better understand the characteristics of
the model. After the model has been processed by using the training set, we test the
model by making predictions against the test set. Because the data in the testing set
10
already contains known values for the attribute that we want to predict, it is easy to
determine whether the model's guesses are correct.
2.5. Linear Discriminant Analysis
LDA and the related Fisher's linear discriminant are methods used in statistics, pattern
recognition and machine learning to find a linear combination of features which
characterizes or separates two or more classes of objects or events. The resulting
combination may be used as a linear classifier or, more commonly, for dimensionality
reduction before later classification. There are many possible techniques for
classification of data. LDA easily handles the case where the within-class frequencies
are unequal and their performances have been examined on randomly generated test
data. This method maximizes the ratio of between-class variance to the within-class
variance in any particular data set thereby guaranteeing maximal separability. The use
of LDA for data classification is applied to classification problem in speech
recognition. LDA doesn‟t change the location but only tries to provide more class
separability and draw a decision region between the given classes. This method also
helps to better understand the distribution of the feature data. Figure 2.1 will be used
as an example to explain and illustrate the theory of LDA [4,5].
Figure 2.1. Figure showing data sets and test vectors in original
Data sets can be transformed and test vectors can be classified in the transformed
space by two different approaches.
11
2.5.1. Class-dependent transformation
This type of approach involves maximizing the ratio of between class variance
to within class variance. The main objective is to maximize this ration so that
adequate class separability is obtained. The class-specific type approach
involves using two optimizing criteria for transforming the data sets
independently.
2.5.2. Class-independent transformation
This approach involves maximizing the ratio of overall variance to within
class variance. This approach uses only one optimizing the criterion to
transform the data sets and hence all data points irrespective of their class
identity are transformed using this transform. In this type of LDA, each class
is considered as a separate class against all other classes.
2.6. Support Vector Machine
In machine learning, SVM are supervised learning models with associated
learning algorithms that analyze data and recognize patterns, used
for classification and regression analysis. Given a set of training examples, each
marked as belonging to one of two categories, an SVM training algorithm builds a
model that assigns new examples into one category or the other, making it a non-
probabilistic binary linear classifier [4]. An SVM model is a representation of the
examples as points in space, mapped so that the examples of the separate categories
are divided by a clear gap that is as wide as possible. New examples are then mapped
into that same space and predicted to belong to a category based on which side of the
gap they fall on.
In addition to performing linear classification, SVMs can efficiently perform a non-
linear classification using what is called the kernel trick, implicitly mapping their
inputs into high-dimensional feature spaces.
12
CHAPTER 3
13
3. FEATURE EXTRACTION
3.1. Introduction
One of the challenges in music genre recognition is to find out what it is that allows
us to differentiate between music styles. The problem is that we want to make
observations about the similarity or dissimilarity of two objects (in our case: music
clips) that are not directly comparable in many cases. To make comparison (and
therefore classification) possible, we must transform the data first in order to be able
to access the essential information contained in them, a process referred to as feature
extraction: computing a numerical representation that characterizes a segment of
audio [6,7].
Feature extraction is one of two commonly used preprocessing techniques in
classification; it means that new features are generated from the raw data by applying
one or more transformations. The other possible technique is feature selection – the
process of identifying a subset of features within the input data that can be used for
effective classification. Feature selection can be applied to the original data set or to
the output of a feature extraction process. A classification system might use both or
either of these techniques. Theoretically, it is also possible to use the raw data, if these
are already in a format suitable for classification. In reality, this is hardly ever the
case, though. The dimensionality of the datasets is often too high; the data contain a
lot of redundancy, or are generally not suited for direct comparison. This is especially
true in the area of audio signal classification, where we are dealing with long streams
of redundant, noisy signals. A schematic overview of the connection between features
selection and feature extraction is shown in Figure 3.1.
Figure 3.1. Generating a feature vector from an input data set.
14
3.2. Formal Notation
A feature vector (also referred to as pattern or observation) „x‟ is a single data item
used by the classification algorithm, consisting of„d‟ measurements:
x = (x1, . .. . , xd). The individual scalar components xi of the feature vector x are
called features or attributes, and the dimensionality of the feature space is denoted by
d. Each feature vector can be thought of as a point in the feature space. A pattern set
containing n elements is denoted as
X = {x1, . . . , xn}
and the ith
feature vector in X is written as
xi = (xi1, . . . , xid)
In most cases, a pattern set can be viewed as an n × d pattern matrix.
3.3. Feature Extraction Process
Mathematically, the feature vector xn at discrete time „n‟ can be calculated with the
function F on the signals as
xn = F(w0 sn−(N−1), ..., wN−1 sn) (3.1)
where w0, w1, ..., wN−1 are the coefficients of a window function and „N‟ denotes the
frame size. The frame size is a measure of the time scale of the feature. Normally, it is
not necessary to have xn for every value of „n‟ and a hop size „M‟ is therefore used
between the frames. The whole process is illustrated in Figure 3.2. In signal
processing terms, the use of hop size amounts to down sampling to the signal xn
which then only contains the terms ….,xn-2M, xn-M, xn, xn+M, xn+2M,….
The flow goes from the upper part of the figure to the lower part. The raw music
signal sn is shown in the first of the three subfigures (signals). It is shown how, at a
specific time, a frame with „N‟ samples is extracted from the signal and multiplied
with the window function wn (Hamming window) in the second subfigure. The
resulting signal is shown in the third subfigure. It is noticeable that the resulting signal
gradually decreases towards the sides of the frame which reduces the spectral leakage
problem [8].
15
Figure 3.2. Illustration of the traditional feature extraction process.
Finally, F takes the resulting signal in the frame as input and returns the feature vector
xn. The function F could be e.g. the Discrete Fourier Transform (DFT) on the signal
followed by the magnitude operation on each Fourier coefficient to get the frequency
spectrum.
The window function is multiplied with the signal to avoid problems due to finite
frame size. The rectangular window with amplitude 1 corresponds to calculating the
features without a window, but has serious problems with the phenomenon of spectral
leakage and is rarely used [7,8].
In our project, Hamming window is used for the purpose of windowing. Hamming
window has side lobes with much lower magnitude. Figure 3.3 shows the result of a
DFT on a signal with and without a Hamming window and the advantage of the
Hamming window is easily seen. The Hamming window can be found as
16
wn = 0.54 - 0.46 cos (2Πn / N-1) (3.2)
where n= 0,1,2,……,N-1.
Figure 3.3. Illustration of the frequency spectrum of a harmonic signal with a
fundamental frequency and four overtones.
It is clearly advantageous to use a Hamming window compared to not using a window
(or a rectangular window) since it is less prone to spectral leakage.
A major part of the work in feature extraction for music and especially speech signals
is focused on its features. They are thought to capture the essential aspects of music
such as loudness, pitch and timbre. An informal definition of the features is, that they
are extracted on a time scale of 10 to 40 ms where the signal is considered (short-
time) stationary.
17
3.4. Basic Features of an audio sample
3.4.1. Beat and Meter:
Beats give music its regular rhythmic pattern. Beats are grouped together in a
measure; the notes and rests correspond to a certain number of beats. Meter refers to
the rhythmic patterns produced by grouping together strong and weak beats.
Figure 3.4. Beat Histograms for Classical (left) and Pop (right)
3.4.2. Harmony:
In general, harmony refers to the combination of notes (or chords) played together and
the relationship between a series of chords.
18
3.4.3. Pitch:
The relative lowness or highness that we hear in a sound refers to its pitch. The pitch
of a sound is based on the frequency of vibration and the size of the vibrating object.
The slower the vibration and the bigger the vibrating object, the lower the pitch. For
example, the pitch of a double bass is lower than that of the violin because the double
bass has longer strings.
3.4.4. Rhythm:
It may be defined as the pattern or placements of sounds in time and beats in
music. It refers to the particular arrangement of note lengths in a piece of
music.
3.4.5. Timbre:
Timbre is generally defined as the quality which allows one to tell the
difference between sounds of the same level and loudness when made by
different musical instruments or voices. It depends on the spectrum, the sound
pressure, the frequency location of the spectrum, and the temporal
characteristics of the stimulus. In music, timbre is thought to be determined by
the number and relative strengths of the instruments partials.
3.5. Mel-Frequency Cepstral Coefficients
Mel-Frequency Cepstral Coefficients (MFCCs) originate from automatic speech
recognition, where they have been used with great success. They have become
popular in the MIR society where they have been used successfully for music genre
classification and for categorization into perceptually relevant groups such as moods
and perceived complexity.
19
MFCCs are based on the spectral information of a sound, but are modeled to capture
the perceptually relevant parts of the auditory spectrum. The MFCCs are to some
extent created according to the principles of the human auditory system, but also to be
a compact representation of the amplitude spectrum and with considerations of the
computational complexity [9]. Existing music processing literature pointed us to
MFCCs as a way to represent time domain waveforms as just a few frequency domain
coefficients.
Figure 3.5. Illustration of the calculation of the MFCCs.
Figure 3.5 illustrates the construction of the MFCC features. The flowchart illustrates
the different steps in the calculation from raw audio signal to the final MFCC
features. There exist many variations of the MFCC implementation, but nearly all of
them follow this flowchart.
In accordance with the equation 3.1, the feature extraction can be described as a
function F on a frame of the signal. After applying the Hamming window on the
frame, this function contains the following four steps:
3.5.1. DFT
The first step is to perform the DFT on the frame. For a frame size of „N‟, this
results in „N‟ (complex) Fourier coefficients. This results in an N-dimensional
spectral representation of the frame.
3.5.2. Mel- scaling
Human order sounds on a musical scale from low to high with the feature
„pitch‟. The pitch of a sine tone is closely related to the physical quantity of
frequency and the fundamental frequency for a complex tone. However, the
pitch scale is not similarly spaced as the frequency scale. The mel-scale is an
estimate of the relation between the perceived pitch and the frequency which
20
is found by equating 1000 mels to a 1000 Hz sine tone at 40 dB. It is used in
the calculation of the MFCCs to transform the frequencies in the spectral
representation into perceptual pitch scale. Normally, the mel-scaling step has
the form of a filterbank of (overlapping) triangular filters in the frequency
domain and with center frequencies which are mel-spaced. The filter bank is
what makes MFCCs unique. It is constructed using 13 linearly spaced filters
and 27 log-spaced filters, following a common model for human auditory
perception. The distance between the centre frequencies of the linearly spaced
filters is 133,33 Hz; the log-spaced filters are separated by a factor of 1.071 in
frequency. A standard filterbank is illustrated in the Figure 3.6. Hence, this
mel scaling step is also a smoothing of the spectrum and dimensionality
reduction of the feature vector.
3.5.3. Log-scaling
Similar to pitch, humans order sound from soft to loud with the perceptual
attribute „loudness‟. Perceptual loudness corresponds quite closely to the
physical measure of intensity. Although other quantities, such as frequency,
bandwidth and duration, affect the perceived loudness it is common to relate
loudness directly to intensity. As such, the relation is often approximated as
L∝I0.3
where L is the loudness and I is the intensity (Stevens‟ power law). It is
argued that the perceptual loudness can also be approximated by the logarithm
of the intensity, although this is not quite similar to the previously mentioned
power law. This is a perceptual motivation for the log-scaling step in the
MFCC extraction. Another motivation for the log-scaling in speech analysis is
that it can be used to deconvolute the slowly varying modulation and the rapid
excitation with pitch period.
3.5.4. Discrete Cosine Transform
As the last step, the discrete cosine transform (DCT) is used as a
computationally inexpensive method to de-correlate the mel-spectral log-
scaled coefficients. It is found that the basic functions of the DCT are quite
21
similar to the eigenvectors of a Principal Component Analysis (PCA) on
music. This suggests that the DCT can actually be used for the de-correlation.
As illustrated in figure 4.2, the assumption of de-correlated MFCCs is,
however, doubtful. Normally, only a subset of the DCT basis functions is used
and the result is then an even lower dimensional feature vector of MFCCs
[9,10].
Figure 3.6. Illustration of the filterbank/matrix which is used to convert the
linear frequency scale into the logarithmic mel-scale in the calculation of the
MFCCs. The filters are seen to be overlapping and have logarithmic increase
in bandwidth.
22
CHAPTER 4
23
4. CLASSIFICATION
4.1. Introduction
The feature extractor, as discussed in the chapter 3, computes feature vectors
representing the data to be classified. These feature vectors are then used to assign
each object to a specific category. This is the classification part, which constitutes the
second basic building block of a music genre recognition system.
Classification is a subfield of decision theory. It relies on the basic assumption that
each observed pattern belongs to a category, which can be thought of as a prototype
for the pattern. Regardless of the differences between the individual patterns, there is
a set of features that are similar in patterns belonging to the same class, and different
between patterns from different classes. These features can be used to determine class
membership.
Music can be of arbitrary complexity, songs from one genre differ in many ways.
Still, human are able to categorize them easily. This seems to support our assumption
that there are certain fundamental properties shared by pieces belonging to one genre.
Classification can also be understood by approaching it in geometrical terms. As
stated before, the feature vectors can be regarded as points in feature space. The goal
of the classifier is to find decision boundaries that partition the feature space into
regions that correspond to the individual classes. New data items are then classified
based on what region they lie in. This depends on a feature representation of the data
in which feature vectors from the same category can easily be distinguished from
feature vectors from other categories [11].
4.2. Domain Independence
Finding a good feature representation requires in-depth knowledge of the data and
context; feature extractors must be adapted to the specific problem and are highly
domain-dependent. Classification techniques, on the other hand, are basically domain-
independent. This can be explained by the fact that feature extraction is also an
24
abstraction step, transforming domain-specific data into a more general numerical
representation that can be processed by a generic classifier.
The feature extraction part is where knowledge of music, psychoacoustics, signal
processing, and many other fields is required; it is an area that has only recently
started to receive the attention it deserves, and there is a limited basis of previous
work to build on. Classification, on the other hand, is an advanced field that has been
studied for many years, and that provides us with many fast, elegant and well-
understood solutions that can be adopted for use in music genre recognition.
4.3. Difficulties
The main difficulty in classification arises from the fact that in addition to the
dissimilarities caused by the different underlying models, the feature values for
objects belonging to the same category often also vary considerably. If all objects
from one class were perfectly equal, classification would be trivial but such is not a
case. The classifier never sees the actual data, only the feature vectors. Therefore, the
following is equally true: A feature representation that extracts exactly the
information that differentiates the categories would also eliminate the need for a
complex classification step. Likewise, perfect classifier would not need any feature
extraction at all, but would be able to uncover the true class membership from the raw
data. In reality, neither feature extractors nor classifiers are perfect, but may be
combined to produce working results.
The variation in patterns belonging to the same category can be due to two factors:
First, the underlying model might generate that complexity: A relatively simple model
can create seemingly random output, which cannot trivially be detected by an
observer who does not know the model. Secondly, considerable variation can be
caused by noise. Noise can be defined as any property of the pattern that is not due to
the true underlying model but instead to randomness in the world or the sensors. As is
obvious from this definition, noise is present in all objects in nature.
The challenge is to distinguish the two kinds of differences between feature values:
Are they caused by different models, which mean that the objects belong to different
categories, or are they due to noise or the complexity of the model, meaning that the
objects belong to the same category?
25
4.4. Training and Learning
Creating a classifier usually means specifying its general form, and estimating its
unknown parameters through training. Training can be defined as the process of using
sample data to determine the parameter settings of the classifier, and is essential in
virtually all real-world classification systems.
Classification is often called supervised learning. Supervised learning consists of
using labeled feature vectors to train classifiers that automatically assign class labels
to new feature vectors. Another variant of learning is unsupervised learning or
clustering, which does not use any label information; the system tries to form natural
groupings. Reinforcement learning, a third type, refers to a technique where the
feedback to the system is only right or wrong – no information about the correct result
is given in case of a wrong categorization [10,11].
4.5. Model development
Classification is a form of data analysis that extracts models describing important data
classes. Such models, called classifiers, predict categorical (discrete, unordered) class
labels. The first step to perform the classification task is to construct a classification
model followed by the classification step, where the model is used to predict class
labels for the given data. Among many classification schemes prevailing, we have
used the Gaussian Mixture Model (GMM) to construct the classifier for our project.
4.5.1. Gaussian Mixture Model
A GMM is a parametric probability density function represented as a weighted sum of
Gaussian component densities. A GMM is a probabilistic model that assumes all the
data points are generated from a mixture of finite number of Gaussian distributions
with unknown parameters. GMMs are commonly used as a parametric model of the
probability distribution of continuous measurements or features in a biometric system,
26
such as vocal-tract related spectral features in a speaker recognition system. GMM
parameters are estimated from training data using the iterative Expectation-
Maximization (EM) algorithm or Maximum A Posteriori (MAP) estimation from a
well-trained prior model. The EM algorithm is used for fitting mixture-of-Gaussian
models. It can also draw confidence ellipsoids for multivariate models and compute
the Bayesian Information Criterion to assess the number of clusters in the data.
A Gaussian mixture model is a weighted sum of M component Gaussian densities as
given by the equation (4.1),
p(x|λ) = ∑ i g(x| µi, Σi ) (4.1)
where x is a D-dimensional continuous-valued data vector (i.e. measurement or
features) wi , i = 1, . . . , M , are the mixture weights, and g(x|µi, Σi), i = 1, . . . , M ,
are the component Gaussian densities. Each component density is a D-variate
Gaussian function of the form as expressed by the equation 4.2,
g(x|µi, Σi ) = (1/ (2π)D/2
|Σi|1/2
) exp {-0.5 (x − µi)′ ∑
x − µi)} (4.2)
with mean vector µi and covariance matrix Σi. The mixture weights satisfy the
constraint that ∑ I = 1.
The complete Gaussian mixture model is parameterized by the mean vectors,
covariance matrices and mixture weighs from all component densities. These
parameters are collectively represented by the notation of the form of equation 4.3,
λ = {wi, µi, Σi} i = 1, . . . , M. (4.3)
There are several variants on the GMM shown in equation (4.3). The covariance
matrices, Σi, can be full rank or constrained to be diagonal. Additionally, parameters
can be shared, or tied, among the Gaussian components, such as having a common
covariance matrix for all components, The choice of model configuration (number of
components, full or diagonal covariance matrices, and parameter tying) is often
determined by the amount of data available for estimating the GMM parameters and
how the GMM is used in a particular biometric application. It is also important to note
that because the component Gaussian is acting together to model the overall feature
densities, full covariance matrices are not necessary even if the features are not
statistically independent. The linear combination of diagonal covariance basis
Gaussians is capable of modeling the correlations between feature vector elements.
The effect of using a set of M full covariance matrix Gaussians can be equally
obtained by using a larger set of diagonal covariance Gaussians. GMMs are often
27
used in biometric systems, most notably in speaker recognition systems, due to their
capability of representing a large class of sample distributions. One of the powerful
attributes of the GMM is its ability to form smooth approximations to arbitrarily
shaped densities. The classical uni-modal Gaussian model represents feature
distributions by a position (mean vector) and a elliptic shape (covariance matrix) and
a vector quantizer (VQ) or nearest neighbor model represents a distribution by a
discrete set of characteristic templates. A GMM acts as a hybrid between these two
models by using a discrete set of Gaussian functions, each with their own mean and
covariance matrix, to allow a better modeling capability [12].
The use of a GMM for representing feature distributions in a biometric system may
also be motivated by the intuitive notion that the individual component densities may
model some underlying set of hidden classes. For example, in speaker recognition, it
is reasonable to assume the acoustic space of spectral related features corresponding
to a speaker‟s broad phonetic events, such as vowels, nasals or fricatives. These
acoustic classes reflect some general speaker dependent vocal tract configurations that
are useful for characterizing speaker identity. The spectral shape of the ith
acoustic
class can in turn be represented by the mean µi of the ith
component density, and
variations of the average spectral shape can be represented by the covariance matrix
Σi. Because all the features used to train the GMM are unlabeled, the acoustic classes
are hidden in that the class of an observation is unknown.
4.5.2. Parameter Estimation
In most classification problems, the conditional densities are not known. However, in
many cases, a reasonable assumption can be made about their general form. This
makes the problem significantly easier, since we need only estimate the parameters of
the functions, not the functions themselves. The unknown probability densities are
usually estimated in a training process, using sample data. For instance it might be
assumed that p(x|wi) is a normal density. We then need to find the values of the mean
µ and the covariance Σ.
28
Maximum-Likelihood Parameter Estimation
Given training vectors and a GMM configuration, we wish to estimate the parameters
of the GMM, λ, which in some sense best matches the distribution of the training
feature vectors. There are several techniques available for estimating the parameters
of a GMM. By far the most popular and well-established method is Maximum
Likelihood (ML) estimation, and we have encompassed this very method in our
project [11,12]. The aim of ML estimation is to find the model parameters which
maximize the likelihood of the GMM given the training data. For a sequence of T
training vectors X = {x1, . . . , xT}, the GMM likelihood, assuming independence
between the vectors can be written as,
p(X|λ) = ∏ t| λ) (4.4)
Unfortunately, this expression is a non-linear function of the parameters λ and direct
maximization is not possible. However, ML parameter estimates can be obtained
iteratively using a special case of the EM algorithm.
On each EM iteration, the following re-estimation formulas are used which guarantee
a monotonic increase in the model‟s likelihood value,
Mixture Weights: wi = 1/T ∑ r(i|xt, λ) (4.5)
Means: µi = ( ∑ r(i|xt, λ) xt) / (∑
r(i|xt, λ)) (4.6)
Variances: σ2
i = ( ∑ r(i|xt, λ) xt
2) / ( ∑
r(i|xt, λ) xt) (4.7)
where σi2, xt, and µi refer to arbitrary elements of the vectors σi
2, xt, and µi
respectively.
29
CHAPTER 5
30
5. REQUIREMENT ANALYSIS
5.1. Functional Requirements
Functional requirements of the system are as follows:
The system should be able to add new genre for classification.
The system should be able to take training data, audio files to train the system.
From the supplied valid training data, system should be able to generate a proper
model.
The system should be able to classify the input test file in satisfactory genre.
The system should be able to provide satisfiable result.
5.2. Non-functional Requirements
5.2.1. Performance
Since we have used python for scripting, list processing and list array
manipulation becomes extremely easy and fast. Also, availability of numpy
and scipy kits for python makes vector computation available and hence array
and matrix processing can be carried out faster.
5.2.2. Accuracy
Compared to other methods of model generation and testing, GMM is found to
be relatively more reliable and accurate. And our implementation also added a
point to this method over other methods [13].
31
5.2.3. Reliability
With the advancement in time, gradual change in every aspect is expected,
including music. Hence for the maintenance of the reliability of the system,
with the change in taste in music, templates can be updated [13,14].
32
CHAPTER 6
33
6. METHODOLOGY
6.1. Introduction
Methodology is the analysis of the tasks to be done in order to obtain the desired
output. An appropriate methodology mainly results into a successful project and vice-
versa. Here, for this system, a number of methodologies were considered and the most
efficient ones were used. This doesn‟t mean that one particular method is used.
According to the system, most appropriate ones are used in the combination.
34
6.2. Various system diagrams and descriptions:
Flow Chart: The Figure 6.2 illustrates the flow of control and task division that
our project encompasses.
Testing
Phase
Training
Phase
Figure 6.2. Flow chart of the system
35
Our system consists of three basic blocks which are explained briefly below.
1. Pre-processing block (vectorizing block): In pre-processing block, input audio
files are vectorized. Initially, audio files are sampled with sampling frequency of
44100Hz. Standard sampling frequency of a wav audio file is 44.1 KHz. The
resultant samples are then framed using hamming window. Window length of 160
was used for framing and 30% of previous frames were overlapped in successive
frames. This reduces the chance of missing important characteristic feature of
song. Finally, silent zone from audio signal is removed. For that, frames with zero
energy value and frames with energy value less than threshold value are removed
and only frames with sufficient energy are taken for further processing. Threshold
energy value is determined by taking median of the total energy value of the
whole framed data. 160 dimensional vectors are obtained as output of this
preprocessing block.
2. Feature extraction block: The output obtained from preprocessing block is taken
as input for this block. First, the 160 dimensional vectors are transformed to
frequency domain using Fast Fourier Transform (FFT) of length 1024. The data
obtained is then passed through mel-filter banks. 32 mel-filter banks are used for
this step of processing which generates (32, n) dimensional data. Here, n represent
number of observations or number of frames obtained from pre-processing block.
These obtained data as a result is known as MFCCs, which is carried out for
further processing; for model generation in training phase, and for computation of
maximum-likelihood value in testing phase.
3. Model generation block: This is one important block in our system. The
extracted feature vectors (MFCCs) these are used for model generation. MFCCs
obtained from a number of music files of known genre are fit to 8 or 16
components models (8 or 16 components GMM models) as per requirement. For
model generation, parameters initialization, i.e. initialization of mixture weights,
means and covariance matrices are to be determined. This is done through k-
means clustering. As per the requirement of number of components in the mixture
models, the number of means in k-means clustering is randomly initialized. Since
we have 32-dimensional n number of observations, k numbers of 32 dimensional
means are randomly initialized and clustering is carried out. The clustered data is
then used to compute mixture weights and covariance matrices for parameter
36
initialization in GMM. With the initialized parameters, in each iteration, value of
all of these parameters is updated with the change in structure of Gaussian.
After the complete optimization (maximization) models are generated for each
genre. The models generated are then stored which is then later used for testing
purpose. The models are generated through GMM model generation method via
use of EM algorithm.
37
Activity Diagram: The activity flow of the various components of our system is
depicted by the Figure 6.3a and 6.3b.
Figure 6.3a. Activity diagram of system training
38
Figure 6.3b. Activity diagram for testing
The figure 6.3a illustrates the various activities carried while the system is trained using
training data set whereas in the figure 6.3b activates related to testing of an audio file is
shown.
39
6.3. Project Tools
Programming Languages: Python 2.7, Matlab R2010a
Drawings and diagrams: Visio, Argo UML
Documentation: MS-WORD
Platform: Windows
IDE: Pycharm
6.3.1. Why MATLAB and Python?
Matrix Laboratory (MATLAB) is a multi-paradigm numerical
computing environment and fourth-generation programming language. MATLAB
allows matrix manipulations, plotting of functions and data, implementation of
algorithms, creation of user interfaces, and interfacing with programs written in other
languages, including C, C++, Java, and Fortran. Mathematical functions for linear
algebra, statistics, Fourier analysis, filtering, optimization, numerical integration, and
solving ordinary differential equations, built-in graphics for visualizing data and tools
for creating custom plots, development tools for improving code quality and
maintainability and maximizing performance; availability of different in-built
functions and libraries has made easier to carry out different simulation tasks. And
same goes for simulation phase of our project. Due to these features, we could utilize
time for hard coding for other research works. Test results for models generated using
different length, different size of training data could be determined fast.
Python on the other hand, provides features similar to MATLAB. This pair of
libraries provide array and matrix structures, linear algebra routines, numerical
optimization, random number generation, statistics routines, differential equation
modeling, Fourier transforms and signal processing, image processing, sparse and
masked arrays, spatial computation, and numerous other mathematical routines.
Together, they cover most of MATLAB‟s basic functionality and parts of many of the
toolkits, and include support for reading and writing MATLAB files. Python allows
one to easily leverage object oriented and functional design patterns. Just as different
40
problems call for different ways of thinking, so, also different problems call for
different programming paradigms. There is no doubt that a linear, procedural style is
natural for many scientific problems. However, an object oriented style that builds on
classes having internal functionality and external behavior is a perfect design pattern
for others. For this, classes in Python are full-featured and practical. Functional
programming, which builds on the power of iterators and functions-as-variables,
makes many programming solutions concise and intuitive. Brilliantly, in Python,
everything can be passed around as an object, including functions, class definitions,
and modules. Iterators are a key language component and Python comes with a full-
featured iterator library. While it doesn‟t go as far in any of these categories as
flagship paradigm languages such as Java, it does allow one to use some very
practical tools from these paradigms. These features combine to make the language
very flexible for problem solving, one key reason for its popularity. The ease of
balancing high-level programming with low-level optimization is a particular strong
point of Python code. However, as with most high-level languages, we often sacrifice
code speed for programming speed. In this context, speeding code up means
vectorizing algorithm to work with arrays of numbers instead of with single numbers,
thus reducing the overhead of the language when array operations are optimized.
6.3.2. Pycharm as IDE
PyCharm is an Integrated Development Environment (IDE) used for programming
in Python. It provides code analysis, graphical debugger, integrated unit tester,
Version Control System (VCS) integration and supports web development
with Django. It is cross-platform working on Windows, Mac OS X and Linux.
41
CHAPTER 7
42
7. OUTPUT
Testing carried out with 8 Gaussian components with 5 sec length of each test file provided
output summarized in table below. Different lengths of training data (30, 60 and 90 seconds)
were used for generation of model for each genre with 8 Gaussian components.
Training data length Average accuracy (%) Precision (%) Recall (%) Error (%)
30 seconds 88.57 73.43 69.37 11.43
60 seconds 88 72.25 68.07 12
90 seconds 90.15 73.32 75.96 9.85
Table 7.1. 8-components for 5 sec test data
The output of testing carried out with 8 Gaussian components with 10 seconds long test data
is summarized in the table below.
Training data length Average accuracy (%) Precision (%) Recall (%) Error (%)
30 seconds 89.38 70.86 74.13 10.62
60 seconds 90.15 73.21 76.76 9.85
90 seconds 90 72.35 74.72 10
Table 7.2. 8-components for 10 sec test data
43
The output of testing carried out with 16 Gaussian components with 5 seconds long test data
of each test file is summarized in the table below.
Training data length Average accuracy (%) Precision (%) Recall (%) Error (%)
30 seconds 89.54 71.50 75.63 10.46
60 seconds 91.08 75.95 77.85 8.92
90 seconds 91.23 76.47 78.40 8.77
Table 7.3. 16-components for 5 sec test data
The output of testing carried out with 16 Gaussian components with 10 seconds long test data
is summarized in the table below.
Training data length Average accuracy (%) Precision (%) Recall (%) Error (%)
30 seconds 89.85 72.29 77.11 10.15
60 seconds 90.62 74.63 77.07 9.38
90 seconds 91.23 76.12 78.57 8.77
Table 7.4. 16-components for 10 sec test data
Different length of training data was used to generate models. With use of different models
for testing, variation in output of the testing was expected. As per our expectation, variation
in result was obtained with change in the model used for tested. From above mentioned tables
of results, it was determined that models generated for 8 components using 90 seconds
training data proved to be more reliable than that built using 30 seconds or 60 seconds long
training data. Also, compared to 8 components Gaussian model, 16 components Gaussian
model was found to be more effective, which is illustrated by the Figure 8.1.
44
Figure 7.1. Classification using 16 Gaussian component model of 90s train data for 5s
test data
Out of the five genres (rock, pop, hip-hop, classic, blues) undertaken for analysis, rock music
consisted distinct features (music timbre, rhythm) due to which, recognition of rock music
compared to other genres was easier. Test for other genres (blues, jazz, classic) had
satisfactory output result. But in case of pop music classification, result was below 60%. This
may be because of the resemblance of pop music to rock music
45
CHAPTER 8
46
8. RESULT AND ANALYSIS
From the results, it was found out that, higher accuracy was obtained when we used
16 Gaussian components model generated using 60 seconds training data for testing
purpose rather than other combinations. Also, the 5 seconds long test data was
determined to be appropriate for testing purpose. Longer length of test data would
rather consume more time than increase accuracy in classification.
MFCCs were found out to be more effective than cepstral coefficients computed using
other methods like linear predictive coding since the former one more effectively
represents characteristics of human ear and hearing characteristics.
GMM on the other hand is soft clustering method and is a probabilistic measure. With
the help of assignment of probability weight to different components through the
implementation of the EM algorithm, Gaussian models are generated which would
help in representing different characteristics of music. Through the use of means,
probability mixture weight, covariance matrix computed from training data in ML
value computation, relatively more accurate and reliable results can be obtained.
47
Figure 8.1 Plot of accuracy obtained for different length of test data with different
models
As per the discussion of the result mentioned in aforementioned chapter, 16 Gaussian
components model generated using 60 seconds long and 90 seconds long training data
resulted more accurate output than that of result computed using 8 Gaussian
components. However, for the model generation containing 16 Gaussian components,
and testing carried out with 16 Gaussian components consumed more time.
Computation using 16 Gaussian -components introduced more complexity due to
which, time consumption became obvious. Introduction of models containing 32
Gaussian components would cause more complex computations and would consume
more time for testing. Considering time consumption, complexity and result accuracy,
build a system for 16 Gaussian components models would be more suitable rather
than to introduce more components for more accuracy.
48
We performed training and testing separately in both MATLAB and Python and
following results were observed:
Process MATLAB Python
Train 25 minutes 2 minutes
Test 193 sec 0.03 sec
Table 8.1 Time taken for training and testing
Above mentioned table illustrates the time taken for training and testing that was
observed. For training and generating a model with 16 components using training data
of 90 seconds from 20 songs each took about 25 minutes in MATLAB whereas, it
took about 2 minutes in python. Also, testing a song with a clip of 5 seconds took 193
seconds in MATLAB, whereas, it took 0.03 seconds in python.
Main reason behind such time difference may be due to method of computations. In
python, computations were carried out by vectoring data matrix.
49
CHAPTER 9
50
9. CONCLUSION AND FURTHER ENHANCEMENT
9.1. Conclusion
With all the accumulated effort invested in the project, there are reasons to believe
that at the end of this semester this project will find itself in a much better shape and
quite closer to actual acceptance it was. We summarize the progress with respect to
the main objectives of the project, namely, accuracy and consistency.
Accuracy: This is the main obstacle for the project. In the papers we followed,
4 different genres were taken under consideration for system development and
determine the performance (rock, pop, classic and jazz) and considerably
obtained 80% of average accuracy in classification. We were able to obtain
satisfactory result in the recognition of 5 different genres (rock, pop, classic,
jazz and blues).
Consistency: Consistency is also a challenging factor for this project. The
requirement for decrease in the inconsistent result has made it difficult to
balance between accuracy and consistency. However, by the use of the data
mining techniques, we have been able to improve the consistency resulting in
the consistent output of the system.
9.2. Limitations
Our system comprises of the following limitations.
We have under-taken only five genres for classification.
Music genre depends not only on rhythm but also upon the way instruments
are played or on how an artist sings a song. So, classification accuracy cannot
be achieved cent percent accurate.
Model was generated using 30 seconds, 60 seconds and 90 seconds long
training data of each song. And, for the generation of each model, 20 songs
were used. More accuracy in classification can be obtained if more training
51
data is used for model generation. However, computational complexity and
time consumption are major drawbacks.
Music characteristics change from time to time, i.e. tempo, rhythm, vocal
characteristics varies time and again in music/song. If data is not chosen
properly for testing, then the classification result may be incorrect.
To obtain more accurate classification result, longer clip can be considered for
testing purpose. But again, the computational complexity and more time
consumption factors causes lag in performance.
We have not clustered music of different genres in their respective directories.
9.3. Further Enhancements
There is a great opportunity to enhance this project in the upcoming future. Few of the
future enhancements of this project are pointed below.
GUI can be developed more user friendly and attractive.
For now, we have developed application only to classify/recognize genre of
music of five different types. In future, the number of genres can be increased
for classification.
Clustering of music files can be carried out in different directories as future
enhancement of the application.
More effective models for classification can be generated using more training
data (i.e. more than 20 songs) in case of availability of huge amount of data.
With the passage of time, music taste may change gradually. Characteristics of a type
of music genre may change gradually. To take that factor in account, system can be
modified to make update in the templates or to say, generated models for
classification so as to maintain the performance of the system and keep the system
operable.
52
10. REFERENCE
[1]. Douglas A. Reynolds, Member, IEEE, and Richard C. Rose, Member, IEEE ,
Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker
Models, January 1995
[2]. Karin Koshina, Music Genre Recognition 2002
[3]. Michael Haggblade, Yang Hong and Kenny Kao , Music Genre Classification
[4]. Tom Diethe, Gabi Teodoru, Nick Furl and John Shawe-Taylor, Sparse Multiview
Methods for Classification of Musical Genre from Magnetoencephalography
Recordings
[5]. Cory McKay, Issues in Automatic Musical Genre Classification
[6]. Mohit Rajani and Luke Ekkizogloy, Supervised Learning in Genre Classification
[7]. Mandel and Ellis, Song-level features and support vector machines for music
classification
[8]. Muralidhar Talupur, Suman Nath and Hong Yan, Classification of Music Genre
[9]. Antonio Jose Homsi Goulart, Rodrigo Capobianco Guido and Carlos Dias
Maciel, Exploring different approaches for music genre classification, March
2012
[10]. Pedro Domingos, Structured Machine Learning: Ten Problems for the Next
Ten Years
[11]. Nicolas Scaringella and Giorgio Zoia, On the Modeling of Time Information
for Automatic Genre Recognition Systems in Audio Signals
[12]. George Tzanetakis, Georg Essl and Perry Cook, Automatic Musical Genre
Classification of Audio Signals
[13]. Sam Clark, Danny Park and Adrien Guerard, Music Genre Classification
Using Machine Learning Techniques, 5/9/2012
[14]. Shumeet Baluja, Vibhu O. Mittal and Rahul Sukthankar, Applying Machine
Learning for High Performance Named-Entity Extraction, November 2000
53
APPENDICES
54
11. APPENDIX A. WINDOW FUNCTION AND WINDOWING
In signal processing, a window function (also known as an apodization function or tapering
function) is a mathematical function that is zero-valued outside of some chosen interval. For
instance, a function that is constant inside the interval and zero elsewhere is called
a rectangular window, which describes the shape of its graphical representation. When
another function or waveform/data-sequence is multiplied by a window function, the product
is also zero-valued outside the interval: all that is left is the part where they overlap, the
"view through the window".
Applications of window functions include spectral analysis, filter design, and beam forming.
In typical applications, the window functions used are non-negative smooth "bell-shaped"
curves, though rectangle, triangle, and other functions can be used.
A more general definition of window functions does not require them to be identically zero
outside an interval, as long as the product of the window multiplied by its argument is square
integral, and, more specifically, that the function goes sufficiently rapidly toward zero.
One of the major applications of window functions includes the design of finite impulse
response filters and the spectral analysis.
SPECTRAL ANALYSIS
The Fourier transform of the function cos ωt is zero, except at frequency ±ω. However, many
other functions and waveforms do not have convenient closed form transforms. Alternatively,
one might be interested in their spectral content only during a certain time period.
In either case, the Fourier transform (or something similar) can be applied on one or more
finite intervals of the waveform. In general, the transform is applied to the product of the
waveform and a window function. Any window (including rectangular) affects the spectral
estimate computed by this method.
55
WINDOWING
Windowing of a simple waveform like cos ωt causes its Fourier transform to develop non-
zero values (commonly called spectral leakage) at frequencies other than ω. The leakage
tends to be worst (highest) near ω and least at frequencies farthest from ω.
If the waveform under analysis comprises two sinusoids of different frequencies, leakage can
interfere with the ability to distinguish them spectrally. If their frequencies are dissimilar and
one component is weaker, then leakage from the larger component can obscure the weaker
one‟s presence. But if the frequencies are similar, leakage can render them irresolvable even
when the sinusoids are of equal strength.
The rectangular window has excellent resolution characteristics for sinusoids of comparable
strength, but it is a poor choice for sinusoids of disparate amplitudes. This characteristic is
sometimes described as low-dynamic-range.
At the other extreme of dynamic range are the windows with the poorest resolution. These
high-dynamic-range low-resolution windows are also poorest in terms of sensitivity; this is, if
the input waveform contains random noise close to the frequency of a sinusoid, the response
to noise, compared to the sinusoid, will be higher than with a higher-resolution window. In
other words, the ability to find weak sinusoids amidst the noise is diminished by a high-
dynamic-range window. High-dynamic-range windows are probably most often justified in
wideband applications, where the spectrum being analyzed is expected to contain many
different components of various amplitudes.
In between the extremes are moderate windows, such as Hamming and Hann. They are
commonly used in narrowband applications, such as the spectrum of a telephone channel. In
summary, spectral analysis involves a tradeoff between resolving comparable strength
components with similar frequencies and resolving disparate strength components with
dissimilar frequencies. That tradeoff occurs when the window function is chosen. These two
windows along with their corresponding Fourier transforms are illustrated in the Figures A.a
and A.b.
56
Figure A.a. Hamming Window Figure A.b. Hanning Window
57
12. APPENDIX B. FILTERBANK
In signal processing, a filter bank is an array of band-pass filters that separates the input
signal into multiple components, each one carrying a single frequency sub-band of the
original signal. One application of a filter bank is a graphic equalizer, which can attenuate the
components differently and recombine them into a modified version of the original signal.
The process of decomposition performed by the filter bank is called analysis (meaning
analysis of the signal in terms of its components in each sub-band); the output of analysis is
referred to as a sub band signal with as many sub bands as there are filters in the filter bank.
The reconstruction process is called synthesis, meaning reconstitution of a complete signal
resulting from the filtering process.
In digital signal processing, the term filter bank is also commonly applied to a bank of
receivers. The difference is that receivers also down-convert the sub bands to a low center
frequency that can be re-sampled at a reduced rate. The same result can sometimes be
achieved by under sampling the band pass sub bands.
Another application of filter banks is signal compression, when some frequencies are more
important than others. After decomposition, the important frequencies can be coded with a
fine resolution. Small differences at these frequencies are significant and a coding scheme
that preserves these differences must be used. On the other hand, less important frequencies
do not have to be exact. A coarser coding scheme can be used, even though some of the finer
(but less important) details will be lost in the coding.
The vocoder uses a filter bank to determine the amplitude information of the sub bands of a
modulator signal (such as a voice) and uses them to control the amplitude of the sub bands of
a carrier signal (such as the output of a guitar or synthesizer), thus imposing the dynamic
characteristics of the modulator on the carrier. Figure B.a and B.b shows the Filter banks of
different dimensions.
58
Figure B.a. One dimensional Filter Bank Figure B.b. Two Dimensional Filter Bank
59
13. APPENDIX C. SUPERVISED LEARNING
Supervised learning is the machine learning task of inferring a function from labeled training
data. The training data consist of a set of training examples. In supervised learning, each
example is a pair consisting of an input object (typically a vector) and a desired output value
(also called the supervisory signal). A supervised learning algorithm analyzes the training
data and produces an inferred function, which can be used for mapping new examples. An
optimal scenario will allow for the algorithm to correctly determine the class labels for
unseen instances. This requires the learning algorithm to generalize from the training data to
unseen situations in a reasonable way.
Supervised learning accounts for a lot of research activity in machine learning and many
supervised learning techniques have found application in the processing of multimedia
content. The defining characteristic of supervised learning is the availability of annotated
training data. The name invokes the idea of “supervisor” that instructs the learning system on
the labels to associate with training examples. Typically, these labels are class labels in the
classification problems. Supervised learning algorithms induce models from these training
data and these models can be used to classify other unlabelled data.
Supervised learning entails learning a mapping between a set of input variables x and an
output y and applying this mapping to predict the outputs for unseen data. Supervised
learning is the most important methodology in machine learning and it also has a central
importance in the processing of multimedia data.