letter of approval€¦ · iv acknowledgement we express our sincere gratitude to the department of...

ii

LETTER OF APPROVAL

The undersigned hereby certify that they have read, and recommended to the Institute of

Engineering for acceptance, this project report entitled “Music Genre Classification”

submitted by Anjan Rai, Anju Maharjan, Dipendra Shrestha and Komal Kadmiya in partial

fulfilment of the requirements for the Bachelor‟s Degree in Computer Engineering.

_________________________________________

Internal Examiner

Dr. Sanjeev Prasad Pandey

Professor

Department of Electronics & Computer Engineering,

Institute of Engineering, Central Campus Pulchowk,

Tribhuwan University, Nepal

________________________________________

External Examiner

Saroj Shakya

Associate Professor

Nepal College of Information Technology,

Pokhara University, Nepal

________________________________________

Dr. Nanda Bikram Adhikari

Deputy Head



Tribhuvan University, Nepal

__________________________________________

Dr. Dibakar Raj Pant

Head



Tribhuvan University, Nepal

DATE OF APPROVAL:

________________________________________

Supervisor

Dr. Shashidhar Ram Joshi

Professor



Tribhuwan University, Nepal

iii

COPYRIGHT

The author has agreed that the Library, Department of Electronics and Computer

Engineering, Pulchowk Campus, Institute of Engineering may make this report freely

available for inspection. Moreover, the author has agreed that permission for extensive

copying of this project report for scholarly purpose may be granted by the supervisors who

supervised the project work recorded herein or, in their absence, by the Head of the

Department wherein the project report was done. It is understood that the recognition will be

given to the author of this report and to the Department of Electronics and Computer

Engineering, Pulchowk Campus, Institute of Engineering in any use of the material of this

project report. Copying or publication or the other use of this report for financial gain without

approval of to the Department of Electronics and Computer Engineering, Pulchowk Campus,

Institute of Engineering and author‟s written permission is prohibited.

Request for permission to copy or to make any other use of the material in this report in

whole or in part should be addressed to:

Dr. Dibakar Raj Pant

Head

Department of Electronics and Computer Engineering

Pulchowk Campus, Institute of Engineering

Lalitpur, Kathmandu

Nepal

iv

ACKNOWLEDGEMENT

We express our sincere gratitude to the Department of Electronics and Computer

Engineering for providing us the opportunity to undertake the project. Likewise, we extend

our thankfulness to our project supervisor Prof. Dr. Sashidhar Ram Joshi for providing us

the essential guidelines and supports for understanding the feasibility and other technical

aspects of the project. Finally, we would also like to thank our friends especially Mr. Bikram

Basnet and Mr. Pravesh Koirala, and our seniors whose knowledge and experiences

contributed to make this one better. They all helped us to know and improvise the flexibility,

flaws and the limitations related to the project.

- Anjan Rai (70803)

- Anju Maharjan (70804)

- Komal Kadmiya (70819)

- Dipendra Shrestha (70822)

v

ABSTRACT

People all over the world love music, but not the same kind of music. Different people have

different taste for music. Some people love pop music while some like to listen to rock music.

These are different genres of music. Music can be divided into different genres in several

ways. The artistic nature of music means that these classifications are often arbitrary and

controversial, and some genres may overlap. Classification of musical genre from audio is a

well- researched area of music research. The creation of huge databases coming from both

restoration of existing analogue archives and new content is demanding fast and reliable tools

for content analysis and description, to be used for searches, content queries and interactive

access. In that context, musical genres are crucial descriptors since they have been widely

used for years to organize music catalogues, libraries and shops. Apparently, there are 126

different genres music can be classified into which include pop, rock, jazz, trance, hip hop,

and so on. With the advent of variety of music, the level of difficulty in distinguishing the

genre of music has increased. Through our project “Music Genre Recognition”, we have

simplified the difficulty by automatically classifying the given set of music files on the basis

of the genre they belong to. Most automatic genre classification models rely on the low-level

temporal relationships between audio chunks when classifying audio signals in terms of their

genre, i.e. generally models are based on the investigation of means to model short-term time

structures from context information in music segments to consolidate classification

consistency by reducing ambiguities. In our project, we have implied the technique of the

pattern recognition architecture which encompasses the concept of feature extraction from the

chunks of audio signal and classifying the features independently via different classification

techniques.

Keywords: Classification techniques, Feature extraction, Music genre.

vi

Contents

ACKNOWLEDGEMENT ........................................................................................................ iv

ABSTRACT ............................................................................................................................... v

LIST OF ABBREVIATIONS ................................................................................................ viii

LIST OF FIGURES ................................................................................................................... x

LIST OF TABLES .................................................................................................................... xi

1 INTRODUCTION .............................................................................................................. 2

1.1. Background ................................................................................................................. 2

1.2. Motivation ................................................................................................................... 3

1.3. Problem Statement ...................................................................................................... 3

1.4. Objectives .................................................................................................................... 4

1.5. Scope of the work ........................................................................................................ 4

1.6. Overview of the project ............................................................................................... 5

2. LITERATURE REVIEW ................................................................................................... 7

2.1. Introduction ................................................................................................................. 7

2.2. A Study of Human Music Genre Classification .......................................................... 8

2.3. Related Works ............................................................................................................. 8

2.4. Training and Testing Data Sets ................................................................................... 9

2.5. Linear Discriminant Analysis.................................................................................... 10

2.5.1. Class-dependent transformation......................................................................... 11

2.5.2. Class-independent transformation ..................................................................... 11

2.6. Support Vector Machine ........................................................................................... 11

3. FEATURE EXTRACTION .............................................................................................. 13

3.1. Introduction ............................................................................................................... 13

3.2. Formal Notation ........................................................................................................ 14

3.3. Feature Extraction Process ........................................................................................ 14

3.4. Basic Features of an audio sample ............................................................................ 17

3.4.1. Beat and Meter: .................................................................................................. 17

3.4.2. Harmony: ........................................................................................................... 17

vii

3.4.3. Pitch: .................................................................................................................. 18

3.5. Mel-Frequency Cepstral Coefficients ....................................................................... 18

4. CLASSIFICATION .......................................................................................................... 23

4.1. Introduction ............................................................................................................... 23

4.2. Domain Independence ............................................................................................... 23

4.3. Difficulties ................................................................................................................. 24

4.4. Training and Learning ............................................................................................... 25

4.5. Model development ................................................................................................... 25

4.5.1. Gaussian Mixture Model.................................................................................... 25

4.5.2. Parameter Estimation ......................................................................................... 27

5. REQUIREMENT ANALYSIS ......................................................................................... 30

5.1. Functional Requirements........................................................................................... 30

5.2. Non-functional Requirements ................................................................................... 30

6. METHODOLOGY ........................................................................................................... 33

6.1. Introduction ............................................................................................................... 33

6.2. Various system diagrams and descriptions: .............................................................. 34

6.3. Project Tools ............................................................................................................. 39

6.3.1. Why MATLAB and Python? ............................................................................. 39

6.3.2. Pycharm as IDE ................................................................................................. 40

7. OUTPUT .......................................................................................................................... 42

8. RESULT AND ANALYSIS ............................................................................................. 46

9. CONCLUSION AND FURTHER ENHANCEMENT .................................................... 50

9.1. Conclusion ................................................................................................................. 50

9.2. Limitations ................................................................................................................ 50

9.3. Further Enhancements ............................................................................................... 51

10. REFERENCE ................................................................................................................ 52

11. APPENDIX A. WINDOW FUNCTION AND WINDOWING ................................... 54

12. APPENDIX B. FILTERBANK .................................................................................... 57

13. APPENDIX C. SUPERVISED LEARNING ............................................................... 59

viii

LIST OF ABBREVIATIONS

2D Two Dimensional

AI Artificial Intelligence

AMGC Automatic Music Genre Classification

CD Compact Disk

DCT Discrete Cosine Transform

DFT Discrete Fourier Transform

EM Expectation Maximization

FFT Fast Fourier Transform

GMM Gaussian Mixture Model

HMM Hidden Markov Model

IDE Integrated Development Environment

k-NN K- Nearest Neighbor

LDA Linear Discriminant Analysis

MAP Maximum A Posterior

MATLAB Matrix Laboratory

MFCC Mel- Frequency Cepstral Coefficient

MIR Music Information Retrieval

ML Maximum Likelihood

MP3 Media Player 3

PCA Principal Component Analysis

SVM Support Vector Machine

TV Television

http://en.wikipedia.org/wiki/Integrated_Development_Environment

ix

VCS Version Control System

VQ Vector Quantizer

WT Wavelet Transform

x

LIST OF FIGURES

Figure.2.1 Figure showing data sets and test vectors in original ...................................... 10

Figure.3.1 Generating a feature vector from an input data set .......................................... 13

Figure.3.2 Illustration of the traditional feature extraction process ................................. 15

Figure.3.3 Illustration of the frequency spectrum of a harmonic signal with a fundamental

and`four overtones ………………………………………………………………………16

Figure.3.4 Beat Histograms for Classical (left) and Pop (right) ....................................... 17

Figure.3.5 Illustration of the calculation of the MFCCs ................................................... 19

Figure.3.6. Illustration of the filterbank/matrix ............................................................... 21

Figure.6.2 Flow chart of the system.................................................................................. 34

Figure 6.3a Activity Diagram of system training ............................................................. 37

Figure 6.3b Activity Diagram for testing .......................................................................... 38

Figure.7.1 Classification using 16 Gaussian component model of 90s train data for 5s test

data ........................ ………………………………………………………………………44

Figure 8.1 Plot of accuracy obtained for different lengths of test data ............................. 47

Figure A.a Hamming Window ......................................................................................... 56

Figure A.b Hanning Window ............................................................................................ 56

Figure B.a One dimensional Filter Bank .......................................................................... 58

Figure B.b Two dimensional Filter Bank ......................................................................... 58

xi

LIST OF TABLES

Table 7.1 8-component for 5 second test data .................................................................. 42

Table 7.2 8-component for 10 second test data ................................................................ 42

Table 7.3 16-component for 5 second test data ................................................................ 43

Table 7.4 16-component for 10 second test data ............................................................. 43

Table 8.1 Time taken for training and testing ……………………………………………48

xii

CHAPTER 1

2

1 INTRODUCTION

1.1.Background

Distinguishing between the musical genres is one of the herculean tasks for human

beings. A musical genre is a conventional category that identifies pieces of music as

belonging to a shared tradition or set of conventions. A few seconds of music usually

suffice to allow us to do a rough classification, such as identifying a song as rock or

classical music. The nebulous definitions and overlapping boundaries of genres makes

reliable and consistent genre classification a non-trivial task for humans and computer

alike.

A musical genre is characterized by the common characteristics shared by its

members. These characteristics typically are related to the instrumentation, rhythmic

structure, and harmonic content of the music. Genre hierarchies are commonly used to

structure the large collections of music available on the web. Currently, musical genre

annotation is performed manually. Automatic music genre classification (AMGC) can

assist or replace the human user in this process and would be a valuable addition to

music information retrieval systems. In addition, AMGC provides a framework for

developing and evaluating features for any type of content-based analysis of musical

signals [1].

The need for an effective automatic means of classifying music is becoming

increasingly pressing as the number of recordings available continues to increase at a

rapid rate. It is estimated that 2000 Compact Disks (CDs) a month are released for a

wide distribution in Western countries alone. Software capable of performing

automatic classifications would be particularly useful to the administrators of the

exponentially growing networked music archives, as their success is heavily linked to

the ease with which the users can search for types of music on their sites. These sites

currently rely on manual genre classification techniques, a methodology that is slow

and inconsistent.

This project eases as much possible the difficulty of classifying the musical audio

pieces with the approach of initial feature extraction stage followed by a classification

procedure, exploring both the variation parameters used as input and the classifier

architecture.

3

1.2. Motivation

Lots of facts make AMGC intelligent systems vital in the current scenario. The case

of downloading and storing music files on computers, the huge availability of albums

on the internet, with free or paid downloading, peer-to-peer servers and the fact that

nowadays artists deliberately distribute their songs on their websites, make music

database management a must.

Another recent tendency is to consume music via streaming, raising the popularity of

on-line radio stations that play similar songs based on a genre preference. In addition,

browsing and searching by genre on the web and smart playlists generation choosing

specific tunes among gigabytes of songs on personal portable audio players are

important tasks that facilitate music mining. As the demand for multimedia grows, the

development of information retrieval systems including information about music is

increasing concern. Radio stations and music television (TV) channels hold archives

of millions of music tapes. Gigabytes of music files are also spread over the web.

These facts make the manual classification of musical genres inappropriate and

overlapping.

End users are nonetheless already accustomed to browse both physical and on-line

music collections by genre, and this approach is seemingly at least reasonably

effective, without an automatic means to do so. The currently prevailing manual

procedures motivated us to develop an automatic and consistent system based on

feature extraction and classification techniques.

1.3. Problem Statement

As a development to the prevailing manual classification of musical genres, system-

oriented approach has been applied for simplifying the task. Still, the main task have

been in temporal feature integration which is the process of combining a time-series

of short time feature vectors into a single feature vector on a larger time scale.

However, such an approach has complex processes involved and often concludes with

inconsistent results.

4

In our project, we try to build a system that outputs the genre that the music sample

belongs to by extracting some features from the audio data to manipulate more

meaningful information and to reduce the further processing of the classification task.

Systematic feature selection techniques are used so as to output a system which is

robust, fast and consistent.

1.4. Objectives

Our primary objective is to develop a system that implements the automatic feature

extraction and learning / pattern classification techniques that have the important

benefit of being adaptable to a variety of other content-based (i.e. relating directly to

and only to music itself) musical analysis and classification tasks. Our objectives can

be further simplified as:

i. To develop a system that implements the machine learning algorithms for fast

and consistent classifications.

ii. To develop a system that can upgrade the current applications which feature

the music genre classification and its implementation.

iii. To contribute to the creation of more appropriate and specific music data

warehouse.

iv. To implement the principles and techniques of digital signal processing.

1.5. Scope of the work

In simple words, AMGC is the classification of a piece of music into its

corresponding genre by a computer. It is considered to be a cornerstone of the

research area Music Information Retrieval (MIR) and closely linked to the other areas

in MIR. MIR carries the scope of being a key element in the processing, searching

and retrieval of digital music in the near future.

The automatic classification of audio data according to music genres aids the creation

of music databases. It also allows the users to generate personal playlists on the fly,

where the user specifies a general description such as 80s Synth-Pop, and the software

5

does the actual file selection [2]. Furthermore, the features developed for automatic

music genre recognition is useful in related fields such as similarity-based searching.

1.6. Overview of the project

The first chapter of the report gives the introduction of the project which includes the

background related to the project, scope of the project, the factors that motivated us to

initiate the project as well as the objectives behind it. The second chapter of the report

deals with the literature review which covers the details of the related works done

earlier for such projects. The different theories and algorithms incorporated for the

completion of the project is dealt in detail in the chapters 3 and 4. The sixth chapter of

the report depicts the methodology behind the completion of the project. It includes

the different diagrams like the Use case Diagram, Flow Diagram (or Flow Chart),

Activity Diagram, etc associated with the project. The eighth chapter consists of the

results and the output of the project whereas the last chapter of the report contains the

necessary conclusions regarding the project along with the limitations of the project.

6

CHAPTER 2

7

2. LITERATURE REVIEW

2.1. Introduction

Music genre classification is not a new milestone in the era of technological

development. Musical genre is used by retailers, libraries and people in general as a

primary means of organizing music. Anyone who has attempted to search through the

discount bins of a music store will have experienced the frustration of searching

through music that is not sorted by genre. Listeners use genres to find music that

they‟re looking for or to get a rough idea of whether they‟re likely to like a piece of

music before hearing it. The music industry, in contrast, uses genre as a key way of

defining and targeting different markets. The importance of genre in the mind of

listeners is exemplified by research indicating that the style in which a piece is

performed can influence listeners‟ liking for the piece of the music [1, 3].

The types of features developed for a classification system could be adapted for other

types of analyses by musicologists and music theorists. Taken in conjunction with

genre classification results, the features could also provide valuable insights into the

particular attributes of different genres and what characteristics are important in

different cases. Automatic feature extraction and learning / pattern classification

techniques have the important benefit of being adaptable to a variety of other content-

based (i.e. relating directly to and only to the music itself) musical analysis and

classification tasks, such as similarity measurements in general or segmentation.

Systems could be constructed that, to give just a few examples, compare or classify

pieces based on compositional or performance style, group music based on

geographical / cultural origin or historical period, search for unknown music that a

user might like based on examples of what he or she is known to like already, sort

music based on perception of mood, or classify music based on when a user might

want to listen to it (e.g. while driving, while eating dinner, etc.). Music librarians and

database administrators could use these systems to classify recordings along whatever

lines they wished. Individual users could use such systems to sort their music

collections automatically as they grow and automatically generate play lists with

certain themes. It would also be possible for them to upload their own classification

8

parameters to search on-line databases equipped with the same classification software

[4].

2.2. A Study of Human Music Genre Classification

Humans are capable of performing the music genre classification with use of the ears,

the auditory processing system in the ears as well as higher-level cognitive processes

in the brain. Musical genres are used among humans as a compact description which

facilitates sharing of information. For instance, the statements “I like heavy metal” or

“I can‟t stand classical music!” are often used to share the information and relies on

shared knowledge about the genres and their relation to society, history and musical

structure.

According to a study conducted by R.O Gjerrdigen and D. Perrot, human listeners

have significant capability to recognize the musical genres. They used ten different

genres of music and eight sample songs for each genre were downloaded from the

web in the Media Player 3 (MP3) format. Half of the eight songs for each style

contained vocals, and half of them contained instrumental only. Five excerpts were

taken from each song, with durations 475 ms, 400 ms, 325 ms, and 250 ms.

The accuracy of the genre prediction for the 250 ms samples was around 40% and the

agreement between the 250ms classification to the 475 ms classification was around

44%. The results of the study are especially interesting, since they show that is

possible to accurately recognize musical genres without any higher level abstractions.

But since the accuracy level is seemingly unsatisfactory, an AMGC system makes a

remarkable room for itself [1, 4].

2.3. Related Works

Though unsupervised clustering of music collections based on similarity measures is

gaining more and more interest in music information retrieval community, most

works related to classification of music titles into genres are based on supervised

techniques. These methods suppose that taxonomy of genres is given and they try to

map a database of songs into it by machine learning algorithms.

9

Soltau et al. have compared a Hidden Markov Model (HMM) to new classification

architecture, the Explicit Time Modeling with Neural Networks in a classification

experiment involving 360 songs distributed over 4 genres.

Tzanetakis and Cook and Li et al. have worked on a database of 1000 songs over 10

genres and have compared the use of different audio features (timbre features,

rhythmic features, pitch features, Wavelet Transform (WT)) and different classifier

(Support Vector Machines (SVMs), Gaussian Mixtures, Linear Discriminant Analysis

(LDA), K-Nearest Neighbor (k-NN)) on time-independent chunks.

Panagakis and Kotropoulos proposed a musical genre classification framework that

considers the properties of the auditory human perception system i.e. two dimensional

(2D) auditory temporal modulations representing music and genre classification based

on sparse representation.

It is observable that a lot of work is being done in the area, but most of the approaches

explore the timbre texture, the rhythmic content, the pitch content, or their

combinations.

2.4. Training and Testing Data Sets

A training set is a set of data used in various areas of information science to discover

potentially predictive relationships. Training sets are used in Artificial Intelligence

(AI), machine learning, genetic programming, intelligent systems, and statistics. In all

these fields, a training set has much the same role and is often used in conjunction

with a test set [5]. A test set is a set of data used in various areas of information

science to assess the strength and utility of predictive relationship.

Separating data into training and testing sets is an important part of evaluating data

mining models. Typically when we separate a data set into a training set and testing

set, most of the data is used for training, and a smaller portion of the data is used for

testing. Analysis Services randomly samples the data to help ensure that the testing

and training sets are similar. By using similar data for training and testing, we can

minimize the effects of data discrepancies and better understand the characteristics of

the model. After the model has been processed by using the training set, we test the

model by making predictions against the test set. Because the data in the testing set

10

already contains known values for the attribute that we want to predict, it is easy to

determine whether the model's guesses are correct.

2.5. Linear Discriminant Analysis

LDA and the related Fisher's linear discriminant are methods used in statistics, pattern

recognition and machine learning to find a linear combination of features which

characterizes or separates two or more classes of objects or events. The resulting

combination may be used as a linear classifier or, more commonly, for dimensionality

reduction before later classification. There are many possible techniques for

classification of data. LDA easily handles the case where the within-class frequencies

are unequal and their performances have been examined on randomly generated test

data. This method maximizes the ratio of between-class variance to the within-class

variance in any particular data set thereby guaranteeing maximal separability. The use

of LDA for data classification is applied to classification problem in speech

recognition. LDA doesn‟t change the location but only tries to provide more class

separability and draw a decision region between the given classes. This method also

helps to better understand the distribution of the feature data. Figure 2.1 will be used

as an example to explain and illustrate the theory of LDA [4,5].

Figure 2.1. Figure showing data sets and test vectors in original

Data sets can be transformed and test vectors can be classified in the transformed

space by two different approaches.

11

2.5.1. Class-dependent transformation

This type of approach involves maximizing the ratio of between class variance

to within class variance. The main objective is to maximize this ration so that

adequate class separability is obtained. The class-specific type approach

involves using two optimizing criteria for transforming the data sets

independently.

2.5.2. Class-independent transformation

This approach involves maximizing the ratio of overall variance to within

class variance. This approach uses only one optimizing the criterion to

transform the data sets and hence all data points irrespective of their class

identity are transformed using this transform. In this type of LDA, each class

is considered as a separate class against all other classes.

2.6. Support Vector Machine

In machine learning, SVM are supervised learning models with associated

learning algorithms that analyze data and recognize patterns, used

for classification and regression analysis. Given a set of training examples, each

marked as belonging to one of two categories, an SVM training algorithm builds a

model that assigns new examples into one category or the other, making it a non-

probabilistic binary linear classifier [4]. An SVM model is a representation of the

examples as points in space, mapped so that the examples of the separate categories

are divided by a clear gap that is as wide as possible. New examples are then mapped

into that same space and predicted to belong to a category based on which side of the

gap they fall on.

In addition to performing linear classification, SVMs can efficiently perform a non-

linear classification using what is called the kernel trick, implicitly mapping their

inputs into high-dimensional feature spaces.

12

CHAPTER 3

13

3. FEATURE EXTRACTION

3.1. Introduction

One of the challenges in music genre recognition is to find out what it is that allows

us to differentiate between music styles. The problem is that we want to make

observations about the similarity or dissimilarity of two objects (in our case: music

clips) that are not directly comparable in many cases. To make comparison (and

therefore classification) possible, we must transform the data first in order to be able

to access the essential information contained in them, a process referred to as feature

extraction: computing a numerical representation that characterizes a segment of

audio [6,7].

Feature extraction is one of two commonly used preprocessing techniques in

classification; it means that new features are generated from the raw data by applying

one or more transformations. The other possible technique is feature selection – the

process of identifying a subset of features within the input data that can be used for

effective classification. Feature selection can be applied to the original data set or to

the output of a feature extraction process. A classification system might use both or

either of these techniques. Theoretically, it is also possible to use the raw data, if these

are already in a format suitable for classification. In reality, this is hardly ever the

case, though. The dimensionality of the datasets is often too high; the data contain a

lot of redundancy, or are generally not suited for direct comparison. This is especially

true in the area of audio signal classification, where we are dealing with long streams

of redundant, noisy signals. A schematic overview of the connection between features

selection and feature extraction is shown in Figure 3.1.

Figure 3.1. Generating a feature vector from an input data set.

14

3.2. Formal Notation

A feature vector (also referred to as pattern or observation) „x‟ is a single data item

used by the classification algorithm, consisting of„d‟ measurements:

x = (x1, . .. . , xd). The individual scalar components xi of the feature vector x are

called features or attributes, and the dimensionality of the feature space is denoted by

d. Each feature vector can be thought of as a point in the feature space. A pattern set

containing n elements is denoted as

X = {x1, . . . , xn}

and the ith

feature vector in X is written as

xi = (xi1, . . . , xid)

In most cases, a pattern set can be viewed as an n × d pattern matrix.

3.3. Feature Extraction Process

Mathematically, the feature vector xn at discrete time „n‟ can be calculated with the

function F on the signals as

xn = F(w0 sn−(N−1), ..., wN−1 sn) (3.1)

where w0, w1, ..., wN−1 are the coefficients of a window function and „N‟ denotes the

frame size. The frame size is a measure of the time scale of the feature. Normally, it is

not necessary to have xn for every value of „n‟ and a hop size „M‟ is therefore used

between the frames. The whole process is illustrated in Figure 3.2. In signal

processing terms, the use of hop size amounts to down sampling to the signal xn

which then only contains the terms ….,xn-2M, xn-M, xn, xn+M, xn+2M,….

The flow goes from the upper part of the figure to the lower part. The raw music

signal sn is shown in the first of the three subfigures (signals). It is shown how, at a

specific time, a frame with „N‟ samples is extracted from the signal and multiplied

with the window function wn (Hamming window) in the second subfigure. The

resulting signal is shown in the third subfigure. It is noticeable that the resulting signal

gradually decreases towards the sides of the frame which reduces the spectral leakage

problem [8].

15

Figure 3.2. Illustration of the traditional feature extraction process.

Finally, F takes the resulting signal in the frame as input and returns the feature vector

xn. The function F could be e.g. the Discrete Fourier Transform (DFT) on the signal

followed by the magnitude operation on each Fourier coefficient to get the frequency

spectrum.

The window function is multiplied with the signal to avoid problems due to finite

frame size. The rectangular window with amplitude 1 corresponds to calculating the

features without a window, but has serious problems with the phenomenon of spectral

leakage and is rarely used [7,8].

In our project, Hamming window is used for the purpose of windowing. Hamming

window has side lobes with much lower magnitude. Figure 3.3 shows the result of a

DFT on a signal with and without a Hamming window and the advantage of the

Hamming window is easily seen. The Hamming window can be found as

16

wn = 0.54 - 0.46 cos (2Πn / N-1) (3.2)

where n= 0,1,2,……,N-1.

Figure 3.3. Illustration of the frequency spectrum of a harmonic signal with a

fundamental frequency and four overtones.

It is clearly advantageous to use a Hamming window compared to not using a window

(or a rectangular window) since it is less prone to spectral leakage.

A major part of the work in feature extraction for music and especially speech signals

is focused on its features. They are thought to capture the essential aspects of music

such as loudness, pitch and timbre. An informal definition of the features is, that they

are extracted on a time scale of 10 to 40 ms where the signal is considered (short-

time) stationary.

17

3.4. Basic Features of an audio sample

3.4.1. Beat and Meter:

Beats give music its regular rhythmic pattern. Beats are grouped together in a

measure; the notes and rests correspond to a certain number of beats. Meter refers to

the rhythmic patterns produced by grouping together strong and weak beats.

Figure 3.4. Beat Histograms for Classical (left) and Pop (right)

3.4.2. Harmony:

In general, harmony refers to the combination of notes (or chords) played together and

the relationship between a series of chords.

18

3.4.3. Pitch:

The relative lowness or highness that we hear in a sound refers to its pitch. The pitch

of a sound is based on the frequency of vibration and the size of the vibrating object.

The slower the vibration and the bigger the vibrating object, the lower the pitch. For

example, the pitch of a double bass is lower than that of the violin because the double

bass has longer strings.

3.4.4. Rhythm:

It may be defined as the pattern or placements of sounds in time and beats in

music. It refers to the particular arrangement of note lengths in a piece of

music.

3.4.5. Timbre:

Timbre is generally defined as the quality which allows one to tell the

difference between sounds of the same level and loudness when made by

different musical instruments or voices. It depends on the spectrum, the sound

pressure, the frequency location of the spectrum, and the temporal

characteristics of the stimulus. In music, timbre is thought to be determined by

the number and relative strengths of the instruments partials.

3.5. Mel-Frequency Cepstral Coefficients

Mel-Frequency Cepstral Coefficients (MFCCs) originate from automatic speech

recognition, where they have been used with great success. They have become

popular in the MIR society where they have been used successfully for music genre

classification and for categorization into perceptually relevant groups such as moods

and perceived complexity.

19

MFCCs are based on the spectral information of a sound, but are modeled to capture

the perceptually relevant parts of the auditory spectrum. The MFCCs are to some

extent created according to the principles of the human auditory system, but also to be

a compact representation of the amplitude spectrum and with considerations of the

computational complexity [9]. Existing music processing literature pointed us to

MFCCs as a way to represent time domain waveforms as just a few frequency domain

coefficients.

Figure 3.5. Illustration of the calculation of the MFCCs.

Figure 3.5 illustrates the construction of the MFCC features. The flowchart illustrates

the different steps in the calculation from raw audio signal to the final MFCC

features. There exist many variations of the MFCC implementation, but nearly all of

them follow this flowchart.

In accordance with the equation 3.1, the feature extraction can be described as a

function F on a frame of the signal. After applying the Hamming window on the

frame, this function contains the following four steps:

3.5.1. DFT

The first step is to perform the DFT on the frame. For a frame size of „N‟, this

results in „N‟ (complex) Fourier coefficients. This results in an N-dimensional

spectral representation of the frame.

3.5.2. Mel- scaling

Human order sounds on a musical scale from low to high with the feature

„pitch‟. The pitch of a sine tone is closely related to the physical quantity of

frequency and the fundamental frequency for a complex tone. However, the

pitch scale is not similarly spaced as the frequency scale. The mel-scale is an

estimate of the relation between the perceived pitch and the frequency which

20

is found by equating 1000 mels to a 1000 Hz sine tone at 40 dB. It is used in

the calculation of the MFCCs to transform the frequencies in the spectral

representation into perceptual pitch scale. Normally, the mel-scaling step has

the form of a filterbank of (overlapping) triangular filters in the frequency

domain and with center frequencies which are mel-spaced. The filter bank is

what makes MFCCs unique. It is constructed using 13 linearly spaced filters

and 27 log-spaced filters, following a common model for human auditory

perception. The distance between the centre frequencies of the linearly spaced

filters is 133,33 Hz; the log-spaced filters are separated by a factor of 1.071 in

frequency. A standard filterbank is illustrated in the Figure 3.6. Hence, this

mel scaling step is also a smoothing of the spectrum and dimensionality

reduction of the feature vector.

3.5.3. Log-scaling

Similar to pitch, humans order sound from soft to loud with the perceptual

attribute „loudness‟. Perceptual loudness corresponds quite closely to the

physical measure of intensity. Although other quantities, such as frequency,

bandwidth and duration, affect the perceived loudness it is common to relate

loudness directly to intensity. As such, the relation is often approximated as

L∝I0.3

where L is the loudness and I is the intensity (Stevens‟ power law). It is

argued that the perceptual loudness can also be approximated by the logarithm

of the intensity, although this is not quite similar to the previously mentioned

power law. This is a perceptual motivation for the log-scaling step in the

MFCC extraction. Another motivation for the log-scaling in speech analysis is

that it can be used to deconvolute the slowly varying modulation and the rapid

excitation with pitch period.

3.5.4. Discrete Cosine Transform

As the last step, the discrete cosine transform (DCT) is used as a

computationally inexpensive method to de-correlate the mel-spectral log-

scaled coefficients. It is found that the basic functions of the DCT are quite

21

similar to the eigenvectors of a Principal Component Analysis (PCA) on

music. This suggests that the DCT can actually be used for the de-correlation.

As illustrated in figure 4.2, the assumption of de-correlated MFCCs is,

however, doubtful. Normally, only a subset of the DCT basis functions is used

and the result is then an even lower dimensional feature vector of MFCCs

[9,10].

Figure 3.6. Illustration of the filterbank/matrix which is used to convert the

linear frequency scale into the logarithmic mel-scale in the calculation of the

MFCCs. The filters are seen to be overlapping and have logarithmic increase

in bandwidth.

22

CHAPTER 4

23

4. CLASSIFICATION

4.1. Introduction

The feature extractor, as discussed in the chapter 3, computes feature vectors

representing the data to be classified. These feature vectors are then used to assign

each object to a specific category. This is the classification part, which constitutes the

second basic building block of a music genre recognition system.

Classification is a subfield of decision theory. It relies on the basic assumption that

each observed pattern belongs to a category, which can be thought of as a prototype

for the pattern. Regardless of the differences between the individual patterns, there is

a set of features that are similar in patterns belonging to the same class, and different

between patterns from different classes. These features can be used to determine class

membership.

Music can be of arbitrary complexity, songs from one genre differ in many ways.

Still, human are able to categorize them easily. This seems to support our assumption

that there are certain fundamental properties shared by pieces belonging to one genre.

Classification can also be understood by approaching it in geometrical terms. As

stated before, the feature vectors can be regarded as points in feature space. The goal

of the classifier is to find decision boundaries that partition the feature space into

regions that correspond to the individual classes. New data items are then classified

based on what region they lie in. This depends on a feature representation of the data

in which feature vectors from the same category can easily be distinguished from

feature vectors from other categories [11].

4.2. Domain Independence

Finding a good feature representation requires in-depth knowledge of the data and

context; feature extractors must be adapted to the specific problem and are highly

domain-dependent. Classification techniques, on the other hand, are basically domain-

independent. This can be explained by the fact that feature extraction is also an

24

abstraction step, transforming domain-specific data into a more general numerical

representation that can be processed by a generic classifier.

The feature extraction part is where knowledge of music, psychoacoustics, signal

processing, and many other fields is required; it is an area that has only recently

started to receive the attention it deserves, and there is a limited basis of previous

work to build on. Classification, on the other hand, is an advanced field that has been

studied for many years, and that provides us with many fast, elegant and well-

understood solutions that can be adopted for use in music genre recognition.

4.3. Difficulties

The main difficulty in classification arises from the fact that in addition to the

dissimilarities caused by the different underlying models, the feature values for

objects belonging to the same category often also vary considerably. If all objects

from one class were perfectly equal, classification would be trivial but such is not a

case. The classifier never sees the actual data, only the feature vectors. Therefore, the

following is equally true: A feature representation that extracts exactly the

information that differentiates the categories would also eliminate the need for a

complex classification step. Likewise, perfect classifier would not need any feature

extraction at all, but would be able to uncover the true class membership from the raw

data. In reality, neither feature extractors nor classifiers are perfect, but may be

combined to produce working results.

The variation in patterns belonging to the same category can be due to two factors:

First, the underlying model might generate that complexity: A relatively simple model

can create seemingly random output, which cannot trivially be detected by an

observer who does not know the model. Secondly, considerable variation can be

caused by noise. Noise can be defined as any property of the pattern that is not due to

the true underlying model but instead to randomness in the world or the sensors. As is

obvious from this definition, noise is present in all objects in nature.

The challenge is to distinguish the two kinds of differences between feature values:

Are they caused by different models, which mean that the objects belong to different

categories, or are they due to noise or the complexity of the model, meaning that the

objects belong to the same category?

25

4.4. Training and Learning

Creating a classifier usually means specifying its general form, and estimating its

unknown parameters through training. Training can be defined as the process of using

sample data to determine the parameter settings of the classifier, and is essential in

virtually all real-world classification systems.

Classification is often called supervised learning. Supervised learning consists of

using labeled feature vectors to train classifiers that automatically assign class labels

to new feature vectors. Another variant of learning is unsupervised learning or

clustering, which does not use any label information; the system tries to form natural

groupings. Reinforcement learning, a third type, refers to a technique where the

feedback to the system is only right or wrong – no information about the correct result

is given in case of a wrong categorization [10,11].

4.5. Model development

Classification is a form of data analysis that extracts models describing important data

classes. Such models, called classifiers, predict categorical (discrete, unordered) class

labels. The first step to perform the classification task is to construct a classification

model followed by the classification step, where the model is used to predict class

labels for the given data. Among many classification schemes prevailing, we have

used the Gaussian Mixture Model (GMM) to construct the classifier for our project.

4.5.1. Gaussian Mixture Model

A GMM is a parametric probability density function represented as a weighted sum of

Gaussian component densities. A GMM is a probabilistic model that assumes all the

data points are generated from a mixture of finite number of Gaussian distributions

with unknown parameters. GMMs are commonly used as a parametric model of the

probability distribution of continuous measurements or features in a biometric system,

26

such as vocal-tract related spectral features in a speaker recognition system. GMM

parameters are estimated from training data using the iterative Expectation-

Maximization (EM) algorithm or Maximum A Posteriori (MAP) estimation from a

well-trained prior model. The EM algorithm is used for fitting mixture-of-Gaussian

models. It can also draw confidence ellipsoids for multivariate models and compute

the Bayesian Information Criterion to assess the number of clusters in the data.

A Gaussian mixture model is a weighted sum of M component Gaussian densities as

given by the equation (4.1),

p(x|λ) = ∑ i g(x| µi, Σi ) (4.1)

where x is a D-dimensional continuous-valued data vector (i.e. measurement or

features) wi , i = 1, . . . , M , are the mixture weights, and g(x|µi, Σi), i = 1, . . . , M ,

are the component Gaussian densities. Each component density is a D-variate

Gaussian function of the form as expressed by the equation 4.2,

g(x|µi, Σi ) = (1/ (2π)D/2

|Σi|1/2

) exp {-0.5 (x − µi)′ ∑

x − µi)} (4.2)

with mean vector µi and covariance matrix Σi. The mixture weights satisfy the

constraint that ∑ I = 1.

The complete Gaussian mixture model is parameterized by the mean vectors,

covariance matrices and mixture weighs from all component densities. These

parameters are collectively represented by the notation of the form of equation 4.3,

λ = {wi, µi, Σi} i = 1, . . . , M. (4.3)

There are several variants on the GMM shown in equation (4.3). The covariance

matrices, Σi, can be full rank or constrained to be diagonal. Additionally, parameters

can be shared, or tied, among the Gaussian components, such as having a common

covariance matrix for all components, The choice of model configuration (number of

components, full or diagonal covariance matrices, and parameter tying) is often

determined by the amount of data available for estimating the GMM parameters and

how the GMM is used in a particular biometric application. It is also important to note

that because the component Gaussian is acting together to model the overall feature

densities, full covariance matrices are not necessary even if the features are not

statistically independent. The linear combination of diagonal covariance basis

Gaussians is capable of modeling the correlations between feature vector elements.

The effect of using a set of M full covariance matrix Gaussians can be equally

obtained by using a larger set of diagonal covariance Gaussians. GMMs are often

27

used in biometric systems, most notably in speaker recognition systems, due to their

capability of representing a large class of sample distributions. One of the powerful

attributes of the GMM is its ability to form smooth approximations to arbitrarily

shaped densities. The classical uni-modal Gaussian model represents feature

distributions by a position (mean vector) and a elliptic shape (covariance matrix) and

a vector quantizer (VQ) or nearest neighbor model represents a distribution by a

discrete set of characteristic templates. A GMM acts as a hybrid between these two

models by using a discrete set of Gaussian functions, each with their own mean and

covariance matrix, to allow a better modeling capability [12].

The use of a GMM for representing feature distributions in a biometric system may

also be motivated by the intuitive notion that the individual component densities may

model some underlying set of hidden classes. For example, in speaker recognition, it

is reasonable to assume the acoustic space of spectral related features corresponding

to a speaker‟s broad phonetic events, such as vowels, nasals or fricatives. These

acoustic classes reflect some general speaker dependent vocal tract configurations that

are useful for characterizing speaker identity. The spectral shape of the ith

acoustic

class can in turn be represented by the mean µi of the ith

component density, and

variations of the average spectral shape can be represented by the covariance matrix

Σi. Because all the features used to train the GMM are unlabeled, the acoustic classes

are hidden in that the class of an observation is unknown.

4.5.2. Parameter Estimation

In most classification problems, the conditional densities are not known. However, in

many cases, a reasonable assumption can be made about their general form. This

makes the problem significantly easier, since we need only estimate the parameters of

the functions, not the functions themselves. The unknown probability densities are

usually estimated in a training process, using sample data. For instance it might be

assumed that p(x|wi) is a normal density. We then need to find the values of the mean

µ and the covariance Σ.

28

Maximum-Likelihood Parameter Estimation

Given training vectors and a GMM configuration, we wish to estimate the parameters

of the GMM, λ, which in some sense best matches the distribution of the training

feature vectors. There are several techniques available for estimating the parameters

of a GMM. By far the most popular and well-established method is Maximum

Likelihood (ML) estimation, and we have encompassed this very method in our

project [11,12]. The aim of ML estimation is to find the model parameters which

maximize the likelihood of the GMM given the training data. For a sequence of T

training vectors X = {x1, . . . , xT}, the GMM likelihood, assuming independence

between the vectors can be written as,

p(X|λ) = ∏ t| λ) (4.4)

Unfortunately, this expression is a non-linear function of the parameters λ and direct

maximization is not possible. However, ML parameter estimates can be obtained

iteratively using a special case of the EM algorithm.

On each EM iteration, the following re-estimation formulas are used which guarantee

a monotonic increase in the model‟s likelihood value,

Mixture Weights: wi = 1/T ∑ r(i|xt, λ) (4.5)

Means: µi = ( ∑ r(i|xt, λ) xt) / (∑

r(i|xt, λ)) (4.6)

Variances: σ2

i = ( ∑ r(i|xt, λ) xt

2) / ( ∑

r(i|xt, λ) xt) (4.7)

where σi2, xt, and µi refer to arbitrary elements of the vectors σi

2, xt, and µi

respectively.

29

CHAPTER 5

30

5. REQUIREMENT ANALYSIS

5.1. Functional Requirements

Functional requirements of the system are as follows:

The system should be able to add new genre for classification.

The system should be able to take training data, audio files to train the system.

From the supplied valid training data, system should be able to generate a proper

model.

The system should be able to classify the input test file in satisfactory genre.

The system should be able to provide satisfiable result.

5.2. Non-functional Requirements

5.2.1. Performance

Since we have used python for scripting, list processing and list array

manipulation becomes extremely easy and fast. Also, availability of numpy

and scipy kits for python makes vector computation available and hence array

and matrix processing can be carried out faster.

5.2.2. Accuracy

Compared to other methods of model generation and testing, GMM is found to

be relatively more reliable and accurate. And our implementation also added a

point to this method over other methods [13].

31

5.2.3. Reliability

With the advancement in time, gradual change in every aspect is expected,

including music. Hence for the maintenance of the reliability of the system,

with the change in taste in music, templates can be updated [13,14].

32

CHAPTER 6

33

6. METHODOLOGY

6.1. Introduction

Methodology is the analysis of the tasks to be done in order to obtain the desired

output. An appropriate methodology mainly results into a successful project and vice-

versa. Here, for this system, a number of methodologies were considered and the most

efficient ones were used. This doesn‟t mean that one particular method is used.

According to the system, most appropriate ones are used in the combination.

34

6.2. Various system diagrams and descriptions:

Flow Chart: The Figure 6.2 illustrates the flow of control and task division that

our project encompasses.

Testing

Phase

Training

Phase

Figure 6.2. Flow chart of the system

35

Our system consists of three basic blocks which are explained briefly below.

1. Pre-processing block (vectorizing block): In pre-processing block, input audio

files are vectorized. Initially, audio files are sampled with sampling frequency of

44100Hz. Standard sampling frequency of a wav audio file is 44.1 KHz. The

resultant samples are then framed using hamming window. Window length of 160

was used for framing and 30% of previous frames were overlapped in successive

frames. This reduces the chance of missing important characteristic feature of

song. Finally, silent zone from audio signal is removed. For that, frames with zero

energy value and frames with energy value less than threshold value are removed

and only frames with sufficient energy are taken for further processing. Threshold

energy value is determined by taking median of the total energy value of the

whole framed data. 160 dimensional vectors are obtained as output of this

preprocessing block.

2. Feature extraction block: The output obtained from preprocessing block is taken

as input for this block. First, the 160 dimensional vectors are transformed to

frequency domain using Fast Fourier Transform (FFT) of length 1024. The data

obtained is then passed through mel-filter banks. 32 mel-filter banks are used for

this step of processing which generates (32, n) dimensional data. Here, n represent

number of observations or number of frames obtained from pre-processing block.

These obtained data as a result is known as MFCCs, which is carried out for

further processing; for model generation in training phase, and for computation of

maximum-likelihood value in testing phase.

3. Model generation block: This is one important block in our system. The

extracted feature vectors (MFCCs) these are used for model generation. MFCCs

obtained from a number of music files of known genre are fit to 8 or 16

components models (8 or 16 components GMM models) as per requirement. For

model generation, parameters initialization, i.e. initialization of mixture weights,

means and covariance matrices are to be determined. This is done through k-

means clustering. As per the requirement of number of components in the mixture

models, the number of means in k-means clustering is randomly initialized. Since

we have 32-dimensional n number of observations, k numbers of 32 dimensional

means are randomly initialized and clustering is carried out. The clustered data is

then used to compute mixture weights and covariance matrices for parameter

36

initialization in GMM. With the initialized parameters, in each iteration, value of

all of these parameters is updated with the change in structure of Gaussian.

After the complete optimization (maximization) models are generated for each

genre. The models generated are then stored which is then later used for testing

purpose. The models are generated through GMM model generation method via

use of EM algorithm.

37

Activity Diagram: The activity flow of the various components of our system is

depicted by the Figure 6.3a and 6.3b.

Figure 6.3a. Activity diagram of system training

38

Figure 6.3b. Activity diagram for testing

The figure 6.3a illustrates the various activities carried while the system is trained using

training data set whereas in the figure 6.3b activates related to testing of an audio file is

shown.

39

6.3. Project Tools

Programming Languages: Python 2.7, Matlab R2010a

Drawings and diagrams: Visio, Argo UML

Documentation: MS-WORD

Platform: Windows

IDE: Pycharm

6.3.1. Why MATLAB and Python?

Matrix Laboratory (MATLAB) is a multi-paradigm numerical

computing environment and fourth-generation programming language. MATLAB

allows matrix manipulations, plotting of functions and data, implementation of

algorithms, creation of user interfaces, and interfacing with programs written in other

languages, including C, C++, Java, and Fortran. Mathematical functions for linear

algebra, statistics, Fourier analysis, filtering, optimization, numerical integration, and

solving ordinary differential equations, built-in graphics for visualizing data and tools

for creating custom plots, development tools for improving code quality and

maintainability and maximizing performance; availability of different in-built

functions and libraries has made easier to carry out different simulation tasks. And

same goes for simulation phase of our project. Due to these features, we could utilize

time for hard coding for other research works. Test results for models generated using

different length, different size of training data could be determined fast.

Python on the other hand, provides features similar to MATLAB. This pair of

libraries provide array and matrix structures, linear algebra routines, numerical

optimization, random number generation, statistics routines, differential equation

modeling, Fourier transforms and signal processing, image processing, sparse and

masked arrays, spatial computation, and numerous other mathematical routines.

Together, they cover most of MATLAB‟s basic functionality and parts of many of the

toolkits, and include support for reading and writing MATLAB files. Python allows

one to easily leverage object oriented and functional design patterns. Just as different

http://en.wikipedia.org/wiki/Multi-paradigm_programming_language

http://en.wikipedia.org/wiki/Numerical_analysis

http://en.wikipedia.org/wiki/Numerical_analysis

http://en.wikipedia.org/wiki/Fourth-generation_programming_language

http://en.wikipedia.org/wiki/Matrix_(mathematics)

http://en.wikipedia.org/wiki/Function_(mathematics)

http://en.wikipedia.org/wiki/Algorithm

http://en.wikipedia.org/wiki/User_interface

http://en.wikipedia.org/wiki/C_(programming_language)

http://en.wikipedia.org/wiki/C%2B%2B

http://en.wikipedia.org/wiki/Java_(programming_language)

http://en.wikipedia.org/wiki/Fortran

40

problems call for different ways of thinking, so, also different problems call for

different programming paradigms. There is no doubt that a linear, procedural style is

natural for many scientific problems. However, an object oriented style that builds on

classes having internal functionality and external behavior is a perfect design pattern

for others. For this, classes in Python are full-featured and practical. Functional

programming, which builds on the power of iterators and functions-as-variables,

makes many programming solutions concise and intuitive. Brilliantly, in Python,

everything can be passed around as an object, including functions, class definitions,

and modules. Iterators are a key language component and Python comes with a full-

featured iterator library. While it doesn‟t go as far in any of these categories as

flagship paradigm languages such as Java, it does allow one to use some very

practical tools from these paradigms. These features combine to make the language

very flexible for problem solving, one key reason for its popularity. The ease of

balancing high-level programming with low-level optimization is a particular strong

point of Python code. However, as with most high-level languages, we often sacrifice

code speed for programming speed. In this context, speeding code up means

vectorizing algorithm to work with arrays of numbers instead of with single numbers,

thus reducing the overhead of the language when array operations are optimized.

6.3.2. Pycharm as IDE

PyCharm is an Integrated Development Environment (IDE) used for programming

in Python. It provides code analysis, graphical debugger, integrated unit tester,

Version Control System (VCS) integration and supports web development

with Django. It is cross-platform working on Windows, Mac OS X and Linux.

http://en.wikipedia.org/wiki/Integrated_Development_Environment

http://en.wikipedia.org/wiki/Python_(programming_language)

http://en.wikipedia.org/wiki/Django_(web_framework)

41

CHAPTER 7

42

7. OUTPUT

Testing carried out with 8 Gaussian components with 5 sec length of each test file provided

output summarized in table below. Different lengths of training data (30, 60 and 90 seconds)

were used for generation of model for each genre with 8 Gaussian components.

Training data length Average accuracy (%) Precision (%) Recall (%) Error (%)

30 seconds 88.57 73.43 69.37 11.43

60 seconds 88 72.25 68.07 12

90 seconds 90.15 73.32 75.96 9.85

Table 7.1. 8-components for 5 sec test data

The output of testing carried out with 8 Gaussian components with 10 seconds long test data

is summarized in the table below.


30 seconds 89.38 70.86 74.13 10.62

60 seconds 90.15 73.21 76.76 9.85

90 seconds 90 72.35 74.72 10


43


of each test file is summarized in the table below.


30 seconds 89.54 71.50 75.63 10.46

60 seconds 91.08 75.95 77.85 8.92

90 seconds 91.23 76.47 78.40 8.77



is summarized in the table below.


30 seconds 89.85 72.29 77.11 10.15

60 seconds 90.62 74.63 77.07 9.38

90 seconds 91.23 76.12 78.57 8.77


Different length of training data was used to generate models. With use of different models

for testing, variation in output of the testing was expected. As per our expectation, variation

in result was obtained with change in the model used for tested. From above mentioned tables

of results, it was determined that models generated for 8 components using 90 seconds

training data proved to be more reliable than that built using 30 seconds or 60 seconds long

training data. Also, compared to 8 components Gaussian model, 16 components Gaussian

model was found to be more effective, which is illustrated by the Figure 8.1.

44

Figure 7.1. Classification using 16 Gaussian component model of 90s train data for 5s

test data

Out of the five genres (rock, pop, hip-hop, classic, blues) undertaken for analysis, rock music

consisted distinct features (music timbre, rhythm) due to which, recognition of rock music

compared to other genres was easier. Test for other genres (blues, jazz, classic) had

satisfactory output result. But in case of pop music classification, result was below 60%. This

may be because of the resemblance of pop music to rock music

45

CHAPTER 8

46

8. RESULT AND ANALYSIS

From the results, it was found out that, higher accuracy was obtained when we used

16 Gaussian components model generated using 60 seconds training data for testing

purpose rather than other combinations. Also, the 5 seconds long test data was

determined to be appropriate for testing purpose. Longer length of test data would

rather consume more time than increase accuracy in classification.

MFCCs were found out to be more effective than cepstral coefficients computed using

other methods like linear predictive coding since the former one more effectively

represents characteristics of human ear and hearing characteristics.

GMM on the other hand is soft clustering method and is a probabilistic measure. With

the help of assignment of probability weight to different components through the

implementation of the EM algorithm, Gaussian models are generated which would

help in representing different characteristics of music. Through the use of means,

probability mixture weight, covariance matrix computed from training data in ML

value computation, relatively more accurate and reliable results can be obtained.

47

Figure 8.1 Plot of accuracy obtained for different length of test data with different

models

As per the discussion of the result mentioned in aforementioned chapter, 16 Gaussian

components model generated using 60 seconds long and 90 seconds long training data

resulted more accurate output than that of result computed using 8 Gaussian

components. However, for the model generation containing 16 Gaussian components,

and testing carried out with 16 Gaussian components consumed more time.

Computation using 16 Gaussian -components introduced more complexity due to

which, time consumption became obvious. Introduction of models containing 32

Gaussian components would cause more complex computations and would consume

more time for testing. Considering time consumption, complexity and result accuracy,

build a system for 16 Gaussian components models would be more suitable rather

than to introduce more components for more accuracy.

48

We performed training and testing separately in both MATLAB and Python and

following results were observed:

Process MATLAB Python

Train 25 minutes 2 minutes

Test 193 sec 0.03 sec

Table 8.1 Time taken for training and testing

Above mentioned table illustrates the time taken for training and testing that was

observed. For training and generating a model with 16 components using training data

of 90 seconds from 20 songs each took about 25 minutes in MATLAB whereas, it

took about 2 minutes in python. Also, testing a song with a clip of 5 seconds took 193

seconds in MATLAB, whereas, it took 0.03 seconds in python.

Main reason behind such time difference may be due to method of computations. In

python, computations were carried out by vectoring data matrix.

49

CHAPTER 9

50

9. CONCLUSION AND FURTHER ENHANCEMENT

9.1. Conclusion

With all the accumulated effort invested in the project, there are reasons to believe

that at the end of this semester this project will find itself in a much better shape and

quite closer to actual acceptance it was. We summarize the progress with respect to

the main objectives of the project, namely, accuracy and consistency.

Accuracy: This is the main obstacle for the project. In the papers we followed,

4 different genres were taken under consideration for system development and

determine the performance (rock, pop, classic and jazz) and considerably

obtained 80% of average accuracy in classification. We were able to obtain

satisfactory result in the recognition of 5 different genres (rock, pop, classic,

jazz and blues).

Consistency: Consistency is also a challenging factor for this project. The

requirement for decrease in the inconsistent result has made it difficult to

balance between accuracy and consistency. However, by the use of the data

mining techniques, we have been able to improve the consistency resulting in

the consistent output of the system.

9.2. Limitations

Our system comprises of the following limitations.

We have under-taken only five genres for classification.

Music genre depends not only on rhythm but also upon the way instruments

are played or on how an artist sings a song. So, classification accuracy cannot

be achieved cent percent accurate.

Model was generated using 30 seconds, 60 seconds and 90 seconds long

training data of each song. And, for the generation of each model, 20 songs

were used. More accuracy in classification can be obtained if more training

51

data is used for model generation. However, computational complexity and

time consumption are major drawbacks.

Music characteristics change from time to time, i.e. tempo, rhythm, vocal

characteristics varies time and again in music/song. If data is not chosen

properly for testing, then the classification result may be incorrect.

To obtain more accurate classification result, longer clip can be considered for

testing purpose. But again, the computational complexity and more time

consumption factors causes lag in performance.

We have not clustered music of different genres in their respective directories.

9.3. Further Enhancements

There is a great opportunity to enhance this project in the upcoming future. Few of the

future enhancements of this project are pointed below.

GUI can be developed more user friendly and attractive.

For now, we have developed application only to classify/recognize genre of

music of five different types. In future, the number of genres can be increased

for classification.

Clustering of music files can be carried out in different directories as future

enhancement of the application.

More effective models for classification can be generated using more training

data (i.e. more than 20 songs) in case of availability of huge amount of data.

With the passage of time, music taste may change gradually. Characteristics of a type

of music genre may change gradually. To take that factor in account, system can be

modified to make update in the templates or to say, generated models for

classification so as to maintain the performance of the system and keep the system

operable.

52

10. REFERENCE

[1]. Douglas A. Reynolds, Member, IEEE, and Richard C. Rose, Member, IEEE ,

Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker

Models, January 1995

[2]. Karin Koshina, Music Genre Recognition 2002

[3]. Michael Haggblade, Yang Hong and Kenny Kao , Music Genre Classification

[4]. Tom Diethe, Gabi Teodoru, Nick Furl and John Shawe-Taylor, Sparse Multiview

Methods for Classification of Musical Genre from Magnetoencephalography

Recordings

[5]. Cory McKay, Issues in Automatic Musical Genre Classification

[6]. Mohit Rajani and Luke Ekkizogloy, Supervised Learning in Genre Classification

[7]. Mandel and Ellis, Song-level features and support vector machines for music

classification

[8]. Muralidhar Talupur, Suman Nath and Hong Yan, Classification of Music Genre

[9]. Antonio Jose Homsi Goulart, Rodrigo Capobianco Guido and Carlos Dias

Maciel, Exploring different approaches for music genre classification, March

2012

[10]. Pedro Domingos, Structured Machine Learning: Ten Problems for the Next

Ten Years

[11]. Nicolas Scaringella and Giorgio Zoia, On the Modeling of Time Information

for Automatic Genre Recognition Systems in Audio Signals

[12]. George Tzanetakis, Georg Essl and Perry Cook, Automatic Musical Genre

Classification of Audio Signals

[13]. Sam Clark, Danny Park and Adrien Guerard, Music Genre Classification

Using Machine Learning Techniques, 5/9/2012

[14]. Shumeet Baluja, Vibhu O. Mittal and Rahul Sukthankar, Applying Machine

Learning for High Performance Named-Entity Extraction, November 2000

53

APPENDICES

54

11. APPENDIX A. WINDOW FUNCTION AND WINDOWING

In signal processing, a window function (also known as an apodization function or tapering

function) is a mathematical function that is zero-valued outside of some chosen interval. For

instance, a function that is constant inside the interval and zero elsewhere is called

a rectangular window, which describes the shape of its graphical representation. When

another function or waveform/data-sequence is multiplied by a window function, the product

is also zero-valued outside the interval: all that is left is the part where they overlap, the

"view through the window".

Applications of window functions include spectral analysis, filter design, and beam forming.

In typical applications, the window functions used are non-negative smooth "bell-shaped"

curves, though rectangle, triangle, and other functions can be used.

A more general definition of window functions does not require them to be identically zero

outside an interval, as long as the product of the window multiplied by its argument is square

integral, and, more specifically, that the function goes sufficiently rapidly toward zero.

One of the major applications of window functions includes the design of finite impulse

response filters and the spectral analysis.

SPECTRAL ANALYSIS

The Fourier transform of the function cos ωt is zero, except at frequency ±ω. However, many

other functions and waveforms do not have convenient closed form transforms. Alternatively,

one might be interested in their spectral content only during a certain time period.

In either case, the Fourier transform (or something similar) can be applied on one or more

finite intervals of the waveform. In general, the transform is applied to the product of the

waveform and a window function. Any window (including rectangular) affects the spectral

estimate computed by this method.

55

WINDOWING

Windowing of a simple waveform like cos ωt causes its Fourier transform to develop non-

zero values (commonly called spectral leakage) at frequencies other than ω. The leakage

tends to be worst (highest) near ω and least at frequencies farthest from ω.

If the waveform under analysis comprises two sinusoids of different frequencies, leakage can

interfere with the ability to distinguish them spectrally. If their frequencies are dissimilar and

one component is weaker, then leakage from the larger component can obscure the weaker

one‟s presence. But if the frequencies are similar, leakage can render them irresolvable even

when the sinusoids are of equal strength.

The rectangular window has excellent resolution characteristics for sinusoids of comparable

strength, but it is a poor choice for sinusoids of disparate amplitudes. This characteristic is

sometimes described as low-dynamic-range.

At the other extreme of dynamic range are the windows with the poorest resolution. These

high-dynamic-range low-resolution windows are also poorest in terms of sensitivity; this is, if

the input waveform contains random noise close to the frequency of a sinusoid, the response

to noise, compared to the sinusoid, will be higher than with a higher-resolution window. In

other words, the ability to find weak sinusoids amidst the noise is diminished by a high-

dynamic-range window. High-dynamic-range windows are probably most often justified in

wideband applications, where the spectrum being analyzed is expected to contain many

different components of various amplitudes.

In between the extremes are moderate windows, such as Hamming and Hann. They are

commonly used in narrowband applications, such as the spectrum of a telephone channel. In

summary, spectral analysis involves a tradeoff between resolving comparable strength

components with similar frequencies and resolving disparate strength components with

dissimilar frequencies. That tradeoff occurs when the window function is chosen. These two

windows along with their corresponding Fourier transforms are illustrated in the Figures A.a

and A.b.

56

Figure A.a. Hamming Window Figure A.b. Hanning Window

57

12. APPENDIX B. FILTERBANK

In signal processing, a filter bank is an array of band-pass filters that separates the input

signal into multiple components, each one carrying a single frequency sub-band of the

original signal. One application of a filter bank is a graphic equalizer, which can attenuate the

components differently and recombine them into a modified version of the original signal.

The process of decomposition performed by the filter bank is called analysis (meaning

analysis of the signal in terms of its components in each sub-band); the output of analysis is

referred to as a sub band signal with as many sub bands as there are filters in the filter bank.

The reconstruction process is called synthesis, meaning reconstitution of a complete signal

resulting from the filtering process.

In digital signal processing, the term filter bank is also commonly applied to a bank of

receivers. The difference is that receivers also down-convert the sub bands to a low center

frequency that can be re-sampled at a reduced rate. The same result can sometimes be

achieved by under sampling the band pass sub bands.

Another application of filter banks is signal compression, when some frequencies are more

important than others. After decomposition, the important frequencies can be coded with a

fine resolution. Small differences at these frequencies are significant and a coding scheme

that preserves these differences must be used. On the other hand, less important frequencies

do not have to be exact. A coarser coding scheme can be used, even though some of the finer

(but less important) details will be lost in the coding.

The vocoder uses a filter bank to determine the amplitude information of the sub bands of a

modulator signal (such as a voice) and uses them to control the amplitude of the sub bands of

a carrier signal (such as the output of a guitar or synthesizer), thus imposing the dynamic

characteristics of the modulator on the carrier. Figure B.a and B.b shows the Filter banks of

different dimensions.

58

Figure B.a. One dimensional Filter Bank Figure B.b. Two Dimensional Filter Bank

59

13. APPENDIX C. SUPERVISED LEARNING

Supervised learning is the machine learning task of inferring a function from labeled training

data. The training data consist of a set of training examples. In supervised learning, each

example is a pair consisting of an input object (typically a vector) and a desired output value

(also called the supervisory signal). A supervised learning algorithm analyzes the training

data and produces an inferred function, which can be used for mapping new examples. An

optimal scenario will allow for the algorithm to correctly determine the class labels for

unseen instances. This requires the learning algorithm to generalize from the training data to

unseen situations in a reasonable way.

Supervised learning accounts for a lot of research activity in machine learning and many

supervised learning techniques have found application in the processing of multimedia

content. The defining characteristic of supervised learning is the availability of annotated

training data. The name invokes the idea of “supervisor” that instructs the learning system on

the labels to associate with training examples. Typically, these labels are class labels in the

classification problems. Supervised learning algorithms induce models from these training

data and these models can be used to classify other unlabelled data.

Supervised learning entails learning a mapping between a set of input variables x and an

output y and applying this mapping to predict the outputs for unseen data. Supervised

learning is the most important methodology in machine learning and it also has a central

importance in the processing of multimedia data.

letter of approval€¦ · iv acknowledgement we express our sincere gratitude to the department of...

Documents