learning techniques for identifying vocal regions in music
TRANSCRIPT
Learning Techniques for Identifying Vocal Regions in Music Using the Wavelet
Transformation, Version 1.0
A Thesis
submitted to the Faculty of the
Graduate School of Arts and Sciences
of Georgetown University
in partial ful�llment of the requirements for the
degree of
Master of Science
in Computer Science
By
Michael J. Henry, B.S.
Washington, DC
February 14, 2011
Copyright c© 2011 by Michael J. Henry
All Rights Reserved
ii
Learning Techniques for Identifying Vocal Regions in Music Using the
Wavelet Transformation, Version 1.0
Michael J. Henry, B.S.
Thesis Advisor: Marcus A. Maloof
Abstract
In this research I present a machine learning method for the automatic detection of vocal regions
in music. I employ the wavelet transformation to extract wavelet coe�cients, from which I build fea-
ture sets capable of constructing a model that can distinguish between regions of a song that contain
vocals and those that are purely instrumental. Singing voice detection is an important aspect of the
broader �eld of Music Information Retrieval, and e�cient vocal region detection facilitates further
research in other areas such as a singing voice detection, genre classi�cation and the management
of large music databases. As such, it is important for researchers to accurately detect automatically
which sections of music contain vocals and which do not. Previous methods that used features,
such as the popular Mel-Frequency Cepstral Coe�cients (MFCC), have several disadvantages when
analyzing signals in the time-frequency domain that the wavelet transformation can overcome. The
models constructed by using the wavelet transformation on a windowed music signal produce a
classi�cation accuracy of 86.66%, 11% higher than models built using MFCCs. Additionally, I show
that applying a decision tree algorithm to the vocal region detection problem will produce a more
accurate model when compared to other, more widely applied learning algorithms, such as Support
Vector Machines.
Index words: Vocal Region Detection, Machine Learning, MFCC, Wavelet Transform,
SVM, Naïve Bayes, Decision Tree, Thesis (academic)
iii
Dedication
For my wife, Stacy, in appreciation for her endless support and patience.
iv
Acknowledgments
I would like to thank my thesis advisor, Dr. Marcus A. Maloof, for his tireless e�ort in helping
me complete this thesis, and for his support in encouraging me to pursue an interest of mine. This
research would not have been possible without his support over the past three years. Also, I would
like to thank my committee members, Dr. Lisa Singh and Dr. Ophir Frieder for inspiring my interest
in Machine Learning and for o�ering continuous guidance and support, without which this would
not be possible. I would also like to thank Dr. Brian Blake for allowing me the opportunity to get
here.
v
Table of Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Chapter
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Music File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Audio Signal Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 The Daubechies Wavelet Family . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Review of Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 A Method for Detecting Vocal Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.2 Experiment 1: Measuring Overall Performance . . . . . . . . . . . . . . . . . . 365.3 Experiment 2: Measuring Performance Across Di�erent Artists . . . . . . . . . 415.4 Experiment 3: Measuring Performance Across Gender . . . . . . . . . . . . . . 435.5 Experiment 4: Measuring Overall Performance Across Groups of Artists . . . . 445.6 Experiment 5: Measuring Performance Across Di�erent Groups of Artists . . . 45
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Appendix
A Complete Experiment 1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
B Complete Experiment 2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
C Complete Experiment 3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
D Complete Experiment 4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
E Complete Experiment 5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
vi
List of Figures
2.1 Spectrogram Progression of Alternating Vocal and Instrumental Regions in a QueenSong, from Left to Right Respectively and Top to Bottom . . . . . . . . . . . . . . . 9
2.2 Binary Tree Representation of the Discrete Wavelet Transformation Filter Bank . . 132.3 Wavelet Coe�cients Extracted from a Progression of Alternating Vocal and Instru-
mental Regions in a Queen Song, from Left to Right Respectively and Top to Bottom 142.4 Wavelet (Left) and Scaling (Right) Functions of the Four Daubechies Wavelets used
in this Thesis. Magnitude is shown along the y-axis and support length is shown alongthe x-axis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1 The Training Phase of the Presented Methodology . . . . . . . . . . . . . . . . . . . 275.1 Experimental Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.2 Transcribe! window with Aerosmith's �Dream On� open . . . . . . . . . . . . . . . . 34
vii
List of Tables
5.1 Experiment 1 Results for Vocal Region Detection Using MFCC Features on IndividualFrames from Single Artists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Experiment 1 Results for Vocal Region Detection Using MFDWC Features on Indi-vidual Frames from Single Artists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Experiment 1 Results for Vocal Region Detection Using Wavelet Energy Features onIndividual Frames from Single Artists . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 Experiment 2 Results for Vocal Region Detection on Individual Frames from Di�erentIndividual Artists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.5 Experiment 3 Results for Gender-Based Vocal Region Detection at a Sampling Rateof 20% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.6 Experiment 4 Results for Vocal Region Detection on Frames From Multiple Artistsat a Sampling Rate of 40% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.7 Experiment 5 Results for Vocal Region Detection on Individual Frames from Multiple,Di�erent Artists using a 32% Sampling Rate, Average Over 10 Runs . . . . . . . . . 46
A.1 Complete Experiment 1 Results Using MFCC Features . . . . . . . . . . . . . . . . . 54A.2 Complete Experiment 1 Results Using MFDWC Features and Naïve Bayes . . . . . 57A.3 Complete Experiment 1 Results Using MFDWC Features and J48 . . . . . . . . . . 60A.4 Complete Experiment 1 Results Using MFDWC Features and SVM . . . . . . . . . 63A.5 Complete Experiment 1 Results Using MFDWC Features and JRip . . . . . . . . . . 66A.6 Complete Experiment 1 Results Using MFDWC Features and IBk . . . . . . . . . . 69A.7 Complete Experiment 1 Results Using Wavelet Energy Features and Naïve Bayes . . 72A.8 Complete Experiment 1 Results Using Wavelet Energy Features and J48 . . . . . . . 75A.9 Complete Experiment 1 Results Using Wavelet Energy Features and SVM . . . . . . 78A.10 Complete Experiment 1 Results Using Wavelet Energy Features and JRip . . . . . . 81A.11 Complete Experiment 1 Results Using Wavelet Energy Features and IBk . . . . . . . 84B.1 Complete Experiment 2 Results using MFCC Features . . . . . . . . . . . . . . . . . 88B.2 Complete Experiment 2 Results using MFDWC Features (Training set in rows; test
set in columns) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89B.3 Complete Experiment 2 Results using Wavelet Energy Features (Training set in rows;
test set in columns) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90C.1 Complete Experiment 3 Results using MFCC Features at 2% Incremental Sampling
Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92C.2 Complete Experiment 3 Results using MFDWC Features at 2% Incremental Sampling
Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92C.3 Complete Experiment 3 Results using Wavelet Energy Features at 2% Incremental
Sampling Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92D.1 Complete Experiment 4 Results using MFCC Features at 2% Incremental Sampling
Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94D.2 Complete Experiment 4 Results using MFDWC Features at 2% Incremental Sampling
Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94D.3 Complete Experiment 4 Results using Wavelet Energy Features at 2% Incremental
Sampling Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
viii
E.1 Complete Experiment 5 Results at 2% Incremental Sampling Rates using MFCCFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
E.2 Complete Experiment 5 Results at 2% Incremental Sampling Rates using MFDWCFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
E.3 Complete Experiment 5 Results at 2% Incremental Sampling Rates using WaveletEnergy Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
ix
Chapter 1
Introduction
The prevalence of digital music in the Internet Age has created a need for music distributors and
content creators to develop and deploy systems capable of managing enormous libraries of music.
That need contributed to the birth of a sub�eld of Information Retrieval, referred to as Music
Information Retrieval (MIR), established with the goal of �nding innovative ways to store, manage
and retrieve information from music. MIR encompasses a broad range of research topics, including
singing voice identi�cation [1, 16, 19, 39], source separation [29], query-by-humming [9], acoustic
�ngerprinting [42] and copyright protection [40].
Western popular music is typically structured in a similar fashion across most genres. A primary
singing voice�occasionally accompanied by one or more backup or supplementary singers�is typ-
ically featured over a number of instrument tracks, which can include such popular instruments as
guitar, bass, piano or drums. In many of the MIR research areas mentioned above, most notably
singing voice identi�cation, any application that wishes to perform its stated task must �rst be able
to detect the singing voice before performing further analysis.
Modern digital music, however, makes that task di�cult. Most digital music available is encoded
using the MPEG-1 Audio Layer 3 (MP3) format [13], or the increasingly popular MPEG-4 (MP4,
M4A) format [14]. The MP3 encoding format uses a patented lossy data compression algorithm [13].
Audio compression is used in digital music as a way to sharply reduce the storage space required by
a digital music �le. For MIR researchers, however, this posses a problem. A song encoded using the
MP3 format has each of the instrument and vocal channels compressed down to two channels [13],
making it di�cult to separate an individual contributor from the rest of the audio.
For instance, researchers interested in automatic singer identi�cation must �rst isolate the vocal
regions in the song [1, 16, 19, 39]. In order to properly develop a singer identi�cation method, the
researchers must �rst preprocess their data using a reliable vocal region detection system. These
vocal region detection systems present their own set of problems and considerations independent of
1
the underlying task. Fortunately, there are some parallels that can be drawn between singing voice
recognition research, which is relatively new, and speaking voice recognition, which is a more mature
area of research. But while those similarities do exist, there are special considerations that must be
taken into account when approaching a vocal identi�cation task, as opposed to a speech identi�cation
task. The singing voice and the speaking voice contain di�erent levels of voiced components (sounds
generated by phonation), cover a wider range of timbre and frequency variation and the presence of
noise in the form of background instrumentation is a much more signi�cant component with vocal
region detection than noise considerations are with speech identi�cation.
These di�erences are signi�cant enough to motivate MIR research that branches away from
research done in the speech discrimination domain. This research most often focuses on exploiting
those stated di�erences between the singing voice and the speaking voice [16], as well as di�erences
between the human voice and instruments, which will enable a model to distinguish between the
two.
The current approach most often found in today's literature involves utilizing the Short-Time
Fourier Transform (STFT) in the calculation of Mel-Frequency Cepstral Coe�cients (MFCC), and
using that representation of a signal's power spectrum as the features in a vocal region detection
model. MFCCs have wide applicability in both the speech and music analysis domains, and are
frequently found in literature for both research areas [11, 21, 26, 31, 34, 40, 45]. However, there
are some limitations of using MFCC features when building a vocal region model. First of which
is the fact that the Short-Time Fourier Transform is well adapted for stationary signals�signals
that repeat with the same periodicity into in�nity�whereas music does not show the same peri-
odic behavior. Additionally, Fourier analysis only permits analysis at a single resolution and once
the Fourier Transform is applied to a signal, good time resolution is lost. Therefore, applying the
Fourier Transformation to the entire signal produces a signal resolution for the entire signal, and
applying the STFT produces a single resolution for each portion of the windowed signal. The MFCC
algorithm also utilizes the Discrete Cosine Transformation (DCT) in its calculation. While the DCT
has demonstrated a wide range of functionality in signal analysis, the coe�cients produced as a result
of applying the DCT are not well localized and thus, they are susceptible to noise and corruption,
negatively a�ecting the performance of the MFCC features in a model where the data is noisy.
The Wavelet Transformation o�ers a means of overcoming those limitations present in the current
research. The Wavelet Transformation gives researchers the ability to analyze a non-stationary signal
2
at a number of di�erent resolutions. In addition, the Wavelet Transformation produces a series of
localized wavelet coe�cients that are representative of the signal in the time�frequency domain. The
wavelet family employed in this research�the Daubechies family�provides minimum support for a
given number of vanishing moments, which is helpful in detecting abrupt transitions in the signal,
producing coe�cients that better capture the distinguishing characteristics of the signal.
As I show in this thesis, the Wavelet Transformation can be incorporated into a number of
di�erent feature vectors and can sharply outperform MFCC features when building a vocal region
detection system. I show that features built using the Discrete Wavelet Transformation outperform
MFCC features by up to 11%.
The purpose of this thesis is to �rst examine previous research in the area of vocal region detec-
tion in an attempt to identify potential limitations that exist within the current body of research.
Mitigating those limitations will produce a better vocal region model, which in turn can be used in
other MIR applications to help produce better results.
In this research I make four contributions to vocal region detection research. First, I propose the
application of two di�erent features with proven applicability in other signal analysis research [8, 11]
to the vocal region detection problem. The �rst is an energy feature vector calculated by using the
Discrete Wavelet Transformation (DWT), utilized in [8] that showed promise in building a more
accurate model than one built using MFCCs. The second feature vector substitutes the DWT for
the Discrete Cosine Transformation in the standard MFCC algorithm [11].
Second, I annotate and use a standard benchmark music corpus, consisting of a wide variety of
popular music songs in their entirety, which was done in response to the wide variety of data sets
used in the MIR literature. The annotations used in this research will be made available for other
researchers to use in the vocal region detection tasks.
Third, I present a methodology is based on the standard machine learning research framework
that is adapted to utilize the Discrete Wavelet Transformation. This methodology can be generalized
to accommodate a number of di�erent data sets, wavelet families and learning algorithms.
Finally, I evaluate these features on �ve machine learning algorithms in an attempt to determine
which algorithm performs the best for the vocal region detection problem. My results are compared
to those generated by using the standard MFCC algorithm, and I show that MFCC features are
outperformed by features vectors built using Wavelet Coe�cients, particularly for the features built
using the wavelet energy algorithms found in [8]. I show that for each of my �ve experiments, the
3
model built with Wavelet Energy features has a higher accuracy than a model built using MFCC
features. Additionally, my results will show that a decision tree algorithm is the best learning method
to use when attempting to identify vocal regions in music, outperforming the other tested algorithms,
which includes Support Vector Machines; a popular classi�cation algorithm in MIR applications.
This thesis is organized as follows. First, background information is presented on music �le
formats, the audio transformations used in this research, the Daubechies wavelet family and the
learning algorithms employed in this thesis. Next is a discussion on the previous research in the
area of vocal region detection and the limitations found in that research. That will be followed
by a description of the methodology developed for this research, including a discussion on the
preprocessing steps, feature extraction and utilization of learning algorithms. Following that is a
discussion of the experimental setup of this research, including a discussion of the data used, as well
as a description of each of the �ve tests that were run. Finally, results are then presented, followed
by a discussion of the results and the conclusions that can be drawn, motivating follow-on research.
The unabridged results are presented in the Appendix.
4
Chapter 2
Background
Before discussing the details of this thesis it is important to give background information on the
music �le format, audio signal transforms, machine learning algorithms and wavelet family used in
this thesis. Often overlooked, the format in which an audio recording is encoding into can have a
signi�cant impact on which audio information is available to analyze. Likewise, each signal trans-
formation has speci�c properties and o�ers di�erent insight into the input signal. From a machine
learning standpoint, the choice of the algorithm used has a similar e�ect. Each machine learning
algorithm makes di�erent assumptions and performs better on di�erent types of data. Thus, each
algorithm is explored in an e�ort to better evaluate an algorithm with the data in this thesis.
Additionally, each wavelet family is designed to o�ered di�erent attractive properties, based on the
problems on which it is applied. Understanding the properties of the wavelet family used in this
research o�ers insight into its performance.
2.1 Music File Formats
The �le format of a music �le plays an important role in an audio analysis task. Raw music from an
audio CD has a sampling rate of 44.1 kHz with a bit rate of 1,411.2 kbps. The bit rate of a audio
�le determines how much space on disk that audio �le uses. Due to the large bit rate, CD-quality
music �les are quite large and can be cumbersome when attempting to store, share or process them.
These issues became the motivation behind the research that would eventually lead to the MPEG-1
standard.
While there exists a number of music �le formats, including Microsoft's proprietary Windows
Media format, Vorbis and AAC�the latter two being open source standards�MPEG-1 Part 3
(commonly referred to as MP3) has become very popular, as it was the �le format of choice during
the early days of peer-to-peer �le sharing, and is used by large online music retailers like Amazon.com.
Encoding �les in the MP3 format allows an individual to compress a song down to a tenth or less of
5
its original size, making it much easier to store a large database of music �les as well as share those
�les across a network.
The ability to shrink a music �le down to a fraction of its original size while retaining a close
approximation of the original sound quality is due to the compression algorithm used by the MP3
encoder. The bit rate of an MP3 encoder is not explicitly set by the MPEG-1 standard, but the
bit rate most commonly used is 128 kbps, which is a high enough bit rate where the common user
will be unable to detect any degradation in quality while remaining small enough to yield a much
smaller �le size. Comparing the common bit rate of an MP3 to the bit rate of a CD yields a ratio
of 128:1411.2, or approximately 1:11. In order to achieve such a ratio, the MP3 encoding algorithm
relies on a perspective acoustic model known as perceptual coding [13] in order to compress the �le.
Perceptual coding involves discarding or reducing the components of an audio signal that the human
ear either cannot perceive or is less audible. This allows the encoding algorithm to discard much
of the information in an audio �le that is of little or no importance to the listener while keeping
the relevant information and encoding it e�ciently. Additionally, the MPEG-1 standard allows for
an encoder to take advantage of variable bit rates by performing bit rate switching on a frame-by-
frame basis. This allows for further compression by using a lower bit rate for less important data
and a higher bit rate�and thus more storage space�for more complicated and important audio
components.
While the compression algorithm used by MP3 encoders is very e�cient and capable of storing
large audio �les in a more compact fashion, there are some drawbacks from a research standpoint.
First, the encoding algorithms are patented, but unfortunately the patent issue surrounding the
MP3 format is complicated. The MP3 format is part of an ISO standard, which is freely available
for interested parties to seek out and implement, which many have done, leading to a variety of
similar yet possibly slightly di�erent MP3 encoding implementations, so researchers must depend on
the reliability of the particular encoding/decoding software they use choose. Numerous companies
own patents on certain MP3 encoding implementations, as well as other pieces that surround the
MP3 algorithm. The patent situation is a convoluted issue, and researchers must keep in mind all
applicable patent restrictions when conducting their research.
Secondly, the compression algorithm used by MP3 is lossy. This means that the algorithm throws
out data while performing its compression, data that cannot be retrieved later. This presents a
problem because information is removed to shrink the size of the audio �le. This information is
6
irrelevant from the standpoint of the listener but it is information that might be relevant from the
standpoint of an algorithm attempting to deduce identifying information from that audio signal.
Additionally, whereas raw recorded audio typically contains a separate channel for each input into
the signal (guitars, bass, drums, vocals, etc.), those channels are �mixed� together in formats such as
MP3. This prevents a researcher from automatically extracting the audio components that they are
interested in, instead forcing them to develop methods�statistical learning methods or otherwise�
that will allow them to isolate a particular component of interest from the original signal.
Despite the mentioned drawbacks, it is important for audio researchers to use the MP3 �le
format in their research due to its popularity. While newer media containers such as MP4, based o�
of Apple's QuickTime container format [14], have grown in popularity in recent years, MP3 remains
the de facto standard, as many digital music sellers, such as Amazon.com, continue to distribute
their digital music in the MP3 format.
2.2 Audio Signal Transforms
When analyzing audio, it is often useful to transform the signal from one domain to another in order
to extract information from it that would not be present in the signal in its original form. Often a
transform is applied to obtain frequency information, as with the Fourier Transformation [5] and the
Wavelet Transformation [23], or in order to gain a compact representation of a signal for compression
purposes, such as the Discrete Cosine Transformation. In the following sections, I will discuss four
transformations of interest: the Fourier Transformation, the Discrete Cosine Transformation, the
Continuous Wavelet Transformation and the Discrete Wavelet Transformation.
2.2.1 The Fourier Transformation
The Fourier Transform (FT) is one of the most widely used transformations in signal analysis [5].
The FT is a operation that produces a frequency domain representation of an input signal that is
in the time domain.
The Fourier Transform is de�ned [23] as the following:
F (τ) =
∫f(t)e−itτdt (2.1)
where τ is the frequency. The inverse Fourier Transform is de�ned [23] as the following:
7
f(t) =1
2π
∫F (τ)eitτdτ (2.2)
The FT is derived from the Fourier series, where periodic functions can be written as a sum of
sines and cosines. Expanding the input signal into a series of sines and cosines gives the frequency
content, over all time, of the input signal.
The FT assumes a stationary signal that is the same throughout time, making it di�cult to apply
the FT to a non-stationary, time-varying signal. As a means of overcoming that limitation, the Short-
Time Fourier Transform (STFT) was developed, introducing the concept of time to Fourier analysis.
The STFT works by �rst windowing an input signal so that the resulting output is zero outside of
the range permitted by the windowing function. The FT is then computed on that window. That
window is shifted along the length of the signal and the FT is calculated at each position.
From the STFT we can generate a spectrogram, which is a time-variant spectral image repre-
senting how the energy of a signal is distributed with frequency. Figure 2.1 shows the spectrogram
of six sequential segments (alternating vocal and non-vocal) of a Queen song�Death on Two Legs.
These spectrograms were generated with a window size of 256 samples, with a 128 sample overlap.
As you can see, the frequency content is di�erent when comparing a vocal segment to a non-vocal
segment. The spectrogram shows time along the horizontal axis, frequency along the vertical axis
and intensity�the amplitude of a particular frequency at a given time�is represented by a color
scale. The maximum amplitude of the frequency content for a given window is di�erent for vocal
and non-vocal regions, as illustrated by the circles on the �rst two images in the series. Viewing the
spectrogram gives us a summary view of the frequency content of a signal, and how that frequency
content changes over time. As you can see, there is a noticeable di�erence between vocal regions
and non-vocal regions. Therein lies the motivation behind applying frequency analysis techniques
for MIR research.
The STFT can be applied to mitigate the aforementioned limitations of the Fourier Transform,
but windowing an input signal does not completely solve the time resolution problem. The windows
used in the STFT are of �xed length, therefore the time resolution of STFT analysis is �xed as well.
Given the non-stationary property of music signals, a �xed-length window is insu�cient to capture
all relevant information. Therefore, while a lot of important information can be gained from the
STFT, the �xed resolution is a limitation of STFT analysis, a shortcoming that can be overcome by
the wavelet transformation.
8
Figure 2.1: Spectrogram Progression of Alternating Vocal and Instrumental Regions in a QueenSong, from Left to Right Respectively and Top to Bottom
9
2.2.2 The Discrete Cosine Transformation
The Discrete Cosine Transformation (DCT) is employed in the calculation of Mel-Frequency Cep-
stral Coe�cients (MFCC), providing a compact representation of a cosine series expansion of the
�lterbank energies [6]. The resulting cepstrum coe�cients are the MFCC features that can be used
as features in an audio classi�cation task.
The DCT is related to the Fourier series expansion utilized by the DFT, except that instead of
decomposing a signal into a series of both cosines and sines, the DCT only expresses a signal as a
sum of cosines [32]. This is due to the fact that the DCT enforces di�erent boundary conditions that
the DTF [32].
There are several variations of the DCT, but the most common is DCT-II, which is used in
a number of signal analysis applications, such as the calculation of MFCC [6]. In the calculation
of DCT-II, N real valued numbers x0, x1, ..., xN − 1 are decomposed into N real valued numbers
X0, X1, ..., XN−1 by 2.3:
Xk =
N−1∑n=0
xncos
(πk
N(n+ 0.5)
), k = 0, 1, ..., N − 1(2.3)
2.2.3 The Continuous Wavelet Transformation
The motivation behind the Wavelet Transform grew out of a need to analyze non-stationary signals,
a limit of other analysis techniques, such as Fourier analysis [23]. Whereas the standard Fourier
Transform converts a time-domain signal to its frequency-domain representation�and thus losing
good time resolution in the process�the STFT applies a �xed-length window to the input signal
and subsequently applies the Fourier Transform to the windowed signal in an attempt to compen-
sate for that limitation and track the frequency changes over time. While applying the STFT gives
researchers a way of introducing an element of time to their analysis, the �xed-length window that is
applied introduces a time/frequency resolution trade-o�. A shorter window gives excellent time res-
olution, but the frequency resolution is degraded. Applying a longer window improves the frequency
resolution, but the time resolution su�ers as a result. While the STFT has a �xed resolution, the
Wavelet Transform has variable resolution over a number of scales, thus giving good time resolution
for the high frequencies and good frequency resolution for the low frequencies [23].
In order to produce a time-frequency representation of a signal, the signal must be cut into
pieces and analyzed individually. However, the Heisenberg Uncertainty Principle stipulates that in
10
the domain of signal processing, it is impossible to know with absolute certainly the point on the
time�frequency axis where the signal should be mapped [23]. Therefore, it is understood that it is
very important to be careful when determine how one would slice up the signal in preparation for
analysis.
The Wavelet Transformation solves the signal cutting problem by using a scaled window that is
shifted across the signal, and for every point along the signal, its spectrum is calculated. That process
is then repeated with either shorter or longer windows. The end result is a set of time�frequency
representations of the signal, all at di�erent resolutions. At this point we switch from speaking in
terms of the time�frequency representation and instead discuss the time�scale representation of a
signal, where scale is thought of as the opposite of frequency. A large scale presents the big picture
representation of the signal with less detail than a small scale, which shows more detail across a
small time frame [2, 7, 11, 35, 43].
The Continuous Wavelet Transformation (CWT) is [23]:
X(s, τ) =
∫f(t)ψ∗s,τ (t)dt (2.4)
It can be seen from (2.4) that the input signal, f(t) is decomposed into the series of basis functions,
or wavelets, denoted by ψs,τ (t), where s represents the scale and τ represents the translation.
The wavelet series, ψs,τ (t), is derived from a single basis wavelet, known as the mother wavelet,
ψ(t). The mother wavelet is used to generate every other wavelet in the wavelet series. Each wavelet
in the series can be extracted by scaling and translating the mother wavelet [23]:
ψs,τ (t) =1√sψ
(t− τs
)(2.5)
where s is the scale factor and τ is the translation factor. The wavelet function derived from the
mother wavelet together with the scaling function de�ne a wavelet.
2.2.4 The Discrete Wavelet Transformation
Due to the fact that the CWT operates over every possible scale and translation, it is ine�cient to
compute. The DWT, however, uses a subset of scale and translation values, which makes it more
e�cient to compute. Rather than being continuously scalable and translatable like the CWT, the
DWT is scalable and translatable in discrete steps, resulting in a discrete sampling in the time�scale
representation. Thus, (2.5) becomes [23]:
11
ψj,k(t) =1√sj0
ψ
(t− kτ0sj0
sj0
)(2.6)
where j and k represent the scale and shift, respectively and s0 de�nes the dilation step. Thus we
can describe [23] the DWT as follows:
X(j, k) =∑j
∑k
x(k)2−j2 ψ(2−jn− k) (2.7)
The DWT can be calculated by passing the signal through a �lter bank, where the signal is
simultaneously passed through both a low pass �lter and a high pass �lter, which act as the scaling
and wavelet function, respectively. Repeating this process produces a binary tree representation of
the �lter bank, where each node in the tree represents di�erent time / frequency localizations, and
each level in the binary tree is a di�erent decomposition level. This representation is shown in Figure
2.2. Passing the signal through a low pass �lter results in the production of a series of approximate
coe�cients, while the detail coe�cients are given from passing the signal through the high pass �lter.
This can be seen [25] from the following equations:
yhigh[k] =∑n
x[n]g[2k − n] (2.8)
ylow[k] =∑n
x[n]h[2k − n] (2.9)
where h[·] and g[·] are weighting factors that form the low pass and high pass �lters, respectively. In
order to keep the output data rate equal to the input data rate�instead of it being doubled after
passing through each �lter�we apply a factor of two to the high and low pass �lters, denoted by
2k in (2.8) and (2.9), e�ectively subsampling the output of each �lter by 2. This process described
one iteration, or level, of the DWT calculation algorithm. The signal can be further decomposed by
repeating the process using the approximate coe�cients as input to the next level of high and low
pass �lters. The maximum number of wavelet decomposition levels that can be applied to a signal
is related to the length of the signal, N = L2, where N is the length of the signal and L is the
maximum decomposition level.
The extracted coe�cients represent the decomposed signal at di�erent resolutions: detailed infor-
mation is produced by the high pass �lter at a low scale and a coarse representation produced by
the low pass �lter at a high scale. These coe�cients, produced by the DWT, give a time�frequency
12
Figure 2.2: Binary Tree Representation of the Discrete Wavelet Transformation Filter Bank
representation of the signal in a compressed fashion, representing how the energy of the signal is
distributed in the time�frequency domain. Compared to the spectrogram shown in Figure 2.1, the
graph of the wavelet coe�cients is shown in Figure 2.3 for the same sequence of alternating vocal
and non-vocal frames, where time is given along the x axis, scale along the y axis and coe�cient
magnitude along the z axis. The wavelet coe�cients give a multiresolution representation of the
input signal, and as such, o�er an opportunity to extract more information about the nature of the
signal.
2.3 The Daubechies Wavelet Family
Wavelet families are typically characterized by the manner in which the scaling and wavelet func-
tions are de�ned. Daubechies wavelets, however, cannot be described in closed form [23]. Thus, the
Daubechies wavelet family is de�ned by its orthogonal basis and a minimum number of vanishing
moments for a given, compact support.
13
Figure 2.3: Wavelet Coe�cients Extracted from a Progression of Alternating Vocal and InstrumentalRegions in a Queen Song, from Left to Right Respectively and Top to Bottom
14
Having an orthonormal basis is an important property of the Wavelet Transformation. L2 space
is the space of square-summable sequences, which is a Hilbert space. If a family of vectors exists in
that Hilbert space then it is said to be an orthogonal basis of that Hilbert space if 〈ψ n,ψp= 0, where
n6=p. It is given that if f ∈ L2 and ψi is the orthonormal basis of L2, then the Plancherel formula
for energy conservation is produce as follows [23]:
‖f‖2 =
+∞∑n=0
| 〈f, ψn〉 |2 (2.10)
The conservation of energy asserts that the distances between two objects are not a�ected by
the Wavelet Transformation [18].
Compact support is de�ned as the �nite interval upon which the basis function is supported.
This means that when a wavelet is applied to a region of a signal, the parts of the signal that fall
outside the support of the wavelet are not a�ected by the wavelet function. This guarantees that the
wavelet coe�cients produced by applying the wavelet transformation will be localized with respect
to that region of interest [18].
Vanishing moments describe the type of polynomial information that can be represented by the
wavelet [23]. The number of vanishing moments represents the order of the wavelet�i.e. db3 has
three vanishing moments. As the order of the wavelet increases�the number of vanishing moments�
the wavelet becomes better able to approximate the signal. In some of the literature wavelet orders
are represented by the length of its support. In that case, the number of vanishing moments is given
by 1/2 of the support length. For the purposes of this thesis, the subscript on the wavelet name will
denote the number of vanishing moments.
The Daubechies wavelet family is popular in many research areas, such as dimensionality reduc-
tion and denoising [18], as well as privacy-preserving data stream identi�cation [35], pattern recogni-
tion [7, 30], general sound and audio analysis [17, 41] and speech analysis [8, 11, 37, 43]. Its popularity
in a wide variety of research areas, both closely and tangentially related, motivate its use in this
thesis.
The wavelet and scaling functions, discussed in the previous section, for the four Daubechies
wavelets [25] used in this thesis are shown in Figure 2.4. As Figure 2.4 illustrates, the �rst-order
Daubechies wavelet�db1�is equivalent to the Haar wavelet [23].
15
Figure 2.4: Wavelet (Left) and Scaling (Right) Functions of the Four Daubechies Wavelets used inthis Thesis. Magnitude is shown along the y-axis and support length is shown along the x-axis
16
2.4 Learning Algorithms
For the purpose of this thesis, models were built using the following learning algorithms: Support
Vector Machines, Naïve Bayes, JRip (RIPPER), J48 (C4.5) and IBk. Each of those �ve learning
algorithms represents a unique machine learning approach. In addition to discovering which features
work best on which learning algorithm, using a wide variety of algorithms allows me to discuss the
stability of the extracted features as learning features. If a feature set works very well with one
learning algorithm, while performing poorly on all others, a discussion of why that might happen
should take place, and a feature set that performs well on all or most learning algorithms provides
a strong degree of con�dence in the usability of that feature set.
2.4.1 Support Vector Machines
One of the main appeals of using Support Vector Machines (SVM) is that it works well with high-
dimensional data and is able to avoid the curse of dimensionality problem. The two main ideas
behind SVMs with regards to this research are linear separability and the hyperplane. The SVM
algorithm attempts to construct a hyperplane that divides a data set into n regions, where n is
the number of class labels in the data set. There are an in�nite number of hyperplanes that can be
constructed to divide data set, and as such, SVMs attempts to �nd the maximum margin hyperplane
with which to separate the data [38]. A hyperplane with the maximum margin has the maximum
distance between the boundaries of the training data for each class. This provides, in general, a
better accuracy when testing the model with previously unseen data.
Let N denote the number of training instances in the data set. Each instance has a set of features
which are used to train the model, denoted by fi = (fi,1, fi,2, ..., fi,m); i = (1, 2, ..., N). For a binary
classi�cation task, the class label is denoted by l ∈ c1, c2, or more generally, l ∈ −1,+1. The
SVM algorithm then determines the decision boundary of a linear SVM classi�er with the following
equation [38]:
y · f+ b = 0 (2.11)
where y and b are the parameters of the SVM model. Those parameters are selected such that the
following [38] two conditions are met:
17
for li = 1
y · fi + b ≥ 1 (2.12)
for li = −1
y · fi + b ≤ −1 (2.13)
Furthermore, Support Vector Machines �nd the maximal marginal distance via applying the
following equation [38]:
f(y) =‖y‖2
2(2.14)
Satisfying those equations gives the model parameters, which can then be applied to previously
unseen data to determine the most appropriate class label.
2.4.2 Naïve Bayes
Naïve Bayes (NB) is a simple probabilistic classi�er that, while simple, has been shown to produce
good results on a variety of data sets [38]. NB relies heavily on the assumption of independence for
each of its training features. That is, NB assumes that the presence or absence of one feature is not
related to the presence or absence of any other feature in the model. More formally, the conditional
independence condition can be expressed [38] as so:
P (X|Y, Z) = P (X|Z) (2.15)
where X,Y and Z are all random variables. The conditional independence assumption is important
because it allows us to consider the class-conditional probability for each training feature, given
a class label, as opposed to computing the class-conditional probability for every combination of
training features. Therefore, given a set of features X and a class label L, the posterior probability
for each class can be calculated as such [38]:
P (L|X) = P (L)
d∏i=1
P (Xi|L) (2.16)
18
From the training set labels, the prior probability, P (L), can be easily calculated, which leaves
only the task of calculating the only model parameter for the NB classi�er, P (Xi|Y ). This too can
be calculated directly from the features in the training set.
2.4.3 JRip
JRip is an implementation of the Repeated Incremental Pruning to Produce Error Reduction
(RIPPER) algorithm [4]. This rule-induction algorithm is attractive because it scales linearly with
the number of training examples, works well in situations where the class distribution is uneven and
it has been shown to work well with noisy data [38].
In the two-class case, JRip initializes by setting the majority class to be the default class. From
there, rules are learned to detect the minority class.
The JRip algorithm looks at the minority class and labels all examples that are of that class as
being positive, while setting all other examples to be negative.
The rule is generated by examining the information gain of adding a new condition to the
antecedent rule. When the generated rule adds negative examples, no more conditions are added
and the algorithm moves to the pruning phase. The pruning metric for the JRip algorithm is (p+1)(p+n+2)
[33], where p and n denote the number of positive and negative examples, respectively. This rule
is then added to the rule set provided that it does not cause the rule set to exceed the minimum
description length�defaulted to 64 bits with JRip [12, 33]�or the error rate of the rule on the
validation set dips below 50%.
2.4.4 J48
J48 is a statistical classi�er which is an implementation of the C4.5 decision tree algorithm. J48
works by constructing a tree based on the features and associated class labels, which is later pruned
to produce an e�cient means of classifying a previously unseen feature set.
J48 uses information gain to determine which training features most e�ectively splits the training
examples into subsets. Information gain, also referred to as Kullback-Leibler divergence, is a measure
of the reduction of uncertainty regarding the classi�cation of an example in a training set. Given
a feature set Xi for a set of examples Ed and a corresponding class label vector Ld, J48 iterates
through Xi and calculates the information gain from splitting on each feature. Whichever feature
displays the highest information gain will allow the creation of a decision node that splits on that
19
feature. Each subsequent subset has the algorithm performed recursively on it, until the point where
all the training examples in that subset belong to the same class. At which point, a leaf node is
created, indicating that the corresponding class label should be selected if that leaf node is met.
After the tree is constructed, a pruning algorithm can be applied to eliminate branches of the tree
which do not aid the classi�cation task; replacing them with leaf nodes.
The use of information gain as the cost function helps produce small trees, which in conjunction
with the binary tree structure makes them easy to traverse when labeling previously unseen data.
Additionally, the pruning involved with J48 also reduces the size of the resulting decision tree. This
is an important feature of J48 to consider when dealing with a data set with as many dimensions as
the data used in this research.
2.4.5 IBk
IBk is an implementation of the k-Nearest Neighbor algorithm by the software program WEKA [12].
It is an instance based learner with a �xed neighborhood determined by k, where k indicates the
number of neighbors used, defaulted to 1. The IBk algorithm selects the appropriate value of k via
hold-one-out cross validation.
IBk looks to classify previously unseen examples based on their proximity to training data already
discovered in the instance space. Because IBk is instance based, it does not create a model using the
speci�ed features like the previously discussed algorithms. Instead, it compares each new example of
unlabeled data to labeled data already encountered. Instance based learning is an example of �lazy
learning", where training examples are not used to build a class model and are instead stored and
accessed when unlabeled data is encountered.
The primary design decision when implementing an IBk test is the selection of the distance
measure. The two most common distance measures are the Euclidean Distance�used for continuous
values�and the Hamming Distance�more suited for discrete attribute values [38]. In a feature space
consisting of two examples with feature sets E1 and E2, the Euclidean distance is simply the length
of the line segment connecting the two examples, or, more formally:
d(E1,E2) =
√√√√ n∑i=1
(E1 − E1)2 (2.17)
20
The Hamming Distance can be thought of as the number of substitutions that would have to be
made in order to turn one feature set into another, or the number of positions at which two equal-
length feature vectors are di�erent. Hamming Distances work well in cases where features belong to
a discrete set of values, but is meaningless for continuous-valued feature sets.
The selection of a value for k is also an important decision when setting up an IBk experiment.
Such as the case with this research, the binary class problem often forces one to use an odd value of
k to prevent ties when doing a majority vote. An optimal value of k can be selected within the IBk
algorithm by using hold-one-out cross validation. This can help to reduce noise but it can blur the
boundaries between classes, particularly when the number of classes is greater than 2.
The drawbacks of the IBk algorithm include the fact that if an example set contains an over-
abundance of a particular class, the labeling of new instances can be dominated by that class. This
can be overcome by weighting the votes of the selected nearest neighbors, often by a function related
to their distance from the unlabeled example [44].
IBk is resistant to noise and has demonstrable success in Information Retrieval research, such as
text processing [38, 44].
21
Chapter 3
Review of Related Literature
The problem of vocal region detection and other Music Information Retrieval problems is a relatively
new area of research but has been aided by research in the related area of speech analysis. As such,
there is a strong relationship between the work being done in speech analysis and similar work
done with music analysis [1, 3, 8, 11, 15, 16, 17, 19, 20, 27, 31, 40, 41, 43, 45]. The applicability of
speech analysis techniques in music applications encourages research, while the structural di�erences
between speech and music signals should be noted and investigated in an e�ort to improve the
performance of those techniques in music analysis.
Tsai and Wang identi�ed a singer in a song by recognizing that the singing voice has harmonic
features that are not present in other instruments [40]. Their task was to automatically recognize
a singer in a song, and they separated vocal and non-vocal regions by building a Gaussian Mixture
Model (GMM) classi�er. They used Mel-scale frequency cepstral coe�cients, calculated on a �xed
length sliding window, as their feature vector. This was done to model common speech-recognition
tasks, which is a common technique in Music Information Retrieval. They performed their analysis
on the vocal region segments in order to identify the singer, using another GMM classi�er and a
custom decision function. They achieve a vocal region detection accuracy of just over 82% for songs
with a solo vocalist. The data set they used was a set of three separate sets of songs: 242 solo tracks,
22 duet track and 174 instrument only tracks. The vocal tracks consisted of 10 male Mandarin pop
singers, and 10 female Mandarin pop singers.
Ramona, Richard and David proposed a learning methodology for identifying vocal regions by
created an extensive feature vector, which they applied a classi�er on to reach their decision [31].
They built their feature vector on a range of song characteristics computed on successive frames of a
song. The feature vector contained 116 components, including 13-order Mel-scale frequency cepstral
coe�cients, linear predictive coding coe�cients and zero-crossing rate. Those feature vectors were
fed into a Support Vector Machine classi�er with a radial basis function, distinguishing between
22
the two classes: vocal and non-vocal. The results of the classi�er were then smoothed by using a
two-state Hidden Markov Model. They achieved a frame classi�cation accuracy rate of around 82%
for a data set of 93 songs, all from di�erent artists.
Bartsch and Wake�eld proposed an algorithm for identifying the singing voice in a song with
no instrumentation using the spectral envelope of the signal [1]. They attempted to estimate the
spectral envelope by using a composite transfer function, which utilizes the instantaneous amplitude
and frequency of segments of the music signal. The computed features were then fed into a quadratic
classi�er, which yielded an accuracy of roughly 95%. Their data set consisted of 12 classically-
trained female singers vocalizing a series of �ve-note exercises. This high accuracy demonstrates the
importance of being able to properly isolate the vocal regions, since they were only able to achieve
such good results by having the singing voice by itself, without any background music.
Mesaros, Virtanen and Klapuri evaluated methods for identifying singers in polyphonic music
using pattern recognition and vocal separation algorithms [26]. They took two approaches to their
singer identi�cation task; one was to extract their model's feature vector directly from the music
signal and in the other they attempted to extract the vocal channel from the polyphonic music
signal. The extraction of the vocal line from the complete music signal required use of a previously
developed melody transcription system, the output of which was then fed to a sinusoidal modeling
re-synthesis system. They used an internally-developed data set of 65 songs from 13 unique singers in
their singer identi�cation task, and found that the models which were fed the extracted vocal channel
as opposed to the complete music signal faired much better, achieving up to 67% singer classi�cation
accuracy as compared to 42% for music signals with the worst singer-to-accompaniment ratios. This
demonstrates that being able to e�ectively isolate the vocal regions has a noticeably positive impact
on tasks that require extracting information about the vocalist in a song.
Tzanetakis, Essl and Cook proposed a learning technique for using the Discrete Wavelet Trans-
form to classify a variety of non-speech audio signals [41]. Their implementation made use of the
level four Daubechies wavelet family to generate wavelet coe�cients for their feature vector. For
comparison, they also built models from features extracted from the Short Time Fourier Transform
(STFT) and well as MFCC features. The data they used consisted of audio separated into three
classes: MusicSpeech (126 �les), Voices (60 �les) and Classical (320 �les). They then trained a Gaus-
sian classi�er on each of the three feature vector sets, and found that the Gaussian classi�er trained
with DWT coe�cients performed better than random classi�cation and its performance was on par
23
with the classi�ers trained on MFCC features and STFT features, producing classi�cation accuracies
that were consistently within 10% of the best performing model.
Kronland, Morlet and Grossman explored how sound patterns can be analyzed by using the
Wavelet Transformation [17]. They built a real-time physical signal processor, through which they
fed a variety of audio signals, including chirps, spoken words and notes played on a clarinet. By doing
so, they were able to note that the Wavelet Transformation produced outputs that they believed
could be used in a variety of signal processing and pattern recognition research areas. In particular to
speech, their results showed that the segmentation of speech sounds would be possible using in part,
information gained by performing the Wavelet Transformation. Their promising results motivated
my work in exploring how the Wavelet Transformation could be used in identifying vocal regions
within music.
Gowdy and Tufekci proposed a new feature vector for speech recognition that adapted the MFCC
algorithm to leverage the desirable properties of the Discrete Wavelet Transformation [11]. They
proposed that the time and frequency localization property exhibited by the coe�cients calculated
by performing the Discrete Wavelet Transformation would provide a more noise-resistant learning
feature when substituted for the Discrete Cosine Transformation in the MFCC algorithm. In their
research, they found that models built with these features � named Mel-Frequency Discrete Wavelet
Coe�cients (MFDWC) � outperformed similar models built with the standard MFCC features for
both clean and noisy speech with a phoneme recognition rate increase of up to 10%. While they
performed their analysis on a variety of speech signals, the close relationship between speech and
music analysis suggests that it may be possible to apply the MFDWC features in the vocal region
detection domain and achieve similar results.
Didiot, Illina, Fohr and Mella also explored the applicability of the Wavelet Transformation in
improving the standard MFCC algorithm [8]. In their research, two class/non-class classi�ers were
built: one for speech/non-speech and the other for music/non-music. For their models, they calcu-
lated the energy distribution for each frequency band using the detail wavelet coe�cients extracted
by applying three wavelet families: Daubechies, Symlet and Coi�et. For each energy band, they cal-
culated the instantaneous energy, the Teager energy and the Hierarchical energy. When compared
to models built with the standard MFCC features, the researchers saw a reduction in error rate of
more than 30%. The corpus employed in this research consisted of two hours of audio � consisting
of both instrumental music and songs � and over four hours of broadcast media taken from French
24
radio, consisting of news, interview and musical performance segments. While the research presented
in this thesis covers a more narrow corpus � strictly music extracted from popular music CDs � the
success of Didiot et al motivates further study in the applicability of wavelet-based energy features
in a vocal region detection model.
Other researchers have explored a variety of other techniques, including utilizing the Fourier
Transform [20, 29], exploring the bene�ts of the harmonic and energy features of the singing voice
[28], and employing various frequency coe�cients as the basis for a feature vector used in classi�cation
[15, 19, 22, 27, 34].
Previous research as noted above has focused primarily on building a model of the singing voice
from either feature vectors created by examining the characteristics of the song�such as zero-
crossing rate, sharpness, spread or loudness, among others [31] or by examining the frequency domain
representation of that signal using the Short-Time Fourier Transform [20, 29], used in the calculation
of the standard Mel-Frequency Cepstral Coe�cients. While those features have in some cases been
able to act as e�ective features for some vocal models, each have their own set of drawbacks. Many of
the song characteristics that have been used as features are common in the speech analysis domain.
However, due to the presence of background instrumentation, the utility of those features is reduced
[31]. Additionally, the singing is over 90% voiced (voiced sounds are those that are produced by vocal
chord vibrations), compared to speech, which is 60% voice [16], a di�erence that forces researchers
to add additional steps to account for the di�erences between the two types of signals, such as
harmonic analysis [16]. Other researchers have decided to forgo analyzing a signal using that set of
characteristics, and instead attempted to use the Short-Time Fourier Transform (STFT) to analyze
the frequency components of the music signal [20, 29].
The fact that the frequency components of a music signal change over time should be a con-
sideration when attempting to detect vocal regions in that signal. Whereas the Short-Time Fourier
Transform is unable to adequately capture those frequency changes, the Wavelet Transform analyzes
a signal at a multitude of resolutions, o�ering a more complete characterization of the signal for use
in classi�cation.
25
Chapter 4
A Method for Detecting Vocal Regions
This chapter discusses the vocal region detection methodology presented by this thesis, used in
conjunction with both Wavelet Energy features and MFDWC features to o�er an improvement on
previous research. The method described herein can be implemented using a variety of machine
learning algorithms and data sets; the following chapter will discuss my selections and my justi�ca-
tions for doing so.
This thesis evaluates a methodology for detecting vocal regions in music using Wavelet Energy
and MFDWC features. As mentioned above, the implementation speci�cs can be tuned based on the
particular application of the research, but the components chosen for this thesis will be discussed
in detail in the next section. I have developed this method as a means of leveraging the Wavelet
Transformation in the problem of vocal region detection.
The methodology evaluated in this thesis builds upon a common framework for conducting
machine learning / knowledge discovery and data mining research. In general, that framework
involves the following steps:
1. Gather training data
2. Extract features
3. Build model
4. Gather testing data
5. Extract features
6. Test against model
7. Report results
26
Figure 4.1: The Training Phase of the Presented Methodology
That general methodology is evaluated in this research, and it adapted to utilize the aforemen-
tioned features calculated using the Discrete Wavelet Transformation in the task of identifying vocal
regions in music.
Vocal region detection is a two class problem, where a learning algorithm attempts to distinguish
between frames of music that contain vocals (assigned the class label �vocal�) and those which contain
pure instrumentation (assigned the class label �inst�). This methodology has two phases: a training
phase and an application phase. The training phase of this methodology is shown visually in Figure
4.1.
The training phase involves the gathering of digital music, independent of format, and �rst
segmenting it into overlapping frames of a predetermined length with a predetermined overlap such
that the length is a power of two�for the purpose of this thesis, the window length was set at 1,024
samples with a overlap of 512 samples. Overlapping frames are used because they give the Wavelet
Transformation more coverage in order to track the frequency changes of the signal over time.
Additionally, the Discrete Wavelet Transformation operates by a series of successive subsampling by
two, therefore in order for the algorithm to be the most e�cient (without having to pad the frame
with zeros to get it to the right size) the frame length must be a power of two.
After the frames are extracted, successive Discrete Wavelet Transformations are applied to the
frame, until a maximum level decomposition has been performed on the frame and detail coe�-
cients are extracted at each level up to the maximum level. Once the transformation coe�cients
are extracted they are then used in the calculation of one of two features: MFDWC features and
Wavelet Energy features.
Mel-Frequency Cepstral Coe�cients are a common feature in vocal region detection research
and in broader signal analysis applications. But while those features tend to perform well in vocal
region related tasks, there exists possible areas of improvement. MFCCs rely on the �xed window
size of the STFT and thus have poor time resolution. The Discrete Wavelet Transformation o�ers
27
an opportunity to make up for those limitations by providing multiresolution analysis, good time
and frequency resolution as well as providing features in the form of wavelet coe�cients, which are
resistant to noise and, in the cause of db4, have a high number of vanishing moments.
The �rst wavelet transformation features tested are Mel-frequency Discrete Wavelet Coe�cients
(MFDWC), as described by Gowdy and Tufekci in [11]. In their research, Gowdy and Tufekci note
that one of the advantages of the Discrete Wavelet Transformation (DWT) is its localization property
in the time and frequency domains. Localization is important because di�erent parts of a signal can
contain di�erent distinguishing information. A drawback of using MFCCs is the utilization of the
Discrete Cosine Transformation (DCT). The DCT is computed over a �xed-length window, giving
it the same resolution in time and frequency. But in a signal that has changes in time and frequency,
like music, it can be helpful to track those changes at di�erent resolutions. The multiresolution
characteristic of the DWT makes it well adapted to tracking those changes, due to the fact that the
basis vectors of the DWT that capture the high frequency component of the signal have better time
resolution than the basis vectors used to calculate the low frequency components. Additionally, those
basis vectors used to calculate the low frequency components of the signal have better resolution in
the frequency domain than those basis vectors used to �nd the high frequency components. Therefore,
excellent resolution in both the time and frequency domain is achieved by using the DWT, and thus
the DWT coe�cients are better localized in time and frequency than the DCT coe�cients.
In order to take advantage of that localization property, the researchers in [11] replace the utiliza-
tion of the DCT in the MFCC algorithm with the DWT and name those coe�cients MFDWC�Mel-
Frequency Discrete Wavelet Coe�cients. I utilize that approach in my research with two changes.
First, in their research, Gowdy and Tufekci used the Cohen, Daubechies and Feauveau's wavelet
family in order to take advantage of its compactly supported bi-orthogonal spline properties. For
the purposes of my research, I used the Daubechies wavelet family. The Daubechies family was used
due the high number of vanishing moments for a given support and the fact that they have demon-
strated usefulness in a wide variety of signal analysis domains [7, 8, 11, 43]. Secondly, the researchers
in [11] used eight coe�cients at scale four, four coe�cients at scale eight, two coe�cients at scale 16
and one coe�cient at scale 32. I perform a maximum-level decomposition of the frame and utilize
all the coe�cients. Additionally, Gowdy and Tufekci tested their coe�cients on a speech data set,
whereas I test the e�ectiveness of MFDWC features on a music data set.
28
The second feature set I utilize in this thesis that is built using the DWT was based o� of
research conducted by Didiot, Illina, Fohr and Mella [8]. These researchers recognized the limitations
of the Fourier Transformation in analyzing non-stationary signals. They used the DWT to calculate
three di�erent energy features: Instantaneous energy, Teager energy and Hierarchical energy. For the
calculation of each energy feature, the wavelet coe�cients are denoted by wj(r) where j denotes the
frequency band and r denotes the time. N is the length of the window, and J denotes the lowest
frequency band. Equations [8] for Instantaneous, Teager and Hierarchical energy are shown in 4.1,
4.2 and 4.3, respectively.
fj = log1
Nj
Nj∑r=1
(wj(r))2 (4.1)
fj = log1
Nj
Nj−1∑r=1
|(wj(r))2 − wj(r − 1) ∗ wj(r + 1)| (4.2)
fj = log1
Nj
Nj+NJ/2∑r=(Nj−NJ )/2
(wj(r))2 (4.3)
Instantaneous Energy gives the energy distribution in each frequency band. Teager energy has
been shown in a number of speech applications to give formant information and is resistant to noise
[8]. Whereas instantaneous energy is a measure of the amplitude of the signal at a point in time,
Teager energy shows the variations of the amplitude and frequency of the signal. Hierarchical energy
provides time resolution for a windowed signal and has demonstrated applications in speech analysis.
My utilization of those energy values di�ers from that of the original researchers in two ways.
First, the researchers in [8] used a variety of di�erent decomposition levels, whereas I perform a
maximum level decomposition of the frame. This was done because I ran a series of pilot studies
on a subset of the data where I varied the decomposition level and examined its e�ect on the
classi�cation accuracy of a vocal region detection model that used just the raw detail Discrete
Wavelet coe�cients as the input feature vector. Preliminary results from those pilot studies showed
that the accuracy of the system increased as the number of decomposition levels increased. Also, I
was unable to �nd anything in the literature that suggested a certain decomposition level should be
used, thus motivating me to use every coe�cient at every level. Additionally, the researchers tried
a variety of di�erent combinations of energy values for their feature vector, whereas I use every
energy value in my feature vector produced. Also, the scope of each research project di�ers. They
29
were attempting to apply these features in dual binary classi�cation system�speech/non-speech
and music/non-music�whereas this thesis looks to distinguish between vocal and non-vocal regions
of a song. Additionally, they applied their features to a corpus consisting of 15s long segments of
audio�20 �les of speech and 21 of music�in addition to French news radio recordings and French
entertainment radio broadcasting, which consisted of interviews and musical programs. While their
data set is quite diverse, it has a relatively minor music component, of which is the sole focus of my
research.
The research in this thesis uses, for both the MFDWC and Wavelet Energy features, all detail
coe�cients extracted by performing a maximum-level DWT decomposition. The presented method-
ology can be generalized to use any number of coe�cients from any decomposition level. It may
be advantageous to not use every coe�cient for reasons such as the wavelet family used, or the
prohibitive cost of storing every coe�cient.
Once the desired feature is calculated, the training phase is completed when a machine learning
model is built. The speci�c algorithms I used in this thesis are described in general in the Background
section and with speci�cs to this research the next section, but this method generalizes to any
machine learning algorithm.
The second phase of the two-phase methodology described in this thesis is the application phase.
In the application phase, frames are produced from a set of unlabeled digital music and features are
extracted from those frames in the same manner as described above. Those resulting feature vectors
are then applied to the model built in the training phase and the model determines whether these
unlabeled frames contain vocalizations or pure instrumentation and assigns it the appropriate class
label.
As discussed above, the only implementation-speci�c aspects of this methodology are the
features�Wavelet Energy and MFDWC. For the other components, this method generalizes to
selections other than what was made in this thesis, stating that any other digital music formats
(vorbis, wma, etc.), wavelet families (Symlets, Coi�ets, Morlet, etc...) and learning algorithms
(Hidden Markov Models, Gaussian Mixture Models, etc...) could be applied within this algorithmic
architecture.
The novel contributions of this methodology lies in its adaptation of the standard machine
learning / knowledge discovering and data mining research methodology to demonstrate the perfor-
mance of the Discrete Wavelet Transformation on music data. The only details of the methodology
30
that are pre-determined are the training features�MFDWC and Wavelet Energy. This methodology
o�ers a means of replacing speci�c aspects of the experimental evaluation presented here with other
components that a researcher may wish to test. Future research may wish to evaluate the impact
of using another wavelet family outside of Daubechies, an evaluation that can be performed using
this methodology. The same is also true for those wishing to evaluate di�erent machine learning
techniques or input data. This methodology is novel because of the �exibility it allows in evaluating
the performance of wavelet-related features in a time-series machine learning task.
In the next section I will discuss the speci�c component choices that were made for this thesis, my
justi�cation for choosing them and I will present the experiments used to evaluate the performance
of this methodology.
31
Chapter 5
Experiments
5.1 Implementation Details
The aim of the research presented in this thesis is to leverage the favorable properties of the wavelet
transformation to improve the accuracy of a vocal region detection system when compared to the
same models built using a standard musical signal feature, MFCCs. In the following sections I will
describe the implementation details of my experiments. That process is shown visually in Figure 5.1.
5.1.1 Data Preparation
One of the predominant issues facing researchers interested in studying singing voice detection
and identi�cation has been the lack of a standard music data set. Each of the research papers
mentioned in this thesis details research performed on a set of songs unique to that particular
research project. Each of those sets contains songs that vary in length, sampling rate and music
type, and in some cases, the di�erences can be as large as sets that contain just vocals and sets that
Figure 5.1: Experimental Method
32
contain a vocal/instrumentation mix, for instance [1] and [16], respectively. These di�erences make
it di�cult to con�dently compare the research results from one paper to the next.
It is because of those issues that this thesis must use a pre-existing data set that has already been
deployed in a research setting. Out of the numerous music data set options, I use the artist20 dataset
[10], compiled by Dan Ellis at Columbia University. The artist20 set is comprised of 20 artists, each
contributing six full albums, for a total of 1,413 songs. This set grew out of the artist identi�cation
work Dr. Ellis' research team has been performing. The songs are full-length 32 kbps mono-tracked
MP3s. The songs have been down-sampled from the original 44.1 kHz stereo tracks to a sampling
rate of 16 kHz, bandlimited to 7.2 kHz. In their research, the downsampling did not adversely a�ect
their results [10], but the quality of the sound is down enough to avoid con�icts with the owners of
the song's copyrights.
For the purposes of this research, only one album per artist was used. This was done to bridge
the balance between having con�dence in the results and the considerable amount of time required
to annotate by hand each song in the set. In total, a subset of the artist20 data set consisting of 221
songs make up this thesis' data set. The songs can be broadly classi�ed as American pop music, and
consists of four female artists and 16 male artists spanning eras from the 1970s to the 2000s.
In order to extract Wavelet Transformation coe�cient (WTC) features from music frames and
properly label those frames as belonging to either a vocal region or a non�vocal region, each song
is annotated by using the Transcribe! software program [36]. This was done by listening to each of
the 221 songs and by ear marking the vocal region boundaries. The Transcribe! application allows
the user to slow down the song to up to 25% of the original playing speed, providing a means of
more accurately determining the subtle di�erence between when the singing voice trails o�, leaving
pure instrumentation. In some cases, that boundary is less clear�cut than what is ideal, particularly
with artists such as Prince, Madonna and Tori Amos. While it may be ideal to use multiple people
annotating the training data, the presence of other annotators would not have altered the ground-
truth data signi�cantly. This is due to that fact that although there does exist some ambiguity
around the vocal/non-vocal boundaries in some songs, by allowing the user to slow down a song
and place markers along the waveform, Transcribe! o�ers a means of examining a song in such as
way that the e�ect of those ambiguities is minimized as much as possible. The vocal boundaries can
then be marked and Transcribe! can be used to split the song based on those markers, exporting
33
Figure 5.2: Transcribe! window with Aerosmith's �Dream On� open
the resulting song samples as .wav �les for MATLAB to import and extract features from. A screen
shot of Transcribe! with Aerosmith's �Dream On� is shown in Figure 5.2.
5.1.2 Feature Extraction
Features are extracted from the music clips and arraigned in a feature vector suitable for a machine
learning task. All the processing was done using MATLAB with the Wavelet Toolbox [25]. The song
clips were broken down into overlapping frames 1,024 samples long with an overlap of 512 samples,
in keeping with a convention used by many researchers [1, 19, 37, 40], while remaining a factor of
two in size, which is helpful when performing wavelet analysis [23]. The last frame in each segment
was dropped, as it was unlikely to contain the correct number of segments and the vocal boundaries
were often blurred at the edges.
Each of the three features used in this thesis were extracted on a per-frame basis. Those were
MFCCs, MFDWCs and Wavelet Energy features. The MFCC algorithm is a well-established signal
analysis technique, and its algorithm is quite simple to implement, making it an ideal control feature.
The steps are as follows:
1. Take the Fourier Transform of a frame
34
2. Using a series of triangular overlapping windows, map the powers of the spectrum obtained in
step one to the mel scale
3. Take the log of the values obtained in step two
4. Take the Discrete Cosine Transformation of the values found in step three
5. Return the top 13 coe�cients obtained in step four
Thirteen coe�cients were used for MFCCs due to the fact that after the 13th coe�cient, the
remaining values were very small and the variance between the remaining coe�cients was also very
small.
For the Wavelet Transformation features, MATLAB's Wavelet Toolbox was utilized. The Wavelet
Toolbox o�ers a fast implementation of the Discrete Wavelet Transformation, used by other algo-
rithms to calculate wavelet coe�cients to produce their features.
Additional, MATLAB produces a �le with each frame's feature vector and associated class label
(vocal vs. instr) in .ar� �le format, a format used by machine learning software such as Weka [12]
and Rapid Miner [33].
5.1.3 Classification
Those .ar� �les are fed to the Java-based machine learning library Weka, version 3.6 [12]. Weka is
responsible for loading the .ar� �les, sampling the input as necessary and performing the required
classi�cation task for the current experiment. Sampling was performed on the input due to the fact
that the number of instances grew to a point where the available memory of the machine running
the tasks was exhausted. For the experiments that required it, the input data was sampled at rates
starting at 2% and increasing to 40% in 2% increments. By 40% the performance of the classi�er
was no longer increasing with the number of instances used to train it. Each classi�er was ran with
the default settings provided by Weka.
5.1.4 Obtaining Results
Additionally, Weka generates accuracy and mean squared error results for each experiment. These
results were then compiled into the tables that can be viewed in the Results subsection under each
experiment described later in this chapter, and in their full form in the Appendix. Accuracy and
35
mean squared error were selected due to the fact that they are common metrics used when discussing
classi�cation results.
This thesis consists of �ve experiments, each with a di�erent scope in an e�ort to test the
robustness of the vocal region detection methodology. The performance of each feature vector during
each experiment o�ers insight into the scope, robustness and utility of Wavelet Coe�cient features
in vocal region detection problems.
5.2 Experiment 1: Measuring Overall Performance
The �rst experiment is designed as a broad examination of the e�ectiveness of wavelet coe�cient
features in a vocal region detection task when compared to a control feature, MFCCs. Models
are built using MFCC features, Wavelet Energy features using Daubechies (db) 1 through 4 and
MFDWC features using db1�db4. The goal of this experiment is to discover which, out of the �ve
aforementioned classi�ers�run using their default settings�performs the best in the vocal region
detection task. Also, with regards to wavelet coe�cient features, which wavelet, out of db1�db4,
performs the best. The best performing classi�er and wavelets are to be used for the remaining four
experiments.
For Experiment 1, I evaluated the �ve classi�cation methods using 10-fold cross-validation. All
the feature vectors for a particular artist are grouped together and divided into 10 subsamples. A
single subsample is retained as the validation data for testing the model, and the remaining nine
subsamples are used as training data. This was repeated until each subsample has been used as
validation data. The 10 results are averaged to determine a single value for the performance of the
system
5.2.1 Experiment 1 Results
As described above, Experiment 1 involves training a model using frames from a single artist, and
validating that model using 10-fold cross validation.
MFCC Features
Table 5.1 shows the accuracy and mean squared error result after running each of the �ve learning
algorithms using MFCC features. The numbers shown in the table represent the average of the 10
36
Table 5.1: Experiment 1 Results for Vocal Region Detection Using MFCC Features on IndividualFrames from Single Artists
Algorithm Accuracy Mean Squared Error
Naïve Bayes 70.49 0.3487JRip 75.27 0.3589J48 75.55 0.3019IBk 75.89 0.2948SVM 73.67 0.2633
validation runs. As you can see from the table, with the exception of Naïve Bayes, the learning
algorithms all performed roughly the same, within 2% of each other, near 75%.
MFDWC Features
Table 5.2 shows the results of Experiment 1 using MFDWC features. Each of the �ve learning
algorithms are trained using MFDWC features extracted by applying each of the four wavelets: db1,
db2, db3, db4. In this experiment, J48 distinguishes itself by having a an accuracy above 80% for
three of the four wavelets, while features extracted using db1 also tend to have the best performance.
Wavelet Energy Features
Table in 5.3 shows the results of Experiment 1 using the Wavelet Energy features, where energy
values for the frames are calculated using the detail coe�cients that arise as a result of applying
the Discrete Wavelet Transformation. Once again, J48 is clearly performs the best out of the �ve
algorithms, while for this feature set the db4 wavelet produces the highest accuracy scores.
5.2.2 Experiment 1 Discussion
The purpose of Experiment 1 is to introduce wavelet features to the vocal region detection problem,
comparing the results to a well-known and widely-applied feature set, MFCCs. The data set used is
the artist20 data set, which, as discussed in previous sections, is a more comprehensive data set than
some of the others applied in the literature. It contains songs from 20 di�erent artists, spanning
decades, genre and vocalist gender. The best performing algorithm was Ibk, with an accuracy of
75.89%. However, J48 and JRip were not far behind, posting accuracy scores of 75.55% and 75.27%,
37
Table 5.2: Experiment 1 Results for Vocal Region Detection Using MFDWC Features on IndividualFrames from Single Artists
Algorithm Wavelet Accuracy Mean Squared Error
db1 70.26 0.3598
db2 67.83 0.3679
Naïve Bayes db3 68.75 0.3552
db4 65.69 0.3893
db1 72.59 0.3784
db2 69.86 0.4027
JRip db3 70.57 0.3968
db4 66.82 0.4279
db1 83.57 0.1837
db2 81.40 0.2101
J48 db3 82.28 0.1987
db4 75.76 0.2843
db1 78.26 0.2542
db2 74.34 0.2984
IBk db3 74.75 0.2892
db4 71.00 0.3249
db1 69.49 0.3056
db2 68.49 0.3167
SVM db3 69.27 0.3073
db4 65.96 0.3404
38
Table 5.3: Experiment 1 Results for Vocal Region Detection Using Wavelet Energy Features onIndividual Frames from Single Artists
Algorithm Wavelet Accuracy Mean Squared Error
db1 59.51 0.4042
db2 59.64 0.4032
Naïve Bayes db3 59.81 0.4014
db4 59.78 0.4017
db1 76.42 0.3413
db2 77.25 0.3298
JRip db3 77.73 0.3272
db4 77.97 0.3242
db1 85.86 0.1566
db2 86.23 0.1551
J48 db3 86.43 0.1551
db4 86.66 0.1521
db1 75.95 0.2756
db2 78.79 0.2459
IBk db3 80.53 0.2265
db4 81.01 0.2215
db1 72.83 0.2717
db2 73.42 0.2658
SVM db3 73.49 0.2651
db4 73.51 0.2649
39
respectively. These values are lower than some results in the posted literature [20, 31, 40]; however,
the di�culty of the artist20 data set accounts for that di�erence, due to the fact that the number
of songs in the artist20 data set is larger than the number of songs used in [20, 31, 40] and the
songs are more diverse in terms of artist gender, genre and era. In addition to varying the learning
algorithm, for the two features that involved the DWT, the wavelet was varied in order to �nd the
best performing wavelet for that speci�c feature. As you can see from Table 5.2 MFDWC features
perform below the level of MFCC features, with the lone exception of when MFDWC features were
used with the J48 learning algorithm. In that case, MFDWC features outperformed MFCC features,
and additionally, the db1 wavelet was found to be the best wavelet, posting an accuracy of 83.57%,
a di�erence of roughly 8% over a J48 model built using MFCCs. As you can see from Table 5.3, this
trend continues, with J48 performing the best. Additionally, the db4 wavelet provided the highest
accuracy score�86.66%�over 11% higher than the equivalent model built using MFCC features.
The results in Experiment 1 are encouraging enough to show that features calculated using the
DWT have the potential to outperform models built using standard MIR features, thus motivating
the research in the following four experiments. Additionally, I was able to determine that for MFDWC
features, applying the db1 wavelet produced the best results, while using the db4 wavelet with
Wavelet Energy features produced the best results. Also, J48 outperformed the other four learning
algorithms, an interesting result due to the fact that to the best of my knowledge, J48 does not
make an appearance in the vocal region detection literature. Most researchers favor Support Vector
Machines, which have a long history of use in the Speech/Music Analysis realm. My research,
however, suggests that it may not always be the best algorithm to apply. In order to validate
these results, I conducted a pilot study to determine if J48 would continue to outperform SVMs.
Experiments 2�5 were conducted on a subset of the data and SVMs were employed. In the pilot
study, SVMs were outperformed by J48 models. For example, in Experiment 3, SVMs only managed
an accuracy of 61.98% when a model was trained on Wavelet Energy features extracted from frames
from female artists and tested on frames from male artists. J48, however, in an identical experiment,
achieved an accuracy of 66.14%. These results added con�dence in my utilization of J48 for the
remainder of my tests. Thus, for Experiments 2-5, J48 was used as the learning algorithm, db1 was
used in the calculation of MFDWC features, and db4 was used in the calculation of Wavelet Energy
features. The full results for Experiment 1 can be found in Appendix A.
40
5.3 Experiment 2: Measuring Performance Across Different Artists
Experiment 2 was designed to test how the model would perform when trained on frames from one
artist and then tested on frames from a completely di�erent artist. Performing well on Experiment
2 would show that wavelet features are able to generalize across artist. For Experiment 2, models
are built using the MFCC features as the control features and for the Wavelet Energy and MFDWC
features, the best-performing wavelet from Experiment 1 is chosen. Models are built using features
from one artist, and then that model is tested by frames from the remaining 19 artists, one by one.
This is done for each of the 20 artists. After performing this experiment, a 20x20 matrix is produced
showing the performance of a model trained on each artist when tested with frames for each of the
remaining 19 artists. Each row was then averaged to determine a �nal number showing the average
accuracy of a model built by a single artist.
5.3.1 Experiment 2 Results
The best performing setup from Experiment 1 is selected to carry forward to the remaining four
experiments. For the MFDWC features, the db1 wavelet is chosen, while for the wavelet energy
features db4 is selected. J48 was selected as the learning algorithm, due to how well it performs
in for each of the features in Experiment 1. For Experiment 2, a model was trained on features
extracted from frames from one artist, and then tested on frames from a di�erent artist. Table 5.4
summarizes the results by giving the average performance of a model built using features from the
given artist on each of the remaining 19 artists for each of the three feature sets. To see how each
artist performed on any other artist, please consult Appendix B.
5.3.2 Experiment 2 Discussion
Experiment 1 implied that wavelet features could possibly be applied to the vocal region detection
problem with better results than MFCCs. Experiment 2 was designed as a more rigorous test of that
conclusion by building models with features from one artist and testing that model with features
from a di�erent artist. As you can see from Table 5.4, MFCC features show an improvement in
comparative performance. MFDWCs clearly perform the worst out of the three features, and the
di�erences between the average performance of MFCC and Wavelet Energy features never varies
by more than 5%. While on average Wavelet Energy features do outperform MFCC features, the
di�erence is not as stark as it was with Experiment 1. In general, the accuracy scores for Experiment
41
Table 5.4: Experiment 2 Results for Vocal Region Detection on Individual Frames from Di�erentIndividual Artists
MFCC MFDWC Wavelet Energies
Training Artist Accuracy M.S.E. Accuracy M.S.E. Accuracy M.S.E.
Aerosmith 60.94 0.4204 56.42 0.4611 63.36 0.3919
Beatles 63.44 0.3901 56.03 0.4557 65.09 0.3756
CCR 59.03 0.4203 56.43 0.4483 61.89 0.3951
Cure 61.18 0.4096 58.74 0.4364 63.88 0.3829
Dave Matthews Band 64.95 0.3812 58.83 0.4614 66.00 0.3736
Depeche Mode 60.13 0.4245 56.97 0.4498 61.07 0.4054
Fleetwood Mac 64.62 0.3944 59.63 0.4426 65.68 0.3715
Garth Brooks 61.95 0.4024 56.82 0.4664 62.81 0.3942
Green Day 62.29 0.4074 57.42 0.4514 63.32 0.3849
Led Zeppelin 60.37 0.4254 57.78 0.4526 63.11 0.4037
Madonna 60.25 0.4217 54.72 0.4657 60.74 0.4028
Metallica 57.34 0.4208 56.97 0.4409 60.78 0.3947
Prince 58.78 0.4519 60.08 0.4532 59.25 0.4520
Queen 56.17 0.4483 55.04 0.4643 53.94 0.4741
Radiohead 60.56 0.4039 55.96 0.4564 59.89 0.4096
Roxette 64.81 0.3997 55.82 0.4555 66.08 0.3790
Steely Dan 60.67 0.4121 58.13 0.4327 62.24 0.3986
Suzanne Vega 58.68 0.4296 55.35 0.4653 63.56 0.3889
Tori Amos 61.67 0.4344 52.26 0.4819 64.54 0.3992
U2 62.72 0.4058 57.90 0.4500 62.68 0.3917
42
2 are lower than those in Experiment 1 across each feature, and the highest accuracy values across
each of the �ve experiments were seen in Experiment 1. This is most likely due to the fact that in
Experiment 1, models were both trained by and tested against frames from a single artist. Artists
were not isolated in the remaining four experiments, showing that a mixture of artists in vocal region
detection research degrades the performance of the system.
5.4 Experiment 3: Measuring Performance Across Gender
Experiment 3 tests the generality of the features across gender. A model is trained by frames from
male singers and tested by frames from female singers. Then a model is trained by frames from
the four female singers and tested by frames from the remaining 16 male artists. The purpose of
this experiment was to determine if DWT features were capable of generalizing across gender. As
an additional consideration, this test introduces the possible hardware limitations of performing
these tasks. Each artist produces between 50,000 and 70,000 frames. Combining artists into a single
training or testing set quickly leads to an unmanageable number of instances, which the computer
hardware had a di�cult time handling. In order to compensate for that problem, the training and
testing sets are sampled, starting at 2% and moving up to 40% and stepping up by 2% with each
iteration. This is done for the tasks in experiments 3�5. At each sampling interval, an accuracy
score was produced, and it was shown that the accuracy results evened out before reaching the 40%
upper limit. 40% was chosen as the upper limit based on empirical results from running a series of
experiments with higher sampling rates with no improvement in performance.
5.4.1 Experiment 3 Results
Experiment three examines the robustness of wavelet features in the presence of a gender di�erences.
The combination of data sets required to perform this experiment leads to unmanageably large
training and testing data sets. In order to compensate for the additional complexity introduced in
this experiment, this experiment is performed at a series of sampled training and testing sets. The
sampling rate is set at 2%, and is increased by 2% up to 40%, where the accuracy values showed
no signi�cant increase in value. Table 5.5 shows the results of each of the three feature vectors at a
sampling rate of 20%, which is when the accuracy values begin to level out. Full results from 2% to
40% are shown in Appendix C.
43
Table 5.5: Experiment 3 Results for Gender-Based Vocal Region Detection at a Sampling Rate of20%
MFCC MFDWC Wavelet Energies
Training Set Test Set Accuracy M.S.E. Accuracy M.S.E. Accuracy M.S.E.
Female Male 62.98 0.4125 55.65 0.4698 66.14 0.3957
Male Female 62.16 0.4126 52.46 0.4779 66.43 0.3820
5.4.2 Experiment 3 Discussion
Experiment 3 was a further test of the performance of the wavelet features by separating training
and test sets based on the gender of the vocalist. As you can see from Table 5.5, MFDWC features
continue to lag in performance behind MFCC and Wavelet Energy features, and that di�erence starts
to become reasonably signi�cant. Wavelet Energy features, however, regain a consistent advantage
over MFCCs, averaging an accuracy improvement of roughly 4%. The calculation of MFDWC fea-
tures di�ers from the calculation of MFCC feature only in the utilization of the Discrete Wavelet
Transformation instead of the Discrete Cosine Transformation. While the DWT coe�cients are more
localized in time and frequency than the DCT coe�cients, making them more resistant to noise,
the output of the DCT is orthogonal and is related to the Fourier Transform, which is used in the
�rst step of the MFCC and MFDWC calculations. The ensuing shift that occurs when moving from
the Fourier domain to the multiresolution DWT domain that occurs with MFDWCs could account
for the degraded performance when compared to MFCCs, which remains consistenly in the Fourier
domain throughout its calculation. Additionally, it should be noted that the Fourier Transformation
is a decomposition of a signal into a series of sines and cosines, the DCT is a decomposition into a
series of cosines and the DWT using the Daubechies wavelet family is neither.
5.5 Experiment 4: Measuring Overall Performance Across Groups of Artists
In Experiment 4, frames were taken from groups of artists, features were extracted and a model
was built. That model was then evaluated using 10-fold cross-validation. This test was designed to
determine if DWT features generalized to groups of artists. It was mirrored after Experiment 1 to
generalize from single artists to multiple artists. Similar to Experiment 1, Experiment 4 is evaluated
44
Table 5.6: Experiment 4 Results for Vocal Region Detection on Frames From Multiple Artists at aSampling Rate of 40%
MFCC MFDWC Wavelet Energies
Accuracy M.S.E. Accuracy M.S.E. Accuracy M.S.E.
72.12 0.3374 67.29 0.3795 73.19 0.3286
using 10-fold cross validation. In Experiment 4, frames from all 20 artists are used�sampled at rates
starting from 2% to 40% with 2% increases in between�to build and test a model. The sampling
rate was increased up to 40% to show that performance of the model would level out without having
to introduce the added complexity of using the entire training set.
5.5.1 Experiment 4 Results
Experiment 4 mirrors the set up in Experiment 1, except models are trained using features from all
artists and tested on features from all artists. The results shown in Table 5.6 are the average of the
10 runs during the validation step. The results shown in Table 5.6 are obtained using a sampling rate
of 40%, where no signi�cant change in the accuracy values from previous runs is seen. Full results
for each sampling rate can be found in Appendix D.
5.5.2 Experiment 4 Discussion
In Experiment 4, Experiment 1 was expanded to incorporate features from multiple artists, instead
of just one. Each of the 20 artists in the data set contributed frames to the training and test sets
in Experiment Four. As you can see in Table 5.6, MFDWC features lag behind MFCC features and
Wavelet Energy features by roughly 5%. MFCC features trail the Wavelet Energy features by around
1%. This is not a signi�cant di�erence, but it is in keeping with the results shown in the previous
three experiments.
5.6 Experiment 5: Measuring Performance Across Different Groups of Artists
In Experiment 5, a model was trained from features extracted from frames from groups of artists,
and was tested against another set of features extracted from frames from another, distinct, group
45
Table 5.7: Experiment 5 Results for Vocal Region Detection on Individual Frames from Multiple,Di�erent Artists using a 32% Sampling Rate, Average Over 10 Runs
MFCC MFDWC Wavelet Energies
Accuracy M.S.E. Accuracy M.S.E. Accuracy M.S.E.
65.25 0.3912 59.80 0.4413 66.43 0.3881
of artists. Experiment 5 was set up to determine if DWT features generalized to groups of multiple
artists. A feature that performs well in Experiment 5 suggests that little if any constraints need
to be placed on those features, as it would have performed well regardless of testing data, gender
and training set make up. Experiment 5 was designed to extend Experiment 2 to multiple artists.
Frames from a set of 10 artists are used to train a model, which is then tested against frames from
the remaining 10 artists. Artists are chosen to be a part of either set randomly, and 10 runs are
conducted for each of the three features. The results of each of the 10 runs per feature are then
averaged to get an overall accuracy score for groups of frames across multiple artists.
5.6.1 Experiment 5 Results
As Experiment 4 extends Experiment 1, Experiment 5 extends Experiment 2. In Experiment 5, 10
artists are chosen at random to be the training set, while the remaining 10 are chosen to be the test
set. In total 10 runs are performed, with each run a new training and test set are generated and
tested. The results of the 10 runs are averaged and shown in Table 5.7 for a sampling rate of 32%.
As in the previous experiments, the sampling rate was initially set at 2%, where on subsequent runs
it was increased in 2% intervals up to 40%. By 40%, the model's performance leveled o�, showing no
improvement in performance with the addition of training data, thereby decreasing the complexity
of the experiment. The full results can be found in Appendix E.
5.6.2 Experiment 5 Discussion
Finally, Experiment 5, shown in Table 5.7, extends Experiment 2 by using 10 artists for both the
training and testing sets. And again, MFDWC features trail MFCC and Wavelet Features. And
again, Wavelet Energy features hold a performance advantage over MFCC features, this time by
46
over 1%. The consistant performance of the Wavelet Energy features show that they are stable
features that o�er improvements over standard MFCC features in a vocal region detection task.
47
Chapter 6
Conclusion
In the preceding chapter I gave results for each of the �ve experiments designed to evaluate the
methodology described in Chapter 4. This chapter o�ers some conclusions that can be drawn from
those results. First, I will discuss the four contributions of this thesis, then I will discuss some
limitations of this research, before concluding with some ideas for the future direction of research in
vocal region detection.
6.1 Contributions
This thesis o�ers contributions in four areas: feature sets used for vocal region detection, data set
annotation and use in vocal region detection, an evaluation of a machine learning methodology
for using the DWT in vocal region detection and learning algorithm selection in the vocal region
detection task.
6.1.1 Feature Set Selection for Vocal Region Detection
To the best of my knowledge, no other researchers have employed the Wavelet Transformation in
the same manner as I have. The researchers who developed the MFDWC features [11] tested their
features on a speech corpus, while those who developed the Wavelet Energy features [8] used a data
set consisting of radio broadcasts that included a heavy focus on speech. Meanwhile, most of the
researchers interested in vocal region detection in music and other music-related tasks used MFCC
features. However, the performance of my wavelet-related features suggests alternatives to MFCCs. In
the �rst experiment, MFDWC features outperformed MFCC features, suggesting that improvements
can be made with regards to the MFCC algorithm's use of the Discrete Cosine Transformation to
possibly improve performance of a vocal region detection system. While MFCCs outperformed in the
MFDWC features in the other four experiments, the di�erence in the �rst test supports the idea that
the localization of coe�cients property within the DWT does o�er some bene�ts to researchers. The
48
fact that the Wavelet Energy features outperformed MFCC features in all �ve experiments further
illustrates the importance of localized coe�cients. The real power of the DWT is its multiresolution
analysis approach, removing the �xed-window constraint imposed by the MFCC algorithm's use of
the Short-Time Fourier Transform. This o�ers a better means of tracking the frequency changes in a
signal over time, and that improvement is supported by my results, where I saw the Wavelet Energy
features outperform MFCC features in all tasks. While MFCCs have a demonstrated place in the
analysis of music signals, the fact that their limitations can be overcome by using the DWT o�ers
an intriguing alternative for researchers wishing to further improve their algorithms.
6.1.2 Data Sets for Vocal Region Detection
As discussed in previous sections, the selection of which data set to use is a crucial �rst step in
designing an experiment. The old adage, �garbage in equals garbage out� holds true when discussing
the relevancy of results in any research task, and it is particularly important in the case of vocal
region detection. Researchers must avoid data sets that are too �ne-tuned to their experimental
method. Any methods developed and tested against one data set ideally should be equipped to
handled a new data set without seeing any meaningful degradation in performance. For my thesis,
I chose to use the artist20 data set due to the fact that it has been used in previous research [10]
and represented a range of di�erent artists, artist genders, musical genres and musical eras. The
temptation to develop and test against my own data set was avoided in order to prevent biasing the
system to the point where the results of my experiment could not be reasonably compared to other
research. Unfortunately, there does not exist a standard data set from which all researchers interested
in vocal region detection could build their experiments around. Additionally, artist20 contains full
songs, as opposed to some data sets which only contain clips of songs [26, 34, 41, 45]. The bene�t to
using the whole song as opposed to a 30 second (or similarly lengthed) clip is a signi�cant amount
of information is lost when you only use a small percentage of the whole signal. When using a clip,
the researchers must chose very carefully which portion of the song to use without introducing a
bias. Therefore, lacking such a data set, I chose to select one, artist20, that had previously been
used before in other areas of Music Information Retrieval. While not a perfect data set selection,
artist20 is attractive because it is larger than most data sets that I was aware of at the time of my
research [20, 26, 27, 31, 41, 45], it covers a reasonably wide range of vocal features and styles, and
has previously been used and shown to be a useful selection. Artist20 was selected so that future
49
researchers looking to tie their researcher to a previously evaluated data set had one such data set
available. Additionally, the subset of artist20 that was used in this thesis was annotated and will be
made publically available for use by future researchers interested in the vocal region detection task.
6.1.3 Experimental Evaluation
The presented methodology is based on the standard machine learning / knowledge discovery and
data mining methodology: gather training data, extract features, build a model, gather testing data,
extract features, test against the model and report the results. That standard approach is utilized to
take advantage of the Discrete Wavelet Transformation in the task of automatically detecting vocal
regions in music. The algorithm developed in this thesis is an extensible, comprehensive method for
detecting vocal regions in music, using the Wavelet Transformation. The fact that MFDWC features
outperformed MFCC features in the �rst test but failed to replicate that success in subsequent
tests suggests that while there might be a useful application of MFDWC features, they still come
up short in broader vocal region detection research when compared to MFCC features. The fact
that they come up short exposes a weakness inherent to having too limited of an experimental
scope when examining a new feature. Wavelet Energy features, however, demonstrated just the
opposite. By outperforming MFCC features in the �ve experimental set ups, Wavelet Energy features
showed generality in their application to the vocal region detection problem. This generality shows
that the Wavelet Energy features are a stable classi�cation feature. In addition, the fact that the
Wavelet Energy features for the most part performed well when applied to each of the selected
learning algorithms further supports that generality statement. This suggests that the Discrete
Wavelet Transformation can be applied in a number of related research areas and possibly continue
to perform well. Having an extensive experimental methodology gives valuable context to the results
being shown.
6.1.4 Learning Algorithm Selection
Perhaps the most interesting conclusion drawn from this research was the fact that a standard
MIR learning algorithm, Support Vector Machines, was outperformed by a lesser-applied decision
tree algorithm, J48. The fact that this was identi�ed in my research is surprising, due to the fact
that SVMs are very common in MIR research. Going back to the experimental evaluation, this is
an important conclusion drawn only by expanding the scope of my research. Five starkly di�erent
50
algorithms were chosen in Experiment 1. By having all three feature sets applied to each learning
algorithm, I was able to demonstrate the stability of the wavelet features, as discussed in the previous
section, while also showing that an uncommon approach can sometimes lead to interesting results.
As such, it is interesting to see SVMs outperformed in an area where they are a dominant algo-
rithm. Using an SVM classi�er in the vocal region detection task is useful because SVMs are able
to appropriately handle high dimension data, and SVMs have been used in previous research with
consistently solid results [16, 20, 24, 31]. Despite these stated advantages of SVMs, they were consis-
tently outperformed by the less frequency used decision tree algorithm, J48. There is no universally
best learning algorithm; each comes with its pluses and minuses, and each works better on di�erent
data, attacking di�erent problems. While SVMs have a strong hold in Music Information Retrieval
speci�cally and signal analysis in general [16, 20, 21, 24, 31, 34] this thesis o�ers the suggestion that
there might be other algorithms that may be better adapted to MIR problems. Speci�cally, decision
trees prove to be well adapted to Wavelet features and vocal region detection.
6.2 Limitations
One of the things to consider when evaluating this research is the choice to proceed with a decision
tree algorithm for Experiments 2-5. SVMs are one of the learning algorithms of choice in MIR
[16, 20, 21, 24, 31, 34], and decision trees have limitations. While each possible learning algorithm
has its drawbacks, it's important to consider those drawbacks when attempting to draw conclusions
from research. Decision tree algorithms are unstable and when performing a literature review on
the state of vocal region detection I did not come across a single paper that used decision trees. I
attempted to mitigate those concerns by running a pilot study for Experiments 205 using a subset
of my data and SVMs. In every case, J48 continued to outperform SVMs, which suggests that for
the purposes of my experiments, decision trees are an acceptable solution. Also, while I selected
�ve widely diverse learning algorithms, additional approaches do exist, such as Gaussian Mixture
Models, Hidden Markov Models and Neural Networks that could improve the performance of my
system.
Another shortcoming is the fact that I chose to only use the Daubechies wavelet family. There
exists a multitude of wavelet families, all with di�erent features that make them attractive to
researchers seeking to solve a number of problems. This conclusions I have drawn from my research
suggests that it would be a worthwhile endeavor to attempt to �nd a better wavelet family for use
51
in the vocal region detection problem. However, the scope of my research was to demonstrate that
wavelets could be applied to this particular problem�which I believe I have�and o�er a jumping-
o� point for future research. Daubechies wavelets were chosen due to their demonstrated usability
in the speech analysis domain, and due to the wide overlap of the two areas of speech and music
analysis. Selecting Daubechies wavelets appeared to be an intelligent choice.
An additional shortcoming was the limitations of my feature extraction method. Many researchers
tend to further optimize their features in a number of ways. Some of those methods include pre-
processing their data in a di�erent manner, combining a larger set of features to make one diverse
feature vector, or combining or chaining classi�ers to improve performance. Certainly some perfor-
mance gains could possibly be introduced by including some or all of these methods. However, I
set out to show that the DWT has a useful application in the vocal region detection domain, and
I believe I have done that. Further research, to be discussed in the next section, could be done to
build on what I have shown.
Finally, a limitation of my research goes back to the selection of my data set. While the artist20
data set is an attractive data set to use, it is not a research standard, making it di�cult to compare
results. Additionally, further improvements could be made to the data to include even more diverse
music, perhaps including rap music, folk, or Eastern music.
6.3 Future Work
While I was able to successfully demonstrate the utility of the DWT in detecting vocal regions in
music, there exists plenty of areas of research that could be built o� of what has been done here.
As mentioned in the previous sections, other data sets could be used within this methodology to
demonstrate how well it performs, as well as a way to tie in previous research. Also, more wavelet
families, such as Symlets and Coi�ets [8, 11, 43], could be used to extract features to expand on
the limitation of Daubechies wavelets. Additionally, while MFCCs are a standard feature in MIR
research, they are by no means the only features used. Wavelets used in conjunction with other
features, such as pitch, timbre, zero-crossing rate, and others, could further improve the perfor-
mance of this system. Additionally, the gains seen here in this research could be broadly expanded
to cover more areas within MIR, such as singer identi�cation, audio �ngerprinting and genre identi-
�cation, among many others. And �nally, this research employed a classi�cation approach to solve
52
this problem. Other approaches, such as clustering or parameter estimation in a semi-supervised
environment, have additional applicability and should be pursued further.
6.4 Summary
In this thesis, I evaluated a methodology for detecting vocal regions in music that uses the Discrete
Wavelet Transformation in the calculation of features that show an improvement over traditional
features in a classi�cation system. Results from a rigorous empirical evaluation suggest that lever-
aging the properties of coe�cient localization, the multiresolution analysis approach and a lack of
a resolution trade-o� between time and frequency present in the Wavelet Transformation allow the
utilization of Mel-Frequency Discrete Wavelet Coe�cient and Wavelet Energy features, calculated
from the Discrete Wavelet Transformation, to achieve a higher classi�cation accuracy over a stan-
dard audio analysis feature, Mel-Frequency Cepstral Coe�cients when distinguishing between vocal
and non-vocal regions in music. Additionally, this thesis shows that a commonly-used classi�cation
algorithm, Support Vector Machines, is outperformed by a less frequently-applied Decision Tree clas-
si�cation algorithm. Reliable methods for detecting vocal regions will lead to better analysis results
for researchers interesting in other areas of Music Information Retrieval, such as singer identi�cation,
that depend on a reliable vocal region detection system in order to achieve optimal results.
53
Appendix A
Complete Experiment 1 Results
Table A.1: Complete Experiment 1 Results Using MFCC Features
Artist Algorithm Accuracy Mean Squared Error
Naïve Bayes 67.68 0.3793
JRip 73.19 0.3765
Aerosmith J48 72.40 0.3342
IBk 72.85 0.3314
SVM 73.85 0.2615
Naïve Bayes 75.44 0.2752
JRip 78.51 0.3111
Beatles J48 78.64 0.2594
IBk 78.86 0.2665
SVM 72.27 0.2773
Naïve Bayes 81.93 0.2654
JRip 83.98 0.2551
CCR J48 83.41 0.2203
IBk 86.39 0.1844
SVM 81.68 0.1832
Naïve Bayes 71.90 0.3549
JRip 79.83 0.3118
Cure J48 79.96 0.2664
IBk 79.87 0.2598
SVM 78.14 0.2186
Naïve Bayes 71.27 0.3419
JRip 75.84 0.3529
Dave Matthews Band J48 76.00 0.2993
IBk 75.95 0.2988
SVM 75.10 0.2490
Naïve Bayes 68.06 0.3768
JRip 73.64 0.3734
Depeche Mode J48 72.89 0.3266
IBk 75.65 0.2984
Continued on Next Page
54
TableA.1 � Continued From Previous Page
SVM 73.63 0.2637
Naïve Bayes 66.69 0.3789
JRip 72.22 0.3897
Fleetwood Mac J48 73.42 0.3338
IBk 68.74 0.3636
SVM 70.60 0.2940
Naïve Bayes 74.02 0.3114
JRip 78.04 0.3223
Garth Brooks J48 78.39 0.2575
IBk 80.05 0.2514
SVM 77.18 0.2282
Naïve Bayes 66.22 0.3665
JRip 73.51 0.3714
Green Day J48 73.85 0.3251
IBk 72.30 0.3292
SVM 71.46 0.2854
Naïve Bayes 65.98 0.3865
JRip 73.42 0.3771
Led Zeppelin J48 74.10 0.3229
IBk 74.96 0.3014
SVM 71.37 0.2863
Naïve Bayes 71.13 0.3411
JRip 74.88 0.3605
Madonna J48 75.47 0.3053
IBk 74.95 0.3082
SVM 70.91 0.2909
Naïve Bayes 82.18 0.2492
JRip 85.33 0.2384
Metallica J48 85.03 0.2076
IBk 84.97 0.2082
SVM 82.30 0.1770
Naïve Bayes 65.13 0.4133
JRip 66.99 0.4352
Prince J48 67.98 0.3848
IBk 67.97 0.3694
SVM 66.70 0.3330
Naïve Bayes 67.93 0.3646
JRip 70.11 0.4018
Queen J48 72.07 0.3248
IBk 76.32 0.2871
Continued on Next Page
55
TableA.1 � Continued From Previous Page
SVM 71.01 0.2899
Naïve Bayes 70.98 0.3226
JRip 80.13 0.2975
Radiohead J48 80.16 0.2362
IBk 80.66 0.2426
SVM 77.75 0.2225
Naïve Bayes 65.52 0.3872
JRip 70.26 0.4083
Roxette J48 71.33 0.3514
IBk 69.20 0.3577
SVM 69.03 0.3097
Naïve Bayes 75.61 0.3210
JRip 78.17 0.3257
Steely Dan J48 78.05 0.2709
IBk 79.36 0.2636
SVM 77.42 0.2258
Naïve Bayes 68.37 0.3792
JRip 74.44 0.3700
Suzanne Vega J48 73.52 0.3212
IBk 75.34 0.3035
SVM 72.16 0.2784
Naïve Bayes 69.16 0.3636
JRip 73.36 0.3795
Tori Amos J48 74.53 0.3242
IBk 74.91 0.3050
SVM 70.58 0.2942
Naïve Bayes 64.66 0.3952
JRip 69.49 0.4187
U2 J48 69.75 0.3669
IBk 68.41 0.3656
SVM 70.32 0.2968
56
Table A.2: Complete Experiment 1 Results Using MFDWC Features and
Naïve Bayes
Artist Wavelet Accuracy Mean Squared Error
db1 69.53 0.3772
db2 67.42 0.3678
Aerosmith db3 68.03 0.3606
db4 65.61 0.3982
db1 71.46 0.3146
db2 69.59 0.3286
Beatles db3 71.77 0.3073
db4 67.63 0.3473
db1 82.40 0.2369
db2 80.84 0.2405
CCR db3 80.68 0.2339
db4 79.83 0.2426
db1 77.80 0.2995
db2 77.16 0.2829
Cure db3 77.60 0.2799
db4 74.96 0.3292
db1 67.20 0.4019
db2 64.44 0.4158
Dave Matthews Band db3 66.31 0.3961
db4 59.68 0.4508
db1 67.91 0.3790
db2 69.54 0.3500
Depeche Mode db3 68.35 0.3567
db4 65.01 0.4045
db1 63.42 0.4248
db2 58.78 0.4529
Fleetwood Mac db3 61.60 0.4242
db4 57.10 0.4636
db1 72.45 0.3250
db2 70.69 0.3112
Garth Brooks db3 70.09 0.3173
db4 62.76 0.3861
db1 70.11 0.3664
db2 65.60 0.3922
Green Day db3 69.50 0.3544
db4 61.38 0.4365
Continued on Next Page
57
TableA.2 � Continued From Previous Page
db1 69.30 0.3824
db2 67.33 0.4122
Led Zeppelin db3 67.93 0.3958
db4 67.16 0.4120
db1 67.14 0.3839
db2 64.92 0.4001
Madonna db3 65.31 0.3797
db4 63.02 0.4100
db1 82.76 0.2282
db2 80.27 0.2478
Metallica db3 81.06 0.2394
db4 80.96 0.2634
db1 62.74 0.4428
db2 60.47 0.4450
Prince db3 61.66 0.4324
db4 58.49 0.4646
db1 68.95 0.3560
db2 65.37 0.3665
Queen db3 66.28 0.3554
db4 64.14 0.3725
db1 71.49 0.3357
db2 68.14 0.3521
Radiohead db3 68.48 0.3450
db4 67.74 0.3483
db1 65.44 0.4049
db2 61.38 0.4225
Roxette db3 62.55 0.4086
db4 60.48 0.4282
db1 74.99 0.3317
db2 72.46 0.3334
Steely Dan db3 73.96 0.3176
db4 69.85 0.3755
db1 68.65 0.3929
db2 64.44 0.4171
Suzanne Vega db3 64.67 0.4117
db4 61.99 0.4316
db1 66.59 0.3997
db2 64.36 0.4099
Tori Amos db3 65.10 0.3909
db4 64.25 0.4067
Continued on Next Page
58
TableA.2 � Continued From Previous Page
db1 64.95 0.4132
db2 63.34 0.4104
U2 db3 64.09 0.3979
db4 61.74 0.4149
59
Table A.3: Complete Experiment 1 Results Using MFDWC Features and
J48
Artist Wavelet Accuracy Mean Squared Error
db1 82.82 0.1944
db2 81.71 0.2108
Aerosmith db3 82.60 0.1958
db4 77.09 0.2701
db1 83.95 0.1775
db2 82.02 0.1996
Beatles db3 82.89 0.1886
db4 77.33 0.2568
db1 89.76 0.1245
db2 88.29 0.1453
CCR db3 88.39 0.1416
db4 86.26 0.1830
db1 85.69 0.1721
db2 85.36 0.1807
Cure db3 85.29 0.1813
db4 80.92 0.2618
db1 81.58 0.2033
db2 79.67 0.2238
Dave Matthews Band db3 81.34 0.2031
db4 70.96 0.3218
db1 83.25 0.1848
db2 82.38 0.2004
Depeche Mode db3 82.06 0.1996
db4 74.47 0.3004
db1 79.15 0.2269
db2 75.18 0.2709
Fleetwood Mac db3 78.39 0.2347
db4 67.33 0.3597
db1 85.18 0.1621
db2 82.99 0.1880
Garth Brooks db3 83.03 0.1866
db4 74.34 0.2874
db1 83.36 0.1871
db2 80.63 0.2150
Green Day db3 82.48 0.1958
db4 74.21 0.2977
Continued on Next Page
60
TableA.3 � Continued From Previous Page
db1 84.03 0.1797
db2 81.04 0.2187
Led Zeppelin db3 81.85 0.2078
db4 76.64 0.2957
db1 82.49 0.1907
db2 80.47 0.2136
Madonna db3 81.23 0.2044
db4 74.35 0.2915
db1 89.22 0.1319
db2 88.38 0.1459
Metallica db3 88.87 0.1365
db4 85.88 0.1896
db1 78.21 0.2458
db2 75.31 0.2816
Prince db3 76.30 0.2690
db4 68.19 0.3639
db1 83.81 0.1767
db2 80.19 0.2190
Queen db3 81.72 0.2011
db4 74.88 0.2900
db1 86.61 0.1487
db2 83.98 0.1801
Radiohead db3 84.04 0.1779
db4 80.29 0.2328
db1 80.43 0.2128
db2 77.80 0.2411
Roxette db3 78.64 0.2328
db4 71.01 0.3263
db1 85.44 0.1668
db2 83.45 0.1927
Steely Dan db3 84.08 0.1832
db4 78.93 0.2634
db1 83.32 0.1843
db2 80.15 0.2228
Suzanne Vega db3 81.18 0.2096
db4 75.06 0.2909
db1 83.38 0.1827
db2 80.56 0.2135
Tori Amos db3 81.72 0.1995
db4 75.72 0.2777
Continued on Next Page
61
TableA.3 � Continued From Previous Page
db1 79.81 0.2203
db2 78.38 0.2383
U2 db3 79.50 0.2244
db4 71.25 0.3253
62
Table A.4: Complete Experiment 1 Results Using MFDWC Features and
SVM
Artist Wavelet Accuracy Mean Squared Error
db1 69.21 0.3079
db2 69.52 0.3049
Aerosmith db3 70.11 0.2989
db4 66.11 0.3389
db1 71.06 0.2984
db2 69.91 0.3009
Beatles db3 71.72 0.2828
db4 68.52 0.3148
db1 82.36 0.1764
db2 81.32 0.1868
CCR db3 80.49 0.1951
db4 79.89 0.2011
db1 75.99 0.2401
db2 75.99 0.2401
Cure db3 75.99 0.2401
db4 75.99 0.2401
db1 66.60 0.3340
db2 65.34 0.3466
Dave Matthews Band db3 67.63 0.3237
db4 60.86 0.3914
db1 69.48 0.3052
db2 70.63 0.2937
Depeche Mode db3 69.72 0.3028
db4 61.36 0.3864
db1 61.69 0.3831
db2 59.29 0.4071
Fleetwood Mac db3 62.75 0.3725
db4 57.92 0.4208
db1 72.48 0.2752
db2 72.25 0.2775
Garth Brooks db3 72.67 0.2733
db4 64.55 0.3545
db1 69.36 0.3064
db2 65.18 0.3482
Green Day db3 69.37 0.3063
db4 63.03 0.3697
Continued on Next Page
63
TableA.4 � Continued From Previous Page
db1 69.19 0.3081
db2 65.05 0.3495
Led Zeppelin db3 65.05 0.3495
db4 65.05 0.3495
db1 64.54 0.3546
db2 64.54 0.3546
Madonna db3 64.54 0.3546
db4 64.54 0.3546
db1 80.43 0.1957
db2 80.47 0.1953
Metallica db3 82.02 0.1798
db4 80.43 0.1957
db1 62.22 0.3778
db2 60.29 0.3971
Prince db3 61.28 0.3872
db4 55.94 0.4406
db1 71.19 0.2881
db2 69.09 0.3091
Queen db3 69.04 0.3096
db4 66.21 0.3379
db1 69.21 0.3079
db2 69.66 0.3034
Radiohead db3 69.49 0.3051
db4 68.51 0.3149
db1 63.57 0.3643
db2 63.44 0.3656
Roxette db3 63.88 0.3612
db4 61.72 0.3828
db1 73.25 0.2675
db2 73.21 0.2679
Steely Dan db3 73.48 0.2652
db4 69.94 0.3006
db1 68.63 0.3137
db2 65.08 0.3793
Suzanne Vega db3 66.08 0.3392
db4 61.69 0.3831
db1 64.80 0.3520
db2 64.80 0.3520
Tori Amos db3 64.80 0.3520
db4 64.80 0.3520
Continued on Next Page
64
TableA.4 � Continued From Previous Page
db1 64.45 0.3555
db2 64.65 0.3535
U2 db3 65.31 0.3469
db4 62.14 0.3786
65
Table A.5: Complete Experiment 1 Results Using MFDWC Features and
JRip
Artist Wavelet Accuracy Mean Squared Error
db1 71.62 0.3882
db2 70.95 0.3969
Aerosmith db3 71.19 0.3925
db4 66.80 0.4347
db1 74.37 0.3647
db2 70.98 0.3969
Beatles db3 72.28 0.3850
db4 68.52 0.4223
db1 85.07 0.2399
db2 83.31 0.2656
CCR db3 82.90 0.2722
db4 81.51 0.2900
db1 79.62 0.3167
db2 79.75 0.3163
Cure db3 79.82 0.3153
db4 77.17 0.3488
db1 66.62 0.4386
db2 64.11 0.4554
Dave Matthews Band db3 65.68 0.4464
db4 58.70 0.4824
db1 72.16 0.3875
db2 71.28 0.3972
Depeche Mode db3 70.84 0.4001
db4 66.34 0.4415
db1 63.18 0.4619
db2 58.41 0.4851
Fleetwood Mac db3 59.99 0.4781
db4 56.18 0.4920
db1 74.90 0.3603
db2 71.60 0.3958
Garth Brooks db3 71.19 0.3991
db4 63.45 0.4588
db1 72.91 0.3699
db2 68.68 0.4153
Green Day db3 71.23 0.3908
db4 62.32 0.4622
Continued on Next Page
66
TableA.5 � Continued From Previous Page
db1 73.45 0.3781
db2 70.01 0.4088
Led Zeppelin db3 70.50 0.4066
db4 67.87 0.4307
db1 70.29 0.4043
db2 66.35 0.4398
Madonna db3 66.95 0.4302
db4 65.72 0.4457
db1 84.89 0.2476
db2 84.07 0.2589
Metallica db3 84.36 0.2529
db4 82.19 0.2866
db1 64.34 0.4523
db2 61.18 0.4708
Prince db3 62.56 0.4638
db4 58.56 0.4827
db1 72.28 0.3876
db2 66.94 0.4378
Queen db3 68.12 0.4288
db4 65.36 0.4494
db1 76.17 0.3424
db2 73.32 0.3734
Radiohead db3 73.24 0.3709
db4 70.89 0.3965
db1 65.52 0.4459
db2 61.86 0.4669
Roxette db3 62.56 0.4622
db4 60.66 0.4729
db1 76.98 0.3420
db2 74.99 0.3636
Steely Dan db3 75.67 0.3567
db4 71.13 0.4051
db1 71.28 0.3911
db2 67.93 0.4203
Suzanne Vega db3 68.33 0.4193
db4 65.26 0.4452
db1 70.85 0.4040
db2 67.44 0.4324
Tori Amos db3 68.38 0.4241
db4 66.68 0.4383
Continued on Next Page
67
TableA.5 � Continued From Previous Page
db1 65.24 0.4446
db2 64.06 0.4560
U2 db3 65.69 0.4405
db4 61.15 0.4727
68
Table A.6: Complete Experiment 1 Results Using MFDWC Features and
IBk
Artist Wavelet Accuracy Mean Squared Error
db1 75.67 0.2789
db2 73.57 0.3052
Aerosmith db3 73.00 0.3063
db4 70.23 0.3355
db1 82.40 0.2193
db2 78.13 0.2597
Beatles db3 78.57 0.2632
db4 74.34 0.2994
db1 87.29 0.1552
db2 84.42 0.1833
CCR db3 84.02 0.1866
db4 81.41 0.2174
db1 80.74 0.2251
db2 79.70 0.2370
Cure db3 79.39 0.2396
db4 76.05 0.2750
db1 75.36 0.2848
db2 70.48 0.3337
Dave Matthews Band db3 71.73 0.3222
db4 66.35 0.3711
db1 78.22 0.2550
db2 75.22 0.2856
Depeche Mode db3 74.43 0.2935
db4 70.01 0.3362
db1 71.78 0.3249
db2 66.96 0.3674
Fleetwood Mac db3 68.11 0.3536
db4 64.16 0.3885
db1 81.53 0.2234
db2 76.82 0.2736
Garth Brooks db3 76.18 0.2742
db4 70.29 0.3313
db1 75.01 0.2869
db2 70.64 0.3305
Green Day db3 72.88 0.3093
db4 67.15 0.3626
Continued on Next Page
69
TableA.6 � Continued From Previous Page
db1 77.64 0.2546
db2 73.68 0.2993
Led Zeppelin db3 73.98 0.2948
db4 70.18 0.3296
db1 76.44 0.2771
db2 71.76 0.3233
Madonna db3 72.59 0.3161
db4 69.72 0.3416
db1 85.50 0.1718
db2 83.68 0.1913
Metallica db3 83.99 0.1871
db4 81.32 0.2185
db1 70.36 0.3351
db2 66.49 0.3682
Prince db3 67.60 0.3600
db4 64.91 0.3823
db1 79.52 0.2424
db2 73.71 0.3032
Queen db3 74.55 0.2934
db4 70.61 0.3298
db1 83.71 0.1952
db2 78.36 0.2523
Radiohead db3 79.01 0.2461
db4 74.68 0.2878
db1 73.55 0.3032
db2 69.55 0.4347
Roxette db3 69.97 0.3379
db4 67.12 0.3625
db1 81.33 0.2232
db2 76.86 0.2685
Steely Dan db3 77.13 0.2632
db4 73.45 0.3018
db1 78.48 0.2544
db2 73.09 0.3055
Suzanne Vega db3 73.32 0.3050
db4 69.35 0.3411
db1 78.62 0.2556
db2 73.97 0.3017
Tori Amos db3 74.88 0.2929
db4 71.36 0.3252
Continued on Next Page
70
TableA.6 � Continued From Previous Page
db1 72.02 0.3184
db2 69.66 0.3435
U2 db3 69.72 0.3385
db4 67.39 0.3599
71
Table A.7: Complete Experiment 1 Results Using Wavelet Energy Fea-
tures and Naïve Bayes
Artist Wavelet Accuracy Mean Squared Error
db1 56.30 0.4361
db2 57.17 0.4279
Aerosmith db3 57.56 0.4242
db4 57.38 0.4258
db1 74.80 0.2522
db2 74.25 0.2577
Beatles db3 74.04 0.2592
db4 73.91 0.2609
db1 44.87 0.5421
db2 44.74 0.5477
CCR db3 44.82 0.5465
db4 44.95 0.5446
db1 41.41 0.5855
db2 40.83 0.5914
Cure db3 40.85 0.5911
db4 40.82 0.5912
db1 66.64 0.3337
db2 66.85 0.3314
Dave Matthews Band db3 66.91 0.3308
db4 66.84 0.3316
db1 54.86 0.4511
db2 51.97 0.4801
Depeche Mode db3 52.37 0.4761
db4 51.94 0.4806
db1 62.88 0.3711
db2 63.12 0.3685
Fleetwood Mac db3 63.60 0.3641
db4 63.46 0.3650
db1 65.85 0.3415
db2 65.25 0.3475
Garth Brooks db3 64.96 0.3504
db4 64.75 0.3521
db1 55.60 0.4441
db2 56.05 0.4396
Green Day db3 56.26 0.4375
db4 56.36 0.4367
Continued on Next Page
72
TableA.7 � Continued From Previous Page
db1 52.15 0.4778
db2 52.06 0.4785
Led Zeppelin db3 52.09 0.4784
db4 51.92 0.4797
db1 68.52 0.3150
db2 68.84 0.3117
Madonna db3 68.96 0.3101
db4 68.98 0.3099
db1 57.30 0.4240
db2 56.90 0.4290
Metallica db3 56.20 0.4343
db4 56.03 0.4371
db1 59.48 0.4045
db2 60.07 0.3988
Prince db3 60.10 0.3981
db4 60.20 0.3970
db1 59.84 0.4015
db2 60.45 0.3959
Queen db3 60.79 0.3925
db4 60.86 0.3921
db1 69.02 0.3102
db2 70.95 0.2907
Radiohead db3 71.69 0.2831
db4 72.02 0.2802
db1 63.90 0.3611
db2 64.01 0.3599
Roxette db3 64.34 0.3572
db4 64.25 0.3579
db1 53.11 0.4684
db2 54.49 0.4550
Steely Dan db3 55.22 0.4481
db4 55.35 0.4461
db1 55.18 0.4482
db2 55.50 0.4452
Suzanne Vega db3 55.76 0.4430
db4 55.91 0.4411
db1 68.95 0.3109
db2 69.46 0.3054
Tori Amos db3 69.73 0.3030
db4 69.74 0.3036
Continued on Next Page
73
TableA.7 � Continued From Previous Page
db1 59.45 0.4055
db2 59.86 0.4013
U2 db3 60.02 0.4000
db4 60.01 0.4002
74
Table A.8: Complete Experiment 1 Results Using Wavelet Energy Fea-
tures and J48
Artist Wavelet Accuracy Mean Squared Error
db1 84.00 0.1778
db2 84.57 0.1734
Aerosmith db3 84.36 0.1790
db4 84.55 0.1757
db1 87.85 0.1365
db2 87.97 0.1357
Beatles db3 88.31 0.1343
db4 88.34 0.1329
db1 87.45 0.1447
db2 88.38 0.1366
CCR db3 89.12 0.1284
db4 89.14 0.1272
db1 87.11 0.1509
db2 87.59 0.1460
Cure db3 87.81 0.1464
db4 87.95 0.1434
db1 86.02 0.1539
db2 86.39 0.1526
Dave Matthews Band db3 86.43 0.1561
db4 86.55 0.1536
db1 83.89 0.1751
db2 84.57 0.1700
Depeche Mode db3 84.54 0.1736
db4 85.14 0.1663
db1 84.65 0.1716
db2 85.17 0.1674
Fleetwood Mac db3 85.48 0.1671
db4 85.81 0.1633
db1 88.23 0.1293
db2 88.70 0.1279
Garth Brooks db3 88.95 0.1258
db4 89.18 0.1236
db1 86.91 0.1441
db2 87.37 0.1420
Green Day db3 87.79 0.1403
db4 88.09 0.1369
Continued on Next Page
75
TableA.8 � Continued From Previous Page
db1 85.18 0.1640
db2 85.66 0.1603
Led Zeppelin db3 86.03 0.1591
db4 86.01 0.1606
db1 86.87 0.1444
db2 87.18 0.1453
Madonna db3 87.18 0.1462
db4 87.41 0.1422
db1 89.63 0.1207
db2 90.03 0.1163
Metallica db3 90.29 0.1170
db4 90.52 0.1129
db1 82.08 0.1980
db2 81.51 0.2096
Prince db3 81.36 0.2155
db4 81.38 0.2140
db1 83.19 0.1798
db2 83.96 0.1746
Queen db3 83.94 0.1770
db4 84.55 0.1702
db1 89.58 0.1153
db2 89.74 0.1150
Radiohead db3 90.25 0.1103
db4 90.50 0.1076
db1 84.95 0.1659
db2 84.74 0.1716
Roxette db3 84.58 0.1748
db4 84.89 0.1712
db1 86.54 0.1468
db2 86.78 0.1477
Steely Dan db3 87.40 0.1425
db4 87.45 0.1413
db1 84.90 0.1655
db2 85.75 0.1586
Suzanne Vega db3 86.04 0.1567
db4 86.55 0.1533
db1 84.84 0.1690
db2 84.96 0.1705
Tori Amos db3 85.40 0.1654
db4 85.86 0.1604
Continued on Next Page
76
TableA.8 � Continued From Previous Page
db1 83.42 0.1795
db2 83.61 0.1811
U2 db3 83.28 0.1860
db4 83.38 0.1853
77
Table A.9: Complete Experiment 1 Results Using Wavelet Energy Fea-
tures and SVM
Artist Wavelet Accuracy Mean Squared Error
db1 72.18 0.2782
db2 72.98 0.2702
Aerosmith db3 72.99 0.2701
db4 73.03 0.2697
db1 75.94 0.2406
db2 76.09 0.2391
Beatles db3 75.91 0.2409
db4 75.83 0.2417
db1 76.52 0.2348
db2 76.52 0.2348
CCR db3 76.52 0.2348
db4 76.52 0.2348
db1 75.99 0.2401
db2 75.99 0.2401
Cure db3 75.99 0.2401
db4 75.99 0.2401
db1 74.34 0.2566
db2 75.65 0.2435
Dave Matthews Band db3 75.74 0.2426
db4 75.89 0.2411
db1 68.33 0.3167
db2 69.44 0.3056
Depeche Mode db3 70.83 0.2917
db4 71.29 0.2871
db1 73.59 0.2641
db2 74.03 0.2597
Fleetwood Mac db3 74.10 0.2590
db4 74.27 0.2573
db1 75.19 0.2481
db2 75.51 0.2449
Garth Brooks db3 74.70 0.2530
db4 74.09 0.2591
db1 74.43 0.2557
db2 75.37 0.2463
Green Day db3 75.23 0.2477
db4 75.29 0.2471
Continued on Next Page
78
TableA.9 � Continued From Previous Page
db1 72.33 0.2767
db2 73.05 0.2695
Led Zeppelin db3 72.77 0.2723
db4 72.85 0.2715
db1 71.07 0.2893
db2 72.13 0.2787
Madonna db3 72.46 0.2754
db4 72.35 0.2765
db1 80.43 0.1957
db2 80.43 0.1957
Metallica db3 80.43 0.1957
db4 80.43 0.1957
db1 67.50 0.3250
db2 68.02 0.3198
Prince db3 67.90 0.3210
db4 67.72 0.3228
db1 62.96 0.3704
db2 63.37 0.3663
Queen db3 63.76 0.3624
db4 63.86 0.3614
db1 78.04 0.2196
db2 78.39 0.2161
Radiohead db3 78.24 0.2176
db4 78.16 0.2184
db1 72.53 0.2747
db2 72.36 0.2764
Roxette db3 71.73 0.2827
db4 71.63 0.2837
db1 75.03 0.2497
db2 75.62 0.2438
Steely Dan db3 76.01 0.2399
db4 76.43 0.2357
db1 71.44 0.2856
db2 72.77 0.2723
Suzanne Vega db3 73.19 0.2681
db4 73.19 0.2681
db1 70.70 0.2930
db2 71.73 0.2827
Tori Amos db3 71.80 0.2820
db4 71.82 0.2818
Continued on Next Page
79
TableA.9 � Continued From Previous Page
db1 67.99 0.3201
db2 69.00 0.3100
U2 db3 69.44 0.3056
db4 69.60 0.3040
80
Table A.10: Complete Experiment 1 Results Using Wavelet Energy Fea-
tures and JRip
Artist Wavelet Accuracy Mean Squared Error
db1 74.18 0.3640
db2 75.10 0.3596
Aerosmith db3 75.19 0.3566
db4 75.56 0.3526
db1 81.01 0.2884
db2 81.57 0.2812
Beatles db3 81.82 0.2772
db4 81.85 0.2766
db1 81.32 0.2942
db2 82.80 0.2733
CCR db3 83.34 0.2658
db4 83.54 0.2638
db1 80.71 0.3049
db2 81.54 0.2928
Cure db3 81.96 0.2870
db4 82.11 0.2862
db1 77.20 0.3352
db2 77.94 0.3287
Dave Matthews Band db3 78.56 0.3228
db4 78.47 0.3224
db1 72.21 0.3867
db2 73.46 0.3750
Depeche Mode db3 74.50 0.3632
db4 74.63 0.3630
db1 74.97 0.3581
db2 75.78 0.3525
Fleetwood Mac db3 76.31 0.3456
db4 76.42 0.3452
db1 80.32 0.2949
db2 81.07 0.2841
Garth Brooks db3 81.53 0.2803
db4 81.75 0.2748
db1 77.59 0.3218
db2 78.39 0.3133
Green Day db3 79.56 0.3026
db4 79.95 0.2971
Continued on Next Page
81
TableA.10 � Continued From Previous Page
db1 75.62 0.3560
db2 76.55 0.3442
Led Zeppelin db3 76.85 0.3405
db4 77.49 0.3356
db1 77.45 0.3307
db2 78.02 0.3265
Madonna db3 77.67 0.3292
db4 78.04 0.3263
db1 84.50 0.2483
db2 85.16 0.2396
Metallica db3 85.62 0.2331
db4 85.92 0.2272
db1 69.31 0.4188
db2 68.77 0.3665
Prince db3 69.65 0.4165
db4 70.08 0.4124
db1 68.76 0.4126
db2 69.40 0.4089
Queen db3 70.54 0.3991
db4 69.93 0.4084
db1 81.77 0.2752
db2 83.04 0.2590
Radiohead db3 83.54 0.2543
db4 84.24 0.2410
db1 73.62 0.3757
db2 74.04 0.3732
Roxette db3 74.20 0.3719
db4 74.38 0.3681
db1 77.66 0.3333
db2 78.70 0.3198
Steely Dan db3 79.29 0.3114
db4 80.16 0.3021
db1 74.56 0.3607
db2 76.56 0.3429
Suzanne Vega db3 77.01 0.3372
db4 77.74 0.3275
db1 75.38 0.3581
db2 75.64 0.3552
Tori Amos db3 75.66 0.3557
db4 76.01 0.3512
Continued on Next Page
82
TableA.10 � Continued From Previous Page
db1 70.17 0.4089
db2 71.43 0.3987
U2 db3 71.81 0.3945
db4 71.19 0.4021
83
Table A.11: Complete Experiment 1 Results Using Wavelet Energy Fea-
tures and IBk
Artist Wavelet Accuracy Mean Squared Error
db1 72.93 0.3075
db2 75.03 0.2845
Aerosmith db3 76.56 0.2685
db4 76.84 0.2643
db1 80.09 0.2350
db2 81.60 0.2167
Beatles db3 83.32 0.1974
db4 83.97 0.1918
db1 78.25 0.2524
db2 81.91 0.2138
CCR db3 83.64 0.1920
db4 84.11 0.1857
db1 79.62 0.2344
db2 80.74 0.2211
Cure db3 82.14 0.2070
db4 82.34 0.2049
db1 77.04 0.2677
db2 79.06 0.2464
Dave Matthews Band db3 80.58 0.2300
db4 80.40 0.2301
db1 72.60 0.3086
db2 76.41 0.2726
Depeche Mode db3 77.50 0.2568
db4 78.85 0.2451
db1 74.91 0.2843
db2 76.86 0.2675
Fleetwood Mac db3 78.43 0.2491
db4 79.55 0.2404
db1 78.55 0.2496
db2 82.23 0.2082
Garth Brooks db3 84.14 0.1860
db4 84.87 0.1796
db1 75.34 0.2867
db2 78.61 0.2484
Green Day db3 80.37 0.2285
db4 80.65 0.2261
Continued on Next Page
84
TableA.11 � Continued From Previous Page
db1 74.34 0.2918
db2 78.44 0.2481
Led Zeppelin db3 80.38 0.2291
db4 81.14 0.2220
db1 78.79 0.2456
db2 82.13 0.2112
Madonna db3 82.95 0.2004
db4 82.94 0.2009
db1 81.35 0.2111
db2 83.89 0.1876
Metallica db3 85.07 0.1749
db4 85.24 0.1716
db1 71.29 0.3240
db2 73.04 0.3055
Prince db3 74.22 0.2931
db4 74.51 0.2888
db1 70.69 0.3289
db2 74.98 0.2870
Queen db3 78.76 0.2491
db4 79.57 0.2381
db1 81.08 0.2234
db2 84.11 0.1886
Radiohead db3 86.07 0.1668
db4 86.88 0.1586
db1 73.77 0.3010
db2 77.26 0.2624
Roxette db3 79.34 0.2398
db4 80.29 0.2302
db1 76.24 0.2699
db2 78.96 0.2418
Steely Dan db3 81.28 0.2162
db4 81.71 0.2107
db1 74.60 0.2904
db2 77.55 0.2596
Suzanne Vega db3 80.27 0.2316
db4 80.17 0.2315
db1 76.12 0.2781
db2 78.76 0.2507
Tori Amos db3 80.30 0.2317
db4 80.60 0.2287
Continued on Next Page
85
TableA.11 � Continued From Previous Page
db1 71.49 0.3225
db2 74.19 0.2953
U2 db3 75.33 0.2826
db4 75.51 0.2806
86
Appendix B
Complete Experiment 2 Results
87
Table
B.1:CompleteExperiment2ResultsusingMFCCFeatures(Trainingsetin
rows;testset
incolumns)
AerosmithBeatles
CCR
Cure
DMB
Depeche
Mode
Fleetwood
Mac
Garth
Brooks
Green
Day
Led
Zep-
pelin
Madonna
Metallica
Prince
Queen
RadioheadRoxette
Steely
Dan
Suzanne
Vega
Tori
Amos
U2
Aerosmith
69.1
1±
0.3
6
57.9
6±
0.4
5
67.9
5±
0.3
9
60.3
7±
0.4
2
62.1
2±
0.4
0
61.5
8±
0.4
2
65.1
9±
0.3
9
62.4
0±
0.4
1
62.5
4±
0.4
1
48.3
7±
0.5
0
67.9
2±
0.4
0
59.4
5±
0.4
2
57.0
7±
0.4
4
62.7
6±
0.4
1
57.7
6±
0.4
5
68.5
9±
0.3
5
55.6
2±
0.4
5
55.1
3±
0.4
6
55.8
8±
0.4
5
Beatles
62.0
8±
0.4
1
61.1
3±
0.4
2
69.0
8±
0.3
6
67.3
5±
0.3
6
64.0
7±
0.3
8
65.4
3±
0.3
7
69.8
0±
0.3
3
65.8
4±
0.3
8
63.3
3±
0.3
9
58.4
3±
0.4
2
51.2
8±
0.5
1
60.2
0±
0.4
0
59.3
0±
0.4
1
68.9
4±
0.3
4
61.9
9±
0.4
0
70.9
3±
0.3
2
60.9
2±
0.4
1
64.5
5±
0.3
8
60.6
3±
0.4
1
CCR
55.3
1±
0.4
5
57.6
0±
0.4
3
73.2
3±
0.3
1
58.6
5±
0.4
2
58.8
8±
0.4
2
57.1
0±
0.4
4
60.1
0±
0.4
1
60.6
5±
0.4
1
64.2
9±
0.3
8
48.0
0±
0.5
1
69.4
1±
0.3
6
57.5
3±
0.4
3
55.2
5±
0.4
5
60.9
8±
0.4
0
53.2
6±
0.4
7
68.1
2±
0.3
4
57.7
3±
0.4
3
48.8
4±
0.5
0
56.7
1±
0.4
4
Cure
64.6
2±
0.3
9
63.5
0±
0.4
1
69.4
8±
0.3
8
59.1
8±
0.4
2
61.3
4±
0.3
9
60.6
2±
0.4
1
65.6
9±
0.3
7
60.5
7±
0.4
3
66.2
3±
0.3
7
50.4
2±
0.4
7
72.3
5±
0.3
8
58.6
8±
0.4
2
55.2
5±
0.4
5
59.4
8±
0.4
3
54.3
0±
0.4
6
69.6
4±
0.3
2
61.4
5±
0.4
0
51.5
1±
0.4
8
58.1
7±
0.4
2
DMB
61.1
7±
0.4
1
73.7
5±
0.3
0
63.8
1±
0.4
2
66.3
7±
0.3
8
63.4
3±
0.3
9
67.4
1±
0.3
6
70.5
6±
0.3
2
66.5
9±
0.3
8
63.0
5±
0.3
9
64.2
5±
0.3
8
66.7
1±
0.4
1
60.0
9±
0.4
1
54.2
8±
0.4
6
63.7
9±
0.3
9
63.5
1±
0.3
9
71.8
3±
0.3
2
65.0
7±
0.3
7
66.9
0±
0.3
6
61.5
1±
0.4
0
Depeche
Mode
57.4
9±
0.4
4
58.8
0±
0.4
4
53.7
8±
0.4
7
63.9
3±
0.4
1
59.0
4±
0.4
4
61.1
2±
0.4
3
66.9
0±
0.3
7
58.6
0±
0.4
4
60.9
5±
0.4
2
56.5
4±
0.4
5
63.2
6±
0.3
9
61.4
3±
0.4
0
57.5
6±
0.4
4
64.0
3±
0.4
0
55.6
8±
0.4
5
66.6
3±
0.3
7
61.6
5±
0.4
1
57.0
6±
0.4
5
58.0
6±
0.4
5
Fleetwood
Mac
59.6
6±
0.4
4
70.2
2±
0.3
6
66.3
5±
0.4
1
70.9
2±
0.3
6
66.6
2±
0.3
8
63.7
2±
0.3
8
66.6
3±
0.3
7
64.6
9±
0.4
2
65.7
6±
0.3
8
65.8
3±
0.3
9
57.4
9±
0.4
6
59.8
3±
0.4
1
56.4
5±
0.4
4
63.8
3±
0.3
8
65.5
1±
0.4
0
71.3
6±
0.3
2
64.8
4±
0.3
8
65.9
3±
0.4
0
62.1
2±
0.4
1
Garth
Brooks
57.1
9±
0.4
4
71.3
6±
0.3
1
55.9
5±
0.4
7
67.0
9±
0.3
9
64.0
6±
0.3
8
62.8
1±
0.4
0
63.1
3±
0.3
9
62.1
7±
0.4
0
61.5
2±
0.4
1
61.3
3±
0.3
9
55.5
1±
0.4
7
59.0
2±
0.4
2
60.5
2±
0.4
0
62.8
4±
0.3
9
60.2
4±
0.4
1
63.9
7±
0.3
9
62.2
4±
0.3
9
64.7
4±
0.3
7
61.4
2±
0.4
1
Green
Day
57.4
8±
0.4
5
71.5
2±
0.3
4
68.3
4±
0.4
0
64.8
7±
0.4
0
65.6
8±
0.3
8
56.2
2±
0.4
5
61.1
5±
0.4
1
66.8
6±
0.3
6
62.0
6±
0.4
1
58.7
1±
0.4
2
70.8
4±
0.3
8
60.4
1±
0.4
1
57.2
0±
0.4
3
59.3
9±
0.4
3
61.5
9±
0.4
0
61.6
5±
0.4
1
59.5
1±
0.4
2
61.9
7±
0.4
0
58.1
1±
0.4
3
Led
Zep-
pelin
59.9
6±
0.4
4
67.7
1±
0.3
7
65.1
6±
0.4
2
72.4
0±
0.3
6
61.6
7±
0.4
1
58.3
7±
0.4
3
61.4
0±
0.4
1
61.2
7±
0.4
0
62.5
4±
0.4
3
50.7
0±
0.4
7
53.8
0±
0.5
1
56.5
2±
0.4
5
56.4
8±
0.4
4
61.7
0±
0.4
1
57.6
7±
0.4
4
67.5
0±
0.3
7
59.4
7±
0.4
2
56.7
6±
0.4
4
56.0
3±
0.4
5
Madonna
56.8
1±
0.4
5
73.1
4±
0.3
2
46.8
4±
0.5
4
52.8
5±
0.4
8
68.0
1±
0.3
8
60.5
3±
0.4
2
66.7
8±
0.3
7
65.5
0±
0.3
7
57.6
8±
0.4
4
55.5
0±
0.4
6
39.2
9±
0.6
0
60.2
5±
0.4
4
59.1
2±
0.4
2
62.2
0±
0.4
0
65.1
0±
0.3
8
64.5
1±
0.3
9
61.3
4±
0.4
1
68.9
9±
0.3
4
60.2
7±
0.4
3
Metallica
60.0
3±
0.4
0
56.0
4±
0.4
2
76.7
4±
0.2
6
75.8
3±
0.2
5
53.3
6±
0.4
5
60.2
8±
0.4
0
51.3
4±
0.4
7
52.7
5±
0.4
6
58.7
5±
0.4
1
66.6
1±
0.3
4
38.5
6±
0.5
9
56.3
0±
0.4
4
49.2
7±
0.5
0
62.1
3±
0.3
7
48.8
7±
0.5
0
69.3
4±
0.3
1
58.7
1±
0.4
1
42.6
9±
0.5
5
51.8
9±
0.4
7
Prince
53.8
8±
0.4
8
65.7
9±
0.4
1
53.0
6±
0.5
0
50.8
7±
0.4
9
61.1
9±
0.4
4
62.0
8±
0.4
3
60.3
6±
0.4
4
65.3
4±
0.4
1
62.4
4±
0.4
5
56.6
6±
0.4
5
62.1
1±
0.4
3
47.5
6±
0.5
3
60.1
1±
0.4
3
57.4
3±
0.4
6
60.2
0±
0.4
4
60.7
1±
0.4
3
58.8
7±
0.4
5
59.5
2±
0.4
5
58.7
2±
0.4
5
Queen
55.4
8±
0.4
6
65.4
9±
0.3
7
45.9
1±
0.5
2
60.7
7±
0.4
2
57.0
2±
0.4
4
61.7
9±
0.4
1
53.8
4±
0.4
7
66.0
5±
0.3
7
57.0
0±
0.4
4
58.7
7±
0.4
3
53.8
7±
0.4
6
32.6
9±
0.6
1
57.6
5±
0.4
4
57.7
9±
0.4
4
54.9
3±
0.4
5
60.2
5±
0.4
2
54.8
6±
0.4
5
56.2
4±
0.4
5
56.8
0±
0.4
5
Radiohead57.0
6±
0.4
4
61.1
4±
0.4
0
70.0
9±
0.3
4
72.1
7±
0.3
2
60.7
2±
0.3
9
63.8
0±
0.3
7
54.6
2±
0.4
5
62.6
4±
0.3
8
61.3
0±
0.4
0
64.6
7±
0.3
7
47.3
5±
0.5
0
69.1
5±
0.3
6
56.7
5±
0.4
3
50.8
4±
0.4
9
55.0
4±
0.4
5
70.2
3±
0.3
1
60.9
5±
0.3
9
57.6
3±
0.4
2
54.5
1±
0.4
5
Roxette
62.9
6±
0.4
2
70.5
9±
0.3
5
64.0
7±
0.4
3
67.2
0±
0.4
0
65.3
5±
0.4
0
62.3
4±
0.4
1
67.8
0±
0.3
9
68.6
7±
0.3
6
65.4
4±
0.4
0
61.0
8±
0.4
2
65.0
0±
0.4
0
73.5
0±
0.3
8
58.2
1±
0.4
3
54.3
1±
0.4
5
65.2
7±
0.3
8
69.5
8±
0.3
6
62.6
8±
0.4
1
67.9
5±
0.3
7
59.3
1±
0.4
4
Steely
Dan
60.5
5±
0.4
3
56.3
5±
0.4
6
73.1
7±
0.3
4
71.3
6±
0.3
6
56.5
8±
0.4
4
64.1
0±
0.3
8
61.8
9±
0.4
1
60.4
5±
0.4
1
56.2
0±
0.4
4
65.1
4±
0.3
8
50.6
0±
0.4
7
80.4
3±
0.2
6
61.0
6±
0.4
0
51.6
9±
0.4
8
62.9
0±
0.3
9
54.3
4±
0.4
5
62.2
6±
0.3
9
48.3
9±
0.4
9
55.3
6±
0.4
5
Suzanne
Vega
51.8
2±
0.4
8
56.4
3±
0.4
5
53.5
9±
0.4
7
62.6
2±
0.4
1
57.0
7±
0.4
3
63.2
2±
0.4
0
59.2
6±
0.4
2
64.2
6±
0.3
8
55.6
3±
0.4
5
59.5
7±
0.4
2
61.4
0±
0.4
1
55.6
1±
0.4
5
59.0
1±
0.4
2
53.4
6±
0.4
7
59.5
9±
0.4
2
57.0
3±
0.4
4
67.1
8±
0.3
7
61.5
7±
0.4
1
56.6
9±
0.4
5
Tori
Amos
59.6
8±
0.4
4
72.3
2±
0.3
6
54.3
8±
0.4
6
59.9
0±
0.4
4
66.2
5±
0.4
0
61.5
5±
0.4
1
64.4
7±
0.4
1
68.7
0±
0.3
6
62.5
9±
0.4
3
58.5
3±
0.4
4
63.5
7±
0.4
0
49.0
0±
0.5
1
59.2
8±
0.4
2
56.4
2±
0.4
5
65.3
5±
0.3
9
63.9
3±
0.4
1
64.9
2±
0.3
9
61.2
0±
0.4
1
59.7
1±
0.7
4
U2
55.1
8±
0.4
6
70.5
5±
0.3
7
51.3
8±
0.4
7
66.2
1±
0.3
8
67.8
9±
0.3
8
63.1
3±
0.3
9
66.3
7±
0.3
9
68.8
0±
0.3
6
62.5
1±
0.4
1
63.6
2±
0.3
9
68.8
4±
0.3
8
45.2
5±
0.5
2
61.0
2±
0.4
1
59.0
8±
0.4
3
64.4
3±
0.3
9
62.0
5±
0.4
2
68.3
8±
0.3
5
62.0
4±
0.4
0
65.0
4±
0.4
0
88
Table
B.2:CompleteExperiment2ResultsusingMFDWC
Features(Trainingsetin
rows;test
setin
columns)
AerosmithBeatles
CCR
Cure
DMB
Depeche
Mode
Fleetwood
Mac
Garth
Brooks
Green
Day
Led
Zep-
pelin
Madonna
Metallica
Prince
Queen
RadioheadRoxette
Steely
Dan
Suzanne
Vega
Tori
Amos
U2
Aerosmith
61.2
4±
0.4
1
61.7
5±
0.4
5
60.9
7±
0.4
3
51.3
3±
0.4
9
53.9
9±
0.4
8
55.2
3±
0.4
7
56.7
1±
0.4
5
59.7
7±
0.4
4
61.4
8±
0.4
3
47.1
4±
0.5
2
69.6
0±
0.4
0
52.7
5±
0.4
8
57.0
5±
0.4
5
57.2
8±
0.4
5
55.0
1±
0.4
6
59.9
1±
0.4
4
50.4
5±
0.5
1
48.3
1±
0.5
1
52.0
0±
0.4
8
Beatles
54.0
0±
0.4
7
53.2
8±
0.4
9
55.3
2±
0.4
7
54.7
9±
0.4
6
52.1
8±
0.4
9
57.0
6±
0.4
5
64.4
4±
0.3
9
61.1
4±
0.4
2
53.8
2±
0.4
7
57.8
2±
0.4
3
50.4
7±
0.5
1
56.3
3±
0.4
6
61.2
8±
0.4
1
56.5
5±
0.4
4
56.3
3±
0.4
5
52.1
1±
0.4
9
51.5
2±
0.5
0
58.2
6±
0.4
3
57.8
8±
0.4
4
CCR
56.7
8±
0.4
5
49.7
5±
0.5
1
70.0
2±
0.3
4
51.3
9±
0.4
9
58.1
5±
0.4
3
55.2
1±
0.4
6
60.2
8±
0.4
1
56.9
3±
0.4
4
61.3
7±
0.4
1
47.8
4±
0.5
2
65.3
6±
0.4
0
54.6
2±
0.4
7
57.4
4±
0.4
3
59.0
3±
0.4
2
51.3
7±
0.4
9
63.9
6±
0.3
9
52.4
8±
0.4
8
46.0
6±
0.5
3
54.0
4±
0.4
6
Cure
58.0
9±
0.4
4
50.3
8±
0.4
8
75.4
4±
0.3
4
51.6
0±
0.4
8
62.5
3±
0.4
2
55.9
4±
0.4
6
61.3
6±
0.4
0
56.8
4±
0.4
5
64.5
3±
0.4
0
47.4
7±
0.5
0
76.7
1±
0.3
3
58.5
5±
0.4
4
59.3
9±
0.4
2
61.2
2±
0.4
2
50.3
3±
0.4
9
68.4
0±
0.3
7
56.8
3±
0.4
5
42.7
0±
0.5
5
57.6
9±
0.4
4
DMB
51.4
3±
0.4
9
60.9
4±
0.4
1
63.1
8±
0.4
4
56.4
4±
0.4
73
52.6
4±
0.4
9
57.3
0±
0.4
5
62.7
3±
0.4
1
61.6
0±
0.4
3
56.0
0±
0.4
7
54.3
1±
0.4
7
63.5
1±
0.4
2
55.6
6±
0.4
7
52.1
2±
0.4
8
53.9
1±
0.4
8
49.1
3±
0.5
0
53.2
6±
0.4
9
53.2
3±
0.4
8
50.7
4±
0.4
9
57.0
8±
0.4
5
Depeche
Mode
54.7
5±
0.4
8
47.5
4±
0.5
3
64.7
6±
0.4
1
71.6
9±
0.3
5
51.0
8±
0.4
9
53.3
0±
0.4
8
59.3
6±
0.4
2
52.3
7±
0.4
8
60.4
8±
0.4
3
48.8
8±
0.5
0
72.2
7±
0.3
5
58.3
3±
0.4
4
57.1
6±
0.4
5
57.7
5±
0.4
4
47.6
1±
0.5
1
65.8
8±
0.3
9
60.1
6±
0.4
3
43.4
3±
0.5
4
55.7
0±
0.4
5
Fleetwood
Mac
54.9
0±
0.4
8
59.7
5±
0.4
3
69.2
7±
0.3
9
66.4
4±
0.4
2
55.9
0±
0.4
7
58.6
2±
0.4
5
63.7
3±
0.4
2
59.0
2±
0.4
5
60.1
3±
0.4
5
57.4
8±
0.4
5
69.2
5±
0.3
9
58.8
8±
0.4
5
55.5
2±
0.4
6
56.9
6±
0.4
6
57.0
5±
0.4
5
63.1
8±
0.4
2
56.5
4±
0.4
6
53.7
6±
0.4
7
56.6
2±
0.4
6
Garth
Brooks
52.2
2±
0.4
7
60.6
0±
0.4
0
54.9
1±
0.4
8
63.9
6±
0.7
2
56.7
2±
0.4
5
57.7
9±
0.4
5
57.6
5±
0.4
4
58.1
7±
0.4
5
56.4
4±
0.4
6
59.4
0±
0.4
3
54.8
4±
0.4
9
57.0
3±
0.4
6
58.9
5±
0.4
2
55.4
1±
0.4
4
52.9
0±
0.4
7
55.0
9±
0.4
7
55.8
4±
0.4
6
53.0
8±
0.4
7
58.6
2±
0.4
4
Green
Day
54.6
7±
0.4
8
68.6
3±
0.3
5
64.1
6±
0.4
4
54.5
2±
0.4
9
60.6
6±
0.4
2
49.8
4±
0.5
1
58.1
6±
0.4
4
62.1
4±
0.4
0
56.2
5±
0.4
7
55.5
8±
0.4
5
69.0
8±
0.4
1
54.9
5±
0.4
7
53.9
8±
0.4
6
56.3
2±
0.4
6
57.0
9±
0.4
4
51.2
0±
0.5
1
51.9
3±
0.4
9
56.0
8±
0.4
4
55.7
6±
0.4
6
Led
Zep-
pelin
60.3
8±
0.4
4
58.7
8±
0.4
3
66.2
3±
0.4
2
69.8
7±
0.3
9
52.6
5±
0.4
8
56.4
7±
0.4
7
56.2
9±
0.4
6
58.4
7±
0.4
4
58.7
3±
0.4
4
46.8
4±
0.5
1
68.2
4±
0.4
2
56.9
0±
0.4
6
55.9
5±
0.4
5
57.7
2±
0.4
5
50.7
7±
0.4
9
63.9
9±
0.4
3
56.7
0±
0.4
6
48.9
0±
0.5
0
53.9
3±
0.4
7
Madonna
50.3
9±
0.5
0
59.3
9±
0.4
2
47.5
7±
0.5
1
53.0
4±
0.4
8
53.0
7±
0.4
7
54.0
6±
0.4
8
57.8
3±
0.4
5
59.5
0±
0.4
3
52.1
6±
0.4
8
52.7
5±
0.4
8
50.7
3±
0.5
0
54.9
9±
0.4
7
54.3
6±
0.4
7
54.6
9±
0.4
7
58.7
8±
0.4
4
54.7
2±
0.4
7
56.4
1±
0.4
6
61.9
0±
0.4
1
53.4
3±
0.4
8
Metallica
61.4
2±
0.4
0
56.0
3±
0.4
5
73.6
2±
0.3
2
72.5
7±
0.3
1
50.2
2±
0.5
0
58.7
9±
0.4
3
54.7
2±
0.4
6
53.7
6±
0.4
7
56.5
2±
0.4
4
65.8
1±
0.3
7
44.3
3±
0.5
4
56.3
8±
0.4
5
54.6
5±
0.4
6
54.9
1±
0.4
7
50.1
4±
0.4
9
67.9
5±
0.3
5
55.7
5±
0.4
5
42.0
7±
0.5
6
52.7
8±
0.4
7
Prince
55.5
9±
0.4
7
61.6
6±
0.4
3
66.6
4±
0.4
4
66.5
8±
0.4
3
56.0
7±
0.4
7
62.0
7±
0.4
5
58.4
1±
0.4
6
63.7
7±
0.4
2
57.9
9±
0.4
9
61.3
5±
0.4
5
54.7
9±
0.4
6
73.3
7±
0.4
2
58.2
3±
0.4
5
60.5
0±
0.4
4
53.1
2±
0.4
7
62.9
5±
0.4
5
57.1
8±
0.4
7
53.7
3±
0.4
7
57.5
5±
0.4
6
Queen
57.0
0±
0.4
6
59.7
2±
0.4
2
55.9
9±
0.4
7
67.0
0±
0.4
0
48.3
2±
0.5
1
60.4
7±
0.4
2
52.7
5±
0.4
8
61.8
2±
0.4
0
54.1
8±
0.4
7
58.1
7±
0.4
5
49.6
5±
0.4
9
43.4
6±
0.5
6
54.1
2±
0.4
7
57.1
8±
0.4
5
50.5
5±
0.4
8
59.7
5±
0.4
4
52.9
4±
0.4
8
46.8
8±
0.5
1
55.7
9±
0.4
6
Radiohead53.1
3±
0.4
8
52.2
2±
0.5
1
63.6
0±
0.4
0
57.1
3±
0.4
5
52.2
2±
0.4
8
56.1
8±
0.4
6
54.2
5±
0.4
7
58.4
0±
0.4
4
54.5
9±
0.4
7
58.0
2±
0.4
4
52.4
6±
0.4
9
69.8
8±
0.3
4
57.1
7±
0.4
5
50.6
1±
0.5
0
54.0
2±
0.4
7
58.0
1±
0.4
4
53.1
4±
0.4
8
52.5
2±
0.4
8
55.6
2±
0.4
6
Roxette
55.4
1±
0.4
6
60.7
0±
0.4
2
52.3
2±
0.4
7
49.8
9±
0.4
8
50.0
9±
0.5
0
53.0
4±
0.4
8
56.9
5±
0.4
5
60.1
7±
0.4
3
58.2
0±
0.4
5
52.6
6±
0.4
6
63.3
0±
0.4
2
68.0
5±
0.3
8
52.9
6±
0.4
8
53.3
1±
0.4
7
54.9
9±
0.4
4
52.4
4±
0.4
7
51.5
4±
0.4
8
60.5
8±
0.4
3
53.8
9±
0.4
7
Steely
Dan
58.7
1±
0.4
4
52.6
5±
0.4
8
75.0
0±
0.3
1
74.3
1±
0.3
3
50.6
0±
0.4
9
61.6
5±
0.4
1
56.1
7±
0.4
6
55.5
5±
0.4
5
55.3
1±
0.4
5
64.1
1±
0.3
9
45.0
0±
0.5
2
80.5
2±
0.2
3
58.5
9±
0.4
3
52.9
0±
0.4
8
59.3
3±
0.4
2
49.6
3±
0.4
9
58.5
9±
0.4
3
41.8
7±
0.5
6
54.0
4±
0.4
6
Suzanne
Vega
51.0
7±
0.5
1
49.8
6±
0.5
0
58.4
5±
0.4
6
61.2
3±
0.4
3
49.7
1±
0.4
9
58.6
6±
0.4
5
54.3
0±
0.4
7
54.8
0±
0.4
6
51.8
6±
0.4
9
57.8
6±
0.4
6
56.5
3±
0.4
4
66.3
6±
0.4
1
57.9
1±
0.4
5
50.6
4±
0.5
0
55.1
0±
0.4
6
51.0
0±
0.4
9
61.1
8±
0.4
3
52.8
1±
0.4
7
52.3
9±
0.4
8
Tori
Amos
49.6
1±
0.5
0
61.2
7±
0.4
1
43.5
8±
0.5
4
41.0
6±
0.5
6
53.0
5±
0.4
8
47.9
8±
0.5
1
54.3
3±
0.4
7
57.3
6±
0.4
4
56.1
1±
0.4
6
50.5
7±
0.4
9
63.8
3±
0.4
0
48.5
3±
0.5
3
53.3
5±
0.4
8
51.7
1±
0.4
8
53.9
8±
0.4
7
58.2
5±
0.4
4
43.9
0±
0.5
3
51.5
8±
0.4
9
52.8
6±
0.4
8
U2
54.0
7±
0.4
7
63.1
4±
0.4
2
57.3
2±
0.4
6
63.7
4±
0.4
2
55.2
8±
0.4
7
56.2
0±
0.4
6
57.7
3±
0.4
6
65.4
3±
0.3
9
58.8
4±
0.4
5
58.1
5±
0.4
4
57.5
1±
0.4
5
59.4
7±
0.4
5
56.6
2±
0.4
6
60.2
8±
0.4
3
61.0
0±
0.4
1
54.1
8±
0.4
7
57.3
5±
0.4
5
51.4
2±
0.4
9
52.4
5±
0.4
8
89
TableB.3:CompleteExperiment2ResultsusingWaveletEnergyFeatures(Trainingsetin
rows;
testsetin
columns)
AerosmithBeatles
CCR
Cure
Dave
Matthews
Depeche
Mode
Fleetwood
Mac
Garth
Brooks
Green
Day
Led
Zep-
pelin
Madonna
Metallica
Prince
Queen
RadioheadRoxette
Steely
Dan
Suzanne
Vega
Tori
Amos
U2
Aerosmith
70.9
7±
0.3
3
65.6
1±
0.3
8
68.6
8±
0.3
7
64.8
4±
0.3
8
60.1
1±
0.4
1
67.1
3±
0.3
7
62.6
6±
0.3
9
67.9
3±
0.3
6
66.4
7±
0.3
6
52.9
2±
0.4
7
66.5
1±
0.3
8
61.8
3±
0.4
0
55.8
0±
0.4
4
60.0
3±
0.4
2
63.3
6±
0.3
9
70.5
5±
0.3
3
60.1
3±
0.4
2
61.2
2±
0.4
1
57.0
1±
0.4
4
Beatles
63.2
3±
0.4
0
64.9
0±
0.3
9
70.7
4±
0.3
5
69.5
2±
0.3
4
62.2
0±
0.3
9
67.3
1±
0.3
7
70.8
6±
0.3
2
69.2
6±
0.3
5
64.1
6±
0.3
8
59.7
7±
0.4
1
66.7
0±
0.4
0
57.6
3±
0.4
3
52.2
1±
0.4
6
68.5
6±
0.3
2
66.6
5±
0.3
6
71.0
0±
0.3
2
63.7
6±
0.3
8
69.2
5±
0.3
3
59.0
5±
0.4
2
CCR
59.5
1±
0.4
2
65.5
6±
0.3
6
71.7
6±
0.3
2
61.2
0±
0.4
0
59.4
7±
0.4
1
61.6
5±
0.4
0
61.2
5±
0.4
0
65.6
3±
0.3
7
67.8
2±
0.3
5
51.3
4±
0.4
8
73.3
3±
0.3
2
57.3
6±
0.4
3
53.1
1±
0.4
7
65.2
2±
0.3
6
59.5
4±
0.4
2
69.0
1±
0.3
3
61.5
8±
0.3
9
55.5
4±
0.4
4
56.0
3±
0.4
4
Cure
63.8
5±
0.3
9
68.1
4±
0.3
3
63.8
1±
0.4
1
63.9
0±
0.3
8
60.1
4±
0.4
0
67.9
2±
0.3
5
71.6
2±
0.3
1
65.1
6±
0.3
9
68.4
2±
0.3
5
59.0
3±
0.4
1
60.4
3±
0.4
5
57.0
2±
0.4
3
52.6
6±
0.4
6
64.5
0±
0.3
7
64.0
8±
0.3
8
69.5
4±
0.3
3
66.8
2±
0.3
5
65.6
9±
0.3
6
61.0
0±
0.4
0
Dave
Matthews
Band
60.3
9±
0.4
3
73.4
6±
0.3
2
63.8
6±
0.4
0
67.4
1±
0.3
8
65.5
6±
0.3
8
69.6
1±
0.3
5
72.9
2±
0.3
1
69.2
9±
0.3
6
64.3
3±
0.3
9
68.0
1±
0.3
5
67.8
3±
0.3
7
57.6
9±
0.4
3
51.7
3±
0.4
9
61.8
6±
0.3
9
68.0
3±
0.3
6
70.8
3±
0.3
3
68.6
4±
0.3
5
70.7
1±
0.3
3
61.9
2±
0.4
0
Depeche
Mode
60.0
2±
0.4
2
54.9
4±
0.4
6
57.9
9±
0.4
3
57.2
8±
0.4
3
63.0
2±
0.4
0
69.0
0±
0.3
4
63.0
6±
0.3
8
54.0
4±
0.4
6
60.7
0±
0.4
0
67.7
9±
0.3
6
59.9
4±
0.4
1
61.1
2±
0.4
0
53.8
0±
0.4
7
60.4
1±
0.4
0
60.3
5±
0.4
2
67.9
6±
0.3
5
66.8
8±
0.3
6
60.5
6±
0.4
1
61.4
8±
0.4
0
Fleetwood
Mac
65.0
5±
0.3
8
68.9
8±
0.3
4
64.9
3±
0.3
9
71.5
4±
0.3
4
67.1
3±
0.3
6
64.8
4±
0.3
8
70.7
3±
0.3
2
68.6
6±
0.3
5
68.1
0±
0.3
5
64.0
5±
0.3
9
58.7
4±
0.4
5
59.4
7±
0.4
1
55.1
9±
0.4
5
64.7
6±
0.3
7
68.2
5±
0.3
5
72.0
9±
0.3
1
66.7
0±
0.3
6
65.9
1±
0.3
6
62.7
7±
0.3
9
Garth
Brooks
55.9
2±
0.4
6
72.9
4±
0.2
9
56.3
8±
0.4
6
70.4
6±
0.3
6
66.6
0±
0.3
6
61.7
3±
0.4
1
67.7
9±
0.3
6
61.9
1±
0.4
0
62.6
2±
0.4
0
62.9
1±
0.3
8
55.9
6±
0.4
6
56.3
0±
0.4
4
53.0
2±
0.4
7
60.2
7±
0.4
0
65.5
3±
0.3
7
65.4
6±
0.3
8
68.5
2±
0.3
5
69.5
4±
0.3
2
59.5
4±
0.4
2
Green
Day
59.8
3±
0.4
2
76.3
6±
0.2
6
63.6
1±
0.3
9
56.1
1±
0.4
6
69.4
9±
0.3
3
56.5
9±
0.4
5
63.0
2±
0.3
9
67.8
2±
0.3
3
59.4
8±
0.4
2
63.7
5±
0.3
7
70.0
1±
0.3
6
59.5
4±
0.4
1
52.4
8±
0.4
7
62.7
8±
0.3
8
67.1
1±
0.3
5
62.3
5±
0.4
0
63.0
4±
0.3
9
69.7
2±
0.3
2
60.0
7±
0.4
1
Led
Zep-
pelin
62.6
0±
0.4
1
70.8
1±
0.3
2
61.1
6±
0.4
0
72.2
3±
0.3
4
64.9
1±
0.3
8
59.6
1±
0.4
2
65.8
9±
0.3
8
66.0
1±
0.3
7
62.3
3±
0.3
9
53.2
6±
0.4
6
72.7
1±
0.5
9
57.8
8±
0.4
3
56.8
0±
0.4
4
63.0
6±
0.3
9
61.2
5±
0.4
0
65.7
8±
0.3
7
61.6
1±
0.4
0
62.3
4±
0.3
9
58.8
5±
0.4
3
Madonna
59.3
7±
0.4
1
73.2
9±
0.2
8
48.3
8±
0.5
1
50.0
4±
0.4
9
68.4
0±
0.3
5
59.6
8±
0.4
3
65.8
0±
0.3
5
66.7
4±
0.3
4
61.7
3±
0.3
9
57.8
8±
0.4
3
39.5
2±
0.5
8
58.0
0±
0.4
4
57.8
8±
0.4
2
64.0
6±
0.3
7
65.7
9±
0.3
5
63.7
0±
0.3
9
61.5
5±
0.4
0
69.7
8±
0.3
2
62.4
8±
0.4
0
Metallica
63.0
5±
0.3
7
66.0
9±
0.3
5
73.7
5±
0.2
8
76.2
1±
0.2
4
59.0
5±
0.4
1
59.1
2±
0.4
1
56.7
4±
0.4
3
57.5
0±
0.4
3
66.2
3±
0.3
5
67.4
2±
0.3
3
42.3
0±
0.5
7
57.0
0±
0.4
3
49.4
0±
0.5
0
67.0
9±
0.3
3
57.5
1±
0.4
3
69.6
3±
0.3
1
59.0
9±
0.4
1
54.6
7±
0.4
5
52.9
9±
0.4
6
Prince
58.8
8±
0.4
6
68.5
1±
0.4
0
57.2
8±
0.4
9
42.4
1±
0.5
4
58.4
5±
0.4
5
59.7
3±
0.4
4
60.4
8±
0.4
4
60.7
7±
0.4
3
66.3
1±
0.4
2
55.5
0±
0.4
7
61.1
3±
0.4
4
57.6
6±
0.5
1
54.7
6±
0.4
6
60.8
4±
0.4
4
62.0
4±
0.4
4
59.8
4±
0.4
4
57.9
6±
0.4
5
65.5
0±
0.4
2
57.7
3±
0.4
5
Queen
52.5
4±
0.4
8
60.0
5±
0.4
0
43.9
6±
0.5
3
53.1
8±
0.4
7
57.4
6±
0.4
3
58.3
8±
0.4
3
52.8
4±
0.4
7
56.2
9±
0.4
4
53.3
4±
0.4
5
57.6
8±
0.4
3
56.2
0±
0.4
4
28.5
7±
0.6
4
55.6
0±
0.4
5
57.3
4±
0.4
3
55.8
0±
0.4
4
58.4
3±
0.4
3
57.4
5±
0.4
4
55.7
3±
0.4
4
54.0
8±
0.4
6
Radiohead55.1
7±
0.4
4
56.4
0±
0.4
4
65.9
5±
0.3
6
71.9
9±
0.3
9
60.9
7±
0.3
9
61.1
9±
0.3
9
54.9
1±
0.4
4
61.2
2±
0.4
0
60.5
0±
0.4
1
63.8
5±
0.3
7
51.0
5±
0.4
8
66.4
6±
0.3
8
56.2
4±
0.4
4
48.8
7±
0.5
0
54.7
7±
0.4
5
69.5
7±
0.3
1
61.6
0±
0.3
9
61.0
3±
0.4
0
56.0
8±
0.4
4
Roxette
62.6
5±
0.4
1
74.0
9±
0.3
0
67.1
8±
0.3
9
64.7
3±
0.4
0
69.5
6±
0.3
9
62.7
6±
0.4
1
70.1
5±
0.3
6
71.3
7±
0.3
3
70.3
2±
0.3
5
61.9
0±
0.4
0
67.3
4±
0.3
8
71.1
2±
0.3
5
58.3
2±
0.4
3
51.5
3±
0.4
8
66.0
2±
0.3
5
69.9
0±
0.3
5
65.9
5±
0.3
8
71.0
1±
0.3
3
60.6
3±
0.4
2
Steely
Dan
62.4
7±
0.4
0
55.7
2±
0.4
6
71.6
7±
0.3
2
66.9
7±
0.3
8
57.1
1±
0.4
5
66.1
4±
0.3
7
64.1
1±
0.3
8
61.9
6±
0.4
0
58.7
9±
0.4
3
65.2
0±
0.3
7
55.8
3±
0.4
4
78.6
7±
0.2
4
61.2
3±
0.3
9
50.9
4±
0.4
9
67.0
6±
0.3
7
58.6
9±
0.4
4
65.1
1±
0.3
7
58.3
6±
0.4
9
56.4
4±
0.4
3
Suzanne
Vega
55.4
6±
0.4
6
66.7
2±
0.3
5
62.1
7±
0.4
2
66.6
0±
0.3
9
65.5
0±
0.3
7
64.3
5±
0.3
8
63.5
9±
0.3
9
72.1
9±
0.3
1
64.1
8±
0.3
9
61.8
1±
0.4
0
64.8
9±
0.3
7
61.9
8±
0.4
4
57.5
7±
0.4
2
52.2
0±
0.4
7
66.6
7±
0.3
5
66.2
7±
0.3
7
66.8
2±
0.3
6
69.7
4±
0.3
3
58.9
3±
0.4
3
Tori
Amos
61.2
0±
0.4
4
75.5
1±
0.3
1
61.3
2±
0.4
2
69.2
8±
0.4
0
67.8
4±
0.3
8
62.6
5±
0.4
1
66.5
9±
0.4
0
71.8
5±
0.3
5
67.4
7±
0.3
8
63.7
1±
0.4
0
64.2
6±
0.3
9
63.2
1±
0.4
3
55.0
5±
0.4
4
52.5
3±
0.4
6
67.1
8±
0.3
5
66.6
5±
0.3
9
66.1
6±
0.3
9
66.0
3±
0.3
9
57.8
3±
0.4
4
U2
56.9
2±
0.4
4
67.9
0±
0.3
3
51.6
9±
0.4
6
66.3
7±
0.3
8
68.9
3±
0.3
4
61.9
0±
0.4
0
70.3
5±
0.3
4
68.3
2±
0.3
4
58.7
7±
0.4
1
66.0
2±
0.3
7
68.3
6±
0.3
6
42.8
0±
0.5
4
59.9
2±
0.4
1
57.6
3±
0.4
3
61.9
5±
0.4
0
64.4
0±
0.3
8
69.6
3±
0.3
3
63.5
2±
0.3
9
65.5
8±
0.3
7
90
Appendix C
Complete Experiment 3 Results
91
Table
C.1:CompleteExperiment3ResultsusingMFCC
Featuresat2%
IncrementalSampling
Rates
Training
Testing
24
68
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
Female
Male
62.21
62.16
61.72
63.21
63.30
62.73
63.20
62.87
63.59
62.98
62.93
63.12
63.36
62.84
63.26
63.14
62.73
62.56
62.94
62.81
0.4297
0.4265
0.4263
0.4194
0.4160
0.4188
0.4167
0.4160
0.4125
0.4125
0.4138
0.4118
0.4130
0.4131
0.4111
0.4107
0.4100
0.4119
0.4102
0.4113
Male
Female
61.19
59.28
61.38
62.61
60.81
62.02
61.13
62.17
61.86
62.16
62.07
61.18
61.41
60.81
61.81
61.77
62.05
60.83
61.52
61.12
0.4299
0.4256
0.4219
0.4179
0.4178
0.4139
0.4147
0.4136
0.4122
0.4126
0.4135
0.4130
0.4137
0.4137
0.4111
0.4105
0.4107
0.4109
0.4102
0.4108
TableC.2:CompleteExperiment3ResultsusingMFDWCFeaturesat2%
IncrementalSampling
Rates
Training
Testing
24
68
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
Female
Male
54.26
53.85
55.07
55.36
55.21
54.48
54.64
54.32
56.59
55.65
56.29
55.51
54.99
55.78
55.50
55.55
55.54
55.39
55.31
55.49
0.4767
0.4780
0.4756
0.4731
0.4739
0.4748
0.4725
0.4709
0.4706
0.4698
0.4676
0.4673
0.4682
0.4668
0.4669
0.4676
0.4660
0.4659
0.4654
0.4669
Male
Female
51.33
50.98
51.69
52.14
51.86
52.35
52.73
52.63
52.48
52.46
52.71
52.93
52.91
53.02
52.92
52.97
53.00
53.21
52.90
52.92
0.4907
0.4901
0.4849
0.4828
0.4854
0.4830
0.4793
0.4797
0.4793
0.4779
0.4750
0.4772
0.4750
0.4752
0.4758
0.4754
0.4743
0.4741
0.4735
0.4739
Table
C.3:CompleteExperiment3ResultsusingWaveletEnergyFeaturesat2%
Incremental
SamplingRates
Training
Testing
24
68
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
Female
Male
66.95
64.15
65.68
65.13
66.09
66.10
65.84
65.40
66.07
66.14
65.78
66.27
65.54
65.08
65.37
65.13
65.34
64.53
64.98
64.78
0.4088
0.4091
0.4029
0.3989
0.4004
0.3985
0.3977
0.3972
0.3961
0.3957
0.3928
0.3900
0.3926
0.3926
0.3912
0.3927
0.3913
0.3923
0.3919
0.3918
Male
Female
66.17
66.78
66.76
67.32
66.23
67.06
66.64
66.00
66.58
66.43
66.50
66.73
66.78
66.80
66.28
66.11
66.30
65.84
66.05
65.64
0.3976
0.3930
0.3929
0.3871
0.3862
0.3848
0.3862
0.3842
0.3852
0.3820
0.3811
0.3826
0.3813
0.3799
0.3799
0.3817
0.3783
0.3803
0.3775
0.3808
92
Appendix D
Complete Experiment 4 Results
93
Table
D.1:CompleteExperiment4ResultsusingMFCC
Featuresat2%
IncrementalSampling
Rates
24
68
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
67.41
67.77
68.38
68.75
69.28
69.48
69.44
69.70
69.93
70.05
70.36
70.47
70.66
70.83
71.11
71.24
71.39
71.65
71.90
72.12
0.3976
0.3899
0.3861
0.3829
0.3773
0.3748
0.3727
0.37
0.3675
0.3657
0.3627
0.3597
0.3577
0.3548
0.3514
0.3493
0.3463
0.3434
0.3405
0.3374
TableD.2:CompleteExperiment4ResultsusingMFDWCFeaturesat2%
IncrementalSampling
Rates
24
68
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
62.25
62.25
62.74
63.06
63.49
63.71
64.01
64.24
64.47
64.69
64.84
65.16
65.44
65.66
65.96
66.22
66.55
66.79
67.13
67.29
0.4383
0.4367
0.4325
0.4288
0.4261
0.4227
0.4196
0.4164
0.4144
0.4109
0.409
0.4054
0.4023
0.3997
0.3959
0.3926
0.3889
0.3852
0.3821
0.3795
Table
D.3:CompleteExperiment4ResultsusingWaveletEnergyFeaturesat2%
Incremental
SamplingRates
24
68
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
69.57
70.07
70.28
70.57
70.59
70.67
71.01
71.16
71.38
71.44
71.63
71.68
72.00
72.05
72.34
72.50
72.64
72.91
73.02
73.19
0.3847
0.3782
0.3753
0.3699
0.3677
0.3663
0.3627
0.3603
0.3579
0.355
0.3526
0.3512
0.3471
0.3453
0.3419
0.3399
0.3374
0.3334
0.331
0.3286
94
Appendix E
Complete Experiment 5 Results
95
TableE.1:CompleteExperiment5Resultsat2%
IncrementalSamplingRatesusingMFCCFea-
tures
Run
Feature
24
68
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
1MFCC
64.5
0±
0.4
1
65.6
8±
0.4
1
64.7
5±
0.4
1
62.7
9±
0.4
0
65.2
5±
0.4
0
63.2
8±
0.4
1
65.5
7±
0.4
1
64.0
8±
0.4
1
65.1
2±
0.4
0
65.8
5±
0.3
9
63.4
4±
0.4
0
64.9
3±
0.4
0
65.7
1±
0.4
0
64.3
3±
0.4
0
62.7
9±
0.4
1
64.2
0±
0.3
9
64.9
4±
0.3
9
64.2
5±
0.3
9
65.5
7±
0.4
0
64.0
0±
0.4
0
265.2
2±
0.4
2
64.5
5±
0.4
1
64.9
2±
0.4
1
63.5
6±
0.4
2
65.0
0±
0.4
1
64.7
7±
0.4
0
69.9
1±
0.3
9
64.4
1±
0.4
0
64.8
5±
0.4
0
65.3
6±
0.3
9
64.5
4±
0.4
0
65.3
1±
0.4
0
65.7
7±
0.3
9
65.5
1±
0.3
9
63.0
9±
0.4
0
65.6
5±
0.3
5
63.8
7±
0.4
0
63.8
6±
0.4
0
64.9
2±
0.3
9
64.3
6±
0.4
0
366.4
9±
0.4
1
63.2
7±
0.4
2
65.1
2±
0.4
2
64.0
5±
0.3
9
65.8
8±
0.4
0
65.9
9±
0.4
0
64.5
0±
0.4
0
65.7
1±
0.4
0
65.1
5±
0.4
0
65.8
4±
0.3
9
65.7
2±
0.3
9
65.4
8±
0.3
9
65.1
8±
0.4
0
65.3
6±
0.4
0
63.6
8±
0.4
0
65.5
3±
0.3
9
64.4
5±
0.3
9
64.4
8±
0.3
9
64.8
1±
0.4
1
65.5
7±
0.4
0
464.1
1±
0.4
2
62.9
4±
0.4
1
64.8
6±
0.4
2
63.3
9±
0.4
0
66.6
8±
0.4
0
64.5
9±
0.4
0
64.7
7±
0.4
0
64.3
2±
0.4
0
65.0
9±
0.3
9
66.1
7±
0.3
9
64.8
8±
0.4
0
64.7
0±
0.3
9
65.7
7±
0.4
0
65.6
7±
0.3
9
64.1
9±
0.4
1
66.7
3±
0.3
9
63.6
7±
0.3
9
64.8
0±
0.3
9
66.4
9±
0.3
9
65.0
2±
0.3
9
565.1
6±
0.4
1
64.9
2±
0.4
1
65.1
2±
0.4
2
64.8
8±
0.4
0
64.6
7±
0.4
1
65.7
4±
0.4
0
64.0
0±
0.4
1
63.4
8±
0.4
0
64.3
5±
0.4
0
65.3
5±
0.3
9
63.8
2±
0.4
1
64.6
0±
0.4
2
63.3
8±
0.4
0
65.9
2±
0.3
9
64.2
3±
0.4
0
65.2
3±
0.4
0
65.0
3±
0.3
9
63.9
9±
0.3
9
64.4
8±
0.3
9
64.2
4±
0.3
9
664.7
5±
0.4
3
64.2
4±
0.4
1
64.5
9±
0.4
1
64.5
3±
0.4
0
65.2
0±
0.4
1
66.1
6±
0.3
9
64.8
0±
0.4
0
64.6
9±
0.4
0
63.6
8±
0.4
0
64.1
1±
0.3
9
63.7
5±
0.3
9
65.2
7±
0.3
9
65.4
6±
0.4
0
64.8
1±
0.3
9
64.0
5±
0.4
1
64.1
9±
0.3
9
64.4
0±
0.3
9
65.4
2±
0.3
9
64.3
1±
0.3
9
65.1
4±
0.3
8
764.1
7±
0.4
1
64.7
4±
0.4
1
64.2
5±
0.4
2
65.4
1±
0.3
9
65.8
2±
0.4
1
64.4
6±
0.4
0
65.3
5±
0.3
9
65.7
6±
0.4
0
64.7
8±
0.3
9
65.9
5±
0.3
9
63.8
6±
0.4
0
65.3
7±
0.4
0
65.5
1±
0.3
9
66.1
0±
0.4
0
63.9
3±
0.4
0
66.3
0±
0.4
0
64.3
0±
0.4
0
64.2
3±
0.3
9
64.6
8±
0.3
9
65.0
5±
0.3
9
864.0
5±
0.4
1
62.4
3±
0.4
1
64.3
3±
0.4
1
65.4
1±
0.4
1
63.4
8±
0.4
0
63.3
8±
0.4
1
66.9
1±
0.4
0
65.6
3±
0.4
1
65.1
6±
0.4
0
64.9
3±
0.3
9
64.9
6±
0.4
0
67.4
8±
0.3
9
65.3
0±
0.3
9
65.5
5±
0.4
0
65.0
1±
0.3
9
65.3
2±
0.3
9
64.1
1±
0.4
1
64.4
2±
0.4
0
62.7
0±
0.4
1
65.0
8±
0.3
9
963.4
9±
0.4
3
64.4
0±
0.4
1
65.3
7±
0.4
1
66.1
1±
0.4
0
63.8
2±
0.4
0
64.5
2±
0.4
1
65.8
4±
0.4
0
64.4
3±
0.4
0
64.3
7±
0.4
1
64.6
3±
0.4
0
64.6
5±
0.3
9
65.8
1±
0.3
9
64.1
5±
0.4
0
65.5
6±
0.3
8
64.5
1±
0.3
9
64.0
4±
0.3
9
64.4
9±
0.3
9
63.0
9±
0.4
1
66.5
1±
0.3
9
64.5
4±
0.4
0
10
64.8
2±
0.4
1
64.5
7±
0.4
0
65.3
4±
0.4
0
64.1
8±
0.4
0
63.4
0±
0.4
1
64.7
0±
0.4
0
65.5
1±
0.4
0
65.6
3±
0.4
0
65.5
5±
0.3
9
65.3
5±
0.3
9
64.3
6±
0.4
0
64.9
0±
0.4
0
65.9
1±
0.3
9
63.8
6±
0.3
9
61.8
9±
0.4
0
65.3
3±
0.3
8
64.9
3±
0.4
0
64.9
7±
0.3
9
64.2
1±
0.4
0
64.5
4±
0.4
0
96
Table
E.2:CompleteExperiment5Resultsat2%
IncrementalSamplingRatesusingMFDWC
Features
Run
Feature
24
68
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
1MFDWC
57.7
0±
0.4
6
60.4
1±
0.4
6
54.4
9±
0.4
7
60.8
9±
0.4
5
59.1
7±
0.4
5
58.8
2±
0.4
5
59.0
1±
0.4
5
61.6
6±
0.4
4
59.3
0±
0.4
5
60.2
1±
0.4
5
59.6
4±
0.4
4
58.8
4±
0.4
5
60.9
0±
0.4
4
59.2
3±
0.4
4
59.8
2±
0.4
5
60.1
3±
0.4
4
59.7
3±
0.4
4
59.5
4±
0.4
4
59.7
1±
0.4
4
59.7
2±
0.4
4
258.9
5±
0.4
6
60.4
1±
0.4
5
58.9
2±
0.4
5
59.3
5±
0.4
5
58.4
5±
0.4
5
60.4
5±
0.4
5
59.9
5±
0.4
5
60.0
9±
0.4
5
60.0
3±
0.4
4
60.7
3±
0.4
4
61.5
2±
0.4
4
58.8
1±
0.4
4
61.6
9±
0.4
4
58.5
9±
0.4
5
58.4
8±
0.4
5
60.2
0±
0.4
4
58.3
9±
0.4
5
59.3
3±
0.4
4
61.0
0±
0.4
4
60.1
1±
0.4
4
356.9
7±
0.4
6
60.3
6±
0.4
6
61.1
8±
0.4
5
59.7
1±
0.4
6
57.2
7±
0.4
6
59.0
8±
0.4
5
59.8
5±
0.4
5
60.7
8±
0.4
5
59.9
9±
0.4
5
59.1
5±
0.4
5
59.2
7±
0.4
4
58.9
2±
0.4
5
59.7
0±
0.4
4
59.9
8±
0.4
4
60.8
9±
0.4
4
59.1
2±
0.4
4
58.9
8±
0.4
5
59.0
0±
0.4
4
59.3
8±
0.4
4
61.5
9±
0.4
4
458.3
8±
0.4
6
60.5
7±
0.4
5
60.1
7±
0.4
5
57.4
1±
0.4
6
59.4
4±
0.4
5
59.0
9±
0.4
5
60.1
9±
0.4
5
59.1
3±
0.4
4
59.2
6±
0.4
5
59.4
8±
0.4
4
60.5
7±
0.4
4
56.7
9±
0.4
5
60.2
8±
0.4
4
59.7
8±
0.4
4
60.3
0±
0.4
4
58.7
0±
0.4
4
58.1
8±
0.4
4
61.3
6±
0.4
4
59.2
9±
0.4
4
61.4
7±
0.4
3
558.1
5±
0.4
5
59.4
5±
0.4
6
59.6
2±
0.4
5
59.4
2±
0.4
6
58.8
8±
0.4
5
59.0
9±
0.4
5
59.9
1±
0.4
5
60.9
3±
0.4
4
58.6
8±
0.4
5
60.8
5±
0.4
4
60.6
6±
0.4
5
60.6
7±
0.4
4
59.2
4±
0.4
5
56.3
1±
0.4
5
60.4
6±
0.4
4
60.3
6±
0.4
4
58.6
9±
0.4
4
59.5
1±
0.4
4
59.6
2±
0.4
4
59.2
5±
0.4
4
660.4
0±
0.4
6
60.0
3±
0.4
6
59.4
8±
0.4
6
59.1
9±
0.4
5
59.7
5±
0.4
5
59.0
6±
0.4
5
59.4
6±
0.4
5
61.2
2±
0.4
3
58.7
7±
0.4
5
59.3
4±
0.4
5
59.5
0±
0.4
4
58.8
2±
0.4
5
61.7
0±
0.4
4
62.6
5±
0.4
3
59.2
6±
0.4
469
60.8
9±
0.4
34
57.5
7±
0.4
5
59.5
9±
0.4
4
58.9
8±
0.4
4
61.0
9±
0.4
4
759.0
0±
0.4
6
60.0
3±
0.4
6
59.9
9±
0.4
5
59.5
4±
0.4
5
59.7
9±
0.4
5
59.4
6±
0.4
5
59.4
5±
0.4
5
60.7
8±
0.4
5
60.0
4±
0.4
5
62.3
0±
0.4
4
58.8
7±
0.4
5
59.5
9±
0.4
5
59.7
6±
0.4
4
60.4
6±
0.4
4
60.1
5±
0.4
4
59.3
2±
0.4
4
58.3
3±
0.4
5
58.6
0±
0.4
418
59.9
1±
0.4
4
60.5
1±
0.4
4
857.6
5±
0.4
6
59.9
3±
0.4
6
60.5
5±
0.4
5
57.4
9±
0.4
55
57.6
4±
0.4
6
60.0
4±
0.4
5
57.1
7±
0.4
5
60.7
2±
0.4
4
59.3
9±
0.4
5
59.0
3±
0.4
5
59.3
3±
0.4
4
59.3
1±
0.4
4
60.7
9±
0.4
4
59.9
4±
0.4
4
59.5
5±
0.4
4
59.1
7±
0.4
4
57.4
7±
0.4
5
59.6
1±
0.4
5
60.7
1±
0.4
4
59.3
2±
0.4
4
960.5
1±
0.4
6
60.5
4±
0.4
5
59.1
9±
0.4
5
58.9
2±
0.4
5
59.4
7±
0.4
5
59.0
2±
0.4
5
60.6
7±
0.4
5
61.8
1±
0.4
4
59.1
1±
0.4
5
59.6
5±
0.4
5
60.2
3±
0.4
4
59.7
1±
0.4
5
60.0
9±
0.4
4
58.5
1±
0.4
4
60.2
6±
0.4
4
59.9
0±
0.4
4
59.1
3±
0.4
4
59.7
3±
0.4
4
59.9
5±
0.4
4
58.7
6±
0.4
5
10
58.9
6±
0.4
5
60.5
5±
0.4
6
57.3
4±
0.4
5
58.8
9±
0.4
5
59.6
8±
0.4
5
59.5
7±
0.4
5
59.6
7±
0.4
5
59.7
7±
0.4
5
59.2
9±
0.4
5
61.0
7±
0.4
5
61.6
6±
0.4
4
59.0
2±
0.4
5
59.7
7±
0.4
4
60.3
0±
0.4
4
60.9
6±
0.4
4
60.2
5±
0.4
4
59.8
2±
0.4
4
60.1
3±
0.4
4
58.6
1±
0.4
4
60.0
8±
0.4
4
97
Table
E.3:CompleteExperiment5Resultsat2%
IncrementalSamplingRatesusingWavelet
EnergyFeatures
Run
Feature
24
68
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
1Wavelet
Energy
69.7
5±
0.4
0
66.5
5±
0.4
0
66.0
7±
0.3
9
68.3
9±
0.3
8
66.7
0±
0.3
9
66.4
8±
0.3
9
66.1
2±
0.3
8
66.4
7±
0.3
9
65.9
6±
0.4
0
65.2
1±
0.3
9
65.5
4±
0.3
9
66.6
7±
0.3
8
65.8
4±
0.3
8
67.6
6±
0.3
8
65.4
0±
0.3
9
67.6
7±
0.3
8
66.8
2±
0.3
8
65.4
4±
0.3
9
65.7
2±
0.3
9
64.5
6±
0.3
8
263.2
5±
0.4
0
67.7
2±
0.4
0
65.9
6±
0.3
9
68.0
7±
0.3
8
66.8
2±
0.4
0
67.1
2±
0.3
8
67.6
2±
0.4
0
65.7
7±
0.4
1
65.5
0±
0.3
9
66.1
6±
0.3
8
65.3
0±
0.3
9
66.3
4±
0.3
9
66.3
7±
0.3
8
66.7
4±
0.3
8
66.5
9±
0.3
9
66.8
2±
0.3
8
65.5
2±
0.3
9
66.0
5±
0.3
9
65.0
9±
0.3
8
65.4
4±
0.3
9
366.8
1±
0.4
0
64.2
1±
0.4
2
66.2
1±
0.4
0
68.0
4±
0.3
9
66.0
9±
0.3
9
66.2
0±
0.4
0
67.1
4±
0.3
8
66.6
9±
0.3
8
66.1
9±
0.3
9
66.1
3±
0.3
8
65.7
2±
0.3
9
66.6
5±
0.3
8
64.5
7±
0.3
9
65.7
1±
0.3
8
66.7
9±
0.3
9
66.1
4±
0.4
0
66.7
8±
0.3
9
65.3
7±
0.3
9
65.2
6±
0.3
9
66.9
2±
0.3
9
468.7
9±
0.4
1
67.3
1±
0.4
0
68.0
3±
0.4
1
65.9
9±
0.3
9
66.4
1±
0.3
9
64.4
5±
0.3
9
67.4
5±
0.3
8
65.4
4±
0.3
9
67.0
9±
0.3
8
66.8
2±
0.4
0
65.1
6±
0.3
9
66.1
7±
0.3
9
66.2
2±
0.4
0
65.5
1±
0.3
8
64.4
8±
0.3
9
66.3
6±
0.3
9
64.9
9±
0.3
9
64.7
1±
0.3
8
65.9
7±
0.3
8
65.2
4±
0.3
9
565.2
3±
0.4
0
67.1
9±
0.4
0
65.5
5±
0.4
1
66.5
6±
0.3
9
66.3
2±
0.3
9
66.6
9±
0.3
8
65.1
4±
0.3
9
65.7
5±
0.3
8
67.1
8±
0.3
9
66.4
3±
0.3
9
65.0
6±
0.3
9
66.8
7±
0.3
8
67.0
3±
0.3
8
65.5
3±
0.3
9
65.1
5±
0.3
9
66.9
3±
0.3
8
66.3
8±
0.3
9
65.7
8±
0.3
8
64.3
9±
0.3
8
66.0
2±
0.3
8
666.9
4±
0.4
1
68.8
5±
0.4
0
66.2
1±
0.4
0
66.3
7±
0.4
0
67.0
1±
0.3
8
66.4
6±
0.3
9
66.1
2±
0.3
9
65.5
9±
0.3
8
65.7
7±
0.3
9
67.1
6±
0.3
9
66.8
1±
0.3
9
64.7
9±
0.4
0
66.6
2±
0.3
9
65.8
3±
0.3
9
67.0
6±
0.3
8
65.9
1±
0.3
9
65.3
0±
0.3
9
65.6
4±
0.3
9
67.3
3±
0.3
8
64.7
6±
0.4
1
766.9
5±
0.3
9
67.2
2±
0.4
0
67.0
9±
0.3
9
66.4
7±
0.3
9
66.0
3±
0.3
9
65.9
5±
0.4
0
65.9
4±
0.3
9
65.5
5±
0.3
9
65.5
5±
0.4
0
66.7
7±
0.3
9
67.2
9±
0.3
8
65.3
3±
0.3
9
66.8
0±
0.3
9
67.1
2±
0.3
8
65.2
3±
0.4
0
66.0
8±
0.4
0
66.3
1±
0.3
9
66.0
4±
0.3
9
65.5
4±
0.3
8
65.6
3±
0.3
8
866.7
1±
0.3
9
65.6
9±
0.4
0
66.7
5±
0.3
9
67.0
8±
0.3
8
67.3
0±
0.3
9
65.2
7±
0.4
0
66.1
2±
0.4
0
66.8
3±
0.3
9
67.5
0±
0.3
8
66.8
2±
0.4
0
65.3
6±
0.4
0
66.0
1±
0.4
0
64.8
5±
0.4
0
64.3
5±
0.4
0
65.7
7±
0.3
8
65.8
2±
0.3
9
65.2
6±
0.3
8
65.4
5±
0.3
8
65.0
0±
0.3
8
65.8
4±
0.3
9
966.1
0±
0.4
1
67.7
0±
0.4
1
66.6
1±
0.4
1
66.5
5±
0.3
8
66.6
9±
0.3
9
66.2
5±
0.4
0
66.3
0±
0.3
9
67.1
6±
0.4
0
65.7
1±
0.3
8
65.0
4±
0.4
2
65.4
3±
0.3
9
65.9
1±
0.3
9
64.9
3±
0.4
0
66.5
6±
0.3
8
66.4
0±
0.3
9
66.3
2±
0.3
9
66.0
0±
0.3
8
66.4
2±
0.3
9
65.7
4±
0.3
9
65.3
5±
0.3
9
10
66.7
7±
0.4
0
67.6
3±
0.4
0
65.9
5±
0.3
9
67.6
5±
0.3
9
66.7
2±
0.3
8
67.1
3±
0.3
9
66.7
4±
0.3
8
65.6
1±
0.3
8
67.5
8±
0.3
7
66.5
3±
0.3
8
67.4
8±
0.3
8
65.5
7±
0.3
9
65.4
8±
0.3
9
66.4
7±
0.3
8
64.8
1±
0.4
0
66.3
0±
0.3
9
64.3
2±
0.3
9
66.8
2±
0.3
7
65.6
7±
0.3
8
65.2
7±
0.4
0
98
Bibliography
[1] Mark A. Bartsch and Gregory Wake�eld. Singing voice identi�cation using spectral envelope
estimation. IEEE Transactions on Speech and Audio Processing, 12(2):100�109, March 2004.
[2] Rich Caruana and Alexandru Niculescu-Mizil. An empirical comparison of supervised learning
algorithms. In Proceedings of the International Conference on Machine Learning. ACM, 2006.
[3] Alain de Cheveigné and Hideki Kawahara. Yin, a fundamental frequency estimator for speech
and music. Journal of the Acoustical Society of America, 111(4):1917�1930, April 2002.
[4] William Cohen. Fast e�ective rule induction. In In Proceedings of the 12 International Confer-
ence on Machine Learning, pages 115�123. Morgan Kaufmann, 1995.
[5] James W. Cooley, Peter A. Lewis, and Peter D. Welch. The fast fourier transform and its
applications. IEEE Transactions on Education, 12(1):27�34, March 1969.
[6] S. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word
recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and
Signal Processing, 28(4):357�366, August 1980.
[7] P. de Chazal, B.G. Celler, and R.B. Reilly. Using wavelet coe�cients for the classi�cation of
the electrocardiogram. In In Proceedings of the 22nd Annual EMBS International Conference.
IEEE, 2000.
[8] E. Didiot, I. Illina, D. Fohr, and O. Mella. A wavelet-based parameterization for speech/music
discrimination. Computer Speech and Language, 24(2):341�357, April 2010.
[9] Stephen Downie and Michael Nelson. Evaluation of a simple and e�ective music information
retrieval method. In SIGIR '00: Proceedings of the 23rd Conference on Research and Develop-
ment in Information Retrieval, pages 73�80. ACM, 2000.
[10] D. Ellis. Classifying music audio with timbral and chroma features. In Proceedings of the
International Conference on Music Information Retrieval ISMIR-07. IEEE, September 2007.
99
[11] J.N. Gowdy and Z. Tufekci. Mel-scaled discrete wavelet coe�cients for speech recognition. In
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing.
IEEE, 2000.
[12] Mark Hall, Eibe Frank, Geo�rey Holmes, Bernhard Pfahringer, and Peter Reutemann. The
weka data mining software: An update. SIGKDD Explorations, 11(1):10�18, July 2009.
[13] ISO/IEC 11172-3:1993. Information technology � Coding of moving pictures and associated
audio for digital storage media at up to about 1.5 Mbit/s � Part 3: Audio. ISO, Geneva,
Switzerland.
[14] ISO/IEC 14496-14:2003. Information technology � Coding of audio-visual objects � Part 14:
MP4 �le format. ISO, Geneva, Switzerland.
[15] Youngmoo E. Kim. Excitation codebook design for coding of the singing voice. In IEEE
Workshop on Applications of Signal Processing to Audio and Acoustics, pages 21�24. IEEE,
October 2001.
[16] Youngmoo E. Kim and Brian Whitman. Singer identi�cation in popular music recordings using
voice coding features. In Proceedings of the 3rd International Conference on Music Information
Retrieval, pages 164�169. IEEE, 2002.
[17] R. Kronland-Martinet, J. Morlet, and A. Grossman. Analysis of sound patterns through wavelet
transforms. International Journal of Pattern Recognition and Arti�cial Intelligence, 1(2):97�
126, January 1987.
[18] Tao Li, Qi Li, Shenghuo Zhu, and Mitsunori Ogihara. A survey on wavelet applications in data
mining. IEEE SIGKDD Explorations, 4(2):49�68, June 2007.
[19] Chih-Chin Liu and Chuan-Sung Huang. A singer identi�cation technique for content-based
classi�cation of mp3 music objects. In CIKM '02: Proceedings of the Eleventh International
Conference on Information and Knowledge Management, pages 438�445. ACM, 2002.
[20] Namunu Maddage, Kongwah Wan, Changsheng Xu, and Ye Wang. Singing voice detection using
twice-iterated composite fourier transform. In IEEE International Conference on Multimedia
and Expo, pages 1347�1350. IEEE, June 2004.
100
[21] Namunu Maddage, Changsheng Xu, Mohan Kankanhalli, and Xi Shao. Content-based music
structure analysis with applications to music semantics understanding. In ACM Conference on
Multimedia. ACM, October 2004.
[22] Namunu C. Maddage. Automatic structure detection for popular music. IEEE Multimedia, 13
(1):65�77, January 2006.
[23] Stéphane Mallat. A Wavelet Tour of Signal Processing: The Sparse Way. Academic Press, 2009.
[24] Janet Marques and Pedro J. Moreno. A study of musical instrument classi�cation using gaus-
sian mixture models and support vector machines. Technical Report 4, Compaq Corporation,
Cambridge Research Laboratory, June 1999.
[25] The MathWorks. Matlab and simulink. http://www.mathworks.com/, 1984�2008.
[26] Annamaria Mesaros, Tuomas Virtanen, and Anssi Klapuri. Singer identi�cation in polyphonic
music using vocal separation and pattern recognition methods. In Proceedings of the 8th Inter-
national Conference on Music Information Retrieval, pages 375�378. International Society for
Music Information Retrieval, 2007.
[27] Tin Lay Nwe and Haizhou Li. Exploring vibrato-motivated acoustic features for singer identi�-
cation. IEEE Transactions on Audio, Speech, and Language Processing, 15(2):519�530, February
2007.
[28] Tin Lay Nwe, Arun Shenoy, and Ye Wang. Singing voice detection in popular music. In MUL-
TIMEDIA '04: Proceedings of the 12th Annual ACM International Conference on Multimedia,
pages 324�327. ACM, 2004.
[29] Alexey Ozerov, Pierrick Philippe, Rémi Gribonval, and Frédéric Bimbot. One microphone
singing voice separation using source-adapted models. IEEE Workshop on Applications of Signal
Processing to Audio and Acoustics, pages 90�93, October 2005.
[30] Stefan Pittner and Sagar Kamarthi. Feature extraction from wavelet coe�cients for pattern
recognition tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(1):
83�88, January 1999.
101
[31] Mathieu Ramona, G. Richard, and B. David. Vocal detection in music with support vector
machines. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages
1885�1888. IEEE, April 2008.
[32] K.R. Rao and P. Yip. Discrete Cosine Transformation: Algorithms, Advantages, Applications.
Academic Press, 1990.
[33] Rapid-I. Rapid miner. http://rapid-i.com/, 2006�2009.
[34] Jialie Shen, Bin Cui, John Shepherd, and Kian-Lee Tan. Towards e�cient automated singer
identi�cation in large music databases. In SIGIR '06: Proceedings of the 29th Annual Interna-
tional ACM SIGIR Conference on Research and Development in Information Retrieval, pages
59�66. ACM, 2006.
[35] Lisa Singh and Mehmet Sayal. Privately detecting burst in streaming, distributed time series
data. Data and Knowledge Engineering, 68(6):509�530, June 2009.
[36] SeventhString Software. Transcribe! http://www.seventhstring.com/, 1998�2009.
[37] J. Stegmann, G. Schroder, and K.A. Fischer. Robust classi�cation of speech based on the dyadic
wavelet transform with application to celp coding. In Proceedings of the 1996 International
Conference on Acoustics, Speech and Signal Processing, pages 546�549. IEEE, 1996.
[38] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. Pearson
Addison Wesley, 2006.
[39] Wei-Ho Tsai. Automatic singer recognition of popular music recordings via estimation and
modeling of solo vocal signals. IEEE Transactions on Audio, Speech, and Language Processing,
14(1):330�341, January 2006.
[40] Wei-Ho Tsai and Hsin-Min Wang. On the extraction of vocal-related information to facilitate the
management of popular music collections. In JCDL '05: Proceedings of the 5th ACM/IEEE-CS
Joint Conference on Digital Libraries, pages 197�206. ACM, 2005.
[41] George Tzanetakis, Georg Essl, and Perry Cook. Audio analysis using the discrete wavelet
transform. In Proceedings of the 1st Conference on Acoustics and Music Theory Applications.
World Scienti�c and Engineering Academy and Society, 2001.
102
[42] Avery Li-Chun Wang. An industrial-strength audio search algorithm. In 4th Symposium Con-
ference on Music Information Retrieval. ISMIR, 2003.
[43] Hubert Wassner and Gérard Chollet. New cepstral representation using wavelet analysis and
spectral transformation for robust speech recognition. In Proceedings of the International Con-
ference on Spoken Language. IEEE, 1996.
[44] Jianping Zhang and Inderjeet Mani. Knn approach to unbalanced data distributions: A case
study involving information extraction. In Proceedings of the International Conference on
Machine Learning, pages 667�671. ACM, 2003.
[45] Tong Zhang. Automatic singer identi�cation. In Proceedings of the International Conference
on Multimedia and Expo, 2003, pages 33�36. IEEE, 2003.
103