training and testing an arti cial neural network for...
TRANSCRIPT
Training and Testing An Artificial Neural
Network for Musical Note Classification
Adam Wroughton
Indiana Wesleyan University
December 10, 2014
Contents
1 Background 11.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . 31.1.2 Single-Layer Delta Rule . . . . . . . . . . . . . . . . . . . 41.1.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . 6
2 Classifying Musical Notes 82.1 Technique and Methodology . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 The Artificial Neural Network . . . . . . . . . . . . . . . . 82.1.2 Signal Processing . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2.1 Data Capture . . . . . . . . . . . . . . . . . . . . 92.1.2.2 Feature Selection . . . . . . . . . . . . . . . . . . 92.1.2.3 Cepstral Analysis . . . . . . . . . . . . . . . . . 10
2.1.3 Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1.4 Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.1 Training Results . . . . . . . . . . . . . . . . . . . . . . . 132.2.2 Generalization Results . . . . . . . . . . . . . . . . . . . . 17
2.3 Sources of Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.4 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Abstract
This project trains an artificial neural network to correctly classify in-
dividual musical notes. Piano notes are recorded and processed by cepstral
analysis to form a training set and a test set. The network is trained and
tested on these sets. The network learns most successfully with learning
rates from input to hidden layer of 0.1 and hidden to output layer of 0.01.
The optimal initial weight ranges from input to hidden layer are [−0.1, 0.1]
and they are [−0.25, 0.25] from hidden to output layer. The optimal num-
ber of hidden nodes H for classification of K classes is 3K ≤ H ≤ 4K.
The network learns successfully up to the full seven note set. The network
generalizes for each number of classes K tested with success greater than
randomly guessing. Future goals are to determine a functional relation
for an optimal number of hidden nodes for generalization and to expand
the project to classify chords and sung speech.
1 Background
Supervised learning is a branch of machine learning in which a machine learns
to classify data with the help of a teacher. First, the machine is trained on a
set of data for which the correct classification is known. An input from this
set is presented to the machine and it outputs its classification of the input.
Next, the network is presented the correct output vector for the input. If there
is discrepancy between the machine’s classification and the correct classifica-
tion, the machine adjusts its decision boundary in the direction of the correct
classification. This process is repeated until the machine learns the underlying
nature of the data. The machine then generalizes to classify data outside the
training set, for which the correct classification is unknown. Supervised learn-
ing is implemented with a learning machine called an artificial neural network
(ANN).[2]
1
1.1 Artificial Neural Networks
An artificial neural network’s goal is to classify data into distinct classes. Many
real world problems may be reduced to classification problems. Speech recogni-
tion classifies a sound as a particular word. Facial recognition classifies an image
as a particular person. A medical diagnosis classifies data from diagnostic tests
to indicate positive or negative results or specify a particular diagnosis. The
breadth of possible applications for ANNs encompasses any classification prob-
lem. A firm definition is needed to fully understand artificial neural networks.
A neural network is an interconnected assembly of simple processing
elements, units or nodes, whose functionality is loosely based on the
animal neuron. The processing ability of the network is stored in
the interunit connection strengths, or weights, obtained by a process
of adaptation to, or learning from, a set of training patterns.[1]
Figure 1: A depiction of a human neu-
ron.
The node, or neuron, of an artifi-
cial neural network is modeled on the
human neuron. Human neurons re-
ceive incoming electrical signals from
connected neurons. The signals pass
through synapses forming the connec-
tion between neurons. These signals
produce an excitatory or inhibitory
effect on the receiving neuron to fire its own response. The strength of these
effects is proportional to the strength of its synaptic weights. These effects from
incoming signals are summed in the receiving neuron’s cell body. If this sum,
called the activation, is above a threshold value, the cell fires a high response
to other neurons. Otherwise it fires a low response. This process is depicted in
Figure 1.
2
Figure 2: A node of an artificial neural
network. The magnitude each weight is
represented by the area of its circle. The
strength of the input into the weight and
the corresponding effect after weighting
is represented by the thickness of the
arrow.
This process is modeled artificially
as follows. A node has a correspond-
ing weight vector w. This vector is
the same dimensionality as the input
vector x to the node. Each com-
ponent of the input is multiplied by
the node’s corresponding weight com-
ponent and summed together by the
node to form the activation α. Math-
ematically activation is the dot prod-
uct α = w · x. If activation w · x ≥ θ
where θ is the threshold, then the node output y is high, usually 1. Otherwise
it outputs low, usually 0 or −1. To create a more expressive space to form a
decision boundary, the threshold is subtracted from each side of the inequality
so it is adjusted as if it were a weight with an input of −1. Thus for high output
w · x + (−1)θ ≥ 0. Let w and x be the augmented weight and input vectors
from the subtraction of the threshold. Then the node’s output is determined by
w · x ≥ 0 =⇒ y = 1
w · x < 0 =⇒ y = 0
(1)
1.1.1 Gradient Descent
To train a network, a measure of the difference between the network’s output
and the desired output is defined as an error function. The network adapts
its weights to minimize the total error by a process called gradient descent.[4]
Figure 3
Consider a situation in which the
function to be minimized forms the
parabola in Figure 3 Now consider a
point on this parabola. If the magni-
3
tude of the slope is large at the point,
the point is far along the parabola and
thus far from the minimum. A large step is made in the direction of the min-
imum without fear of overshooting it. As the point nears the minimum, the
magnitude of the slope at the point will decrease approaching 0. Thus as
the slope decreases, smaller steps are made in the direction of the minimum
because the point is likely near the minimum. This concept is generalized to
higher dimensions by replacing the function’s slope with the function’s gradient.
Gradient descent iteratively takes sequential steps in the direction of steepest
descent toward the minimum of a function, and the step size is proportional to
the magnitude of the gradient.
1.1.2 Single-Layer Delta Rule
Now to train a single-layer network the error measure is defined, and a training
rule is constructed which incorporates this error and gradient descent to adjust
the network’s weights.
Consider a single node artificial neural network. Let E be the total error of
the network. From gradient descent the node’s ith weight is adjusted by
4wi = −α ∂E∂wi
(2)
Now the overall error E is the average of the error ep of each input pattern.
Rather than use E for gradient descent ep, the error of the current pattern, is
used to estimate E and the weights are adapted according to gradient descent
on ep.[1]
Now, ep must be defined. For pattern p let tp be the desired output, yp be
the node’s output, and a be the node’s activation, the weighted sum of inputs.
An initial error measure is the square of the differnce between desired and
actual ouptut. However, for gradient descent, partial derivatives of a continuous
function must be computed, but yp relies on a discontinuous step function. If
4
activation is above a threshold yp is 1, otherwise it is 0. Thus, rather than use
yp, activation a is used. Additionally a factor of 1/2 is inserted to simplify the
partial derivatives. Thus far
ep =1
2(tp − ap)2 (3)
Figure 4
A more intuitive error measure
uses the node’s actual output yp in-
stead of activation, but to do this yp
must be defined so that it is conti-
nous. To achieve continuity of yp, the
step function which uses activation to
output 1 or 0 is replaced by a contin-
uous squashing function such as the
sigmoid function σ(a) of Figure 4 centered about the threshold. A large activa-
tion results in a yp close to 1, whereas a low activation yields a yp very close to
0. If activation is at the threshold, yp = 1/2. [4] Now
ep =1
2(tp − yp)2 (4)
Now the weight’s are adjusted proportionally according to the partial deriva-
tive of the error which is
∂ep
∂wi= −(tp − yp)xpi (5)
where xpi is the ith component of pattern p.
Weights are then adjusted by gradient descent according to
4wi = α(tp − yp)xpi (6)
A final addition is made to the learning rule of equation 6. The derivative of
the sigmoid function is included. For high or low network output, activation is
5
large and fairly flat so its derivative is minimal (Figure 4). Thus a small weight
change, creates a small change in activation, yielding a small change in output.
This change in output results in a small change to ep. Near the threshold,
however, the sigmoid is changing rapidly. Thus small change in weight results
in large changes in activation, network output, and error. Thus the weight
change is proportional to the sigmoid function’s derivative. The final learning
rule, called the delta rule, for a single node is
4wi = ασ′(a)(tp − yp)xpi (7)
This is expanded to a single layer network of M nodes summing the error over
the nodes. Then
ep =1
2
M∑j=1
(tpj − ypj )2 (8)
Then the delta rule for the ith weight of the jth node is
4wji = ασ′(aj)(tpj − y
pj )xpji (9)
1.1.3 Backpropagation
Neural networks need an additional hidden layer of nodes to learn complex data
sets. This hidden layer receives the inputs, and then produces its own outputs
which it feeds to the network’s output layer. The information flow for a multiple
layer neural network is shown in Figure 5.
Figure 5
The single layer delta rule is ex-
panded for multiple layer networks.
The quantity (tpj − ypj ) is directly ac-
cessible for the output layer. Thus
the output layer follows the delta rule
of equation 9. Hidden nodes also
contribute to the network’s error and
6
need to adapt, but tpk, the desired out-
put for hidden node k, is unknown for hidden nodes. An equivalent term of
(tpj − ypj ), designated δj need be defined for hidden nodes. Consider hidden
node k and output node j. Now the effect on total error from the jth output
node is affected by j’s inherent error δj and by the influence, wjk, the kth hid-
den node has on the output of j. The effect is combined by a multiplication
of the two contributing factors and the derivative of the sigmoid into the term
σ′(ak)δjwjk. Several output nodes receive input from hidden node k so δk for
hidden node k is the sum of the effect over the set of output nodes Ik receiving
input from k. Then
δk =
σ′(ak)(tpk − y
pk) k is an output node
σ′(ak)∑j∈Ik
δjwjk k is a hidden node(10)
and weights for hidden or output node k are adjusted by
4wki = αδkxpki (11)
This multi-layer delta rule is incorporated into a backpropagation algorithm
which first randomly initializes the network weights. Then until the error is
acceptably the training set is looped through and the network is trained on
each pattern. A pattern is presented to the input layer. Then the hidden layer
computes its outputs, and feeds them to the output layer. After the output layer
evaluates its outputs, the target vector is applied to the output layer. Then the
δ for each output node is determined by equation 10 and the output layer is
trained by equation 11. Lastly the δ of each hidden node is calculated from
equation 10 and trained by equation 11.[4]
7
2 Classifying Musical Notes
2.1 Technique and Methodology
2.1.1 The Artificial Neural Network
The Artificial Neural Network used for this project is a multi-layer backprop-
agation network. The network has as many output nodes as there are classes
of input data. Thus, if there are K distinct musical note classes in the training
set, there are K output nodes in the network. Each node corresponds to a class.
Ideally, the corresponding output node for the class to which the input belongs
fires 1 and all other output nodes fire a 0. K output nodes are used rather than
log2K nodes because there are more weights to adjust allowing for more expres-
sive decision making. Additionally, this method supplies information about the
certainty of the network. For instance, if one output node fires a value of 0.9,
and another fires a value of 0.8, the network is uncertain about its classification
since both values are close.[3]
The network contains a single layer of hidden nodes, each of which is fully
connected to both the input layer and the output layer. The number of hidden
nodes is user-adjustable. A goal of the project is to determine a functional
relation between the number of hidden nodes and success of the network to
learn and to generalize.
The learning rate of the network is split into learning rates αIH and αHO
for the weights from the input layer to the hidden layer and from the hidden
layer to the output layer respectively.
2.1.2 Signal Processing
Now the neural network for classifying musical notes is established, but its in-
puts remain to be determined. An initial idea for constructing a training set
might be to record a time-sampled sound signal of a note and use the N time
samples of sound energy as the components of a training vector. However, each
8
training vector is very large, and it is computationally intensive for the network.
Additionally, better representations of the sound signal highlight relevant fea-
tures of the sound signal. For classifying musical notes the key feature is pitch.
Thus, for a training set a representation of the sound which provides fundamen-
tal and harmonic frequency information is important. To do this, sound signals
are first recorded, then relevant features are selected, and finally these features
are extracted and presented as input to the network.
2.1.2.1 Data Capture
Data capture is the first step of signal processing and is performed by sampling
the signal at regular time intervals. For this project sound signals of various
musical notes are created by striking individual keys on a Yamaha keyboard.
The sound energy is sampled with a microphone once every 2.5× 10−5 seconds.
This rate effectively avoids aliasing by which a frequency f > 12fs, where fs is
the sampling rate, appear to be a frequency f ′ = f − fs. This phenomenon is
shown in Figure 6 in which the sample points appear to generate the dashed
curve, but the actual signal is represented by the solid curve.[4] The signal is
recorded using AD Instruments software.
Figure 6
2.1.2.2 Feature Selection
Once the data is recorded the relevant features are selected. Each sound signal
contains a wealth of information. Certain types of information are relevant to
9
a particular problem and irrelevant to others. For instance, a sound signal of
a spoken word contains information about the volume, speaker, and identity
of the word. Information relevant to speaker recognition is different than the
information relevant to word recognition. Thus it is important to select the
information relevant to the particular application.[4] For this project the pitch
of the signal is the relevant information. Cepstral analysis is a standard method
for extracting pitch information from speech. Since the future direction of the
project aims to expand to speech recognition of sung speech, cepstral analysis
is used to extract the pitch from the piano note sound signals.
2.1.2.3 Cepstral Analysis
Cepstral analysis is a signal processing technique that extracts the fundamental
and harmonic frequencies of a speech signal. A convolution of excitation and
vocal tract components form a speech signal. The excitation component is
produced as air flows rapidly through the pharynx causing the vocal cords to
vibrate producing a sound at the frequency of the opening and closing of the
vocal cords. The mouth, throat, and nose form the vocal tract which shapes
the sound into distinctions such as an ”a” in hat or ”i” in hit. [4] Speech signals
are constructed by exciting the time varying characteristics of the system with
a periodic impulse sequence. Let s(t), e(t), and h(t) be the speech signal,
excitation sequence, and vocal tract filter sequence respectively. Then
s(t) = e(t) ∗ h(t) (12)
The ∗ represents a convolution in the time domain. It is often useful to trans-
form a signal from a time domain to a frequency domain. This is especially
useful when frequency related features dominate the problem such as in musical
note classification. Cepstral analysis deconvolves the excitation and vocal tract
components of a speech signal by transforming their product in the frequency
domain to a linear combination in the cepstral domain.[4] The flow of cepstral
10
analysis is illustrated by Figure 7.
Figure 7
An integral part of cepstral analysis is the Discrete Fourier Transform (DFT).
The DFT sends a time sampled signal to the frequency domain which reveals pe-
riodicities in the data and the energy of these periodicities. This shows energies
at various frequencies which is useful for pitch determination.
Time sampled signals transformed to the frequency domain by the DFT
result in large spikes at prevalent frequencies and smaller spikes at secondary
frequencies of the signal. The Discrete Fourier Transform is
Fn =
N−1∑k=0
Fne2πikn/N (13)
This equation contains both a real and complex components. In cepstral
analysis the magnitude is used.
Figure 8
Now, cepstral analysis begins by
windowing a normalized time signal.
Applying a windowing function to a
signal forces the input signal to be
an integral number of cycles. This
limits leakage caused by performing a
DFT on a non-integer number of cy-
cles. This leakage causes frequencies
at which large, sharp spikes should
appear, to instead have smaller, more spread out spikes (Figure 8). Cepstral
analysis applies a hamming window given by
w(n) = 0.54− 0.46cos(2πn
N), 0 ≤ n ≤ N (14)
11
to correct leakage.
After applying the window, the magnitude of the DFT of the windowed
function is taken. The state of the signal is now a product of excitation and
vocal tract components in the frequency domain given by
|S(ω)| = |E(ω)| · |H(ω)| (15)
To transform the speech signal to a linear combination, the logarithm is taken
yielding
log|S(ω)| = log|E(ω)|+ log|H(ω)| (16)
The linearly combined excitation and vocal tract components are finally sepa-
rated by applying an Inverse Discrete Fourier Transform (IDFT) which trans-
forms the signal to the cepstral domain. In this domain excitation components,
which contain the pitch information, appear in the higher cepstral regions.
Thus cepstral analysis provides a method for extracting pitch information
from speech, and this project uses this method to extract pitch information of
musical notes.[4]
2.1.3 Training Set
Sound signals of piano notes processed by cepstral analysis compose both train-
ing and test sets for the artificial neural network. The training set is used to
teach the network to classify the notes and is composed of seven classes of mu-
sical notes: C,D,E,F,G,A,B. The third, fourth, and fifth octaves are recorded
for each class. Thus, each class has three corresponding training vectors. Each
training pattern is a single piano note belonging to one of the classes.
2000 time samples of the sound signals are taken for cepstral analysis. Be-
cause cepstral analysis is symmetrical, this yields a 1000-component input for
each sound signal. An input size this large is computationally intensive for a
network and therefore, the input needs to be compressed still[4]. The input is
compressed from 1000 inputs to 40. Each of the 40 inputs is the average of a 25-
12
component block of the 1000-component cepstral coefficient vector. Information
is inherently lost in the data compression, but 40-component input vectors are
feasible for the artificial neural network. Finally a bias component is added to
each training vector. This completes the training vectors. Various training sets
are constructed from these training vectors. The network is trained on various
sets of 2,3,4, or 7 classes of musical notes.
2.1.4 Test Set
The test set is a set of processed signals supplied to a trained network to test
its ability to generalize. These test signals are new recordings of the same piano
notes which compose the training set. An additional test vector is supplied for
each note which is a recording of that note in the fourth octave by a piano with
a different tone. The correct classification of these notes is known, but is not
supplied to the network. The network outputs its classification of the test vector
and its classification is manually compared with the correct classification to
determine the network’s ability to successfully classify data outside the training
set.
2.2 Results
2.2.1 Training Results
This project aims to determine optimal learning rates, initial weights, and num-
ber of hidden nodes for learning.
There are a vast number of choices for learning rates. Thus it is important
to find a good range of learning rates for successful learning. Various learning
rates were attempted when training the network on several training sets. When
there were two or three classes of input data, many learning rates resulted in
successful learning. The optimal learning rate from the input to the hidden
layer was functionally related to the learning rate from the hidden to output
layer. The network learned the training set most successfully when the following
13
functional relation held.
αIH ≈ 10 · αHO (17)
Now, for two or three classes the network learned successfully when
0.1 ≤ αIH ≤ 0.7
0.01 ≤ αHO ≤ 0.07
(18)
When the number of classes was greater than three the network learned the
training set when
αIH = 0.1
αHO = 0.01
(19)
The range of values into which the weights were randomly initialized also
affected network’s learning success. Similarly to the learning rate, the range of
weight initialization was split into separate ranges for weights from the input
to hidden layer and from the hidden to output layer. The network’s success
was sensitive to initial conditions, but when the weights were initialized in the
following ranges, the network’s learning success was optimal.
Weight range for weights from input to hidden layer = [−0.1, 0.1]
Weight range for weights from hidden to output layer = [−0.25, 0.25]
(20)
Lastly, the number of hidden nodes had significant impact on the success of
the network to learn the training set. Lower and upper bounds were determined
for the optimal number of hidden nodes H needed to learn a training set of K
classes. An initial H was found which yielded minimal learning success, and H
was increased until the percentage of trials in which the training set was learned
was 100%.
Training sets of K = 2, K = 3, K = 4, and K = 7 classes were tested. Ten
trials were run on each training set. For K = 2, K = 3, and K = 4, a training
14
pattern was considered successfully classified if the correct output node fired
the highest value and its output was at least 0.2 greater than the second highest
output node’s output. A trial was considered completely successful if all of the
training vectors were classified correctly by the network and half successful if
one pattern was misclassified. For the training set consisting of all seven musical
note classes a pattern was considered successfully learned if the maximum output
corresponded to the correct output node regardless of the network’s certainty.
A trial was considered successful if fewer than three patterns were misclassified.
The success of the network to learn various training sets is displayed by
table 9. For K = 2 the network learned with a high degree of success with
only H = 4 = 2K hidden nodes. For a training set of C and F notes, when
H = 6 = 3K the network always successfully learned the training set. With
the same number of hidden nodes, and a training set of C and D notes the
network learned the training set 90% of the trials. This small difference may
be due to few trials. Perhaps with a large enough number of trials, the success
of the network on training set {C,F} and {C,D} would be the same. Another
possibility is that the network was able to more easily distinguish between C
and F because the notes are more distant from each other than C and D. When
H = 8 = 4K the network always learned both {C,F} and {C,D}.
Three note sets were minimally successful with fewer than H = 3K hidden
nodes. The sets {C,E,G} and {G,A,B} were both successfully learned in 100% of
the trials when H = 12 = 4K. The network had greater success learning the set
{C,E,G} than learning {G,A,B} when H = 8 or H = 9. This could be due to the
greater distance between the notes in the set {C,E,G}. However, when H = 10
or H = 11 the set {G,A,B} was learned more often than {C,E,G}. Perhaps
as the number of hidden nodes are increased the harmonics of the notes are
recognized more by the network. The set {C,E,G} has more similar harmonics
than {G,A,B} which could make {C,E,G} more difficult for the network to
learn. Successful learning for K = 4 classes was similar to the success of K = 3.
Seven classes was also learned in 100% of the trials with H = 3K hidden nodes.
15
Generally, given K classes the number of hidden nodes needed to obtain a high
degree of success in learning was
3K ≤ H ≤ 4K (21)
Figure 9
The network has a high degree of success learning a difficult training set.
Figures 10 and 11 are graphs of training vectors input to the network. Figure
10 shows the plots of C4, E4, and G4. The three signals have peaks in sim-
ilar regions and seem to have similar structures. It is not intuitively obvious
by looking at the input what qualities distinguish the three notes. Figure 11
displays the training vectors of C3, C4, and C5. These signals have peaks at
different points and are not visibly recognizable as the same class. However, the
network was able to classify these signals correctly. These two plots show the
nontrivial nature of the classification of the training signals. Yet, the network
16
learned these difficult sets with a high degree of success.
Figure 10
Figure 11
2.2.2 Generalization Results
After the network was trained its ability to generalize to new inputs was tested
on corresponding test sets. First, the network was trained on a particular train-
ing set. Then the trained network was tested with a test set consisting of the
notes of the same classes as the training set. For each training set, the network
was trained multiple times and tested across each training session.
17
Figure 12
The network was moderately successful at generalizing to classify new inputs.
The results of testing are displayed in Figure 12. For K = 2 classes, the network
classified test inputs most successfully with H = 8 hidden nodes. The network
classified 84% of the C and F test signals correctly and 94% of the C and D test
signals. With only H = 6 hidden nodes the network generalized well for the
{C,F} test set with 81% correct classification, but only 63% success for the set
{C,D}. This is still 13% greater success than randomly guessing should yield.
On sets of K = 3 classes success was more varied. A network with H = 11
hidden nodes successfully classified 85% of the test signals in {C,E,G}. This is
52% greater success than randomly guessing should yield. However, no number
of hidden nodes was able to classify test signals from {G,A,B} with more than
48% success which is 15% greater success than random guessing. Finally highest
success classifying test signals for K = 4 classes was 44%. For K = 2, K = 3,
and K = 4 classes, a trained network was able to generalize to classify new input
with greater success than randomly guessing. A network trained on K = 7
classes of notes was unable to generalize to successfully classify new data.
The optimal number of hidden nodes for generalization varied as K varied.
For a constant number of classes, H varied as the notes composing the set
varied. A functional relationship for the optimal number of hidden nodes H for
18
classifying new data was indeterminate. However, intuitively as the number of
hidden nodes increased, the training set was learned more successfully. If too
many hidden nodes were used, the network was too tuned to the training set
and unable to generalize. If too few hidden nodes were used the underlying
structure of the input was not learned. The network experienced lower success
rates generalizing to sets consisting of notes that are closer together than those
consisting of notes that have a greater separation.
2.3 Sources of Error
There are several potential sources of error in the problem. Cepstral analy-
sis uses logarithms to separate the multiplication of excitation and vocal tract
components into a linear combination. However, if there is noise in the signal,
it appears as an additive term in the frequency domain. Thus taking the log-
arithm of the signal will no longer linearly separate the excitation and vocal
tract components.[4] The training set and test set were recorded on different
days. Perhaps the nature of the noise in the room on those days was different.
Perhaps the mic was noisier one day than the other.
Error in learning from the training sets could result from sensitivity to initial
conditions of the network. Randomly initialized weights can lead the network
to settle on various local minima of the error function rather than at the global
minimum. A network settled at a local minimum of the error function may also
result in error upon generalizing to new inputs. Additional training patterns
may also increase the success of training and generalizing.
The network is also trained on a compressed vector consisting of averaged
blocks of the cepstral coefficients. Inherently in the compression of the cepstral
coefficient vector, information is lost. Some of the information lost may be
useful to the network. This loss could contribute to error.
19
2.4 Future Research
Future directions for research are broad. Determining the variables which af-
fect the optimal number of hidden nodes for network generalization is useful.
Improving generalization for notes that are close intervals is also beneficial.
It would be interesting to determine if the artificial neural network of this
project could learn chords with success. Finally, using the foundation estab-
lished by this and related projects, the concepts could be expanded to classify
sung speech.
20
References
[1] Gurney K (1997): An Introduction to Neural Networks. CRC Press. 1-6,
39-44.
[2] Haykin S (2009): Neural Networks and Learning Machines PHI Learning.
[3] Parker L (2006): Notes on Multilayer, Feedforward Neural Networks.
University of Tennessee. http://web.eecs.utk.edu/~leparker/Courses/
CS594-spring06/handouts/Neural-net-notes.pdf
[4] Wroughton A (2014): Classifying Musical Notes Using Artificial Neural Net-
works. Indiana Wesleyan University.
21