training and testing an arti cial neural network for...

Training and Testing An Artificial Neural

Network for Musical Note Classification

Adam Wroughton

Indiana Wesleyan University

December 10, 2014

Contents

1 Background 11.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . 31.1.2 Single-Layer Delta Rule . . . . . . . . . . . . . . . . . . . 41.1.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . 6

2 Classifying Musical Notes 82.1 Technique and Methodology . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 The Artificial Neural Network . . . . . . . . . . . . . . . . 82.1.2 Signal Processing . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.2.1 Data Capture . . . . . . . . . . . . . . . . . . . . 92.1.2.2 Feature Selection . . . . . . . . . . . . . . . . . . 92.1.2.3 Cepstral Analysis . . . . . . . . . . . . . . . . . 10

2.1.3 Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1.4 Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.1 Training Results . . . . . . . . . . . . . . . . . . . . . . . 132.2.2 Generalization Results . . . . . . . . . . . . . . . . . . . . 17

2.3 Sources of Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.4 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Abstract

This project trains an artificial neural network to correctly classify in-

dividual musical notes. Piano notes are recorded and processed by cepstral

analysis to form a training set and a test set. The network is trained and

tested on these sets. The network learns most successfully with learning

rates from input to hidden layer of 0.1 and hidden to output layer of 0.01.

The optimal initial weight ranges from input to hidden layer are [−0.1, 0.1]

and they are [−0.25, 0.25] from hidden to output layer. The optimal num-

ber of hidden nodes H for classification of K classes is 3K ≤ H ≤ 4K.

The network learns successfully up to the full seven note set. The network

generalizes for each number of classes K tested with success greater than

randomly guessing. Future goals are to determine a functional relation

for an optimal number of hidden nodes for generalization and to expand

the project to classify chords and sung speech.

1 Background

Supervised learning is a branch of machine learning in which a machine learns

to classify data with the help of a teacher. First, the machine is trained on a

set of data for which the correct classification is known. An input from this

set is presented to the machine and it outputs its classification of the input.

Next, the network is presented the correct output vector for the input. If there

is discrepancy between the machine’s classification and the correct classifica-

tion, the machine adjusts its decision boundary in the direction of the correct

classification. This process is repeated until the machine learns the underlying

nature of the data. The machine then generalizes to classify data outside the

training set, for which the correct classification is unknown. Supervised learn-

ing is implemented with a learning machine called an artificial neural network

(ANN).[2]

1

1.1 Artificial Neural Networks

An artificial neural network’s goal is to classify data into distinct classes. Many

real world problems may be reduced to classification problems. Speech recogni-

tion classifies a sound as a particular word. Facial recognition classifies an image

as a particular person. A medical diagnosis classifies data from diagnostic tests

to indicate positive or negative results or specify a particular diagnosis. The

breadth of possible applications for ANNs encompasses any classification prob-

lem. A firm definition is needed to fully understand artificial neural networks.

A neural network is an interconnected assembly of simple processing

elements, units or nodes, whose functionality is loosely based on the

animal neuron. The processing ability of the network is stored in

the interunit connection strengths, or weights, obtained by a process

of adaptation to, or learning from, a set of training patterns.[1]

Figure 1: A depiction of a human neu-

ron.

The node, or neuron, of an artifi-

cial neural network is modeled on the

human neuron. Human neurons re-

ceive incoming electrical signals from

connected neurons. The signals pass

through synapses forming the connec-

tion between neurons. These signals

produce an excitatory or inhibitory

effect on the receiving neuron to fire its own response. The strength of these

effects is proportional to the strength of its synaptic weights. These effects from

incoming signals are summed in the receiving neuron’s cell body. If this sum,

called the activation, is above a threshold value, the cell fires a high response

to other neurons. Otherwise it fires a low response. This process is depicted in

Figure 1.

2

Figure 2: A node of an artificial neural

network. The magnitude each weight is

represented by the area of its circle. The

strength of the input into the weight and

the corresponding effect after weighting

is represented by the thickness of the

arrow.

This process is modeled artificially

as follows. A node has a correspond-

ing weight vector w. This vector is

the same dimensionality as the input

vector x to the node. Each com-

ponent of the input is multiplied by

the node’s corresponding weight com-

ponent and summed together by the

node to form the activation α. Math-

ematically activation is the dot prod-

uct α = w · x. If activation w · x ≥ θ

where θ is the threshold, then the node output y is high, usually 1. Otherwise

it outputs low, usually 0 or −1. To create a more expressive space to form a

decision boundary, the threshold is subtracted from each side of the inequality

so it is adjusted as if it were a weight with an input of −1. Thus for high output

w · x + (−1)θ ≥ 0. Let w and x be the augmented weight and input vectors

from the subtraction of the threshold. Then the node’s output is determined by

w · x ≥ 0 =⇒ y = 1

w · x < 0 =⇒ y = 0

(1)

1.1.1 Gradient Descent

To train a network, a measure of the difference between the network’s output

and the desired output is defined as an error function. The network adapts

its weights to minimize the total error by a process called gradient descent.[4]

Figure 3

Consider a situation in which the

function to be minimized forms the

parabola in Figure 3 Now consider a

point on this parabola. If the magni-

3

tude of the slope is large at the point,

the point is far along the parabola and

thus far from the minimum. A large step is made in the direction of the min-

imum without fear of overshooting it. As the point nears the minimum, the

magnitude of the slope at the point will decrease approaching 0. Thus as

the slope decreases, smaller steps are made in the direction of the minimum

because the point is likely near the minimum. This concept is generalized to

higher dimensions by replacing the function’s slope with the function’s gradient.

Gradient descent iteratively takes sequential steps in the direction of steepest

descent toward the minimum of a function, and the step size is proportional to

the magnitude of the gradient.

1.1.2 Single-Layer Delta Rule

Now to train a single-layer network the error measure is defined, and a training

rule is constructed which incorporates this error and gradient descent to adjust

the network’s weights.

Consider a single node artificial neural network. Let E be the total error of

the network. From gradient descent the node’s ith weight is adjusted by

4wi = −α ∂E∂wi

(2)

Now the overall error E is the average of the error ep of each input pattern.

Rather than use E for gradient descent ep, the error of the current pattern, is

used to estimate E and the weights are adapted according to gradient descent

on ep.[1]

Now, ep must be defined. For pattern p let tp be the desired output, yp be

the node’s output, and a be the node’s activation, the weighted sum of inputs.

An initial error measure is the square of the differnce between desired and

actual ouptut. However, for gradient descent, partial derivatives of a continuous

function must be computed, but yp relies on a discontinuous step function. If

4

activation is above a threshold yp is 1, otherwise it is 0. Thus, rather than use

yp, activation a is used. Additionally a factor of 1/2 is inserted to simplify the

partial derivatives. Thus far

ep =1

2(tp − ap)2 (3)

Figure 4

A more intuitive error measure

uses the node’s actual output yp in-

stead of activation, but to do this yp

must be defined so that it is conti-

nous. To achieve continuity of yp, the

step function which uses activation to

output 1 or 0 is replaced by a contin-

uous squashing function such as the

sigmoid function σ(a) of Figure 4 centered about the threshold. A large activa-

tion results in a yp close to 1, whereas a low activation yields a yp very close to

0. If activation is at the threshold, yp = 1/2. [4] Now

ep =1

2(tp − yp)2 (4)

Now the weight’s are adjusted proportionally according to the partial deriva-

tive of the error which is

∂ep

∂wi= −(tp − yp)xpi (5)

where xpi is the ith component of pattern p.

Weights are then adjusted by gradient descent according to

4wi = α(tp − yp)xpi (6)

A final addition is made to the learning rule of equation 6. The derivative of

the sigmoid function is included. For high or low network output, activation is

5

large and fairly flat so its derivative is minimal (Figure 4). Thus a small weight

change, creates a small change in activation, yielding a small change in output.

This change in output results in a small change to ep. Near the threshold,

however, the sigmoid is changing rapidly. Thus small change in weight results

in large changes in activation, network output, and error. Thus the weight

change is proportional to the sigmoid function’s derivative. The final learning

rule, called the delta rule, for a single node is

4wi = ασ′(a)(tp − yp)xpi (7)

This is expanded to a single layer network of M nodes summing the error over

the nodes. Then

ep =1

2

M∑j=1

(tpj − ypj )2 (8)

Then the delta rule for the ith weight of the jth node is

4wji = ασ′(aj)(tpj − y

pj )xpji (9)

1.1.3 Backpropagation

Neural networks need an additional hidden layer of nodes to learn complex data

sets. This hidden layer receives the inputs, and then produces its own outputs

which it feeds to the network’s output layer. The information flow for a multiple

layer neural network is shown in Figure 5.

Figure 5

The single layer delta rule is ex-

panded for multiple layer networks.

The quantity (tpj − ypj ) is directly ac-

cessible for the output layer. Thus

the output layer follows the delta rule

of equation 9. Hidden nodes also

contribute to the network’s error and

6

need to adapt, but tpk, the desired out-

put for hidden node k, is unknown for hidden nodes. An equivalent term of

(tpj − ypj ), designated δj need be defined for hidden nodes. Consider hidden

node k and output node j. Now the effect on total error from the jth output

node is affected by j’s inherent error δj and by the influence, wjk, the kth hid-

den node has on the output of j. The effect is combined by a multiplication

of the two contributing factors and the derivative of the sigmoid into the term

σ′(ak)δjwjk. Several output nodes receive input from hidden node k so δk for

hidden node k is the sum of the effect over the set of output nodes Ik receiving

input from k. Then

δk =

σ′(ak)(tpk − y

pk) k is an output node

σ′(ak)∑j∈Ik

δjwjk k is a hidden node(10)

and weights for hidden or output node k are adjusted by

4wki = αδkxpki (11)

This multi-layer delta rule is incorporated into a backpropagation algorithm

which first randomly initializes the network weights. Then until the error is

acceptably the training set is looped through and the network is trained on

each pattern. A pattern is presented to the input layer. Then the hidden layer

computes its outputs, and feeds them to the output layer. After the output layer

evaluates its outputs, the target vector is applied to the output layer. Then the

δ for each output node is determined by equation 10 and the output layer is

trained by equation 11. Lastly the δ of each hidden node is calculated from

equation 10 and trained by equation 11.[4]

7

2 Classifying Musical Notes

2.1 Technique and Methodology

2.1.1 The Artificial Neural Network

The Artificial Neural Network used for this project is a multi-layer backprop-

agation network. The network has as many output nodes as there are classes

of input data. Thus, if there are K distinct musical note classes in the training

set, there are K output nodes in the network. Each node corresponds to a class.

Ideally, the corresponding output node for the class to which the input belongs

fires 1 and all other output nodes fire a 0. K output nodes are used rather than

log2K nodes because there are more weights to adjust allowing for more expres-

sive decision making. Additionally, this method supplies information about the

certainty of the network. For instance, if one output node fires a value of 0.9,

and another fires a value of 0.8, the network is uncertain about its classification

since both values are close.[3]

The network contains a single layer of hidden nodes, each of which is fully

connected to both the input layer and the output layer. The number of hidden

nodes is user-adjustable. A goal of the project is to determine a functional

relation between the number of hidden nodes and success of the network to

learn and to generalize.

The learning rate of the network is split into learning rates αIH and αHO

for the weights from the input layer to the hidden layer and from the hidden

layer to the output layer respectively.

2.1.2 Signal Processing

Now the neural network for classifying musical notes is established, but its in-

puts remain to be determined. An initial idea for constructing a training set

might be to record a time-sampled sound signal of a note and use the N time

samples of sound energy as the components of a training vector. However, each

8

training vector is very large, and it is computationally intensive for the network.

Additionally, better representations of the sound signal highlight relevant fea-

tures of the sound signal. For classifying musical notes the key feature is pitch.

Thus, for a training set a representation of the sound which provides fundamen-

tal and harmonic frequency information is important. To do this, sound signals

are first recorded, then relevant features are selected, and finally these features

are extracted and presented as input to the network.

2.1.2.1 Data Capture

Data capture is the first step of signal processing and is performed by sampling

the signal at regular time intervals. For this project sound signals of various

musical notes are created by striking individual keys on a Yamaha keyboard.

The sound energy is sampled with a microphone once every 2.5× 10−5 seconds.

This rate effectively avoids aliasing by which a frequency f > 12fs, where fs is

the sampling rate, appear to be a frequency f ′ = f − fs. This phenomenon is

shown in Figure 6 in which the sample points appear to generate the dashed

curve, but the actual signal is represented by the solid curve.[4] The signal is

recorded using AD Instruments software.

Figure 6

2.1.2.2 Feature Selection

Once the data is recorded the relevant features are selected. Each sound signal

contains a wealth of information. Certain types of information are relevant to

9

a particular problem and irrelevant to others. For instance, a sound signal of

a spoken word contains information about the volume, speaker, and identity

of the word. Information relevant to speaker recognition is different than the

information relevant to word recognition. Thus it is important to select the

information relevant to the particular application.[4] For this project the pitch

of the signal is the relevant information. Cepstral analysis is a standard method

for extracting pitch information from speech. Since the future direction of the

project aims to expand to speech recognition of sung speech, cepstral analysis

is used to extract the pitch from the piano note sound signals.

2.1.2.3 Cepstral Analysis

Cepstral analysis is a signal processing technique that extracts the fundamental

and harmonic frequencies of a speech signal. A convolution of excitation and

vocal tract components form a speech signal. The excitation component is

produced as air flows rapidly through the pharynx causing the vocal cords to

vibrate producing a sound at the frequency of the opening and closing of the

vocal cords. The mouth, throat, and nose form the vocal tract which shapes

the sound into distinctions such as an ”a” in hat or ”i” in hit. [4] Speech signals

are constructed by exciting the time varying characteristics of the system with

a periodic impulse sequence. Let s(t), e(t), and h(t) be the speech signal,

excitation sequence, and vocal tract filter sequence respectively. Then

s(t) = e(t) ∗ h(t) (12)

The ∗ represents a convolution in the time domain. It is often useful to trans-

form a signal from a time domain to a frequency domain. This is especially

useful when frequency related features dominate the problem such as in musical

note classification. Cepstral analysis deconvolves the excitation and vocal tract

components of a speech signal by transforming their product in the frequency

domain to a linear combination in the cepstral domain.[4] The flow of cepstral

10

analysis is illustrated by Figure 7.

Figure 7

An integral part of cepstral analysis is the Discrete Fourier Transform (DFT).

The DFT sends a time sampled signal to the frequency domain which reveals pe-

riodicities in the data and the energy of these periodicities. This shows energies

at various frequencies which is useful for pitch determination.

Time sampled signals transformed to the frequency domain by the DFT

result in large spikes at prevalent frequencies and smaller spikes at secondary

frequencies of the signal. The Discrete Fourier Transform is

Fn =

N−1∑k=0

Fne2πikn/N (13)

This equation contains both a real and complex components. In cepstral

analysis the magnitude is used.

Figure 8

Now, cepstral analysis begins by

windowing a normalized time signal.

Applying a windowing function to a

signal forces the input signal to be

an integral number of cycles. This

limits leakage caused by performing a

DFT on a non-integer number of cy-

cles. This leakage causes frequencies

at which large, sharp spikes should

appear, to instead have smaller, more spread out spikes (Figure 8). Cepstral

analysis applies a hamming window given by

w(n) = 0.54− 0.46cos(2πn

N), 0 ≤ n ≤ N (14)

11

to correct leakage.

After applying the window, the magnitude of the DFT of the windowed

function is taken. The state of the signal is now a product of excitation and

vocal tract components in the frequency domain given by

|S(ω)| = |E(ω)| · |H(ω)| (15)

To transform the speech signal to a linear combination, the logarithm is taken

yielding

log|S(ω)| = log|E(ω)|+ log|H(ω)| (16)

The linearly combined excitation and vocal tract components are finally sepa-

rated by applying an Inverse Discrete Fourier Transform (IDFT) which trans-

forms the signal to the cepstral domain. In this domain excitation components,

which contain the pitch information, appear in the higher cepstral regions.

Thus cepstral analysis provides a method for extracting pitch information

from speech, and this project uses this method to extract pitch information of

musical notes.[4]

2.1.3 Training Set

Sound signals of piano notes processed by cepstral analysis compose both train-

ing and test sets for the artificial neural network. The training set is used to

teach the network to classify the notes and is composed of seven classes of mu-

sical notes: C,D,E,F,G,A,B. The third, fourth, and fifth octaves are recorded

for each class. Thus, each class has three corresponding training vectors. Each

training pattern is a single piano note belonging to one of the classes.

2000 time samples of the sound signals are taken for cepstral analysis. Be-

cause cepstral analysis is symmetrical, this yields a 1000-component input for

each sound signal. An input size this large is computationally intensive for a

network and therefore, the input needs to be compressed still[4]. The input is

compressed from 1000 inputs to 40. Each of the 40 inputs is the average of a 25-

12

component block of the 1000-component cepstral coefficient vector. Information

is inherently lost in the data compression, but 40-component input vectors are

feasible for the artificial neural network. Finally a bias component is added to

each training vector. This completes the training vectors. Various training sets

are constructed from these training vectors. The network is trained on various

sets of 2,3,4, or 7 classes of musical notes.

2.1.4 Test Set

The test set is a set of processed signals supplied to a trained network to test

its ability to generalize. These test signals are new recordings of the same piano

notes which compose the training set. An additional test vector is supplied for

each note which is a recording of that note in the fourth octave by a piano with

a different tone. The correct classification of these notes is known, but is not

supplied to the network. The network outputs its classification of the test vector

and its classification is manually compared with the correct classification to

determine the network’s ability to successfully classify data outside the training

set.

2.2 Results

2.2.1 Training Results

This project aims to determine optimal learning rates, initial weights, and num-

ber of hidden nodes for learning.

There are a vast number of choices for learning rates. Thus it is important

to find a good range of learning rates for successful learning. Various learning

rates were attempted when training the network on several training sets. When

there were two or three classes of input data, many learning rates resulted in

successful learning. The optimal learning rate from the input to the hidden

layer was functionally related to the learning rate from the hidden to output

layer. The network learned the training set most successfully when the following

13

functional relation held.

αIH ≈ 10 · αHO (17)

Now, for two or three classes the network learned successfully when

0.1 ≤ αIH ≤ 0.7

0.01 ≤ αHO ≤ 0.07

(18)

When the number of classes was greater than three the network learned the

training set when

αIH = 0.1

αHO = 0.01

(19)

The range of values into which the weights were randomly initialized also

affected network’s learning success. Similarly to the learning rate, the range of

weight initialization was split into separate ranges for weights from the input

to hidden layer and from the hidden to output layer. The network’s success

was sensitive to initial conditions, but when the weights were initialized in the

following ranges, the network’s learning success was optimal.

Weight range for weights from input to hidden layer = [−0.1, 0.1]

Weight range for weights from hidden to output layer = [−0.25, 0.25]

(20)

Lastly, the number of hidden nodes had significant impact on the success of

the network to learn the training set. Lower and upper bounds were determined

for the optimal number of hidden nodes H needed to learn a training set of K

classes. An initial H was found which yielded minimal learning success, and H

was increased until the percentage of trials in which the training set was learned

was 100%.

Training sets of K = 2, K = 3, K = 4, and K = 7 classes were tested. Ten

trials were run on each training set. For K = 2, K = 3, and K = 4, a training

14

pattern was considered successfully classified if the correct output node fired

the highest value and its output was at least 0.2 greater than the second highest

output node’s output. A trial was considered completely successful if all of the

training vectors were classified correctly by the network and half successful if

one pattern was misclassified. For the training set consisting of all seven musical

note classes a pattern was considered successfully learned if the maximum output

corresponded to the correct output node regardless of the network’s certainty.

A trial was considered successful if fewer than three patterns were misclassified.

The success of the network to learn various training sets is displayed by

table 9. For K = 2 the network learned with a high degree of success with

only H = 4 = 2K hidden nodes. For a training set of C and F notes, when

H = 6 = 3K the network always successfully learned the training set. With

the same number of hidden nodes, and a training set of C and D notes the

network learned the training set 90% of the trials. This small difference may

be due to few trials. Perhaps with a large enough number of trials, the success

of the network on training set {C,F} and {C,D} would be the same. Another

possibility is that the network was able to more easily distinguish between C

and F because the notes are more distant from each other than C and D. When

H = 8 = 4K the network always learned both {C,F} and {C,D}.

Three note sets were minimally successful with fewer than H = 3K hidden

nodes. The sets {C,E,G} and {G,A,B} were both successfully learned in 100% of

the trials when H = 12 = 4K. The network had greater success learning the set

{C,E,G} than learning {G,A,B} when H = 8 or H = 9. This could be due to the

greater distance between the notes in the set {C,E,G}. However, when H = 10

or H = 11 the set {G,A,B} was learned more often than {C,E,G}. Perhaps

as the number of hidden nodes are increased the harmonics of the notes are

recognized more by the network. The set {C,E,G} has more similar harmonics

than {G,A,B} which could make {C,E,G} more difficult for the network to

learn. Successful learning for K = 4 classes was similar to the success of K = 3.

Seven classes was also learned in 100% of the trials with H = 3K hidden nodes.

15

Generally, given K classes the number of hidden nodes needed to obtain a high

degree of success in learning was

3K ≤ H ≤ 4K (21)

Figure 9

The network has a high degree of success learning a difficult training set.

Figures 10 and 11 are graphs of training vectors input to the network. Figure

10 shows the plots of C4, E4, and G4. The three signals have peaks in sim-

ilar regions and seem to have similar structures. It is not intuitively obvious

by looking at the input what qualities distinguish the three notes. Figure 11

displays the training vectors of C3, C4, and C5. These signals have peaks at

different points and are not visibly recognizable as the same class. However, the

network was able to classify these signals correctly. These two plots show the

nontrivial nature of the classification of the training signals. Yet, the network

16

learned these difficult sets with a high degree of success.

Figure 10

Figure 11

2.2.2 Generalization Results

After the network was trained its ability to generalize to new inputs was tested

on corresponding test sets. First, the network was trained on a particular train-

ing set. Then the trained network was tested with a test set consisting of the

notes of the same classes as the training set. For each training set, the network

was trained multiple times and tested across each training session.

17

Figure 12

The network was moderately successful at generalizing to classify new inputs.

The results of testing are displayed in Figure 12. For K = 2 classes, the network

classified test inputs most successfully with H = 8 hidden nodes. The network

classified 84% of the C and F test signals correctly and 94% of the C and D test

signals. With only H = 6 hidden nodes the network generalized well for the

{C,F} test set with 81% correct classification, but only 63% success for the set

{C,D}. This is still 13% greater success than randomly guessing should yield.

On sets of K = 3 classes success was more varied. A network with H = 11

hidden nodes successfully classified 85% of the test signals in {C,E,G}. This is

52% greater success than randomly guessing should yield. However, no number

of hidden nodes was able to classify test signals from {G,A,B} with more than

48% success which is 15% greater success than random guessing. Finally highest

success classifying test signals for K = 4 classes was 44%. For K = 2, K = 3,

and K = 4 classes, a trained network was able to generalize to classify new input

with greater success than randomly guessing. A network trained on K = 7

classes of notes was unable to generalize to successfully classify new data.

The optimal number of hidden nodes for generalization varied as K varied.

For a constant number of classes, H varied as the notes composing the set

varied. A functional relationship for the optimal number of hidden nodes H for

18

classifying new data was indeterminate. However, intuitively as the number of

hidden nodes increased, the training set was learned more successfully. If too

many hidden nodes were used, the network was too tuned to the training set

and unable to generalize. If too few hidden nodes were used the underlying

structure of the input was not learned. The network experienced lower success

rates generalizing to sets consisting of notes that are closer together than those

consisting of notes that have a greater separation.

2.3 Sources of Error

There are several potential sources of error in the problem. Cepstral analy-

sis uses logarithms to separate the multiplication of excitation and vocal tract

components into a linear combination. However, if there is noise in the signal,

it appears as an additive term in the frequency domain. Thus taking the log-

arithm of the signal will no longer linearly separate the excitation and vocal

tract components.[4] The training set and test set were recorded on different

days. Perhaps the nature of the noise in the room on those days was different.

Perhaps the mic was noisier one day than the other.

Error in learning from the training sets could result from sensitivity to initial

conditions of the network. Randomly initialized weights can lead the network

to settle on various local minima of the error function rather than at the global

minimum. A network settled at a local minimum of the error function may also

result in error upon generalizing to new inputs. Additional training patterns

may also increase the success of training and generalizing.

The network is also trained on a compressed vector consisting of averaged

blocks of the cepstral coefficients. Inherently in the compression of the cepstral

coefficient vector, information is lost. Some of the information lost may be

useful to the network. This loss could contribute to error.

19

2.4 Future Research

Future directions for research are broad. Determining the variables which af-

fect the optimal number of hidden nodes for network generalization is useful.

Improving generalization for notes that are close intervals is also beneficial.

It would be interesting to determine if the artificial neural network of this

project could learn chords with success. Finally, using the foundation estab-

lished by this and related projects, the concepts could be expanded to classify

sung speech.

20

References

[1] Gurney K (1997): An Introduction to Neural Networks. CRC Press. 1-6,

39-44.

[2] Haykin S (2009): Neural Networks and Learning Machines PHI Learning.

[3] Parker L (2006): Notes on Multilayer, Feedforward Neural Networks.

University of Tennessee. http://web.eecs.utk.edu/~leparker/Courses/

CS594-spring06/handouts/Neural-net-notes.pdf

[4] Wroughton A (2014): Classifying Musical Notes Using Artificial Neural Net-

works. Indiana Wesleyan University.

21

http://web.eecs.utk.edu/~leparker/Courses/CS594-spring06/handouts/Neural-net-notes.pdf

http://web.eecs.utk.edu/~leparker/Courses/CS594-spring06/handouts/Neural-net-notes.pdf

training and testing an arti cial neural network for...

Documents