long short-term memory projection recurrent neural network

Research ArticleLong Short-Term Memory Projection Recurrent Neural NetworkArchitectures for Pianorsquos Continuous Note Recognition

YuKang Jia1 ZhichengWu1 Yanyan Xu1 Dengfeng Ke2 and Kaile Su3

1School of Information Science and Technology Beijing Forestry University No 35 Qinghuadong Road Haidian DistrictBeijing 100083 China2National Laboratory of Pattern Recognition Institute of Automation Chinese Academy of Sciences No 95 Zhongguancundong RoadHaidian District Beijing 100190 China3College of Information Science and Technology Jinan University No 601 West Huangpu Avenue GuangzhouGuangdong 510632 China

Correspondence should be addressed to Yanyan Xu xuyanyanbjfueducn

Received 10 May 2017 Revised 30 July 2017 Accepted 6 August 2017 Published 12 September 2017

Academic Editor Keigo Watanabe

Copyright copy 2017 YuKang Jia et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Long Short-TermMemory (LSTM) is a kind of Recurrent Neural Networks (RNN) relating to time series which has achieved goodperformance in speech recogniton and image recognition Long Short-Term Memory Projection (LSTMP) is a variant of LSTMto further optimize speed and performance of LSTM by adding a projection layer As LSTM and LSTMP have performed well inpattern recognition in this paper we combine themwith Connectionist Temporal Classification (CTC) to study pianorsquos continuousnote recognition for robotics Based on the Beijing Forestry University music library we conduct experiments to show recognitionrates and numbers of iterations of LSTM with a single layer LSTMP with a single layer and Deep LSTM (DLSTM LSTM withmultilayers) As a result the single layer LSTMP proves performing much better than the single layer LSTM in both time and therecognition rate that is LSTMP has fewer parameters and therefore reduces the training time and moreover benefiting from theprojection layer LSTMP has better performance tooThe best recognition rate of LSTMP is 998 As for DLSTM the recognitionrate can reach 100 because of the effectiveness of the deep structure but compared with the single layer LSTMP DLSTM needsmore training time

1 Introduction

Pianorsquos continuous note recognition is important for a robotwhether it is a bionic robot a dance robot or a music robotThere have been companies researching on music robotsFor example Vadi produced by Vatti is able to identify avoiceprint

Most of the existing pianorsquos note recognition techniquesuse Hidden Markov Model (HMM) and Radial Basis Func-tion (RBF) to recognize musical notes with one musical noteat a time and therefore are not suitable for continuous noterecognition Fortunately in the field of pattern recognitionDeepNeuralNetworks (DNNs) have shown great advantagesDNNs are used to recognize features extracted from a largenumber of hidden nodes [1] and seek reverse partial guidancethrough the chain rule and at the same time make the neural

network weight matrix convergence through training dataiteratively and then achieve recognition [2] RNN adds atime series based on DNN [3] which makes features havetime continuity [4 5] However in experiments we find thatRNNrsquos time characteristics will disappear completely afterfour iterations [6] and a music note is generally longer thana frame [7] so RNN is not suitable for pianorsquos continuousnote recognition [8] Fortunately a variant of RNN namedLSTM is proposed [9ndash12] in which an input gate an outputgate and a forgotten gate are added to memorize a long-term cell state to maintain long-term memory [8 9 13ndash16]Furthermore LSTMP adds a projection layer to LSTM toincrease its efficiency and effectiveness

This paper studies LSTM and LSTMP for pianorsquos contin-uous note recognition and in order to solve the temporalclassification problem we combine LSTM and LSTMP with

HindawiJournal of RoboticsVolume 2017 Article ID 2061827 7 pageshttpsdoiorg10115520172061827

2 Journal of Robotics

Lastoutput

Input Newinput

Input_gate

Output_gate

Cell_value

Forget_gate

Last_cell_recent

Thisinput

Thisinput

This_cell_value Cell_memory OutputActive

ActiveActiveActive

1-input_gate

Figure 1 The LSTM network structure

a method named CTC [17] In experiments we test theperformance of a single layer LSTM Deep LSTM and asingle layer LSTMP with different parameters Comparedwith the traditional pianorsquos note recognition methods LSTMand LSTMP can recognize continuous notes that is somesimple piano music The experimental results show that asingle layer LSTMP can attain a recognition rate of 998 andDeep LSTM can reach 100 which proves that our methodsare quite effective

The rest of this paper is organised as follows In Section 2we first introduce the LSTM network architecture and thenDeep LSTM LSTMP is illustrated in Section 3 In Section 4we discuss CTC The experimental results are presented inSection 5 and finally in Section 6 we draw conclusions andgive our future work

2 LSTM

21 The LSTMNetwork Architecture LSTM is a kind of RNNwhich succeeds to keep memory for a period of time byadding a ldquomemory cellrdquoThememory cell ismainly controlledby ldquothe input gaterdquo ldquothe forgetting gaterdquo and ldquothe outputgaterdquo The input gate activates the input of information tothememory cell and the forgetting gate selectively obliteratessome information in the memory cell and activates thestorage to the next input [18] Finally the output gate decideswhat information will be outputted by the memory cell [19]

The LSTM network structure is illustrated in Figure 1Each box represents different data and the lines with arrowsmean data flow among these data From Figure 1 we canunderstand how LSTM stores memory for a long period oftime

The recognition procedure of LSTM begins with a set ofinput sequences 119909 = (1199091 1199092 119909119905) (119909119894 is a vector) and finally

outputs a set of 119910 = (1199101 1199102 119910119905) (119910119894 is also a vector) whichis calculated according to the following equations

119894119905 = 120590 (119882119894119909119909119905 +119882119894119898119898119905minus1 + 119887119894) (1)

119900119905 = 120590 (119882119900119909119909119905 +119882119900119898119898119905minus1 + 119887119900) (2)

119891119905 = 1 minus 119894119905 (3)

119905119888119905 = 119892 (119882119888119909119909119905 +119882119888119898119898119905minus1 + 119887119888) (4)

119888119905 = 119891119905 ⊙ 119888119905minus1 + 119894119905 ⊙ 119905119888119905 (5)

119898119905 = 119900119905 ⊙ ℎ119888119905 (6)

119910119905 = 120601 (119882119910119898119898119905 + 119887119910) (7)

In these equations 119894 means the input gate and 119900 and 119891are the output gate and the forget gate respectively 119905119888 is theinformation input to the memory cell and 119888 includes cellactivation vectors and119898 is the information the memory celloutputs 119882 represents weight matrices (eg 119882119894119909 representstheweightmatrix from input119909 to the input gate 119894) 119887 is the bias(119887119894 is the input gate bias vector) and 119892 and ℎ are the activationfunction of cell inputting and cell outputting respectivelyregarded as 119905119886119899ℎ and 119897119894119899119890119886119903 in most of the models and alsoin this paper ⊙ is the point multiplication in a matrix 120601 is theactivation function of the neural network output and we use119904119900119891119905119898119886119909 in this paper

After conducting some experiments we find that com-pared with the 119891119905 standard equation (3) is more simple andeasier to converge Not only does the training time becomeless but also the number of iterations becomes smallerTherefore in the neural networks in this paper we use (3) tocalculate 119891119905 instead of the 119891119905 standard equation

22 Deep LSTM In pianorsquos continuous note recognition wealso build a multilayer neural network to further increase the

Journal of Robotics 3

recognition rate Deep LSTM adds an LSTM after anotherand so on [10] The added LSTMs have the same structureas the original one Each layer regards the output fromthe last layer as the input of the next layer We hope thatthe neural networks in different LSTM layers will learndifferent characteristics so as to learn the various features ofmusical notes from different aspects and therefore improvethe recognition rate

3 LSTMP-LSTM with a Projection Layer

In LSTM there are a large number of calculations in thevarious gates calculating the number of parameters119873 in theneural network The weight matrix dimension input by theinput gate the output gate and the cell state at this time is119899119894 lowast 119899119888 and the weight matrix dimension at the last time is119899119888 lowast 119899119888 and the output matrix dimension connected to theoutput of the neural network is 119899119900lowast119899119888 where 119899119894 and 119899119900 are thedimensions of the input and the output respectively and 119899119888 isthe number of memory cells We can easily get the followingformula

119873LSTM = 3 lowast 119899119894 lowast 119899119888 + 3 lowast 119899119888 lowast 119899119888 + 119899119900 lowast 119899119888 (8)

that is

119873LSTM = 3 lowast 119899119894 lowast 119899119888 + 119899119888 lowast (3 lowast 119899119888 + 119899119900) (9)

As we increase 119899119888 119873LSTM grows in a square patternTherefore increasing the number of memory cells to increasethe amount of memory costs a lot but a smaller cellnumber will bring a lower recognition rate so we propose anarchitecture named LSTMP which can not only improve theaccuracy but also effectively reduce the computations

In the output layer of the neural network LSTM outputsa matrix of 1198991198882 lowast 119898119905 Then 119898119905 is sent into the output matrixto be outputted and also serves as the input to the neuralnetwork at the next time We add a 119901119903119900119895119890119888119905119894119900119899 layer to theLSTM architecture and after passing this layer 119898119905 becomesan 119899119888 lowast 119899119903matrix called 119903119905 which replaces 119898119905 as the input ofthe next neural network When the memory cell number ofthe neural network increases the number of parameters inthe neural network is

119873LSTMP = 3 lowast 119899119894 lowast 119899119888 + 3 lowast 119899119888 lowast 119899119903 + 119899119900 lowast 119899119903 + 119899119888

lowast 119899119903(10)

that is

119873LSTMP = 3 lowast 119899119894 lowast 119899119888 + 119899119903 lowast (4 lowast 119899119888 + 119899119900) (11)

Calculating119873LSTM minus 119873LSTMP we have

119873LSTM minus 119873LSTMP = 119899119888 lowast (3 lowast 119899119888 + 119899119900) minus 119899119903

lowast (4 lowast 119899119888 + 119899119900) (12)

Therefore in LSTMP the factor that affects the totalnumber of parameters changes from 119899119888 lowast 119899119888 to 119899119888 lowast 119899119903 Wecan change the value of 119899119888119899119903 to reduce the computational

complexity When 3 lowast 119899119888 gt 4 lowast 119899119903 LSTMP can speed up thetraining model Moreover with the projection layer LSTMPcan converge faster to ensure the convergence of the modelThe mathematical formulae of LSTMP are as follows

119894119905 = 120590 (119882119894119909119909119905 +119882119894119903119903119905minus1 + 119887119894)

119900119905 = 120590 (119882119900119909119909119905 +119882119900119903119903119905minus1 + 119887119900)

119891119905 = 1 minus 119894119905

119905119888119905 = 119892 (119882119888119909119909119905 +119882119888119903119903119905minus1 + 119887119888)

119888119905 = 119891119905 ⊙ 119888119905minus1 + 119894119905 ⊙ 119905119888119905119898119905 = 119900119905 ⊙ ℎ119888119905119903119905 = 119882119903119898119898119905

119910119905 = 120601 (119882119910119903119903119905 + 119887119910)

(13)

In these formulae 119903119905 represents the 119901119903119900119895119890119888119905119894119900119899 layer andthe other equations are the same as LSTM

Figure 2 is the structure of LSTMP and the part markedwith red dashed lines is the projection By comparing Figure 1with Figure 2 we can see that LSTMP is LSTM with aprojection layer

Algorithm 1 is the pseudocode of LSTMP 119908 is the inputweight matrix and 119906 is the weight matrix of the last result 119887is bias and 119903 is the projection matrix We put the extractedmusical notes features into the neural network and thealgorithm executes until we get an acceptable recognitionrate

4 CTC

Theoutput layer of our LSTM and LSTMP is called CTC [20]We use CTC because it does not need presegmented trainingdata or external postprocessing to extract label sequencesfrom the network outputs

To be the same as many latest neural networks CTChas forward and backward algorithms When it comes tothe forward algorithm the key point is to estimate thedistribution through probabilities Given the length 119879 theinput sequence 119909 and the training set 119878 at time 119905 theactivation 119910119905119896 of the output unit 119896 at time 119905 is interpreted asthe probability of observing label 119896 (119900119903 119887119897119886119899119896 119894119891 119896 = |119871| + 1)

119901 (120587 | 119909 119878) =119879

prod119905=1

119910119905120587119905 (14)

We refer to the elements 120587 isin 1198711015840119879 as paths where 1198711015840119879

is the set of the length 119879 sequences over the alphabet 1198711015840 =119871 cup 119887119897119886119899119896 Then we define a many-to-one map 120601 to removefirst the repeated labels and then the blanks from the pathsWith glance at the paths will find they aremutually exclusiveAccording to the characteristic the conditional probability ofsome labelling 119897 isin 119871⩽119879 can be calculated by summing theprobabilities of all the paths mapped onto it by 120601

119901 (119897 | 119909) = sum120587isin120601minus1(119897)

119901 (120587 | 119909) (15)


Lastoutput

Input Newinput

Input_gate

Output_gate

Cell_value

Forget_gate

Last_cell_recent

Thisinput

Thisinput

This_cell_value

Cell_memory Cell_recent Output

Active

Active

Active

1-input_gateProjection

Projection

Active

Figure 2 The LSTMP structure

Require 119894119899119901119906119905 119877119890119904119906119897119905119897119886119904119905 119862119890119897119897119897119886119904119905 119908 119906 119887 119903Ensure 119862119890119897119897V119886119897119906119890 119877119890119904119906119897119905while119882ℎ119890119899 ℎ119886V119894119899119892 119894119899119901119906119905 do119905119890119898119901119894119899119901119906119905 = (119877119890119904119906119897119905119897119886119904119905 lowast 119906119894119899119901119906119905) + (119894119899119901119906119905 lowast 119908119894119899119901119906119905) + 119887119905119890119898119901119900119906119905119901119906119905 = (119877119890119904119906119897119905119897119886119904119905 lowast 119906119900119906119905119901119906119905) + (119894119899119901119906119905 lowast 119908119900119906119905119901119906119905) + 119887119905119890119898119901119888119890119897119897 = (119877119890119904119906119897119905119897119886119904119905 lowast 119906119888119890119897119897) + (119894119899119901119906119905 lowast 119908119888119890119897119897) + 119887119894119899119901119906119905119892119886119905119890 = 119904119894119892119898119900119894119889(119905119890119898119901119894119899119901119906119905)119900119906119905119901119906119905119892119886119905119890 = 119904119894119892119898119900119894119889(119905119890119898119901119900119906119905119901119906119905)119891119900119903119892119890119905119892119886119905119890 = 1 minus 119894119899119901119906119905119892119886119905119890119862119890119897119897V119886119897119906119890 = 119886119888119905119894V119886119905119890(119905119890119898119901119888119890119897119897) lowast 119894119899119901119906119905119892119886119905119890 + 119862119890119897119897119897119886119904119905 lowast 119891119900119903119892119890119905119892119886119905119890119877119890119904119906119897119905 = 119900119906119905119901119906119905119892119886119905119890 lowast 119886119888119905119894V119886119905119890(119888119890119897119897V119886119897119906119890)119862119890119897119897V119886119897119906119890 = (119877119890119904119906119897119905 119903)

end while

Algorithm 1 LSTM-projection

After all these procedures CTCwill complete its classificationtask

5 Experiments

We conduct all our experiments on a server with 4 IntelXeon E5-2620 CPUs and 512GB memories A NVIDIA TeslaM2070-Q graphics card is used to train all the models Theprogramming language we use is python 35

We choose the piano as our instrument We record 445note sequences as our dataset and the length of each sequenceis around 8 seconds

In the extraction of features we carry out Hammingwindow processing and then take Fast Fourier Transform(FFT) for the real part and the imaginary part of eachwindowThen we let the FFT result to be orthogonal by adding thesquare of the real part and that of the imaginary part together

Apart from that we gain the log of the quadratic sum Finallythe normalization of the input data is performed

In the experiments the number of kinds of notes is 8 andthe number of input nodes is 9 We try different numbers ofcell units in our models from 20 to 320 The initial value ofthe neural network is set as a random value within [minus02 02]and the learning rate is 0001 In terms of the structures all theneural networks are connected to a single layer CTC As forthe dataset we choose 80 of the samples as the developmentset and 20 as the test set

51 Experimental Results Table 1 shows the recognition ratesand how many times LSTM DLSTM and LSTMP withdifferent parameters need to iterate until their recognitionrates are stable and the best results are in bold

In Table 1 ldquoLSTMP-80 to 20rdquo means the LSTMP modelprojecting 80 cell states to 20 cell states From Table 1 we see


Table 1 The number of iterations and the recognition rates of different models

Model Number of iterations Recognition rateLSTM-20 300 772LSTM-40 300 871LSTM-80 300 907LSTM-160 400 825DLSTM-two layers 61 857DLSTM-three layers 71 995DLSTM-four layers 109 1000DLSTM-five layers 76 975DLSTM-six layers 103 1000LSTMP-10 to 20 43 56LSTMP-30 to 20 35 942LSTMP-40 to 20 45 960LSTMP-40 to 30 30 940LSTMP-60 to 30 22 980LSTMP-80 to 20 58 998LSTMP-80 to 30 27 973LSTMP-80 to 40 29 973LSTMP-160 to 40 39 990LSTMP-160 to 80 50 950LSTMP-320 to 160 93 934

LSTMP with different parametersDLSTM with different layers

LSTMP (30 to 20)LSTMP (40 to 20)LSTMP (60 to 30)

LSTMP (80 to 20)LSTMP (160 to 40)

35302520151050

00

02

04

06

08

10

250200150100500

00

02

04

06

08

10

LSTM_layer 1Deep LSTM_layer 3

Deep LSTM_layer 4Deep LSTM_layer 6

300

Figure 3 The recognition rates of LSTMP with different parameters and DLSTM with different layers

that DLSTM and LSTMP perform much better than LSTMand their best recognition rates are almost the same whichare 100 and 998 respectively As for the numbers ofiterations LSTMP needs much less iterations than LSTMand DLSTM which makes LSTMP more suitable for pianorsquoscontinuous note recognition for robotics considering theefficiency

52 LSTMP and DLSTM with Different Parameters Figure 3illustrates LSTMP with different parameters and DLSTMwith different layers The 119909 axis means the number of

iterations and the 119910 axis means the recognition rate We seethat for LSTMP the model projecting 80 cell states to 20 cellstates has the best result but all LSTMP results are very closeAs for DLSTM we see clearly that Deep LSTM is much betterthan LSTM with only one layer

53 Comparisons of LSTM LSTMP and DLSTM We com-pare LSTM LSTMP and DLSTM in Figure 4 Given the sameparameters LSTMP performs much better than LSTM Asfor LSTMP and DLSTM we find that when the number of


LSTM and LSTMP LSTMP and DLSTM

Deep LSTM (4 layers)

6050403020100

00

02

04

06

08

10

LSTM (80)LSTMP (80 to 20)

LSTMP (80 to 20)

100806040200

00

02

04

06

08

10

Figure 4 Comparisons of LSTM LSTMP and DLSTM

iterations is small LSTMP has great advantages but as thenumber of iterations increases DLSTM becomes better

6 Conclusions and Future Work

In this paper we have used neural network structures calledLSTM with CTC to recognize continuous musical notes Onthe basis of LSTM we have also tried LSTMP and DLSTMAmong them LSTMP worked best when projecting 80 cellstates to 20 cell states which needed much less iterationsthan LSTM and DLSTM making it most suitable for pianorsquoscontinuous note recognition

In the future we will use LSTM LSTMP and DLSTMto recognize more complex continuous chord music suchas piano music violin pieces or even symphony which willgreatly improve the development of music robots

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

Thanks are due to Yanlin Yang for collecting the record-ing materials This work is supported by the FundamentalResearch Funds for the Central Universities (no 2016JX06)and the National Natural Science Foundation of China (no61472369)

References

[1] A K Jain J Mao and K M Mohiuddin ldquoArtificial neural net-works a tutorialrdquo IEEE Computational Science amp Engineeringvol 29 no 3 pp 31ndash44 1996

[2] J R Zhang T M Lok andM R Lyu ldquoA hybrid particle swarmoptimization-back-propagation algorithm for feedforwardneu-ral network trainingrdquo Applied Mathematics and Computationvol 185 no 2 pp 1026ndash1037 2007

[3] L Liang Y Tang L Bin and W Xiaohua Artificial neuralnetwork based universal time series May 11 2004 US Patent6735580

[4] R J Williams and D Zipser ldquoA learning algorithm forcontinually running fully recurrent neural networksrdquo NeuralComputation vol 1 no 2 pp 270ndash280 1989

[5] L R Medsker and L C Jain ldquoRecurrent neural networksrdquoDesign and Applications vol 5 2001

[6] G Shalini O Vinyals B Strope R Scott T Dean and LHeck ldquoContextual lstm (clstm)models for large scale nlp tasksrdquohttpsarxivorgabs160206291

[7] A Karpathy The unreasonable effectiveness of recurrentneural networks httpkarpathygithubio20150521rnn-effectiveness 2015

[8] D Turnbull and C Elkan ldquoFast recognition of musical genresusingRBFnetworksrdquo IEEETransactions onKnowledge andDataEngineering vol 17 no 4 pp 580ndash584 2005

[9] H Sak A Senior and F Beaufays ldquoLong short-term memoryrecurrent neural network architectures for large scale acousticmodelingrdquo in Proceedings of the 15th Annual Conference of theInternational Speech Communication Association Celebratingthe Diversity of Spoken Languages INTERSPEECH 2014 pp338ndash342 September 2014

[10] AGraves A-RMohamed andGHinton ldquoSpeech recognitionwith deep recurrent neural networksrdquo in Proceedings of the 38thIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo13) pp 6645ndash6649 May 2013

[11] J Schmidhuber ldquoDeep learning in neural networks anoverviewrdquo Neural Networks vol 61 pp 85ndash117 2015

[12] K M Hermann T Kocisky E Grefenstette et al ldquoTeachingmachines to read and comprehendrdquo in Proceedings of the 29thAnnual Conference on Neural Information Processing SystemsNIPS 2015 pp 1693ndash1701 December 2015

[13] B Bakker ldquoReinforcement learning with long short-termmem-oryrdquo in In Advances in Neural Information Processing Systemspp 1475ndash1482

[14] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997


[15] P Goelet V F Castellucci S Schacher and E R KandelldquoThe long and the short of long-term memory - A molecularframeworkrdquo Nature vol 322 no 6078 pp 419ndash422 1986

[16] Q Lyu Z Wu and J Zhu ldquoPolyphonic music modelling withLSTM-RTRBMrdquo in Proceedings of the 23rd ACM InternationalConference onMultimedia MM 2015 pp 991ndash994 aus October2015

[17] S Haim A Senior R Kanishka and F Beaufays ldquoFast andaccurate recurrent neural network acoustic models for speechrecognitionrdquo httpsarxivorgabs150706947

[18] F A Gers J Schmidhuber and F Cummins ldquoLearning toforget Continual prediction with LSTMrdquo Neural Computationvol 12 no 10 pp 2451ndash2471 2000

[19] F A Gers N N Schraudolph and J Schmidhuber ldquoLearningprecise timing with LSTM recurrent networksrdquo Journal ofMachine Learning Research (JMLR) vol 3 no 1 pp 115ndash1432003

[20] A Graves S Fernandez F Gomez and J SchmidhuberldquoConnectionist temporal classification Labelling unsegmentedsequence data with recurrent neural networksrdquo in Proceedingsof the 23rd International Conference onMachine Learning ICML2006 pp 369ndash376 Pittsburgh Pa USA June 2006

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Active and Passive Electronic Components

Control Scienceand Engineering

Journal of


International Journal of

RotatingMachinery


Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design



Shock and Vibration


Civil EngineeringAdvances in

Acoustics and VibrationAdvances in



Electrical and Computer Engineering

Journal of

Advances inOptoElectronics


Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Chemical EngineeringInternational Journal of Antennas and

Propagation




Navigation and Observation



DistributedSensor Networks



Lastoutput

Input Newinput

Input_gate

Output_gate

Cell_value

Forget_gate

Last_cell_recent

Thisinput

Thisinput

This_cell_value Cell_memory OutputActive

ActiveActiveActive

1-input_gate

Figure 1 The LSTM network structure

a method named CTC [17] In experiments we test theperformance of a single layer LSTM Deep LSTM and asingle layer LSTMP with different parameters Comparedwith the traditional pianorsquos note recognition methods LSTMand LSTMP can recognize continuous notes that is somesimple piano music The experimental results show that asingle layer LSTMP can attain a recognition rate of 998 andDeep LSTM can reach 100 which proves that our methodsare quite effective

The rest of this paper is organised as follows In Section 2we first introduce the LSTM network architecture and thenDeep LSTM LSTMP is illustrated in Section 3 In Section 4we discuss CTC The experimental results are presented inSection 5 and finally in Section 6 we draw conclusions andgive our future work

2 LSTM

21 The LSTMNetwork Architecture LSTM is a kind of RNNwhich succeeds to keep memory for a period of time byadding a ldquomemory cellrdquoThememory cell ismainly controlledby ldquothe input gaterdquo ldquothe forgetting gaterdquo and ldquothe outputgaterdquo The input gate activates the input of information tothememory cell and the forgetting gate selectively obliteratessome information in the memory cell and activates thestorage to the next input [18] Finally the output gate decideswhat information will be outputted by the memory cell [19]

The LSTM network structure is illustrated in Figure 1Each box represents different data and the lines with arrowsmean data flow among these data From Figure 1 we canunderstand how LSTM stores memory for a long period oftime

The recognition procedure of LSTM begins with a set ofinput sequences 119909 = (1199091 1199092 119909119905) (119909119894 is a vector) and finally

outputs a set of 119910 = (1199101 1199102 119910119905) (119910119894 is also a vector) whichis calculated according to the following equations

119894119905 = 120590 (119882119894119909119909119905 +119882119894119898119898119905minus1 + 119887119894) (1)

119900119905 = 120590 (119882119900119909119909119905 +119882119900119898119898119905minus1 + 119887119900) (2)

119891119905 = 1 minus 119894119905 (3)

119905119888119905 = 119892 (119882119888119909119909119905 +119882119888119898119898119905minus1 + 119887119888) (4)

119888119905 = 119891119905 ⊙ 119888119905minus1 + 119894119905 ⊙ 119905119888119905 (5)

119898119905 = 119900119905 ⊙ ℎ119888119905 (6)

119910119905 = 120601 (119882119910119898119898119905 + 119887119910) (7)

In these equations 119894 means the input gate and 119900 and 119891are the output gate and the forget gate respectively 119905119888 is theinformation input to the memory cell and 119888 includes cellactivation vectors and119898 is the information the memory celloutputs 119882 represents weight matrices (eg 119882119894119909 representstheweightmatrix from input119909 to the input gate 119894) 119887 is the bias(119887119894 is the input gate bias vector) and 119892 and ℎ are the activationfunction of cell inputting and cell outputting respectivelyregarded as 119905119886119899ℎ and 119897119894119899119890119886119903 in most of the models and alsoin this paper ⊙ is the point multiplication in a matrix 120601 is theactivation function of the neural network output and we use119904119900119891119905119898119886119909 in this paper

After conducting some experiments we find that com-pared with the 119891119905 standard equation (3) is more simple andeasier to converge Not only does the training time becomeless but also the number of iterations becomes smallerTherefore in the neural networks in this paper we use (3) tocalculate 119891119905 instead of the 119891119905 standard equation

22 Deep LSTM In pianorsquos continuous note recognition wealso build a multilayer neural network to further increase the






that is





lowast 119899119903(10)

that is




lowast (4 lowast 119899119888 + 119899119900) (12)



119894119905 = 120590 (119882119894119909119909119905 +119882119894119903119903119905minus1 + 119887119894)

119900119905 = 120590 (119882119900119909119909119905 +119882119900119903119903119905minus1 + 119887119900)

119891119905 = 1 minus 119894119905

119905119888119905 = 119892 (119882119888119909119909119905 +119882119888119903119903119905minus1 + 119887119888)

119888119905 = 119891119905 ⊙ 119888119905minus1 + 119894119905 ⊙ 119905119888119905119898119905 = 119900119905 ⊙ ℎ119888119905119903119905 = 119882119903119898119898119905

119910119905 = 120601 (119882119910119903119903119905 + 119887119910)

(13)




4 CTC



119901 (120587 | 119909 119878) =119879

prod119905=1

119910119905120587119905 (14)



119901 (119897 | 119909) = sum120587isin120601minus1(119897)

119901 (120587 | 119909) (15)


Lastoutput

Input Newinput

Input_gate

Output_gate

Cell_value

Forget_gate

Last_cell_recent

Thisinput

Thisinput

This_cell_value


Active

Active

Active


Projection

Active



end while



5 Experiments














35302520151050

00

02

04

06

08

10

250200150100500

00

02

04

06

08

10



300









6050403020100

00

02

04

06

08

10


LSTMP (80 to 20)

100806040200

00

02

04

06

08

10








Acknowledgments


References






















RoboticsJournal of





Journal of



RotatingMachinery



Journal of

Volume 201


VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation














that is





lowast 119899119903(10)

that is




lowast (4 lowast 119899119888 + 119899119900) (12)



119894119905 = 120590 (119882119894119909119909119905 +119882119894119903119903119905minus1 + 119887119894)

119900119905 = 120590 (119882119900119909119909119905 +119882119900119903119903119905minus1 + 119887119900)

119891119905 = 1 minus 119894119905

119905119888119905 = 119892 (119882119888119909119909119905 +119882119888119903119903119905minus1 + 119887119888)

119888119905 = 119891119905 ⊙ 119888119905minus1 + 119894119905 ⊙ 119905119888119905119898119905 = 119900119905 ⊙ ℎ119888119905119903119905 = 119882119903119898119898119905

119910119905 = 120601 (119882119910119903119903119905 + 119887119910)

(13)




4 CTC



119901 (120587 | 119909 119878) =119879

prod119905=1

119910119905120587119905 (14)



119901 (119897 | 119909) = sum120587isin120601minus1(119897)

119901 (120587 | 119909) (15)


Lastoutput

Input Newinput

Input_gate

Output_gate

Cell_value

Forget_gate

Last_cell_recent

Thisinput

Thisinput

This_cell_value


Active

Active

Active


Projection

Active



end while



5 Experiments














35302520151050

00

02

04

06

08

10

250200150100500

00

02

04

06

08

10



300









6050403020100

00

02

04

06

08

10


LSTMP (80 to 20)

100806040200

00

02

04

06

08

10








Acknowledgments


References






















RoboticsJournal of





Journal of



RotatingMachinery



Journal of

Volume 201


VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation










Lastoutput

Input Newinput

Input_gate

Output_gate

Cell_value

Forget_gate

Last_cell_recent

Thisinput

Thisinput

This_cell_value


Active

Active

Active


Projection

Active



end while



5 Experiments














35302520151050

00

02

04

06

08

10

250200150100500

00

02

04

06

08

10



300









6050403020100

00

02

04

06

08

10


LSTMP (80 to 20)

100806040200

00

02

04

06

08

10








Acknowledgments


References






















RoboticsJournal of





Journal of



RotatingMachinery



Journal of

Volume 201


VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation















35302520151050

00

02

04

06

08

10

250200150100500

00

02

04

06

08

10



300









6050403020100

00

02

04

06

08

10


LSTMP (80 to 20)

100806040200

00

02

04

06

08

10








Acknowledgments


References






















RoboticsJournal of





Journal of



RotatingMachinery



Journal of

Volume 201


VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation












6050403020100

00

02

04

06

08

10


LSTMP (80 to 20)

100806040200

00

02

04

06

08

10








Acknowledgments


References






















RoboticsJournal of





Journal of



RotatingMachinery



Journal of

Volume 201


VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation
















RoboticsJournal of





Journal of



RotatingMachinery



Journal of

Volume 201


VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation









long short-term memory projection recurrent neural network

Documents