fpga based speech recognition using dynamic mfcc and speaker recognition. but,while using mfcc as...
TRANSCRIPT
ADVANCES in NATURAL and APPLIED SCIENCES
ISSN: 1995-0772 Published BYAENSI Publication EISSN: 1998-1090 http://www.aensiweb.com/ANAS
2017 June 11(8): pages 476-484 Open Access Journal
ToCite ThisArticle: A. Joe Virgin, Dr. S. Selva Nidhyananthan., FPGA Based Speech Recognition using Dynamic MFCC. Advances in Natural and Applied Sciences. 11(8); Pages: 476-484
FPGA Based Speech Recognition using Dynamic MFCC
1A. Joe Virgin, 2Dr. S. Selva Nidhyananthan
1PG Scholar, Department of ECE, Mepco Schlenk Engineering College, Sivakasi, 2Associate Professor, Department of ECE, Mepco Schlenk Engineering College, Sivakasi. Received 28 March 2017; Accepted 7 June 2017; Available online 12 June 2017
Address For Correspondence: A. Joe Virgin, PG Scholar, Department of ECE, Mepco Schlenk Engineering College, Sivakasi,
Copyright © 2017 by authors and American-Eurasian Network for ScientificInformation (AENSI Publication). This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/
ABSTRACT Speech recognition is one of the emerging technology that enables the recognition of spoken language into text by computers. In speech processing, a array of filter bank is used to separate the multiple components of the input signal i.e, the speech signal. The methods used for speech recognition have difficulties in memory size, area, and power in regard with implementation. In this proposed system, the speech recognition has undergone an efficient feature extraction called Dynamic Mel Frequency Cepstral Coefficients (DMFCC). The Mel Frequency Cepstral Coefficient (MFCC) is the most widely used feature extraction technique for speech and speaker recognition. But,while using MFCC as feature is shows very sensitive to noise interference and also it tends to degrade the performance during speech recognition system. This is overcome in the proposed system, the feature extraction is carried out here is by means of DMFCC. The feature extraction carried out through the process sub band processing, block truncation of DCT and filter banks for each and every frames. An array of band pass Mel- filter banks is used to enhance the recognition and accuracy of speech. The extracted feature are stored in a ROM based memory module as a template. This template acts as a reference template for the test speech signal. Enhanced Code Word Reference Templa te (E CWRT) is used for template matching in order to reduce the memory size. One to one mapping is done in the template matching for the reference template and for the test speech signal. The Automatic Speech Recognition (ASR) architecture was designed and simulated using Verilog HDL. The simulation results are verified through ModelSim Altera 6.4a starter Edition and the system is implemented in Virtex5 FPGA kit. The accuracy of the recognition of speech is increased in DMFCC than in MFCC by 13%. KEYWORDS: Dynamic Mel Frequency Cepstral Coefficients (DMFCC) MFCC Mel filter banks
INTRODUCTION
Digital signal processing is concerned on obtaining discrete representation of signals and with the
implementation of the signal in discrete representation. The main applications of digital signal processing is the
speech processing. The representation of the speech signals in digital form. Speech recognition has different
categories such as speech and pattern recognition, voice recognition etc., Speech processing is the study and
processing of speech signals. The speech recognition developed a methodology that enables the recognition of
speech into text by smart technologies. It is also be called as Automatic Speech Recognition (ASR).
The most common feature used for speaker identification is Mel Filter Cepstral Coefficients(MFCC). As
the human auditory system is most sensitive to the pitch frequency of the speaker, a feature that consider the
pitch frequency is more efficiently gives the output. In this paper Dynamic Mel Frequency Cepstral Coefficients
(DMFCC) (2011) are used as features, which are formed by imparting Mel frequency spectrum and there by
producing dynamic features. The extracted features are stored as reference template. With respect to the
decoding procedure, this work investigates the factors of the local-path constraints that influence on memory
477 A. Joe Virgin, Dr. S. Selva Nidhyananthan., 2017/Advances in Natural and Applied Sciences. 11(8) June 2017, Pages:
476-484
usage and recognition accuracy, and proposes a new template-matching method, called enhanced crosswords
reference template (ECWRT) (2014 & 2015), to reduce the memory requirements without decreasing accuracy.
The ASR system has a wide range of various method for the recognition of speech. Many authors proposed
many algorithms in different field for the enhancement of speech recognition. Md. Sahidullah, GoutamSaha
(2015) proposed a method for speech recognition using buffering method named Ultra-Low Queue-
Accumulator Buffering (ULQAB) and CWRT. this method show better accuracy rate. Chih-Hung Chou1 et al.
[2] proposed a hardware of speed 4.3X over a 2.4-GHz Intel Core 2 Duo processor running the CMU Sphinx
speech recognition software by decoding algorithm with word dependent N-best Viterbi Beam Search. But, the
design used a logic-on-memory approach and consumed power at a rate of 1.72 Watt. Ojas A. Bapat et al. [7]
developed a Generic and Scalable Architecture for a Large Acoustic Model and Large Vocabulary Speech using
MFCC feature extraction. He uses a phase information method, but the original phase information creates a
problem and the defined error rate is only about 58.3%. Another author Seiichi Nakagawa et al. [8] proposed a
Support Vector Machine(SVM) method including SMO technique. In VLSI, the power, area and complexity are
need to be consider which creates problem in hardware implementation. Tse-Wei Chen [10] adopted different
clustering/classifying algorithms for pattern recognition during training phase, such as general k-means SMO
and k-nearest neighbor (KNN) to develop a k-means-based clustering method for speaker modeling. The time
complexity of normal k-means is very high and not suitable for hardware implementation. The ASR consists of
the first central Spectral Moment time-frequency distribution Augmented by low order Cepstral coefficients
(SMAC) was developed by PirrosTsiakoulis et al. (2011). But the factors such as the reduced frequency
resolution, modulation effects in the voiced regions, and increased number of frames for the calculation of
derivative features, introduce further complexity. A Multicore and Multichannel and Synchronous and Forward–
Backward Schedulingwas proposed by Chih-Hsiang Peng et al. [2]. The total size of the shared memory for
storing Lagrange multipliers, prediction error, and hyperplane is 9 kB. Although memory cost is slightly
increased by 5%.
Proposed Method:
The proposed method describes the step by step process of the ASR system.Fig. 1. Shows the process
followed in an ASR system.
Fig. 1: Proposed ASR System.
1) Acquisition of input signal:
The input speech signal is taken from the database. The given input speech signal is a noisy signal which
has been taken from the database. The database consists of both female and male speaker. The sampling rate of
the input signal is 8 KHZ. A set of 10 different speech sets of 10 speakers were taken from the databases TIMIT
and MEPCO Speech database.
2) Wiener FIR filtering:
Finite Impulse Response (FIR) filter has a finite duration and it goes to zero in finite time. The FIR filters
do not have any feedback. Wiener FIR filter (2015) is one of the types of FIR adaptive filter. Wiener plays an
important role in wide range of application such as echo cancellation, signal restoration; channel identification
etc., the coefficients of a Wiener filter is used to minimize the average squared distance. The filter coefficients
are periodically recalculated for every block of N signal samples.
Speech Database
Recognized Output
Pre-processing
Feature Extraction
ECWRT
Template Matching
Test Speech
478 A. Joe Virgin, Dr. S. Selva Nidhyananthan., 2017/Advances in Natural and Applied Sciences. 11(8) June 2017, Pages:
476-484
The input- output relation of a FIR filter is given by,
1
0
n
k
k knxwny
Where, ny is the output of the filter.
n is the order of the filter
kw is the filter coefficient
Wiener FIR filter is used as a noise removing filter in the proposed system. A six tap wiener FIR filter is
designed for this process is shown in Fig. 2.
Fig. 2: Wiener FIR Filter.
3) Framing:
Speech signal is not stationary in nature but it is appearing to be stationary for a certain period of time i.e.,
about (20-30) ms each speech signal is segmented into different frames. While framing a signal overlap between
each frame is more important. Overlapping of frames reduce the data get lost by means of small gap between
each frames.
The number of samples in a frame can be obtained as,
sst ftn (2)
Where, stt is the time period of a frame and
sf is the sampling frequency of the signal.
The proposed has the following framing specification:
Total number of samples =40,000 samples.
Speech signal is constant about 30ms.
Sampling Frequency, FS= 8 KHz.
Total number of frame=312.
Number of sample for a frame, N =256 samples.
Overlap samples, M=128 samples.
4) Pre- emphasis:
The higher frequency in the speech signal is needed to be emphasized in order to undergo further process
and it is done in pre-emphasis stage. The pre-emphasis has a factor of 0.97 as α.
The output of pre-emphasis is given by,
)1()()( nxnxny (3)
The architecture of pre-emphasis is shown in Fig. 3.
Fig. 3: Pre-emphasis Architecture
a0 a1 a2
X(n) 32 32 32
32
32
32
32
32
z-1 z-1
a7
32 Y(n)
32
32
z-1
...
.
...
.
X(n)
Y(n)
z-1
α=0.97
(1)
479 A. Joe Virgin, Dr. S. Selva Nidhyananthan., 2017/Advances in Natural and Applied Sciences. 11(8) June 2017, Pages:
476-484
1) Windowing:
The final stage in pre-processing is windowing. The pre- emphasized output is divided into small frames
through windowing. Hamming window is used in case of speech processing. The hamming window is defined
as,
10,)1
2cos(46.054.0)(
Nn
N
nnw
(4)
The window signal is given by,
)(*)()( nwnynSw (5)
where, y(n)is the pre-emphasized signal
w(n)is the window used
B. Feature Extraction:
Feature extraction is the process of extracting the useful amount of information from the speech signal. In
the proposed system, Dynamic Mel Frequency Cepstral Coefficients (DMFCC) feature is taken. DMFCC
feature is extracted using pitch and mel frequency information.
1) Sub Band Processing:
Sub band processing is the technique that breaks a signal into a number of different frequencies bands and
encodes each one independently the sub band approach has also become popular in recent years in speech
recognition (2012). In this related area, the main motivation has been to achieve robust recognition in the face of
noise.
The sub band processing is done by Discrete Wavelet Transform (DWT) and the sample obtained from
DWT then undergoes FFT to convert it into frequency domain. The flow of DWT was shown in Fig. 5.
Fig. 4: Achitecture of DWT
The Fig. 5. shows the overall block diagram of the feature extraction.
Fig. 5: Block Diagram of Dynamic MFCC.
P
R
E
-
P
R
O
C
E
S
S
E
D
S
I
G
N
A
l
SUBBAND 2
SUBBAND 1
SUBBAND N
Dynamic
Mel Filter
Bank
Dynamic
Mel Filter
Bank
Dynamic
Mel Filter
Bank
DCT 1
DCT n
DCT 1
DCT n
DCT 1
DCT n
D
M
F
C
C
ECWRT
480 A. Joe Virgin, Dr. S. Selva Nidhyananthan., 2017/Advances in Natural and Applied Sciences. 11(8) June 2017, Pages:
476-484
2) DMFCC Feature Extraction:
The most commonly used feature for speech and speaker recognition that facilitates better speech as well as
speaker characteristics is MFCC [14]. As the human auditory system can sensitively perceive the pitch changes
in the speech, the speech information obtained by the MFCC with the pitch, can dynamically construct a set of
Mel-filters according to the results of pitch detection.
The mel- frequency is obtained by,
7001log2595)( pp ffMel (6)
Where, pf is the pitch frequency.
3) Discrete Cosine Transform(DCT):
DCT was performed on Dynamic Mel Filter Log Energies (DMFLE) in order to decorrelate the feature.
When such DCT is applied to speech signal’s log energies, all the features will be affected by the noise and
hence will make it unsuitable for speaker identification. To alleviate this problem, block based transformation is
performed. The filter log energies are divided into blocks and DCT is performed on them.
The filter bank log energies are decomposed into several blocks unlike standard full band based DCT
technique. In this work the whole signal is divided into non-overlapping blocks and individual blocks are
processed independently. Therefore the presence of narrowband noise in one block will not affect the other
blocks because of truncation. The transformation matrix can be given as
N
NL
000
000
000
2
1
21
(7)
C. Enhanced Code Word Reference Template (ECWRT): A template matching based ASR systems developed the Crosswords Reference Template (CWRT) method
to improve the recognition accuracy by the 27°–45°–63° local-path constraint.. For memory-sensitive
applications, the 0°–45°–90° local-path constraint is utilized here in to avoid the shortcomings of CWRT and to
achieve the required recognition accuracy a new template-matching method, called Enhanced Crosswords
Reference Template (ECWRT), to reduce the memory requirements without decreasing accuracy. The flow of
ECWRT is shown in Fig. 6.
Fig. 6: Code Word Reference Templates (ECWRT)
Extracted
DMFCC
Feature
Template
(mean)
ECWRT ROM
Module
addr 0
addr 1
addr n
……
……
……
…
481 A. Joe Virgin, Dr. S. Selva Nidhyananthan., 2017/Advances in Natural and Applied Sciences. 11(8) June 2017, Pages:
476-484
Simulation Resuts:
The verilog code for the proposed system was simulated and successfully verified using ModelSim 6.4a and
Xilinx 14.1 ISE. The simulation results are described step by step below.
A. Data Set:
The raw speech signal has its original format in ‘.wav’ format. In order to read the .wav file in ModelSim it
is first converted to ‘.txt’ file. The txt file can be read in verilog by using a command “$readmemb()”. Then the
data file are further processed.
B. Pre-processing:
The signal given by .txt format is allowed to pre-process to make the signal ready for feature extraction.
1) Filtering:
Filtering is done through Wiener FIR filter. The speech sample has to free from noise so it undergoes a
filtering. The data need filter coefficients for its filtering process. It can be generated by means of MATLAB.
Fig. 7: Simulation Result of Wiener Filtering
1) Framing:
The speech signal is not stationary and it will be constant only for (20-30) ms. At 30 ms for 8000 Hz of
sampling frequency the speech signal is framed using the formula sst ftn .
2) Pre- emphasis:
To improve the signal strength at high frequencies pre- emphasis is done for the speech signal. The pre-
emphasis is done for each and every frame with α=0.97 as shown in Fig. 8.
Fig. 8: Simulation Result of Pre- emphasis
1) Windowing:
Windowing is essential for capturing dynamic characteristics of vocal tract system in speech production
mechanism. The windows are of 10 – 20 ms length. Windowing is done for every frame. Hamming Window is
used in this paper as it is best among other types of window.
482 A. Joe Virgin, Dr. S. Selva Nidhyananthan., 2017/Advances in Natural and Applied Sciences. 11(8) June 2017, Pages:
476-484
C. Feature Extraction:
The pre-processed data are used in this feature extraction. The feature extracted from the speech is
DMFCC. The DMFCC is carried out by taking DWT,FFT, Filter Banks and DCT.
1) Sub Band Processing:
The Sub band processing is carried out by taking DWT for the input pre-processed speech signal. The input
speech are formed into different sub bands. The sub band processing was shown in Fig. 9.
Fig. 9: Simulation Result of Sub Band Processing.
2) DMFCC Feature Extraction:
The DMFCC filter bank consists of a bank of Band pass filters and finally the feature is extracted for each
frame. The simulation result of feature extraction is shown in Fig. 10.
Fig. 10: Simulation Result of Feature Extraction
Result:
The accuracy can be estimated by using the recognition accuracy formula. The recognition accuracy is
given by,
databasetheinspeechesofnoTotal
identifiedcorrectlyspeechesofNoAccuracycognition
.
.Re
(8)
The recognition accuracy for each speech signal is noted in the TABLE I.
Table I: Recognition accuracy of dmfcc and mfcc
Speech file Speech Content Recognition Accuracy
MFCC DMFCC
Speech 1 Clear Pronunciation is appreciated 80% 100%
Speech 2 Prevention is better than cure 70% 80%
Speech 3 Do you hear sleigh bell’s rings 50% 70%
Speech 4 The mango and papaya are in bowl 60% 50%
Speech 5 Add remaining ingredient’s 40% 70%
Speech 6 He might say to do something foolish 60% 90%
483 A. Joe Virgin, Dr. S. Selva Nidhyananthan., 2017/Advances in Natural and Applied Sciences. 11(8) June 2017, Pages:
476-484
Speech 7 An official deadline cannot be postponed 30% 50%
Speech 8 Academic aptitude guarantee suit diploma 60% 70%
Speech 9 John catch the big goose without help 80% 100%
Speech 10 First add milk to salty cheese 70% 80%
Recognition Accuracy 68% 81%
The TABLE I shows that the recognition accuracy of DMFCC is grater than the MFCC feature. Its
performance chart is shown below Fig. 11.
Performance of DMFCC and MFCC
Feature
020
406080
100120
0 5 10 15
Speech
Perc
enta
ge
DMFCC
MFCC
Fig. 11: Performance Comparison of DMFCC and MFCC.
The Fig. 12. Shows the RTL Schematic of the Sub Band processing. This schematic describes the number of
register and the block used in the process.
Fig. 12: (a). Filter in DWT. (b). Downsample in DWT for a single stage
Conclution and future work:
The speech signal is successfully undergone the noise removal and the further pre-processing stages. The
pre-processed output is used to carry out to the next feature extraction stage. The DMFCC feature has been
extracted through pitch frequency by undergone through various process as clearly describe in simulation
results. ECWRT reference template is designed to store all the extracted features. Then the template matching is
done with the reference template for the test signal and its accuracy is calculated. The system is simulated by
using ModelSim 6.4a Starter Edition and implemented by using Virtex-5 FPGA kit. From the proposed work, it
found that the extracted DMFCC feature more accurate recognition while compared to MFCC feature. The
recognition Accuracy of the DMFCC feature is 13% more than that of the MFCC features. This work can be
extended by including speaker recognition for the calculated features.
(a) (b)
484 A. Joe Virgin, Dr. S. Selva Nidhyananthan., 2017/Advances in Natural and Applied Sciences. 11(8) June 2017, Pages:
476-484
REFERENCES
1. Chih-Hsiang Peng, Chih-Hung Chou, Ta-Wen Kuan, Po-Chuan Lin, Jhing-Fa Wang and Pen-Yuan Yu,
2014. “An Automatic Speaker-Speech Recognition System for Friendly HMI based on Binary Halved
Clustering”, -IEEE.
2. Chih-Hsiang Peng, Ta-Wen Kuan, Po-Chuan Lin, Jhing-Fa Wang, and Guo-Ji Wu, 2015. “Trainable and
Low-Cost SMO Pattern Classifier Implemented via MCMC and SFBS Technologies”, IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, 23: 10.
3. Chih-Hung Chou, Ta-Wen Kuan, ShovanBarma, Bo-Wei Chen, Wen Ji, Chih-Hsiang Peng, and Jhing-Fa
Wang, 2015. “A New Binary-Halved Clustering Method and ERT Processor for ASSR System”, IEEE
Transactions on Very Large Scale Integration (VLSI).
4. Chih-Hung Chou1, Ta-Wen Kuan1, Po-Chuan Lin2, Bo-Wei Chen1, Jhing-Fa Wang, 2015. “Memory-
efficient buffering method and enhanced reference template for embedded automatic speech recognition
system”, IET Comput. Digit.Tech., 9(3): 153-164.
5. Rajalakshmi, K., A. Kandaswamy, 2012. “VLSI Architecture of Digital Auditory Filter for Speech
Processor of Cochlear Implant”, International Journal of Computer Applications (0975 – 8887) 39(7).
6. Md. Sahidullah, GoutamSaha, 2011. “Design, analysis and experimental evaluation of block based
transformation in MFCC computation for speaker recognition”, Elsevier.
7. Ojas A. Bapat, Paul D. Franzon and Richard M. Fastow, 2014. “A Generic and Scalable Architecture for a
Large Acoustic Model and Large Vocabulary Speech Recognition Accelerator Using Logic on Memory”,
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 22: 12.
8. Seiichi Nakagawa,Longbiao Wang, and Shinji Ohtsuka, 2012. “Speaker Identification and Verification by
Combining MFCC and Phase Information”, IEEE Transactions on Audio, Speech, and Language
Processing, 20: 4.
9. Ta-Wen Kuan, Jhing-Fa Wang, Jia-Ching Wang, Po- Chuan Lin, and Gaung-HuiGu, 2012. “VLSI Design
of an SVM Learning Core on Sequential Minimal Optimization Algorithm”, IEEE Transactions on Very
Large Scale Integration (VLSI) SYSTEMS, 20: 4.
10. Ta-Wen Kuan, Jhing-Fa Wang, Jia-Ching Wang, Po-Chuan Lin, and Gaung-HuiGu, 2012. “VLSI Design
of an SVM Learning Core on Sequential Minimal Optimization Algorithm”, IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, 20: 4.
11. Tse-Wei Chen and Shao-Yi Chien, 2011.”Flexible Hardware Architecture of Hierarchical K-Means
Clustering for Large Cluster Number”, IEEE TRANSACTIONS on Very Large Scale Integration (VLSI)
SYSTEMS, 19: 8.
12. Yasodai, A1, A. Ramprasad, 2015. “Noise degradation system using Wiener filter and CORDIC based
FFT/IFFT processor”, J. Cent. South Univ., 22: 3849-3859.