fpga based speech recognition using dynamic mfcc and speaker recognition. but,while using mfcc as...

ADVANCES in NATURAL and APPLIED SCIENCES

ISSN: 1995-0772 Published BYAENSI Publication EISSN: 1998-1090 http://www.aensiweb.com/ANAS

2017 June 11(8): pages 476-484 Open Access Journal

ToCite ThisArticle: A. Joe Virgin, Dr. S. Selva Nidhyananthan., FPGA Based Speech Recognition using Dynamic MFCC. Advances in Natural and Applied Sciences. 11(8); Pages: 476-484

FPGA Based Speech Recognition using Dynamic MFCC

1A. Joe Virgin, 2Dr. S. Selva Nidhyananthan

1PG Scholar, Department of ECE, Mepco Schlenk Engineering College, Sivakasi, 2Associate Professor, Department of ECE, Mepco Schlenk Engineering College, Sivakasi. Received 28 March 2017; Accepted 7 June 2017; Available online 12 June 2017

Address For Correspondence: A. Joe Virgin, PG Scholar, Department of ECE, Mepco Schlenk Engineering College, Sivakasi,

Copyright © 2017 by authors and American-Eurasian Network for ScientificInformation (AENSI Publication). This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/

ABSTRACT Speech recognition is one of the emerging technology that enables the recognition of spoken language into text by computers. In speech processing, a array of filter bank is used to separate the multiple components of the input signal i.e, the speech signal. The methods used for speech recognition have difficulties in memory size, area, and power in regard with implementation. In this proposed system, the speech recognition has undergone an efficient feature extraction called Dynamic Mel Frequency Cepstral Coefficients (DMFCC). The Mel Frequency Cepstral Coefficient (MFCC) is the most widely used feature extraction technique for speech and speaker recognition. But,while using MFCC as feature is shows very sensitive to noise interference and also it tends to degrade the performance during speech recognition system. This is overcome in the proposed system, the feature extraction is carried out here is by means of DMFCC. The feature extraction carried out through the process sub band processing, block truncation of DCT and filter banks for each and every frames. An array of band pass Mel- filter banks is used to enhance the recognition and accuracy of speech. The extracted feature are stored in a ROM based memory module as a template. This template acts as a reference template for the test speech signal. Enhanced Code Word Reference Templa te (E CWRT) is used for template matching in order to reduce the memory size. One to one mapping is done in the template matching for the reference template and for the test speech signal. The Automatic Speech Recognition (ASR) architecture was designed and simulated using Verilog HDL. The simulation results are verified through ModelSim Altera 6.4a starter Edition and the system is implemented in Virtex5 FPGA kit. The accuracy of the recognition of speech is increased in DMFCC than in MFCC by 13%. KEYWORDS: Dynamic Mel Frequency Cepstral Coefficients (DMFCC) MFCC Mel filter banks

INTRODUCTION

Digital signal processing is concerned on obtaining discrete representation of signals and with the

implementation of the signal in discrete representation. The main applications of digital signal processing is the

speech processing. The representation of the speech signals in digital form. Speech recognition has different

categories such as speech and pattern recognition, voice recognition etc., Speech processing is the study and

processing of speech signals. The speech recognition developed a methodology that enables the recognition of

speech into text by smart technologies. It is also be called as Automatic Speech Recognition (ASR).

The most common feature used for speaker identification is Mel Filter Cepstral Coefficients(MFCC). As

the human auditory system is most sensitive to the pitch frequency of the speaker, a feature that consider the

pitch frequency is more efficiently gives the output. In this paper Dynamic Mel Frequency Cepstral Coefficients

(DMFCC) (2011) are used as features, which are formed by imparting Mel frequency spectrum and there by

producing dynamic features. The extracted features are stored as reference template. With respect to the

decoding procedure, this work investigates the factors of the local-path constraints that influence on memory

477 A. Joe Virgin, Dr. S. Selva Nidhyananthan., 2017/Advances in Natural and Applied Sciences. 11(8) June 2017, Pages:

476-484

usage and recognition accuracy, and proposes a new template-matching method, called enhanced crosswords

reference template (ECWRT) (2014 & 2015), to reduce the memory requirements without decreasing accuracy.

The ASR system has a wide range of various method for the recognition of speech. Many authors proposed

many algorithms in different field for the enhancement of speech recognition. Md. Sahidullah, GoutamSaha

(2015) proposed a method for speech recognition using buffering method named Ultra-Low Queue-

Accumulator Buffering (ULQAB) and CWRT. this method show better accuracy rate. Chih-Hung Chou1 et al.

[2] proposed a hardware of speed 4.3X over a 2.4-GHz Intel Core 2 Duo processor running the CMU Sphinx

speech recognition software by decoding algorithm with word dependent N-best Viterbi Beam Search. But, the

design used a logic-on-memory approach and consumed power at a rate of 1.72 Watt. Ojas A. Bapat et al. [7]

developed a Generic and Scalable Architecture for a Large Acoustic Model and Large Vocabulary Speech using

MFCC feature extraction. He uses a phase information method, but the original phase information creates a

problem and the defined error rate is only about 58.3%. Another author Seiichi Nakagawa et al. [8] proposed a

Support Vector Machine(SVM) method including SMO technique. In VLSI, the power, area and complexity are

need to be consider which creates problem in hardware implementation. Tse-Wei Chen [10] adopted different

clustering/classifying algorithms for pattern recognition during training phase, such as general k-means SMO

and k-nearest neighbor (KNN) to develop a k-means-based clustering method for speaker modeling. The time

complexity of normal k-means is very high and not suitable for hardware implementation. The ASR consists of

the first central Spectral Moment time-frequency distribution Augmented by low order Cepstral coefficients

(SMAC) was developed by PirrosTsiakoulis et al. (2011). But the factors such as the reduced frequency

resolution, modulation effects in the voiced regions, and increased number of frames for the calculation of

derivative features, introduce further complexity. A Multicore and Multichannel and Synchronous and Forward–

Backward Schedulingwas proposed by Chih-Hsiang Peng et al. [2]. The total size of the shared memory for

storing Lagrange multipliers, prediction error, and hyperplane is 9 kB. Although memory cost is slightly

increased by 5%.

Proposed Method:

The proposed method describes the step by step process of the ASR system.Fig. 1. Shows the process

followed in an ASR system.

Fig. 1: Proposed ASR System.

1) Acquisition of input signal:

The input speech signal is taken from the database. The given input speech signal is a noisy signal which

has been taken from the database. The database consists of both female and male speaker. The sampling rate of

the input signal is 8 KHZ. A set of 10 different speech sets of 10 speakers were taken from the databases TIMIT

and MEPCO Speech database.

2) Wiener FIR filtering:

Finite Impulse Response (FIR) filter has a finite duration and it goes to zero in finite time. The FIR filters

do not have any feedback. Wiener FIR filter (2015) is one of the types of FIR adaptive filter. Wiener plays an

important role in wide range of application such as echo cancellation, signal restoration; channel identification

etc., the coefficients of a Wiener filter is used to minimize the average squared distance. The filter coefficients

are periodically recalculated for every block of N signal samples.

Speech Database

Recognized Output

Pre-processing

Feature Extraction

ECWRT

Template Matching

Test Speech


476-484

The input- output relation of a FIR filter is given by,

1

0

n

k

k knxwny

Where, ny is the output of the filter.

n is the order of the filter

kw is the filter coefficient

Wiener FIR filter is used as a noise removing filter in the proposed system. A six tap wiener FIR filter is

designed for this process is shown in Fig. 2.

Fig. 2: Wiener FIR Filter.

3) Framing:

Speech signal is not stationary in nature but it is appearing to be stationary for a certain period of time i.e.,

about (20-30) ms each speech signal is segmented into different frames. While framing a signal overlap between

each frame is more important. Overlapping of frames reduce the data get lost by means of small gap between

each frames.

The number of samples in a frame can be obtained as,

sst ftn (2)

Where, stt is the time period of a frame and

sf is the sampling frequency of the signal.

The proposed has the following framing specification:

Total number of samples =40,000 samples.

Speech signal is constant about 30ms.

Sampling Frequency, FS= 8 KHz.

Total number of frame=312.

Number of sample for a frame, N =256 samples.

Overlap samples, M=128 samples.

4) Pre- emphasis:

The higher frequency in the speech signal is needed to be emphasized in order to undergo further process

and it is done in pre-emphasis stage. The pre-emphasis has a factor of 0.97 as α.

The output of pre-emphasis is given by,

)1()()( nxnxny (3)

The architecture of pre-emphasis is shown in Fig. 3.

Fig. 3: Pre-emphasis Architecture

a0 a1 a2

X(n) 32 32 32

32

32

32

32

32

z-1 z-1

a7

32 Y(n)

32

32

z-1

...

.

...

.

X(n)

Y(n)

z-1

α=0.97

(1)


476-484

1) Windowing:

The final stage in pre-processing is windowing. The pre- emphasized output is divided into small frames

through windowing. Hamming window is used in case of speech processing. The hamming window is defined

as,

10,)1

2cos(46.054.0)(

Nn

N

nnw

(4)

The window signal is given by,

)(*)()( nwnynSw (5)

where, y(n)is the pre-emphasized signal

w(n)is the window used

B. Feature Extraction:

Feature extraction is the process of extracting the useful amount of information from the speech signal. In

the proposed system, Dynamic Mel Frequency Cepstral Coefficients (DMFCC) feature is taken. DMFCC

feature is extracted using pitch and mel frequency information.

1) Sub Band Processing:

Sub band processing is the technique that breaks a signal into a number of different frequencies bands and

encodes each one independently the sub band approach has also become popular in recent years in speech

recognition (2012). In this related area, the main motivation has been to achieve robust recognition in the face of

noise.

The sub band processing is done by Discrete Wavelet Transform (DWT) and the sample obtained from

DWT then undergoes FFT to convert it into frequency domain. The flow of DWT was shown in Fig. 5.

Fig. 4: Achitecture of DWT

The Fig. 5. shows the overall block diagram of the feature extraction.

Fig. 5: Block Diagram of Dynamic MFCC.

P

R

E

-

P

R

O

C

E

S

S

E

D

S

I

G

N

A

l

SUBBAND 2

SUBBAND 1

SUBBAND N

Dynamic

Mel Filter

Bank

Dynamic

Mel Filter

Bank

Dynamic

Mel Filter

Bank

DCT 1

DCT n

DCT 1

DCT n

DCT 1

DCT n

D

M

F

C

C

ECWRT


476-484

2) DMFCC Feature Extraction:

The most commonly used feature for speech and speaker recognition that facilitates better speech as well as

speaker characteristics is MFCC [14]. As the human auditory system can sensitively perceive the pitch changes

in the speech, the speech information obtained by the MFCC with the pitch, can dynamically construct a set of

Mel-filters according to the results of pitch detection.

The mel- frequency is obtained by,

7001log2595)( pp ffMel (6)

Where, pf is the pitch frequency.

3) Discrete Cosine Transform(DCT):

DCT was performed on Dynamic Mel Filter Log Energies (DMFLE) in order to decorrelate the feature.

When such DCT is applied to speech signal’s log energies, all the features will be affected by the noise and

hence will make it unsuitable for speaker identification. To alleviate this problem, block based transformation is

performed. The filter log energies are divided into blocks and DCT is performed on them.

The filter bank log energies are decomposed into several blocks unlike standard full band based DCT

technique. In this work the whole signal is divided into non-overlapping blocks and individual blocks are

processed independently. Therefore the presence of narrowband noise in one block will not affect the other

blocks because of truncation. The transformation matrix can be given as

N

NL

000

000

000

2

1

21

(7)

C. Enhanced Code Word Reference Template (ECWRT): A template matching based ASR systems developed the Crosswords Reference Template (CWRT) method

to improve the recognition accuracy by the 27°–45°–63° local-path constraint.. For memory-sensitive

applications, the 0°–45°–90° local-path constraint is utilized here in to avoid the shortcomings of CWRT and to

achieve the required recognition accuracy a new template-matching method, called Enhanced Crosswords

Reference Template (ECWRT), to reduce the memory requirements without decreasing accuracy. The flow of

ECWRT is shown in Fig. 6.

Fig. 6: Code Word Reference Templates (ECWRT)

Extracted

DMFCC

Feature

Template

(mean)

ECWRT ROM

Module

addr 0

addr 1

addr n

……

……

……

…


476-484

Simulation Resuts:

The verilog code for the proposed system was simulated and successfully verified using ModelSim 6.4a and

Xilinx 14.1 ISE. The simulation results are described step by step below.

A. Data Set:

The raw speech signal has its original format in ‘.wav’ format. In order to read the .wav file in ModelSim it

is first converted to ‘.txt’ file. The txt file can be read in verilog by using a command “$readmemb()”. Then the

data file are further processed.

B. Pre-processing:

The signal given by .txt format is allowed to pre-process to make the signal ready for feature extraction.

1) Filtering:

Filtering is done through Wiener FIR filter. The speech sample has to free from noise so it undergoes a

filtering. The data need filter coefficients for its filtering process. It can be generated by means of MATLAB.

Fig. 7: Simulation Result of Wiener Filtering

1) Framing:

The speech signal is not stationary and it will be constant only for (20-30) ms. At 30 ms for 8000 Hz of

sampling frequency the speech signal is framed using the formula sst ftn .

2) Pre- emphasis:

To improve the signal strength at high frequencies pre- emphasis is done for the speech signal. The pre-

emphasis is done for each and every frame with α=0.97 as shown in Fig. 8.

Fig. 8: Simulation Result of Pre- emphasis

1) Windowing:

Windowing is essential for capturing dynamic characteristics of vocal tract system in speech production

mechanism. The windows are of 10 – 20 ms length. Windowing is done for every frame. Hamming Window is

used in this paper as it is best among other types of window.


476-484

C. Feature Extraction:

The pre-processed data are used in this feature extraction. The feature extracted from the speech is

DMFCC. The DMFCC is carried out by taking DWT,FFT, Filter Banks and DCT.

1) Sub Band Processing:

The Sub band processing is carried out by taking DWT for the input pre-processed speech signal. The input

speech are formed into different sub bands. The sub band processing was shown in Fig. 9.

Fig. 9: Simulation Result of Sub Band Processing.

2) DMFCC Feature Extraction:

The DMFCC filter bank consists of a bank of Band pass filters and finally the feature is extracted for each

frame. The simulation result of feature extraction is shown in Fig. 10.

Fig. 10: Simulation Result of Feature Extraction

Result:

The accuracy can be estimated by using the recognition accuracy formula. The recognition accuracy is

given by,

databasetheinspeechesofnoTotal

identifiedcorrectlyspeechesofNoAccuracycognition

.

.Re

(8)

The recognition accuracy for each speech signal is noted in the TABLE I.

Table I: Recognition accuracy of dmfcc and mfcc

Speech file Speech Content Recognition Accuracy

MFCC DMFCC

Speech 1 Clear Pronunciation is appreciated 80% 100%

Speech 2 Prevention is better than cure 70% 80%

Speech 3 Do you hear sleigh bell’s rings 50% 70%

Speech 4 The mango and papaya are in bowl 60% 50%

Speech 5 Add remaining ingredient’s 40% 70%

Speech 6 He might say to do something foolish 60% 90%


476-484

Speech 7 An official deadline cannot be postponed 30% 50%

Speech 8 Academic aptitude guarantee suit diploma 60% 70%

Speech 9 John catch the big goose without help 80% 100%

Speech 10 First add milk to salty cheese 70% 80%

Recognition Accuracy 68% 81%

The TABLE I shows that the recognition accuracy of DMFCC is grater than the MFCC feature. Its

performance chart is shown below Fig. 11.

Performance of DMFCC and MFCC

Feature

020

406080

100120

0 5 10 15

Speech

Perc

enta

ge

DMFCC

MFCC

Fig. 11: Performance Comparison of DMFCC and MFCC.

The Fig. 12. Shows the RTL Schematic of the Sub Band processing. This schematic describes the number of

register and the block used in the process.

Fig. 12: (a). Filter in DWT. (b). Downsample in DWT for a single stage

Conclution and future work:

The speech signal is successfully undergone the noise removal and the further pre-processing stages. The

pre-processed output is used to carry out to the next feature extraction stage. The DMFCC feature has been

extracted through pitch frequency by undergone through various process as clearly describe in simulation

results. ECWRT reference template is designed to store all the extracted features. Then the template matching is

done with the reference template for the test signal and its accuracy is calculated. The system is simulated by

using ModelSim 6.4a Starter Edition and implemented by using Virtex-5 FPGA kit. From the proposed work, it

found that the extracted DMFCC feature more accurate recognition while compared to MFCC feature. The

recognition Accuracy of the DMFCC feature is 13% more than that of the MFCC features. This work can be

extended by including speaker recognition for the calculated features.

(a) (b)


476-484

REFERENCES

1. Chih-Hsiang Peng, Chih-Hung Chou, Ta-Wen Kuan, Po-Chuan Lin, Jhing-Fa Wang and Pen-Yuan Yu,

2014. “An Automatic Speaker-Speech Recognition System for Friendly HMI based on Binary Halved

Clustering”, -IEEE.

2. Chih-Hsiang Peng, Ta-Wen Kuan, Po-Chuan Lin, Jhing-Fa Wang, and Guo-Ji Wu, 2015. “Trainable and

Low-Cost SMO Pattern Classifier Implemented via MCMC and SFBS Technologies”, IEEE Transactions

on Very Large Scale Integration (VLSI) Systems, 23: 10.

3. Chih-Hung Chou, Ta-Wen Kuan, ShovanBarma, Bo-Wei Chen, Wen Ji, Chih-Hsiang Peng, and Jhing-Fa

Wang, 2015. “A New Binary-Halved Clustering Method and ERT Processor for ASSR System”, IEEE

Transactions on Very Large Scale Integration (VLSI).

4. Chih-Hung Chou1, Ta-Wen Kuan1, Po-Chuan Lin2, Bo-Wei Chen1, Jhing-Fa Wang, 2015. “Memory-

efficient buffering method and enhanced reference template for embedded automatic speech recognition

system”, IET Comput. Digit.Tech., 9(3): 153-164.

5. Rajalakshmi, K., A. Kandaswamy, 2012. “VLSI Architecture of Digital Auditory Filter for Speech

Processor of Cochlear Implant”, International Journal of Computer Applications (0975 – 8887) 39(7).

6. Md. Sahidullah, GoutamSaha, 2011. “Design, analysis and experimental evaluation of block based

transformation in MFCC computation for speaker recognition”, Elsevier.

7. Ojas A. Bapat, Paul D. Franzon and Richard M. Fastow, 2014. “A Generic and Scalable Architecture for a

Large Acoustic Model and Large Vocabulary Speech Recognition Accelerator Using Logic on Memory”,

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 22: 12.

8. Seiichi Nakagawa,Longbiao Wang, and Shinji Ohtsuka, 2012. “Speaker Identification and Verification by

Combining MFCC and Phase Information”, IEEE Transactions on Audio, Speech, and Language

Processing, 20: 4.

9. Ta-Wen Kuan, Jhing-Fa Wang, Jia-Ching Wang, Po- Chuan Lin, and Gaung-HuiGu, 2012. “VLSI Design

of an SVM Learning Core on Sequential Minimal Optimization Algorithm”, IEEE Transactions on Very

Large Scale Integration (VLSI) SYSTEMS, 20: 4.

10. Ta-Wen Kuan, Jhing-Fa Wang, Jia-Ching Wang, Po-Chuan Lin, and Gaung-HuiGu, 2012. “VLSI Design

of an SVM Learning Core on Sequential Minimal Optimization Algorithm”, IEEE Transactions on Very

Large Scale Integration (VLSI) Systems, 20: 4.

11. Tse-Wei Chen and Shao-Yi Chien, 2011.”Flexible Hardware Architecture of Hierarchical K-Means

Clustering for Large Cluster Number”, IEEE TRANSACTIONS on Very Large Scale Integration (VLSI)

SYSTEMS, 19: 8.

12. Yasodai, A1, A. Ramprasad, 2015. “Noise degradation system using Wiener filter and CORDIC based

FFT/IFFT processor”, J. Cent. South Univ., 22: 3849-3859.

fpga based speech recognition using dynamic mfcc and speaker recognition. but,while using mfcc as...

Documents