improved dtw speech recognition algorithm based on …file.scirp.org/pdf/27-3.1.pdf · improved dtw...

Improved DTW Speech Recognition Algorithm Based on the MEL Frequency Cepstral Coefficients

WEI Ming-zhe, LI Xi, REN Li-mian

Tangshan College, department of Information Engineering

Email: [email protected]

Abstract: The application of MEL cepstrum coefficients combined with the DTW algorithm has been relatively mature in speech recognition technology, and the recognition accuracy is also higher. But people have higher and higher requirement with the development of technology of speech recognition and the expansion of speech recognition technology application domain, so this method has not fully satisfies site. In this paper, an improved speech recognition algorithm has been proposed, which based on the improved MEL frequency cepstral coefficients. The MEL-Frequency Cepstral Coefficients (MFCC) as the characteristics template used in dynamic time warping (DTW) technique. The algorithm has been improved in the establishment of the template and matching process of the extracted voice signal characteristics. DTW algorithm can thus play its efficient and accurate. Compared to the classical method of the MEL frequency cepstral that directly applied to speech recognition as a template, this method can improve the recognition rate for the recognition of a speaker’s word experiments.

Keywords: MFCC; WFCC; DTW; Speech Recognition

1. Introduction and Ideas

Speech recognition rate depends on the matching

accuracy between the unknown template and existed

database template. Template matching method deter-

mines the final speech recognition result.

DTW (Dynamic Time Warping) using the dynamic

time regulation algorithm solve the difficult of the time

stretching change caused by speaking rate differential.

This method was successfully applied to small

vocabulary isolated word speech recognition and small

vocabulary conjunction speech recognition system

because of the easier algorithm and common hardware

require, so it is a more mature voice recognition

algorithm。MFCC (Mel-Frequency Ceptral Coefficients)

take full account of the structure of the human ear's

hearing and the characteristics of the human voice. And

it will have a good robustness with MFCC features

extracted from the voice. The speaker's personality

characteristics are reflected in the channel shape, that is

the distribution of formant. MFCC considered human

perception characteristics, but there are deviations of

resonance peak and auditory orientation sensitive areas.

In this paper, the speaker will be recognized by the

combined algorithm between weighted Mel frequency

cepstral coefficients and planning is dynamic time

algorithm. Weighted Mel estimated spectrum is more

accurate approach to speech amendment cycle diagram

in the point of the resonant peak. Weighted Mel

frequency cepstral coefficients have good recognition

performance and noise immunity in the absence of any

assumptions.

2. Algorithm Description

2.1. Weighted Mel Cepstrum Analysis

It is still relatively simple to determine the method of

existing weighted Mel frequency cepstral coefficients.

Generally we get it through the minimize solving method

of the estimation Spectrum error function of the Mel

frequency cepstral coefficients. MFCC is characterized

by frequency spectral, of which analysis is based on

unbiased estimator Log- ratio method. In MFCC, the

optimization of cepstral coefficient utilize the

characteristics of human auditory is nonlinear, and is

according to Short-time spectrum of sounds. Based on

these two considerations, the error function of Mel

frequency cepstral coefficients is consists of two main parameters, that is Mel estimated spectrum )(H and

modified periodic function )(NI . Mel estimated

spectrum )(H defined example is shown as follow:

235

Information and Communication Technology and Smart Grid

978-1-935068-23-5 © 2010 SciRes.

N

n

nn zcH

0

~exp)( ，

where

jez

z

zz

1

11

1~

Where is the Mel frequency transform coefficient, M is the estimate order number of Mel cepstrum,

is Mel frequency cepstral coefficients. For a frame speech signal , the definition of

the signal revision periodogram is

nc

1,,1,0)( Nnnx ，

1

0

2

21

0

)(

)()(

)(N

n

N

n

nj

N

nw

enxnw

I

Where is the window function. The estimated

error function is defined as the following formula.

Getting the optimal number of Mel frequency cepstral

coefficients through solving the function minimization

problem:

)(nw

dHI

H

IN

N

2

0 21)(log2)(log

)(

)(

2

1

Psychological perception experiments show that different

frequencies and different loudness of pure tone signal

have different contributions to human auditory. For the

estimated error function, the different frequency

components in the estimation error of the weight should

be different. Therefore, the weighted estimation error

function based on the Mel cepstrum is defined as:

dWHI

H

IN

N )(1)(log2)(log)(

)(

2

1 2

0 2

)(W is the non-negative Perceive weighting function.

Note that, when the function is constant function,

weighted Mel cepstrum analysis will degradate as Mel

cepstrum analysis.

2.2. Choice of perceived weighting function

The rationality of perceived weighting function )(W

directly impact on the description ability of spectral

characteristics by the weighted Mel frequency cepstral

coefficients. Psychoacoustic model show that the human

auditory sensitivity and sound frequencies were

non-linear relationship. In addition, absolute hearing

threshold, the critical bandwidth and masking effect also

have relationship with it. In the psychoacoustic model,

the signal mask ratio (SMR) can quantitatively reflect the

contribution of human auditory system and can

quantitatively reflect the sensitivity of the quantity and

estimation error which result from the corresponding

components. Therefore, we using SMR which is the

output of the psychoacoustic model as human auditory

perception weighted. However, the output of the model

SMR is a discrete value, we need interpolate SMR in

order to get a continuous perception weighted function )(W .And then determine the initial weight function

from the interpolated continuous function, it is denoted by ，， 0)( L . Further define the perception

weight function as

2

2)(

0

0

LW

2)(

)2(

0)(

)(

，

，

dL

dL

L

2.3. Dynamic time warping technique

As the voice signal with considerable randomness, even

if the same person at different times issued with one

voice, can not have exactly the same length of time, so it

is essential to disposal dynamic time warping. To solve

this problem, the most mature technology is the

technique of dynamic time warping (dynamic time

warping, DTW). DTW is a nonlinear reformed

technology which is combined with the time warping and

distance measure calculation. The speech recognition is

also successful in the matching algorithm.

Dynamic time warping translates a complex global

optimization problem to many local optimization

problem using dynamic programming techniques

(dynamic programming, DP), decision- making it step by

step. Suppose the feature vector sequence of reference templates is Ixx ,,, 2 xX 1 , input speech feature

vector sequence is Jy,yyY , 21 , . It is

shown below:

JI

DTW algorithm is to find an optimal time warping

function which minimize the total amount of the cumulative distortion and map the timeline of

unknown reference template to the timeline i of

j

236


978-1-935068-23-5 © 2010 SciRes.

end

Start

Reference template Timeline i

Tested voice Tim

eline

j

0 1 5 10 15 20 25

15

10

5

1

Figure 1. Dynamic Time Reformed Process

reference template non-linearly. Suppose the time warping function is

)( ,),2( ),1( NcccC

where is path length, indicates

the first N matching points which are constituted by the first of feature vector of reference template and the

first of feature vector of tested template. The

distance between the two templates (or distortion value) is called partial matching distance. DTW

algorithm realizes the minimum distance of weighted

sum by the local optimization algorithm, that is

N

(ni

(j

, (njy

))(),(()( njninc

)

)n

))( )(nixd

N

nn

N

nnnjni

CW

WyxdD

1

1)()( ),(

min

In the formula, the selection of weighting function should consider the two prerequisites: nW

First, according to the step before the partial path of

the first N points of the match, punish local path of

direction 45 degrees, to adapt to the situation of JI ;

Second, consider the various parts of speech to give

different weights to strengthen certain distinctive features;

Otherwise, the paper selection of the weighting function derived from the third condition: is determined

by the SMR values nonlinearly that according to

psychoacoustic model output which correspond with the perceptual weighting function

nW nW

)(W of weighted Mel

frequency cepstral coefficients. This point consider the

attachment in a frame between SMR and DTW which is

used in the psychoacoustic model.

Table 1. recognition rate comparison

Number of words to be recognized

MFCC recognition rate

WFCC recognition rate

1 94.7% 97.3%

2 95.3% 97.5%

3 96.6% 96.4%

4（Idiom） 99.2% 99.1%

3. Experimental Analysis

Experimental feature database selected 10 male and 10

female voices sound which record by the recording

software of PC Windows system. And there is lower

intensity noise in the recording environment. Experiment

was divided into two processes, training and recognition.

Training process, that is experiment, speaker record the

word at the speed slower than normal speech rate, and

save them for the original template ,extraction WFCC

feature, and then archive it. In the recognition process,

we recorded the words using normal speech rate,

extracted the WFCC feature at the same time, and then

match them using the DTW algorithm. Also the

recognition rate compared with MFCC, the following

table statistics are:

We can get some solution based on the tables: WFCC

has a improved recognition rate in isolated word and

isolated words compared with MFCC. However, it is

almost the same that is used in the trisyllabic words and

two words and idioms recognition. It can be obtained

that DTW recognition algorithm combined with

improved Mel frequency cepstral coefficients has a

higher recognition rate for the isolated word and isolated

words recognition.

4. Conclusions

Usually, at premises of using MFCC for DTW, speech

recognition algorithm is the MFCC for isolated word

recognition for certain improvements, the two algorithms

in the selection of organic weight link. The experimental

data shows， that the improved DTW Mel frequency

cepstral coefficients and the combination of speech

recognition compared to MFCC-based speech

recognition algorithms, robust is better.

237


978-1-935068-23-5 © 2010 SciRes.

References

[1] Delaney B, Jayant N, Hans M. A Low-power , Fixed-point Front-end Feature Extraction for a Distributed Speech Recognition System[J]. HP Laboratories Technical Report, 2001, 26(9): 252-254

[2] Christophe L, Georges L, Nocera P. Reducing Computational and Memory Cost for Cellular Phone Embedded Speech Recognition System[J] . Proceedings of the IEEE, 1997, 85(9): 112-115

[3] Douglas A, Richard C. Robust text-independent speaker identification using Gaussian mixture speaker models [J]. IEEE

Trans Speech and Audio Processing, 1995, 3(1): 77-80 [4] Deller John R, Proakis John G, Hansen John H L. Discrete-Time

Processing of Speech Signals [M]. Macmillan Publishing Company, 1993

[5] Reynolds D A, Quatieri T F, Dunn R B. Speaker verification using adapted gaussian mixture models [J]. Digital Signal Processing, 2000(10): 19-41

[6] Yang Hongwu, HuangDezhi, Cai Lian-hong. Perceptually Weighted Mel-Cepstrum Analysis of Speech Based on PsychoacousticModel[J]. IEICE TRANS. INF. & SYST, 2006, E89-D (12): 1-4

238


978-1-935068-23-5 © 2010 SciRes.

improved dtw speech recognition algorithm based on …file.scirp.org/pdf/27-3.1.pdf · improved dtw...

Documents