improved dtw speech recognition algorithm based on …file.scirp.org/pdf/27-3.1.pdf · improved dtw...
TRANSCRIPT
Improved DTW Speech Recognition Algorithm Based on the MEL Frequency Cepstral Coefficients
WEI Ming-zhe, LI Xi, REN Li-mian
Tangshan College, department of Information Engineering
Email: [email protected]
Abstract: The application of MEL cepstrum coefficients combined with the DTW algorithm has been relatively mature in speech recognition technology, and the recognition accuracy is also higher. But people have higher and higher requirement with the development of technology of speech recognition and the expansion of speech recognition technology application domain, so this method has not fully satisfies site. In this paper, an improved speech recognition algorithm has been proposed, which based on the improved MEL frequency cepstral coefficients. The MEL-Frequency Cepstral Coefficients (MFCC) as the characteristics template used in dynamic time warping (DTW) technique. The algorithm has been improved in the establishment of the template and matching process of the extracted voice signal characteristics. DTW algorithm can thus play its efficient and accurate. Compared to the classical method of the MEL frequency cepstral that directly applied to speech recognition as a template, this method can improve the recognition rate for the recognition of a speaker’s word experiments.
Keywords: MFCC; WFCC; DTW; Speech Recognition
1. Introduction and Ideas
Speech recognition rate depends on the matching
accuracy between the unknown template and existed
database template. Template matching method deter-
mines the final speech recognition result.
DTW (Dynamic Time Warping) using the dynamic
time regulation algorithm solve the difficult of the time
stretching change caused by speaking rate differential.
This method was successfully applied to small
vocabulary isolated word speech recognition and small
vocabulary conjunction speech recognition system
because of the easier algorithm and common hardware
require, so it is a more mature voice recognition
algorithm。MFCC (Mel-Frequency Ceptral Coefficients)
take full account of the structure of the human ear's
hearing and the characteristics of the human voice. And
it will have a good robustness with MFCC features
extracted from the voice. The speaker's personality
characteristics are reflected in the channel shape, that is
the distribution of formant. MFCC considered human
perception characteristics, but there are deviations of
resonance peak and auditory orientation sensitive areas.
In this paper, the speaker will be recognized by the
combined algorithm between weighted Mel frequency
cepstral coefficients and planning is dynamic time
algorithm. Weighted Mel estimated spectrum is more
accurate approach to speech amendment cycle diagram
in the point of the resonant peak. Weighted Mel
frequency cepstral coefficients have good recognition
performance and noise immunity in the absence of any
assumptions.
2. Algorithm Description
2.1. Weighted Mel Cepstrum Analysis
It is still relatively simple to determine the method of
existing weighted Mel frequency cepstral coefficients.
Generally we get it through the minimize solving method
of the estimation Spectrum error function of the Mel
frequency cepstral coefficients. MFCC is characterized
by frequency spectral, of which analysis is based on
unbiased estimator Log- ratio method. In MFCC, the
optimization of cepstral coefficient utilize the
characteristics of human auditory is nonlinear, and is
according to Short-time spectrum of sounds. Based on
these two considerations, the error function of Mel
frequency cepstral coefficients is consists of two main parameters, that is Mel estimated spectrum )(H and
modified periodic function )(NI . Mel estimated
spectrum )(H defined example is shown as follow:
235
Information and Communication Technology and Smart Grid
978-1-935068-23-5 © 2010 SciRes.
N
n
nn zcH
0
~exp)( ,
where
jez
z
zz
1
11
1~
Where is the Mel frequency transform coefficient, M is the estimate order number of Mel cepstrum,
is Mel frequency cepstral coefficients. For a frame speech signal , the definition of
the signal revision periodogram is
nc
1,,1,0)( Nnnx ,
1
0
2
21
0
)(
)()(
)(N
n
N
n
nj
N
nw
enxnw
I
Where is the window function. The estimated
error function is defined as the following formula.
Getting the optimal number of Mel frequency cepstral
coefficients through solving the function minimization
problem:
)(nw
dHI
H
IN
N
2
0 21)(log2)(log
)(
)(
2
1
Psychological perception experiments show that different
frequencies and different loudness of pure tone signal
have different contributions to human auditory. For the
estimated error function, the different frequency
components in the estimation error of the weight should
be different. Therefore, the weighted estimation error
function based on the Mel cepstrum is defined as:
dWHI
H
IN
N )(1)(log2)(log)(
)(
2
1 2
0 2
)(W is the non-negative Perceive weighting function.
Note that, when the function is constant function,
weighted Mel cepstrum analysis will degradate as Mel
cepstrum analysis.
2.2. Choice of perceived weighting function
The rationality of perceived weighting function )(W
directly impact on the description ability of spectral
characteristics by the weighted Mel frequency cepstral
coefficients. Psychoacoustic model show that the human
auditory sensitivity and sound frequencies were
non-linear relationship. In addition, absolute hearing
threshold, the critical bandwidth and masking effect also
have relationship with it. In the psychoacoustic model,
the signal mask ratio (SMR) can quantitatively reflect the
contribution of human auditory system and can
quantitatively reflect the sensitivity of the quantity and
estimation error which result from the corresponding
components. Therefore, we using SMR which is the
output of the psychoacoustic model as human auditory
perception weighted. However, the output of the model
SMR is a discrete value, we need interpolate SMR in
order to get a continuous perception weighted function )(W .And then determine the initial weight function
from the interpolated continuous function, it is denoted by ,, 0)( L . Further define the perception
weight function as
2
2)(
0
0
LW
2)(
)2(
0)(
)(
,
,
dL
dL
L
2.3. Dynamic time warping technique
As the voice signal with considerable randomness, even
if the same person at different times issued with one
voice, can not have exactly the same length of time, so it
is essential to disposal dynamic time warping. To solve
this problem, the most mature technology is the
technique of dynamic time warping (dynamic time
warping, DTW). DTW is a nonlinear reformed
technology which is combined with the time warping and
distance measure calculation. The speech recognition is
also successful in the matching algorithm.
Dynamic time warping translates a complex global
optimization problem to many local optimization
problem using dynamic programming techniques
(dynamic programming, DP), decision- making it step by
step. Suppose the feature vector sequence of reference templates is Ixx ,,, 2 xX 1 , input speech feature
vector sequence is Jy,yyY , 21 , . It is
shown below:
JI
DTW algorithm is to find an optimal time warping
function which minimize the total amount of the cumulative distortion and map the timeline of
unknown reference template to the timeline i of
j
236
Information and Communication Technology and Smart Grid
978-1-935068-23-5 © 2010 SciRes.
end
Start
Reference template Timeline i
Tested voice Tim
eline
j
0 1 5 10 15 20 25
15
10
5
1
Figure 1. Dynamic Time Reformed Process
reference template non-linearly. Suppose the time warping function is
)( ,),2( ),1( NcccC
where is path length, indicates
the first N matching points which are constituted by the first of feature vector of reference template and the
first of feature vector of tested template. The
distance between the two templates (or distortion value) is called partial matching distance. DTW
algorithm realizes the minimum distance of weighted
sum by the local optimization algorithm, that is
N
(ni
(j
, (njy
))(),(()( njninc
)
)n
))( )(nixd
N
nn
N
nnnjni
CW
WyxdD
1
1)()( ),(
min
In the formula, the selection of weighting function should consider the two prerequisites: nW
First, according to the step before the partial path of
the first N points of the match, punish local path of
direction 45 degrees, to adapt to the situation of JI ;
Second, consider the various parts of speech to give
different weights to strengthen certain distinctive features;
Otherwise, the paper selection of the weighting function derived from the third condition: is determined
by the SMR values nonlinearly that according to
psychoacoustic model output which correspond with the perceptual weighting function
nW nW
)(W of weighted Mel
frequency cepstral coefficients. This point consider the
attachment in a frame between SMR and DTW which is
used in the psychoacoustic model.
Table 1. recognition rate comparison
Number of words to be recognized
MFCC recognition rate
WFCC recognition rate
1 94.7% 97.3%
2 95.3% 97.5%
3 96.6% 96.4%
4(Idiom) 99.2% 99.1%
3. Experimental Analysis
Experimental feature database selected 10 male and 10
female voices sound which record by the recording
software of PC Windows system. And there is lower
intensity noise in the recording environment. Experiment
was divided into two processes, training and recognition.
Training process, that is experiment, speaker record the
word at the speed slower than normal speech rate, and
save them for the original template ,extraction WFCC
feature, and then archive it. In the recognition process,
we recorded the words using normal speech rate,
extracted the WFCC feature at the same time, and then
match them using the DTW algorithm. Also the
recognition rate compared with MFCC, the following
table statistics are:
We can get some solution based on the tables: WFCC
has a improved recognition rate in isolated word and
isolated words compared with MFCC. However, it is
almost the same that is used in the trisyllabic words and
two words and idioms recognition. It can be obtained
that DTW recognition algorithm combined with
improved Mel frequency cepstral coefficients has a
higher recognition rate for the isolated word and isolated
words recognition.
4. Conclusions
Usually, at premises of using MFCC for DTW, speech
recognition algorithm is the MFCC for isolated word
recognition for certain improvements, the two algorithms
in the selection of organic weight link. The experimental
data shows, that the improved DTW Mel frequency
cepstral coefficients and the combination of speech
recognition compared to MFCC-based speech
recognition algorithms, robust is better.
237
Information and Communication Technology and Smart Grid
978-1-935068-23-5 © 2010 SciRes.
References
[1] Delaney B, Jayant N, Hans M. A Low-power , Fixed-point Front-end Feature Extraction for a Distributed Speech Recognition System[J]. HP Laboratories Technical Report, 2001, 26(9): 252-254
[2] Christophe L, Georges L, Nocera P. Reducing Computational and Memory Cost for Cellular Phone Embedded Speech Recognition System[J] . Proceedings of the IEEE, 1997, 85(9): 112-115
[3] Douglas A, Richard C. Robust text-independent speaker identification using Gaussian mixture speaker models [J]. IEEE
Trans Speech and Audio Processing, 1995, 3(1): 77-80 [4] Deller John R, Proakis John G, Hansen John H L. Discrete-Time
Processing of Speech Signals [M]. Macmillan Publishing Company, 1993
[5] Reynolds D A, Quatieri T F, Dunn R B. Speaker verification using adapted gaussian mixture models [J]. Digital Signal Processing, 2000(10): 19-41
[6] Yang Hongwu, HuangDezhi, Cai Lian-hong. Perceptually Weighted Mel-Cepstrum Analysis of Speech Based on PsychoacousticModel[J]. IEICE TRANS. INF. & SYST, 2006, E89-D (12): 1-4
238
Information and Communication Technology and Smart Grid
978-1-935068-23-5 © 2010 SciRes.