estimation of fundamental frequency of...straight-tempo, ifhc, and phia. the comparative results...

15
Japan Advanced Institute of Science and Technology JAIST Repository https://dspace.jaist.ac.jp/ Title Estimation of fundamental frequency of reverberant speech by utilizing complex cepstrum analysis Author(s) Unoki, Masashi; Hosorogiya, Toshihiro Citation Journal of Signal Processing, 12(1): 31-44 Issue Date 2008-01 Type Journal Article Text version author URL http://hdl.handle.net/10119/7755 Rights Copyright (C) 2008 ��. Masashi Unoki and Toshihiro Hosorogiya, Journal of Signal Processing, 12(1), 2008, 31-44. Description

Upload: others

Post on 12-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Estimation of fundamental frequency of...STRAIGHT-TEMPO, IFHC, and PHIA. The comparative results revealed that the percentage of correct rates and SNRs of the estimated F0s were reduced

Japan Advanced Institute of Science and Technology

JAIST Repositoryhttps://dspace.jaist.ac.jp/

Title

Estimation of fundamental frequency of

reverberant speech by utilizing complex cepstrum

analysis

Author(s) Unoki, Masashi; Hosorogiya, Toshihiro

Citation Journal of Signal Processing, 12(1): 31-44

Issue Date 2008-01

Type Journal Article

Text version author

URL http://hdl.handle.net/10119/7755

Rights

Copyright (C) 2008 信号処理学会. Masashi Unoki

and Toshihiro Hosorogiya, Journal of Signal

Processing, 12(1), 2008, 31-44.

Description

Page 2: Estimation of fundamental frequency of...STRAIGHT-TEMPO, IFHC, and PHIA. The comparative results revealed that the percentage of correct rates and SNRs of the estimated F0s were reduced

PAPER

Estimation of fundamental frequency of reverberant speech by

utilizing complex cepstrum analysis

Masashi Unoki and Toshihiro Hosorogiya

School of Information Science, Japan Advanced Institute of Science and Technology1-1 Asahidai, Nomi, Ishikawa 923-1292, Japan

E-mail: {unoki, t-hosoro}@jaist.ac.jp

Abstract This paper reports comparative evaluations of twelve typical methods of estimating funda-

mental frequency (F0) over huge speech-sound datasets in artificial reverberant environments. They in-

volve several classic algorithms such as Cepstrum, AMDF, LPC, and modified autocorrelation algorithms.

Other methods involve a few modern instantaneous amplitude- and/or frequency-based algorithms, such as

STRAIGHT-TEMPO, IFHC, and PHIA. The comparative results revealed that the percentage of correct

rates and SNRs of the estimated F0s were reduced drastically as reverberation time increased. They also

demonstrated that homomorphic (complex cepstrum) analysis and the concept of the source-filter model

were relatively effective for estimating F0 from reverberant speech. This paper thus proposes a new method

of robustly and accurately estimating F0s in reverberant environments, by utilizing the modulation transfer

function (MTF) concept and the source-filter model in complex cepstrum analysis. The MTF concept is

used in this method to eliminate dominant reverberant characteristics from observed reverberant speech.

The source-filter model (liftering) is used to extract source information from the processed cepstrum. Fi-

nally, F0s are estimated from them by using the comb-filtering method. Additive-comparative evaluation

was carried out on the new approach with other typical methods. The results demonstrated that it was

better than the previously reported techniques in terms of robustness and providing accurate F0 estimates

in reverberant environments.

Keywords: F0 estimation, reverberant speech, complex cepstrum analysis, MTF concept, source-filter model

1. Introduction

The fundamental frequency (F0) as well as the fun-damental period (T0) of speech can be utilized as sig-nificant features to represent the source information(glottal waveform or vocal-fold vibrations) of speechsounds in various speech-signal processes. These are,for example, in speech analysis/synthesis systems, au-tomatic speech recognition (ASR) systems, and speechemphasis methods. In particular, robust and accurateF0 can generally be used as a powerful cue to reducethe noise component in noisy speech and/or to removethe reverberation effect in reverberant speech. There-fore, robustly and accurately estimating the F0 of tar-get speech in real environments, which is the same asextracting the F0 of noiseless speech, is a particularlyimportant issue in these applications.

Many studies on extracting or estimating the F0

of target speech have been done in the literature on

speech-signal processing, and numerous methods havebeen proposed [1, 2, 3] over the last half-century. Thetraditional extraction/estimation methods can be di-vided into processing in the time and frequency do-mains, or both domains. Most of these have madeuse of the periodic features of speech in the time do-main (zero-cross [4, 5], periodgram [6], peak-picking[4, 7], autocorrelation [4, 8], the amplitude magni-tude difference function (AMDF) [9], and maximumlikelihood [10]), or harmonic features in the frequencydomain (comb filtering [11, 12], autocorrelation [13],sub-harmonic summation (SHS) [14], and cepstrum[15]).

The aim of all these methods has been to extractthe periodicity or harmonicity of source informationfrom observed speech. However, this still seems to beincompletely resolved because three main issues re-main, i.e., (1) observability: the observed speech isan emitted sound passing through the mouth/nose so

Page 3: Estimation of fundamental frequency of...STRAIGHT-TEMPO, IFHC, and PHIA. The comparative results revealed that the percentage of correct rates and SNRs of the estimated F0s were reduced

2 Journal of Signal Processing, Vol. , No. ,

that it is impossible to directly observe glottal vibra-tions from it without eliminating the effects of thevocal tract, (2) flexibility and irregularity: glottalvibrations are not complete periodic signals and therange of variations in the periods is relatively wide,and (3) robustness: the observed speech signals areaffected by noise and reverberation so that the signif-icant features for estimating F0 are also smeared.

Most studies have focused on the first two issuesso that they have implicitly assumed all speech signalsare observed in clean environments or all observationsare only noiseless speech sounds. Various methodsof estimating F0 have been proposed under this as-sumption to solve the first issue by suppressing the ef-fects of filter characteristics (vocal tract), based on thesource-filter model, from the observed speech sounds.For example, typical approaches based on this ideahave been homomorphic methods of analysis [15, 16]and linear prediction (LP)-based methods [17, 18, 19].A few examples of inverse filtering methods are lag-windowing [20] and simplified inverse filter tracking(SIFT) [21]. Center-clipping, band-limitation [22, 23],and multi-windowing [24] techniques have also beenused in approaches based on the autocorrelation func-tion.

A few approaches to precisely estimate the F0 oftarget noiseless speech have been established (e.g.,STRAIGHT-TEMPO [25] and YIN [26]) by compar-ing electro-glottal-graph (EGG) information. The sta-bility of the instantaneous frequency of speech has alsobeen used in the STRAIGHT-TEMPO method (re-ferred to as “TEMPO” after this) to accurately esti-mate F0s as significant features to resolve the first twoissues. This method plays an important role in con-trolling “pitch” related features in STRAIGHT anal-ysis/synthesis tools [27]. YIN has also been proposedwhich combines autocorrelation functions and AMDFto resolve these. It has been reported that both meth-ods can be used to estimate the F0 of target noiselessspeech extremely precisely so that the first two issuesseem to be have been resolved. However, it has notyet been clarified whether these methods can preciselyestimate F0 in real environments. Hence, we need toinvestigate the last issue for realistic applications.

It is generally known that the method of estimatingF0 using periodic and/or harmonic features (e.g., au-tocorrelation functions and comb filtering) is relativelyrobust against background noise, but the estimatedF0 is not relatively accurate [2, 28, 29]. It has alsobeen reported that the comb-filtering-based methodis more robust against background noise than theautocorrelation-based approach [29]. The cepstrum-based method is not as robust against backgroundnoise as either of these because it is composed of ho-momorphic analysis so that noise components are notclearly separated in the quefrency domain [29].

The time-frequency representation of speech ob-

tained by time-frequency analysis can also ade-quately represent the periodic/harmonic componentsof speech. The instantaneous amplitude (IA) of speechsignals has fine harmonic features that are robustagainst background noise so that the comb-filteringof instantaneous amplitude has been proposed [30] toconstruct a sound-segregation model. The instanta-neous frequency (IF) of speech has also been used toaccurately estimate F0s but their stability as used inTEMPO is sensitive to noise. More robust methodsusing instantaneous frequency have been proposed byusing bandwidth equations related to instantaneousamplitude and frequency with harmonicity [28, 31].Other robust techniques using instantaneous ampli-tude and frequency-related approaches have been pro-posed by using periodicity and harmonicity [29]. Ithas been reported that these are more robust thanTEMPO and can precisely estimate the F0 in noisyenvironments.

All these methods have focused on noiseless tonoisy conditions to estimate sufficiently accurate F0sof target speech. Thus, methods using instantaneousamplitude and frequency or those with robust featuresagainst noise such as periodicity and harmonicity havebeen regarded as accurately being able to estimate F0sfrom noisy speech. The last issue seems to be havebeen solved at this time; however, there have been nostudies on robustness in reverberant environments.

It can easily be predicted that no typical methodswill work as well and their percentage correct rates forF0s will drastically be reduced as reverberation timeincreases. If our prediction is correct, the last issuehas not yet been completely solved and needs to beconsidered in reverberant environments and in noisyreverberant environments. We evaluated traditionalmethods of estimating F0 in terms of robustness andaccuracy in reverberant environments to investigatethis issue and discuss them in this paper. We thenpropose a method of estimating F0 from reverberantspeech without measuring impulse response in roomacoustics (i.e., blind method of estimating F0) by tak-ing the characteristics of reverberation into consider-ation.

This paper is organized as follows. Section 2 de-scribes the mathematical setup and then defines theproblem of estimating F0 from reverberant speech. Wediscuss our evaluations of most typical methods of es-timating F0 in reverberant environments in Section 3and investigations into what the best model is. Sec-tion 4 introduces complex cepstrum analysis and in-vestigates what the significant features for robust es-timates are. We then introduce the model concept(complex cepstrum analysis, the modulation transferfunction (MTF) concept, and source-filter model). Wefinally propose a method of estimating F0 in reverber-ant environments. We discuss our evaluations of theproposed method in Section 5 by comparing it with

Page 4: Estimation of fundamental frequency of...STRAIGHT-TEMPO, IFHC, and PHIA. The comparative results revealed that the percentage of correct rates and SNRs of the estimated F0s were reduced

3

other methods using the same simulations. Section 6gives our conclusions and perspective regarding futurework.

2. Mathematical setup

2.1 Signal representation and STFT

A time-varying harmonic signal, x(t), can be rep-resented as the analytic signal:

x(t) =∑k∈K

ak(t) exp(jωk(t)t + θk(t)) (1)

where ak(t) is the instantaneous amplitude and θk(t)is the phase. Here, k denotes the harmonic index andK is the number of harmonics so that ωk(t) can beexpressed as 2πkF0(t). Fundamental frequency F0(t)is an instantaneous frequency so that this should beextracted from x(t) using instantaneous cues.

The short-term Fourier transform (STFT) is usu-ally used [32] to analyze x(t) in any given short-termsegment (windowing processing):

X(ω, τ) =∫

x(t)w(t − τ) exp(−jωt)dt (2)

= A(ω, τ) exp(j argφ(ω, τ)) (3)A(ω, τ) = |X(ω, τ)| (4)

φ(ω, τ) = arctan(�[X(ω, τ)]�[X(ω, τ)]

)(5)

where w(t) is a window function and a short-term sig-nal, x(t, τ), is defined as w(t−τ)x(t) for mathematicalconvenience. A(ω, τ) is the amplitude spectrum andφ(ω, τ) is the phase spectrum of X(ω, τ).

The task of extracting/estimating fundamental fre-quency F0(t) in this formulation is, therefore, to es-timate the F0 in each short-term segment using theharmonicity of X(ω, τ) or to estimate segmental T0 =1/F0 by using the periodicity of x(t, τ). Thus, tra-ditional methods based on waveform processing (e.g.,zero-cross [4, 5], periodgrams [6], peak-picking [4, 7],autocorrelation [4, 8], AMDF [9], maximum likeli-hood [10], STFT-based processes, and sub-harmonicsummation (SHS) [14]) estimate F0(t) from x(t, τ) orX(ω, τ) by using periodicity or harmonicity.

2.2 Source-filter model

The source-filter model is a well-known concept toseparately represent the glottal (source information)and vocal-tract (filter information) characteristics forspeech production. Based on this concept, the ob-served clean speech signal, x(t), can be representedas

x(t) = e(t) ∗ vτ (t) (6)

where e(t) is the source signal related to glottal infor-mation and vτ (t) is the impulse response of the filter

Amplitude cepstrum

QuefrencyCepstrum component of filter characteristics

(vocal tract) Cepstrum component of source (glottal vibration)

Liftering

CA(q,τ)

l(q)

Csrc(q,τ)

Cflt(q,τ)

Csrc(q,τ)

Fig. 1 Separated representations of source and filtercharacteristics in quefrency domain

related to the vocal tract at time τ . The asterisk “∗”denotes convolution. Note that the emission effect hasbeen omitted from this formulation. Thus, Eq. (2) canalso be represented as

X(ω, τ) = S(ω, τ) · V (ω, τ) (7)

where S(ω, τ) is the STFT of s(t, τ) = w(t − τ)e(t)and V (ω, τ) is that of v(t, τ) = vτ (t). V (ω, τ) rep-resents filter characteristics so that the separationeffect of V (ω, τ) is usually used to estimate F0(t)from X(ω, τ). Some traditional methods of estima-tion are inverse filtering V −1(ω, τ) [21], whitening ofX(ω, τ) using |V (ω, τ)| (or lag windowing) [20], andsubtraction on logarithmic processing log X(ω, τ) =log S(ω, τ) + log V (ω, τ) [22, 23].

The linear prediction (LP) method is also one ofthe most powerful techniques of analyzing speech sig-nals. LP coefficients have filter characteristics (all-pole type) and the LP residue has source informa-tion. The LP coefficients of x(t, τ) can thus be usedas inverse filtering V −1(ω, τ) in the source-filter model[17, 21]. The LP residue can also be used as a short-term signal s(t, τ) [19]. Waveform processing andAMDF have also been incorporated [18].

2.3 Cepstrum representation

Cepstrum is also a well-known method of homo-morphic analysis. The complex cepstrum of X(ω, τ)in Eq. (2) can be represented as

C(q, τ) = F−1 [log {|X(ω, τ)| exp(jφ(ω, τ))}]= F−1

[log A(ω, τ)

]+ F−1

[jφ(ω, τ)

]= CA(q, τ) + Cφ(q, τ) (8)

where F−1[·] is the Fourier inverse transform, CA(q, τ)is the amplitude cepstrum, Cφ(q, τ) is the phase cep-strum of C(q, τ), and q denotes the quefrency (timedomain). The complex cepstrum of X(ω, τ) in Eq. (7)

Page 5: Estimation of fundamental frequency of...STRAIGHT-TEMPO, IFHC, and PHIA. The comparative results revealed that the percentage of correct rates and SNRs of the estimated F0s were reduced

4 Journal of Signal Processing, Vol. , No. ,

can also be represented as

C(q, τ) = F−1 [log S(ω, τ)] + F−1 [log V (ω, τ)]= Csrc(q, τ) + Cflt(q, τ) (9)

where Csrc(q, τ) is the complex cepstrum of sourceS(ω, τ) and Cflt(q, τ) is that of filter V (ω, τ).

The amplitude cepstrum, CA(q, τ), is generallyused in the traditional method so that CA,src(q, τ) andCA,flt(q, τ) are separately used for estimating F0(t)from CA(q, τ). Figure 1 outlines the concept underly-ing the source-filter model in the quefrency domain.CA,flt(q, τ) represents the dominant spectrum enve-lope of X(ω, τ) (lower Fourier component in quefrencydomain) so that they are compactly located in thelower quefrency. In contrast, CA,src(q, τ) representsthe dominant fine structure of X(ω, τ) so that theyare compactly located in the higher quefrency domain.Therefore, the task of estimating F0 with this conceptis to find the dominant quefrency from CA,src(q, τ) orto detect periodicity or harmonicity from CA,src(q, τ)by eliminating CA,flt(q, τ) from CA(q, τ). The lastprocessing is referred to as “liftering”. Typical ap-proaches are Noll’s original method [15] and his clip-strum method [16].

2.4 Problem with estimating F0

The task of estimating F0 in reverberant environ-ments is to extract F0(t) from reverberant speech sig-nal y(t) or respective STFT Y (ω, τ):

y(t) = x(t) ∗ h(t) = e(t) ∗ vτ (t) ∗ h(t) (10)Y (ω, τ) = S(ω, τ)V (ω, τ)H(ω, τ) (11)

where h(t) is the impulse response and H(ω, τ) is theSTFT of h(t) in room acoustics (reverberation). Notethat, H(ω, τ) is actually required to present all char-acteristics (H(ω) = H(ω, τ)) by using a long-termFourier transform (LTFT) so that the length of analy-sis (at each τ) should be more than the reverberationtime.

The task of estimating F0 in reverberant environ-ments is thus to select periodicity and harmonicityfrom the convolved source signal, e(t), while that innoisy environments is to select them from the noisy(additive) source signal, e(t). If h(t) is simplified echoor a minimum phase impulse response, the cepstrum-based method can be used to adequately estimate F0

from the reverberant speech signal, y(t), because ho-momorphic analysis is a powerful tool for dealing withsimplified echos. Realistic impulse responses in roomacoustics generally have non-minimum phase charac-teristics and we therefore predicted that estimating F0

robustly and accurately would be more difficult thanin noisy environments.

3. Evaluation of typical methods

3.1 Typical methods of estimating F0

Many methods of estimating F0 have been pro-posed in the literature on speech signal processing,as described in Section 1. The most comprehensivereview remains that by Hess (1983) [1] and more re-cent reviews are those by Hess (1992) and Cheveigneand Kawahara (2001) [2, 3]. A few examples ofrecent approaches are instantaneous-amplitude [30],instantaneous-frequency [28, 31], and fundamentalwave-filtering [33]. There are also comparative evalua-tions in Atake et al.’s (2000), Ishimoto et al.’s (2001),and Nakatani and Irino (2004) [2, 3, 28, 29, 31].

We evaluated twelve typical methods to investi-gate how robust estimates of F0 were in reverberantenvironments: ACMWL (AutoCorrelation throughMultiple Window-Length) [24], AMDF [9], STFT-ACorrLog (AutoCorrelation of Log-amplitude spec-trum on STFT) [22, 23, 13], STFT-ACorrLag (Lag-windowing on STFT) [20], STFT-Comb (Comb filter-ing on STFT) [11, 12], SHS [14], Cepstrum [15], LPC-residue [19], VFWFF (Voice Fundamental Wave Fil-tering (Feed-forward type)) [33], TEMPO [25], IFHC(Instantaneous Frequency of Harmonic Components)[28], and PHIA (Periodicity/Harmonicity using In-stantaneous Amplitude) [29]. Although other meth-ods have been proposed, we choose these twelve be-cause they are commonly used in comparative evalu-ations and the others are just modifications or heavyrevisions of them.

The characteristics, parameters, and detailed con-ditions on these algorithms that we used in this com-parative evaluation are listed in Table 1. TEMPOwas used as a complete original version. As the othermethods implemented by the researchers were basedon their original research, they were the first or orig-inal versions. Several parameters were set to obtainappropriate results based on preliminary simulations.For more detailed information on their technical im-plementations and skill levels, please refer to theiroriginal papers.

3.2 Sound dataset and evaluation measures

The sound dataset we used in this evaluation wasthe speech database on simultaneous recordings ofspeech and EGG by Atake et al. [28]. This datasetconsisted of 30 short Japanese sentences uttered by 14males and 14 females with voiced-unvoiced labels (to-tal of 840 utterances, sampling frequency of 16 kHz,and quantization of 16-bits).

The reverberant speech sentences we used were cre-ated by convolving the original signals, x(t)s, with thefollowing reverberant impulse responses, h(t)s, as a

Page 6: Estimation of fundamental frequency of...STRAIGHT-TEMPO, IFHC, and PHIA. The comparative results revealed that the percentage of correct rates and SNRs of the estimated F0s were reduced

5

Table 1 Characteristics of typical methods of estimating F0 and their parameter settings

Algorithm domain Features Parameter setting

ACMWL [24] time x(t, τ) Window candidates of nine (15 to 60-ms in 5-ms steps)AMDF [9] time x(t, τ) 40-ms window, 2-ms shiftSTFT freq. 40-ms window, 2-ms shiftauto-corr. [22, 23, 13] freq. log |X(ω, τ)| 1.5-kHz band-limitation, averaged-clippingLag windowing [20] freq. |S(ω, τ)| 3-ms lag windowing, 1.5-kHz band-limitation, averaged clippingComb filtering [11, 12] freq. |X(ω, τ)| 10th order of the comb filter

SHS [14] freq. log |X(ω, τ)| 40-ms window length, 5-ms shift, N = 15 (harmonic order),BL = 1.25 kHz (band limitation), weigh function: hn = 0.84n−1

Cepstrum [15] quef. CA(q, τ) 40-ms window, 2-ms shift, 3-ms liftering,Noll’s method (hw = 0.54 + 0.46 cos(2πf/600))

LPC Residue [19] time s(t, τ) 40-ms window, 2-ms shift, LP order of 12, 1/4 down-samplingF0 filtering [33] time s(t, τ) 80-ms window, 2-ms shift, 1/4 down-sampling, averaged-clipping,

1.5 kHz band-limitation, “Non-selfsupport mode”,TEMPO [25] freq. Fixed point analysis 1-ms F0 shift length, nvo = 24, Heuristic factor: “on”

on Instant. freq. (IF)IFHC [28] freq. Harmonicity of IFs Nm = 3 (harmonic order), 1/8 down-sampling,

adaptive-windowing based on BW equationPHIA [29] time/ Instant. Amp. (IA) 400-channel CBFB, 64-channel CQFB,

freq. & Dempster’s law 1/4 down-sampling, averaged-clipping

(1) Hanning windowing function was used in STFT-based processing.(2) Lower limits of estimated F0 were set at 60 Hz and higher were set at 800 Hz.

function of the reverberation time.

h(t) = a exp(−6.9t

TR

)n(t) (12)

a =

[1/∫ T

0

exp(−13.8t

TR

)dt

]1/2

(13)

where the “a” is a gain factor as the normalizedpower of h(t), TR is reverberation time, and n(t)is white noise. This is the well-known stochastic-approximated impulse response in room acoustics re-ported by Schroeder [34, 35], in which the response hasan envelope of exponential decay and is a white noisecarrier. This formulation for the impulse response hasbeen used in the study of speech intelligibility in roomacoustics [36] as general artificial reverberation andthus has non-minimum phase components [34, 35]. Sixreverberation conditions (TR = 0.0, 0.1, 0.3, 0.5, 1.0,and 2.0 s) were used in this study. There were a totalof 5, 040 stimuli.

Fine F0 error and gross F0 error within the voicedsection have been used as measures for some compar-ative evaluations in noisy environments [2, 28, 29, 31].These have concentrated on error analysis. Since weconcentrated on evaluating robustness and the accu-racy of F0 estimates, we used two similar measures forevaluation but not the same measures. The first wasthe percent correct rate (%) and the second was SNR(in dB).

Correct rateE =NF0,Est(E)

NF0,Ref

× 100 (14)

SNR = 20 log10

∫(F0,Ref(t) − F0,Est(t))2dt∫

F0,Ref(t)2dt(15)

where F0,Ref(t) and F0,Est(t) are reference F0 and es-timated F0, and this integral is done in the voicedsection (t). NF0,Est(E) is the size of the correct regionthat satisfies

|F0,Ref(t) − F0,Est(t))|F0,Ref(t)

≤ E (%)

within voiced section (t) where E is the error mar-gin (%). NF0,Ref (E) is the size of region F0,Ref(t) inthe voiced section at the error margin E. In this pa-per, the F0 estimated by TEMPO from the EGG sig-nal is used as the correct F0 (reference F0, F0,Ref(t)).F0,Est(t) was used to estimate F0 with the twelvemethods from reverberant speech signals. Two val-ues for E (error margins of 5% and 10%) were used inthe percent correct rate.

Since gross F0 error is the ratio of the number offrames giving “incorrect” F0 values to the total num-ber of frames, the percent correct rate approximatelyindicates gross F0 error. Since fine F0 error is thenormalized room mean square error between F0,Ref(t)and F0,Est(t), SNR indicates a similar measure in dB.

3.3 Results

Figure 2 plots the results of comparative evalua-tions for the twelve typical methods of estimating F0

from reverberant speech as a function of the reverber-ation time, TR. The left panels (a), (c), and (e) plotthe results for the first six methods and the right pan-els (b), (d), and (f) plot them for the last six. Thetop panel plots the percent correct rates (expressedas percentages) for F0 estimates within an error mar-gin of 5 % and the middle panel plots these within

Page 7: Estimation of fundamental frequency of...STRAIGHT-TEMPO, IFHC, and PHIA. The comparative results revealed that the percentage of correct rates and SNRs of the estimated F0s were reduced

6 Journal of Signal Processing, Vol. , No. ,

0

10

20

30

40

50

60

70

80

90

100

Cor

rect

rat

e (%

)

(a)

5 % error margin

CepstrumACMWLAMDFSTFT−ACorrLogSTFT−ACorrLagSTFT−Comb

0

10

20

30

40

50

60

70

80

90

100

Cor

rect

rat

e (%

)

(b)

5 % error margin

SHSVFWFFLPC−residualIFHCPHIATEMPO

0

10

20

30

40

50

60

70

80

90

100

Cor

rect

rat

e (%

)

(c)

10 % error margin

CepstrumACMWLAMDFSTFT−ACorrLogSTFT−ACorrLagSTFT−Comb

0

10

20

30

40

50

60

70

80

90

100

Cor

rect

rat

e (%

)

(d)

10 % error margin

SHSVFWFFLPC−residualIFHCPHIATEMPO

0 0.5 1 1.5 20

5

10

15

20

25

30

Reverberation time TR (s)

SN

R (

dB)

(e)

CepstrumACMWLAMDFSTFT−ACorrLogSTFT−ACorrLagSTFT−Comb

0 0.5 1 1.5 20

5

10

15

20

25

30

Reverberation time TR (s)

SN

R (

dB)

(f)

SHSVFWFFLPC−residualIFHCPHIATEMPO

Fig. 2 Estimation results: (a)–(b) percent correct rate within error margin of 5 %, (c)–(d) percent correct ratewithin error margin of 10 %, and (e)–(f) SNR (s: original, n: error between original and estimated F0) of F0

estimates from reverberant speech using twelve typical methods as function of reverberation time, TR

Page 8: Estimation of fundamental frequency of...STRAIGHT-TEMPO, IFHC, and PHIA. The comparative results revealed that the percentage of correct rates and SNRs of the estimated F0s were reduced

7

an error margin of 10 %. The bottom panel plots theSNRs. The correct rates and SNRs of all 12 meth-ods are drastically reduced as the reverberation timeincreases. The correct rates within the 5 % error mar-gin for all methods were less than 50 % and the SNRswere less than about 15 dB, especially when reverber-ation time TR was 2.0 s. Moreover, the correct rateswithin the 10 % error margin as an approximate eval-uation were also less than 70 %. We hence concludedthat none of these methods worked as well as robustand accurate F0 estimates and they had drawbacks inestimating F0 from reverberant speech.

However, we found a few clues in doing this evalu-ation for improving these methods. We can see fromFig. 2 that the cepstrum method is the most accu-rate excluding the clean condition (TR = 0.0). Cep-strum analysis is homomorphic and this can deal withconvolution processing as additive (subtractive) pro-cessing. Although the impulse responses we used inevaluations were not minimum-phase characteristics,the cepstrum method seemed to reduce the effect ofreverberation for estimating F0 since this can treat adirect sound and a reflected sound as the same signal.Therefore, the cepstrum method has the possibility ofestimating F0 from reverberant speech if it is not af-fected too much by reverberation. The comb-filteringmethod is slightly more robust against reverberationas we can see from Figs. 2(c) and (e). Maximizationof matched harmonicity may have the effect of track-ing stationary fluctuations in harmonics that are notoften affected by reverberation.

4. Proposed method

4.1 Complex cepstrum analysis

Let us overview the results in Subsection 3.3 byreconsidering the complex cepstrum representation ofreverberant speech y(t). From Eqs. (9)-(11), the com-plex cepstrum of y(t) can be represented as

CY (q, τ) = Csrc(q, τ) + Cflt(q, τ) + CH(q, τ) (16)

where CH(q, τ) is the complex cepstrum of the rever-berant impulse response, h(t). These cepstra can alsobe represented as all amplitude and phase cepstra (de-noted by subscripts “A” and “φ”).

Complex cepstrum analysis, on the other hand, isusually used to separate minimum and non-minimumphase characteristics. The complex cepstrum, C(q, τ),can also be separately represented as

C(q, τ) = Cmin(ω, τ) + Call(ω, τ)= CA,min(q, τ) + Cφ,min(q, τ)

+CA,all(q, τ) + Cφ,all(q, τ) (17)

where the subscripts “min” and “all” indicate mini-mum and non-minimum phase characteristics. Here,

respective spectra can be represented as

X(ω, τ) = Xmin(ω, τ) · Xall(ω, τ)= |Xmin(ω, τ)| exp(jφmin(ω, τ))

×|Xall(ω, τ)| exp(jφall(ω, τ)) (18)

where |Xall(ω, τ)| = 1 and CA,all(q, τ) = 0. Hence,CY (q, τ) can be separately represented as

CY,A,min(q, τ) + CY,φ,min(q, τ) + CY,φ,all(q, τ)= Csrc,A,min(q, τ) + Csrc,φ,min(q, τ) + Csrc,φ,all(q, τ)

+Cflt,A,min(q, τ) + Cflt,φ,min(q, τ) + Cflt,φ,all(q, τ)+CH,A,min(q, τ) + CH,φ,min(q, τ) + CH,φ,all(q, τ)

(19)

Note that the amplitude cepstrum of all-pass phasecharacteristics has been omitted from this equation.

According to Eq. (16), an optimal F0 estimate isonly used to extract Csrc(q, τ) from CY (q, τ) to dealwith the periodicity/harmonicity of source informa-tion as a filter and the reverberation characteristicsare eliminated. It is too difficult only to deal withCsrc(q, τ) in this estimation task, without measuringh(t) or CH(q, τ) (i.e., blind F0 estimation). In addi-tion, long-term CH(q, τ), in which the length of anal-ysis is more than the reverberation time, is needed toaccurately extract Csrc(q, τ).

We did a preliminary investigation into whichcomponent, CH,min(q, τ) or CH,all(q, τ), most affecteddealing with Csrc(q, τ) for estimating F0, using Eq.(19). Figure 3 shows the process of estimating one ofthe reverberant speech signals (/Tokushima-To-Ieba-Awa-Odori-Ga-Yuumei-Desu/, female speaker, rever-beration time TR of 2.0 s) we used in the evaluations.Speech signals (x(t) and reverberant y(t)) are shownin Figs. 3(a) and (b). The reference F0 (F0,Ref(t) byTEMPO from the EGG signal) and the F0 (F0,Est(t))estimated by the cepstrum method from y(t) corre-spond to the dashed and solid lines in Fig. 3(c).Note that F0,Ref(t) obtained by TEMPO are com-pletely within the voiced part because TEMPO hasan Unvoiced/Voiced (U/V) decision. As can be seen,the estimated F0 was not close to the reference in thevoiced part. This method, however, can be used toaccurately estimate F0 from y(t) by eliminating theeffect of h(t) from y(t) on the complex cepstrum inthe long-term Fourier transform, as plotted in Fig.3(d). At the same time, two comparative F0s wereobtained as plotted in Figs. 3(e) and (f) by estimat-ing F0 from y(t) by eliminating minimum phase or theall-pass phase component from y(t).

The all-pass phase component of the reverberantimpulse response, h(t), we used appears to have adominant effect from these comparisons of robust andaccurate F0 estimates. Although the same compar-isons of all the other stimuli are not presented in this

Page 9: Estimation of fundamental frequency of...STRAIGHT-TEMPO, IFHC, and PHIA. The comparative results revealed that the percentage of correct rates and SNRs of the estimated F0s were reduced

8 Journal of Signal Processing, Vol. , No. ,

−2

−1

0

1

2x 10

4x(

t)

(a)

−2

−1

0

1

2x 10

4

y(t)

(b)

100

200

300

400

500

F0(

t) (

Hz)

(c)

100

200

300

400

500

F0(

t) (

Hz)

(d)

100

200

300

400

500

F0(

t) (

Hz)

(e)

0 0.5 1 1.5 2 2.5 3

100

200

300

400

500

Time (s)

F0(

t) (

Hz)

(f)

Fig. 3 Examples: (a) original speech x(t), (b) re-verberant speech y(t) (reverberation time of 2.0 s),(c) reference F0 using TEMPO from EGG of x(t) in-dicated by dashed-line and estimated F0 using cep-strum method from y(t) indicated by solid line, (d)estimated F0 from dereverberated y(t) using h−1(t),(e) F0 from y(t) eliminated by minimum phase char-acteristics, and (f) F0 from y(t) eliminated by all-passphase characteristics

paper, the same trends were observed. Hence, we con-cluded that eliminating the all-pass phase character-istics of h(t) would enable effective estimates of F0

from reverberant speech y(t). In addition, the cep-strum method with the all-pass component eliminatedraised the possibility of achieving robust and accurateestimates of F0 since we knew homomorphic analysiscould easily deal with minimum phase characteristicssuch as simplified echoes.

4.2 Estimates of h(t) based on MTF concept

The MTF concept was proposed by Houtgast andSteeneken [36] to account for the relation between thetransfer function of frequency in an enclosure in termsof the envelopes of input and output signals (x(t) andy(t)), and characteristics of the enclosure such as re-verberation. This concept was introduced as a mea-sure in room acoustics to assess what effect enclosurehad on the intelligibility of speech [36]. The complexmodulation transfer function, m(ω), is defined as

m(ω) =

∫∞0 h(t)2 exp(jωt)dt∫∞

0h(t)2dt

(20)

where h(t) is the impulse response of the room acous-tics and ω is the radian frequency. This equationmeans the Fourier transform of the squared impulseresponse is divided by its total energy.

When reverberant impulse response h(t) as definedin Eq. (12) is substituted into the equation above, theMTF, m(ω), can be obtained as

m(ω) = |m(ω)| =

[1 +

TR

13.8

)2]−1/2

(21)

where TR is the reverberation time, i.e., the time re-quired for the power of h(t) to decay by 60 dB [36].Figure 4 plots the MTF, m(ω), as a function of themodulation frequency, Fm, (i.e., the dominant fre-quency in the temporal envelope). These theoreticalcurves were calculated by substituting five reverber-ation times (TR = 0.1, 0.3, 0.5, 1.0, and 2.0 s) andω = 2πFm into Eq. (21). Here, m(ω) can also beregarded as the modulation index with respect to Fm.These curves reveal how much the modulation indexof the envelope will be reduced from 1.0 to 0.0 depend-ing on TR at a specific Fm. In other words, TR can bepredicted from a specific m(2πFm) at a specific Fm.Therefore, the temporal envelope of the reverberantimpulse response, a exp(−6.9t/TR), can also be pre-dicted using TR and the “a” in Eq. (13).

Based on the MTF concept, we can establish howmuch reverberation affects a reduction in m(ω) and wecan then predict the characteristics of room acoustics(TR) using inverse MTF. MTF-based power-envelopeinverse-filtering methods, which have aimed at restor-ing the reduced MTF in the temporal envelope of the

Page 10: Estimation of fundamental frequency of...STRAIGHT-TEMPO, IFHC, and PHIA. The comparative results revealed that the percentage of correct rates and SNRs of the estimated F0s were reduced

9

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

TR

= 0.1 s

TR

= 0.3 s

TR

= 0.5 s

TR

= 1 s

TR

= 2 s

Modulation frequency, Fm (Hz)

MT

F, m

(2π

Fm

)

0.402

Fig. 4 Theoretical curves representing modulationtransfer function m(2πFm) for various conditions withTR = 0.1, 0.3, 0.5, 1.0, and 2.0 s

signals, have been proposed by the present authors[37, 38]. A technique of predicting TR using the inverseMTF from the temporal envelope of observed signaly(t) has also been proposed [37, 38]; this was used as aconstraint for deriving the optimal restoration of thepower envelope for the original signal based on the de-convolution relationship between the power envelopesof y(t) and h(t). (For details, see Appendix):

TR = max

(arg min

TR,min≤TR≤TR,max

∫ T

0

∣∣min(ex,TR(t)2, 0

)∣∣ dt

)(22)

where T is the signal duration and ex,TR(t)2 is the setof candidates for the restored power envelope of cleansignal x(t) via inverse MTF [37] as a function of TR.Note that the operation of “max(arg min{·})” meansthat the maximum argument of TR needs to be de-termined from a timing point where the negative areaof ex,TR(t)2 approximately equals zero or a particularminimum area. Here, TR,min and TR,max are the lowerand upper limited regions of TR [37].

According to Eqs. (12), (13) and (22), h(t) canbe estimated by utilizing a exp(−6.9t/TR) with simu-lated white noise n(t). This is referred to as h(t). Inthis case, long-term CH(q, τ) can be directly obtainedfrom h(t). Although this does not completely equalthe original h(t) we used in the evaluation, long-termamplitude cepstrum CH,A(q, τ) can only be matchedto the original. This is because the MTFs of h(t) andh(t) are the same if TR is a complete value, and Eqs.(20) and (21) can be regarded as having characteris-tics of CH,A(q, τ) that can be indirectly obtained fromF−1[log |m(ω)|] with the power factor on the LTFT.Therefore, it can be easily predicted that CH,A(q, τ)becomes a cepstral shape of exponential decay withrespect to quefrency (dominant at lower quefrencies).

In contrast, although it is difficult to obtain a com-plete value with regard to phase cepstrum CH,φ(q, τ),long-term CH,φ,all(q, τ) can be estimated from themby using Eqs. (17) and (19). As explained in Sec.4.1, using estimated CH,φ,all(q, τ) from h(t) to elim-inate the all-pass phase component from reverberantspeech y(t) on the basis of LTFT should be done toestimate F0. Although the estimated CH,A,min(q, τ)can also be canceled out in Eq. (19) on LTFT, theelimination of minimum-phase characteristics in Eq.(19) on LTFT is not as effective for eliminating all-pass phase characteristics so that this has not beenused in this paper. The short-term CH,A,min(q, τ) andCH,φ,min(q, τ) to be canceled out in Eq. (19) on STFTwill be considered in the next section.

4.3 Liftering on complex cepstrum

CH,φ,all(q, τ) is canceled out in Eq. (19) on LTFTas explained in the previous section, so that the re-maining terms are Cflt(q, τ) and CH,min(q, τ) to ex-tract Csrc(q, τ). Complex cepstrum analysis and thesource-filter model are used to cancel out the remain-ing terms in Eq. (19) on STFT to take the best ad-vantage of homomorphic processing.

There is a Hilbert transform relationship betweenCA,min(q, τ) and Cφ,min(q, τ), and the latter has thesame characteristics in the positive quefrency domainbased on the minimum phase characteristics. How-ever, short-term CH,A,min(q, τ) and CH,φ,min(q, τ) arenot the same as the long-term versions when STFTanalysis is shorter than the reverberation time. How-ever, amplitude cepstrum CH,min(q, τ) in the lowerquefrency parts is generally larger than that in thehigher parts and this exponentially attenuates as thequefrency increases. Therefore, the minimum phasecharacteristics, CH,min(q, τ), can be assumed to con-centrate on the lower quefrency parts.

The cepstrum components of the source character-istics are separately concentrated on the higher que-frency parts and those of the filter are separately con-centrated on the lower based on the advantages of thesource-filter model, as shown in Fig. 1. Therefore, ifa component on the low quefrency part can only beremoved by liftering, the filter characteristics as wellas the dominant components of the minimum-phasecharacteristics of reverberation can be canceled out inEq. (19). Thus, the following lifter, l(q), is used inthis paper to cancel them out in Eq. (19).

l(q) ={

0, q ≤ qlif

1, q > qlif(23)

where qlif = 1.25 ms. This means the upper limit forestimated F0 is 800 Hz.

Page 11: Estimation of fundamental frequency of...STRAIGHT-TEMPO, IFHC, and PHIA. The comparative results revealed that the percentage of correct rates and SNRs of the estimated F0s were reduced

10 Journal of Signal Processing, Vol. , No. ,

Liftering based on source-filter model

F0 estimation

FFT

y(t)Reverberant speech

Long-term Fourier transform (LTFT)

TR estimation

Complex cepstrum analysis (CCA)

Y(ω)

CY(q)

Estimation of MTF based imp. response

TR

Elimination CH,φ,all(q) from CY(q)

Inversed LTFT

Sort-term Fourier transform (STFT)

LTFT & CCA

CCA

CH(q)

Comb filtering

Fundamental frequency

F0(t)

y(t)

l(q)

Csrc(q,t)

logXsrc(ω,t)

Fig. 5 Algorithm for proposed method

4.4 Proposed method of estimating F0

The algorithm for estimating F0 based on complexcepstrum analysis, the MTF concept, and the source-filter model are explained by Fig. 5. This method iscomposed of three main processes: (1) estimating theMTF-based reverberation impulse responses and elim-inating them from reverberant speech, (2) extractingXsrc(ω, τ) from the processed reverberant speech byusing liftering on the complex cepstrum based on thesource-filter model, and (3) estimating F0 from themby using a final decision block.

Comb filtering was employed in the final two blocksin Fig. 5. As these are commonly used in classi-cal methods of estimation, such as in comb-filteringand autocorrelation functions, they can be replacedby the autocorrelation function. In addition, since the

proposed method treats a complex cepstrum, the re-stored short-term waveform s(t, τ) from Csrc(q, τ) canbe used to estimate F0 with the autocorrelation func-tion and/or AMDF. As the aim of this paper was topropose a model concept for robustly estimating F0

in reverberant environments, these kinds of consider-ations with regard to modifications in processing arebeyond the scope of this paper.

5. Evaluation of proposed method

5.1 Method

We evaluated the proposed method with (la-beled “Proposed(Est)”) and without (labeled “Pro-posed(Org)”) TR estimates by using the same proce-dure and sound dataset described in Section 3. Withand without comparisons of the proposed methodwere done to find out how accurate the TR estimateswere. We compared them with TEMPO, the cepstrummethod, and a modified complex cepstrum methodbased on the source-filter model (labeled “SrcFlt”).The SrcFlt method was used to find out how effec-tively CH,φ,all(q, τ) was eliminated on the LTFT withthe proposed method.

5.2 Results and discussion

Figure 6 plots the results for the comparative eval-uations. The correct rates within error margins of 5 %and 10 % for the proposed and the other methods areplotted in Figs. 6(a) and (b). Their SNRs are plottedin Fig. 6(c). The results for the cepstrum method in-dicate the baselines in the evaluations while those forTEMPO (dashed-line) indicate the lower limits.

Although the overall accuracy of F0 estimatestended to be reduced as reverberation time increased,about a 10 % improvement in the correct rates andabout a 5 dB improvement in the SNR could be ob-tained with the new method. There are fewer differ-ences in the results for both the proposed methodswith and without TR estimates. This means the TR

estimates can work as well. Since a correct rate of 60% within an error margin of 5 %, a correct rate of75 % within an error margin of 10 %, and an SNR of17 dB at TR = 2.0 s were achieved with the methodwe propose, we concluded that MTF-based impulseresponses can be precisely estimated by utilizing TR

estimates. For example, the results for extracting F0

at TR = 2.0 with the proposed method, with and with-out TR estimates, from the same reverberant speech(Fig. 3(b)) are plotted in Figs. 6(d) and (e).

The SrcFlt method results indicate a slight im-provement (about 3 % in the correct rate) to that withthe cepstrum method. In contrast, there were about7 % and 5 dB improvements in the percent correctrate and in the SNR by using the new method. We

Page 12: Estimation of fundamental frequency of...STRAIGHT-TEMPO, IFHC, and PHIA. The comparative results revealed that the percentage of correct rates and SNRs of the estimated F0s were reduced

11

concluded that the use of complex cepstrum analy-sis with regard to non-minimum phase characteristicseffectively enabled F0 to be estimated in reverberantenvironments.

As shown in Figs. 3(b), 6(d), and 6(e), most F0-estimation methods including the proposed method,which does not have the U/V decision output of re-dundant information in the unvoiced and silent sec-tions, while TEMPO does. An important issue is howto determine the U/V decision rules for applicationsof speech-signal processing in our next stage of re-search. This issue should be able to be resolved byincorporating power discrimination such as that usedin TEMPO with the U/V decision rules, but this isbeyond the scope of this paper.

6. Conclusion and Future Perspectives

We evaluated the robustness and accuracy oftwelve typical methods of estimating F0 (i.e., clas-sic ACMWL, AMDF, STFT-based, cepstrum, LPC,and SHS algorithms, and modern IFHC, PHIA, andTEMPO algorithms) in artificial reverberant environ-ments using huge speech datasets. The results re-vealed that none of these methods could accurately es-timate F0 in reverberant environments and that theiraccuracies drastically decreased as reverberation timeincreased. The results also demonstrated that the bestmethod was cepstrum-based and that the worst wasthe instantaneous frequency-based model. We foundthat periodicity and/or harmonicity on the complexcepstrum effectively enabled F0 to be estimated in re-verberant environments.

We proposed a robust and accurate method ofestimating F0 that was based on the source-filtermodel concept and the MTF concept in complex cep-strum analysis. This method included (1) eliminat-ing the dominant reverberation effect from observedspeech by estimating MTF-based reverberant impulseresponses and (2) extracting source information fromthem by subtracting the remaining cepstrum relatedto filter characteristics and the remaining reverbera-tion through liftering. We demonstrated that our newmethod is robust against reverberation and can accu-rately estimate F0 from observed reverberant speech,using the same comparative evaluations.

Additional improvements may be possible by mod-ifying the F0 determination block. Further evalua-tions using real reverberant impulse responses in roomacoustics are required for real applications. Exten-sional improvements may be possible by incorporatingthe proposed method into the U/V decision rules andby considering the robustness of the new method inboth noisy and reverberant environments, but theseare beyond the scope of this paper.

In future work, we hope to incorporate our newrobust and accurate approach to estimate F0 into the

MTF-based process of speech dereverberation [39, 40]and then to improve them so that they can become amore complete blind-dereverberation method. Thisis because the current method, which consists ofMTF-based power-envelope inverse filtering and car-rier restoration using F0 information can dereverber-ate reverberant speech in power envelopes as well as inresynthesized waveforms, assuming that F0 can be ac-curately estimated. As mentioned in the Introduction,robust and accurate F0 is a significant feature of ournew method so that it should be able to contributeto resolving the problems we previously experiencedwith speech dereverberation [39, 40].

7. Acknowledgments

This work was supported by a Grant-in-Aid forScientific Research from the Ministry of Education,Culture, Sports, Science, and Technology of Japan(No. 18680017). It was also partially supportedby the Strategic Information and COmmunicationsR&D Promotion ProgrammE (SCOPE) (071705001)of the Ministry of Internal Affairs and Communica-tions (MIC), Japan.

References

[1] W. J. Hess: Pitch Determination of Speech Signals,

Springer-Verlag, New York, 1983.

[2] W. J. Hess: Pitch and Voicing Determination in Advances

in speech signal processing, Eds. S. Furui and M. M.

Sondhi, pp. 3–48, Marcel Dekker. Inc. New York, 1992.

[3] A. de Cheveigne and H. Kawahara: Comparative evalua-

tion of F0 estimation algorithms, Proc. Eurospeech2001,

pp. 2451–2454, Sep. 2001.

[4] B. Gold and L. Rabiner: Parallel processing techniques

for estimating pitch periods of speech in the time domain,

J. Acoust. Soc. Am., Vol. 46, No. 2, pp. 442–448, Aug.

1969.

[5] N. C. Geckinli and D. Yavuz: Algorithm for pitch extrac-

tion using zero-crossing interval sequence, IEEE Trans.

Acoustics, Speech, and Signal processing, Vol. ASSP-25,

No. 6, pp. 559–564, Dec. 1977.

[6] M. R. Schroeder: Period histogram and product spectrum:

new methods for fundamental frequency measurement, J.

Acoust. Soc. Am., Vol. 43, No. 4, pp. 829–834, Apr. 1968.

[7] D. M. Howard: Peak-picking fundamental period estima-

tion for hearing prostheses, J. Acoust. Soc. Am., Vol. 86,

No. 3, pp. 902–910, Sep. 1989.

[8] M. M. Sondhi: New methods of pitch extraction, IEEE

Trans. Audio and Electroacoustics, Vol. AU-16, No. 2, pp.

262–266, Jun. 1968.

[9] M. J. Ross, H. L. Shaffer, A. Cohen, R. Freudberg and H.

J. Manley: Average magnitude difference function pitch

extractor, IEEE Trans. Acoust., Speech, Signal Process-

ing, Vol. ASSP-22, No. 5, pp. 353–361, Oct. 1974.

Page 13: Estimation of fundamental frequency of...STRAIGHT-TEMPO, IFHC, and PHIA. The comparative results revealed that the percentage of correct rates and SNRs of the estimated F0s were reduced

12 Journal of Signal Processing, Vol. , No. ,

0 0.5 1 1.5 20

10

20

30

40

50

60

70

80

90

100

TR (s)

Cor

rect

rat

e (%

)

(a)

5 % error margin

TEMPOCepstrumSrcFltProposed(Org)Proposed(Est)

0 0.5 1 1.5 20

10

20

30

40

50

60

70

80

90

100

TR (s)

Cor

rect

rat

e (%

)

(b)

10 % error margin

TEMPOCepstrumSrcFltProposed(Org)Proposed(Est)

0 0.5 1 1.5 20

5

10

15

20

25

30

Reverberation time TR (s)

SN

R (

dB)

(c)

TEMPOCepstrumSrcFltProposed(Org)Proposed(Est)

100

200

300

400

500

F0(

t) (

Hz)

(d)

0 0.5 1 1.5 2 2.5 3

100

200

300

400

500

Time (s)

F0(

t) (

Hz)

(e)

Fig. 6 Evaluation results: (a) percent correct rate within error margin of 5%, (b) percent correct rate withinerror margin of 10%, (c) SNR of F0 estimates from reverberant speech using proposed method, and examples ofextracted F0 using proposed model (d) without TR estimation and (e) with TR estimation

[10] J. D. Wise, J. R. Caprio and T. W. Parks: Maximum like-

lihood pitch estimation, IEEE Trans. Acoustics, Speech,

Signal Processing, Vol. ASSP-24, No. 5, pp. 418–423, Oct.

1976.

[11] K. Nishi and S. Ando: An optimal comb filter for time-

varying harmonics extraction, IEICE Trans. Fundamen-

tals, Vol. E81-A, No. 8, pp. 1622–1627, Aug. 1998.

[12] T. Miwa, Y. Tadokoro and T. Saito: The pitch estimation

of different musical instruments sounds using comb filters

for transcription, IEICE, Trans. D-II, vol. J81-D-II, no. 9,

pp. 1965–1974, Sep. 1998.

[13] T. Shimamura and H. Kobayashi: Weighted autocorrela-

tion pitch extraction of noisy speech, IEEE Trans. Speech

and Audio Processing, Vol. 9, No. 7, pp. 727–730, Oct.

2001.

[14] D. J. Hermes: Measurement of pitch by subharmonic sum-

mation, J. Acoust. Soc. Am., Vol. 83, No. 1, pp. 257–264,

Jan. 1988.

[15] A. M. Noll: Ceptrum pitch determination, J. Acoust. Soc.

Am., Vol. 41, No. 2, pp. 293–309, Aug. 1966.

[16] A. M. Noll: Clipstrum pitch determination, J. Acoust.

Soc. Am., Vol. 44, No. 6, pp. 1585–1591, Jul. 1968.

[17] L. R. Rabiner, M. J. Cheng, A. E. Rosenberg and A. Mc-

Gonegal: A comparative study of several pitch detection

algorithms, IEEE Trans. Acoustics, Speech, Signal Pro-

cessing, Vol. ASSP-24, pp. 399–413, 1976.

[18] C. K. Un and S. C. Yang: A pitch extraction algorithm

based on LPC inverse filtering and AMDF, IEEE Trans.

Acoust., Speech, Signal Process. Vol. ASSP-25, No. 6, pp.

565–572, Dec. 1977.

[19] T. V. Ananthapadmanabha and B. Yegnanarayana:

Epoch extraction from linear prediction residual for inden-

tification of closed glottis interval, IEEE Trans. Acoustics,

Speech, Signal Processing, Vol. ASSP-27, No. 4, pp. 309–

319, Aug. 1979.

[20] H. Singer and S. Sagayama: Pitch dependent phone mod-

Page 14: Estimation of fundamental frequency of...STRAIGHT-TEMPO, IFHC, and PHIA. The comparative results revealed that the percentage of correct rates and SNRs of the estimated F0s were reduced

13

eling for HMM-based speech recognition, J. Acoust. Soc.

Jpn. (E), Vol 15, No. 2, pp. 77–86, Mar. 1994.

[21] J. D. Markel: The SIFT algorithm for fundamental fre-

quency estimation, IEEE Trans. Audio, Vol. AU-20, No.

5, pp. 367–377, Dec. 1972.

[22] N. Kunieda, T. Shimamura and J. Suzuki: Pitch extrac-

tion by using autocorrelation function on the log spec-

trum, IEICE Trans. A, Vol. J80-A, No. 3, pp. 435–443,

Mar. 1997.

[23] H. Kobayashi and T. Shimamura: An extraction method

of fundamental frequency using clipping and band limita-

tion on log spectrum, IEICE Trans. A, Vol. J82-A, No. 7,

pp. 1115–1122, Jul. 1999.

[24] T. Takagi, N. Seiyama and E. Miyasaka: A Method for

pitch extraction of speech signal using autocorrelation

functions through multiple window-length, IEICE Trans.

A, Vol. J80-A, No. 9, pp. 1341–1350, Sep. 1997.

[25] H. Kawahara, H. Katayose, A. de Cheveigne and R. D.

Patterson: Fixed Point analysis of frequency to instan-

taneous frequency mapping for accurate estimation of F0

and periodicity, Proc.Eurospeech99, No. 6, pp. 2781–2784,

Sep. 1999.

[26] A. de Cheveigne and H. Kawahara: Yin, a fundamental

frequency estimator for speech and music, J. Acoust. Soc.

Am., Vol. 111, No. 4, pp. 1917–1930, Apr. 2002.

[27] H. Kawahara, I. Masuda-Katsuse and A. de Cheveigne:

Restructuring speech representations using a pitch-

adaptive time-frequency smoothing and an instantaneous-

frequency-based F0 extraction: Possible role of a repeti-

tive structure in sounds, Speech Communication, Vol. 27,

pp. 187–207, Apr. 1999.

[28] Y. Atake, T. Irino, H. Kawahara, J. Lu, S. Nakamura

and K. Shikano: Robust fundamental frequency estima-

tion using instantaneous frequencies of harmonic compo-

nents, Proc of ICSLP2000, Vol. 2, pp. 907–910, Oct. 2000.

[29] Y. Ishimoto, M. Unoki and M. Akagi: A Fundamental

Frequency Estimation Method for Noisy Speech Based

on Instantaneous Amplitude and Frequency, Proc. Eu-

roSpeech2001, pp. 2439–2442, Sep. 2001.

[30] M. Unoki and M. Akagi: A Method of Signal Extrac-

tion from Noisy Signal based on Auditory Scene Analysis,

Speech Communication, Vol. 27, No. 3, pp. 261–279, Apr.

1999.

[31] T. Nakatani and T. Irino: Robust and accurate funda-

mental frequency estimation based on dominant harmonic

components, J. Acoust. Soc. Am. Vol. 116, No. 6, pp.

3690–3700, Dec. 2004.

[32] P. P. Vaidyanathan: Multirate systems and Filter Banks,

Prentice-Hall, New Jersey, 1993.

[33] H. Ohmura and K. Tanaka: Fine pitch contour extraction

by voice fundamental wave filtering method, J. Acoust.

Soc. Jpn, Vol. 51, No. 7, pp. 509–518, Jul. 1995.

[34] M. R. Schroeder: Modulation transfer functions: defini-

tion and measurement, Acustica, Vol. 49, pp. 179–182,

1981.

[35] H. Kuttruff: Room Acoustics, Taylor & Francis, fourth

edition, London, 2000.

[36] T. Houtgast and H. J. M. Steeneken: The modulation

transfer function in room acoustics as a predictor of speech

intelligibility, Acustica, Vol. 28, pp. 66–73, (1973).

−2

0

2

4(b)

x(t)

−0.2

0

0.2

h(t)

(d)

TR

=0.5 (s)

0

0.5

1

(a)

e x(t)2

0

2

4

x 10−3

e h(t)2

(c)

0 0.2 0.4 0.6

−2

0

2

4

y(t)

(f)

0

0.5

1

e y(t)2

(e)

0 0.2 0.4 0.6

0

1

2(g)

^ e x(t

)2

time (s)

TR

=0.3 (s)T

R=0.5 (s)

TR

=1.0 (s)

Fig. 7 Examples of relationship between power en-velope of system based on MTF concept: (a) powerenvelope e2

x(t) of (b) original signal x(t), (c) powerenvelope e2

h(t) of (d) impulse response h(t), (e) powerenvelope e2

y(t) derived from e2x(t) ∗ e2

h(t), (f) reverber-ant signal y(t) derived from x(t)∗h(t), and (g) restoredpower envelope e2

x(t)

[37] M. Unoki, M. Furukawa, K. Sakata and M. Akagi: An

improved method based on the MTF concept for restoring

the power envelope from a reverberant signal, Acoustical

Science and Technology, Vol. 25, No. 4, pp. 232–242, Apr.

2004.

[38] M. Unoki, K. Sakata, M. Furukawa, and M. Akagi: A

speech dereverberation method based on the MTF con-

cept in power envelope, Acoustical Science and Technol-

ogy, Vol. 25, No. 4, pp. 243–254, Apr. 2004.

[39] M. Unoki, K. Sakata and M. Akagi: A speech derever-

beration method based on the MTF concept, Proc. Eu-

roSpeech2003, pp. 1417–1420, Sep. 2003.

[40] M. Unoki, M. Toi, and M. Akagi: Development of the

MTF-based speech dereverberation method using adap-

tive time-frequency division, Proc. Forum Acusticum

2005, pp. 51–56, Aug. 2005.

Appendix MTF-based power envelope inversefiltering

In the model of inverse-filtering of the power enve-lope [37], the observed reverberant signal, the originalsignal, and the stochastic-idealized impulse responsein room acoustics are denoted as bfy(t), x(t), andh(t), respectively, and are modeled as:

y(t) = x(t) ∗ h(t) (24)x(t) = ex(t)n1(t) (25)h(t) = eh(t)n2(t) = a exp(−6.9t/TR)n2(t)(26)

where the asterisks “*” denote the convolution oper-ation, ex(t) and eh(t) are the envelopes of x(t) and

Page 15: Estimation of fundamental frequency of...STRAIGHT-TEMPO, IFHC, and PHIA. The comparative results revealed that the percentage of correct rates and SNRs of the estimated F0s were reduced

14 Journal of Signal Processing, Vol. , No. ,

h(t), and n1(t) and n2(t) are the mutually indepen-dent respective white noise functions, i.e.,< nk(t),nk(t − τ) >= δ(τ). The parameters of theimpulse response, a and TR, correspond to a constantamplitude term and the reverberation time.

In this model, the power envelope of the reverber-ant signal, ey(t)2, can be determined as⟨

y2(t)⟩

= e2x(t) ∗ e2

h(t) = e2y(t) (27)

where < · > is the ensemble average operation. Thisequation shows that there is a significant relationshipbetween the envelopes; i.e., the MTF concept. Basedon this result, e2

x(t) can be recovered by deconvolutinge2

y(t) with e2h(t). Here, the transmission functions of

power envelopes Ex(z), Eh(z), and Ey(z) are assumedto correspond to the z-transforms of e2

x(t), e2h(t), and

e2y(t). Thus, the transmission function of the power

envelope of the original signal, Ex(z), can be deter-mined from

Ex(z) =Ey(z)

a2

{1 − exp

(− 13.8

TR · fs

)z−1

}(28)

where fs is the sampling frequency. Finally, the powerenvelope, e2

x(t), can be obtained from the inverse z-transform of Ex(z).

Figure 7 shows an example of how power-envelopeinverse filtering is related to the MTF concept. Fig-ure 7(a) shows a sinusoidal power envelope as origi-nal power envelope e2

x(t) (= 0.5(1 + sin(2πFmt)); themodulation frequency, Fm, was 10 Hz and the mod-ulation index, m, was 1.0). Figure 7(b) shows theoriginal signal, x(t), calculated from e2

x(t) and a whitenoise carrier, n1(t), using Eq. (25). Figure 7(c) showspower envelope e2

h(t) calculated using Eq. (26) withTR = 0.5 s. Figure 7(d) shows the impulse response,h(t), of Eq. (12), calculated from e2

h(t) and a whitenoise carrier, n2(t). Figures 7(e) and (f) show thepower envelope, e2

y(t), obtained from a convolution ofe2

x(t)∗ e2h(t) and the observed reverberant signal, y(t),

obtained from a convolution of x(t) with h(t), respec-tively. The left panels ((a), (c), and (e)) show thepower envelopes of the signals and the right panels((b), (d), and (f)) show the corresponding signals. Inthis figure, the modulation index decreased from 1.0(in Fig. 7(a)) to 0.404 (maximum deviation of enve-lope between the dotted lines in Fig. 7(e) relative tothat in Fig. 7(a)). Since the MTF concept shows themodulation index as a function of Fm and TR, it canalso be shown that the decreased modulation index isderived from m(2πFm) = 0.402 using Eq. (21) by sub-stituting TR = 0.5 s and Fm = 10 Hz into Eq. (21).The solid line in Fig. 7(g) indicates the restored powerenvelope e2

x(t) obtained from reverberant power enve-lope e2

y(t) (Fig. 7(e)) using Eq. (28) with TR = 0.5 s.We can see that power-envelope inverse filtering canaccurately restore the power envelope from a rever-berant signal in terms of shape and magnitude.

Masashi Unoki was born inAkita Pref., Japan, in 1969. Hereceived his M.S. and Ph.D. (In-formation Science) from the JapanAdvanced Institute of Science andTechnology (JAIST) in 1996 and1999. His main research interestsare auditory-motivated signal pro-cessing and the modeling of auditorysystems. He was a JSPS researchfellow from 1998 to 2001. He was

associated with the ATR Human Information Processing Lab-oratories as a visiting researcher during 1999–2000, and from2000 to 2001 he was a visiting research associate at CNBH inthe Dept. of Physiology at the University of Cambridge. Hehas been on the faculty of the School of Information Science atJAIST since 2001 and he is now an associate professor. He is amember of the Research Institute of Signal Processing (RISP),the Institute of Electrical and Electronic Engineering (IEEE),the Institute of Electronics, Information and CommunicationEngineers (IEICE) of Japan, the Acoustical Society of America(ASA), the Acoustical Society of Japan (ASJ), and the Interna-tional Speech Communication Association (ISCA). Dr. Unokireceived the Sato Prize for an Outstanding Paper from the ASJin 1999 and the Yamashita Taro Prize for Young Researcherfrom the Yamashita Taro Research Foundation in 2005.

Toshihiro Hosorogiya wasborn in Ishikawa Pref., Japan in1980. He received his B.E. fromNagoya University in 2005, and hisM.S. from the Japan Advanced In-stitute of Science and Technology in2007. He is a member of the Re-search Institute of Signal Processing(RISP) and the Acoustical Societyof Japan (ASJ).

(Received June 17, 2007; revised September 11, 2007)