v-uv detection paper

8/9/2019 V-UV Detection Paper

http://slidepdf.com/reader/full/v-uv-detection-paper 1/4

S4b.9PITCH ESTIMATION AND VOICING DETECTION

BASED ON A SINUSOIDAL SPEECH MODEL’

Robert J. McAulay and Thomas F. Quatieri

Lincoln Laboratory, MITLexington, MA 02173 9108

A B S T R A C T

A new technique for estimating the pitchof a speech waveformis developed that fits a harmonic set of sine waves to the input datausing a mean-squared-error (MSE) criterion. By exploiting a sinusoidalmodel for the inp ut speech waveform, a new pitch estimation criterion isderived th at is inherently unambiguous, uses pitch-adaptive resolution,uses small-signal suppression to provide enhanced discrimination, anduses amplitude compression to eliminate the effects of pitch-formantinteraction. Th e normalized m inimum mean-squared-error proves tobe a powerful discriminantfor estima ting the likelihood tha t a given

frame of speech is voiced.

INTR ODUC TION

An analysis/synthesis system has been developed based on a si-nusoidal representation for speech that leads t o synthetic speech thatis essentially perceptually indistinguishable from the original[ l] . T h equestion arises as to whether the parameters of the sinusoidal model,the amplitudes, frequencies and phases of the underlying sine waves,can be coded at low data rate s (2400-4800 b/s ) and result in a high-quality speech compression system. Although straightforward codingof each of the parameters would lead to the most robust system, theatten dant da ta rate would be too high, hence speech-specific propertiesmust be introduced to reduce the size of the parameter set to be quan-tized. One of the fund ame ntal models used in low-rate coding is theassumption th at voiced speech is periodic, which suggests tha t perh apsthe sine-wave frequencies could he coded in terms of a harmonic series.

In this paper an algorithm is derived that fits a harmonic set ofsine waves to the measured se t of sine waves for voiced an d unvoicedspeech. T he accuracy of the harmo nic fit becomes an indicator of thevoicing state and is used t o define the probability of voicing which canbe used to allow for mixed voiced/unvoiced excitations. The methodhas proven to be a powerful pitch estimation technique tha t has foundwide application beyond th e original lowrate coding problem.

PARAMETER ESTIMATION FOR THETH E HARMONI C SINE-WAVE MODEL

As a first step in the analysis procedure, itis assumed that a frameof the input speech waveform has already been analyzed in t e r m of itssinusoidal components using the technique described in[l]. The speechmeasured data,s(n) can therefore be representedas

s(n) = Arezpplj(nwr+811 1)I=1

where {Al,w l, represent the amplitudes, frequencies, and phasesof the measured sine waves. The goal is to try to represent this

‘This work w a s sponsored by the Department of he Air Force.

sinusoidal waveform by another for which all of the frequencies areharmonic. This latter waveform can be modeledds

K wa)

i(n;wo, 1= A(kwo)exppli(nkwo+4t)l 2)k = l

where W O = 2?rf0/fs is the fundam ental frequency, K(w0) is the numberof harmonics in the speech bandwidth, A(w) is the vocal tract enve-lope, 4 = {41 ,q52 , . .. d ~ ~ , , ) }epresents the phases of the harmonics,and fs is the rate at which the waveform is sampled. Henceforth,WO

will be referred to as the “pitch” , althoug h during unvoiced speech thisterminology is not me aningful in the usual sense. Itis desired to esti-mate the pitch frequencyWO and the phases { , 4 2 , . . K ( ~ , ) } ,uchtha t i n ) is as “close as possible” to s n ) according to som e meaningfulcriterion,

A reasonable estimation criterion is to seek the minimum of themean-squared-error (MSE),

N I 2

over w,, and 4, since this at least insures robustness against additivewhite Gaussian noise [2]. The MSE in Eq. 3) can be expanded as

(4)If the sinusoidal representation fors n), Eq. l ) , s used in the firstterm of Eq. (4), then the power in the measured signal can be defined

Substituting Eq. (2) in the second term ofEq. 4) leads to the relation

NI 2 K ( w 0 ) NI 2

C s n ) i* n ; WO , 4) = A(kwo)ezp(jdk) s(n)ezp(-jnkwo)n = - N / 2 k = l n = - N / 2

6)

Finally, substituting Eq. (2) in the third term ofEq 4) leads to therelation

. N I 2

where the approximation is valid provided the analysis window satisfies

the condition(N+1) >>2?r/wo, which is more or less assured by m akingthe ana lysis window 2.5 times the average pitch period. Letting

249

CH2847-2/90/0000-0249 1.00 1990 IEEE



denote the short-t ime Fourier transform (STFT) of the input speechsignal, and using this in Eq. 6), hen the MSE in Eq. 4) becomes

K w o) K w o)

k = l k = l€(wold)= P, - ~ e A(kwo)ezp(- j~k)S(kwo)+ A2(kwo)

9)

Since the phase parameters { k}fLYo) only affect the second term inEq. 9), t h e MSE will be minimized by choosing

i k = ars[S(kwo)l 10)

and the resulting MSE will be given by

K w 0) K w o)

€(WO)= P, A(kWO)lS(kWO)l+ P(kW0) 11)k = 1 k = l

The unknown pitch affects only second and third terms in Eq. l l ) ,and these can be combined by defining

and th e MSE can then be expressedas

Since the first term is a known constant, the m inimum-mean-squared-error (MMSE) is obtained by maximizing p(w0) over WO.

I t is useful to manipu late this metric further by making explici t useof the sinusoidal representation of the input speech waveform. Substi-tuting the representation in Eq. 1) in the S TF T defined in Eq. 8)leads to the expression

S(w) = C A t e z p ( j 0 l ) s in c( wl w ) 14)l

where

Since the sine waves are well-resolved, the magnitude of the STFT canthen be approximated by

where D +) I s i nc z ) l . Th e MSE cri terion then becomes

K w o)

~ w o ) A(kw o)[kA tD(w t wo) A(kwo)] 17)

To gain some insight into the meaning of this cri terion, supposethat the inp ut speech is periodic with pitch frequency w*. Thenwl =h , 1 = A( *) and

k = 1 l

When W O corresponds to submultiples of the pitch, th e first te rm in17) remains unchang ed, since D(wl kwo)= 0 at the submultiples;

but the second term, because i tis an envelope and always non-zero,will increase at the submultiples ofw . As a consequence

19)W*

m( - ) < p(w*) = 2 , 3 , ..

which shows tha t t he M SE cri terion leads to unambiguous pitch esti-mates . This is possibly its most significant attribute,as i t has beenfound through extensive experimentation that the usual problems withpitch period doubling do not occur with this metric. However, the

frequency domain implem entation can lead to additio nal processingadvantages, the first of which is pitch-adaptive resolution.

PITCH-ADAPTIVE RESOLUTION

In the above formulation i t was implied tha t the analysis windowwas fixed at N + 1 samples. This would mean th at the m ain lobe ofthe sinc-fun ction, which measures the distance of the measured sine-wave frequencies from the harmon ic candi date s (i.e., sinc(wc kwo)-(we kwo) for Iw( kwol small) would be fixed for all pitch candi-dates. T his is contrary to the fact th at the ear is perceptually tolerantto larger errors in the pitch at high pitch frequencies than at lowerpitch frequencies. Moreover, the sine-function distance measure of theerror is meaningful only over each harmonic lo be. These effects canbe accounted for by defining the distance functionD ( z ) at the k f thharmonic lobe to be

and to be zero elsewhere. In this way the resolution becomes very sharpat low pitch values, and in contrast becomes quite broad at high valuesof the pitch. It is this expression which is used in 17) to compute thefirst revised mean-squared-error.

ENHANCED DISCRIMINATION

The MSE cri terion is closely related to the design ofa Gaussianclassifier for which the classes, the pitch candidates, are assumed to beindependent. It is desirable that the classification algorithm not onlydetect the correct class with high probabil i ty, but also suppress thelikelihood th at any other class might be detected . This featu re, whichin a neural net classifieris known as negative reinforcement [3], can beincorporated into the MSE pitch estimation algorithm by noting tha tif W O were the true pitch, then there would be at most one measuredsine wave in each harmonic lobe tuned to WO .Therefore, if there aremore, then only the one that contributes most to the MSE should

be compu ted. Since the lobes are determined by tbe pitch-adaptivesinc-function in 20) and, since each lobe sp ans one harmonic intervaldefined by the set

21)WOL(kW0)= {w : kwo < w < kwo + }

2 - 2

then discriminatio n will be enhanced by allowing only the largest weightedsine wave for each harmon ic lobe. T he second revision to th e MSE pitchestimation criterion becomes

In addition to providing greater robustness against additive noise (sincethe small peaks due to noise are ignored), the enhanced M SE cri terioninsures that speech of low pitch will less likely be estimatedas a highpitch. Moreover, if the above implementation is thought ofas a form ofsmall-signal-suppression and , if the harmon ic lobe struc ture is thoug htof as an a uditory cri t ical band fi l ter, then i t is possible to speculate tha tenhanced discriminationis not unlike t he effect of audi tory masking ofsmall tones by nearby large tones141.

250



THE FORMANT INTERACTION PROBLEM

One of the more imp ortant pitch estimation techniques in currentuse is based on the correlation function. In some respects i t is the t imedomain duali ty to the correlation implicit in the first term in (12). O neproblem with the t ime-domain correlation technique is tha t i t is inher-ently ambiguous which requires the use of some type of frame-to-framepitch tracking. Anothe r problem arisesas a result of the interactionbetween the pitch and the first formant. If the formant bandwidthis narrow relative to the harmonic spacing, the correlation functionreflects the formant frequency rather than the underlying pitch. Non-linear time-domain processing techniques using various types of center-clipping have been developed to eliminate the problem[5].

The same effect manifests itself in the frequency domainas t h esine-wave amplitud e near the form ant frequency will tend t o dom inatethe MSE cri terion. This effect can be eliminated simply by reducingthe dynamic range of al l of the sine-wave amplitu des and, in turn , theamplitu de envelope. One way to do this is to replace the measuredsine-wave amplitu des by

where Amaz = n a + { A ~ } f = ~ .ince the MSE cri terion leads to m aximalrobustness against additive white Gaussian noise, it was desirable tokeep y as close to unity as possible, introducing just enough amplitud ecompression t o el iminate the formant interaction problem. Too muchcompression causes the low level peaks due to noise to distort the MSEcriterion. Ultimately, the compression factor was chosen to be y= .5having been determined experimentally using a real-t ime sy stem toprocess approximately two hours of speech for a variety of speech andnoise conditions.

SINE-WAVE AMPLITU DE ENVELOPEESTIMATION

It h as been shown tha t if the envelope of the sine-wave amplitu desis known, then the MSE cri terion can lead to unambiguous estimatesof the pitch. While a number of methods might be used for est imatingthe envelope using linear prediction or cepstral estimation techniques,for example, i t was desirable to use a m ethod t hat led to an envelopetha t passed through the measured sine-wave amplitudes. Such a tech-nique has already been developed in the Spec tral Envelope EstimationVocoder (SEEVO C) [6]. The m ethod depends on having an es t imateof the average pitch, denoted here byGO. he first step is to search forthe largest sine-wave amplitude in the frequency range[ f , ] av-ing found the am plitude and frequency of tha t peak, denoted here by

AI,^), then th e interval[q ? f , q F] s searched for its largestpeak, ( A 2 , w z ) .Th e process is continued throughout th e speech ba nd.If no peak is found in a search bin, then the largest end-point of theST FT magnitu de is used and placed a t a frequency at the bin center.In the original SEEVO C application the goal was to obtain an es timateof the vocal tract envelope for use in a lowrate vocoder. Thi s was doneby linearly int erpo latin g between th e successive log-amplitudes usingthe peaks determined by th e above search procedure. In the applica-t ion to MSE pitch es timation, however, the purpose of the envelope ismainly to el iminate pitch ambiguit ies. Since the l inearly interpolatedenvelope could affect the fine structure of th e MSE cri terion through i tsinteraction with the m easured peaks in the correlation operation, bet-ter performance was obtain ed by using piecewise constant interpolationbetween the SE EVO C peaks.

COARSE PITCH ESTIMATION

The M SE pitch extractor is predicated on th e assumption tha t theinput speech waveform has been represented in terms of the sinusoidal

model. This implicit ly assumes that the analysis has been performedusing a Hamming window approximately 2.5 times the average pitch.Moreover, the SEE VOC technique also assumes tha t an estim ate of theaverage pitch is available. It seems, therefore, that th e pitch has to beknown in order to est im ate the average pitch, in order to est imate thepitch. This circular dilemma can be broken by using some other meth odto estim ate the average pitch based on a fixed window. Since only anaverage pitch value is needed, the e st imation technique does not haveto b e accu rate on every frame; hence, any of the well-known techniques

can be used. In a future pap er, a method using the sinusoidal modeland the MSE cri terion will be described that has the advantages ofthe present technique but which operates on a fixed analysis windowand requires no amplitude estimate. I t is notas reliable as the presentmethod , but i t is good enough to est imate the average pitch.

VOICING DETECTION

In the context of the sinusoidal model the degree to which a givenframe of speech is voiced is determined by the degree to which theharmonic model fi ts the o riginal sine-wave data. Th e accuracy of theharmonic fi t can be related, in turn , to the signal-to-noise rat io (SNR )defined bv

From (13) i t follows that

where now the inp ut power,Pa, s computed for the compressed sine-wave amplitudes. If the SN R is large, then the MSE is small andthe harmonic fi t is very good, which indicates that the input speechis most l ikely voiced. For small SNR, on the other ha nd, the MSE islarge and the harm onic fi t is quite poor which indicates that the inpu tspeech is more likely to be unvoiced. Therefore, th e degree of voicing isfunctionally dependent on the S NR. Although the dete rmination of theexact functional form is difficult , one t hat has proven useful in severalspeech applications is the following:

P, = Q SNR-4) 4 d B 5 S N R 5 1 0 d B (2 6 )1 S N R > l O d B

0 S N R < 4 d B

where P, represents th e probabil ity t hat speech is voiced, and the S NRis expressed in dB. The voicing probability concept has proven usefulin a number of speech applications [7] and, in part icular, has been

used in the Sinusoidal Transform Coder[S] to provide a mixed voicingexcitation.

IMPLEMENTATION

In one implementation of the MSE pitch extractor the speech wassampled at 10 kHz and Fourier analyzed using a 512-point FF T. Th e

sine-wave amplitud es and frequencies were determined over a 1000Hzbandw idth. In Figure l(a), the measured amplitudes and frequen-cies are shown along with th e piecewise-constant S EEV OC envelope.Squ arer oot compression has been applied to the amplitude data. Fig-ure I(b) is a plot of the first term in (22) over a pitch range from38 HZto 400 Hz and the inherent am biguity of the correlatoris apparent . I tshould be noted that “most of the t ime” the peak at the correct pitchis largest , but during steady vowels the am biguous behavior i l lustratedin the figure comm only occurs. Figure l( c) is a plot of the overallMSE cri terion and th e mann er in which the ambiguit ies are eliminatedis clearly dem onstrated. Figure l(d ) is an i l lustrat ion of the voicing

probability as a function of the SNR, and for this example theSNR isabout 20 indicating that the speech is clearly voiced.

251



HARMONIC SINE-WAVE RECONSTRUCTION

Validating the performance of a pitch extractor can be a t ime-consuming and laborious procedure, since i t requires a com parison withhand-labeled dat a. The approach used in the present study was to re-construct t he speech using th e harmon ic sine-wave model and l isteningfor pitch errors. Th e procedure is not qu iteso straightforward as Eq.2) indicates; however, because during unvoiced speech, meaningless

pitch estim ates are made which can lead to perceptual a rt ifacts when-

ever the pitch estimate is greater than about150 Ha. T his is due tothe fact th at , in these cases, there are too few sine waves to adequatelysynthesize a noiselike waveform. Th is problem has been eliminatedby defaulting to a fixed low pitch M 100 Hz) during unvoiced speechwhenever the pitch exceeds 100 Hz. T he exact procedure for doing thisis to first define a voicing dependent cutoff frequency,w,, as

W C P ” ) = T ” 27)

which is constrained to be no smaller than2 ~( 1 0 0 0 H z / f , ) . If theactual pitch estimate isWO then th e sine-wave frequencies used in thereconstruction are

where k‘ is the largest value of k for which k*wo 5 w, ( P v ) , and wherew,, the unvoiced pitch co rresponds to 100 Hz (i.e.,U, = 2r 100/fS)) .Note that if W O < U,, hen w = kwo for all k. Th e harmonic recon-struction then becomes

K

i(n; O = xA(wb)etp[j(nwk + ‘ k)] 29)k = l

where ’ k is the phase of the ST FT a t frequencyW k . Strictly speaking,this procedure is harmonic only during strongly-voiced speech sinceif the speech is a voiced/unvoiced mix ture the frequencies above thecutoff, al though equally spaced byw,, are aharmonic, since they arethemselves not multiples of a fundam ental pitch.

Th e synthetic speech produced by this mod el is of very high qual-ity, almost perceptually equivalent to the original. Not only does thisvalidate the performance of theMSE pitch extractor, but i t also showsthat if the amp litudes and phases of the harmonic representation couldbe efficiently coded, then only the pitch and voicing are needed to codethe information in the sine-wave frequencies.

CONCLUSIONS

A new technique for estimating the pitch of a speech waveformhas been developed th at fi ts a harmonic set of sine waves to the inp utdata using a mean-squared-error cri terion. By exploit ing a sinusoidalmodel for the input speech waveform a new criterion was derived thatwas inherently unambiguous, ha d pitch-adaptive resolution, used smallsignal suppression to provide enhanced discrimination, and used ampli-tude compression to eliminate the effects of pitch-formant interaction.It was found that the normalizedMSE proved to be a powerful di ecriminant for estimating the likelihood that a given frame of speechis voiced. T he new pitch estimator/voicing detector has proven to beuseful for lowrate speech coding, speech enhancement and for t ime- andpitch-scale modification of speech.

References

[I] R.J. cAulay and T .F. Quatieri, “Speech Analysis/S ynthesis Basedona Sinu soidal Represen tation,” IEEE Tran s. on Acoustics, Speech and

Signal Proces sing, Vol. ASSP-34,No. 4, pp. 744-754, August 1986.[3] H. Van Trees, Detection Estimotionand Modulation Theory, Port I(Wi-

ley, New York) 1968.

[31 R.P. Lippmann, “An Introduction to Computing with Neural Nets,”IEEE ASSP Magazine, pp. 4-22, April 1987.

[4] H. Duifuis, L.F. Willems and R.J. Sluyter, “Measurement of Pitch inSpeech: An Implemen tation of Goldstein ’sTheory of Pitch Perception ,”J. Acoust. Soc. Am., Vol. 71, No. 6, pp. 1568-1580, June 1982.

[5] L. Rabiner, “O n the Use of Autoco rrelation Analysis for Pitch Detec-tion,? IEEE Trans. on A coustics, Speech and Signal Processing, Vol.ASSP-25, No. 1, pp. 24-33, Febru ary 1977.

[6] D.B. Paul, ”The Spectral Envelope Estimation Vocoder,” IEEE Trans.on Acou stics, Speech and Signal Pr oces sing , Vol. ASSP-29, pp. 786-794,1981.

[7] T.F. Quatieri and R.J. McAulay, “Phase Coherence in Speech Re-construction for Enhancement and Coding Applications,” Proc. IEEEICASSP’89, Glasgow, Scotland, pp. 207-209, May 1989.

[a] R.J. McAulay, T.M. Paxks, T.F. Qnatieri, and M. Sabin, “Sine-waveAmplitude Coding at Low Data Rates,” IEEE Workshop on SpeechCodin g, Vancouver, B.C., Can ada, Septem ber 1989.

[9] M.R. Schroeder, “Period Histogram and Prod uct S pectrum: New Meth-ods for Fundamental-Frequency Measu rement,” Journal of the A cousti-cal Society of America, Vol. 43, No. 4, pp. 829-834, 1968.

[IO] R. Linggard and W. Millar, “Pitch Detection Using HarmonicHis-tograms,” Speech Communication1, pp. 113-124, 1982.

[ l l ] S. Seneff, “Real-Time Harmonic Pitch Detector,” IEEE Trans.on

Acous tics, Speech and Signal Processing , Vol. ASSP-26, No. 4, pp. 358-365, August 1978.

5000

0

50004 0 -20 0 20 40

TIME ms)

a) AMPLITUDE ENVELOPES (b) CORRELATOR OUTPUT1

8

6

4

2

‘0 2 4 6 1 0 100 200 300 400

kHz

c ) MEAN-SQUARED-ERROR

Hz

(d)VOICING PROBABILITY

Hz SNR

Figure 1

252

v-uv detection paper

Documents