a hybrid transform method for analysis/synthesis of speech

Signal Processing 24 (1991) 217-229 217 Elsevier

A hybrid transform method for analysis/synthesis of speech

Andreas S. Spanias Department of Electrical Engineering, Telecommunications Research Center, Arizona State University, Tempe,

A Z 85287-5706, USA

Received 8 August 1990 Revised 13 March 1991

Abstract. In this paper, a hybrid transform method for analysis/synthesis and coding of speech signals is proposed. A finite number of Fourier functions is used to represent frame-by-frame the selective (narrow-band) components of speech. The broadband residual is characterized by a small number of Walsh functions. An iterative algorithm, to determine the frequency and sequency components in a systematic manner, is derived. This is employed for speech coding at 16 kbit/s and 9.6 kbit/s. The hybrid transform method produces natural sounding speech which is indistinguishable from the original at 16 kbit/s and highly intelligible at 9.6 kbit/s.

Zusammenfassung. In diesem Beitrag wird ein hybrides Transformationsverfahren fiJr die Analyse/Synthese und Codierung von Sprachsignalen vorgeschlagen. Eine endliche Anzahl von Fourierfunktionen wird benutzt, um die selektiven (schmalban- digen) Sprachkomponenten Rahmen fiir Rahmen darzustellen. Das breitbandige Restfehlersignal wird durch eine kleine Anzahl von Walshfunktionen charakterisiert. Ein iterativer Algorithmus wird hergeleitet, um die Frequenz- und Sequenz- Komponenten in systematischer Weise zu bestimmen. Dieser Algorithmus wird fiir die Sprachcodierung bei 16 kbit/s und 9.6 kbit/s eingesetzt. Das hybride Transformationsverfahren liefert natiirlich klingende Sprache, die bei 16 kbit/s vom Original ununterscheidbar und bei 9.6 kbit/s gut verst~indlich ist.

Rrsumr. Nous proposons dans cet article une mrthode de transformation hybride pour I'analyse/synth~se et le codage de signaux de parole. Nous utilisons un nombre fini de fonctions de Fourier pour reprrsenter tr'ame par trame les composantes srlectives (h bande 6troite) de la parole. Nous caractrrisons le rrsidu large bande par un petit nombre de fonctions de Walsh. Nous drrivons un algorithme itrratif permettant de drterminer les composantes frrquentielles et srquentielles de faqon systrmique. Cet algorithme est employ6 pour le codage de parole ~t 16 kbit/s et 9.6 kbit/s. La mrthode de transformation hybride produit un signal de parole d'allure naturelle qui est impossible ~ distinguer du signal original ~ 16 kbit/s et tr~s

intelligible ~ 9.6 kbit/s.

Keywords. Speech, coding, transforms, Fourier, Walsh.

1. Introduction

Transform-based analysis/synthesis models have been used extensively for speech and waveform coding [5]. Most of these models achieve data compression by encoding a set of orthogonal components derived from unitary transforms. A class of coders, namely the zonal samplingtrans- form coders [8], characterizes speech by a subset of transform components which are obtained by

constraining the basis of an orthogonal transform. This method of signal modeling is referred to in this paper as constrained (basis) representation. In constrained representations the nature of the basis vectors determines the quality of signal representation. It can be shown [1, 13], that an optimal basis for constrained representations is formed by the eigenvectors of the autocorrelation matrix of the signal. This basis forms the Karhunen-Loeve transform (KLT), which is statistically optimal but

0165-1684/91/$03.50 © 1991 - Elsevier Science Publishers B.V.

218 A.S. Spanias / A hybrid transform method for analysis~synthesis of speech

data-dependent and computationally expensive [8]. Thus, other suboptimal transforms can be used such as the Fourier, the Walsh, etc.

The Fourier transform is a convenient method for signal characterization and provides intuition in a number of applications since its basis functions are eigenfunctions of linear systems. Constrained

representation with the Fourier basis entails signal characterization by a subset of narrow-band components, namely complex exponentials. A finite set of complex exponentials is capable of representing accurately narrow-band signal components, however broadband components are generally not characterized efficiently. For example, voiced

speech segments possess narrow-band components that have a harmonic structure and can be represented by a small number of harmonic complex exponentials [2]. Unvoiced segments, however, are random-like and broadband and thus require a large number of sinusoids to be represented. If a small number of sinusoids is used for synthesizing

broadband speech a tonal noise, often referred to as 'musical noise' [3], is produced which is both undesirable and annoying.

Broadband signal components can be efficiently characterized by a broadband class of basis functions, such as the Walsh functions [1]. The Walsh

set consists of periodic and aperiodic pulse-train functions which have amplitude one or minus one. These are characterized by their sequency, i.e., the number of zero crossings per unit time. Walsh functions have been used successfully for the representation of the residual signal in residual excited linear prediction (RELP) [4].

In this paper, we introduce a general approach to transform-based speech analysis/synthesis that uses constrained sets of narrow-band and broadband basis functions. The narrow-band components of speech are characterized in the frequency domain by Fourier functions while the broadband components are characterized in the sequency domain by Walsh functions. This approach is based on the fact that narrow-band and broadband components coexist even in voiced speech segments. In the latter, the waveform con- Signal Processing

sists of strong narrow-band components due to vocal chord vibration, and broadband components [7] due to the turbulence created by the breathing- out process that goes along with voiced speech. The frequency and sequency components are determined frame-by-frame by an iterative algorithm. Synthesis using a small number of Fourier and Walsh components produces natural sounding speech which is in some cases superior to that produced by only Fourier components at the same information rate.

The constrained basis representation is dis- cussed in the next section. An analysis algorithm that selects the basis functions in a systematic manner is developed in Section 3. Section 4 pres- ents some implementation issues, as well as results that demonstrate the merit of the proposed tech- nique. Section 5 gives speech coding implementa- tions at 16 kbit/s and 9.6 kbit/s, and Section 6 gives our conclusion.

2. Constrained basis representation

Data compression can be effected using a constrained representation which is implemented in discrete-time. A block diagram of a typical constrained representation scheme is shown in Fig. 1. The sampled signal is processed frame-by-frame and the pth frame, denoted s(p), is transformed using an N-point orthogonal (or unitary in the complex case) transform. The N orthogonal basis vectors of the transform are constrained to n (n < N) and the inverse transform is taken to recon- struct the signal. It can be shown, using Parseval's theorem, that the n vectors are selected optimally by retaining the n dominant (magnitude-wise) transform components in S(p) and setting the rest

to zero. The accuracy of the constrained representation depends on the orthogonal transform

v zero ] r .... form I

Fig. 1. Constrained basis representation.

A.S. Spanias / A hybrid transform method for analysis~synthesis of speech 219

employed. The optimal transform is the KLT which

is, however, not practical. Thus, other unitary (or orthogonal in the real case) transforms can be used, such as the discrete Fourier transform (DFT), the Walsh Hadamard transform (WHT), the discrete cosine transform (DCT), etc. In this paper, the DFT was selected as a representative of the class

of transforms which have narrow-band basis, and not for any superiority to other such transforms. The WHT was chosen as a representative of those transforms that have broadband basis.

In the constrained DFT representation, speech segments are characterized by a linear combination of sampled complex exponentials. Since the speech sequence is real valued there is symmetry in the short-time spectrum, and the frequency components are essentially selected in complex conjugate pairs such that a real sequence is reconstructed. Thus, a sinusoidal model for speech is implied, that is,

g(m) = ~ A, cos(tOqm W ~bq), (1) q

where the amplitude Aq and phase ~bq are obtained from the magnitude and phase of the qth frequency component, respectively. The frequencies of the DFT components are those corresponding to the dominant DFT magnitudes, thus the energy in the reconstructed signal is maximized.

The Walsh-Hadamard transform characterizes the signal in the sequency domain using sampled pulse train functions, and was found to be efficient [4] in the coding of random broadband sequences. In addition, a fast WriT (FWHT) is available [1] which lends itself well to real-time speech applications [15].

The performance of the DFT and the WriT in constrained representations is data depended. A comparative result for the constrained representation of a periodic sequence using the DFT and the WriT is shown in Fig. 2. The vertical axis shows the normalized error in dB and the horizontal axis shows the percentage of basis functions ( n / N x

100) from each transform ( N = 256). The results were averaged over 100 independent runs using periodic segments taken from a speech data base.

o

~ -1o

_ 2 o -

N --30" F o u r i e r

25 50 75 100

Percentage of Transform Components (~,)

Fig. 2. Comparisons o f constrained basis representations for narrow-band signals.

As it is seen, the constrained DFT representation

outperforms the constrained WHT representation. The same experiment was performed using broadband pseudo-random sequences (see Fig. 3). The results were again averaged over 100 broadband pseudo-random sequences and as it is seen the WHT outperforms the DFT. These results are consistent with the notion that signals with narrow- band spectral properties are best characterized by a narrow-band basis and broadband signals are

best represented by a broadband basis.

3. Hybrid transform representation

The traditional source-system speech reproduc- tion model uses periodic or noise-like excitation

O-

-10-

-20"

-6 -30 E

z -40

-50

Walsh

25 50 75 100

Percentage of Transform Components (~,)

Fig. 3. Comparisons of constrained basis representations for broadband signals.

Vol. 24, No. 2, August 1991

220 A.S. Spanias / A hybrid transform method for analysis~synthesis of speech

n F n W

= • S F ( i ) F * ( t ) + E S w ( i ) W * ( t ) , i = l i = 1

S igna l P r o c e s s i n g

(2)

where * in the superscript denotes complex conjugate, Sv( i ) and S w ( i ) are the nv and nw coefficients to be determined, while Fi(t) , W~(t) are basis functions which have the following properties:

(F*( t ), Fj( t)) = CF$,-j, (3)

( W ~ ( t ) , Wj(t)) = Cwr,_j, (4)

where '( , )' are inner products on the interval [fi, t2], ~ is the unit impulse, and cr and Cw are constants. For convenience and without loss of generality we will consider that the basis functions in (3) and (4) are normalized, i.e., CF = CW = 1. The two sets of functions are orthogonal in their own domain but not mutually orthogonal. The mixed- basis signal representation minimizes the mean square error, i.e.,

1 I'2 e = le(t)12 dt, (5)

t2 - t, ,,

with respect to the coefficients SF(i) , Sw( i ) , where

e( t ) = s ( t ) - ~(t). (6)

This minimization yields

n W

S,:(j) =(s(t), E(t))- Z Sw(i)(w*(t), Fj(t)) i = ,

for j = 1 . . . . . ne, (7)

n F

S w ( j ) = (s ( t ) , W j ( t ) ) - ~ SF( i ) (F*( t ) , Wj(t)) i = 1

for j = 1 , . . . , nw. (8)

Using (2), (7) and (8) can be recognized as

SF( j ) = ((s( t) - Sw( t) ), Fj( t))

for j = 1 . . . . , nF, (9)

S w ( j ) = <(s ( t ) - ~F(t)), Wj(t))

for j = 1 , . . . , nw. (10)

Equations (9) and (10) suggest that each transform operates on a residual, i.e., the transform with basis functions Fi(t) operates on the residual ( s ( t ) - ~w(t ) ) while the second transform operates on

in conjuction with a time-varying vocal tract filter for synthesis. This model was exploited in linear predictive coding (LPC) [11] and in filter-bank vocoders [5]. Perhaps the most significant recent development in LPC was reported by Atal and Schroeder (CELP) [14] where the traditional voiced/unvoiced excitation was replaced by a more general stochastic excitation scheme. It was reported that CELP has the potential for high quality speech at 4.8 kbit/s. In addition, Griffin and Lim [7] employed a more general excitation scheme to vocoders, where mixed voiced and unvoiced excitation is applied in a frequency selective fashion. By adopting this mixed excitation model Griffin and Lira reported high quality

speech at 8 kbit/s. A mixed narrow-band/broadband representation approach to speech synthesis is essentially implied in source-system methods [4, 7, 14]. For transform-based approaches, however, there is no explicit narrow-band/broad-

band model. Transform coding techniques often use Fourier-

based models [2, 12] which provide for the characterization of narrow-band speech components, however broadband components are not properly represented. In this paper, we attempt to character- ize speech in the transform domain by narrow- band and broadband basis functions. A subset of Fourier components is used to represent the nar-

row-band components of speech. The broadband residual is represented by a small number of Walsh components.

In the following, we derive a general algorithm

that selects, in a systematic fashion, basis functions from two different orthogonal sets. This is derived in a general and compact form for the continuous- time case and is then extended to the discrete-time case. The hybrid basis signal characterization that we consider in this paper is one that represents the signal s ( t ) on the interval [ q , t2] by ~(t) such that

( s ( t ) - # ( t ) ) . possible, that is

Ak S~-+'(j) = ( ( s ( t ) -Sw( t ) ) , F~(t))

f o r j = 1 , . . . , r/F, (11)

S ~ ' ( j ) = ((s( t ) - ffk+'(t)), Wj(t))

for j = 1 , . . . , nw, (12)

where k is the iteration index and

n F

~ ( t ) = E Skv(i)F*(t), (13) i = l

n W

skw(t)= 2 s k ( i ) W * ( t ) • (14) i = 1

Equations (11)-(14) give a general procedure for determining the coefficients of the hybrid transform representation (2). It can be shown (see Appendix A) that (11) and (12) are equivalent to a Gauss-Seidel iteration [18] for solving linear equations. The Fourier-Walsh representation con- sidered in this paper can be accommodated by obtaining a discrete-time version of the algorithm for the interval NT, where T is the sampling period. The basis functions F~(t), W~(t) become N-point DFT basis vectors, F~ and WHT basis vectors, W~, respectively• It must be pointed out, however, that the discrete-time hybrid transform representation is justified only if the sum of the DFT and WHT components ( n F + n w ) is less than N. This is because the N components of an N-point orthogonal transform represent s (p) exactly• The discrete-time form of (7) and (8) is

n W

SF(j) = (s(p), F j ) - ~ Sw(i)(W~, Fj) i = 1

for j = I . . . . , nF, (15)

n F

S w ( j ) = (s(p) , W j ) - E S¢(i)(F*, W~) i = l

for j = 1 , . . . , nw, (16)

where complex conjugation was removed from the Walsh vectors since they are real valued. It can be shown (see Appendix B), that the equations above have an orthogonalizing effect. Equation (16)

A.S. Spanias / A hybrid transform method for analysis~synthesis of speech

An iterative implementation is thus

2 2 1

implies that the j th member, Wj, of the second set of basis functions is made orthogonal to all the members of the first set, i.e., F~, i = 1 . . . . , nv. An analogous argument can be made for (15). The nv basis vectors, F~, and the nw basis vectors, W~, are chosen from the N orthogonal vectors of the DFT and the N orthogonal vectors of the WHT, respectively, using constrained representation• Thus, the iterative form of (15) and (16) becomes

~ k + ~ , , ~ select nv maximum components / F tP) = t and conjugates. Set rest to zero J

e ( s ( p ) - ~ ( p ) ) ,

where

and

~ ( p ) - Ak = W I S w ( p )

(17)

(18)

+ [ select s~ ; ' (p )= nw maximum components.~

Set rest to zero J

(19) w(s(p) - ~÷'(p)),

where

~+'(p) = e-lg~+'(p); (20)

F is the N × N normalized DFT matrix given by

• o .

• ° . ~ ( - ( N - l ) ) ,

(21)

(22)

1 1 1 1 ~(-1)

F = ~

1 ff(-(N-l))

where ~" is given by

~ - q = e ( - j 2 ~ r q / N ) ;

W is the N × N normalized WHT matrix• The Walsh matrix can be generated by an order recur- sive process, i.e.,

w , = W,_l -Wr_,J' (23)

where r = log2 N and Wo = 1. The matrix generated in (23) is a normalized Walsh-Hadamard matrix, and its basis vectors are Hadamard ordered. A sequency ordered Walsh matrix may be generated by reordering the rows of Wr [1].


222

Equations (17)-(20) describe an iterative process that determines the DFT and WHT components to be used for synthesis. For each segment, s(p) , the algorithm iterates and R(p) as well as sw(p) are determined. The signal segment, s(p), at iteration k is then represented by

~k(p) = ~ ( p ) + ~kw(p)" (24)

The algorithm is initialized for each incoming frame as follows:

=0, (25)

~°w(p) =0, (26)

where 0 is the N x 1 null vector.

4 . Implementation and results

4.1. Implementation using the FFT and the FWHT

The iterative process (17)-(20) can be implemented using the FFT and the FWHT, Fig. 4. The FFT operates on the residual (s(p) - sw(p)), while the FWHT operates on the residual (s(p) -SF(p)). Ten iterations were found to be sufficient. The iterative algorithm requires that the frequencies and sequencies (the basis vectors) are selected first,

otherwise a unique solution cannot be guaranteed. A peak-picking method for determining the frequencies and the sequencies of the DFT and

the WHT components, respectively, was thus adopted. Other parameters to be estimated are the number (n~) of DFT components, and the number (nw) of Wr iT components. An optimal solution

for this problem can be obtained using com- binatorial methods which are both cumbersome and impractical. A practical method for determining nF can be a devised by taking into account the harmonic structure of the short-time spectrum [16] of voiced speech. Voiced segments exhibit spectral peaks at integer multiples of the fundamental (pitch) frequency, fo. The number of DFT components can then be determined by the expression nF = f J 2 f o , where f~ is the sampling frequency. In general, nF will be higher for male speakers (low Signal Processing


sampled signal

get pth signal frame lnltlallze s_ o (p)=_O and s ~ (p)=_O0 ~ P) _

I FFT l

I select n F from N J set rest to zero

F IFFT •

WHT I

select nwfrom N I set rest to zero

I tWHr I I

Fig. 4. Iterative algorithm for Fourier-Walsh representation.

pitch) as opposed to female speakers (high pitch). Once nF has been set, increasing nw monotonically decreases the error energy. Thus, nw can be determined by the constraint on the data rate. If nF is

to be determined in a time-varying (frame-by- frame) fashion a pitch estimator must be integrated to the iterative algorithm. This would essentially increase the complexity of the hybrid transform approach. Instead, empirical values for nF and nw can be used. Twenty DFT components and six WriT components gave sufficiently good results for both voiced and unvoiced speech. Twenty DFT components correspond to a pitch frequency of

A.S. Spanias / A hybrid transform method for analysis~synthesis of speech 223

200 Hz (fs = 8 kHz). Since the pitch period ranges from 3 mS to 12 mS [10], 200 Hz may be roughly regarded as an ensemble average pitch for male and female speakers.

4.2. Results

In the following, speech analysis/synthesis results are presented using the proposed method. Ten seconds of speech, bandlimited to 3.2 kHz and sampled at 8 KHz, was processed frame-by-frame using a 32 ms trapezoidal window and 256-point transforms. The trapezoidal window was overlap- ped as in Fig. 5. The DFT and the WHT components were selected using the algorithm shown in Fig. 4, and speech was reconstructed using (24) with k = 10. Synthesis results were evaluated using the following performance measures. The signal- to-noise ratio (SNR) is given by

( ' / SNR--lO,og,o L '

(s(m)-~(m)) 2]

where I is the total number of samples to be reconstructed. The segmental signal-to-noise ratio (SNRSG) is

1 e SNRSG = ~ p~_, SNRFR(p), (28)

where P is the number of speech frames having energy exceeding a certain threshold, and SNRFR(p) is the signal-to-noise ratio for the pth frame, that is,

SNRFR(p)

~ ~ ) - ~ ( p ) ) ] -

(29)

16rnS 8mS

/

Fig. 5. O v e r l a p p e d t r apezo ida l windows .

In addition, the sample standard deviation (STD) was used to show the deviation of S NRFR(p) from SNRSG. STD provides a measure of performance consistency for the different speech segments. This is given by

STD = [SNRFR(p)]2 _ [SNRSG]2. 1

(30) A comparative result that demonstrates the merit of the proposed method is given in the following. A ten-second speech record was processed using the constrained DFT approach (8 = SF) with nF = 20 and produced an SNR (SNRSG/STD) of 24.5 dB (24.9/+5.1). By introducing six WHT components per frame (~=gF+gw), that is, nv=20 and nw=6, an SNR of 26.2dB (26.4/+4.7) was realized. On the other hand, by trading the six real WHT components for three complex DFT components (six independent parameters), i.e., using constrained DFT representation (~ = ;F) with the same overall number (nv = 23) of components, the SNR was 24.92 dB (25.2/±4.9), which was smaller than the one obtained for the hybrid representation. Spectrograms are also given to illustrate the difference between the original and the reconstructed speech. In Fig. 6, the spectrogram of the first 5 seconds of the original speech record is shown. Figure 7 shows the spectrogram of speech reconstructed only from twenty-three (nF=23) DFT components. The last spectrogram, Fig. 8, is that of speech reconstructed using twenty (nF =20) DFT and six (nw = 6) WriT components. Figures 6 and 8 are almost identical. Figure 7 shows that the high frequencies are absent, while Fig. 8 shows that the introduction of 6 WriT components produced the high frequencies that were absent in Fig. 7 and consequently enhanced the SNR. This improvement was also audible. Note that the results in Figs. 7 and 8 involve the same information rate.

4.3. Remarks

The frequency components, in constrained DFT representation methods, were mostly picked from


224

3.0

v /ii. , ,.t,

>_ 2 . 0 ~ ,L %. U ,-,, .~, %, Z W

W I . 0 - .., ,.~,~ ~ . .,.,,,

I 2 3 4

TIME ( S e c )

Fig. 6. Spectrogram of the original speech record.

N -1- V

>- t ) Z W

0 W n~ b_

3 . 0

2 . 0

A.S. Spanias / A hybrid transform method for analysis/synthesis of speech

produced by the peak-picking approach. The introduction of sequency components, in addition to the frequency components, eliminated the musical noise and resulted in improved speech quality. This improvement was more prominent for 'low-pass' speech records. Also, the proposed method worked somewhat better with female speakers - a result that was expected.

%

I I L . . . . . I..

0 . 0 . ~ _ _ ~ 4 j j _ u.ail, ~ . , , , ~ . L . ~ . J : l ~ - ' - r ; L , ~ T a ' l T m , m, ! t : : ' ~ . . , , . ~ .

0 I 2 3 4

TIME (Sec)

Fig. 7. Spectrogram of synthetic speech using the constrained Fourier representation with nF = 23.

3 . 0

I v

>_ 2.13 U Z Ld

0 Ld 1 . 0 n / b_

• • <

2 ,h, =.., 1 2 3 4

T I M E ( S e c )

0 . 0 0

Fig. 8. Spectrogram of synthetic speech using the hybrid Fourier-Walsh method with n F = 20 and nw = 6.

the low-frequency band (i.e. within 2 kHz). Addi- tionally, musical noise was detected in the reconstructed speech, which is due to the spectral gaps Signal Processing

5. Coding at 16 kbit/s and 9.6 kbit/s

5.1. Implementa t ion

In this section, a potential application of the hybrid transform representation, namely speech coding, is presented. Coding is implemented and evaluated at 16 kbit/s and 9.6 kbit/s. The iterative algorithm of Fig. 4 is modified to include quantization of the frequency and sequency components as shown in Fig. 9. The frequency and sequency components are encoded in each iteration, thus the two transforms operate on residuals that include quantization noise. The idea is to allow each constrained method to minimize the quantization noise embedded in the residual.

The encoding and decoding schemes are shown in Figs. 10 and l l , respectively. The signal is processed as shown in Fig. 5, thus the frame rate is 41.67 Hz. Encoding the frequency components involves coding their magnitude, phase and location. The DFT log-magnitudes are encoded using an adaptive quantizing step and a gain, while the DFT phases are encoded linearly. Coding the frequency locations of the principal DFT components directly (8 bits per location) requires a large number of bits. The number of bits can be reduced by using bit-map techniques, and by exploiting the fact that the high-energy frequency components lie, in most cases, within the halfband. Thus, the DFT peak-picking in Fig. 9 is constrained to the lower band, i.e. from 0 to 2 kHz for 16 kbit/s and from 0 to 1.5 kHz for 9.6 kbit/s. The sequency components are encoded in sign-magnitude form, and the sequency locations are encoded using eight bits per component.

A.S. Spanias / A hybrid transform method for analysis/synthesis of speech 225

Fig.

sampled

signal get pth signal f rame

Inltlallze i ~ (p)=O and ~ ~(p)=_O

~ P ) _

l FFT i

select n F from N set rest to zero

I Ouan.z, I

÷1 I

[ WHT l

select nwfrom N set rest to zero

I 0uo.tlz, J = (p)

I l i

9. Modified iterative algorithm for representation.

Fourier-Walsh

~,(p) ~,Cp)

. . . . . , Decode s_',(p).F~T_I,_,(p) _~TT)~

Fig. 11. Decoding and synthesis.

In the 16 kbi t /s implementation, twenty (hE =

20) frequency components and six ( n w = 6 ) sequency components are encoded. DFT log- magnitudes are encoded at five bits and the phases at six bits. In addition, the six Wr iT components are encoded at five bits each. Overhead information, including location bits, amounts to 134 bits per frame.

In the 9.6 kbit/s coding scheme, fourteen (nF =

14) DFT and four (nw =4) WHT components are encoded. DFT log-magnitudes are encoded at four bits and phases at five bits. In addition, four WriT components are encoded at four bits. Overhead information, including location bits, amounts to

88 bits.

5.2. Results

Experimental results are presented for three ten- second speech records, that is, a female speaker, a male speaker, and a group of speakers speaking simultaneously. In addition, results are given for a ten-second music record for the purpose of test- ing the robustness (ability to handle non-speech signals) of the coding schemes. All the audio records were bandlimited to 3.2 kHz and sampled at 8 kHz. The encoding/decoding schemes were implemented non-real-time on a main-frame com- puter using single-precision floating-point arithmetic. Signal-to-noise ratio results are tabulated in Table 1.

5.3. Remarks

Digitized

Speech

J Windowing/ • Segmentation

(Fig. 5)

Iteratlve - Four er/Wa sh - Algorithm

(Fig. 9)

DIrT Mag. and Pho.

WHT Components

Overhead bits

Fig. 10. Analysis and coding.

From Table 1 it is seen that both coding schemes perform better with the female speaker. This is

because high-pitch speech can be characterized with fewer frequency components than low-pitch speech. Informal subjective listening tests of the processed speech reveal that 16 kbit/s yields high-


226

Table 1

Performance of the hybrid t ransform coding at 16 and 9.6 kbit /s


Audio record 9.6 kbit /s 16 kbit /s (10 sec) S N R / S N R S G / S T D S N R / S N R S G / S T D

(dB) (dB)

FEMALE 16.65/17.61/±3.01 22.18/25.06/±4.32 MALE 14.37/15.88/+3.07 17.82/22.17/+4.41 G R O U P 14.59/16.02/±3.63 18.05/22.04/±5.48 MUSIC 11.71/12.96/±4.15 14.17/16.60/±5.98

quality speech which is indistinguishable from the

original. At 16 kbit/s the quality of the processed music is good and noticeably free of artifacts. At 9.6 kbit/s speech is less clear, however, highly

intelligible. The quality of processed music at 9.6 kbit/s degrades but is still free of artifacts. It is important to note that at both rates the hybrid coding scheme was robust in the sense that it performed well not only for single speakers but

also for multiple speakers and music. Also, from the STD values in Table 1 it is seen that the performance was relatively consistent for all processed frames and there was no breakdown in signal synthesis. These observations were consistent not only for the above records, but also for other phonetically balanced speech records taken from the DAR PA-TIMIT database [6].

5.4. Practical considerations

Practical considerations for the hybrid transform analysis/synthesis are concerned with its robustness to transmission errors and its potential for real-time implementation. Although transform- based approaches are generally robust to transmission errors [8], in an actual implementation the perceptually important information must be encoded with some form of error protection. In this particular case, error control methods must be provided to protect the bits associated with base- band FFT components 1 and their locations. Typi- cal methods for error control include cyclic redundancy checks and convolutional codes [9].

FFT components within 1 KHz are associated with the pitch structure of the voice waveform and hence are perceptually important.

Signal Processing

The potential for real time implementation depends on the computational complexity of the coding algorithm, i.e., the number of operations per processed speech frame. The iterative algorithm involves ten FFTs, ten IFFTs, ten FWHTs and ten IWHTs per processed frame. Even though there is a significant amount of arithmetic operations involved in this process, there is also appreciable redundancy which can be exploited for efficient computation. Firstly, the 256-point speech frames are real-valued and hence they can be computed using 128-point FFTs. Secondly, the selection of frequency components in both coding schemes produces only nF non-zero frequency components which always lie within the halfband, therefore pruned FFTs and IFFTs [17] can be used to reduce the computational load. 2 Redundancies

can also be exploited with sequency components, where the matrix version (instead of the fast version) of the IWHT can be implemented such that it operates only on the non-zero sequencies. An additional consideration is to make use of the fact that the energy in the reconstruction error II (s(p) = ~k (p))112 is significantly reduced after the first two iterations. The average decrease in the SNR and the SNRSG values of Table 1, caused by iterating two instead of ten times, was approximately 0.37dB and 0.42dB, respectively, while the changes in the STD were insignificant.

Although real-time implementation was not undertaken, the potential for real-time implementation can be examined by considering floating-point signal processors. A typical floating- point signal processor [19] is capable of perform- ing 12.5 million floating-point operations per second which correspond to 299,976 thousand operations per processed frame. For the 16 kbit/s scheme 3 and considering two iterations, the FFTs and IFFTs require a total of 2688 floating-point operations per frame, while the FWHTs and IWHTs require 7168 floating-point operations per

2 In the 16kbi t / s scheme the number of arithmetic operations taken by the FFTs is reduced in half by using pruned FFTs and IFFTs. The savings are even more for 9.6 kbit/s.

3 Similar arguments can be made for the 9.6 kbit/s scheme.

A.S. Spanias / A hybrid transform method for analysis~synthesis o f speech

frame. From this simple and conservative estimate it is immediately apparent that the rest of the signal processing operations involved (windowing, peak peaking, quantizing, etc) can be easily accommodated, and therefore both coding schemes could be implemented real-time.

6. Conclusion

A new hybrid transform method for speech analysis/synthesis was presented in this paper. This uses two transforms, namely the Fourier and the Walsh, for speech representation. An iterative

algorithm, to determine the frequency and sequency components in a systematic manner, was derived. This was shown to have an orthogonalizing effect on the basis vectors, as well as conver- gence properties that are similar to those of a Gauss-Seidel iteration. An implementation of the algorithm using the FFT and the FWHT was also given. The Fourier-Walsh analysis/synthesis scheme was implemented and compared to the

227

of the hybrid transform scheme were encoded at 16 kbit/s and 9.6 kbit/s. High-quality speech which was indistinguishable from the original was produced at 16 kbit/s. At 9.6 kbit/s speech was highly intelligible. The hybrid transform coding scheme seemed to be robust in the sense that it performed well for non-speech signals. The latter is not generally the case with traditional source-system

vocoders. Although the computational load for the proposed method is relatively high, real time implementation on a floating-point signal processor is feasible.

Appendix A. Analysis Of the iterative algorithm

The minimization of (5), given that the basis functions are selected, results in the set of equations given in (7) and (8). These can be written a s

SFW = R F W S F W , (A. 1)

where

SFW = [(s( t ) , Fl(t)>(S(t), F2(t)). • • (s( t ) , F,v(t))(S(t), Wl(t))(s(t), W2(t)>. . . (s ( t ) , W,w(t))] t,

Svw = [SF(1) SF(2) • .- Sv(nF) SW(1) SW(2) . . . Sw(nw)] ' ,

R ~ , v

(A.2)

1 0 . . " 0 (W*I( t ) ,Ft( t ) ) ( W * ( t ) , F l ( t ) ) . . .

0 1 ' ' ' 0 (W*l(t), F 2 ( t ) ) (W~( t ) , F 2 ( t ) ) . . .

: . . . : : : . . .

0 0 . . " 1 (W*~(t),F,~(t)) (W*2(t) ,F,v(t)) . . .

( F * ( t ) , W l ( t ) > ( F * ( t ) , Wl(t)) . . . (F*v( t ) ,W, ( t ) ) 1 0 . . .

(F*(t) , W 2 ( t ) ) ( F * ( t ) , W 2 ( t ) ) . . . (F*F(t), W2(t )> 0 1 . . -

: . . . : : : . . .

(F*(t ) , W,w(t)) (F*2(t), W,w(t)) ' ' . (F*,F(t), W,w(t)) 0 0 . . .

(A.3)

(W~w(t) , Ft(t))

{W~w(t) ,F2(t))

(W~w(t) , F.~(t))

0

0

(A.4)

constrained Fourier scheme at the same information rate. For 'low-pass' speech perceptual

improvements as well as improvements in the signal-to-noise ratio were realized. To demonstrate a potential application, the transform components

RFW is of the form

= C"] L c i .~ ' (A.5)

where I, F and I,~ denote nF X n F and nw x nw

identity matrices, respectively. Additionally, Rvw Vol. 24, No. 2, August 1991

228

can be written as

RFW = D + L + U,

where

D -- l . F + . w ,

C H 0]


(A.6)

(A.7)

(A.8)

00] ' A9,

0 are null matrices of appropriate orders. Using (A.6) we can write

( O + L ) S F w = - U S F w + SFW, (A. IO)

and the iterative form of this equation is equivalent to the Gauss-Seidel method of solving the linear equations of (A.1). This is given by

( D Ak+~ Ak + L ) S F W = -- U S F W + SFW. (A. 11 )

The above equation is the matrix form of (11) and

(12). Clearly RFW is hermitian and also non-nega- tive definite. If R F w is positive definite the iterative algorithm (11)-(12) converges to

SFW = RFlwSFw (A .12 )

for any initial vector S°w.

A p p e n d i x B. O r t h o g o n a l i z i n g propert ies o f the

i terative a l g o r i t h m

The iterative algorithm in (11)-(14) has an orthogonalizing effect. Equation (16) essentially forces each vector of the second set, namely Wi, to become orthogonal to all of the vectors of the first set, i.e., F~, i = 1 , . . . , n F. This argument can be proved easily by showing that (16) is essentially equivalent to the first step o f a Gram-Schmitt (GS) process. An analogous argument can be proved for (15). The GS procedure obtains a set of orthogonal vectors (vj) from a set of independent (a j ) vectors using the following expression:

Vj=Otj--vHv------~Vl . . . . v H vj_ ' Vj_,. (B.1)

Signal Processing

The first nv vectors F~, i = 1 . . . . , nF, are orthonor- mal, thus

v j= F * , j = l , . . . , nv. (B.2)

Continuing with the j th vector, Wj, of the second set, we get

= - F1 F1 " " VnF+ I Wj g< t w j - - " -- F*nF FtnF Wj . (B.3)

Taking the complex conjugate transpose of (B.3) and post multiplying by the signal vector s ( p ) , we get

n F

S w ( j ) W ) s ( p ) Z t * t = - W ) F i F , s ( p ) , (B.4) i = 1

which is equivalent to (16). A similar process can be followed to prove that (15) has an orthogonalizing effect.

References

[1] N. Ahmed and K.R. Rao, Orthogonal Transforms for Digital Signal Processing, Springer, New York, 1975.

[2] L. Almeida and F.M. Silva, "Variable-frequency synthesis: An improved harmonic coding scheme", Proc. Internat. Conf. Acoust. Speech Signal Process. 84, San Diego 1984, p: 37.6.1.

[3] M. Berouti et al., "Enhancement of speech corrupted by additive noise", Proc. IEEE Internat. Conf. Acoust. Speech Signal Process. 79, April 1979, pp. 208-221.

[4] P.C. Ching et al., "Walsh transform coding of the speech residual in RELP coders", IEE Proc., Vol. 131, Pt. G, No. 1, 1984, pp. 29-34.

[5] J.L. Flanagan et al., "Speech coding", IEEE Trans. Com- mun., Vol. COM-27, No. 4, April 1979, pp. 710-737.

[6] J.S. Garofolo et al., "DARPA CD-ROM: An acoustic phonetic continuous speech database, Training set: 420 talkers, 4200 sentences", December 1988.

[7] D.W. Griffin and J.S. Lim, "Multiband excitation vocoder", IEEE Trans. Acoust. Speech Signal Process., Vol. ASSP-36, No. 8, August 1988, pp. 1223-1235.

[8] N.S. Jayant and P. Noll, Digital Coding of Waveforms, Prentice-Hall, Englewood Cliffs, N J, 1984.

[9] S. Lin and D. Costello, Error Control Coding: Funda- mentals and Applications, Prentice-Hall, Englewood Cliffs, N J, 1983.

[10] John Makhoul et al., "Vector quantization in speech coding", Proc. IEEE, Vol. 73, No. 11, November 1985, pp. 1551-1588.

[11] J.D. Markel and A.H. Gray, Jr., Linear Prediction of Speech, Communications and Cybernetics 12, Springer, New York, 1976.

A.S. Spanias / A hybrid transform

[12] R.J. McAulay and T.F. Quatieri, "Speech analysis/synthesis based on a sinusoidal representation", IEEE Trans. Acoust. Speech Signal Process., Vol. ASSP-34, No. 4, August 1986, pp. 744-754.

[13] J. Pearl, "Basis restricted transformations and performance measures for spectral representation", IEEE Trans. Inform. Theory, Vol. IT-17, 1971, pp. 751-752.

[14] M.R. Schroeder and B.S. Atal, "Code-excited linear prediction (CELP): High quality speech at very low bit rates", Proc. lnternat. Conf. Acoust. Speech Signal Process. 85, Tampa, Florida, April 1985, pp. 937-940.

[ 15] F. Shum et al., "Speech processing with Walsh-Hadamard transforms", IEEE Trans., Aud. Electr., Vol. AU-21, No. 3, June 1973, pp. 174-178.

method for analysis~synthesis of speech

[16]

229

A.S. Spanias, "A hybrid model for speech synthesis", Proc. IEEE lnternat. Symposium on Ciruits and Systems ( ISCAS- 90), Conf. Proc. ISCAS-90, Vol. 2, New Orleans, May 1990, pp. 1521-1524.

[17] T.V. Sreenivas and P.V.S. Rao, "FFT algorithm for both input and output pruning", IEEE Trans. Acoust. Speech Signal Process., Vol. ASSP-27, June 1979, p. 291.

[18] Gilbert Strang, Linear Algebra and its Applications, Academic Press, 1980.

[19] AT&T DSP-32 digital signal processor development system, User manual.


a hybrid transform method for analysis/synthesis of speech

Documents