e gianfelici2007

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 3, MARCH 2007 823

Multicomponent AMFM Representations:An Asymptotically Exact Approach

Francesco Gianfelici, Giorgio Biagetti, Member, IEEE, Paolo Crippa, Member, IEEE, andClaudio Turchetti, Member, IEEE

AbstractThis paper presents, on the basis of a rigorous math-ematical formulation, a multicomponent sinusoidal model that al-lows an asymptotically exact reconstruction of nonstationary speechsignals, regardless of their duration and without any limitation inthe modeling of voiced, unvoiced, and transitional segments. Theproposed approach is based on the application of the Hilbert trans-form to obtain an amplitude signal from which an AM componentis extracted by filtering, so that the residue can then be iterativelyprocessed in the same way. This technique permits a multicompo-nent AMFM model to be derived in which the number of com-ponents (iterations) may be arbitrarily chosen. Additionally, theinstantaneous frequencies of these components can be calculatedwith a given accuracy by segmentation of the phase signals. Thevalidity of the proposed approach has been proven by some appli-cations to both synthetic signals and natural speech. Several com-parisons show how this approach almost always has a higher per-formance than that obtained by current best practices, and doesnot need the complex filter optimizations required by other tech-niques.

Index TermsAMFM speech model, envelope estimation,Gabor signal, Hilbert transform, multicomponent modeling,sinusoidal model.

I. INTRODUCTION

SINUSOIDAL models, as defined by McAulay and Quatieriin [1], are highly parametric representations of speech sig-nals, based on physiologic properties of speech production andperception. This characterization can be assimilated to the jointaction of both amplitude modulation (AM) and frequency mod-ulation (FM), where neither the carriers nor the amplitude en-velopes and the instantaneous frequencies (IFs) are known, andtherefore need to be estimated. Parametric representations of theabove kind can be classified on the basis of the number of com-ponents, as: 1) monocomponent or 2) multicomponent models.This classification directly affects the number of envelopes andIFs that need to be estimated, and finally the demodulation tech-nique that has to be used.

The theory of the monocomponent representation is wellestablished, and a large number of demodulation techniquesbased on different approaches have been developed in the lastdecade. The relevance assumed by the TeagerKaiser operator

Manuscript received August 25, 2005; revised August 30, 2006. The associateeditor coordinating the review of this manuscript and approving it for publica-tion was Dr. Rainer Martin.

The authors are with the Dipartimento di Elettronica, Intelligenza Artificialee Telecomunicazioni (DEIT), Universit Politecnica delle Marche, I-60131Ancona, Italy (e-mail: [email protected]; [email protected];[email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TASL.2006.889744

[2] and the Hilbert transform [3] makes them the currentbest approaches. Both techniques use ad hoc (low-pass andbandpass) filters in order to regularize the large variations thataffect estimation of both the envelope and the IF of the signal.In fact, these techniques perform well in stationary signalmodeling (e.g., in the modeling of artificially synthesizedsignals) where parameter variations are of limited entity, butthey are not adequate for nonstationary signals (such as speechsignals), mainly because of the large and fast excursions of thepitch period (inverse of fundamental frequency) that typicallyoccur in such signals. The performance of the aforementionedtechniques can be enhanced, at least on some signal subparts,by accurately filtering and windowing the signal before thedemodulation process. Thus, this approximate nature of pa-rameter extraction in sinusoidal modeling does not permit anexact reconstruction of the original signal. This aspect causesconsiderable difficulty in the modeling of signal subparts suchas transitions between phonemes, where the nonstationarynature of signals determines large variations in signal dynamics(attacks and closures), variations that in turn produce somewell-known undesired phenomena such as pre-echoes anddistortions. Additionally, model parameters are highly sensitiveto frame segmentation, so that when long frames are used, thetime resolution is inadequate for capturing signal dynamicssuch as attack transients. On the other hand, when short framesare used, the degradation affects frequency resolution. In bothcases, estimation of sinusoidal components becomes difficult,as stated by Goodwin in [4]. Therefore, the assumption onwhich the sinusoidal model is based, that is, model parametersare slowly time-varying quantities, is difficult to satisfy in everyframe or in transitions between adjacent ones [5].

The above limitations have determined the development ofmodels that generalize the QuatieriMcAulay model and theyare usually based on mixed approaches such as the exponen-tial sinusoidal model (ESM) [6][8], exponentially damped si-nusoids (EDSs) [9], damped delayed sinusoids (DDSs) [10],[11], and partial damped and delayed sinusoids (PDDSs) [5].In these cases, model parameters are estimated by means of ap-proximate techniques, which allow the control of modeling errorand, under adequate conditions, a multicomponent characteriza-tion of signals. The limitations of the QuatieriMcAulay model,previously described for monocomponents, also exist for multi-components.

The great interest in the theory of multicomponent modelingand the absence of a rigorous closed-form formulation of thisproblem in fact represent the starting point for the formaliza-tion and development of specific approaches and suitable tech-niques. An accurate description of the currently adopted bestapproaches is proposed in [12]. A recent development of one

1558-7916/$25.00 2007 IEEE

824 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 3, MARCH 2007

of these techniques can be found in [13], where the parame-ters of the sinusoidal components are estimated by means oflikelihood maximization over the windowed signal. Moreover,Huang et al. [14] proposed a different analysis technique suitedfor nonlinear and nonstationary data, the empirical mode de-composition (EMD), which is based on the iterative extractionof intrinsic modes, followed by the application of the Hilberttransform to compute a spectrum from them. This iterative tech-nique has been applied to many different application fields, suchas seismology, oceanography, and the processing of biologicaldata, but its applicability to the speech processing area has notyet received much attention in literature. Another interestingiterative approach [15] identifies pure sinusoids immersed innoise by means of iterated filtering.

In this paper, we present an innovative iterative approach tocompute an asymptotically exact multicomponent sinusoidalmodel of speech signals, based on the iterated application ofthe Hilbert transform to a filtered version of the amplitudeenvelope, and on the exact computation of these amplitudesand associated phases. The algorithm can be applied to signalswithout limitations on their duration, the number of componentsto be extracted, and the desired modeling accuracy both forstationary and transient signal portions. Finally, for the purposeof completing the FM decomposition, an a posteriori adaptivesegmentation algorithm is used to extract arbitrarily accurateinstantaneous frequencies from the phase signals previouslyobtained.

This paper is organized as follows. In Section II, a brief pre-sentation of monocomponent and multicomponent sinusoidalmodels, their extensions, and their associated extraction tech-niques, is given. In Section III, the mathematical formulation ofan iterative approach for accurate demodulation of multicom-ponent AMFM signals, based on an iterative application of theHilbert transform, is presented. In Section IV, an adaptive seg-mentation algorithm based on linear regression, for piecewise-constant IF calculus, is introduced. In Section V, the proposedtechnique is applied to synthetic signals in order to demonstrateits behavior and to compare its performance with several othermethods. In Section VI, some examples of the application of theproposed technique to natural speech signals are shown. Finally,Section VII concludes this paper.

II. SINUSOIDAL MODELING

A sine wave, whose amplitude and instantaneous phase aretime-varying quantities, can be considered as a monocomponentAMFM signal. Although this representation is in principle ableto represent arbitrary signals, it will be unsatisfactory when thesource is known to contain a mixture of components. In thiscase, it is more suitable to consider a model composed of thesuperposition of signals of this kind. This gives rise to what isgenerally known as a multicomponent AMFM signal.

In the following paragraphs, a brief rsum of both mono-and multicomponent modeling techniques is given.

A. Monocomponent Model

A monocomponent AMFM signal is a sine wave defined by

(1)

where and represent the amplitude and the instanta-neous phase, respectively. It is worth noting that the derivativeof represents the IF .

The best performing techniques for estimating the AM andthe FM modulating signals are based on: 1) the Teager energyoperator [16] and 2) the Hilbert transform [3].

The first technique, called the discrete energy separation al-gorithm (DESA), is a nonlinear differential approach best suitedfor narrowband signals, although it can be generalized to largefrequency deviations as has recently been proposed [17].

Its mathematical formulation is defined in the discrete time-domain, according to the notation used in [17], as

(2)

where the derivative operation, which takes part in theTeagerKaiser energy operator, is approximated by thesymmetric difference. In this case, the parameters of the dis-crete-time AMFM signal , i.e., the envelope andthe IF , are calculated as in [16]

(3)

and

(4)

The second technique is based on the Gabor analytic rep-resentation of the signal , which makes use of the Hilberttransform [3]. Namely, let be the complex signal defined as

(5)

where the quadrature signal is the Hilbert transformof . can be equivalently defined through the

Fourier transform (FT) as

(6)

with and being the FTs of and , respec-tively. In this case, the envelope and the IF aregiven by

(7)

and

(8)

Both the above techniques require suitable low-pass and/orbandpass filters to reduce the large variations in the FM param-eters that would arise from direct application of these param-eter extraction algorithms. An accurate comparison between al-gorithms using these two approaches, such as the energy oper-ator separation algorithm (EOSA), the smoothed energy oper-ator separation algorithm (SEOSA), and the Hilbert transformseparation algorithm (HTSA), can be found in [18].

GIANFELICI et al.: MULTICOMPONENT AMFM REPRESENTATIONS 825

B. Multicomponent Model

A multicomponent AMFM sinusoidal model [12] of thesignal can be represented as

(9)

where is the total signal duration, is the number of com-ponents, is the amplitude envelope, is the center fre-quency (or frequency centroid), and is the instantaneousphase of the th component, which is also called resonance [18].Generally, is required to be a slowly time-varying signal,and should be constant in the time domain. Because it is im-practical to consider as constant over the full signal length,using a slowly varying or piecewise-constant function is gen-erally accepted. For this purpose, the total time span isdivided into intervals , , and a con-stant frequency centroid is defined in each of them, so that (9)can be rewritten as

(10)

with

(11)where are the center frequencies inside the intervals

. These intervals can generally have different lengths,and depending on how they are obtained (i.e., by fixed-lengthwindowing, by segmentation algorithms, or with a combinationof these techniques) different relationships exist between theirlengths. As a direct consequence of this characterization, in non-stationary signals, such as speech signals, a smooth behavior of

in transitions between interval boundaries cannot be guar-anteed.

The multicomponent sinusoidal models inherit the limitationsof the monocomponent models they can be considered a gener-alization of. Additionally, the need to identify the componentsin the frequency domain in a systematic way makes the multi-component models more complex. Generally, the effectivenessof these models is related to the type of signal, its properties, therequired accuracy, and the validity of nonobvious assumptions.

According to the characterization proposed in [12], the vastmajority of demodulation methodologies for a multicomponentAMFM signal are based on the following techniques:

1) state space estimation;2) Hankel and Toeplitz matrices;3) linear prediction;4) energy demodulation;5) maximum-likelihood estimation.

Additionally, a structure based on a phase-locked loop (PLL)can be used to separate two components, as in the work of Bar-Ness et al. [19] where two couples of in-phase and quadrature

signals are used to extract the amplitude envelopes of the com-ponents, and feedback loops progressively correct each otherscomponent estimation.

These techniques generally provide only an approximate re-construction of the original signal. The main limitations derivefrom the nonexact nature of parameter extraction and the re-strictions imposed by the assumptions on which the solutionsare based. Letting be the estimation of the original signal

, it is thus possible to write

(12)

where , and are the estimated parameters. Theresidue, or the modeling error, is

(13)

It is generally well known that these modeling errors are crit-ical as they cause the appearance of pre-echoes and distortions.On the one hand, contains information about events that arelocalized in the time domain1 and are critical in the modeling be-cause they are not taken into account by the parameter extractiontechniques. On the other hand, reducing the modeling errorrequires the windowing to be done on a large timescale for sta-tionary parts and a small timescale for transitions. These win-dows are determined before (and independently of) parameterextraction, and they are generally based on empirical techniques[4]. The empirical nature of these techniques does not guaranteethe satisfaction of the sinusoidal model basic assumption, i.e.,the slow variability inside frames and during transients. In orderto overcome these limitations, generalizations of the sinusoidalmodel have been developed, such as the ESM, the exponentiallydamped sinusoids (EDS), the DDS [10], [11], and the PDDS [5].These techniques, after performing a segmentation of the signalinto very short time intervals, model the large variations in thesignal dynamics by means of exponential functions of the time.The increase in number of parameters inevitably degrades theaccuracy of their estimation. Therefore, the effectiveness of themodel would be limited for the sole purpose of improving pa-rameter estimation during transitions.

III. MULTICOMPONENT ASYMPTOTICALLY EXACTAMFM DECOMPOSITION

With the above considerations in mind, it can be stated thata multicomponent model ought to be extracted by using anapproach that is able to determine the parameters step by step,allowing an exact signal reconstruction after each step, andwithout any constraint on the characteristics of the signal to bedecomposed. In this section, we present an iterative decompo-sition method [21], developed to suit these requirements, andshow how the modeling error rapidly vanishes as the iterationproceeds.

1It is well known that signal transformations in the time domain are very crit-ical and that the arbitrary elimination of components (even if) with low energycould impair the intelligibility of the original signal [20].


Fig. 1. Terms a (t) (thin line) and a (t) (thick line) for the sample speechsignal, the Italian word settimana, that will be used next. The term a (t) canbe obtained as the difference between a (t) and a (t) according to the filterformulation.

A. Mathematical Formulation

Let be a generic speech signal. By virtue of (5) it can bewritten as

(14)

where denotes the real part of a complex number,, and . Our aim here is to derive a multi-

component decomposition of by means of iterated applica-tions of representations like (14) to the amplitude component.Let us denote by the index the generic terms correspondingto those in (14) for the th iteration of the decomposition pro-cedure. In the case of speech signals, and in general for everysignal that contains a mixture of sinusoids, the amplitudeof the Gabor signal exhibits an oscillating behavior. Ofcourse, since is always nonnegative, these oscillations donot occur around the origin, and is thus unsuited for directtreatment with further Hilbert transforms. Before performinganother transform, it is in fact necessary to separate its trend

[22] from the alternating component , by means ofa suitable adaptive filtering algorithm acting upon itself,so that (as is depicted in Fig. 1) and theresidual is a zero-mean oscillating signal that can thus beiteratively decomposed as in (14). The filter used to obtain theresidual can be defined in several different ways, but to guar-antee convergence it must behave as a high-pass filter designedso that only a fraction of the total signal energy is keptin the alternating component, as will be shown next.

Formally, starting with as the first step of this iterativealgorithm, we can write

(15)

where . By denoting withthe (complex) Gabor signal associated with the alter-

nating component, it is possible to proceed with the decompo-sition by using the relations

(16)

so that it results

(17)

which, once placed inside of (15), yields

(18)Having made use of the Werner trigonometric formula for thecosine product, we thus obtain

(19)

where

(20)

A detailed description of the subsequent steps will be given inthe Appendix, where it will be shown that the number of com-ponents increases geometrically with the number of iterations.Letting this latter number be , and using the generalizationof (17)

(21)

it results in

(22)This is a generalized multicomponent sinusoidal model, inwhich the phases can be iteratively computed as

(23)(24)

for and .A remarkable property holds for this signal representation,

namely, as increases, the last term in (22)

(25)

rapidly vanishes. To prove this asymptotic behavior, sinceis the high-pass filtered counterpart of , by means of theParseval equality

(26)


we can write

(27)

where and are the Fourier transforms of the cor-responding signals and , and is the transferfunction of the high-pass filter. Let be the filter energy loss,defined as

(28)

For a given , it is therefore possible to adaptively designthe filter so that its transfer function retains a fraction

of the signal energy. We can thus write

(29)

Since the Hilbert transform preserves energy, the Gabor signalenergy will be twice that of the original signal, i.e.,

(30)

so that

(31)

hence, since , .Finally from (25)

(32)

it results in .Having shown that

(33)

(22) can be rewritten as

(34)where

(35)

and

(36)

Equation (34) is, by virtue of (33), an asymptotically exact de-composition of the signal in terms of amplitude and phaseenvelopes. In Section VI, the convergence behavior will be dis-cussed in more depth, and some examples of the truncationerror as a function of , showing that good approximations areachieved even with low values of , will be reported.

Fig. 2. Scheme of the complete envelope and phase extraction algorithm.

B. Algorithmic Formulation

A sketch of the algorithm flow for the implementation of thesignal representation (34) is depicted in Fig. 2. The basic iter-ation computes amplitude and phase of the Gabor signal, ob-tained through Hilbert transformation, and then decomposes theamplitude by filtering, as the sum of amplitude envelope (low-pass) and amplitude residue (high-pass). The latter must be fur-ther decomposed iteratively. The extracted amplitude envelopeis ready to enter the model, while the (elementary) phase needsto be combined with the previously extracted ones. Specifically,the last extracted component needs to be added to all the pos-sible linear combinations with 1 coefficients of the previouslyextracted components in order to reflect the composition rule forcosines stated in (23) and (24).

After iterations, the model is thus composed ofparameter pairs, representing amplitudes and phases of the sinu-soidal components. Of course, due to (23), (24), and (36), not allthese parameters need to be separately computed or stored. Thenumber of different amplitude envelopes is only , since


a single envelope is added after each iteration. Similarly, onlyelementary phases suffice to compute all the others.

IV. ADAPTIVE SEGMENTATION FOR INSTANTANEOUSFREQUENCY CALCULATION

This section shows how to obtain a decomposition in termsof instantaneous frequencies from the phase envelopes derivedin Section III.

The model stated in (34)

(37)

is actually an amplitude-phase modulation which can easily beconverted to an AMFM model by letting

(38)

and

(39)

with being the modeling error, and the instanta-neous frequency estimate.

Several methods have been proposed in the literature to es-timate , and most try to extract instantaneous frequenciesthat are constant in the time domain (stationary condition) orslowly time-varying (semistationary condition).

In order to better satisfy these conditions, the currentlyemployed techniques split the signal into short intervals, toexploit the semistationary nature of speech frames. Neverthe-less, most of these IF estimation techniques, such as short-timeFourier transform (STFT) [23], multiband demodulation anal-ysis (time-varying Gabor filterbank) [24], peak tracking ofshort time spectra [1], matching pursuit technique [25], andinstantaneous frequency attractors [26], are very sensitive to thesegmentation method they adopt (windowing, frame division).To alleviate this problem, several adaptive segmentation tech-niques, operating a subdivision of the signal into intervals overwhich all the sinusoidal model parameters are to be estimated,have recently been developed [4], [27][29].

However, the absence of a direct connection between segmen-tation and IF extraction necessarily undermines the achievableaccuracy.

In the approach presented here, amplitude and phase en-velopes can be computed without the need for segmentation.Instead, segmentation is used to extract IFs from the phaseenvelopes so that a much simpler algorithm to be applied aposteriori to the unwrapped phases of the signal would suffice.We assume to be piecewise constant in a set of timeintervals which are adaptively estimated a posteriori, in orderto satisfy an upper-bound error

(40)

with being the desired accuracy.From (40), the adaptive segmentation problem can be stated

as the problem of finding a set of disjoint time spans

Fig. 3. Sketch of the adaptive segmentation algorithm for IF extraction.

that cover the whole interval, so that isconstant in each and (40) holds.

The proposed algorithm is sketched in Fig. 3 and is basedon the search for an appropriate segmentation that satisfies theabove requirements. With being typically a noisy signal,the value that the piecewise-constant function as-sumes in the interval can be estimated over finite intervalsas the linear regression of the data provided by , for linearregression is known to be a robust technique for computing in-stantaneous frequencies from noisy phase signals.

The algorithm starts with and and computesthe derivative of the unwrapped phase by means of linear regres-sion over the interval , increasing until the condition(40) is no longer met . The largest interval thatsatisfied the condition is recorded as one of the , then isadvanced to the last that satisfied the condition and the pro-cedure is iterated until .

The result obtained in this way is an estimation of the IF, and the modeling error can be made arbitrarily

small. The accuracy is of course directly related to the numberof intervals produced, and the algorithm easily allows the intro-duction of a signal-dependent bound to take into accountphenomena like pre-echoes and distortions.

V. MODEL APPLICATION TO SYNTHETIC SIGNALS ANDCOMPARISON WITH THE STATE-OF-THE-ART

In this section, the behavior of the proposed modeling tech-nique, based on the iterated Hilbert transform (IHT), is analyzed


Fig. 4. (a) Synthetic signal given by (41) used for testing the decompositionalgorithm. (b) The first two amplitude envelopes a (t) (solid line) and a (t)(dashed line). (c) The corresponding unwrapped phases (t) (solid) and (t)(dashed).

with applications to a few synthetic signals, which were chosento validate its effectiveness. Several comparisons of IHT perfor-mance with the state of the art are also described in this section.Empirical mode decomposition (EMD), periodic algebraic sep-aration and energy demodulation (PASED), and multiband en-ergy separation algorithm (MESA) were considered for this pur-pose. The decomposition capabilities of these techniques havebeen investigated using the same synthetic signals, composedof two components, with and without additive Gaussian noisesuperposed on them. Moreover, the convergence properties andprocessing times of the first two techniques, i.e., IHT and EMD,which sequentially extract signal components by means of iter-ative algorithms, were also analyzed.

Let us consider the two-component AM synthetic signal

(41)

with

(42)

where Hz, Hz, and Hz. Thissignal is shown in Fig. 4 along with the first two amplitude en-velopes and the corresponding phases as extracted by the IHTalgorithm. It is easy to note that the two amplitude envelopes ac-curately match the original modulating signals and .The phase curves reported in Fig. 4(c) correspond to (solidline) and (dashed line), and they have a mean slope of500.09 and 49.97 Hz, respectively. The first slope correspondsexactly to the carrier frequency , while the latter needs to becombined with the former (added, in this case) to obtain the car-rier frequency , as already explained in Section III-A.

The frequency separation used in the above example is 10%.To highlight how the algorithm behaves as the frequencies of the

Fig. 5. Amplitudes of the two demodulated chirp components.

components become closer, two crossing chirp signals,and , were considered. The resulting synthetic signal

is composed of two sinusoids whose frequencies varylinearly with time and cross each other, while their amplitudesare held constant. In formulas

(43)

with

(44)

(45)

where , , Hz, kHz, andis the sweep duration.

Figs. 5 and 6 show the algorithm capabilities in separatingamplitudes and IFs, respectively. As the component frequen-cies vary (increase/decrease) from to , the 1-to-8 ratio (18dB) between the two amplitudes is correctly recognized andthe components are well separated, except for a small regionnear the crossing instant, as can be seen in Fig. 5. Fig. 6 showsthe IFs , , and of the demodulated compo-nents. Dashed straight lines represents the original frequenciesof the two chirps and . As can be seen, thedemodulated components closely follow the chirps also duringthe crossover, although there is a discontinuity in the labelingat the intersection point. In fact, since the IFs and wereobtained by combining the components and , there is onedegree of freedom in the labeling of the IFs, although the labelsneed to be switched from one IF to the other at the crossoverpoint when trying to track the smaller of the two chirp compo-nents. It is important to note that this problem is also common toother algorithms. Indeed, among those considered for the com-parison, only PASED does not have this problem and is able totrack the components also across the intersection point.


TABLE IPERFORMANCE COMPARISON OF VARIOUS AMFM METHODS APPLIED TO NOISELESS SIGNALS (f = 500 Hz,f = 550 Hz, SIGNAL-TO-NOISE RATIOS SNR AND SNR ARE MEASURED IN DECIBELS)

TABLE IIPERFORMANCE COMPARISON OF VARIOUS AMFM METHODS APPLIED TO NOISY SIGNALS (f = 500 Hz,f = 550 Hz, SIGNAL-TO-NOISE RATIOS SNR AND SNR ARE MEASURED IN DECIBELS)

Fig. 6. Frequencies , , and of the demodulated chirp components.The dashed straight lines are the frequencies of the two chirps.

In order to better appreciate the validity of the proposed IHT-based approach, a comparison with the performance of EMD,MESA, and PASED, for noise-free and noisy signals, was car-ried out. As a reference implementation for EMD the Rillingsalgorithm [30] was used, which is, to the best of the authorsknowledge, one of the best optimized implementations avail-able. MESA was implemented by means of an ad hoc Gaussianfilter bank followed by the standard DESA demodulator bor-rowed from the PASED algorithm. Finally, the implementation

of the PASED algorithm itself was based on Santhanams orig-inal code [12], with the addition of Gaussian filters used tosmooth the signal between the algebraic decomposition and en-ergy separation blocks.

The filters used in the MESA and PASED algorithms werecentered around the known carrier frequencies, by providingthe true values of these, and their bandwidths were selected sothat their frequency responses cross each other at the half-peakheight, in order to satisfy the optimality criteria described in[31]. It is worth noting that EMD and IHT algorithms do notrequire this shrewdness in the choice of filters.

A series of two-component AM synthetic signals of the samekind as the one described in (41) was used for this comparison,obtained by varying between 0.1 and 500. For the twocomponents, the signal-to-noise ratios, SNR and SNR , weredefined as

SNR (46)

where is the index of the extracted component that correspondsto the component , are reported in Table I for the fourAMFM methods IHT, EMD, MESA, and PASED, as functionsof the amplitude . The same comparison was carried out fornoise-corrupted signals, obtained by adding a Gaussian noisewith variance , that is, a noise power of one tenth of thesignal power ( 10 dB), to the second component. These resultsare shown in Table II.

As can be seen from Tables I and II, the IHT-based modelinghas higher performance than EMD, both in the extraction of AM


Fig. 7. SNR of AM demodulated components as a function of modulation index and amplitude ratio A =A for the three algorithms IHT, MESA, and PASED.

components, and in noise rejection. MESA and PASED havea lower performance than IHT when one component is muchstronger than the other, in both noiseless and noisy signals.

Another series of tests using two-component AMFM signalswas considered in order to verify the FM demodulation capabil-ities and the influence of FM on AM component extraction. Thesignals used are defined as

(47)with

(48)

(49)

where , , , and are the same as in (41) and (42),is the modulation index, and is the FM modulating fre-

quency, which was fixed at Hz. Both the AM SNR andthe root mean square (rms) error of estimated IFs are shown inFigs. 7 and 8, for IHT, MESA, and PASED.

It is worth noting that IHT has a better SNR in AM-extractedcomponents than that of the other techniques for every consid-

ered modulation index , except for the case where the ampli-tudes of the two components are comparable, e.g., .In fact, it is well known that energy-based methods work betterin this case. Analogously, IHT has a lower rms error in the es-timation of the IFs of FM components, apart from the case ofcomparable amplitudes as previously stated.

Finally, Fig. 9 shows some direct comparisons between thetwo sequential iterative algorithms. As it turns out, IHT per-forms better than EMD for computation time, with a processingtime ratio between EMD and IHT that increases with signallength. Moreover, Fig. 10, where the residual energy is reportedas a function of the number of iterations, clearly shows that theasymptotical convergence of IHT is faster than that of EMD, re-gardless of the number of components.

VI. MODEL APPLICATION TO SPEECH SIGNALSThis section presents a few applications of the IHT algorithm

to speech signals of arbitrary length. The signals used are partof the Italian portion of the Multext Prosodic Database [32],which is an extract of the EUROM.1 speech corpus [33] andcontains utterances from ten Italian speakers of different sex,age, and geographical origin, who recorded 15 sentences eachin an anechoic room, amounting to nearly 7000 words.

Fig. 11 shows elementary amplitude envelopes andphases as obtained applying the IHT algorithm with


Fig. 8. RMS error of demodulated IF components as a function of modulation index and amplitude ratioA =A for the three algorithms IHT, MESA, and PASED.

Fig. 9. Processing times of the IHT versus EMD.

to the Italian word settimana (which is pronounced/settima:na/ and means week). The phases vary slowly, andtheir slopes appear quite similar, but it is easy to note that their

Fig. 10. Residual energy comparison. IHT versus EMD.

derivatives (whose mean values, in isolated vowels, representthe speech formants) generate different center frequencies.In order to test the validity of mean-IF extraction from theslowly-varying phases, several isolated vowels extracted from


Fig. 11. Italian word settimana. (a) Original signal. (b) Its elementary am-plitude envelopes a (t). (c) Phases (t). (j = 0; . . . ; 4).

Fig. 12. Results of the adaptive segmentation algorithm applied to (t). Ver-tical bars are interval boundaries.

the Multext Prosodic Database, were considered. Experimentalresults show a good accuracy in formant estimation.

Fig. 12 shows the results of the adaptive segmentation algo-rithm applied to the first extracted phase. This figure clearlyshows that the required accuracy determines a highly irregularsegmentation of the time axis, in contrast with what would havehappened with an a priori segmentation.

The asymptotical exactness of the proposed method arisesfrom (33), but Fig. 13 empirically shows that the convergenceis reached even with low values of . Here, the relative energyof the residual is plotted as a function of theiteration number and it can be seen that 20 iterations sufficeto give an error comparable to round-off noise. Moreover, on thebasis of a subjective listening test performed with 40 people, itis possible to state that, with , the model can be deemed

Fig. 13. Relative energy of the residual kr (t)k=kx(t)k as a function of theiteration number N .

equivalent to the original signal, with a relative residual-errorin the order of 30 dB.

In order to clarify the relation between mathematical compo-nents and their physical meanings, the time-frequency analysisbased on spectrograms is shown in Figs. 14 and 15. In particular,Fig. 14 shows the results for the Italian word settimana, whereit is easy to note that the conformation of the time-frequencystructure can be effectively approximated with a small numberof AMFM components, which progressively perfect the spec-trogram reconstruction. Nevertheless, the high-frequency com-ponents are not well-reconstructed because of the heterodyningeffects of IHT.

It is worth noting that signals segmented at word level do nothave a direct and simple connection with speech resonances andformants as happens for example, in simple speech signals (suchas vowels, etc.), because of complex coarticulation phenomenain speech production and phonation. Bearing in mind this con-sideration, the time-frequency analysis based on spectrogramswas performed with a simpler signal to validate the applicabilityin speech-signal modeling. Fig. 15 depicts the spectrograms ofthe Italian sustained vowel /a/. As happened in the above case,the progressive extraction of components perfects the spectro-gram structure, and the heterodyning effect of IHT causes higherpower components to be reconstructed before the lower powerones. Moreover, in this case it is clear that the formant structureof the vowel is captured after the first few components.

Experimental results show the absence of limitations interms of signal length, time-frequency distribution, and so on.Additionally, the power spectral density (PSD) of the Italianvowel /e/ and of the Italian word settimana, rebuilt with avarying number of components, was considered. The Euclideandistances between the aforementioned approximations and theoriginal PSDs, are shown in Figs. 16 and 17, thus empiricallyverifying the convergence. It is worth noting that a theoreticalproof of the PSD convergence can easily be obtained by meansof the Parseval equality and the property of asymptotical IHTconvergence.


Fig. 14. Italian word settimana. (a) Spectrogram. (b)(f) Spectrograms of rebuilt signals with N =1; . . . ;5, respectively.

VII. CONCLUSION

This paper presents an asymptotically exact multicomponentsinusoidal model that can be applied to implement an AMFMdecomposition of speech signals. The proposed approach isbased on the iterated application of the Hilbert transform toamplitude envelopes obtained by adaptively low-pass filteringthe Gabor signal amplitudes. Instantaneous frequencies werethen obtained from the extracted phases by means of a simplelinear regression over time intervals adaptively detected aposteriori.

Applications of the algorithm to synthetic signals and naturalspeech showed its effectiveness in both component extractionand speech modeling. A comparative evaluation with state-of-the-art techniques demonstrated the superiority of the proposedapproach, without the need for complex optimizations like thoserequired by other approaches.

APPENDIX

According to (14), it is possible to write the signal as

(50)

where is defined as in (5), and. Thanks to the filtering described in Section III-A, (50)

can be rewritten as

(51)where . Then reapplying to the equalityexpressed in (50) we have

(52)where , and

, thus obtaining

(53)

where

(54)(55)


Fig. 15. Italian sustained vowel /a/. (a) Original signal. (b) Its spectrogram. (c)(f) spectrograms of rebuilt signals with N =1; . . . ;4, respectively.

Fig. 16. PSD distance of the Italian vowel /e/ and its rebuilt signals as a functionof the number of components.

Subsequently, is further decomposed by filtering so as toobtain

(56)

Fig. 17. PSD distance of Italian word settimana and its rebuilt signals as afunction of the number of components.

i.e.,

(57)


Reapplying to the formulation proposed in (52)

(58)where , and

, we have

(59)

and expanding

(60)By reusing trigonometric formulas, it is possible to write

(61)that is, in compact form

(62)

where

(63)(64)(65)(66)

With a further filtering operation we then have

(67)

and it is possible to generalize (52) and (58) as

(68)

where , , and. The generalization of (54), (55), and

(63)(66) is

(69)(70)

for and . Iterating the formulationpreviously proposed, we obtain

(71)

which can be rewritten in the form

(72)

and regrouped

(73)thus obtaining the formulation proposed in (22).

ACKNOWLEDGMENT

The authors would like to thank the associate editor and theanonymous reviewers for their valuable comments that helpedimprove this paper. They would also like to thank Prof. B. San-thanam for providing them with a reference implementation ofhis PASED demodulation algorithm.

REFERENCES[1] R. McAulay and T. Quatieri, Speech analysis/synthesis based on a si-

nusoidal representation, IEEE Trans. Acoust., Speech, Signal Process.,vol. ASSP-34, no. 4, pp. 744754, Aug. 1986.

[2] P. Maragos, J. F. Kaiser, and T. F. Quatieri, Energy separation insignal modulations with application to speech analysis, IEEE Trans.Speech Audio Process., vol. 41, no. 10, pp. 30243051, Oct. 1993.

[3] S. L. Hahn, Hilbert Transforms in Signal Processing. Boston, MA:Artech House, 1996.

[4] M. Goodwin, Multiresolution sinusoidal modeling using adaptive seg-mentation, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.(ICASSP98), Seattle, WA, May 1998, vol. 3, pp. 15251528.

[5] R. Boyer and K. Abed-Meraim, Audio modeling based on delayedsinusoids, IEEE Trans. Speech Audio Process., vol. 12, no. 2, pp.110120, Mar. 2004.

[6] J. Jensen, S. H. Jensen, and E. Hansen, Exponential sinusoidal mod-eling of transitional speech segments, in Proc. IEEE Int. Conf. Acoust.,Speech, Signal Process. (ICASSP99), Phoenix, AZ, Mar. 1999, vol. 1,pp. 473476.


[7] P. Lemmerling, I. Dologlou, and S. Van Huffel, Speech compressionbased on exact modeling and structured total least norm optimization,in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP98),Seattle, WA, May 1998, vol. 1, pp. 353356.

[8] S. Van Huffel, H. Park, and J. B. Rosen, Formulation and solution ofstructured total least norm problems for parameter estimation, IEEETrans. Signal Process., vol. 44, no. 10, pp. 24642474, Oct. 1996.

[9] K. Hermus, W. Verhelst, P. Lemmerling, P. Wambacq, and S. VanHuffel, Perceptual audio modeling with exponentially damped sinu-soids, Signal Process., vol. 85, no. 1, pp. 163176, Jan. 2005.

[10] R. Boyer and K. Abed-Meraim, Estimation of damped and delayedsinusoids: Algorithm and CramerRao bound, in Proc. IEEE Int. Conf.Acoust., Speech, Signal Process. (ICASSP03), Hong Kong, Apr. 2003,vol. 6, pp. 137140.

[11] , Audio transients modeling by damped and delayed sinusoids(DDS), in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.(ICASSP02), Orlando, FL, May 2002, vol. 2, pp. 17291732.

[12] B. Santhanam and P. Maragos, Multicomponent AMFM demodu-lation via periodicity-based algebraic separation and energy-based de-modulation, IEEE Trans. Commun., vol. 48, no. 3, pp. 473490, Mar.2000.

[13] S. Gazor and R. R. Far, Adaptive maximum windowed likelihoodmulticomponent AMFM signal decomposition, IEEE Trans. Audio,Speech, Lang. Process., vol. 14, no. 2, pp. 479491, Mar. 2006.

[14] N. E. Huang, Z. Shen, S. R. Long, M. C. Wu, H. H. Shih, Q. Zheng,N.-C. Yen, C. C. Tung, and H. H. Liu, The empirical mode decomposi-tion and the Hilbert spectrum for nonlinear and non-stationary time se-ries analysis, Proc. R. Soc. London A, vol. 454, no. 1971, pp. 903995,Mar. 1998.

[15] T.-H. Li and B. Kedem, Iterative filtering for multiple frequency es-timation, IEEE Trans. Signal Process., vol. 42, no. 5, pp. 11201132,May 1994.

[16] P. Maragos, J. F. Kaiser, and T. F. Quatieri, On separating amplitudefrom frequency modulations using energy operators, in Proc. IEEEInt. Conf. Acoust., Speech, Signal Process. (ICASSP92), San Fran-cisco, CA, Mar. 1992, vol. 2, pp. 14.

[17] B. Santhanam, Generalized energy demodulation for large frequencydeviations and wideband signals, IEEE Signal Process. Lett., vol. 11,no. 3, pp. 341344, Mar. 2004.

[18] A. Potamianos and P. Maragos, A comparison of energy operators andthe Hilbert transform approach to signal and speech demodulation,Signal Process., vol. 37, no. 1, pp. 95120, May 1994.

[19] Y. Bar-Ness, F. Cassara, H. Schachter, and R. DiFazio, Cross-coupledphase-locked loop with closed loop amplitude control, IEEE Trans.Commun., vol. COM-32, no. 2, pp. 195199, Feb. 1984.

[20] D. OShaughnessy, Speech Communications: Human and Machine,2nd ed. Piscataway, NJ: IEEE Press, 2000.

[21] F. Gianfelici, G. Biagetti, P. Crippa, and C. Turchetti, Asymptoti-cally exact AMFM decomposition based on iterated Hilbert trans-form, in Proc. Interspeech2005Eurospeech9th Eur. Conf. SpeechCommun. Technol., Lisbon, Portugal, Sep. 2005, pp. 11211124.

[22] P. J. Brockwell and R. A. Davis, Times Series: Theory and Methods.New York: Springer-Verlag, 1991.

[23] J. S. Marques and L. B. Almeida, Frequency-varying modeling ofspeech, IEEE Trans. Acoust., Speech, Signal Process., vol. 39, no. 5,pp. 763765, May 1989.

[24] A. Potamianos and P. Maragos, Speech analysis and synthesis usingan AMFM modulation model, Speech Commun., vol. 28, no. 3, pp.195209, Jul. 1999.

[25] . . Etemoglu and V. Cuperman, Matching pursuits sinusoidalspeech coding, IEEE Trans. Speech Audio Process., vol. 11, no. 5,pp. 413424, Sep. 2003.

[26] T. Abe and M. Honda, Sinusoidal model based on instantaneous fre-quency attractors, in Proc. IEEE Int. Conf. Acoust., Speech, SignalProcess. (ICASSP03), Hong Kong, Apr. 2003, vol. 6, pp. 133136.

[27] P. Prandoni, M. Goodwin, and M. Vetterli, Optimal time segmentationfor signal modeling and compression, in Proc. IEEE Int. Conf. Acoust.,Speech, Signal Process. (ICASSP97), Munich, Germany, Apr. 1997,vol. 3, pp. 20292032.

[28] M. M. Goodwin and J. Laroche, Audio segmentation by feature-spaceclustering using linear discriminant analysis and dynamic program-ming, in IEEE Workshop Applicat. Signal Process. Audio Acoust.,New Paltz, NY, Oct. 2003, vol. 1, pp. 131134.

[29] R. Vafin, R. Heusdens, S. van de Par, and W. B. Kleijn, Improvedmodeling of audio signals by modifying transient locations, in IEEEWorkshop on Applicat. Signal Process. Audio Acoust., New Paltz, NY,Oct. 2001, vol. 1, pp. 143146.

[30] G. Rilling, P. Flandrin, and P. Gonalvs, On empirical mode decom-position and its algorithms, in Proc. IEEE EURASIP Workshop Non-linear Signal Image Process., Grado, Italy, Jun. 2003.

[31] A. C. Bovik, P. Maragos, and T. F. Quatieri, AMFM energy detectionand separation in noise using multiband energy operators, IEEE Trans.Signal Process., vol. 41, no. 12, pp. 32453265, Dec. 1993.

[32] E. Campione and J. Vronis, A multilingual prosodic database, inProc. 5th Int. Conf. Spoken Lang. Process. (ICSLP98), Sydney, Aus-tralia, Dec. 1998, vol. 7, pp. 31633166.

[33] D. Chan, A. Fourcin, D. Gibbon, B. Grandstrm, M. Huckvale,G. Kokkinakis, K. Kvale, L. Lamel, B. Lindberg, A. Moreno,J. Mouropoulos, F. Senia, I. Trancoso, C. Veld, and J. Zeiliger,EUROMA spoken language resource for the EU, in Proc. ESCA,4th Eur. Conf. Speech Commun. Technol. (Eurospeech95), Madrid,Spain, Sep. 1995, vol. 1, pp. 867870.

Francesco Gianfelici was born in 1979. He receivedthe Laurea degree in electronics engineering from theUniversit Politecnica delle Marche, Ancona, Italy,in 2003. He is currently pursuing the Ph.D. degreein electronics, informatics, and telecommunicationsengineering in the Dipartimento di Elettronica, Intel-ligenza Artificiale e Telecomunicazioni (DEIT), Uni-versit Politecnica delle Marche.

He has been active in the areas of theoreticalcomputer science and information theory. Hiscurrent research interests include multicomponent

speech modeling based on AMFM parameters, signal and image processing,recognition algorithms, and neural networks.

Giorgio Biagetti (S03M05) received the Laureadegree (summa cum laude) in electronics engineeringfrom the Universit degli Studi di Ancona, Ancona,Italy, in 2000, and the Ph.D. degree in electronics andtelecommunications engineering from the UniversitPolitecnica delle Marche, Ancona, in 2004.

He is currently a Research Assistant at the Di-partimento di Elettronica, Intelligenza Artificiale eTelecomunicazioni (DEIT), Universit Politecnicadelle Marche. His research interests include statis-tical and high-level simulation of analog integrated

circuits, statistical modeling, coding, and synthesis of speech, and automaticspeech recognition.

Paolo Crippa (M02) received the Laurea degree inelectronics engineering (summa cum laude) from theUniversit degli Studi di Ancona, Ancona, Italy, in1994 and the Ph.D. degree in electronics engineeringfrom the Polytechnic of Bari, Bari, Italy, in 1999.

From 1994 to 1999, he was Research Fellowat the Department of Electronics, Universit degliStudi di Ancona, where in 1999 he was appointedResearch Assistant as a member of the TechnicalStaff. Since 2006, he has been with the Dipartimentodi Elettronica, Intelligenza Artificiale e Telecomuni-

cazioni (DEIT), Universit Politecnica delle Marche, Ancona, as an AssistantProfessor. His current research interests include statistical modeling andsimulation of integrated circuits, mixed-signal and RF circuit design, neuralnetworks, and areas of signal processing involving coding, synthesis, andautomatic recognition of speech.

Claudio Turchetti (M86) received the Laurea de-gree in electronics engineering from the Universitdegli Studi di Ancona, Ancona, Italy, in 1979.

He joined the Department of Electronics, Univer-sit degli Studi di Ancona in 1980. He is currently aFull Professor of applied electronics and integratedcircuits design and the Head of the Dipartimento diElettronica, Intelligenza Artificiale e Telecomuni-cazioni (DEIT), Universit Politecnica delle Marche,Ancona. He has been active in the areas of devicemodeling, circuits simulation at the device level,

and design of integrated circuits. His current research interests are also inanalog neural networks, statistical analysis of integrated circuits for parametricyield optimization, statistical modeling, coding, and synthesis of speech, andautomatic speech recognition.

userHighlight

e gianfelici2007

Documents