transformation of short-term spectral envelope of speech signal

I IT B

17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P7

Transformation of Short-Term Spectral Envelope of Speech

Signal Using Multivariate Polynomial Modeling

P. K. LehanaP. C. Pandey

{lehana, pcpandey}@ee.iitb.ac.in

EE Dept, IIT Bombay30th January, 2011

I IT B

PRESENTATION OUTLINE

1. Introduction

2. Multivariate Polynomial Modeling

3. Methodology

4. Results

5. Conclusion

I IT B

1. INTRODUCTION

Speaker transformation

Modification of the speech signal of the source speaker to make it perceptually similar to that of the target speaker.

Processing steps in transformation

Estimation of mapping

▫ Estimation of the source and the target parameters

▫ Alignment of the parameters

▫ Estimation of the source-to-target transformation function(s)

Transformation of source speech

▫ Estimation of the source parameters

▫ Application of the transformation function(s) on the source parameters

▫ Generation of the transformed speech

I IT B

Spectral parameters for transformation Formant frequencies Line spectral frequencies (LSFs) Cepstral coefficients Mel frequency cepstrum coefficients (MFCCs): robust w.r.t. to noise,

coefficients uncorrelated with each other and hence suitable for interpolation.

Transformation methods Vector quantization (Shikano, 86): degradation in the output speech quality due to

discretization of the acoustic space. Statistical and ANN (Narendranath, 98; Stylianou, 98; Ye, 06): large set of training data and computation needed.Frequency warping and interpolation (Rinscheid, 96; Hashimoto, 96; Jian, 07; Masuda, 07; Valbret, 92): different transformation functions needed for different acoustic classes.

I IT B

Research objective

Modification of spectral characteristics by modeling the source-target relationship using a single mapping applicable to all acoustic classes, by

modeling each parameter of the target speech as a multivariate polynomial function of all the parameters of the source speech,

harmonic plus noise model (HNM) based analysis- synthesis.

I IT B

2. MULTIVARIATE POLYNOMIAL MODELING

ModelingApproximation of m-dimensional function g, known at q points (wn), by a

multivariate polynomial with terms Фk and error n1

( , , , ) , 0,1,..., 1p

k k n n mn n nk

c w w w g n q

Coefficients ck obtained for minimizing the sum of squared errors.

Application ▫ Relationship between the parameters of the corresponding source and target frames obtained by modeling each parameter of the target speech as a multivariate polynomial function of all the parameters of the source speech.

▫ Each parameter of a target frame obtained as the corresponding function of all the parameters of the corresponding source frame.

I IT B

3. METHODOLOGYProcessing

HNM based analysis-synthesis as platform for transformation

▫ Harmonic band parameters: voicing, pitch, max. voiced frequency, harmonic magnitudes and phases.

▫ Noise band parameters: LP coefficients and energy.

Modification of parameters

▫ Harmonic magnitudes converted to MFCCs (20), transformed, & converted back to magnitudes; phases estimated by minimum-phase approximation.

▫ LP coeffs (20). converted to LSFs, transformed, & converted back to LP coeffs. Different transformation fns. for the voiced and the unvoiced frames.

▫ Linear transformation for time and pitch scaling.

I IT B

Estimation of spectral transformation functions

Transformation of source speech

Transformation functions investigated

▫ Univariate linear (UL) ▫ Multivariate linear (ML) ▫ Multivariate quadratic (MQ)

I IT B

Evaluation Material

A Hindi story with 80 sentences (10 kHz, 16 bits) from 5 speakers (2 M, 3 F). 77 sentences used for training, 3 for testing.

Preliminary evaluation ▫ Unity transformation (same speaker as the source and the target)

Identity not disturbed, a small degradation in quality. ▫ Pitch modificationTarget identity not achieved, quality degradation similar to the unity transformation. ▫ Spectral modificationSource identity changed towards target for the same gender transformation, slightly higher degradation in quality.

▫ Spectral modification along with pitch and time scalingSource identity close to the target for all the speaker pairs, quality same as in spectral

modification.

I IT B

Example: “Vah padne likhane men bahut achchha tha”

F1-F2 F1-M2

M1-F2 M1-M2

I IT B

Objective evaluation

Mahalanobis distance between two set of MFCC feature vectors (P,Q) ,

where P corresponds to the target speech and Q corresponds to the source or the transformed speech.

Subjective evaluation

XAB and MOS test (automated administration)

▫ Source, target, or modified randomly presented as X. Source or target randomly presented as A or B.

▫ No. of subjects: 6 ▫ Material: 2 sentences for each of the 4 speaker pairs ▫ No. of presentations for each stimulus: 3

T 1M ( , )D P Q P - Q P - Q Σ = Covariance matrix

I IT B

4. RESULTS

Distance TransformationF1-F2 F1-M1 M1-F2 M1-M2

Source 0.51 0.65 0.64 0.53

Tr_UL 0.68 0.65 0.61 0.64

Tr_ML 0.45 0.47 0.44 0.43

Tr_MQ 0.38 0.39 0.38 0.33

Stimulus Source Target Pitch modified Transformed

Score (%) 6 96 14 92

Transformation UL ML MQ

Score 1.7 2.8 3.1

• Mahalanobis distance of the target MFCCs

• XAB score (2 sentences × 3 presentations × 6 listeners, averaged across the 4 speaker pairs)Transformed: Tr_MQ along with pitch modification and time scaling

• MOS score (2 sentences × 3 presentations × 6 listeners, averaged across the 4 speaker pairs)

Highest reduction in the target-transformed distance for MQ based transformation

Identification errorsSource: 6 % Target: 4 % Transformed: 8 %

I IT B

DemoS: source, T: target, PM: pitch modified, SM: spectrum modified, TS: time scaledUL: univariate linear, ML: multivariate linear, MQ: multivariate quadratic

I IT B

5. CONCLUSION

Modification of spectral characteristics feasible by modeling the source-target relationship using multivariate polynomial functions for a single mapping applicable to all acoustic classes, without extensive training or labeling.

Methods investigated for transformation function: UL, ML, MQ. MQ resulted in satisfactory identity transformation and fair quality.

Further work

▫ Listening tests involving larger number of speaker pairs and listeners.

▫ Comparison with other transformation techniques.

I IT B

Thank you

transformation of short-term spectral envelope of speech signal

Documents

novel deep autoencoder features for non-intrusive speech...

a speech envelope landmark for syllable encoding in human...

speech enhancement using spectral subtraction and cascaded

feature extraction for asr spectral (envelope) analysis...

learning spectral clustering, with application to...

bearing envelope analysis window selection using spectral...

speech and spectral analysis

the role of rhythm in speech and language rehabilitation...

speech enhancement using spectral subtraction-type...

speech recognition front end pre-emphasis temporal features...

a spectral envelope estimation method based on f0-adaptive...

a spectral energy distance for parallel speech synthesisa...

speech enhancement using spectral subtraction and ...

a ol 5 n 1 spectral restoration based speech enhancement

heart rate monitoring using human speech spectral features

suppression of acoustic noise in speech using spectral...

mel-generalized cepstral representation of speech —a...

enhancement of spectral envelope modeling in hmm-based...

speech enhancement using spectral flatness ... - … ·...

remi - reduced envelope multi-spectral imager for ...1 ball...