transformation of short-term spectral envelope of speech signal
Post on 17-Jan-2016
19 Views
Preview:
DESCRIPTION
TRANSCRIPT
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P7
Transformation of Short-Term Spectral Envelope of Speech
Signal Using Multivariate Polynomial Modeling
P. K. LehanaP. C. Pandey
{lehana, pcpandey}@ee.iitb.ac.in
EE Dept, IIT Bombay30th January, 2011
1/15
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P7
PRESENTATION OUTLINE
1. Introduction
2. Multivariate Polynomial Modeling
3. Methodology
4. Results
5. Conclusion
2/15
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P7
1. INTRODUCTION
Speaker transformation
Modification of the speech signal of the source speaker to make it perceptually similar to that of the target speaker.
Processing steps in transformation
Estimation of mapping
▫ Estimation of the source and the target parameters
▫ Alignment of the parameters
▫ Estimation of the source-to-target transformation function(s)
Transformation of source speech
▫ Estimation of the source parameters
▫ Application of the transformation function(s) on the source parameters
▫ Generation of the transformed speech
3/15
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P7
Spectral parameters for transformation Formant frequencies Line spectral frequencies (LSFs) Cepstral coefficients Mel frequency cepstrum coefficients (MFCCs): robust w.r.t. to noise,
coefficients uncorrelated with each other and hence suitable for interpolation.
Transformation methods Vector quantization (Shikano, 86): degradation in the output speech quality due to
discretization of the acoustic space. Statistical and ANN (Narendranath, 98; Stylianou, 98; Ye, 06): large set of training data and computation needed.Frequency warping and interpolation (Rinscheid, 96; Hashimoto, 96; Jian, 07; Masuda, 07; Valbret, 92): different transformation functions needed for different acoustic classes.
4/15
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P7
Research objective
Modification of spectral characteristics by modeling the source-target relationship using a single mapping applicable to all acoustic classes, by
modeling each parameter of the target speech as a multivariate polynomial function of all the parameters of the source speech,
harmonic plus noise model (HNM) based analysis- synthesis.
5/15
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P7
2. MULTIVARIATE POLYNOMIAL MODELING
ModelingApproximation of m-dimensional function g, known at q points (wn), by a
multivariate polynomial with terms Фk and error n1
1 20
( , , , ) , 0,1,..., 1p
k k n n mn n nk
c w w w g n q
Coefficients ck obtained for minimizing the sum of squared errors.
Application ▫ Relationship between the parameters of the corresponding source and target frames obtained by modeling each parameter of the target speech as a multivariate polynomial function of all the parameters of the source speech.
▫ Each parameter of a target frame obtained as the corresponding function of all the parameters of the corresponding source frame.
6/15
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P7
3. METHODOLOGYProcessing
HNM based analysis-synthesis as platform for transformation
▫ Harmonic band parameters: voicing, pitch, max. voiced frequency, harmonic magnitudes and phases.
▫ Noise band parameters: LP coefficients and energy.
Modification of parameters
▫ Harmonic magnitudes converted to MFCCs (20), transformed, & converted back to magnitudes; phases estimated by minimum-phase approximation.
▫ LP coeffs (20). converted to LSFs, transformed, & converted back to LP coeffs. Different transformation fns. for the voiced and the unvoiced frames.
▫ Linear transformation for time and pitch scaling.
7/15
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P7
Estimation of spectral transformation functions
Transformation of source speech
Transformation functions investigated
▫ Univariate linear (UL) ▫ Multivariate linear (ML) ▫ Multivariate quadratic (MQ)
8/15
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P7
Evaluation Material
A Hindi story with 80 sentences (10 kHz, 16 bits) from 5 speakers (2 M, 3 F). 77 sentences used for training, 3 for testing.
Preliminary evaluation ▫ Unity transformation (same speaker as the source and the target)
Identity not disturbed, a small degradation in quality. ▫ Pitch modificationTarget identity not achieved, quality degradation similar to the unity transformation. ▫ Spectral modificationSource identity changed towards target for the same gender transformation, slightly higher degradation in quality.
▫ Spectral modification along with pitch and time scalingSource identity close to the target for all the speaker pairs, quality same as in spectral
modification.
9/15
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P7
Example: “Vah padne likhane men bahut achchha tha”
S
T
Tr_UL
Tr_ML
Tr_MQ
F1-F2 F1-M2
M1-F2 M1-M2
S
T
Tr_UL
Tr_ML
Tr_MQ
10/15
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P7
Objective evaluation
Mahalanobis distance between two set of MFCC feature vectors (P,Q) ,
where P corresponds to the target speech and Q corresponds to the source or the transformed speech.
Subjective evaluation
XAB and MOS test (automated administration)
▫ Source, target, or modified randomly presented as X. Source or target randomly presented as A or B.
▫ No. of subjects: 6 ▫ Material: 2 sentences for each of the 4 speaker pairs ▫ No. of presentations for each stimulus: 3
T 1M ( , )D P Q P - Q P - Q Σ = Covariance matrix
11/15
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P7
4. RESULTS
Distance TransformationF1-F2 F1-M1 M1-F2 M1-M2
Source 0.51 0.65 0.64 0.53
Tr_UL 0.68 0.65 0.61 0.64
Tr_ML 0.45 0.47 0.44 0.43
Tr_MQ 0.38 0.39 0.38 0.33
Stimulus Source Target Pitch modified Transformed
Score (%) 6 96 14 92
Transformation UL ML MQ
Score 1.7 2.8 3.1
• Mahalanobis distance of the target MFCCs
• XAB score (2 sentences × 3 presentations × 6 listeners, averaged across the 4 speaker pairs)Transformed: Tr_MQ along with pitch modification and time scaling
• MOS score (2 sentences × 3 presentations × 6 listeners, averaged across the 4 speaker pairs)
Highest reduction in the target-transformed distance for MQ based transformation
Identification errorsSource: 6 % Target: 4 % Transformed: 8 %
12/15
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P7
DemoS: source, T: target, PM: pitch modified, SM: spectrum modified, TS: time scaledUL: univariate linear, ML: multivariate linear, MQ: multivariate quadratic
13/15
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P7
5. CONCLUSION
Modification of spectral characteristics feasible by modeling the source-target relationship using multivariate polynomial functions for a single mapping applicable to all acoustic classes, without extensive training or labeling.
Methods investigated for transformation function: UL, ML, MQ. MQ resulted in satisfactory identity transformation and fair quality.
Further work
▫ Listening tests involving larger number of speaker pairs and listeners.
▫ Comparison with other transformation techniques.
14/15
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P7
Thank you
15/15
top related