[ieee 2013 national conference on communications (ncc) - new delhi, india (2013.2.15-2013.2.17)]...

5

Click here to load reader

Upload: d

Post on 01-Apr-2017

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2013 National Conference on Communications (NCC) - New Delhi, India (2013.2.15-2013.2.17)] 2013 National Conference on Communications (NCC) - Unified pitch markers generation

Unified Pitch Markers Generation Method for Pitchand Duration Modification

S R M Prasanna and D GovindDepartment of Electronics and Electrical Engineering

Indian Institute of Technology Guwahati

Email:{prasanna,dgovind}@iitg.ernet.in

Abstract—This paper proposes a modified pitch markers gen-eration method that can be used for both pitch and durationmodification. Except for changing some input parameters, themethod remains common for both. The original pitch markers,modification and scaling factors are the input to the method. Themodified pitch markers will be the output, generated according tothe given modification and scaling factors. Thus providing simpli-fied and modular approach for pitch and duration modification.The proposed method is illustrated for both static and dynamicpitch and duration modification cases. The experimental resultsindicate that the method can be used without any modificationand with equal ease in both the cases.

I. INTRODUCTION

The information in speech can be viewed at vocal tract

system, excitation source and prosodic levels. Apart from the

message part due to vocal tract component, the prosodic level

contains significant information exploited in human speech

communication. To mention a few, these include naturalness,

expressiveness, pitch contour, duration and intensity. Without

this prosodic information, the communication is not natural

and will not be expressive. Thus while doing speech synthe-

sis, apart from synthesizing message by exciting vocal tract

system with excitation source, efforts also should be made to

incorporate prosodic level information. One approach followed

during speech synthesis is to synthesize the message part and

then incorporate the required prosodic information by prosody

modification [1], [2].

The prosodic modification involves mainly the modification

of pitch and duration information [3], [4]. This is achieved

by first extracting the original pitch markers from the given

speech signal and then obtaining the modified pitch markers

from them based on the modification factors. The modified

pitch markers are used as anchor points for the reconstruction

of prosody modified speech. Thus deriving the modified pitch

markers is an important step during prosody modification.

Traditionally, the modified pitch markers are obtained for pitch

and duration modification by an independent procedure [4]–

[6]. This is because, in case of duration modification the pitch

should remain constant and in case of pitch modification the

duration should remain constant. To ensure this, independent

procedures are employed. However, if we view carefully the

modified pitch markers, independent of duration information,

both pitch and duration cases will be same, except for a scaling

factor on duration. Thus a single procedure can be employed

for deriving modified pitch markers suitable for both pitch

and duration modification cases, simplifying the procedure and

hence reducing complexity of prosody modification process.

If there are N original pitch markers and βi is the duration

modification factor, then the modified pitch markers will have

βi ×N number of markers. For instance, if N = 50 and βi =

2, then there will be 100 modified pitch markers. This will

essentially double the duration of prosody modified speech,

where as pitch is kept unaltered. Alternatively, if αi is the pitch

modification factor, then the modified pitch markers will have

N × 1αi

number of markers. For instance, if N = 50 and αi =

0.5, then there will be 100 modified pitch markers. This will

reduce the pitch period in the prosody modified speech by half,

where as duration is kept unaltered. Without worrying about

what happens for pitch and duration, if we concentrate only

on the number of pitch markers, then both the cases involve

generation of same number of pitch markers. Thus we can have

a modified pitch markers generation factor δi = βi = 1/αi

and generate the modified pitch markers. After this, depending

on whether duration or pitch modification, the modified pitch

markers duration scaling by factor ζi = 1 if δi = βi or ζi =

αi if δi = 1/αi can be performed. In this way using the

same procedure, the modified pitch markers for both pitch

and duration modification can be obtained.

The rest of the paper is organized as follows: Section

2 describes the proposed modified pitch markers generation

method. The pitch and duration modification using the pro-

posed method is described in Section 3. Section 4 reports and

discusses the experimental results. The summary, conclusions

and scope for the reported work is given in the last section.

II. UNIFIED MODIFIED PITCH MARKERS GENERATION

Let there are N original pitch markers given by

PO = {po1, po2, . . . , pon, . . . , poN} (1)

Let δi be the modified pitch markers generation factor. Then

there will be M = δi × N modified pitch markers given by

P ′

M = {p′m1, p′

m2, . . . , p′

mj , . . . , p′

mM} (2)

P ′

M can be generated from PO assuming the modified pitch

markers are for duration modification and using δi as the

duration modification factor.

The number of interval between successive modified pitch978-1-4673-5952-8/13/$31.00 c© 2013 IEEE

Page 2: [IEEE 2013 National Conference on Communications (NCC) - New Delhi, India (2013.2.15-2013.2.17)] 2013 National Conference on Communications (NCC) - Unified pitch markers generation

markers will be M − 1 and given by

E′

M = {e′m1, e′

m2, . . . , e′

mj, . . . , e′

mM−1} (3)

where

e′mj = p′m(j+1) − p′mj (4)

If the pitch markers are meant for duration modification, then

nothing needs to be done as the original procedure employs

duration modification approach [4], [5]. Alternatively, if the

pitch markers are meant for pitch modification, then the E′

M

need to be scaled in duration to nullify the change in duration.

Let ζi be the duration scaling factor.

For duration modification

δi = βi (5)

and hence

ζi = 1 (6)

Therefore, the interval between final modified pitch markers

for duration modification are given by

EM = E′

M (7)

and the final pitch markers for duration modification is given

by

PM = {p′m1, p′

m1 + e′m1, p′

m1 + e′m1 + e′m2, . . .} (8)

For pitch modification

δi = 1/αi (9)

and hence

ζi = αi (10)

Therefore, the intervals between final modified pitch markers

for pitch modification are given by

EM = ζi × E′

M (11)

and the final pitch markers for pitch modification is given by

PM = {p′m1, p′

m1 + ζie′

m1, p′

m1 + ζie′

m1 + ζie′

m2, . . .} (12)

Fig. 1 shows the pictorial comparison of pitch markers of

duration and pitch modification cases. Figs. 1(a) and 1(d)

contain original pitch markers. Figs.1(b) and 1(c) plot modified

pitch markers for duration modification factor > 1 and < 1,

respectively. Figs.1(e) and 1(f) plot modified pitch markers

for pitch modification factor < 1 and > 1, respectively. The

comparison of Figs.1(b) with 1(f) (similarly, Figs.1(c) with

1(e)) infer that, the number of pitch markers are same, except

for the duration. Thus the pitch markers of Fig.1(f) can be

generated using the pitch markers present in Fig.1(b), by

scaling in duration. Thus it is indeed possible to employ the

same procedure for obtaining the modified pitch markers for

pitch and duration modification.

The proposed unified modified pitch marker generation

process can be represented as a block diagram shown in Fig. 2.

The original pitch markers PO and modification factor δi

are used initially to generate modified pitch markers P ′

M for

0 200 4000

1

0 200 400 600 8000

1

0 100 2000

1

Time (Samples)

0 200 4000

1

0 200 4000

1

0 200 4000

1

Time (Samples)

(a)

(b)

(c) (f)

(e)

(d)

Fig. 1: Comparison of pitch markers for duration and pitch

modification. (a) The original pitch marks, (b) pitch marks for

the duration modification factors βi = 2 and (c)βi=0.5. (d) The

original pitch marks, (e) pitch marks for the pitch modification

factors αi = 2 and (f) αi = 0.5.

duration modification by δi. The final modified pitch markers

PM are generated by scaling the intervals between successive

pitch periods of P ′

M by ζi. This module can be used in

any stage of prosody modification process, either for duration

or pitch or both modifications. Hence the modularity of the

proposed method.

III. PROSODY MODIFICATION USING UNIFIED PITCH

MARKER GENERATION

The prosody modification process can be viewed in three

modules as shown in Figure 3, namely, original pitch markers

generation, modified pitch markers generation and prosody

modified waveform generation modules. The original pitch

markers generation module extracts the accurate pitch markers

from the given speech signal. Epoch based approach is proven

to provide the most accurate pitch markers [7]. The modified

pitch markers generation modules takes the original pitch

markers and the modification factors as input and generates

the modified pith markers. One mostly used approach in epoch

based prosody modification is to generate the epoch intervals

plot from the original pitch markers, interpolate and scale the

plot, and then derive the modified pitch markers from this

plot [4], [8]. The proposed work is the replacement of this

Page 3: [IEEE 2013 National Conference on Communications (NCC) - New Delhi, India (2013.2.15-2013.2.17)] 2013 National Conference on Communications (NCC) - Unified pitch markers generation

�������������

�� �� ������ ������

���

���

��������������� �� ��������������

����

���

������

���

Fig. 2: Block diagram for the proposed unified pitch marker

generation algorithm.

���������� ������

����������� �����������

������ ����

�� �����������

��� �� ����

������������ �

���

���

��� ���

��� ���

������������ �

�����������������

��� �

���

Fig. 3: Schematic diagram for prosody modification.

method in the second module of prosody modification block

diagram shown in Fig. 3. The waveform generation module

takes modified pitch markers and the original speech signal as

input and constructs the prosody modified speech signal.

The perceptual quality of the synthesized speech depends on

the accuracy of the methods present in each of the modules.

The zero frequency filtering (ZFF) based epoch extraction

is demonstrated to provide the most accurate pitch markers

[7], [8] and hence the present work uses the same in the

first module. The proposed unified modified pitch markers

generation method is used in the second module. After obtain-

ing the modified pitch markers, the locations of the original

pitch markers are found that are nearest to the modified pitch

markers. The method described in [8] is used to synthesize the

prosody modified speech by copying waveform samples from

original pitch marker locations.

The prosody modification process can be viewed as either

static or dynamic. In case of static prosody modification, all the

pitch markers are modified by a constant factor. Alternatively,

each pitch marker is modified by a different factor in case of

dynamic prosody modification. The proposed modified pitch

markers generation method is designed basically keeping in

view of dynamic prosody modification and hence can be used

for static also, by viewing the static as a special case of

dynamic.

A. Duration Modification

For static duration modification, every pitch marker in PO

is assigned with the fixed duration modification factor δi = βand scaling factors ζi = 1. For dynamic duration modification,

every pitch marker in PO is assigned with varying duration

modification factor δi = βi and scaling factors ζi = 1. The

inputs to the pitch markers generation algorithm are PO , δi and

ζi. The prosody modified speech is synthesized by copying the

waveform samples from the original pitch marker locations to

the nearest modified pitch marker locations. All the speech

samples in the pitch period interval, starting from original pitch

marker locations to the next pitch marker are considered for

copying.

B. Pitch Modification

For static pitch modification, every PO is assigned with

the fixed pitch modification factor δi = α and scaling factor

ζi = 1/αi. For dynamic duration modification, every PO is

assigned with varying pitch modification factor δi = αi and

scaling factor ζi. The prosody modified speech is synthesized

by copying the waveform samples from original pitch marker

locations to the modified locations. In case of decrease in pitch

period, the copied waveform samples in the original pitch

interval is overlap added with the waveform samples in the

overlap region. In case of increase in pitch period, the modified

pitch period is filled by copying the whole speech samples in

the original pitch interval and resampling the 10% tail portion

of the pitch interval.

C. Duration and Pitch Modification

For the combined duration and pitch modification, the mod-

ified pitch markers for duration modification is obtained first

using the proposed unified pitch markers generation method.

The modified pitch markers obtained are passed through the

same module once again using the pitch modification factor

(αi) and scaling factors (ζi = αi). The modified pitch markers

for both the duration and pitch modification are then used to

generate the duration and pitch modified speech. The speech

waveform is generated by using the modified pitch markers to

synthesize the prosody modified speech.

IV. EXPERIMENTAL RESULTS AND DISCUSSION

Fig. 4 compares the static duration and pitch modification

for the modification factors β=2 and α=0.5, respectively.

Fig. 4(a) shows a segment of voiced speech, its original pitch

marks in Fig. 4(b) and modified pitch marks Fig. 4(c) for

β=2 and the synthesized speech for the duration modification

in Fig. 4(d). Fig. 4(e) shows the same segment of voiced

Page 4: [IEEE 2013 National Conference on Communications (NCC) - New Delhi, India (2013.2.15-2013.2.17)] 2013 National Conference on Communications (NCC) - Unified pitch markers generation

0 20 40 60 80−1

0

1

0 20 40 60 80

1

0 20 40 60 80 100 120 140 160

1

0 20 40 60 80 100 120 140 160−1

0

1

Time (ms)

0 20 40 60 80−1

0

1

0 20 40 60 80

1

0 20 40 60 80

1

0 20 40 60 80−1

0

1

Time (ms)

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Fig. 4: (a) A voiced segment of speech, (b) corresponding original pitch marks, (c) modified pitch marks for the static duration

modification factor β=2 and (d) duration modified speech waveform. (e) A voiced segment of speech, (f) corresponding original

pitch marks, (g) modified pitch marks for the static pitch period modification factor α=0.5 and (h) synthesized waveform

according to the pitch modification factor.

speech used for duration modification, its original pitch marks

in Fig. 4(f) and modified pitch marks in Fig. 4(g) for α=0.5

and the synthesized speech according to the pitch modification

factor in Fig. 4(h). By comparing the Figures 4(c) and 4(g),

it has to be observed that both the cases have the same

number modified pitch marks. Also the time instants of the

modified pitch markers in pitch modified case are scaled by

α as compared to duration modification case.

In this section a subjective evaluation is performed for the

static pitch and duration modification using proposed unified

pitch marker generation method and existing conventional

pitch markers generation method used in the epoch based

prosody modification [8] in the second module of the prosody

modification system. Also the prosody modified speech syn-

thesized using epoch based method and TDPSOLA methods

using the proposed unified pitch marker generation methods

are subjectively evaluated. Four phonetically balanced utter-

ances from 3 speakers (2 males and a female) of CMU-Arctic

database are selected for the subjective evaluation [9]. The

files initially sampled at 32 kHz are down sampled to 8 kHz

and used for the evaluation. 15 research scholars of Electro

Medical and Speech Technology (EMST) lab participated in

the subjective evaluation. The subjects were asked to judge the

quality of each speech file according to the distortions present

in them. The significance of the scores used for the subjective

evaluation are given in Table I.

Table II compares the mean opinion scores (MOS) obtained

for the epoch based prosody modification using proposed

modified pitch markers generation method and conventional

method. Table II also compares the epoch based prosody

modification and TD-PSOLA using the proposed unified pitch

marker generation method. It can be observed from the Ta-

ble II that the MOS scores obtained for the epoch based

prosody modification using the unified pitch marker generation

method and conventional method are nearly same. The slight

degradation in the conventional method is not due to the

modified pitch markers generation, but due to the method

employed for speech waveform generation in the third module.

Table II also shows that by using the proposed unified pitch

marker generation method for both TD-PSOLA and epoch

based prosody modification results are almost same. These

results indicate the effectiveness of the proposed method as an

alternative to the need of using two independent methods for

pitch marker generation for pitch and duration modification.

The high MOS scores also indicate that the locations of the

modified pitch markers by the proposed method are correct.

V. SUMMARY AND CONCLUSIONS

This work proposed an unified method for generating mod-

ified pitch markers for pitch and duration modification. The

proposed method considers the given set of original pitch

markers and generates the intermediate modified pitch markers

Page 5: [IEEE 2013 National Conference on Communications (NCC) - New Delhi, India (2013.2.15-2013.2.17)] 2013 National Conference on Communications (NCC) - Unified pitch markers generation

TABLE I: Ranking used for judging the quality and distortion

of the speech signal for different modification factors.

Rating Speech Quality Justification for the ranking

1 Unsatisfactory Very annoying and objectionable

2 Poor Annoying but not objectionable

3 Fair Perceptible and slightly annoying

4 Good Just perceptible but not annoying

5 Excellent Imperceptible

TABLE II: Mean opinion scores for different duration and

pitch modification factors. PPM stands for proposed pitch

markers and EPM stands for existing pitch markers.

Duration Modification

Method 0.5 1.5 2.5

PPM-Epoch 3.59 4.32 3.75

EPM-Epoch 3.23 3.98 3.52

PPM TD-PSOLA 3.64 4.28 3.64

Pitch Modification

0.6 1.5 2

PPM-Epoch 4.62 4.25 4.09

EPM-Epoch 3.59 4.03 4.09

PPM TD-PSOLA 4.40 4.15 4.09

based on modification factor. The intermediate modified pitch

markers are then subjected to duration scaling to obtain final

modified pitch markers. Except for the choice of parame-

ter values for modification and scaling, there are no other

differences for obtaining the modified pitch markers for the

case of pitch and duration modification. The method was later

demonstrated in pitch and duration modification tasks.

The proposed method gives a procedure for the generation

of modified pitch markers and is independent of the prosody

modification method. Therefore the proposed method can

be employed in any of the existing prosody modification

methods like PSOLA and epoch based methods.The future

work should focus on incorporating this method in different

prosody modification methods and evaluating them.

VI. ACKNOWLEDGEMENTS

The present work is part of UKIERI project (2007-2011)

titled ”Study of source features for speech synthesis and

speaker recognition” between IIT Guwahati, IIIT Hyderabad

and University of Edinburgh.

REFERENCES

[1] P. Taylor, Text to Speech Synthesis. Cambridge university press, 2009.[2] A. Hunt and A. Black, “Unit selection in a concatenative speech synthesis

system using a large speech database,” in Proc. ICASSP, pp. 373-376,1996.

[3] E. Mourlines and J. Laroche, “Non-parametric techniques for pitch-scaleand time-scale modification of speech,” Speech Commun., vol. 16, pp.175–205, 1995.

[4] K. S. Rao and B. Yegnanarayana, “Prosody modification using instantsof significant excitation,” IEEE Trans. Audio, Speech and Language

Processing, vol. 14, pp. 972–980, May 2006.

[5] E. Moulines and F. Charpentier, “Pitch-synchronous waveform processingtechniques for text-to-speech synthesis using diphones,” Speech Commun.,vol. 9, pp. 452–467, 1990.

[6] T. F. Quatieri and R. J. McAulay, “Shape invariant time scale and pitchmodification of speech,” IEEE Trans. on Signal Process., vol. 40, no. 3,pp. 497–510, Mar 1992.

[7] K. S. R. Murty and B. Yegnanarayana, “Epoch extraction from speechsignals,” IEEE Trans. Audio, Speech and Language Process., vol. 16,no. 8, pp. 1602–1614, Nov. 2008.

[8] S. R. M. Prasanna, D. Govind, K. S. Rao, and B. Yenanarayana, “Fastprosody modification using instants of significant excitation,” in Proc

Speech Prosody, May 2010.[9] J. Kominek and A. Black, “CMU-Arctic speech databases,” in in 5th ISCA

Speech Synthesis Workshop, Pittsburgh, PA, 2004, pp. 223–224.