[ieee 2013 national conference on communications (ncc) - new delhi, india (2013.2.15-2013.2.17)]...
TRANSCRIPT
![Page 1: [IEEE 2013 National Conference on Communications (NCC) - New Delhi, India (2013.2.15-2013.2.17)] 2013 National Conference on Communications (NCC) - Unified pitch markers generation](https://reader037.vdocuments.net/reader037/viewer/2022100504/5750a9201a28abcf0ccdcf0e/html5/thumbnails/1.jpg)
Unified Pitch Markers Generation Method for Pitchand Duration Modification
S R M Prasanna and D GovindDepartment of Electronics and Electrical Engineering
Indian Institute of Technology Guwahati
Email:{prasanna,dgovind}@iitg.ernet.in
Abstract—This paper proposes a modified pitch markers gen-eration method that can be used for both pitch and durationmodification. Except for changing some input parameters, themethod remains common for both. The original pitch markers,modification and scaling factors are the input to the method. Themodified pitch markers will be the output, generated according tothe given modification and scaling factors. Thus providing simpli-fied and modular approach for pitch and duration modification.The proposed method is illustrated for both static and dynamicpitch and duration modification cases. The experimental resultsindicate that the method can be used without any modificationand with equal ease in both the cases.
I. INTRODUCTION
The information in speech can be viewed at vocal tract
system, excitation source and prosodic levels. Apart from the
message part due to vocal tract component, the prosodic level
contains significant information exploited in human speech
communication. To mention a few, these include naturalness,
expressiveness, pitch contour, duration and intensity. Without
this prosodic information, the communication is not natural
and will not be expressive. Thus while doing speech synthe-
sis, apart from synthesizing message by exciting vocal tract
system with excitation source, efforts also should be made to
incorporate prosodic level information. One approach followed
during speech synthesis is to synthesize the message part and
then incorporate the required prosodic information by prosody
modification [1], [2].
The prosodic modification involves mainly the modification
of pitch and duration information [3], [4]. This is achieved
by first extracting the original pitch markers from the given
speech signal and then obtaining the modified pitch markers
from them based on the modification factors. The modified
pitch markers are used as anchor points for the reconstruction
of prosody modified speech. Thus deriving the modified pitch
markers is an important step during prosody modification.
Traditionally, the modified pitch markers are obtained for pitch
and duration modification by an independent procedure [4]–
[6]. This is because, in case of duration modification the pitch
should remain constant and in case of pitch modification the
duration should remain constant. To ensure this, independent
procedures are employed. However, if we view carefully the
modified pitch markers, independent of duration information,
both pitch and duration cases will be same, except for a scaling
factor on duration. Thus a single procedure can be employed
for deriving modified pitch markers suitable for both pitch
and duration modification cases, simplifying the procedure and
hence reducing complexity of prosody modification process.
If there are N original pitch markers and βi is the duration
modification factor, then the modified pitch markers will have
βi ×N number of markers. For instance, if N = 50 and βi =
2, then there will be 100 modified pitch markers. This will
essentially double the duration of prosody modified speech,
where as pitch is kept unaltered. Alternatively, if αi is the pitch
modification factor, then the modified pitch markers will have
N × 1αi
number of markers. For instance, if N = 50 and αi =
0.5, then there will be 100 modified pitch markers. This will
reduce the pitch period in the prosody modified speech by half,
where as duration is kept unaltered. Without worrying about
what happens for pitch and duration, if we concentrate only
on the number of pitch markers, then both the cases involve
generation of same number of pitch markers. Thus we can have
a modified pitch markers generation factor δi = βi = 1/αi
and generate the modified pitch markers. After this, depending
on whether duration or pitch modification, the modified pitch
markers duration scaling by factor ζi = 1 if δi = βi or ζi =
αi if δi = 1/αi can be performed. In this way using the
same procedure, the modified pitch markers for both pitch
and duration modification can be obtained.
The rest of the paper is organized as follows: Section
2 describes the proposed modified pitch markers generation
method. The pitch and duration modification using the pro-
posed method is described in Section 3. Section 4 reports and
discusses the experimental results. The summary, conclusions
and scope for the reported work is given in the last section.
II. UNIFIED MODIFIED PITCH MARKERS GENERATION
Let there are N original pitch markers given by
PO = {po1, po2, . . . , pon, . . . , poN} (1)
Let δi be the modified pitch markers generation factor. Then
there will be M = δi × N modified pitch markers given by
P ′
M = {p′m1, p′
m2, . . . , p′
mj , . . . , p′
mM} (2)
P ′
M can be generated from PO assuming the modified pitch
markers are for duration modification and using δi as the
duration modification factor.
The number of interval between successive modified pitch978-1-4673-5952-8/13/$31.00 c© 2013 IEEE
![Page 2: [IEEE 2013 National Conference on Communications (NCC) - New Delhi, India (2013.2.15-2013.2.17)] 2013 National Conference on Communications (NCC) - Unified pitch markers generation](https://reader037.vdocuments.net/reader037/viewer/2022100504/5750a9201a28abcf0ccdcf0e/html5/thumbnails/2.jpg)
markers will be M − 1 and given by
E′
M = {e′m1, e′
m2, . . . , e′
mj, . . . , e′
mM−1} (3)
where
e′mj = p′m(j+1) − p′mj (4)
If the pitch markers are meant for duration modification, then
nothing needs to be done as the original procedure employs
duration modification approach [4], [5]. Alternatively, if the
pitch markers are meant for pitch modification, then the E′
M
need to be scaled in duration to nullify the change in duration.
Let ζi be the duration scaling factor.
For duration modification
δi = βi (5)
and hence
ζi = 1 (6)
Therefore, the interval between final modified pitch markers
for duration modification are given by
EM = E′
M (7)
and the final pitch markers for duration modification is given
by
PM = {p′m1, p′
m1 + e′m1, p′
m1 + e′m1 + e′m2, . . .} (8)
For pitch modification
δi = 1/αi (9)
and hence
ζi = αi (10)
Therefore, the intervals between final modified pitch markers
for pitch modification are given by
EM = ζi × E′
M (11)
and the final pitch markers for pitch modification is given by
PM = {p′m1, p′
m1 + ζie′
m1, p′
m1 + ζie′
m1 + ζie′
m2, . . .} (12)
Fig. 1 shows the pictorial comparison of pitch markers of
duration and pitch modification cases. Figs. 1(a) and 1(d)
contain original pitch markers. Figs.1(b) and 1(c) plot modified
pitch markers for duration modification factor > 1 and < 1,
respectively. Figs.1(e) and 1(f) plot modified pitch markers
for pitch modification factor < 1 and > 1, respectively. The
comparison of Figs.1(b) with 1(f) (similarly, Figs.1(c) with
1(e)) infer that, the number of pitch markers are same, except
for the duration. Thus the pitch markers of Fig.1(f) can be
generated using the pitch markers present in Fig.1(b), by
scaling in duration. Thus it is indeed possible to employ the
same procedure for obtaining the modified pitch markers for
pitch and duration modification.
The proposed unified modified pitch marker generation
process can be represented as a block diagram shown in Fig. 2.
The original pitch markers PO and modification factor δi
are used initially to generate modified pitch markers P ′
M for
0 200 4000
1
0 200 400 600 8000
1
0 100 2000
1
Time (Samples)
0 200 4000
1
0 200 4000
1
0 200 4000
1
Time (Samples)
(a)
(b)
(c) (f)
(e)
(d)
Fig. 1: Comparison of pitch markers for duration and pitch
modification. (a) The original pitch marks, (b) pitch marks for
the duration modification factors βi = 2 and (c)βi=0.5. (d) The
original pitch marks, (e) pitch marks for the pitch modification
factors αi = 2 and (f) αi = 0.5.
duration modification by δi. The final modified pitch markers
PM are generated by scaling the intervals between successive
pitch periods of P ′
M by ζi. This module can be used in
any stage of prosody modification process, either for duration
or pitch or both modifications. Hence the modularity of the
proposed method.
III. PROSODY MODIFICATION USING UNIFIED PITCH
MARKER GENERATION
The prosody modification process can be viewed in three
modules as shown in Figure 3, namely, original pitch markers
generation, modified pitch markers generation and prosody
modified waveform generation modules. The original pitch
markers generation module extracts the accurate pitch markers
from the given speech signal. Epoch based approach is proven
to provide the most accurate pitch markers [7]. The modified
pitch markers generation modules takes the original pitch
markers and the modification factors as input and generates
the modified pith markers. One mostly used approach in epoch
based prosody modification is to generate the epoch intervals
plot from the original pitch markers, interpolate and scale the
plot, and then derive the modified pitch markers from this
plot [4], [8]. The proposed work is the replacement of this
![Page 3: [IEEE 2013 National Conference on Communications (NCC) - New Delhi, India (2013.2.15-2013.2.17)] 2013 National Conference on Communications (NCC) - Unified pitch markers generation](https://reader037.vdocuments.net/reader037/viewer/2022100504/5750a9201a28abcf0ccdcf0e/html5/thumbnails/3.jpg)
�
�������������
�� �� ������ ������
���
���
�
��������������� �� ��������������
�
����
���
������
���
Fig. 2: Block diagram for the proposed unified pitch marker
generation algorithm.
�
���������� ������
����������� �����������
������ ����
�� �����������
��� �� ����
������������ �
���
���
��� ���
��� ���
������������ �
�����������������
��� �
���
Fig. 3: Schematic diagram for prosody modification.
method in the second module of prosody modification block
diagram shown in Fig. 3. The waveform generation module
takes modified pitch markers and the original speech signal as
input and constructs the prosody modified speech signal.
The perceptual quality of the synthesized speech depends on
the accuracy of the methods present in each of the modules.
The zero frequency filtering (ZFF) based epoch extraction
is demonstrated to provide the most accurate pitch markers
[7], [8] and hence the present work uses the same in the
first module. The proposed unified modified pitch markers
generation method is used in the second module. After obtain-
ing the modified pitch markers, the locations of the original
pitch markers are found that are nearest to the modified pitch
markers. The method described in [8] is used to synthesize the
prosody modified speech by copying waveform samples from
original pitch marker locations.
The prosody modification process can be viewed as either
static or dynamic. In case of static prosody modification, all the
pitch markers are modified by a constant factor. Alternatively,
each pitch marker is modified by a different factor in case of
dynamic prosody modification. The proposed modified pitch
markers generation method is designed basically keeping in
view of dynamic prosody modification and hence can be used
for static also, by viewing the static as a special case of
dynamic.
A. Duration Modification
For static duration modification, every pitch marker in PO
is assigned with the fixed duration modification factor δi = βand scaling factors ζi = 1. For dynamic duration modification,
every pitch marker in PO is assigned with varying duration
modification factor δi = βi and scaling factors ζi = 1. The
inputs to the pitch markers generation algorithm are PO , δi and
ζi. The prosody modified speech is synthesized by copying the
waveform samples from the original pitch marker locations to
the nearest modified pitch marker locations. All the speech
samples in the pitch period interval, starting from original pitch
marker locations to the next pitch marker are considered for
copying.
B. Pitch Modification
For static pitch modification, every PO is assigned with
the fixed pitch modification factor δi = α and scaling factor
ζi = 1/αi. For dynamic duration modification, every PO is
assigned with varying pitch modification factor δi = αi and
scaling factor ζi. The prosody modified speech is synthesized
by copying the waveform samples from original pitch marker
locations to the modified locations. In case of decrease in pitch
period, the copied waveform samples in the original pitch
interval is overlap added with the waveform samples in the
overlap region. In case of increase in pitch period, the modified
pitch period is filled by copying the whole speech samples in
the original pitch interval and resampling the 10% tail portion
of the pitch interval.
C. Duration and Pitch Modification
For the combined duration and pitch modification, the mod-
ified pitch markers for duration modification is obtained first
using the proposed unified pitch markers generation method.
The modified pitch markers obtained are passed through the
same module once again using the pitch modification factor
(αi) and scaling factors (ζi = αi). The modified pitch markers
for both the duration and pitch modification are then used to
generate the duration and pitch modified speech. The speech
waveform is generated by using the modified pitch markers to
synthesize the prosody modified speech.
IV. EXPERIMENTAL RESULTS AND DISCUSSION
Fig. 4 compares the static duration and pitch modification
for the modification factors β=2 and α=0.5, respectively.
Fig. 4(a) shows a segment of voiced speech, its original pitch
marks in Fig. 4(b) and modified pitch marks Fig. 4(c) for
β=2 and the synthesized speech for the duration modification
in Fig. 4(d). Fig. 4(e) shows the same segment of voiced
![Page 4: [IEEE 2013 National Conference on Communications (NCC) - New Delhi, India (2013.2.15-2013.2.17)] 2013 National Conference on Communications (NCC) - Unified pitch markers generation](https://reader037.vdocuments.net/reader037/viewer/2022100504/5750a9201a28abcf0ccdcf0e/html5/thumbnails/4.jpg)
0 20 40 60 80−1
0
1
0 20 40 60 80
1
0 20 40 60 80 100 120 140 160
1
0 20 40 60 80 100 120 140 160−1
0
1
Time (ms)
0 20 40 60 80−1
0
1
0 20 40 60 80
1
0 20 40 60 80
1
0 20 40 60 80−1
0
1
Time (ms)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 4: (a) A voiced segment of speech, (b) corresponding original pitch marks, (c) modified pitch marks for the static duration
modification factor β=2 and (d) duration modified speech waveform. (e) A voiced segment of speech, (f) corresponding original
pitch marks, (g) modified pitch marks for the static pitch period modification factor α=0.5 and (h) synthesized waveform
according to the pitch modification factor.
speech used for duration modification, its original pitch marks
in Fig. 4(f) and modified pitch marks in Fig. 4(g) for α=0.5
and the synthesized speech according to the pitch modification
factor in Fig. 4(h). By comparing the Figures 4(c) and 4(g),
it has to be observed that both the cases have the same
number modified pitch marks. Also the time instants of the
modified pitch markers in pitch modified case are scaled by
α as compared to duration modification case.
In this section a subjective evaluation is performed for the
static pitch and duration modification using proposed unified
pitch marker generation method and existing conventional
pitch markers generation method used in the epoch based
prosody modification [8] in the second module of the prosody
modification system. Also the prosody modified speech syn-
thesized using epoch based method and TDPSOLA methods
using the proposed unified pitch marker generation methods
are subjectively evaluated. Four phonetically balanced utter-
ances from 3 speakers (2 males and a female) of CMU-Arctic
database are selected for the subjective evaluation [9]. The
files initially sampled at 32 kHz are down sampled to 8 kHz
and used for the evaluation. 15 research scholars of Electro
Medical and Speech Technology (EMST) lab participated in
the subjective evaluation. The subjects were asked to judge the
quality of each speech file according to the distortions present
in them. The significance of the scores used for the subjective
evaluation are given in Table I.
Table II compares the mean opinion scores (MOS) obtained
for the epoch based prosody modification using proposed
modified pitch markers generation method and conventional
method. Table II also compares the epoch based prosody
modification and TD-PSOLA using the proposed unified pitch
marker generation method. It can be observed from the Ta-
ble II that the MOS scores obtained for the epoch based
prosody modification using the unified pitch marker generation
method and conventional method are nearly same. The slight
degradation in the conventional method is not due to the
modified pitch markers generation, but due to the method
employed for speech waveform generation in the third module.
Table II also shows that by using the proposed unified pitch
marker generation method for both TD-PSOLA and epoch
based prosody modification results are almost same. These
results indicate the effectiveness of the proposed method as an
alternative to the need of using two independent methods for
pitch marker generation for pitch and duration modification.
The high MOS scores also indicate that the locations of the
modified pitch markers by the proposed method are correct.
V. SUMMARY AND CONCLUSIONS
This work proposed an unified method for generating mod-
ified pitch markers for pitch and duration modification. The
proposed method considers the given set of original pitch
markers and generates the intermediate modified pitch markers
![Page 5: [IEEE 2013 National Conference on Communications (NCC) - New Delhi, India (2013.2.15-2013.2.17)] 2013 National Conference on Communications (NCC) - Unified pitch markers generation](https://reader037.vdocuments.net/reader037/viewer/2022100504/5750a9201a28abcf0ccdcf0e/html5/thumbnails/5.jpg)
TABLE I: Ranking used for judging the quality and distortion
of the speech signal for different modification factors.
Rating Speech Quality Justification for the ranking
1 Unsatisfactory Very annoying and objectionable
2 Poor Annoying but not objectionable
3 Fair Perceptible and slightly annoying
4 Good Just perceptible but not annoying
5 Excellent Imperceptible
TABLE II: Mean opinion scores for different duration and
pitch modification factors. PPM stands for proposed pitch
markers and EPM stands for existing pitch markers.
Duration Modification
Method 0.5 1.5 2.5
PPM-Epoch 3.59 4.32 3.75
EPM-Epoch 3.23 3.98 3.52
PPM TD-PSOLA 3.64 4.28 3.64
Pitch Modification
0.6 1.5 2
PPM-Epoch 4.62 4.25 4.09
EPM-Epoch 3.59 4.03 4.09
PPM TD-PSOLA 4.40 4.15 4.09
based on modification factor. The intermediate modified pitch
markers are then subjected to duration scaling to obtain final
modified pitch markers. Except for the choice of parame-
ter values for modification and scaling, there are no other
differences for obtaining the modified pitch markers for the
case of pitch and duration modification. The method was later
demonstrated in pitch and duration modification tasks.
The proposed method gives a procedure for the generation
of modified pitch markers and is independent of the prosody
modification method. Therefore the proposed method can
be employed in any of the existing prosody modification
methods like PSOLA and epoch based methods.The future
work should focus on incorporating this method in different
prosody modification methods and evaluating them.
VI. ACKNOWLEDGEMENTS
The present work is part of UKIERI project (2007-2011)
titled ”Study of source features for speech synthesis and
speaker recognition” between IIT Guwahati, IIIT Hyderabad
and University of Edinburgh.
REFERENCES
[1] P. Taylor, Text to Speech Synthesis. Cambridge university press, 2009.[2] A. Hunt and A. Black, “Unit selection in a concatenative speech synthesis
system using a large speech database,” in Proc. ICASSP, pp. 373-376,1996.
[3] E. Mourlines and J. Laroche, “Non-parametric techniques for pitch-scaleand time-scale modification of speech,” Speech Commun., vol. 16, pp.175–205, 1995.
[4] K. S. Rao and B. Yegnanarayana, “Prosody modification using instantsof significant excitation,” IEEE Trans. Audio, Speech and Language
Processing, vol. 14, pp. 972–980, May 2006.
[5] E. Moulines and F. Charpentier, “Pitch-synchronous waveform processingtechniques for text-to-speech synthesis using diphones,” Speech Commun.,vol. 9, pp. 452–467, 1990.
[6] T. F. Quatieri and R. J. McAulay, “Shape invariant time scale and pitchmodification of speech,” IEEE Trans. on Signal Process., vol. 40, no. 3,pp. 497–510, Mar 1992.
[7] K. S. R. Murty and B. Yegnanarayana, “Epoch extraction from speechsignals,” IEEE Trans. Audio, Speech and Language Process., vol. 16,no. 8, pp. 1602–1614, Nov. 2008.
[8] S. R. M. Prasanna, D. Govind, K. S. Rao, and B. Yenanarayana, “Fastprosody modification using instants of significant excitation,” in Proc
Speech Prosody, May 2010.[9] J. Kominek and A. Black, “CMU-Arctic speech databases,” in in 5th ISCA
Speech Synthesis Workshop, Pittsburgh, PA, 2004, pp. 223–224.