application of time-frequency representations of

11
音声研究 第 21 巻第 3 2017(平成 29)年 12 63–73 Journal of the Phonetic Society of Japan, Vol. 21 No. 3 December 2017, pp. 63–73 特集論文 Application of Time-frequency Representations of Aperiodicity and Instantaneous Frequency for Detailed Analysis of Filled Pauses Hideki Kawahara * 非周期性と瞬時周波数の時間周波数表現の Filled pause の詳細な分析への応用について SUMMARY: This paper introduces a new fine-grained voice source analysis method and its application to filled pause analysis in the CSJ (Corpus of Spontaneous Japanese). The new source analysis procedure is designed to provide an- notation with reliable and precise descriptions of objective characteristics to items in large speech corpora. This design target made the new analysis method provide far more accurate descriptions than existing methods. The new method pro- vides the fundamental frequency estimate and the band-wise aperiodicity information simultaneously. It also provides an information-rich representation of a probability map of the fundamental component. This paper presents several analysis examples and discussions. Key words: filled pause, fundamental frequency, aperiodicity, instantaneous frequency, error estimates 1. Introduction Filled pauses are generated with dierent vocal tract shapes as well as dierent voice excitation source char- acteristics, such as vocal cord vibration and turbulent noise, from ordinarily spoken voices (Maekawa and Mori 2016). This paper focuses on the source charac- teristics analysis of filled pauses in the CSJ (Corpus of Spontaneous Japanese) (Maekawa 2003 ). The goal of this paper is to provide a calibrated objective measure to characterize filled pauses. It is important to provide dependable representations of the source characteris- tics for fully making use of rapidly advancing machine learning tools in the investigation of the intricate nature of filled pauses. To attain this goal, we introduce a new fine-grained voice source analysis method (Kawahara et al. 2017c). The method is an integrated procedure made from two periodicity analysis methods (Kawahara et al. 2016, Kawahara et al. 2017a) based on dierent aspects of deviations from periodicity. This paper is organized as follows. First, we briefly review fundamental frequency estimation methods to explain the background for developing the new proce- dure. Next, we introduce the architecture and introduce principles of operation of each component procedures followed by an introduction to visualization tools. Us- * Wakayama University(和歌山大学) ing these visualization tools, we illustrate several analy- sis examples of filled pauses. Finally, possible applica- tions of the proposed procedure for analyzing the CSJ are discussed. 2. Background Fundamental frequency ( f o afterward, Titze et al. 2015) is an important physical attribute of voiced sounds, and it is closely related to pitch, an im- portant perceptual attribute of sounds (Moore 2012). Even before the emergence of digital signal process- ing, VOCODER (Dudley 1939) extracted periodic exci- tation of voiced sounds using an analog circuit 80 years ago. Although detection and estimation of f o have been investigated since then and numerous algorithms were proposed, there is no perfect method. Users of f o ex- tractors have to select relevant algorithms depending on the purpose of the application which uses f o infor- mation. This situation is systematically reviewed in a comprehensive book on f o extraction (Hess 1983) and still holds true today. Periodic sounds repeat the same waveform. The shortest repetition period defines the fundamental pe- riod. The reciprocal of the fundamental period is f o . This leads to three types of f o extractors. The first category directly measures this repetition period using — 63 —

Upload: others

Post on 22-Feb-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Application of Time-frequency Representations of

音声研究 第 21巻第 3号2017(平成 29)年 12月63–73 頁

Journal of the Phonetic Societyof Japan, Vol. 21 No. 3

December 2017, pp. 63–73

特集論文Application of Time-frequency Representations of Aperiodicity

and Instantaneous Frequency for Detailed Analysis of Filled Pauses

Hideki Kawahara*

非周期性と瞬時周波数の時間―周波数表現のFilled pauseの詳細な分析への応用について

SUMMARY: This paper introduces a new fine-grained voice source analysis method and its application to filled pauseanalysis in the CSJ (Corpus of Spontaneous Japanese). The new source analysis procedure is designed to provide an-notation with reliable and precise descriptions of objective characteristics to items in large speech corpora. This designtarget made the new analysis method provide far more accurate descriptions than existing methods. The new method pro-vides the fundamental frequency estimate and the band-wise aperiodicity information simultaneously. It also provides aninformation-rich representation of a probability map of the fundamental component. This paper presents several analysisexamples and discussions.

Key words: filled pause, fundamental frequency, aperiodicity, instantaneous frequency, error estimates

1. Introduction

Filled pauses are generated with different vocal tractshapes as well as different voice excitation source char-acteristics, such as vocal cord vibration and turbulentnoise, from ordinarily spoken voices (Maekawa andMori 2016). This paper focuses on the source charac-teristics analysis of filled pauses in the CSJ (Corpus ofSpontaneous Japanese) (Maekawa 2003 ). The goal ofthis paper is to provide a calibrated objective measureto characterize filled pauses. It is important to providedependable representations of the source characteris-tics for fully making use of rapidly advancing machinelearning tools in the investigation of the intricate natureof filled pauses. To attain this goal, we introduce a newfine-grained voice source analysis method (Kawahara etal. 2017c). The method is an integrated procedure madefrom two periodicity analysis methods (Kawahara et al.2016, Kawahara et al. 2017a) based on different aspectsof deviations from periodicity.

This paper is organized as follows. First, we brieflyreview fundamental frequency estimation methods toexplain the background for developing the new proce-dure. Next, we introduce the architecture and introduceprinciples of operation of each component proceduresfollowed by an introduction to visualization tools. Us-

* Wakayama University(和歌山大学)

ing these visualization tools, we illustrate several analy-sis examples of filled pauses. Finally, possible applica-tions of the proposed procedure for analyzing the CSJare discussed.

2. Background

Fundamental frequency ( fo afterward, Titze et al.2015) is an important physical attribute of voicedsounds, and it is closely related to pitch, an im-portant perceptual attribute of sounds (Moore 2012).Even before the emergence of digital signal process-ing, VOCODER (Dudley 1939) extracted periodic exci-tation of voiced sounds using an analog circuit 80 yearsago. Although detection and estimation of fo have beeninvestigated since then and numerous algorithms wereproposed, there is no perfect method. Users of fo ex-tractors have to select relevant algorithms dependingon the purpose of the application which uses fo infor-mation. This situation is systematically reviewed in acomprehensive book on fo extraction (Hess 1983) andstill holds true today.

Periodic sounds repeat the same waveform. Theshortest repetition period defines the fundamental pe-riod. The reciprocal of the fundamental period is fo.This leads to three types of fo extractors. The firstcategory directly measures this repetition period using

— 63 —

Page 2: Application of Time-frequency Representations of

特集「フィラー研究の進展」

Figure 1 Residual calculation procedure for each periodicity detector. The instantaneous frequency calculation part inthe original reference (Kawahara et al. 2016) is removed.

waveform-based algorithms. The second category usesthe harmonic structure in the frequency domain. Thethird category uses the peak location of autocorrelationof the signal.

The current CSJ provides fo information using“get_f0” program based on the RAPT algorithm(Talkin 1995). It combines autocorrelation-based can-didate extraction and the optimal trajectory searchbased on a dynamic programming procedure. Althoughthe program is robust and reliable, there remains es-timation error and failure for detecting periodicity. Itdoes not provide calibration data on bias and variancein fo estimation. Deviation from usual periodic voicedsounds, which is more frequently found in filled pausesin the CSJ, makes it difficult to handle them by using“get_f0.” Because the primary application of the CSJis on phonological aspects of linguistics, it is importantto provide conceptually clear and quantitatively accu-rate fo descriptions.

Strictly speaking, the concept behind fo is not ap-plicable to speech sounds, because they are not math-ematically periodic. The prosodic information intro-duces temporal modification of the vocal fold vibra-tion rate, the stochastic nature of muscular drive intro-duces statistical variation, and bifurcation and chaoticstates in vocal fold vibration are not uncommon (Titze1994). They are the sources of deviation from pure pe-riodicity. Commonly used fo extractors (for example,de Cheveigné and Kawahara 2002, Camacho and Harris2008) approximate voiced sounds as periodic. Insteadof introducing this periodicity approximation, it is morerelevant to use extended concepts which are suitableto time-varying signals. They are instantaneous fre-quency (Flanagan 1966) and instantaneous amplitude.The time derivative of phase defines the instantaneousfrequency. The absolute value of a complex-valued sig-nal provides the instantaneous amplitude. These con-cepts made it possible to derive two measures, whichrepresent deviations from mathematical periodicity ob-jectively and quantitatively. One measure is based ona relative power of residuals (Kawahara et al. 2016).

The other measure is based on fluctuations in deriva-tives of the phase of the filtered signal (Kawahara et al.2017a). Combination of these two measures provides areliable and calibrated measure of deviation from peri-odicity (Kawahara et al. 2017c). We use this measurefor extracting fo and characterizing the source informa-tion.

3. Source Information Extraction Tool

The new source information extraction tool based onthis measure has three subsystems. They are a) peri-odicity detector, b) fo tracker and c) visualization sub-systems. This section mainly describes the first sub-system and the following section describes the visual-ization subsystem. For the tracking subsystem, severalalternative methods are being tested.

Deviation from periodicity is represented as the SNR(Signal-to-Noise Ratio). The signal represents the pe-riodic component which has a harmonic structure. Thenoise represents the components added to this harmonicstructure. Reliable estimation of this SNR has been dif-ficult because usually periodic components are signifi-cantly stronger than noise.

3. 1 Residual-based Periodicity DetectorResidual-based periodicity detector evaluates devia-

tion of a filter output from the most prominent sinu-soidal component without requiring a priori knowledgeof the frequency of the prominent component (Kawa-hara et al. 2016). Each filter has a pass-band which isnarrow enough to isolate the fundamental component ofa periodic signal and wide enough to cover both the firstand the second harmonic component when centered atthe second harmonic component. This requirement pro-vides a set of band-pass filters which have the sameshape on the logarithmic frequency axis.

Figure 1 shows how to calculate each residual fromthe input signal x[n]. Two filters of each detectorhave the same complex-valued impulse response. Sub-tracting the twice-filtered and absolute value normal-

— 64 —

Page 3: Application of Time-frequency Representations of

Application of Time-frequency Representations of Aperiodicity and Instantaneous Frequency for Detailed Analysis of Filled Pauses

Figure 2 Calculation procedure of the phase derivative-based deviation measure and the instantaneous frequency. The in-stantaneous frequency calculation uses a simpler procedure than the original reference (Kawahara et al. 2017a).

ized signal y2[n] from the once-filtered and absolutevalue normalized signal y1[n] yields the residual sig-nal r[n]. The power of this residual signal is propor-tional to the deviation from periodicity. Note that theinstantaneous frequency calculation part in the originalreference (Kawahara et al. 2016) is removed from thisresidual-based detector because it is calculated by theinstantaneous frequency-based detector.

3. 2 Instantaneous Frequency-based PeriodicityDetector

Instantaneous frequency-based periodicity detectorevaluates deviation of a filter output from a sinusoidalcomponent using derivatives of the phase of the filteroutput (Kawahara et al. 2017a). This procedure doesnot require a priori knowledge about the frequency ofthe sinusoidal component. This detector also uses a setof band-pass filters which have the same shape on thelogarithmic frequency axis. A new cosine series win-dowing function (Kawahara et al. 2017b) is used fordesigning the complex-valued impulse response of thefilter.

Figure 2 shows the outline of the procedure forcalculating the phase derivative-based deviation mea-sure. The combination of the squared frequency deriva-tive of the instantaneous frequency and the squaredtime-frequency derivative of the instantaneous fre-quency, with appropriate calibration, provides thisphase derivative-based measure. This measure is alsoproportional to the deviation from periodicity. Note thatFlanagan’s instantaneous frequency equation (Flanaganand Golden 1966) used in the original proposal (Kawa-hara et al. 2017a) is replaced by a simpler equation us-ing an inverse trigonometric function. This replacementis to make use of the specialized instruction set of mod-ern CPUs, which calculates inverse trigonometric func-tions fast (for example, Intel 2017).

3. 3 Integrated Periodicity DetectorThe residual-based measure and the instantaneous

frequency-based measure are statistically independent,approximately. The combination of these two measures

provides smaller estimation variance than each compo-nent (Kawahara et al. 2017c). The periodicity detectorsubsystem of the new source analysis method has a setof detectors using a logarithmically linear center fre-quency allocation. In the following examples, 12 detec-tors are allocated for each octave in a 40 Hz to 1000 Hzrange. The detector also calculates the instantaneousfrequency of each filtered output by calculating the an-gle of the complex-valued ratio of the two successiveoutput samples.

This detector summarizes results to generate a set offo candidates. When a prominent sinusoidal componentis mixed with a wide band random noise, filter outputswhich consist of the prominent component in its pass-band have smaller deviation values than other filter out-puts. Calculating a weighted average of output instanta-neous frequencies of filter outputs which consist of thesame sinusoidal component provides a frequency esti-mate of the fo candidate and a weighted harmonic meanof the deviation measure provides the estimate of vari-ance. A set of simulations was conducted to tune thisdetector to provide the calibrated fo variance estimate.This set of simulations indicated that the final periodic-ity detector provides linear SNR estimation to the trueSNR from 0 dB to 80 dB SNR range (Kawahara et al.2017c). The six-term cosine series window (Kawaharaet al. 2017b) used in each component detector elim-inates glitches, which occur when using conventionalwindowing functions.

4. Visualization of Source Periodicity

Three representations are used to visualize the out-put of the integrated periodicity detector. The first rep-resentation is a candidate map. The second one is aperiodicity (probability) map. They are mainly used forsoftware development and later analyses. The third rep-resentation is an integrated information display. Thisis used for visual feedback and interactive inspections.The fo trajectory and the confidence interval can also beused for later analyses.

— 65 —

Page 4: Application of Time-frequency Representations of

特集「フィラー研究の進展」

Figure 3 Example of extracted frequency of fo candidates. The top panel shows the waveform. The lower panel showsthe extracted candidates. The darkness of the dots represents the strength of each candidate, which is calculatedfrom the estimated error variance. The filled pause in this example spans from the beginning to 0.4 s. The voicequality of the other parts are modal.

4. 1 Raw Candidate MapThe first display visualizes the raw output of fo can-

didates. Each candidate has two attributes. The first at-tribute is the weighted mean of instantaneous frequencyof the filtered output signals. The other attribute is theweighted mean of the estimated standard deviation ofthe filtered output signals. These raw candidates areused to calculate the probability map and used to helpprepare the integrated source information display.

Figure 3 shows an example of this raw candidatemap. This display shows an example of a filled pausespoken by a male talker with creaky voice quality. Thisexample is an excerpt from the CSJ (Maekawa 2003).The top panel shows the original waveform. From thebeginning to 0.4 s corresponds to the filled pause /eH/.The following ordinary speech says /okane o herasi tedemo ii/ spoken in the same voice quality. Each candi-date is represented by a dot with gray scale. The darkercolor represents smaller error variance. In other words,darker color indicates strong periodicity. A pseudocolor coding is used for color display devices.

4. 2 Probability MapThe raw candidate map yields a visualization of the

probability of fo in each time-frequency bin, the proba-bility map. For speeding up visualization, calculationof the probability of each candidate uses an approx-

imation. The probability assigned to each frequencybin represents the sum of the probability of all periodiccomponents in that bin. The sum of the probability inall frequency bins in each frame may exceed one be-cause we allow the signal to have multiple periodicities.

Figure 4 shows an example probability map of a filledpause /eH/ spoken with creaky voice quality. The sam-ple is the same as that in Figure 3. This representationis more informative than Figure 3 because it intuitivelyshows trade-offs between spread of distribution and cor-responding probability. It is also because the size of thefrequency bin is finer than the filter allocation. In thisexample, 24 bins are allocated for each octave. A one-octave separation of two visually salient trajectories in-dicates that the lower trajectory corresponds to fo. Forexample, in the filled pause part, the trajectory movingfrom 90 Hz to 60 Hz may correspond to fo.

4. 3 Source Information DisplayThe source information display provides an inte-

grated view of the source information. The time-frequency locations of all candidates, the best candi-dates of fo and their confidence intervals, and waveformare time aligned in the integrated display.

Figure 5 shows the integrated display of an exam-ple of the filled pause with creaky voice quality. Thelower panel of Figure 5 displays the following infor-

— 66 —

Page 5: Application of Time-frequency Representations of

Application of Time-frequency Representations of Aperiodicity and Instantaneous Frequency for Detailed Analysis of Filled Pauses

Figure 4 Probability map of a creaky voice example from Figure 3. The top panel shows the waveform. The bottompanel shows the periodicity map. Higher probability corresponds to darker image.

Figure 5 Integrated display of an example of a filled pause with creaky voice quality from Figures 3 and 4. The top panelshows the waveform. The bottom panel consists of the following information; the best fo candidates and othercandidates, the guide for selecting the best candidate and 95% confidence interval.

mation using color, which is shown using gray scale inthis printed form. All fo candidates are shown usinglight gray (light green on color display) dots. The bestcandidates of fo are shown using black dots. The pro-cedure to select the best candidate is based on a simpleKalman filtering (Garner et al. 2013) by assuming thatonly one fo exists in each instant of analysis. The out-

put of Kalman filtering, the latent variable, is shownusing a thick light gray (cyan on color display) line.The candidate which has the highest posterior proba-bility is selected. The upper and lower 95% confidencelimits are displayed using two dark gray (red on colordisplay) lines. In this case, the best fo component of thefilled pause /eH/ seems modulated. It is observable by

— 67 —

Page 6: Application of Time-frequency Representations of

特集「フィラー研究の進展」

Figure 6 Probability map of filled pauses spoken with creaky voice quality. They span from 0.15 s to 0.4 s, and 0.95 s to1.3 s.

comparing the waveform and the fo value.

5. Filled Pause Examples in the CSJ-core

This section provides other analysis examples offilled pauses. They are excerpted from the X-JTOBI(Maekawa et al. 2002) annotated part of the Corpus ofSpontaneous Japanese (Maekawa 2003), known as theCSJ-core.

5. 1 Filled Pauses Spoken with Creaky Voice Qual-ity

For other examples spoken with creaky voice qual-ity, it is very difficult to determine fo as a unique valuein each frame. Only probability maps are shown forthe following examples because the Kalman filter-basedguide traces were unreliable.

Figures 6, 7 and 8 show probability map examples offilled pauses spoken with creaky voice quality. Figure 6has two filled pauses /eH/ from 0.15 s to 0.4 s and from0.95 s to 1.3 s. The fo is clearly observed in the ordinaryspeech segment which starts from 1.3 s. The probabilitymap of this segment shows a typical example of modalvoice quality. It has only one outstanding probabilitypeak in each frame and the peaks are temporally con-tiguous. However, it is very difficult to determine fo

reliably for filled pauses. Short and many intermittentstrong periodicity regions around 500 Hz correspond tothe first formant of /eH/.

Figure 7 has a filled pause from the beginning to 1 s.Around 0.25 s, the filled pause shows ordinary period-icity. The rest of the part does not have reliable fo. Thisexample also shows short and intermittent strong peri-odic regions which correspond to the first formant of/eH/.

Figure 8 shows a filled pause from 0.7 s to 1.1 s.The probability map shows a strong frequency modula-tion (FM) in fo component around 220 Hz in the filledpause. The waveform also shows a strong amplitudemodulation (AM) which is synchronized with the FM.

5. 2 Filled Pauses Spoken with Breathy Voice Qual-ity

Figure 9 shows an example probability map of a filledpause /eH/ spoken with breathy voice quality. The filledpause is from 0.3 s to 0.7 s. This example shows a re-liable fo trace in the filled pause region similar to thefollowing ordinary voice region. Note that a notch filterpreprocessing was used to suppress the 50 Hz inductionnoise in this example.

Figure 10 shows the integrated source informationdisplay of the same sample as in Figure 9. The95% confidence intervals are invisible in the ordinaryvoice and filled pause regions because the estimates arehighly reliable and the intervals are covered by the bestcandidates’ dots.

Figure 11 shows a filled pause spoken with breathyvoice quality. It consists of an unvoiced region and

— 68 —

Page 7: Application of Time-frequency Representations of

Application of Time-frequency Representations of Aperiodicity and Instantaneous Frequency for Detailed Analysis of Filled Pauses

Figure 7 Probability map of a filled pause spoken with creaky voice quality. It spans from the beginning to 1 s.

Figure 8 Probability map of a filled pause spoken with creaky voice quality. It spans from 0.7 s to 1.1 s.

a voiced region. The beginning of the voiced regionshows clear periodicity. Periodicity deteriorates from0.7 s.

Figure 12 shows a filled pause spoken with breathyvoice quality. This example has a clearly periodic fo

trajectory in the beginning of the filled pause. It alsodeteriorates from 0.75 s.

In some cases, subharmonics are found in filledpauses. The following two examples use the probability

map for visualization.Figures 13 and 14 show subharmonic behavior in

filled pauses. The lower limit of the frequency rangeis extended to 20 Hz to cover the subharmonic range.Figure 13 has the prominent periodicity around 90 Hzbut it also has periodicity around 45 Hz and 25 Hz from0.5 s to 0.7 s. The latter two frequencies roughly corre-spond to 1/2 and 1/4 of the prominent frequency. The1/2 subharmonic is also clearly seen in the waveform

— 69 —

Page 8: Application of Time-frequency Representations of

特集「フィラー研究の進展」

Figure 9 Probability map of a breathy voice example. A filled pause spans from 0.3 s to 0.7 s. The top panel shows thewaveform. The bottom panel shows the periodicity map. Higher probability corresponds to darker image.

Figure 10 Integrated display of an example of a filled pause with breathy voice quality. A filled pause spans from 0.3 sto 0.7 s. The top panel shows the waveform. The bottom panel consists of the following information. They arecandidates of fo using light gray dots, the best fo guiding trace (a thick light gray line) with thin gray linesindicating confidence interval, and the selected the best fo candidate in each frame.

plot. Around 0.6 s, every other cycle has a peak.The filled pause of Figure 14 is from 0.1 s to 0.6 s.

The prominent periodicity is around 90 Hz and subhar-monic components are around 1/3 and 2/3 of the promi-nent frequency. The 1/3 subharmonic is also seen in the

waveform plot. Around 0.2 s, two cycles out of everythree cycles of 90 Hz repetition have peaks. This corre-sponds to 1/3 subharmonic. Also, two peaks are point-ing the same direction, asymmetry introduces 2/3 sub-harmonic. The ordinary voice region also shows sub-

— 70 —

Page 9: Application of Time-frequency Representations of

Application of Time-frequency Representations of Aperiodicity and Instantaneous Frequency for Detailed Analysis of Filled Pauses

Figure 11 Integrated display of example of a filled pause with breathy voice quality. A filled pause spans from 0.3 s to0.9 s. The top panel shows the waveform. The bottom panel consists of the following information. The lightgray dots represent candidates of fo, a thick gray line represents the best fo guiding trace, thin gray linesaround the thick line indicate the confidence interval, and the black dots represent the best fo candidate ineach frame.

Figure 12 Integrated display of example of a filled pause with breathy voice quality. A filled pause spans from 0.2 s to1 s. The top panel shows the waveform. The bottom panel consists of the following information. The light graydots represent candidates of fo, a thick gray line represents the best fo guiding trace, thin gray lines aroundthe thick line indicate the confidence interval, and the black dots represent the best fo candidate in each frame.

harmonic components from 0.8 s to 1.2 s.6. Discussion

This preliminary application of the new fine-grained

— 71 —

Page 10: Application of Time-frequency Representations of

特集「フィラー研究の進展」

Figure 13 Probability map of a breathy voice example of a filled pause. The filled pause spans from 0.35 s to 1.35 s.This example shows subharmonic behavior. The top panel shows the waveform. The bottom panel shows theperiodicity map. Higher probability corresponds to darker image.

Figure 14 Probability map of a breathy voice example of a filled pause. The filled pause spans from 0.1 s to 0.6 s. Thisexample shows subharmonic behavior. The top panel shows the waveform. The bottom panel shows the peri-odicity map. Higher probability corresponds to darker image.

fo extractor to filled pause analysis suggested that rep-resenting fo as a scalar value is not an appropriate con-cept for representing filled pause source characteristics.Ordinary speech and filled pause with clear periodicityare well represented by fo. However, strong deviation

from periodicity usually found in filled pauses makes fo

an insufficient parameter to represent. For subharmon-ics, the probability map is useful to characterize suchbehavior. Figure 8 shows strong FM in the probabil-ity map and the waveform on top of the plot indicates

— 72 —

Page 11: Application of Time-frequency Representations of

Application of Time-frequency Representations of Aperiodicity and Instantaneous Frequency for Detailed Analysis of Filled Pauses

AM which is synchronized with the FM. Irregular ex-citations found in Figures 6 and 7 are also visible usingthe probability map. Statistical parameters which areable to quantitatively characterize these deviations areto be investigated. Especially, the randomness in thelast creaky case needs time domain-based parameters.Group delay-based methods (Murty and Yegnanarayana2008, Kawahara et al. 2000), which are defined usingthe frequency derivative of the phase, will provide thekey.

7. Conclusion

A new detailed analysis framework of the source in-formation of voiced sounds was introduced to filledpause analyses. It is based on an objective and cali-brated measure of deviation from pure periodicity andprovides the estimate of error variance of fo. The es-timation results are visualized in three formats; rawcandidate, probability map and integrated display. Forfilled pauses, the probability map provides rich and rel-evant information because filled pauses usually showstronger deviations from pure periodicity than ordinaryvoices. Preliminary application of these displays to theCSJ indicated a need for a set of statistical parametersfor representing randomness found in filled pauses.

Acknowledgement

This work was supported by JSPS KAKENHI GrantNumber 26284062.

References

Camacho, A. and J. G. Harris (2008) “A sawtooth waveforminspired pitch estimator for speech and music.” The Jour-nal of the Acoustical Society of America 124(3), 1638–1652.

de Cheveigné, A. and H. Kawahara (2002) “YIN, a fundamen-tal frequency estimator for speech and music.” The Jour-nal of the Acoustical Society of America 111(4), 1917–1930.

Dudley, H. (1939) “Remaking speech.” The Journal of TheAcoustical Society of America 11(2), 169–177.

Flanagan, J. L. and R. M. Golden (1966) “Phase vocoder.” BellSystem Technical Journal 45(9), 1493–1509.

Garner, P. N., M. Cernak and P. Motlicek (2013) “A sim-ple continuous pitch estimation algorithm.” IEEE SignalProcessing Letters 20(1), 102–105.

Hess, W. (1983) Pitch determination of speech signals: Algo-rithms and devices. Springer-Verlag.

Intel Corporation (2017) “Intel Math Kernel Library 2017,Vector Mathematics (VM), Performance and AccuracyData.” Documentation in Intel Software Developer Zonehttps://software.intel.com/.

Kawahara, H., Y. Atake and P. Zolfaghari (2000) “Accurate vo-cal event detection method based on a fixed-point anal-ysis of mapping from time to weighted average groupdelay.” Proc. ICSLP 2000 4, 664–667.

Kawahara, H., Y. Agiomyrgiannakis and H. Zen (2016) “Usinginstantaneous frequency and aperiodicity detection to es-timate F0 for high-quality speech synthesis.” 9th ISCASpeech Synthesis Workshop 221–228 (arXiv PreprintarXiv:1605.07809).

Kawahara, H., K.-I. Sakakibara, M. Morise, H. Banno and T.Toda (2017a) “A modulation property of time-frequencyderivatives of filtered phase and its application to aperi-odicity and fo estimation.” Interspeech 2017 Stockholm,Aug. 2017, 424–428 (arXiv preprint arXiv:1706.02964).

Kawahara, H., K.-I. Sakakibara, M. Morise, H. Banno, T. Todaand T. Irino (2017b) “A new cosine series antialiasingfunction and its application to aliasing-free glottal sourcemodels for speech and singing synthesis.” Interspeech2017 Stockholm, Aug. 2017, 1358–1362 (arXiv preprintarXiv:1702.06724).

Kawahara, H., K.-I. Sakakibara, H. Banno, M. Morise and T.Toda (2017c) “Accurate estimation of fo and aperiodic-ity based on periodicity detector residuals and deviationsof phase derivatives.” APSIPA ASC 2017.

Maekawa, K., H. Kikuchi, Y. Igarashi and J. Venditti(2002) “X-JToBI: An extended J_ToBI for spontaneousspeech.” Proceedings of ICSLP2002 Denver, 1545–1548.

Maekawa, K. (2003) “Corpus of Spontaneous Japanese: Its de-sign and evaluation.” ISCA and IEEE Workshop on Spon-taneous Speech Processing and Recognition 7–12.

Maekawa, K. and H. Mori (2016) “Voice-quality difference be-tween the vowels in filled pauses and ordinary lexicalitems.” Interspeech 2016, 3171–3175.

Moore, B. C. J. (2012) An introduction to the psychology ofhearing. Brill.

Murty, K. S. R. and B. Yegnanarayana (2008) “Epoch extrac-tion from speech signals.” IEEE Trans. Audio, Speechand Language Processing 16 (8), 1602–1613.

Titze, I. R. (1994) Principles of voice production. Allyn & Ba-con.

Titze, I. R. et al. (2015) “Toward a consensus on symbolic no-tation of harmonics, resonances, and formants in vocal-ization.” The Journal of the Acousti- cal Society of Amer-ica 137(5), 3005–3007.

Talkin, D. (1995) “A robust algorithm for pitch tracking(RAPT).” In W. B. Kleijn and K. K. Paliwal (eds.)Speech coding and synthesis, 495–518.

(Recieved Sept. 2, 2017, Accepted Dec. 11, 2017)

— 73 —