speech acoustics and phonetics louis c.w. pols institute of phonetic sciences (ifa) amsterdam center...

Speech acoustics and phonetics

Louis C.W. Pols

Institute of Phonetic Sciences (IFA)

Amsterdam Center for Language and Communication

(ACLC)NATO-ASI “Dynamics of Speech Production and Perception” Il Ciocco, Tuscany, Italy, July 1, 2002

July 1st, 2002 Speech acoustics and phonetics, Il Ciocco

2

Overview

Dynamics in speech acoustics Contour modeling (mainly formants) Aspects of spectral undershoot Modeling V and C reduction Phonetic knowledge from speech

corpora IFA, CGN, TIMIT, found speech

Conclusions


4

Dynamics in speech acoustics

Dynamics is the norm, not stationarity articulatory efficiency

Dynamics is everywhere generally no word boundaries in speech deletion of words, syllables, phonemes;

insertion within/between word coarticulation/assimilation vowel and consonant reduction

Acoustic manifestations segment duration, F0, loudness, spectral quality


5

Dynamics is the norm

The speaker speaks as sloppily as the listeners allow him to do in communication communicative efficiency

Articulatory vs. perceptual efficiency do spectral transitions facilitate or hamper

perception? —> see other presentation Speaker flexibility; speaking style (clear

vs. sloppy); speaking rate


6

Dynamics is everywhere

Deletion ‘bread and butter’ /brEmbY3/ ‘Amsterdam’ (Du) /Amst@rdAm/ —>/Ams@dAm/ ‘koninklijke’ (Du) /konIŋkl@k@/ —>/kol@k@/

Insertion homorganic glide insertion: ‘die een’ (Du) /dij@n/

Degemination ‘is zichtbaar’ (Du) /Is zIxtbar/ —>/IsIxbar/

Reduction, coarticulation, assimilation


7

Acoustic manifestations pitch, loudness, formant, component

contours contour stylization (e.g., pitch in praat) contour modeling

n-th degree curve fitting (D.van Bergem) Legendre polynomials ) (R.van Son) 16 points per segment )

(phoneme) segmentation by hand (time consuming; non-consistent) automatically (via forced phoneme

recognition and a pronunciation lexicon with alternatives; systematic errors)


8

Contour modeling

allows modeling of specific phenomena pitch accentuation (vs. vowel onset) reduction, centralization, undershoot

allows generation of stimuli for perc. expts. phoneme identification in extending context 2-alternatives forced choice identif. of continua discrimination, RT

allows statistics on large speech corpora TIMIT, CGN, IFA-corpus, Switchboard


9

Static vs. dynamic V recogn.

see Weenink (2001) “Vowel normalizations with the TIMIT acoustic

phonetic speech corpus”, IFA Proc. 24, 117-123 438 males, both train & test sent. of TIMIT 35,385 vowel segments, hand segmented 13 monophthongeal vowel categories 1-Bark bandfilter anal. (18), intensity.

normal. 3 frames per segment: central and 25 ms L/R


10

Some results

Vowel classif. (%) with discriminant functions

Condition # Items Static 1 frame

Dynamic 3 frames

Original 35,385438x13x(1…25)

59.3 66.9

speaker normalized

35,385 62.2 69.2

V centers per speaker

5,374438x13

78.9 90.1

speaker normalized

5,374 87.9 94.5


11

Formant tracks / speaking rate

Ph.D. thesis Rob van Son (1993) “Spectro-temporal features of vowel segments” see also Speech Comm. 13, 135-148 (Pols & vSon)

850-words text, read at normal and fast rate hand segmentation of 7 most freq. V + schwa formant tracks

via 16 points per segm. or 5 Legendre polynomials influence of rate, V-dur., context, sent. acc. evidence for duration-controlled undershoot?


12

Some results

no differences for F1/F2 in vowel center for normal- or fast-rate speech; only some overall rise in F1 for fast rate (irrespective of V)

same formant track shape (normalized to 16 points) for normal- or fast-rate speech

same results when using the more elaborate Legendre polynomials

Concl.: changes in V-duration do not change the amount of undershoot —> active control of articulation speed


13

Formant representations

800

1000

1200

1400

1600

1800

2000

300 400 500 600

y

´

o

a

u

i

Normal rate

Fast rate

F ->1

F

->2

-250

-200

-150

-100

-50

0

50

100

150

200

250

-150-100-50050

a

y

´

o

u

i Normal rate

Fast rate

F '

' ->

2

F '' ->1

zeroth order Legendre Legendre polynomial coefficients (mean Fi in vowel segment)

second order polynomials (axes reversed)

e e


14

Modeling vowel reduction

Ph.D. thesis Dick van Bergem (1995) “Acoustic and lexical vowel reduction” see also Speech Communication 16, 329-358

lexical V reduction Fr /betõ/ vs. Du /b@tOn/ acoustic V reduction /banan, bAnan, b@nan/

f(sent. acc., w. str., w. class): can-candy-canteen coarticulatory effects on the schwa

C1@C2V- and VC1@C2-type nonsense words perceptual effects (full V or schwa, f.i.

‘ananas’)


15

Some results

The schwa is not just a centralized vowel but somethingthat is completely assimilated with its phonemic context

t-n w-l


16

Modeling consonant reduction

Sp. Comm. (1999) 28, 125-140 (vSon & Pols) 20 min. speech, both spontaneous and read 2 x 791 similar VCV; hand segmented 5 aspects of V and C reduction

related to coarticulation: F2 slope differences at CV- vs. VC-boundaries; F2 locus equations (F2 onset vs. F2 target)

related to speaking effort: duration; spectral COG (mean freq.); V-C sound energy differences


17

Some results

V markedly reduced in spontaneous speech lower F2-slope diff. in spontaneous speech

—> decrease in articulation speed no systematic effect on F2 locus equation; V

onsets and targets change in concert —> any V reduction mirrored by comparable change in C

spont. sp.: V and C shorter; lower COG —> decrease in vocal and articulatory effort


18

Access to large corpora

more, and more realistic, data phonetic knowledge via statistical analyses f.i. highly accessible IFA-corpus (free, SQL)

see “Structure and access of the open source IFA-corpus”, IFA Proc. 24, 15-26 (vSon & Pols)

on-line http://www.fon.hum.uva.nl/IFAcorpus/ 4 M/4F speakers, 5.5 hrs of speech

from informal to read + sent., words, syllables ~ 50Kwords segm. and labeled at phoneme

level


19

Some results speech + annot. + meta data: relational DB realization of final n, f.i. Du ‘geven’ /xev@(n)/

Style #wrds

/@n/ /@/ All % /@n/

Informal 5,250 1 304 305 0.3

Retelling 6,229 13 236 249 5.2

LF HF

Narr. story 14,453

180 372 552 33 42 30

Sentences 14,970

203 340 543 37

Pseudo-sent

2,554 62 19 81 77

All 43,456

459 1,271 1,730 36

Read


20

Spoken Dutch Corpus (CGN)

10 M words, 1,000 hrs of speech variety of styles, incl. telephone speech adult Dutch and Flemish speakers for linguistic and technological research see various LREC and ICSLP papers (2002) see also http://lands.let.kun.nl/cgn/home.htm fully transcribed: orthogr., POS, lemmas partly transcr.: phonemic, prosodic, syntactic


21

TIMIT

popular DB in acoustic phonetics and ASR also telephone version (NTIMIT)

hand segmented & labeled at phoneme level

438 males, 192 females (8 dialect regions) 10 sent./sp. (2 fixed, 1 phon. compact, 7 diverse)

sa1: “She had her dark suit in greasy wash water all year”

includes separate test data (112 M, 56 F) e.g. Ph.D thesis X. Wang (1997)

“Incorporating knowledge on segmental duration in HMM-based continuous speech recognition”


22

R

S

Root /iy/

Lw

Lu

count

mean

s.d.

factorlevel

4626

95

39

1544

83

31

1588

95

36

1494

109

46

796

78

25

711

89

36

37

91

25

816

87

29

735

104

40

37

98

34

719

98

33

729

119

54

46

104

42

91

80

529

91

117

75

79

80

52

94

70

136

180

101

433

101

14

83

22

107

1

99

52

94

50

126

12

186

8

121

134

98

46

111

374

96

37

156

22

90

0 1 2

0 1 2 0 1 2 0 1 2

0 1 2 3 0 1 2 3 0 1 2

0 0 1 2 0 2 0 1 2

26 30 22 25 27 50 25 42 24 36 0

27 46 52 23 25 24 37 58 27

Useful info: durational variability

Adopted from Wang (1998)

normal rate=95

primary stress=104

word final=136

utterance final=186

overall average=95 ms

(fast rate <-) relative utterance speaking rate (-> slow rate)

his

tog

ram

co

un

t (n

um

be

r o

f u

tte

ran

ce

s)

0

20

40

60

80

100

120

140

-0.76 -0.53 -0.3 -0.07 0.16 0.39 0.62 0.85 1.08 1.31 1.54

0

20

40

60

80

100

120

140

160

180

utt

era

nc

e-a

ve

rag

ed

ph

on

e d

ura

tio

n (

ms

)

histogramphone dur

d

,normalized phone duration speaking raterN

ii

N

1

1

,

all 3,696 training sent. (sx + si) of TIMIT training set

0


24

‘found’ speech DARPA-LVSR community rather ambitious Broadcast News (BN), Sp.Comm. 37 (2002)

< ’95WSJ NAB read sp.

1995Market place

1996F0-F5, FX partitioned

19973 hrs test unpartit.

1998+ non Engl.

speech also < 10x RT

audiotraining data

100 hrs 10 hrs 55 hrs + 50 hrs

+ 100 hrs

text (for LM)

430 K 122 M 540 M > 900 M

best % WERon test set

27.0 %

27.1 %1:46 hrs

16.2 %3 hrs

13.5 —>16.1 %3 hrs (10xRT)

For Proc. DARPA Workshops, see http://www.nist.gov/speech/proc/darpa99/index.htm


25

Articul.-acoustic features in ASR

“A Dutch treatment of an elitist approach to articulatory-acoustic feature classification”, Proc. Eurospeech-2001, 1729-1732 (M. Wester et al.)

“Integrating articulatory features into acoustic models for speech recognition”, Phonus 5, 73-86 (K. Kirchhoff, 2000)

“An overlapping-feature-based phonological model incorporating linguistic constraints: Applications to speech recognition”, JASA 111 (2), 1086-1101 (J. Sun & L. Deng, 2002)


26

Conclusions

examples of dynamics in speech acoustics going from formal to informal speech:

less dynamics, more reduction (artic. guided) undershoot vs. speaking style sloppiness or articulatory limits?

functionality of dynamics? —> other paper systematicity of dynamics?

easing ASR, rules for TTS, acquiring knowledge?

speech acoustics and phonetics louis c.w. pols institute of phonetic sciences (ifa) amsterdam center...

Documents

speech acoustics dynamics

speech conclusions

speech corpora ifa

speech deletion of words

large speech corpora

il ciocco4 dynamics

il ciocco6 dynamics

il ciocco5 dynamics