speech acoustics and phonetics louis c.w. pols institute of phonetic sciences (ifa) amsterdam center...
TRANSCRIPT
Speech acoustics and phonetics
Louis C.W. Pols
Institute of Phonetic Sciences (IFA)
Amsterdam Center for Language and Communication
(ACLC)NATO-ASI “Dynamics of Speech Production and Perception” Il Ciocco, Tuscany, Italy, July 1, 2002
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
2
Overview
Dynamics in speech acoustics Contour modeling (mainly formants) Aspects of spectral undershoot Modeling V and C reduction Phonetic knowledge from speech
corpora IFA, CGN, TIMIT, found speech
Conclusions
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
4
Dynamics in speech acoustics
Dynamics is the norm, not stationarity articulatory efficiency
Dynamics is everywhere generally no word boundaries in speech deletion of words, syllables, phonemes;
insertion within/between word coarticulation/assimilation vowel and consonant reduction
Acoustic manifestations segment duration, F0, loudness, spectral quality
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
5
Dynamics is the norm
The speaker speaks as sloppily as the listeners allow him to do in communication communicative efficiency
Articulatory vs. perceptual efficiency do spectral transitions facilitate or hamper
perception? —> see other presentation Speaker flexibility; speaking style (clear
vs. sloppy); speaking rate
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
6
Dynamics is everywhere
Deletion ‘bread and butter’ /brEmbY3/ ‘Amsterdam’ (Du) /Amst@rdAm/ —>/Ams@dAm/ ‘koninklijke’ (Du) /konIŋkl@k@/ —>/kol@k@/
Insertion homorganic glide insertion: ‘die een’ (Du) /dij@n/
Degemination ‘is zichtbaar’ (Du) /Is zIxtbar/ —>/IsIxbar/
Reduction, coarticulation, assimilation
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
7
Acoustic manifestations pitch, loudness, formant, component
contours contour stylization (e.g., pitch in praat) contour modeling
n-th degree curve fitting (D.van Bergem) Legendre polynomials ) (R.van Son) 16 points per segment )
(phoneme) segmentation by hand (time consuming; non-consistent) automatically (via forced phoneme
recognition and a pronunciation lexicon with alternatives; systematic errors)
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
8
Contour modeling
allows modeling of specific phenomena pitch accentuation (vs. vowel onset) reduction, centralization, undershoot
allows generation of stimuli for perc. expts. phoneme identification in extending context 2-alternatives forced choice identif. of continua discrimination, RT
allows statistics on large speech corpora TIMIT, CGN, IFA-corpus, Switchboard
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
9
Static vs. dynamic V recogn.
see Weenink (2001) “Vowel normalizations with the TIMIT acoustic
phonetic speech corpus”, IFA Proc. 24, 117-123 438 males, both train & test sent. of TIMIT 35,385 vowel segments, hand segmented 13 monophthongeal vowel categories 1-Bark bandfilter anal. (18), intensity.
normal. 3 frames per segment: central and 25 ms L/R
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
10
Some results
Vowel classif. (%) with discriminant functions
Condition # Items Static 1 frame
Dynamic 3 frames
Original 35,385438x13x(1…25)
59.3 66.9
speaker normalized
35,385 62.2 69.2
V centers per speaker
5,374438x13
78.9 90.1
speaker normalized
5,374 87.9 94.5
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
11
Formant tracks / speaking rate
Ph.D. thesis Rob van Son (1993) “Spectro-temporal features of vowel segments” see also Speech Comm. 13, 135-148 (Pols & vSon)
850-words text, read at normal and fast rate hand segmentation of 7 most freq. V + schwa formant tracks
via 16 points per segm. or 5 Legendre polynomials influence of rate, V-dur., context, sent. acc. evidence for duration-controlled undershoot?
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
12
Some results
no differences for F1/F2 in vowel center for normal- or fast-rate speech; only some over- all rise in F1 for fast rate (irrespective of V)
same formant track shape (normalized to 16 points) for normal- or fast-rate speech
same results when using the more elaborate Legendre polynomials
Concl.: changes in V-duration do not change the amount of undershoot —> active control of articulation speed
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
13
Formant representations
800
1000
1200
1400
1600
1800
2000
300 400 500 600
y
´
o
a
u
i
Normal rate
Fast rate
F ->1
F
->2
-250
-200
-150
-100
-50
0
50
100
150
200
250
-150-100-50050
a
y
´
o
u
i Normal rate
Fast rate
F '
' ->
2
F '' ->1
zeroth order Legendre Legendre polynomial coefficients (mean Fi in vowel segment)
second order polynomials (axes reversed)
e e
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
14
Modeling vowel reduction
Ph.D. thesis Dick van Bergem (1995) “Acoustic and lexical vowel reduction” see also Speech Communication 16, 329-358
lexical V reduction Fr /betõ/ vs. Du /b@tOn/ acoustic V reduction /banan, bAnan, b@nan/
f(sent. acc., w. str., w. class): can-candy-canteen coarticulatory effects on the schwa
C1@C2V- and VC1@C2-type nonsense words perceptual effects (full V or schwa, f.i.
‘ananas’)
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
15
Some results
The schwa is not just a centralized vowel but somethingthat is completely assimilated with its phonemic context
t-n w-l
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
16
Modeling consonant reduction
Sp. Comm. (1999) 28, 125-140 (vSon & Pols) 20 min. speech, both spontaneous and read 2 x 791 similar VCV; hand segmented 5 aspects of V and C reduction
related to coarticulation: F2 slope differences at CV- vs. VC-boundaries; F2 locus equations (F2 onset vs. F2 target)
related to speaking effort: duration; spectral COG (mean freq.); V-C sound energy differences
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
17
Some results
V markedly reduced in spontaneous speech lower F2-slope diff. in spontaneous speech
—> decrease in articulation speed no systematic effect on F2 locus equation; V
onsets and targets change in concert —> any V reduction mirrored by comparable change in C
spont. sp.: V and C shorter; lower COG —> decrease in vocal and articulatory effort
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
18
Access to large corpora
more, and more realistic, data phonetic knowledge via statistical analyses f.i. highly accessible IFA-corpus (free, SQL)
see “Structure and access of the open source IFA-corpus”, IFA Proc. 24, 15-26 (vSon & Pols)
on-line http://www.fon.hum.uva.nl/IFAcorpus/ 4 M/4F speakers, 5.5 hrs of speech
from informal to read + sent., words, syllables ~ 50Kwords segm. and labeled at phoneme
level
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
19
Some results speech + annot. + meta data: relational DB realization of final n, f.i. Du ‘geven’ /xev@(n)/
Style #wrds
/@n/ /@/ All % /@n/
Informal 5,250 1 304 305 0.3
Retelling 6,229 13 236 249 5.2
LF HF
Narr. story 14,453
180 372 552 33 42 30
Sentences 14,970
203 340 543 37
Pseudo-sent
2,554 62 19 81 77
All 43,456
459 1,271 1,730 36
Read
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
20
Spoken Dutch Corpus (CGN)
10 M words, 1,000 hrs of speech variety of styles, incl. telephone speech adult Dutch and Flemish speakers for linguistic and technological research see various LREC and ICSLP papers (2002) see also http://lands.let.kun.nl/cgn/home.htm fully transcribed: orthogr., POS, lemmas partly transcr.: phonemic, prosodic, syntactic
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
21
TIMIT
popular DB in acoustic phonetics and ASR also telephone version (NTIMIT)
hand segmented & labeled at phoneme level
438 males, 192 females (8 dialect regions) 10 sent./sp. (2 fixed, 1 phon. compact, 7 diverse)
sa1: “She had her dark suit in greasy wash water all year”
includes separate test data (112 M, 56 F) e.g. Ph.D thesis X. Wang (1997)
“Incorporating knowledge on segmental duration in HMM-based continuous speech recognition”
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
22
R
S
Root /iy/
Lw
Lu
count
mean
s.d.
factorlevel
4626
95
39
1544
83
31
1588
95
36
1494
109
46
796
78
25
711
89
36
37
91
25
816
87
29
735
104
40
37
98
34
719
98
33
729
119
54
46
104
42
91
80
529
91
117
75
79
80
52
94
70
136
180
101
433
101
14
83
22
107
1
99
52
94
50
126
12
186
8
121
134
98
46
111
374
96
37
156
22
90
0 1 2
0 1 2 0 1 2 0 1 2
0 1 2 3 0 1 2 3 0 1 2
0 0 1 2 0 2 0 1 2
26 30 22 25 27 50 25 42 24 36 0
27 46 52 23 25 24 37 58 27
Useful info: durational variability
Adopted from Wang (1998)
normal rate=95
primary stress=104
word final=136
utterance final=186
overall average=95 ms
(fast rate <-) relative utterance speaking rate (-> slow rate)
his
tog
ram
co
un
t (n
um
be
r o
f u
tte
ran
ce
s)
0
20
40
60
80
100
120
140
-0.76 -0.53 -0.3 -0.07 0.16 0.39 0.62 0.85 1.08 1.31 1.54
0
20
40
60
80
100
120
140
160
180
utt
era
nc
e-a
ve
rag
ed
ph
on
e d
ura
tio
n (
ms
)
histogramphone dur
d
,normalized phone duration speaking raterN
ii
N
1
1
,
all 3,696 training sent. (sx + si) of TIMIT training set
0
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
24
‘found’ speech DARPA-LVSR community rather ambitious Broadcast News (BN), Sp.Comm. 37 (2002)
< ’95WSJ NAB read sp.
1995Market place
1996F0-F5, FX partitioned
19973 hrs test unpartit.
1998+ non Engl.
speech also < 10x RT
audiotraining data
100 hrs 10 hrs 55 hrs + 50 hrs
+ 100 hrs
text (for LM)
430 K 122 M 540 M > 900 M
best % WERon test set
27.0 %
27.1 %1:46 hrs
16.2 %3 hrs
13.5 —>16.1 %3 hrs (10xRT)
For Proc. DARPA Workshops, see http://www.nist.gov/speech/proc/darpa99/index.htm
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
25
Articul.-acoustic features in ASR
“A Dutch treatment of an elitist approach to articulatory-acoustic feature classification”, Proc. Eurospeech-2001, 1729-1732 (M. Wester et al.)
“Integrating articulatory features into acoustic models for speech recognition”, Phonus 5, 73-86 (K. Kirchhoff, 2000)
“An overlapping-feature-based phonological model incorporating linguistic constraints: Applications to speech recognition”, JASA 111 (2), 1086-1101 (J. Sun & L. Deng, 2002)
July 1st, 2002 Speech acoustics and phonetics, Il Ciocco
26
Conclusions
examples of dynamics in speech acoustics going from formal to informal speech:
less dynamics, more reduction (artic. guided) undershoot vs. speaking style sloppiness or articulatory limits?
functionality of dynamics? —> other paper systematicity of dynamics?
easing ASR, rules for TTS, acquiring knowledge?