acoustic modeling on telephony speech corpora for directory assistance systems applications
DESCRIPTION
Acoustic modeling on telephony speech corpora for directory assistance systems applications Børge Lindberg, Center for PersonKommunikation (CPK), Aalborg University Denmark [[email protected]]. Outline. Part 1 - Acoustic modeling Reference recogniser (COST 249) - PowerPoint PPT PresentationTRANSCRIPT
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 1
Acoustic modeling on telephony speech corpora
for directory assistance systems applications
Børge Lindberg,
Center for PersonKommunikation (CPK),
Aalborg University
Denmark
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 2
Part 1 - Acoustic modeling
• Reference recogniser (COST 249)
Part 2 - Directory assistance
• NaNu - Names & Numbers (Tele Danmark)
• Acoustic model optimisation
• Project- and system details
Outline
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 3
COST 249
The COST 249 SpeechDat Multilingual Reference Recogniserhttp://www.telenor.no/fou/prosjekter/taletek/refrec
• F.T. Johansen, N. Warakagoda (Telenor, Kjeller, Norway),
• B. Lindberg (CPK, Aalborg, Denmark),
• G. Lehtinen (ETH, Zürich, Switzerland),
• Z. Kacic, B. Imperl, A. Zgank (UMB, Maribor, Slovenia),
• B. Milner, D. Chaplin (British Telecom, Ipswich, UK),
• K. Elenius, G. Salvi (KTH, Stockholm, Sweden),
• E. Sanders, F. de Wet (KUN, Nijmegen, The Netherlands)
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 4
What is the reference recogniser?
• Phoneme based recogniser design procedure
• Language-independent
• Fully automatic, one script works straight from CDs
• Standardised database format: SpeechDat(II)
– Available in many languages world wide
– Oriented towards telephone applications
• Commonly available recogniser toolkit: HTK
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 5
Motivation
• A fast start for recognition research in new languages
– Share experience, avoid doing the same mistakes
• Improve state-of-the-art
– Share research efforts
– Provide a benchmark for recogniser performance
comparison across tasks and languages
• Facilitate true multilingual recognition research
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 6
Related Work
• COST 232
– Assumed TIMIT-like segmented database
• Reference verification systems
– CAVE, PICASSO
– COST 250
• GlobalPhone (Schultz & Waibel, ICSLP 98):
– Dictation type multilingual databases
– Language independent and -adaptive recognition
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 7
SpeechDat(II) databases
• 20 FDBs (fixed network), 5 MDBs (mobile networks)
• 500-5000 speakers, 4-8 minutes recording sessions
• Telephone information and transaction services
• Compatible databases:
– SpeechDat(E): 5 central and Eastern European languages
– SALA: 8 dialect zones in Latin America
– SpeechDat-Car: 9 languages, parallel GSM and in-car
– SpeechDat Australian English
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 8
number type corpus code1 isolated digit items I5 digit/number strings B,C1+ natural numbers N1 money amounts M2 yes/no questions Q3+ dates D2 times T3 application keywords/keyphrases A1 word spotting phrase E5 directory assistance names O3 spellings L4+ phonetically rich words W9 phonetically rich sentences S40+ In total
Core Utterance Types in SpeechDat(II)
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 9
• Standard HTK tutorial features (39-dimensional MFCC_0_D_A), no
normalisation
• Word internal triphone HMMs, 3 states per model
• Decision-tree state clustering
• Trained from flat-start using only orthographic
transcriptions and a SpeechDat lexicon
• Remove “difficult” utterances from the training set
• 1,2,4,8,16 and 32 diagonal covariance Gaussian mixtures
• Re-training on re-segmented material
Recogniser design - version 0.95
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 10
MFCC_0_D_A - feature set
Pre-empasis 0.97
Frame shift 10 ms
Analysis window Hamming
Window length 25 ms
Spectrum type FFT-magnitude
Filterbank type Mel-scale
Filter shape Triangular
Filterbank channels 26
Cepstral coefficients 12
Cepstral liftering 22
Energy feature C0
Deltas 13
Delta-deltas 13
Total features 39
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 11
Test design
• Common test suite on SpeechDat– I-test: Isolated digit recognition (SVIP)– Q-test: Yes/no recognition (SVIP)– A-test: Recognition of 30 isolated application words (SVIP)– BC-test: Unknown length connected digit string recognition (SVWL)– O-test: City name recognition (MVIP)– W-test: Recognition of phonetically rich words (MVIP)
• Two test procedures used– SVIP: Small Vocabulary Isolated Phrase– MVIP: Medium Vocabulary Isolated Phrase– SVWL: Small Vocabulary Word Loop, NIST alignment
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 12
Results
• Six labs have completed the training procedure on the
SpeechDat(II) databases
• KUN has converted the Dutch Polyphone to SpeechDat(II) format:
– train only on phonetically rich sentences
– tests only on digit strings
• More details available on the web
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 13
Training Statistics
Language (database) Trainspkrs
Totaluttr
Trainuttr
Mono-phns
Tri-phns
ClusterReduc.
Danish FDB1000 800 34400 23216 71* 13056 7.3 %Danish FDB4000 3500 150500 101100 71* 19032 11.5 %Norwegian FDB1000 816 36720 20335 40* 7866 8.4 %Slovenian FDB1000 800 34392 20548 39* 6613 10.8 %Swedish FDB1000 800 38400 24827 46 10689 8.6 %Swedish MDB1000 800 41600 34346 46 11876 7.8 %Swiss German FDB1000 800 32580 17442 51* 12374 7.1 %Dutch (polyph)** 4522 22602 20167 47 10194 13.0 %British english MDB1000** 800 30917 26068 43 8368 12.0 %
* External information available (either session list, pronunciation lexicon or a phoneme mapping - see web-site)
** Results are for Refrec. v. 0.93
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 14
A typical training curve
0
5
10
15
20
25
30
35m
ini_
1_4
min
i_1_
6
min
i_2_
1
min
i_4_
1
min
i_8_
1
min
i_16
_1
min
i_32
_1
mon
o_1_
4
mon
o_1_
6
mon
o_2_
1
mon
o_4_
1
mon
o_8_
1
mon
o_16
_1
mon
o_32
_1
tri_
1_1
tied
_1_1
tied
_2_1
tied
_4_1
tied
_8_1
tied
_16_
1
tied
_32_
1
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 15
Word error rates
Language (database) I Q A BC O W Danish FDB1000 1,0 1,1 2,4 2,3 15,8 64,4Danish FDB4000 0,6 1,1 2,4 2,7 14,0 64,1Norwegian FDB1000 2,3 0,5 4,4 5,9 17,3 34,7Slovenian FDB1000 4,2 0,9 4,9 6,1 9,3 19,3Swedish FDB1000 1,0 0,0 1,2 2,5 12,4 35,2Swedish MDB1000 10,5 1,1 4,0 14,2 18,6 52,4Swiss German FDB1000 0,5 0,3 1,1 3,1 6,3 24,3Dutch (polyph)* - - - 5,0 - -British english MDB1000* 10,2 - - - - -
* Results are for Refrec. v. 0.93
Average number
of phonemes in
test vocabularies
Language I/BC Q ADanish 2,6 2,0 4,6Norwegian 2,9 2,0 4,6Slovenian 3,9 2,0 6,5Swedish 3,3 2,5 6,2Swiss German 3,7 2,5 6,7
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 16
Word error rates - cont.
Word error rates for different tests
0,0
2,0
4,0
6,0
8,0
10,0
12,0
14,0
16,0
0,0 2,0 4,0 6,0 8,0
Phonemes pr. word
Err
or
rate
BC
I
Q
A
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 17
Word error rates - cont.
Database TestError rate Voc size # phnSwissGerman_FDB O 6,3 684 12,6Slovenian_FDB O 9,3 597 10,4Swedish_FDB O 12,4 905 9,3Danish_FDB O 15,8 495 6,5Norwegian_FDB O 17,3 1182 7,3Swedish_MDB O 18,6 869 9,0Slovenian_FDB W 19,3 1491 6,8SwissGerman_FDB W 24,3 3274 7,9Norwegian_FDB W 34,7 3438 6,6Swedish_FDB W 35,2 3610 9,3Swedish_MDB W 52,4 3611 9,1Danish_FDB W 64,4 16934 8,8
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 18
Word error rates - cont.
#phn and voc size for different error rates
6,0
7,0
8,0
9,010,0
11,0
12,0
13,0
14,0
0,0 20,0 40,0 60,0 80,0
Error rate
#phn
& (
3*lo
g(si
ze))
# Phn
3*log(size)
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 19
Language independent considerations
• Performance probably below state-of-the-art systems
– No whole-word modelling, no cross-word context
(especially needed for connected digits)
– A lot of training data with noise has been removed
– No speaker noise of filled pause model
– Not robust enough feature analyser
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 20
Language differences
• Mobile database has 3-5 times the error rate of FDBs
– more robust modeling needed
• Slovenian: high noise level on recordings
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 21
Conclusion - part 1
• Practical/logistic problems mostly solved
• Future work:
– Improve language and database coverage
– More speakers: Swedish 5000
– More challenging tests, large vocabularies
– More analyses
– Improved training procedure, clustering
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 22
Directory assistance
• Recognition of ‘Names & Numbers’
• In collaboration with Tele Danmark
• Auto attendant/directory assistance applications
• Large vocabulary - for the first time in Danish
• Exploiting the SpeechDat(II) database
NaNu
Børge Lindberg, Bo Nygaard Bai,
Tom Brøndsted, Jesper Ø. Olsen
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 23
Acoustic modeling - Decision trees
(Ref: HTK Book)
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 24
Acoustic modeling of Danish diphthongs
0,0
5,0
10,0
15,0
20,0
25,0
30,0
35,0
40,0
45,0
50,0
55,0
60,0
65,0
FDB1200
NoDiph
FDB4000
T500
ReSeg
Type MonoPhn Tri-Phns Clusters ReductionFDB1200 71 15239 3653 8.0%NoDiph 42 12814 3660 9.5%
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 25
Acoustic modeling - CMN
Cityname task
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00m
ini_
1_4
min
i_1_
6
min
i_2_
1
min
i_4_
1
min
i_8_
1
min
i_16
_1
min
i_32
_1
mon
o_1_
4
mon
o_1_
6
mon
o_2_
1
mon
o_4_
1
mon
o_8_
1
mon
o_16
_1
mon
o_32
_1
tri_
1_1
tied_
1_1
tied_
2_1
tied_
4_1
tied_
8_1
tied_
16_1
tied_
32_1
Err
or
rate
CMN_TR1000
CMN_TR3500
TR1000
TR3500
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 26
Acoustic modeling - decision trees
STATE-CLUSTERS
FDB4000 (Reducing 57231 clusters):TB: Likelihood increase thresholdR0: State occupancy threshold
R0\TB 100 350 600 1000100 15343/26.8% 6554/11.5% 4503/7.9% 3286/5.7%200 11697/20.4% 6332/11.1% 4457/7.8% 3278/5.7%300 9630/16.8% 6084/10.6% 4399/7.7% 3257/5.7%400 8251/14.4% 5806/10.1% 4318/7.5% 3234/5.7%600 6540/11.4% 5208/9.1% 4111/7.2% 3166/5.5%800 5528/9.7% 4720/8.2% 3913/6.8% 3096/5.4%1200 4310/7.5% 3953/6.9% 3485/6.1% 2886/5.0%1600 3574/6.2% 3396/5.9% 3107/5.4% 2683/4.7%
FDB1200 (Reducing 46089 clusters):
R0\TB 100 350100 3531/7.7%
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 27
NaNu
Acoustic models
• SpeechDat - COST 249
• 20k+ tied-mixture tri-phones, 6554 clusters
• 16 mixture models - 100k+ mixture components
Database
• ¼ million subscribers (Århus and Næstved areas)
Vocabulary extracted from database, for which:
• there is a minimum of two occurences
• transcription exists (Onomastica)
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 28
Vocabulary and Coverage
Surname Forename Street City/town11291 3886 2598 26
Coverage Surname Forename Street City/town85% 14253 315 9691 111890% 28105 495 13379 154795% 65741 1243 19623 2371
100% 240606 49275 47057 9077
NaNu Vocabulary
# Unique database entries, Denmark (source Tele Danmark)
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 29
SLANG
Recogniser - Spoken LANGuage
• Speech Recognition Research Platform
• For Dialogue Systems execution
• Modular design and implementation (C++)
• Frame synchronous operation
• Dynamic Tree Structured Decoder
• Optimised towards large vocabulary recognition (Gaussian mixture selection)
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 30
NLP
• N-Best lists are parsed into semantic frames and SQL queries are generated according to the following strategy:
1. simple 1-best match
2. full search in all N-best lists
3. under specified (street name and last name required to be contained in the N-best list)
• Output is “converted” to synthetic speech.
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 31
Dialogue System
• Java implementation of dialogue system and telephony server.
• uses SLANG speech recognition library in C++
• connects to public domain SQL database (mySQL)
• system directed dialogue
• one word pr. turn - high perplexity
• dynamic, parallel allocation of recognisers
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 32
Performance
Lack of test data - SpeechDat data were used (!)
Person names task:
• First name, optional middle name, last name
• 434 test utterances (speaker independent)
Results from predecessor configuration: (10646 last names, 2777 first/middle names):
• Recognition accuracy 1-best : 39.1 %
REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 33
Conclusion - Part 2
Real system probably needs application specific data - not mentioning the dialogue aspect !
Effect of further acoustic model optimisation (on SpeechDat) may be marginal, when N-best lists are used
Limited number of pronunciation variants available
Immediate steps are:- test data !- acoustic validation of retrieved candidates
Mixed initiative dialogue - CPK’s incentive to work on NaNu !