acoustic modeling on telephony speech corpora for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK

Acoustic modeling on telephony speech corpora

for directory assistance systems applications

Børge Lindberg,

Center for PersonKommunikation (CPK),

Aalborg University

Denmark

[[email protected]]


Part 1 - Acoustic modeling

• Reference recogniser (COST 249)

Part 2 - Directory assistance

• NaNu - Names & Numbers (Tele Danmark)

• Acoustic model optimisation

• Project- and system details

Outline


COST 249

The COST 249 SpeechDat Multilingual Reference Recogniserhttp://www.telenor.no/fou/prosjekter/taletek/refrec

• F.T. Johansen, N. Warakagoda (Telenor, Kjeller, Norway),

• B. Lindberg (CPK, Aalborg, Denmark),

• G. Lehtinen (ETH, Zürich, Switzerland),

• Z. Kacic, B. Imperl, A. Zgank (UMB, Maribor, Slovenia),

• B. Milner, D. Chaplin (British Telecom, Ipswich, UK),

• K. Elenius, G. Salvi (KTH, Stockholm, Sweden),

• E. Sanders, F. de Wet (KUN, Nijmegen, The Netherlands)


What is the reference recogniser?

• Phoneme based recogniser design procedure

• Language-independent

• Fully automatic, one script works straight from CDs

• Standardised database format: SpeechDat(II)

– Available in many languages world wide

– Oriented towards telephone applications

• Commonly available recogniser toolkit: HTK


Motivation

• A fast start for recognition research in new languages

– Share experience, avoid doing the same mistakes

• Improve state-of-the-art

– Share research efforts

– Provide a benchmark for recogniser performance

comparison across tasks and languages

• Facilitate true multilingual recognition research


Related Work

• COST 232

– Assumed TIMIT-like segmented database

• Reference verification systems

– CAVE, PICASSO

– COST 250

• GlobalPhone (Schultz & Waibel, ICSLP 98):

– Dictation type multilingual databases

– Language independent and -adaptive recognition


SpeechDat(II) databases

• 20 FDBs (fixed network), 5 MDBs (mobile networks)

• 500-5000 speakers, 4-8 minutes recording sessions

• Telephone information and transaction services

• Compatible databases:

– SpeechDat(E): 5 central and Eastern European languages

– SALA: 8 dialect zones in Latin America

– SpeechDat-Car: 9 languages, parallel GSM and in-car

– SpeechDat Australian English


number type corpus code1 isolated digit items I5 digit/number strings B,C1+ natural numbers N1 money amounts M2 yes/no questions Q3+ dates D2 times T3 application keywords/keyphrases A1 word spotting phrase E5 directory assistance names O3 spellings L4+ phonetically rich words W9 phonetically rich sentences S40+ In total

Core Utterance Types in SpeechDat(II)


• Standard HTK tutorial features (39-dimensional MFCC_0_D_A), no

normalisation

• Word internal triphone HMMs, 3 states per model

• Decision-tree state clustering

• Trained from flat-start using only orthographic

transcriptions and a SpeechDat lexicon

• Remove “difficult” utterances from the training set

• 1,2,4,8,16 and 32 diagonal covariance Gaussian mixtures

• Re-training on re-segmented material

Recogniser design - version 0.95


MFCC_0_D_A - feature set

Pre-empasis 0.97

Frame shift 10 ms

Analysis window Hamming

Window length 25 ms

Spectrum type FFT-magnitude

Filterbank type Mel-scale

Filter shape Triangular

Filterbank channels 26

Cepstral coefficients 12

Cepstral liftering 22

Energy feature C0

Deltas 13

Delta-deltas 13

Total features 39


Test design

• Common test suite on SpeechDat– I-test: Isolated digit recognition (SVIP)– Q-test: Yes/no recognition (SVIP)– A-test: Recognition of 30 isolated application words (SVIP)– BC-test: Unknown length connected digit string recognition (SVWL)– O-test: City name recognition (MVIP)– W-test: Recognition of phonetically rich words (MVIP)

• Two test procedures used– SVIP: Small Vocabulary Isolated Phrase– MVIP: Medium Vocabulary Isolated Phrase– SVWL: Small Vocabulary Word Loop, NIST alignment


Results

• Six labs have completed the training procedure on the

SpeechDat(II) databases

• KUN has converted the Dutch Polyphone to SpeechDat(II) format:

– train only on phonetically rich sentences

– tests only on digit strings

• More details available on the web


Training Statistics

Language (database) Trainspkrs

Totaluttr

Trainuttr

Mono-phns

Tri-phns

ClusterReduc.

Danish FDB1000 800 34400 23216 71* 13056 7.3 %Danish FDB4000 3500 150500 101100 71* 19032 11.5 %Norwegian FDB1000 816 36720 20335 40* 7866 8.4 %Slovenian FDB1000 800 34392 20548 39* 6613 10.8 %Swedish FDB1000 800 38400 24827 46 10689 8.6 %Swedish MDB1000 800 41600 34346 46 11876 7.8 %Swiss German FDB1000 800 32580 17442 51* 12374 7.1 %Dutch (polyph)** 4522 22602 20167 47 10194 13.0 %British english MDB1000** 800 30917 26068 43 8368 12.0 %

* External information available (either session list, pronunciation lexicon or a phoneme mapping - see web-site)

** Results are for Refrec. v. 0.93


A typical training curve

0

5

10

15

20

25

30

35m

ini_

1_4

min

i_1_

6

min

i_2_

1

min

i_4_

1

min

i_8_

1

min

i_16

_1

min

i_32

_1

mon

o_1_

4

mon

o_1_

6

mon

o_2_

1

mon

o_4_

1

mon

o_8_

1

mon

o_16

_1

mon

o_32

_1

tri_

1_1

tied

_1_1

tied

_2_1

tied

_4_1

tied

_8_1

tied

_16_

1

tied

_32_

1


Word error rates

Language (database) I Q A BC O W Danish FDB1000 1,0 1,1 2,4 2,3 15,8 64,4Danish FDB4000 0,6 1,1 2,4 2,7 14,0 64,1Norwegian FDB1000 2,3 0,5 4,4 5,9 17,3 34,7Slovenian FDB1000 4,2 0,9 4,9 6,1 9,3 19,3Swedish FDB1000 1,0 0,0 1,2 2,5 12,4 35,2Swedish MDB1000 10,5 1,1 4,0 14,2 18,6 52,4Swiss German FDB1000 0,5 0,3 1,1 3,1 6,3 24,3Dutch (polyph)* - - - 5,0 - -British english MDB1000* 10,2 - - - - -

* Results are for Refrec. v. 0.93

Average number

of phonemes in

test vocabularies

Language I/BC Q ADanish 2,6 2,0 4,6Norwegian 2,9 2,0 4,6Slovenian 3,9 2,0 6,5Swedish 3,3 2,5 6,2Swiss German 3,7 2,5 6,7


Word error rates - cont.

Word error rates for different tests

0,0

2,0

4,0

6,0

8,0

10,0

12,0

14,0

16,0

0,0 2,0 4,0 6,0 8,0

Phonemes pr. word

Err

or

rate

BC

I

Q

A



Database TestError rate Voc size # phnSwissGerman_FDB O 6,3 684 12,6Slovenian_FDB O 9,3 597 10,4Swedish_FDB O 12,4 905 9,3Danish_FDB O 15,8 495 6,5Norwegian_FDB O 17,3 1182 7,3Swedish_MDB O 18,6 869 9,0Slovenian_FDB W 19,3 1491 6,8SwissGerman_FDB W 24,3 3274 7,9Norwegian_FDB W 34,7 3438 6,6Swedish_FDB W 35,2 3610 9,3Swedish_MDB W 52,4 3611 9,1Danish_FDB W 64,4 16934 8,8



#phn and voc size for different error rates

6,0

7,0

8,0

9,010,0

11,0

12,0

13,0

14,0

0,0 20,0 40,0 60,0 80,0

Error rate

#phn

& (

3*lo

g(si

ze))

# Phn

3*log(size)


Language independent considerations

• Performance probably below state-of-the-art systems

– No whole-word modelling, no cross-word context

(especially needed for connected digits)

– A lot of training data with noise has been removed

– No speaker noise of filled pause model

– Not robust enough feature analyser


Language differences

• Mobile database has 3-5 times the error rate of FDBs

– more robust modeling needed

• Slovenian: high noise level on recordings


Conclusion - part 1

• Practical/logistic problems mostly solved

• Future work:

– Improve language and database coverage

– More speakers: Swedish 5000

– More challenging tests, large vocabularies

– More analyses

– Improved training procedure, clustering


Directory assistance

• Recognition of ‘Names & Numbers’

• In collaboration with Tele Danmark

• Auto attendant/directory assistance applications

• Large vocabulary - for the first time in Danish

• Exploiting the SpeechDat(II) database

NaNu

Børge Lindberg, Bo Nygaard Bai,

Tom Brøndsted, Jesper Ø. Olsen


Acoustic modeling - Decision trees

(Ref: HTK Book)


Acoustic modeling of Danish diphthongs

0,0

5,0

10,0

15,0

20,0

25,0

30,0

35,0

40,0

45,0

50,0

55,0

60,0

65,0

FDB1200

NoDiph

FDB4000

T500

ReSeg

Type MonoPhn Tri-Phns Clusters ReductionFDB1200 71 15239 3653 8.0%NoDiph 42 12814 3660 9.5%


Acoustic modeling - CMN

Cityname task

0,00

10,00

20,00

30,00

40,00

50,00

60,00

70,00

80,00m

ini_

1_4

min

i_1_

6

min

i_2_

1

min

i_4_

1

min

i_8_

1

min

i_16

_1

min

i_32

_1

mon

o_1_

4

mon

o_1_

6

mon

o_2_

1

mon

o_4_

1

mon

o_8_

1

mon

o_16

_1

mon

o_32

_1

tri_

1_1

tied_

1_1

tied_

2_1

tied_

4_1

tied_

8_1

tied_

16_1

tied_

32_1

Err

or

rate

CMN_TR1000

CMN_TR3500

TR1000

TR3500


Acoustic modeling - decision trees

STATE-CLUSTERS

FDB4000 (Reducing 57231 clusters):TB: Likelihood increase thresholdR0: State occupancy threshold

R0\TB 100 350 600 1000100 15343/26.8% 6554/11.5% 4503/7.9% 3286/5.7%200 11697/20.4% 6332/11.1% 4457/7.8% 3278/5.7%300 9630/16.8% 6084/10.6% 4399/7.7% 3257/5.7%400 8251/14.4% 5806/10.1% 4318/7.5% 3234/5.7%600 6540/11.4% 5208/9.1% 4111/7.2% 3166/5.5%800 5528/9.7% 4720/8.2% 3913/6.8% 3096/5.4%1200 4310/7.5% 3953/6.9% 3485/6.1% 2886/5.0%1600 3574/6.2% 3396/5.9% 3107/5.4% 2683/4.7%

FDB1200 (Reducing 46089 clusters):

R0\TB 100 350100 3531/7.7%


NaNu

Acoustic models

• SpeechDat - COST 249

• 20k+ tied-mixture tri-phones, 6554 clusters

• 16 mixture models - 100k+ mixture components

Database

• ¼ million subscribers (Århus and Næstved areas)

Vocabulary extracted from database, for which:

• there is a minimum of two occurences

• transcription exists (Onomastica)


Vocabulary and Coverage

Surname Forename Street City/town11291 3886 2598 26

Coverage Surname Forename Street City/town85% 14253 315 9691 111890% 28105 495 13379 154795% 65741 1243 19623 2371

100% 240606 49275 47057 9077

NaNu Vocabulary

# Unique database entries, Denmark (source Tele Danmark)


SLANG

Recogniser - Spoken LANGuage

• Speech Recognition Research Platform

• For Dialogue Systems execution

• Modular design and implementation (C++)

• Frame synchronous operation

• Dynamic Tree Structured Decoder

• Optimised towards large vocabulary recognition (Gaussian mixture selection)


NLP

• N-Best lists are parsed into semantic frames and SQL queries are generated according to the following strategy:

1. simple 1-best match

2. full search in all N-best lists

3. under specified (street name and last name required to be contained in the N-best list)

• Output is “converted” to synthetic speech.


Dialogue System

• Java implementation of dialogue system and telephony server.

• uses SLANG speech recognition library in C++

• connects to public domain SQL database (mySQL)

• system directed dialogue

• one word pr. turn - high perplexity

• dynamic, parallel allocation of recognisers


Performance

Lack of test data - SpeechDat data were used (!)

Person names task:

• First name, optional middle name, last name

• 434 test utterances (speaker independent)

Results from predecessor configuration: (10646 last names, 2777 first/middle names):

• Recognition accuracy 1-best : 39.1 %


Conclusion - Part 2

Real system probably needs application specific data - not mentioning the dialogue aspect !

Effect of further acoustic model optimisation (on SpeechDat) may be marginal, when N-best lists are used

Limited number of pronunciation variants available

Immediate steps are:- test data !- acoustic validation of retrieved candidates

Mixed initiative dialogue - CPK’s incentive to work on NaNu !

acoustic modeling on telephony speech corpora for directory assistance systems applications

Documents