acoustic modeling on telephony speech corpora for directory assistance systems applications

33
REC Meeting, Lisboa, May 8, 2000 Børge Lindberg, CPK, Aalborg Univ., DK Page 1 Acoustic modeling on telephony speech corpora for directory assistance systems applications Børge Lindberg, Center for PersonKommunikation (CPK), Aalborg University Denmark [[email protected]]

Upload: elmo

Post on 25-Jan-2016

23 views

Category:

Documents


2 download

DESCRIPTION

Acoustic modeling on telephony speech corpora for directory assistance systems applications Børge Lindberg, Center for PersonKommunikation (CPK), Aalborg University Denmark [[email protected]]. Outline. Part 1 - Acoustic modeling Reference recogniser (COST 249) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 1

Acoustic modeling on telephony speech corpora

for directory assistance systems applications

Børge Lindberg,

Center for PersonKommunikation (CPK),

Aalborg University

Denmark

[[email protected]]

Page 2: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 2

Part 1 - Acoustic modeling

• Reference recogniser (COST 249)

Part 2 - Directory assistance

• NaNu - Names & Numbers (Tele Danmark)

• Acoustic model optimisation

• Project- and system details

Outline

Page 3: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 3

COST 249

The COST 249 SpeechDat Multilingual Reference Recogniserhttp://www.telenor.no/fou/prosjekter/taletek/refrec

• F.T. Johansen, N. Warakagoda (Telenor, Kjeller, Norway),

• B. Lindberg (CPK, Aalborg, Denmark),

• G. Lehtinen (ETH, Zürich, Switzerland),

• Z. Kacic, B. Imperl, A. Zgank (UMB, Maribor, Slovenia),

• B. Milner, D. Chaplin (British Telecom, Ipswich, UK),

• K. Elenius, G. Salvi (KTH, Stockholm, Sweden),

• E. Sanders, F. de Wet (KUN, Nijmegen, The Netherlands)

Page 4: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 4

What is the reference recogniser?

• Phoneme based recogniser design procedure

• Language-independent

• Fully automatic, one script works straight from CDs

• Standardised database format: SpeechDat(II)

– Available in many languages world wide

– Oriented towards telephone applications

• Commonly available recogniser toolkit: HTK

Page 5: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 5

Motivation

• A fast start for recognition research in new languages

– Share experience, avoid doing the same mistakes

• Improve state-of-the-art

– Share research efforts

– Provide a benchmark for recogniser performance

comparison across tasks and languages

• Facilitate true multilingual recognition research

Page 6: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 6

Related Work

• COST 232

– Assumed TIMIT-like segmented database

• Reference verification systems

– CAVE, PICASSO

– COST 250

• GlobalPhone (Schultz & Waibel, ICSLP 98):

– Dictation type multilingual databases

– Language independent and -adaptive recognition

Page 7: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 7

SpeechDat(II) databases

• 20 FDBs (fixed network), 5 MDBs (mobile networks)

• 500-5000 speakers, 4-8 minutes recording sessions

• Telephone information and transaction services

• Compatible databases:

– SpeechDat(E): 5 central and Eastern European languages

– SALA: 8 dialect zones in Latin America

– SpeechDat-Car: 9 languages, parallel GSM and in-car

– SpeechDat Australian English

Page 8: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 8

number type corpus code1 isolated digit items I5 digit/number strings B,C1+ natural numbers N1 money amounts M2 yes/no questions Q3+ dates D2 times T3 application keywords/keyphrases A1 word spotting phrase E5 directory assistance names O3 spellings L4+ phonetically rich words W9 phonetically rich sentences S40+ In total

Core Utterance Types in SpeechDat(II)

Page 9: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 9

• Standard HTK tutorial features (39-dimensional MFCC_0_D_A), no

normalisation

• Word internal triphone HMMs, 3 states per model

• Decision-tree state clustering

• Trained from flat-start using only orthographic

transcriptions and a SpeechDat lexicon

• Remove “difficult” utterances from the training set

• 1,2,4,8,16 and 32 diagonal covariance Gaussian mixtures

• Re-training on re-segmented material

Recogniser design - version 0.95

Page 10: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 10

MFCC_0_D_A - feature set

Pre-empasis 0.97

Frame shift 10 ms

Analysis window Hamming

Window length 25 ms

Spectrum type FFT-magnitude

Filterbank type Mel-scale

Filter shape Triangular

Filterbank channels 26

Cepstral coefficients 12

Cepstral liftering 22

Energy feature C0

Deltas 13

Delta-deltas 13

Total features 39

Page 11: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 11

Test design

• Common test suite on SpeechDat– I-test: Isolated digit recognition (SVIP)– Q-test: Yes/no recognition (SVIP)– A-test: Recognition of 30 isolated application words (SVIP)– BC-test: Unknown length connected digit string recognition (SVWL)– O-test: City name recognition (MVIP)– W-test: Recognition of phonetically rich words (MVIP)

• Two test procedures used– SVIP: Small Vocabulary Isolated Phrase– MVIP: Medium Vocabulary Isolated Phrase– SVWL: Small Vocabulary Word Loop, NIST alignment

Page 12: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 12

Results

• Six labs have completed the training procedure on the

SpeechDat(II) databases

• KUN has converted the Dutch Polyphone to SpeechDat(II) format:

– train only on phonetically rich sentences

– tests only on digit strings

• More details available on the web

Page 13: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 13

Training Statistics

Language (database) Trainspkrs

Totaluttr

Trainuttr

Mono-phns

Tri-phns

ClusterReduc.

Danish FDB1000 800 34400 23216 71* 13056 7.3 %Danish FDB4000 3500 150500 101100 71* 19032 11.5 %Norwegian FDB1000 816 36720 20335 40* 7866 8.4 %Slovenian FDB1000 800 34392 20548 39* 6613 10.8 %Swedish FDB1000 800 38400 24827 46 10689 8.6 %Swedish MDB1000 800 41600 34346 46 11876 7.8 %Swiss German FDB1000 800 32580 17442 51* 12374 7.1 %Dutch (polyph)** 4522 22602 20167 47 10194 13.0 %British english MDB1000** 800 30917 26068 43 8368 12.0 %

* External information available (either session list, pronunciation lexicon or a phoneme mapping - see web-site)

** Results are for Refrec. v. 0.93

Page 14: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 14

A typical training curve

0

5

10

15

20

25

30

35m

ini_

1_4

min

i_1_

6

min

i_2_

1

min

i_4_

1

min

i_8_

1

min

i_16

_1

min

i_32

_1

mon

o_1_

4

mon

o_1_

6

mon

o_2_

1

mon

o_4_

1

mon

o_8_

1

mon

o_16

_1

mon

o_32

_1

tri_

1_1

tied

_1_1

tied

_2_1

tied

_4_1

tied

_8_1

tied

_16_

1

tied

_32_

1

Page 15: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 15

Word error rates

Language (database) I Q A BC O W Danish FDB1000 1,0 1,1 2,4 2,3 15,8 64,4Danish FDB4000 0,6 1,1 2,4 2,7 14,0 64,1Norwegian FDB1000 2,3 0,5 4,4 5,9 17,3 34,7Slovenian FDB1000 4,2 0,9 4,9 6,1 9,3 19,3Swedish FDB1000 1,0 0,0 1,2 2,5 12,4 35,2Swedish MDB1000 10,5 1,1 4,0 14,2 18,6 52,4Swiss German FDB1000 0,5 0,3 1,1 3,1 6,3 24,3Dutch (polyph)* - - - 5,0 - -British english MDB1000* 10,2 - - - - -

* Results are for Refrec. v. 0.93

Average number

of phonemes in

test vocabularies

Language I/BC Q ADanish 2,6 2,0 4,6Norwegian 2,9 2,0 4,6Slovenian 3,9 2,0 6,5Swedish 3,3 2,5 6,2Swiss German 3,7 2,5 6,7

Page 16: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 16

Word error rates - cont.

Word error rates for different tests

0,0

2,0

4,0

6,0

8,0

10,0

12,0

14,0

16,0

0,0 2,0 4,0 6,0 8,0

Phonemes pr. word

Err

or

rate

BC

I

Q

A

Page 17: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 17

Word error rates - cont.

Database TestError rate Voc size # phnSwissGerman_FDB O 6,3 684 12,6Slovenian_FDB O 9,3 597 10,4Swedish_FDB O 12,4 905 9,3Danish_FDB O 15,8 495 6,5Norwegian_FDB O 17,3 1182 7,3Swedish_MDB O 18,6 869 9,0Slovenian_FDB W 19,3 1491 6,8SwissGerman_FDB W 24,3 3274 7,9Norwegian_FDB W 34,7 3438 6,6Swedish_FDB W 35,2 3610 9,3Swedish_MDB W 52,4 3611 9,1Danish_FDB W 64,4 16934 8,8

Page 18: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 18

Word error rates - cont.

#phn and voc size for different error rates

6,0

7,0

8,0

9,010,0

11,0

12,0

13,0

14,0

0,0 20,0 40,0 60,0 80,0

Error rate

#phn

& (

3*lo

g(si

ze))

# Phn

3*log(size)

Page 19: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 19

Language independent considerations

• Performance probably below state-of-the-art systems

– No whole-word modelling, no cross-word context

(especially needed for connected digits)

– A lot of training data with noise has been removed

– No speaker noise of filled pause model

– Not robust enough feature analyser

Page 20: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 20

Language differences

• Mobile database has 3-5 times the error rate of FDBs

– more robust modeling needed

• Slovenian: high noise level on recordings

Page 21: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 21

Conclusion - part 1

• Practical/logistic problems mostly solved

• Future work:

– Improve language and database coverage

– More speakers: Swedish 5000

– More challenging tests, large vocabularies

– More analyses

– Improved training procedure, clustering

Page 22: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 22

Directory assistance

• Recognition of ‘Names & Numbers’

• In collaboration with Tele Danmark

• Auto attendant/directory assistance applications

• Large vocabulary - for the first time in Danish

• Exploiting the SpeechDat(II) database

NaNu

Børge Lindberg, Bo Nygaard Bai,

Tom Brøndsted, Jesper Ø. Olsen

Page 23: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 23

Acoustic modeling - Decision trees

(Ref: HTK Book)

Page 24: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 24

Acoustic modeling of Danish diphthongs

0,0

5,0

10,0

15,0

20,0

25,0

30,0

35,0

40,0

45,0

50,0

55,0

60,0

65,0

FDB1200

NoDiph

FDB4000

T500

ReSeg

Type MonoPhn Tri-Phns Clusters ReductionFDB1200 71 15239 3653 8.0%NoDiph 42 12814 3660 9.5%

Page 25: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 25

Acoustic modeling - CMN

Cityname task

0,00

10,00

20,00

30,00

40,00

50,00

60,00

70,00

80,00m

ini_

1_4

min

i_1_

6

min

i_2_

1

min

i_4_

1

min

i_8_

1

min

i_16

_1

min

i_32

_1

mon

o_1_

4

mon

o_1_

6

mon

o_2_

1

mon

o_4_

1

mon

o_8_

1

mon

o_16

_1

mon

o_32

_1

tri_

1_1

tied_

1_1

tied_

2_1

tied_

4_1

tied_

8_1

tied_

16_1

tied_

32_1

Err

or

rate

CMN_TR1000

CMN_TR3500

TR1000

TR3500

Page 26: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 26

Acoustic modeling - decision trees

STATE-CLUSTERS

FDB4000 (Reducing 57231 clusters):TB: Likelihood increase thresholdR0: State occupancy threshold

R0\TB 100 350 600 1000100 15343/26.8% 6554/11.5% 4503/7.9% 3286/5.7%200 11697/20.4% 6332/11.1% 4457/7.8% 3278/5.7%300 9630/16.8% 6084/10.6% 4399/7.7% 3257/5.7%400 8251/14.4% 5806/10.1% 4318/7.5% 3234/5.7%600 6540/11.4% 5208/9.1% 4111/7.2% 3166/5.5%800 5528/9.7% 4720/8.2% 3913/6.8% 3096/5.4%1200 4310/7.5% 3953/6.9% 3485/6.1% 2886/5.0%1600 3574/6.2% 3396/5.9% 3107/5.4% 2683/4.7%

FDB1200 (Reducing 46089 clusters):

R0\TB 100 350100 3531/7.7%

Page 27: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 27

NaNu

Acoustic models

• SpeechDat - COST 249

• 20k+ tied-mixture tri-phones, 6554 clusters

• 16 mixture models - 100k+ mixture components

Database

• ¼ million subscribers (Århus and Næstved areas)

Vocabulary extracted from database, for which:

• there is a minimum of two occurences

• transcription exists (Onomastica)

Page 28: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 28

Vocabulary and Coverage

Surname Forename Street City/town11291 3886 2598 26

Coverage Surname Forename Street City/town85% 14253 315 9691 111890% 28105 495 13379 154795% 65741 1243 19623 2371

100% 240606 49275 47057 9077

NaNu Vocabulary

# Unique database entries, Denmark (source Tele Danmark)

Page 29: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 29

SLANG

Recogniser - Spoken LANGuage

• Speech Recognition Research Platform

• For Dialogue Systems execution

• Modular design and implementation (C++)

• Frame synchronous operation

• Dynamic Tree Structured Decoder

• Optimised towards large vocabulary recognition (Gaussian mixture selection)

Page 30: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 30

NLP

• N-Best lists are parsed into semantic frames and SQL queries are generated according to the following strategy:

1. simple 1-best match

2. full search in all N-best lists

3. under specified (street name and last name required to be contained in the N-best list)

• Output is “converted” to synthetic speech.

Page 31: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 31

Dialogue System

• Java implementation of dialogue system and telephony server.

• uses SLANG speech recognition library in C++

• connects to public domain SQL database (mySQL)

• system directed dialogue

• one word pr. turn - high perplexity

• dynamic, parallel allocation of recognisers

Page 32: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 32

Performance

Lack of test data - SpeechDat data were used (!)

Person names task:

• First name, optional middle name, last name

• 434 test utterances (speaker independent)

Results from predecessor configuration: (10646 last names, 2777 first/middle names):

• Recognition accuracy 1-best : 39.1 %

Page 33: Acoustic modeling on telephony speech corpora  for directory assistance systems applications

REC Meeting, Lisboa, May 8, 2000Børge Lindberg, CPK, Aalborg Univ., DK Page 33

Conclusion - Part 2

Real system probably needs application specific data - not mentioning the dialogue aspect !

Effect of further acoustic model optimisation (on SpeechDat) may be marginal, when N-best lists are used

Limited number of pronunciation variants available

Immediate steps are:- test data !- acoustic validation of retrieved candidates

Mixed initiative dialogue - CPK’s incentive to work on NaNu !