potential team members to date: karen livescu (presenter) simon king florian metze jeff bilmes...

Potential team members to date:

Karen Livescu (presenter)Simon KingFlorian MetzeJeff Bilmes

Articulatory Feature-based Speech Recognition:A Proposal for the 2006 JHU Summer Workshop

on Language Engineering

LIP-OP TT-OPEN

TT-LOC

TB-OPEN VELUM

GLOTTIS

.

.

.

.

.

.

.

.

.

Mark Hasegawa-JohnsonOzgur Cetin Kate Saenko

November 12, 2005

Motivations

• Why articulatory feature-based ASR?– Improved modeling of co-articulatory pronunciation phenomena– Take advantage of human perception and production knowledge– Application to audio-visual modeling– Application to multilingual ASR– Evidence of improved ASR performance with feature-based models

* In noise [Kirchhoff et al. 2002]* For hyperarticulated speech [Soltau et al. 2002]

– Potential savings in training data

• Why this workshop project?– Growing number of sites investigating complementary aspects of this idea;

a non-exhaustive list:* U. Edinburgh (King et al.)* UIUC (Hasegawa-Johnson et al.)* MIT (Livescu, Glass, Saenko)

– Recently developed tools (e.g. graphical models) for systematic exploration of the model space

The challenge of pronunciation variation

(2) p r aa b iy

(1) p r ay

(1) p r aw l uh

(1) p r ah b iy

(1) p r aa l iy

(1) p r aa b uw

(1) p ow ih

(1) p aa iy

(1) p aa b uh b l iy

(1) p aa ah iy

probably

(1) s eh n t s

(1) s ih t s

sense

(1) eh v r ax b ax d iy

(1) eh v er b ah d iy

(1) eh ux b ax iy

(1) eh r uw ay

(1) eh b ah iy

everybody

(37) d ow n

(16) d ow

(6) ow n

(4) d ow n t

(3) d ow t

(3) d ah n

(3) ow

(3) n ax

(2) d ax n

(2) ax

(1) n uw

(1) n

(1) t ow

(1) d ow ax n

...

don’t

• Noted as an obstacle for recognition of conversational speech [McAllaster et al. ‘98, Saraçlar et al. ‘00]

– Conversational speech is recognized at twice the error rate of read speech [Weintraub et al. ‘98]

– Recognizer errors are correlated with reduced pronunciations [Fosler-Lussier ’99]

• Phonetic transcription of conversational pronunciations [Greenberg et al. ‘96]

0 0 0 0 20 0 0 10 0 0

0 0 0 0 1 1 1 2 2 2 2 21 2 2

0

2

10

2

0 0000

ind GLOT

ind LIP-OPEN 0 0 0 0 1 1 1 2 2 2 2

ind VEL

1

Approach: Main Ideas

P indGLOT indVEL 2

p s u

W W W W C C C C W W W W

W W N N N C C C W W W W

U LIP-OPEN

S LIP-OPEN

baseform dictionary

asynchrony+

feature substitutions

+

everybody

…...............…WideWideCritWideLIP-OPEN…OffOffOffOffVEL…VVVVGLOT

…iyrvehphone

…3210index

• Many ways to use articulatory features in ASR

• Approach for this project: Multiple streams of hidden articulatory states that can desynchronize and stray from target values

– Inspired by linguistic theories, but simplified and cast in a probabilistic setting

Dynamic Bayesian network implementation: The context-independent case

word t

checkSync t1 ;2

ind t1 ind t

2 ind t3

S t1 S t

2 S t3

U t1 U t

2 U t3

async t1 ; 2

= 1 checkSync t1,2 ;3

= 1

async t1,2 ; 3

)|Pr(|)Pr( 212;1 aindindaasync

checkSync1 ;2 1 if ind 1 ind 2 async1 ; 2

… .1 0 0 4

… … … … … …… .2 .7 0 0 2… .1 .2 .7 0 1… 0 .1 .2 .7 0… 3 2 1 0

given by baseform pronunciations

word T

syncT1 ; 2 1 syncT

1,2 ;3 1

ind T1 ind T

2 ind T3

ST1 ST

2 ST3

U T1 U T

2 U T3

word 1

sync11 ; 2 1 sync1

1,2 ;3 1

ind 11 ind 1

2 ind 13

S11 S1

2 S13

U 11 U 1

2 U 13

word 0

sync01 ; 2 1 sync0

1,2 ;3 1

ind 01 ind 0

2 ind 03

S01 S0

2 S03

U 01 U 0

2 U 03

. . . Example DBN with 3 features:

Recent related work

• Product observation models combining phones and features, p(obs|s) = p(obs|phs) p(obs|fsi), improve ASR in some conditions

– [Kirchhoff et al. 2002, Metze et al. 2002, Stueker et al. 2002]

• Lexical access from manual transcriptions of Switchboard words using DBN model above [Livescu & Glass 2004, 2005]– Improves over phone-based pronunciation models (~50% ~25% error)

– Preliminary result: Articulatory phonology features preferable to IPA-style (place/manner) features

• JHU WS’04 project [Hasegawa-Johnson et al. 2004]– Can combine landmarks + IPA-style features at acoustic level with articulatory

phonology features at pronunciation level

• Articulatory recognition using DBN and ANN/DBN models [Wester et al. 2004, Frankel et al. 2005]– Modeling inter-feature dependencies useful, asynchrony may also be useful

• Lipreading using multistream DBN model + SVM feature detectors– Improves over viseme-based models in medium-vocabulary word ranking and

realistic small-vocabulary task [Saenko et al. 2005]

Ongoing work: Audio-visual ASR

visual state (viseme)

audio state (phoneme)

V V V

AAA

phoneme-viseme based

A A A

V V V

checkSyncLT

checkSyncT

G

asyncLT

asyncTG

Lip features

Tongue features

Glottis/velum

articulatory feature-based

spectrogram

mouth images

G phone

T phone

L phone

Sample alignment from a prototype feature-based system:

A partial taxonomy of design issues

factored state (multistream structure)?

No

factored obs model?

Yes No

obs model

GM SVMNN

[Metze ’02] [Kirchhoff ’02] [Juneja ’04]

[Deng ’97, Richardson ’00]

Yes

state asynchrony

free within unit

soft asynchrony within word

coupled state transitions

cross-word soft asynchrony

[Livescu ‘04]

fact. obs?

YN

fact. obs?

YN

fact. obs?

YN

fact. obs?

YN

CD

[Kirchhoff ’96,

Wester et al. ‘04]

CHMMs

FHMMs [Livescu ’05]???

???

???[WS04]

CDCD

YN

??????Y

N

???

CD

YN

???

CD

Y N

???

CD

YNCD

Y N

Y NCD

Y

N

???

(Not to mention choice of feature sets... same in hidden structure and observation model?)

Goals for 2006 workshop

• To build complete articulatory feature-based ASR systems– Using multistream DBN structures

– For both audio-only and audio-visual ASR

• To develop a thorough understanding of the design issues involved

– Asynchrony modeling

– Context modeling

– Speaker dependency

– Generative observation modeling vs. discriminative feature classification

Potential participants and contributors

• Local participants:– Karen Livescu, MIT:

* Feature-based ASR structures, graphical models, GMTK– Mark Hasegawa-Johnson, U. Illinois at Urbana-Champaign

* Discriminative feature classification, JHU WS’04– Simon King, U. Edinburgh

* Articulatory feature recognition, ANN/DBN structures– Ozgur Cetin, ICSI Berkeley

* Multistream/multirate modeling, graphical models, GMTK– Florian Metze

* Articulatory features in HMM framework– Jeff Bilmes, U. Washington

* Graphical models, GMTK– Kate Saenko, MIT

* Visual feature classification, AVSR– Others?

• Satellite/advisory contributors– Jim Glass, MIT– Katrin Kirchhoff, U. Washington

Resources• Tools

– GMTK– HTK– Intel AVCSR toolkit

• Data– Audio-only:

* Svitchboard (CSTR Edinburgh): Small-vocab, continuous, conversational* PhoneBook: Medium-vocab, isolated-word, read* (Switchboard rescoring? LVCSR)

– Audio-visual:* AVTIMIT (MIT): Medium-vocab, continuous, read, added noise* Digit strings database (MIT): Continuous, read, naturalistic setting (noise and

video background)– Articulatory measurements:

* X-ray microbeam database (U. Wisconsin): Many speakers, large-vocab, isolated-word and continuous

* MOCHA (QMUC, Edinburgh): Few speakers, medium-vocab, continuous* Others?

– Manual transcriptions: ICSI Berkeley Switchboard transcription project

Thanks!

Questions? Comments?