potential team members to date: karen livescu (presenter) simon king florian metze jeff bilmes...
TRANSCRIPT
Potential team members to date:
Karen Livescu (presenter)Simon KingFlorian MetzeJeff Bilmes
Articulatory Feature-based Speech Recognition:A Proposal for the 2006 JHU Summer Workshop
on Language Engineering
LIP-OP TT-OPEN
TT-LOC
TB-OPEN VELUM
GLOTTIS
.
.
.
.
.
.
.
.
.
Mark Hasegawa-JohnsonOzgur Cetin Kate Saenko
November 12, 2005
Motivations
• Why articulatory feature-based ASR?– Improved modeling of co-articulatory pronunciation phenomena– Take advantage of human perception and production knowledge– Application to audio-visual modeling– Application to multilingual ASR– Evidence of improved ASR performance with feature-based models
* In noise [Kirchhoff et al. 2002]* For hyperarticulated speech [Soltau et al. 2002]
– Potential savings in training data
• Why this workshop project?– Growing number of sites investigating complementary aspects of this idea;
a non-exhaustive list:* U. Edinburgh (King et al.)* UIUC (Hasegawa-Johnson et al.)* MIT (Livescu, Glass, Saenko)
– Recently developed tools (e.g. graphical models) for systematic exploration of the model space
The challenge of pronunciation variation
(2) p r aa b iy
(1) p r ay
(1) p r aw l uh
(1) p r ah b iy
(1) p r aa l iy
(1) p r aa b uw
(1) p ow ih
(1) p aa iy
(1) p aa b uh b l iy
(1) p aa ah iy
probably
(1) s eh n t s
(1) s ih t s
sense
(1) eh v r ax b ax d iy
(1) eh v er b ah d iy
(1) eh ux b ax iy
(1) eh r uw ay
(1) eh b ah iy
everybody
(37) d ow n
(16) d ow
(6) ow n
(4) d ow n t
(3) d ow t
(3) d ah n
(3) ow
(3) n ax
(2) d ax n
(2) ax
(1) n uw
(1) n
(1) t ow
(1) d ow ax n
...
don’t
• Noted as an obstacle for recognition of conversational speech [McAllaster et al. ‘98, Saraçlar et al. ‘00]
– Conversational speech is recognized at twice the error rate of read speech [Weintraub et al. ‘98]
– Recognizer errors are correlated with reduced pronunciations [Fosler-Lussier ’99]
• Phonetic transcription of conversational pronunciations [Greenberg et al. ‘96]
0 0 0 0 20 0 0 10 0 0
0 0 0 0 1 1 1 2 2 2 2 21 2 2
0
2
10
2
0 0000
ind GLOT
ind LIP-OPEN 0 0 0 0 1 1 1 2 2 2 2
ind VEL
1
Approach: Main Ideas
P indGLOT indVEL 2
p s u
W W W W C C C C W W W W
W W N N N C C C W W W W
U LIP-OPEN
S LIP-OPEN
baseform dictionary
asynchrony+
feature substitutions
+
everybody
…...............…WideWideCritWideLIP-OPEN…OffOffOffOffVEL…VVVVGLOT
…iyrvehphone
…3210index
• Many ways to use articulatory features in ASR
• Approach for this project: Multiple streams of hidden articulatory states that can desynchronize and stray from target values
– Inspired by linguistic theories, but simplified and cast in a probabilistic setting
Dynamic Bayesian network implementation: The context-independent case
word t
checkSync t1 ;2
ind t1 ind t
2 ind t3
S t1 S t
2 S t3
U t1 U t
2 U t3
async t1 ; 2
= 1 checkSync t1,2 ;3
= 1
async t1,2 ; 3
)|Pr(|)Pr( 212;1 aindindaasync
checkSync1 ;2 1 if ind 1 ind 2 async1 ; 2
… .1 0 0 4
… … … … … …… .2 .7 0 0 2… .1 .2 .7 0 1… 0 .1 .2 .7 0… 3 2 1 0
given by baseform pronunciations
word T
syncT1 ; 2 1 syncT
1,2 ;3 1
ind T1 ind T
2 ind T3
ST1 ST
2 ST3
U T1 U T
2 U T3
word 1
sync11 ; 2 1 sync1
1,2 ;3 1
ind 11 ind 1
2 ind 13
S11 S1
2 S13
U 11 U 1
2 U 13
word 0
sync01 ; 2 1 sync0
1,2 ;3 1
ind 01 ind 0
2 ind 03
S01 S0
2 S03
U 01 U 0
2 U 03
. . . Example DBN with 3 features:
Recent related work
• Product observation models combining phones and features, p(obs|s) = p(obs|phs) p(obs|fsi), improve ASR in some conditions
– [Kirchhoff et al. 2002, Metze et al. 2002, Stueker et al. 2002]
• Lexical access from manual transcriptions of Switchboard words using DBN model above [Livescu & Glass 2004, 2005]– Improves over phone-based pronunciation models (~50% ~25% error)
– Preliminary result: Articulatory phonology features preferable to IPA-style (place/manner) features
• JHU WS’04 project [Hasegawa-Johnson et al. 2004]– Can combine landmarks + IPA-style features at acoustic level with articulatory
phonology features at pronunciation level
• Articulatory recognition using DBN and ANN/DBN models [Wester et al. 2004, Frankel et al. 2005]– Modeling inter-feature dependencies useful, asynchrony may also be useful
• Lipreading using multistream DBN model + SVM feature detectors– Improves over viseme-based models in medium-vocabulary word ranking and
realistic small-vocabulary task [Saenko et al. 2005]
Ongoing work: Audio-visual ASR
visual state (viseme)
audio state (phoneme)
V V V
AAA
phoneme-viseme based
A A A
V V V
checkSyncLT
checkSyncT
G
asyncLT
asyncTG
Lip features
Tongue features
Glottis/velum
articulatory feature-based
spectrogram
mouth images
G phone
T phone
L phone
Sample alignment from a prototype feature-based system:
A partial taxonomy of design issues
factored state (multistream structure)?
No
factored obs model?
Yes No
obs model
GM SVMNN
[Metze ’02] [Kirchhoff ’02] [Juneja ’04]
[Deng ’97, Richardson ’00]
Yes
state asynchrony
free within unit
soft asynchrony within word
coupled state transitions
cross-word soft asynchrony
[Livescu ‘04]
fact. obs?
YN
fact. obs?
YN
fact. obs?
YN
fact. obs?
YN
CD
[Kirchhoff ’96,
Wester et al. ‘04]
CHMMs
FHMMs [Livescu ’05]???
???
???[WS04]
CDCD
YN
??????Y
N
???
CD
YN
???
CD
Y N
???
CD
YNCD
Y N
Y NCD
Y
N
???
(Not to mention choice of feature sets... same in hidden structure and observation model?)
Goals for 2006 workshop
• To build complete articulatory feature-based ASR systems– Using multistream DBN structures
– For both audio-only and audio-visual ASR
• To develop a thorough understanding of the design issues involved
– Asynchrony modeling
– Context modeling
– Speaker dependency
– Generative observation modeling vs. discriminative feature classification
Potential participants and contributors
• Local participants:– Karen Livescu, MIT:
* Feature-based ASR structures, graphical models, GMTK– Mark Hasegawa-Johnson, U. Illinois at Urbana-Champaign
* Discriminative feature classification, JHU WS’04– Simon King, U. Edinburgh
* Articulatory feature recognition, ANN/DBN structures– Ozgur Cetin, ICSI Berkeley
* Multistream/multirate modeling, graphical models, GMTK– Florian Metze
* Articulatory features in HMM framework– Jeff Bilmes, U. Washington
* Graphical models, GMTK– Kate Saenko, MIT
* Visual feature classification, AVSR– Others?
• Satellite/advisory contributors– Jim Glass, MIT– Katrin Kirchhoff, U. Washington
Resources• Tools
– GMTK– HTK– Intel AVCSR toolkit
• Data– Audio-only:
* Svitchboard (CSTR Edinburgh): Small-vocab, continuous, conversational* PhoneBook: Medium-vocab, isolated-word, read* (Switchboard rescoring? LVCSR)
– Audio-visual:* AVTIMIT (MIT): Medium-vocab, continuous, read, added noise* Digit strings database (MIT): Continuous, read, naturalistic setting (noise and
video background)– Articulatory measurements:
* X-ray microbeam database (U. Wisconsin): Many speakers, large-vocab, isolated-word and continuous
* MOCHA (QMUC, Edinburgh): Few speakers, medium-vocab, continuous* Others?
– Manual transcriptions: ICSI Berkeley Switchboard transcription project
Thanks!
Questions? Comments?