acoustic modeling for multi- language, multi-style, multi-channel automatic speech recognition mark...

29
Acoustic Modeling for Multi-Language, Multi- Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang, Jui-Ting Huang, Xi Zhou, Zhen Li, and Thomas Huang including also the research results of Laehoon Kim and Harsh University of University of Illinois Illinois

Upload: gary-gardner

Post on 12-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Acoustic Modeling for Multi-Language, Multi-Style, Multi-Channel Automatic Speech Recognition

Mark Hasegawa-Johnson

Yuxiao Hu, Dennis Lin, Xiaodan Zhuang, Jui-Ting Huang, Xi Zhou, Zhen Li, and Thomas Huangincluding also the research results of

Laehoon Kim and Harsh Sharma

University of University of IllinoisIllinois

Page 2: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Motivation Applications in a Multilingual Society

News Hound: Find all TV news segments, in any language, mentioning “Barack Obama”

Language Learner: Transcribe learner's accented speech; tell him which words sound accented

Broadcaster/Podcaster: Automatically transcribe “man on the street” interviews in a multilingual city (LA, Sing)

Problems Physical variability: noise, echo, talker Imprecise categories: dependent on context Content variability: language, topic, dialect, style

Page 3: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Method: Transform and Infer(ubiquitous methodology in ASR; see, e.g., Jelinek, 1976)

Signal transforms

Classifier transforms

Likelihood Vector b

i=p(observation

t|state

t=i)

Best label sequence = argmax p(label1,...,label

T|observation

1,...,observation

T)

Inference AlgorithmA Parametric Model of p(state

1,...,state

T,label

1,...,label

T)

Page 4: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Signal TransformsTransforms determined by a physical model of the signal

A good signal model tells you a lot: Reverberation model: y[n]=v[n]+mh[m]x[n-m]

x[n] produced by a human vocal tract, designed for efficient processing by a human auditory system

A good signal transform improves the accuracy of all classifiers Denoising: Correct for additive noise Dereverberation: Correct for convolutional noise Perceptual freq warping: Hear what humans hear

Page 5: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Denoising Example(Kim et al., 2006)

Page 6: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Classifier TransformsCompute a precise and accurate estimate of p(obst|statet)

Robust Machine Learning From a limited amount of training data, Learn parameterized probability models as precise as possible, ...with a known upper bound on generalization error

Methods that trade off precision and generalization Decorrelate the signal measurements: PCA, DCT Select the most informative features from an inventory: AdaBoost

Train a linear or nonlinear function zt=f(y

t) that

Discriminates among the training examples from diff classes Has known upper bounds on generalization error (SVM, ANN)

Train another nonlinear function p(zt|state

t) with same properties

Page 7: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Classifier TransformsCompute a precise and accurate estimate of p(obst|statet)

Page 8: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

InferenceIntegrate information to choose best global labelset

Labels = variables that matter globally Speech Recognition: what words were spoken? Information Retrieval: which segment best matches the query? Language Learning: where's the error?

States = variables that can be classified locally

May be scalar, e.g., qt=sub-phoneme

May be vector, e.g., qt=[vector of articulatory states]

Inference algorithm = Parametric model of p(states,labels) Scalar states: Hidden Markov model, Finite State Transducer Vector states: Dynamic Bayesian network, Conditional Random

Field

Page 9: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

InferenceIntegrate information to choose best global labelset

Page 10: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Example: Language-Independent Phone Recognition(Huang et al., in preparation)

Voice activity detectionPerceptual freq warpingGaussian mixtures

Likelihood Vector b

i=p(observation

t|state

t=i)

Best label sequence = argmax p(phone1,...,phone

T|observation

1,...,observation

T)

Inference Algorithm: Hidden Markov Model with Token Passingp(state

1,...,state

T,phone

1,...,phone

T)

Page 11: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

A Language-Independent Phone Set (Consonants)

Plus secondary articulations (glottis, pharynx, palate, lips), sequences, and syllabics

Page 12: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

^~, >~, @~, &~, a~, A~, …nasalized

>:, a:, A:, e:, E:, i:, I:, o:, u:, 3r:long

3r, 4r, &rretroflexion

>i, aU, Au, ei, eI, Ei, eu, Eu, ia, ie, io, iu, oi, oU, ua, uax, ui, uo, aI, i>, iE, ue, uE (>u, AI, axI~, axU~)

diphthongs

A Language-Independent Phone Set (Vowels)

Page 13: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Training Data

10 languages, 11 corpora Arabic, Croatian, English, Japanese, Mandarin,

Portuguese, Russian, Spanish, Turkish, Urdu

95 hours of speech Sampled from a larger set of corpora Mixed styles of speech: broadcast, read, and

spontaneous

Page 14: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Summarization of Corpora

Page 15: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Dictionaries(Hasegawa-Johnson and Fleck, http://www.isle.uiuc.edu/dict/)

Diacriticized Versionavailable on web?

Ruleset #1q= قk= کg= گ

...

Ruleset #2� =A

� =ligature� =u

...

/sAh{SV}b{SV}/, /sA!iqƏ/

Phonetic Transcriptions

No Yesصاحب صاحب, صاعقصاع�ق

Urdu: No Vowels!!

Orthographic Transcriptions

Page 16: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Context-Dependent Phones Triphones: when is a /t/ not a /t/?

“writer” /t/ is unusual; call it /aI-t+3r/ “a tree” /t/ is unusual; call it /&-t+r/ “that soup” /t/ is unusual; call it /ae-t+s/

Lexical stress /i/ in “reek” longer than in “recover” Call them /r-i+k'/ vs. /r-i+k/

Punctuation, an easy-to-transcribe proxy for prosody /n/ in “I'm done.” 2X as long as /n/ in “Done yet?” Call them /^-n+{PERIOD}/ vs. /^-n+j/

Language, Dialect, Style: /o/ in “atone:” call it /t-o+n%eng/ /o/ in あとに : call it /t-o+n%jap/

Gender: handled differently (speaker adaptation)

^’-A+b%eng^’-A+b’%eng>-A+d%cmn

….

Page 17: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Decision Tree State TyingCategories for decision tree questions

Distinctive phone features (manner/place of articulation) of right or left context

Language identity Dialect identity (L1 vs. L2) Lexical stress Punctuation mark

^’-A+b%engL2^’-A+b’%engL2

>A+d%cmn….

Each leaf node contains at least 3.5 seconds of training data

Page 18: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Phone Recognition Experiment(Huang et al., in preparation)

Language-independent triphone bigram language model

Standard classifier transforms (PLP+d+dd, CDHMM, 11-17 Gaussians)

Vocabulary size: top 60K most frequent triphones (since 140K is too much!) For the rest of infrequent triphones, map them

back to center monophones

Page 19: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Recognition Results(Huang et al., in preparation)

Test set: 50 sentences per corpus

Page 20: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Example: Language-Independent Speech Information Retrieval(Zhuang et al., in preparation)

Voice activity detectionPerceptual freq warpingGaussian mixtures

Likelihood Vector b

i=p(observation

t|state

t=i)

Retrieval Ranking = E(count(query|segment observations))

Inference Algorithm: Finite State Transducer built from ASR LatticesE(count(query|observations))

Page 21: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Information RetrievalStandard Methods Task Description: given a query, find the “most relevant”

segments in a database Published Algorithms:

EXACT MATCH: segment = argmin d(query,segment) Fast

SUMMARY STATISTICS: segment = argmax p(query|segment), no concept of “word order” Good for text, e.g., google, yahoo, etc.

TRANSFORM AND INFER: segment = argmax p(query|segment), ≈ E(count(query)|segment); word order matters Flexible, but slow....

Page 22: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Language-Independent IR:The Star Challenge

A Multi-Language Multi-Media Broadcast News Retrieval Competition, sponsored by A*STAR

Elimination rounds, June-August 2008 Three rounds, each of 48 hours duration 56 teams entered from around the world 5 teams selected for the Grand Finals

Grand Finals: 10/23/2008, Singapore

Page 23: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Star Challenge Tasks VT1, VT2: Given image category (e.g.,

“crowd,” “sports,” “keyboard”), find examples AT1: Given an IPA phoneme sequence

(example: /ɻogutʃA/), find audio segments AT2: Given a waveform containing a word or

word sequence in any language, find audio segments containing the same word

AT1+VT2: find specified video class, speech contains IPA (e.g., “man monologue”+/groʊɵ/)

Page 24: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Star Challenge: Simplified Results

Round 1 Round 3 Grand FinalRanking Ranking Ranking(Verified) (Hearsay) (Verified)

National U. Singapore (EXACT MATCH) ? 5 1NII and IRISA (EXACT MATCH) ? 2 2University of Illinois (TRANSFORM INFER) 4 2 3Beijing University (TRANSFORM INFER) ? 1 4

Rounds 1 and 3: 48,000 CPU hours Round 1: English, 20 queries Round 3: English and Mandarin, 3 queries each

Grand Final: 6 CPU hours English, Mandarin, Malay, and Tamil, 2 queries each

Page 25: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Open Research Areas When does “Transform and Infer” help?

ROUND 3 (1000cpus, 48 hours): best algorithms were “transform and infer”

GRAND FINAL (3 cpus, 2 hours): best algorithms were “exact match”

Open research area #1: complexity “Inference algorithm:” user constraints → simplified

classifier Improved transforms and improved classifiers allow

the use of a less-constrained user interface Open research area #2: accuracy

Page 26: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Existence Proof:ASR can beat Human Listeners(Sharma et al., in preparation)

Talker ID (M/F=Gender)M09 M05 M06 F02 M07 F03 M04

Human Listener Accuracy(Unfamiliar Listeners, 86 58 39 29 28 6 2Unlimited Vocabulary)ASR: Digits 85 90 93 94 100 74 46ASR: Letters 97 77 77 70 86 42 19ASR: 55 Words 90 63 72 73 81 40 14ASR: 155 Words 47 50 36 43 44 22 6

The task: speech of talkers with gross motor disability (Cerebral Palsy) Familiar listeners in familiar situations understand most of what they say... ASR can also be talker-dependent and vocabulary-constrained

Page 27: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Open Research Areas

Remove the Constraints! ASR can beat a human listener if the ASR knows

more than the human (e.g., knows the talker and the vocabulary)

Better knowledge = better signal models better classifiers better inference

Page 28: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Thank You!

Questions?

Page 29: Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang,

Decision Tree State Tying (Odell, Woodland and Young, 1994)

1. Divide each IPA phone into three temporally sequential “states,” /i/ -> /i/onset, /i/center, /i/offset

2. Start with one model for each state. Create a statistical model p(acoustics|state) using training data

3. Ask yes-no questions about context variables Left phone, right phone, lexical stress, language ID

• If p(acoustics|state, yes) ≠ p(acoustics|state, no), split the training data into two groups The “yes” examples vs. the “no” examples If many such questions exist, choose the best Repeat this process as long as each group contains enough

training data examples