acoustic modeling for multi- language, multi-style, multi-channel automatic speech recognition mark...

Acoustic Modeling for Multi-Language, Multi-Style, Multi-Channel Automatic Speech Recognition

Mark Hasegawa-Johnson

Yuxiao Hu, Dennis Lin, Xiaodan Zhuang, Jui-Ting Huang, Xi Zhou, Zhen Li, and Thomas Huangincluding also the research results of

Laehoon Kim and Harsh Sharma

University of University of IllinoisIllinois

Motivation Applications in a Multilingual Society

News Hound: Find all TV news segments, in any language, mentioning “Barack Obama”

Language Learner: Transcribe learner's accented speech; tell him which words sound accented

Broadcaster/Podcaster: Automatically transcribe “man on the street” interviews in a multilingual city (LA, Sing)

Problems Physical variability: noise, echo, talker Imprecise categories: dependent on context Content variability: language, topic, dialect, style

Method: Transform and Infer(ubiquitous methodology in ASR; see, e.g., Jelinek, 1976)

Signal transforms

Classifier transforms

Likelihood Vector b

i=p(observation

t|state

t=i)

Best label sequence = argmax p(label1,...,label

T|observation

1,...,observation

T)

Inference AlgorithmA Parametric Model of p(state

1,...,state

T,label

1,...,label

T)

Signal TransformsTransforms determined by a physical model of the signal

A good signal model tells you a lot: Reverberation model: y[n]=v[n]+mh[m]x[n-m]

x[n] produced by a human vocal tract, designed for efficient processing by a human auditory system

A good signal transform improves the accuracy of all classifiers Denoising: Correct for additive noise Dereverberation: Correct for convolutional noise Perceptual freq warping: Hear what humans hear

Denoising Example(Kim et al., 2006)

Classifier TransformsCompute a precise and accurate estimate of p(obst|statet)

Robust Machine Learning From a limited amount of training data, Learn parameterized probability models as precise as possible, ...with a known upper bound on generalization error

Methods that trade off precision and generalization Decorrelate the signal measurements: PCA, DCT Select the most informative features from an inventory: AdaBoost

Train a linear or nonlinear function zt=f(y

t) that

Discriminates among the training examples from diff classes Has known upper bounds on generalization error (SVM, ANN)

Train another nonlinear function p(zt|state

t) with same properties

Classifier TransformsCompute a precise and accurate estimate of p(obst|statet)

InferenceIntegrate information to choose best global labelset

Labels = variables that matter globally Speech Recognition: what words were spoken? Information Retrieval: which segment best matches the query? Language Learning: where's the error?

States = variables that can be classified locally

May be scalar, e.g., qt=sub-phoneme

May be vector, e.g., qt=[vector of articulatory states]

Inference algorithm = Parametric model of p(states,labels) Scalar states: Hidden Markov model, Finite State Transducer Vector states: Dynamic Bayesian network, Conditional Random

Field

InferenceIntegrate information to choose best global labelset

Example: Language-Independent Phone Recognition(Huang et al., in preparation)

Voice activity detectionPerceptual freq warpingGaussian mixtures

Likelihood Vector b

i=p(observation

t|state

t=i)

Best label sequence = argmax p(phone1,...,phone

T|observation

1,...,observation

T)

Inference Algorithm: Hidden Markov Model with Token Passingp(state

1,...,state

T,phone

1,...,phone

T)

A Language-Independent Phone Set (Consonants)

Plus secondary articulations (glottis, pharynx, palate, lips), sequences, and syllabics

^~, >~, @~, &~, a~, A~, …nasalized

>:, a:, A:, e:, E:, i:, I:, o:, u:, 3r:long

3r, 4r, &rretroflexion

>i, aU, Au, ei, eI, Ei, eu, Eu, ia, ie, io, iu, oi, oU, ua, uax, ui, uo, aI, i>, iE, ue, uE (>u, AI, axI~, axU~)

diphthongs

A Language-Independent Phone Set (Vowels)

Training Data

10 languages, 11 corpora Arabic, Croatian, English, Japanese, Mandarin,

Portuguese, Russian, Spanish, Turkish, Urdu

95 hours of speech Sampled from a larger set of corpora Mixed styles of speech: broadcast, read, and

spontaneous

Summarization of Corpora

Dictionaries(Hasegawa-Johnson and Fleck, http://www.isle.uiuc.edu/dict/)

Diacriticized Versionavailable on web?

Ruleset #1q= قk= کg= گ

...

Ruleset #2� =A

� =ligature� =u

...

/sAh{SV}b{SV}/, /sA!iqƏ/

Phonetic Transcriptions

No Yesصاحب صاحب, صاعقصاع�ق

Urdu: No Vowels!!

Orthographic Transcriptions

Context-Dependent Phones Triphones: when is a /t/ not a /t/?

“writer” /t/ is unusual; call it /aI-t+3r/ “a tree” /t/ is unusual; call it /&-t+r/ “that soup” /t/ is unusual; call it /ae-t+s/

Lexical stress /i/ in “reek” longer than in “recover” Call them /r-i+k'/ vs. /r-i+k/

Punctuation, an easy-to-transcribe proxy for prosody /n/ in “I'm done.” 2X as long as /n/ in “Done yet?” Call them /^-n+{PERIOD}/ vs. /^-n+j/

Language, Dialect, Style: /o/ in “atone:” call it /t-o+n%eng/ /o/ in あとに : call it /t-o+n%jap/

Gender: handled differently (speaker adaptation)

^’-A+b%eng^’-A+b’%eng>-A+d%cmn

….

Decision Tree State TyingCategories for decision tree questions

Distinctive phone features (manner/place of articulation) of right or left context

Language identity Dialect identity (L1 vs. L2) Lexical stress Punctuation mark

^’-A+b%engL2^’-A+b’%engL2

>A+d%cmn….

Each leaf node contains at least 3.5 seconds of training data

Phone Recognition Experiment(Huang et al., in preparation)

Language-independent triphone bigram language model

Standard classifier transforms (PLP+d+dd, CDHMM, 11-17 Gaussians)

Vocabulary size: top 60K most frequent triphones (since 140K is too much!) For the rest of infrequent triphones, map them

back to center monophones

Recognition Results(Huang et al., in preparation)

Test set: 50 sentences per corpus

Example: Language-Independent Speech Information Retrieval(Zhuang et al., in preparation)

Voice activity detectionPerceptual freq warpingGaussian mixtures

Likelihood Vector b

i=p(observation

t|state

t=i)

Retrieval Ranking = E(count(query|segment observations))

Inference Algorithm: Finite State Transducer built from ASR LatticesE(count(query|observations))

Information RetrievalStandard Methods Task Description: given a query, find the “most relevant”

segments in a database Published Algorithms:

EXACT MATCH: segment = argmin d(query,segment) Fast

SUMMARY STATISTICS: segment = argmax p(query|segment), no concept of “word order” Good for text, e.g., google, yahoo, etc.

TRANSFORM AND INFER: segment = argmax p(query|segment), ≈ E(count(query)|segment); word order matters Flexible, but slow....

Language-Independent IR:The Star Challenge

A Multi-Language Multi-Media Broadcast News Retrieval Competition, sponsored by A*STAR

Elimination rounds, June-August 2008 Three rounds, each of 48 hours duration 56 teams entered from around the world 5 teams selected for the Grand Finals

Grand Finals: 10/23/2008, Singapore

Star Challenge Tasks VT1, VT2: Given image category (e.g.,

“crowd,” “sports,” “keyboard”), find examples AT1: Given an IPA phoneme sequence

(example: /ɻogutʃA/), find audio segments AT2: Given a waveform containing a word or

word sequence in any language, find audio segments containing the same word

AT1+VT2: find specified video class, speech contains IPA (e.g., “man monologue”+/groʊɵ/)

Star Challenge: Simplified Results

Round 1 Round 3 Grand FinalRanking Ranking Ranking(Verified) (Hearsay) (Verified)

National U. Singapore (EXACT MATCH) ? 5 1NII and IRISA (EXACT MATCH) ? 2 2University of Illinois (TRANSFORM INFER) 4 2 3Beijing University (TRANSFORM INFER) ? 1 4

Rounds 1 and 3: 48,000 CPU hours Round 1: English, 20 queries Round 3: English and Mandarin, 3 queries each

Grand Final: 6 CPU hours English, Mandarin, Malay, and Tamil, 2 queries each

Open Research Areas When does “Transform and Infer” help?

ROUND 3 (1000cpus, 48 hours): best algorithms were “transform and infer”

GRAND FINAL (3 cpus, 2 hours): best algorithms were “exact match”

Open research area #1: complexity “Inference algorithm:” user constraints → simplified

classifier Improved transforms and improved classifiers allow

the use of a less-constrained user interface Open research area #2: accuracy

Existence Proof:ASR can beat Human Listeners(Sharma et al., in preparation)

Talker ID (M/F=Gender)M09 M05 M06 F02 M07 F03 M04

Human Listener Accuracy(Unfamiliar Listeners, 86 58 39 29 28 6 2Unlimited Vocabulary)ASR: Digits 85 90 93 94 100 74 46ASR: Letters 97 77 77 70 86 42 19ASR: 55 Words 90 63 72 73 81 40 14ASR: 155 Words 47 50 36 43 44 22 6

The task: speech of talkers with gross motor disability (Cerebral Palsy) Familiar listeners in familiar situations understand most of what they say... ASR can also be talker-dependent and vocabulary-constrained

Open Research Areas

Remove the Constraints! ASR can beat a human listener if the ASR knows

more than the human (e.g., knows the talker and the vocabulary)

Better knowledge = better signal models better classifiers better inference

Thank You!

Questions?

Decision Tree State Tying (Odell, Woodland and Young, 1994)

1. Divide each IPA phone into three temporally sequential “states,” /i/ -> /i/onset, /i/center, /i/offset

2. Start with one model for each state. Create a statistical model p(acoustics|state) using training data

3. Ask yes-no questions about context variables Left phone, right phone, lexical stress, language ID

• If p(acoustics|state, yes) ≠ p(acoustics|state, no), split the training data into two groups The “yes” examples vs. the “no” examples If many such questions exist, choose the best Repeat this process as long as each group contains enough

training data examples

acoustic modeling for multi- language, multi-style, multi-channel automatic speech recognition mark...

Documents