lecture 9: speech recognition - columbia universitydpwe/e6820/lectures/l09-asr.pdfrecognizing speech...

Post on 09-May-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

EE E6820: Speech & Audio Processing & Recognition

Lecture 9:Speech Recognition

Dan Ellis <dpwe@ee.columbia.edu>Michael Mandel <mim@ee.columbia.edu>

Columbia University Dept. of Electrical Engineeringhttp://www.ee.columbia.edu/∼dpwe/e6820

April 7, 2009

1 Recognizing speech

2 Feature calculation

3 Sequence recognition

4 Large vocabulary, continuous speech recognition (LVCSR)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 1 / 43

Outline

1 Recognizing speech

2 Feature calculation

3 Sequence recognition

4 Large vocabulary, continuous speech recognition (LVCSR)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 2 / 43

Recognizing speech

Time

Fre

quen

cy

0 0.5 1 1.5 2 2.5 30

2000

4000“So, I thought about that and I think it’s still possible”

What kind of information might we want from the speechsignal?

I wordsI phrasing, ‘speech acts’ (prosody)I mood / emotionI speaker identity

What kind of processing do we need to get at thatinformation?

I time scale of feature extractionI signal aspects to capture in featuresI signal aspects to exclude from features

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 3 / 43

Speech recognition as Transcription

Transcription = “speech to text”I find a word string to match the utterance

Gives neat objective measure: word error rate (WER) %I can be a sensitive measure of performance

Reference:

Recognized:

THE

CAT SAT ON THE MAT

CAT SAT AN THE MATA

Deletion Substitution Insertion

Three kinds of errors:

WER = (S + D + I )/N

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 4 / 43

Problems: Within-speaker variability

Timing variationI word duration varies enormously

0.5

Fre

quen

cy0

0 1.0 1.5 2.0 2.5 3.0

2000

4000

s

SO ITHOUGHT

ABOUTTHAT

ITHINK

IT'S STILL POSSIBLE

AND

ow ayth

aadx

ax b aw plihtstih

kn

ihth

ayn

axth

aa s b ax l

I fast speech ‘reduces’ vowels

Speaking style variationI careful/casual articulationI soft/loud speech

Contextual effectsI speech sounds vary with context, role:

“How do you do?”

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 5 / 43

Problems: Between-speaker variability

Accent variationI regional / mother tongue

Voice quality variationI gender, age, huskiness, nasality

Individual characteristicsI mannerisms, speed, prosody

0

2000

4000

6000

8000

time / s

freq

/ H

z

0 0.5 1 1.5 2 2.50

2000

4000

6000

8000

mbma0

fjdm2

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 6 / 43

Problems: Environment variability

Background noiseI fans, cars, doors, papers

ReverberationI ‘boxiness’ in recordings

Microphone/channelI huge effect on relative spectral gain

0

2000

4000

time / s

freq

/ H

z

0 0.2 0.4 0.6 0.8 1 1.2 1.40

2000

4000

Close mic

Tabletop mic

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 7 / 43

How to recognize speech?

Cross correlate templates?I waveform?I spectrogram?I time-warp problems

Match short-segments & handle time-warp laterI model with slices of ∼10 msI pseudo-stationary model of words:

time / s

freq

/ H

z

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.450

1000

2000

3000

4000sil silg w eh n

I other sources of variation. . .

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 8 / 43

Probabilistic formulation

Probability that segment label is correctI gives standard form of speech recognizers

Feature calculation: s[n]→ Xm (m = nH )

I transforms signal into easily-classified domain

Acoustic classifier: p(qi |X )I calculates probabilities of each mutually-exclusive state qi

‘Finite state acceptor’ (i.e. HMM)

Q∗ = argmax{q0,q1,...qL}

p(q0, q1, . . . qL |X0,X1, . . .XL)

I MAP match of allowable sequence to probabilities:

X

q0 = “ay” q1

0 1 2 ...

...

time

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 9 / 43

Standard speech recognizer structure

Fundamental equation of speech recognition:

Q∗ = argmaxQ

p(Q |X ,Θ)

= argmaxQ

p(X |Q,Θ)p(Q |Θ)

I X = acoustic featuresI p(X |Q,Θ) = acoustic modelI p(Q |Θ) = language modelI argmaxQ = search over sequences

Questions:I what are the best features?I how do we do model them?I how do we find/match the state sequence?

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 10 / 43

Outline

1 Recognizing speech

2 Feature calculation

3 Sequence recognition

4 Large vocabulary, continuous speech recognition (LVCSR)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 11 / 43

Feature Calculation

Goal: Find a representational spacemost suitable for classification

I waveform: voluminous, redundant, variableI spectrogram: better, still quite variableI . . . ?

Pattern Recognition:representation is upper bound on performance

I maybe we should use the waveform. . .I or, maybe the representation can do all the work

Feature calculation is intimately bound to classifierI pragmatic strengths and weaknesses

Features develop by slow evolutionI current choices more historical than principled

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 12 / 43

Features (1): Spectrogram

Plain STFT as features e.g.

Xm[k] = S [mH, k] =∑n

s[n + mH] w [n] e−j2πkn/N

Consider examples:

0

2000

4000

6000

8000

time / s

freq

/ H

z

0 0.5 1 1.5 2 2.50

2000

4000

6000

8000

Feature vector slice

Similarities between corresponding segmentsI but still large differences

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 13 / 43

Features (2): Cepstrum

Idea: Decorrelate, summarize spectral slices:

Xm[`] = IDFT{log |S [mH, k]|}

I good for Gaussian modelsI greatly reduce feature dimension

0 0.5 1 1.5 2 2.5 time / s

2

4

6

8

0

4000

8000

0

4000

8000

2

4

6

8

Male

Female

spectrum

cepstrum

spectrum

cepstrum

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 14 / 43

Features (3): Frequency axis warp

Linear frequency axis gives equal ‘space’to 0-1 kHz and 3-4 kHz

I but perceptual importance very different

Warp frequency axis closer to perceptual axisI mel, Bark, constant-Q . . .

X [c] =uc∑

k=`c

|S [k]|2

Male

Female

spectrum

audspec

audspec

0 0.5 1 1.5 2 2.5 time / s

5

10

15

0

4000

8000

5

10

15

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 15 / 43

Features (4): Spectral smoothing

Generalizing across different speakers ishelped by smoothing (i.e. blurring) spectrum

Truncated cepstrum is one way:I MMSE approx to log |S [k]|

LPC modeling is a little different:I MMSE approx to |S [k]| → prefers detail at peaks

Male

audspec

plp smoothed 5

10

15

0 2 4 6 8 10 12 14 16 18 freq / chan

leve

l / d

B

30

40

50

5

10

15

0 0.5 1 1.5 2 2.5 time / s

audspec

plp

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 16 / 43

Features (5): Normalization along timeIdea: feature variations, not absolute level

Hence: calculate average level and subtract it:

Y [n, k] = X [n, k]−meann{X [n, k]}

Factors out fixed channel frequency response

x [n] = hc ∗ s[n]

X [n, k] = log |X [n, k]| = log |Hc [k]|+ log |S [n, k]|

Male

Female

plp

mean norm

mean norm

0 0.5 1 1.5 2 2.5 time / s

5

10

15

5

10

15

5

10

15

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 17 / 43

Delta features

Want each segment to have ‘static’ feature valsI but some segments intrinsically dynamic!→ calculate their derivatives—maybe steadier?

Append dX/dt (+ d2X/dt2) to feature vectors

Male

plp(µ,σ norm)

deltas

ddeltas

51015

51015

51015

0 0.5 1 1.5 2 2.5 time / s

Relates to onset sensitivity in humans?

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 18 / 43

Overall feature calculation

MFCCs and/or RASTA-PLP

FFT X[k]

Mel scale freq. warp

log|X[k]|

IFFT

Truncate

Subtract mean

CMN MFCC features

Sound

spectra

audspec

cepstra

FFT X[k]

Bark scale freq. warp

log|X[k]|

Rasta band-pass

LPC smooth

Cepstral recursion

Rasta-PLP cepstral features

smoothed onsets

LPC spectra

Key attributes:

spectral, auditory scale

decorrelation

smoothed (spectral)detail

normalization of levels

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 19 / 43

Features summary

0

4000

8000

5

10

15

5

10

15

5

10

15

0 0.5 1 time / s0 0.5 1 1.5

Male Female

spectrum

audspec

rasta

deltas

Normalize same phones

Contrast different phones

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 20 / 43

Outline

1 Recognizing speech

2 Feature calculation

3 Sequence recognition

4 Large vocabulary, continuous speech recognition (LVCSR)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 21 / 43

Sequence recognition: Dynamic Time Warp (DTW)

Framewise comparison with stored templates:

10 20 30 40 50 time /frames

10

20

30

40

50

60

70

Ref

eren

ce

Test

ON

ET

WO

TH

RE

EF

OU

RF

IVE

I distance metric?I comparison across templates?

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 22 / 43

Dynamic Time Warp (2)

Find lowest-cost constrained path:I matrix d(i , j) of distances

between input frame fi and reference frame rjI allowable predecessors and transition costs Txy

Input frames fi

Ref

eren

ce f

ram

es r

j

D(i,j) = d(i,j) + min{ }D(i-1,j) + T10D(i,j-1) + T01

D(i-1,j-1) + T11D(i-1,j)

D(i-1,j) D(i-1,j)

T10

T01

T 11 Local match cost

Lowest cost to (i,j)

Best predecessor (including transition cost)

Best path via traceback from final stateI store predecessors for each (i , j)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 23 / 43

DTW-based recognition

Reference templates for each possible word

For isolated words:I mark endpoints of input wordI calculate scores through each template (+prune)

Ref

eren

ce

Input frames

ON

ET

WO

TH

RE

EF

OU

R

I continuous speech: link together word ends

Successfully handles timing variationI recognize speech at reasonable cost

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 24 / 43

Statistical sequence recognition

DTW limited because it’s hard to optimizeI learning from multiple observationsI interpretation of distance, transition costs?

Need a theoretical foundation: Probability

Formulate recognition as MAP choice among word sequences:

Q∗ = argmaxQ

p(Q |X ,Θ)

I X = observed featuresI Q = word-sequencesI Θ = all current parameters

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 25 / 43

State-based modeling

Assume discrete-state model for the speech:I observations are divided up into time framesI model → states → observations:

q1Qk : q2 q3 q4 q5 q6 ...

x1X1 : x2 x3 x4 x5 x6 ...

time

states

Nobserved feature

vectors

Model Mj

Probability of observations given model is:

p(X |Θ) =∑all Q

p(XN1 |Q,Θ) p(Q |Θ)

I sum over all possible state sequences Q

How do observations XN1 depend on states Q?

How do state sequences Q depend on model Θ?

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 26 / 43

HMM review

HMM is specified by parameters Θ:

k a t

k a t •

k a t •

• •

• k a t

k a t •

0.9 0.1 0.0 0.01.0 0.0 0.0 0.0

0.0 0.9 0.1 0.00.0 0.0 0.9 0.1

p(x|q)

x

- states qi

- transition probabilities aij

- emission distributions bi(x)

(+ initial state probabilities πi )

aij ≡ p(q jn | qi

n−1) bi (x) ≡ p(x | qi ) πi ≡ p(qi1)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 27 / 43

HMM summary (1)

HMMs are a generative model: recognition is inference ofp(Q |X )

During generation, behavior of model depends only on currentstate qn:

I transition probabilities p(qn+1 | qn) = aij

I observation distributions p(xn | qn) = bi (x)

Given states Q = {q1, q2, . . . , qN}and observations X = XN

1 = {x1, x2, . . . , xN}Markov assumption makes

p(X ,Q |Θ) =∏n

p(xn | qn)p(qn | qn−1)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 28 / 43

HMM summary (2)

Calculate p(X |Θ) via forward recursion:

p(X n1 , q

jn) = αn(j) =

[S∑

i=1

αn−1(i)aij

]bj(xn)

Viterbi (best path) approximation

α∗n(j) =

[max

i

{α∗n−1(i)aij

}]bj(xn)

I then backtrace. . .

Q∗ = argmaxQ

(X ,Q |Θ)

Pictorially:

Q = {q1,q2,...qn}

M = M*

Q*X

assumed, hidden observed inferred

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 29 / 43

Outline

1 Recognizing speech

2 Feature calculation

3 Sequence recognition

4 Large vocabulary, continuous speech recognition (LVCSR)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 30 / 43

Recognition with HMMs

Isolated wordI choose best p(M |X ) ∝ p(X |M)p(M)

Model M1p(X | M1)·p(M1) = ...

Model M2p(X | M2)·p(M2) = ...

Model M3p(X | M3)·p(M3) = ...

Input

w ah n

th r iy

t uw

Continuous speechI Viterbi decoding of one large HMM gives words

Inputp(M1)

p(M2)

p(M3)sil

w ah n

th r iy

t uw

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 31 / 43

Training HMMs

Probabilistic foundation allows us to train HMMs to ‘fit’training data

i.e. estimate aij , bi (x) given dataI better than DTW. . .

Algorithms to improve p(Θ |X ) are key to success of HMMsI maximum-likelihood of models. . .

State alignments Q for training examples are generallyunknown

I ... else estimating parameters would be easy

Viterbi trainingI ‘Forced alignment’I choose ‘best’ labels (heuristic)

EM trainingI ‘fuzzy labels’ (guaranteed local convergence)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 32 / 43

Overall training procedure

Word modelsLabelled training data“two one”

“four three”

“five”

Data Models

one

two

three

w ah n

w ah n

th r iy

th r iy

th r iy

t uw

f ao

t uw

Fit models to data Repeat until

convergenceRe-estimate model parameters

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 33 / 43

Language models

Recall, fundamental equation of speech recognition

Q∗ = argmaxQ

p(Q |X ,Θ)

= argmaxQ

p(X |Q,ΘA)p(Q |ΘL)

So far, looked at p(X |Q,ΘA)

What about p(Q |ΘL)?I Q is a particular word sequenceI ΘL are parameters related to the language

Two components:I link state sequences to words p(Q |wi )I priors on word sequences p(wi |Mj)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 34 / 43

HMM Hierarchy

HMMs support compositionI can handle time dilation, pronunciation, grammar all within

the same framework

ae1 ae2 ae3

kae

aat

THE

CAT

DOGSAT

ATE

p(q |M) = p(q, φ,w |M)

= p(q |φ)

· p(φ |w)

· p(wn |wn−11 ,M)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 35 / 43

Pronunciation models

Define states within each word p(Q |wi )

Can have unique states for each word (‘whole-word’modeling), or . . .

Sharing (tying) subword units between words to reflectunderlying phonology

I more training examples for each unitI generalizes to unseen wordsI (or can do it automatically. . . )

Start e.g. from pronunciation dictionary:

ZERO(0.5) z iy r owZERO(0.5) z ih r owONE(1.0) w ah nTWO(1.0) tcl t uw

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 36 / 43

Learning pronunciations

‘Phone recognizer’ transcribes training data as phonesI align to ‘canonical’ pronunciations

Surface Phone String

f ay v y iy r ow l d

f ah ay v y uh r ow l

Baseform Phoneme String

I infer modification rulesI predict other pronunciation variants

e.g. ‘d deletion’:

d → ∅|`stop p = 0.9

Generate pronunciation variants; use forced alignment to findweights

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 37 / 43

Grammar

Account for different likelihoods of different words and wordsequences p(wi |Mj)

‘True’ probabilities are very complex for LVCSRI need parses, but speech often agrammatic

→ Use n-grams:

p(wn |wL1 ) = p(wn |wn−K , . . . ,wn−1)

e.g. n-gram models of Shakespeare:

n=1 To him swallowed confess hear both. Which. Of save on . . .n=2 What means, sir. I confess she? then all sorts, he is trim, . . .n=3 Sweet prince, Falstaff shall die. Harry of Monmouth’s grave. . .n=4 King Henry. What! I will go seek the traitor Gloucester. . . .

Big win in recognizer WERI raw recognition results often highly ambiguousI grammar guides to ‘reasonable’ solutions

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 38 / 43

Smoothing LVCSR grammars

n-grams (n = 3 or 4) are estimated from large text corporaI 100M+ wordsI but: not like spoken language

100,000 word vocabulary → 1015 trigrams!I never see enough examplesI unobserved trigrams should NOT have Pr = 0!

Backoff to bigrams, unigramsI p(wn) as an approx to p(wn |wn−1) etc.I interpolate 1-gram, 2-gram, 3-gram with learned weights?

Lots of ideas e.g. category grammarsI p(PLACE | “went”, “to”)p(wn |PLACE)I how to define categories?I how to tag words in training corpus?

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 39 / 43

Decoding

How to find the MAP word sequence?

States, pronunciations, words define one big HMMI with 100,000+ individual states for LVCSR!

→ Exploit hierarchic structureI phone states independent of wordI next word (semi) independent of word history

k

axr

z

s

dowiy

d

oy

uw

b

root

DO

DECOY DECODES

DECODES

DECODER

DECODE

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 40 / 43

Decoder pruning

Searching ‘all possible word sequences’?I need to restrict search to most promising ones: beam searchI sort by estimates of total probability

= Pr(so far)+ lower bound estimate of remainsI trade search errors for speed

Start-synchronous algorithm:I extract top hypothesis from queue:

[Pn, {w1, . . . ,wk}, n]pr. so far words next time frame

I find plausible words {wi} starting at time n→ new hypotheses:

[Pnp(X n+N−1n |w i )p(w i |wk . . .), {w1, . . . ,wk ,w

i}, n + N]

I discard if too unlikely, or queue is too longI else re-insert into queue and repeat

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 41 / 43

Summary

Speech signal is highly variableI need models that absorb variabilityI hide what we can with robust features

Speech is modeled as a sequence of featuresI need temporal aspect to recognitionI best time-alignment of templates = DTW

Hidden Markov models are rigorous solutionI self-loops allow temporal dilationI exact, efficient likelihood calculations

Language modeling captures larger structureI pronunciation, word sequencesI fits directly into HMM state structureI need to ‘prune’ search space in decoding

Parting thought

Forward-backward trains to generate, can we train to discriminate?

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 42 / 43

References

Lawrence R. Rabiner. A tutorial on hidden markov models and selected applications inspeech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.

Mehryar Mohri, Fernando Pereira, and Michael Riley. Weighted finite-state transducersin speech recognition. Computer Speech & Language, 16(1):69–88, 2002.

Wendy Holmes. Speech Synthesis and Recognition. CRC, December 2001. ISBN0748408576.

Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recognition.Prentice Hall PTR, April 1993. ISBN 0130151572.

Daniel Jurafsky and James H. Martin. Speech and Language Processing: AnIntroduction to Natural Language Processing, Computational Linguistics andSpeech Recognition. Prentice Hall, January 2000. ISBN 0130950696.

Frederick Jelinek. Statistical Methods for Speech Recognition (Language, Speech, andCommunication). The MIT Press, January 1998. ISBN 0262100665.

Xuedong Huang, Alex Acero, and Hsiao-Wuen Hon. Spoken Language Processing: AGuide to Theory, Algorithm and System Development. Prentice Hall PTR, April2001. ISBN 0130226165.

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 43 / 43

top related