dt2112 speech recognition by computers - kth · 2015. 1. 29. · letter) i unknown utterance:...

DT2112Speech Recognition by Computers

Giampiero Salvi

KTH/CSC/TMH giampi@kth.se

VT 2015

1 / 117

Motivation

I Natural way of communication (No trainingneeded)

I Leaves hands and eyes free (Good forfunctionally disabled)

I Effective (Higher data rate than typing)

I Can be transmitted/received inexpensively(phones)

2 / 117

A dream of Artificial Intelligence

2001: A space odyssey (1968)

3 / 117

ASR in a Broader Context

DialogueManager

AutomaticSpeech

Recognition

SpokenLanguage

UnderstandingText to Speech

4 / 117

The ASR Scope

Convert speech into text

AutomaticSpeech

Recognition“My name is . . . ”

[confidence score]

Not considered here:

I non-verbal signals

I prosody

I multi-modal interaction

5 / 117

The ASR Scope

Convert speech into text

AutomaticSpeech

Recognition“My name is . . . ”

[confidence score]

Not considered here:

I non-verbal signals

I prosody

I multi-modal interaction

5 / 117

A very long endeavour1952, Bell laboratories, isolated digit recognition,single speaker, hardware based [2]

[2] K. H. Davis, R. Biddulph, and S. Balashek. “Automatic Recognition of Spoken Digits”. In: JASA 24.6 (1952),pp. 637–642

6 / 117

An underestimated challenge

for 60 years many boldannouncements

7 / 117

http://www.itl.nist.gov/iad/mig/publications/ASRhistory/

8 / 117

Main variables in ASR

Speaking mode isolated words vs continuous speech

Speaking style read speech vs spontaneous speech

Speakers speaker dependent vs speakerindependent

Vocabulary small (<20 words) vs large (>50 000words)

Robustness against background noise

9 / 117

Challenges — VariabilityBetween speakers

I AgeI GenderI AnatomyI Dialect

Within speaker

I StressI EmotionI Health conditionI Read vs SpontaneousI Adaptation to

environment (Lombardeffect)

I Adaptation to listener

Environment

I NoiseI Room acousticsI Microphone distanceI Microphone, telephoneI Bandwidth

Listener

I AgeI Mother tongueI Hearing lossI Known / unknownI Human / Machine

10 / 117

Applications today

Call centers:

I traffic information

I time-tables

I booking. . .

Accessibility

I Dictation

I hand-free control (TV, video, telephone)

Smart phones

I Siri, Android. . .

11 / 117

Outline

Speech Signal Representations

Template Matching

Probabilistic ApproachKnowledge Modelling

Performance Measures

Robustness and Adaptation

Speaker Recognition

More details in DT2118: “Speech and SpeakerRecognition”

12 / 117

Components of ASR System

Speech SignalSpectralAnalysis

FeatureExtraction

Searchand Match

Recognised Words

Acoustic Models

Lexical Models

Language Models

Representation

Constraints - KnowledgeDecoder

13 / 117

Outline

Template Matching

Speaker Recognition

14 / 117

FeatureExtraction

Representation

Goals:

I disregard irrelevant information

I optimise relevant information for modelling

15 / 117

FeatureExtraction

Representation

Means:

I try to model essential aspects of speechproduction

I imitate auditory processes

I consider properties of statistical modelling

15 / 117

Examples of Speech Sounds

http://www.speech.kth.se/wavesurfer/

16 / 117

Feature Extraction and Speech Production

Esophagus

Larynx

Soft palate(velum)

Hard palate

Trachea

Oral cavity

TongueLip

Nostril

Nasal cavity

Diaphragm

cavityPharyngeal

��

Trachea

Muscle Force and Relaxation

FoldsVocal

Glottis

PharyngealCavity Cavity

Cavity

NoseOutput

MouthOutput

17 / 117

Source/Filter Model, General Case

Vowels

Esophagus

Larynx

Soft palate(velum)

Hard palate

Trachea

Oral cavity

TongueLip

Nostril

Nasal cavity

Diaphragm

cavityPharyngeal

��

� Source (periodic)� Front Cavity� Back Cavity� Back Cavity (2ndapprox.)

18 / 117

Fricatives (e.g. sh) or Plosive (e.g. k)

Esophagus

Larynx

Soft palate(velum)

Hard palate

Trachea

Oral cavity

TongueLip

Nostril

Nasal cavity

Diaphragm

cavityPharyngeal

��

� Source (noise orimpulsive)� Front Cavity� Back Cavity� Back Cavity (2ndapprox.)

18 / 117

Fricatives (e.g. s) or Plosive (e.g. t)

Esophagus

Larynx

Soft palate(velum)

Hard palate

Trachea

Oral cavity

TongueLip

Nostril

Nasal cavity

Diaphragm

cavityPharyngeal

��

� Source (noise orimpulsive)� Front Cavity� Back Cavity� Back Cavity (2ndapprox.)

18 / 117

Nasalised Vowels

Esophagus

Larynx

Soft palate(velum)

Hard palate

Trachea

Oral cavity

TongueLip

Nostril

Nasal cavity

Diaphragm

cavityPharyngeal

��

� Source (periodic)� Front Cavity� Back Cavity� Back Cavity (2ndapprox.)

18 / 117

Examples

0 1 2 3 4 5 6 7 8

freqency (kHz)

vocalic sound

��

0 1 2 3 4 5 6 7 8

freqency (kHz)

fricative sound

��

19 / 117

Relevant vs Irrelevant InformationFor the purpose of transcribing words:Relevant: vocal tract shape → spectral envelopeIrrelevant: vocal fold vibration frequency (f0) →spectral details

0 1 2 3 4 5 6 7 8

freqency (kHz)

vocalic sound

Exceptions:

I tonal languages (Chinese)

I pitch and prosody convey meaning

20 / 117

Relevant vs Irrelevant InformationFor the purpose of transcribing words:Relevant: vocal tract shape → spectral envelopeIrrelevant: vocal fold vibration frequency (f0) →spectral details

0 1 2 3 4 5 6 7 8

freqency (kHz)

vocalic sound

Exceptions:

I tonal languages (Chinese)

I pitch and prosody convey meaning

20 / 117

Linear Prediction AnalysisAttempt to model the vocal tract filter

x [n] =

p∑k=1

akx [n − k]

better match at spectral peaks than valleys21 / 117

Mel Frequency Cepstrum Coefficients

I imitate aspects of auditory processing

I de facto standard in ASR

I does not assume all-pole model of the spectrum

I uncorrelated: easier to model statistically

22 / 117

MFCCs Calculation

FFT Mel FilterbankCosine

Transform

C1 C2 C3 C4−20

40cepstrum of /a:/

0 5 10 1540

filterbank channels

spectrum of /a:/

23 / 117

Cosine Transform

Cj =√

∑Ni=1 Ai cos( jπ(i−0.5)

0 5 10 15−0.5

0 5 10 1540

filterbank channels

spectrum of /a:/

0 5 10 1520

filterbank channels

spectrum of /s:/

C1 C2 C3 C4−20

40cepstrum of /a:/

C1 C2 C3 C4−60

20cepstrum of /s:/

24 / 117

MFCCs: typical values

I 12 Coefficients C1–C12

I Energy (could be C0)

I Delta coefficients (derivatives in time)

I Delta-delta (second order derivatives)

I total: 39 coefficients per frame (analysiswindow)

25 / 117

A time varying signal

I speech is time varying

I short segment are quasi-stationary

I use short time analysis

26 / 117

Short-Time Fourier Analysis

7500 8000 8500 9000 9500 10000−0.1

−0.05

samples

0 1000 2000 3000 4000 5000 6000 7000 8000−20

spectr

frequency (Hz)

27 / 117

7500 8000 8500 9000 9500 10000−0.1

−0.05

samples

0 1000 2000 3000 4000 5000 6000 7000 8000−16

spectr

frequency (Hz)

27 / 117

7500 8000 8500 9000 9500 10000−0.1

−0.05

samples

0 1000 2000 3000 4000 5000 6000 7000 8000−16

spectr

frequency (Hz)

27 / 117

Windowing, typical values

I signal sampling frequency: 8–20kHz

I analysis window: 10–50ms

I frame interval: 10–25ms (100–40Hz)

28 / 117

Frame-Based Processing

File: sx352.WAV Page: 1 of 1 Printed: Mon Dec 05 09:01:39

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

ttclnixaymaxttclniydxraokkclixh#

mytoaccording

29 / 117

Comparing frames

I city block distance: d(x , y) =∑

i |xi − yi |I Euclidean distance: d(x , y) =

√∑i(xi − yi)2

I Mahalanobis distance:d(x , y) =

∑i(xi − µy)2/σy

I probability function:f (X = x |µ,Σ) = N(x ;µ,Σ)

I artificial neural networks: d = f (∑

i wixi − θ)

30 / 117

Comparing Utterances

In order to recognise speech we have to be able tocompare different utterances

Va jobbaru me Vad jobbar du med

31 / 117

Fixed vs Variable Length Representation

32 / 117

Combining frame-wise scores intoutterance scores

Template Matching

I oldest technique

I simple comparison of template patterns

I compensate for varying speech rate (DynamicProgramming)

Hidden Markov Models (HMMs)

I most used technique

I models of segmental structure of speech

I recognition by Viterbi search (DynamicProgramming)

33 / 117

Outline

Template Matching

Speaker Recognition

34 / 117

Template Matching

35 / 117

Dynamic ProgrammingI compare any possible alignmentI problem: exponential with H and K!

unknown utterance

36 / 117

Dynamic ProgrammingDynamic Time Warping (DTW) algorithm

1: for h = 1 to H do2: for k = 1 to K do3: AccD[h, k] = LocD[h, k] + min(AccD[h − 1, k],