radboud university nijmegen

How to handlepronunciation variation in ASR:By storing episodes in memory?

Helmer StrikCentre for Language and Speech Technology (CLST)Radboud University Nijmegen, the Netherlands

Radboud University Nijmegen


Overview

Contents : Variation, invariance problem ASR : Automatic Speech Recognition HSR : Human Speech Recognition ESR : Episodic Speech Recognition


Invariance problem (1)

One of the main issues in speech recognition is the large amount of variability present in speech.SRIV2006: ITRW on Speech Recognition and Intrinsic Variation

Invariance problem:Variation in stimuli, invariant perceptAlso visual, tactile, etc.Studied in many fields, no consensus

2 paradigms InvariantEpisodic



Example 1: Speech

Dutch word: “natuurlijk” (naturally, ‘of course’) [natyrlk] [natylk]… [tyk]

Multiword expressions (MWEs): lot of reductionmany variants



Example 2: Writing (vision)

natuurlijk natuurlijk






Familiar ‘styles’ (fonts, handwriting)are recognized better


ASR - Paradigm

Invariant, symbolic approach : utterance sequence of words sequence of phonemes sequence of states parametric description : pdf’s / ANN


ASR - Paradigm

Same paradigm (HMMs), since 70’s Assumptions : incorrect, questionable Insufficient performance

ASR vs. HSR : error rates 8-80x higher Slow progress (ceiling effect?) Simply using more and more data is not sufficient

(Moore, 2001)

A new paradigm is needed!However, only few attempts


HSR - Indexical information

Speech - 2 types of information :

1. Verbal info. : what, contents2. Indexical info. : how, form

e.g. environmental and speaker-specific aspects(pitch, loudness, speech rate, voice quality)



Traditional ASR model: Verbal information is used Indexical information

Noise, disturbances Preprocessing:

o Strip offo Normalization (VTLN, MLLR, etc.)

And in HSR?



HSR : Strip off indexical information?

No!

Familiar voices and accents :recognize and mimic

Indexical informationis perceived and encoded



Verbal & indexical information :processed independently?

No!

Familiar ‘voices’ are recognized better

Facilitation, also with ‘similar’ speech


HSR - Indexical and detailed information

Experimental results:indexical information andfine phonetic detail (Hawkins et al.)influence perception

Difficult to explain / integrate in the traditional, invariant model

New models: episodic models,for auditive and visual perception


ESR - Basic idea

A new paradigm for ASR is needed:An episodic model !!??

Training : Store trajectories - (representatives of) episodes

Recognition : Calculate distance between X and sequences of stored

trajectories (DTW) Take the one with minimum distance : the recognized

word


ESR – Invariant vs. episodic

phone-based HMM ESR-------------------------------------------------------------

Unit:[ Phone Syllable, word, … ]

Representation:States - pdf’s or ANN Trajectories

Compare:Trajectory (X) & states Trajectory (X) & Trajectories

Parsimonious representation Extensive representationComplex mapping Simple mapping‘Variation is noise’ Variation contains info.Normalization Use variation


Phone ‘aj’ from ‘nine’.

X = begin

3 parts: aj(, aj|, aj)

Representationpdf’s (Gaussians)

Much detail, dynamic information is lost

Trajectories: details


Unit: phone(me)

Switchboard (Greenberg et al.):deletion: 25% of the phonessubstitution: 30% of the phones together 55%!!

Difficult for a model based on ‘sequences of phones’.Syllables: less than 1% deleted

Phonetic transcriptions and their evaluation :Large differences between humansWhat is the ‘golden reference’?Speech – a sequence of symbols?


Unit: Multiword expressions (MWEs)

MWEs (see poster) :A lot of reduction;

many phonemes deleted, or substitutedMany variants (= sequences of phonemes)

more than 90 for 2 MWEs studiedDifficult to handle in ASR systems with current methods

for pronunciation variation modeling.Reduction, e.g. for a MWE: 4 words with 7 syllables

reduced to ‘1 entity’ with 2 syllables

What should be stored?Units of various lenghts?


An episodic approach for ASR

Advantages:More information during search:

dynamic, indexical, fine phonetic detailContinuity constraints can be used

(reduces the trajectory folding problem)Model is simpler

Disadvantage:More information during search: complexity

Brain: a lot of storage and ‘CPU’ Computers: more and more powerful


An episodic approach for ASR

Strik (2003) ITC-irst, Trento, Italy; ICPhS, Barcelona De Wachter et al. (2003) Interspeech-2003 Axelrod & Maison (2004) ICASSP-2004 Maier & Moore (2005) Interspeech-2005 Aradilla, Vepa, Bourlard (2005) Interspeech-2005 Matton, De Wachter, et al. (2005) SPECOM-2005

Promising results The computing power and memory that are needed to

investigate the episodic approach to speech recognition are (becoming) available


The HSR-ASR gap

HSR & ASR – 2 different communitiesDifferent people, departments, journals, terminology, goals, methodologies

Goals, evaluationHSR: simulate experimental findingsASR: reduce WER


The HSR-ASR gap

Marr (1982) – 3 levels of modeling:1. Computational2. Algorithmic3. Implementational

HSR - (larger) differences at higher levels

ASR – implementations, end-to-end models using real speech signals as input Thousands of exp.: WER has been gradually reduced However, essentially the same model New model: performance (WER), funding, etc.


The HSR-ASR gap - bridge

Use same evaluation metric for HSR & ASR systems: reaction times (Cutler & Robinson, 1992)

Use knowledge or components from the other field (Scharenborg et al., 2003).

Use models that are suitable for HSR & ASR researchEvaluation from HSR & ASR point of view

S2S – Sound to Sense (Sarah Hawkins)Marie Curie Research Training Network (MC-RTN)Recently approved by the EU


Episodic speech recognition


ESRASA model

T1

T2

TN

B1

B2

BE

C12

C11

C22

CE2

CE

W

F1

F2

FN

attention weights

... ... ...

EA1

EA2

EAE

WA1

WA2

WAW

episodes association weights

words

feature vector

episode activation

B1

B2

BW

word activation


ESRASA model

ESRASAEpisodic Speech Recognition And Structure Acquisition

The ESRASA model is inspired by several previous models, especially model described in Johnson (1997)WRAPSA (Jusczyk, 1993), and CGM (Nosofsky, 1986)

The ESRASA model is a feedforward neural network with two sets of weights: atTention weights Tn and assoCiation weights Cew. Besides these two sets of weights, words, episodes (for speech units), and their base activation levels (Bw and Be, respectively) will be stored in memory.


ESRRecognition

L items in lexicon

S items in subset

1 item, the winner

Preselection

Competition

L items in lexicon

S items in subset

1 item, the winner

Preselection

Competition

X


ESRPreselection

Why preselection? Reduce CPU & memory Increase performance Also used in DTW-based pattern recognition

applications Used in many HSR models


ESRCompetition

Recognize unknown word X : Calculate distance between X and sequences of

stored episodes (DTW) Take the one with minimum distance : the recognized

word

Use continuity constraints (as in TTS)


ESRDTW: Dynamic Time Warping


ESR – ResearchPreselection ?

Best method?Compare: kNN – k nearest neighbor Lower bound distance : Ddtw Dlb d Build an index for the lexicon

Is preselection needed?Compare: with & without preselection


ESR – ResearchUnits for preselection ?

Compare : Syllable Word Begin (window of fixed length)


ESR - ResearchUnits for competition ?Compare : Syllables Words In combination with multisyllables?

Multisyllables (reduction, resyllabification) Ik weet het niet -> kweeni Op een gegeven moment -> pgeefment Zeven-en -> ze-fnen


ESR - ResearchExemplars ?

How to select exemplars : DTW distances + hierarchical clustering VQ : LVQ & K-means

Trade-off normalization & (size) lexiconCompare normalization techniques : TDNR, MVN, HN VTLN


ESR - ResearchFeatures ?

Compare : Spectral features : MFCC, PLP, LPC Articulatory features (ANN) Combine spectral & articulatory feat.

Different features for preselection & competition?


ESR - Research Distance metrics ?

Compare (frame-based metrics) : Euclidean Mahalanobis Itakura (for LPC) Perceptually-based?

Distance metric for trajectories?


HMM-based ASR Information sources

HMM-based ASR, roughly 3 ways :1. Class-specific HMMs2. Multistream3. 2-pass decoding

Disadvantages :1. Many classes2. Synchronization & recombination3. Pass 1 : no / less knowledge


ESR - ResearchInformation sources

ESR : compare 2 trajectoriesAll details are available during search, e.g. context &

dynamic informationCompare shape + timing of feat. contours

F0 rise: early or final, half or complete

Tags can be added to the lexicon+ continuity constraints


HSR - Foreign English Examples

Conversation about Italy.

dropped / robbed

I was robbed in Milan.By parachute?

[ FEE 1 ]



HSR : Strip off indexical information?No!

Familiar voices and accents :recognize and mimic [ FEE 2 ]

Indexical informationis perceived and encoded



Verbal & indexical information :processed independently? No!

Familiar ‘voices’ are recognized better[ FEE 3 ]

Facilitation, also with ‘similar’ speech[ FEE 4 ]


ASR - Pronunciation variation

SRIV2006:ITRW on Speech Recognition and Intrinsic Variation

Pronunciation variation modeling for ASR : Improvements, but generally small Current ASR paradigm : suitable?

Phonetic transcriptions and their evaluation : Large differences between humans What is the ‘golden reference’? Speech – a sequence of symbols?

radboud university nijmegen

Documents

indexical information

speech rate

verbal information

invariance problem asr

human speech recognition

types of information

invariant model new

episodic models