radboud university nijmegen

42
How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands Radboud University Nijmegen

Upload: adara

Post on 31-Jan-2016

65 views

Category:

Documents


0 download

DESCRIPTION

How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands. Radboud University Nijmegen. Overview. Contents : Variation, invariance problem - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Radboud University Nijmegen

How to handlepronunciation variation in ASR:By storing episodes in memory?

Helmer StrikCentre for Language and Speech Technology (CLST)Radboud University Nijmegen, the Netherlands

Radboud University Nijmegen

Page 2: Radboud University Nijmegen

Radboud University Nijmegen

Overview

Contents : Variation, invariance problem ASR : Automatic Speech Recognition HSR : Human Speech Recognition ESR : Episodic Speech Recognition

Page 3: Radboud University Nijmegen

Radboud University Nijmegen

Invariance problem (1)

One of the main issues in speech recognition is the large amount of variability present in speech.SRIV2006: ITRW on Speech Recognition and Intrinsic Variation

Invariance problem:Variation in stimuli, invariant perceptAlso visual, tactile, etc.Studied in many fields, no consensus

2 paradigms InvariantEpisodic

Page 4: Radboud University Nijmegen

Radboud University Nijmegen

Invariance problem (1)

Example 1: Speech

Dutch word: “natuurlijk” (naturally, ‘of course’) [natyrlk] [natylk]… [tyk]

Multiword expressions (MWEs): lot of reductionmany variants

Page 5: Radboud University Nijmegen

Radboud University Nijmegen

Invariance problem (2)

Example 2: Writing (vision)

natuurlijk natuurlijk

natuurlijk natuurlijk

natuurlijk natuurlijk

natuurlijk natuurlijk

natuurlijk natuurlijk

natuurlijk natuurlijk

Familiar ‘styles’ (fonts, handwriting)are recognized better

Page 6: Radboud University Nijmegen

Radboud University Nijmegen

ASR - Paradigm

Invariant, symbolic approach : utterance sequence of words sequence of phonemes sequence of states parametric description : pdf’s / ANN

Page 7: Radboud University Nijmegen

Radboud University Nijmegen

ASR - Paradigm

Same paradigm (HMMs), since 70’s Assumptions : incorrect, questionable Insufficient performance

ASR vs. HSR : error rates 8-80x higher Slow progress (ceiling effect?) Simply using more and more data is not sufficient

(Moore, 2001)

A new paradigm is needed!However, only few attempts

Page 8: Radboud University Nijmegen

Radboud University Nijmegen

HSR - Indexical information

Speech - 2 types of information :

1. Verbal info. : what, contents2. Indexical info. : how, form

e.g. environmental and speaker-specific aspects(pitch, loudness, speech rate, voice quality)

Page 9: Radboud University Nijmegen

Radboud University Nijmegen

HSR - Indexical information

Traditional ASR model: Verbal information is used Indexical information

Noise, disturbances Preprocessing:

o Strip offo Normalization (VTLN, MLLR, etc.)

And in HSR?

Page 10: Radboud University Nijmegen

Radboud University Nijmegen

HSR - Indexical information

HSR : Strip off indexical information?

No!

Familiar voices and accents :recognize and mimic

Indexical informationis perceived and encoded

Page 11: Radboud University Nijmegen

Radboud University Nijmegen

HSR - Indexical information

Verbal & indexical information :processed independently?

No!

Familiar ‘voices’ are recognized better

Facilitation, also with ‘similar’ speech

Page 12: Radboud University Nijmegen

Radboud University Nijmegen

HSR - Indexical and detailed information

Experimental results:indexical information andfine phonetic detail (Hawkins et al.)influence perception

Difficult to explain / integrate in the traditional, invariant model

New models: episodic models,for auditive and visual perception

Page 13: Radboud University Nijmegen

Radboud University Nijmegen

ESR - Basic idea

A new paradigm for ASR is needed:An episodic model !!??

Training : Store trajectories - (representatives of) episodes

Recognition : Calculate distance between X and sequences of stored

trajectories (DTW) Take the one with minimum distance : the recognized

word

Page 14: Radboud University Nijmegen

Radboud University Nijmegen

ESR – Invariant vs. episodic

phone-based HMM ESR-------------------------------------------------------------

Unit:[ Phone Syllable, word, … ]

Representation:States - pdf’s or ANN Trajectories

Compare:Trajectory (X) & states Trajectory (X) & Trajectories

Parsimonious representation Extensive representationComplex mapping Simple mapping‘Variation is noise’ Variation contains info.Normalization Use variation

Page 15: Radboud University Nijmegen

Radboud University Nijmegen

Phone ‘aj’ from ‘nine’.

X = begin

3 parts: aj(, aj|, aj)

Representationpdf’s (Gaussians)

Much detail, dynamic information is lost

Trajectories: details

Page 16: Radboud University Nijmegen

Radboud University Nijmegen

Unit: phone(me)

Switchboard (Greenberg et al.):deletion: 25% of the phonessubstitution: 30% of the phones together 55%!!

Difficult for a model based on ‘sequences of phones’.Syllables: less than 1% deleted

Phonetic transcriptions and their evaluation :Large differences between humansWhat is the ‘golden reference’?Speech – a sequence of symbols?

Page 17: Radboud University Nijmegen

Radboud University Nijmegen

Unit: Multiword expressions (MWEs)

MWEs (see poster) :A lot of reduction;

many phonemes deleted, or substitutedMany variants (= sequences of phonemes)

more than 90 for 2 MWEs studiedDifficult to handle in ASR systems with current methods

for pronunciation variation modeling.Reduction, e.g. for a MWE: 4 words with 7 syllables

reduced to ‘1 entity’ with 2 syllables

What should be stored?Units of various lenghts?

Page 18: Radboud University Nijmegen

Radboud University Nijmegen

An episodic approach for ASR

Advantages:More information during search:

dynamic, indexical, fine phonetic detailContinuity constraints can be used

(reduces the trajectory folding problem)Model is simpler

Disadvantage:More information during search: complexity

Brain: a lot of storage and ‘CPU’ Computers: more and more powerful

Page 19: Radboud University Nijmegen

Radboud University Nijmegen

An episodic approach for ASR

Strik (2003) ITC-irst, Trento, Italy; ICPhS, Barcelona De Wachter et al. (2003) Interspeech-2003 Axelrod & Maison (2004) ICASSP-2004 Maier & Moore (2005) Interspeech-2005 Aradilla, Vepa, Bourlard (2005) Interspeech-2005 Matton, De Wachter, et al. (2005) SPECOM-2005

Promising results The computing power and memory that are needed to

investigate the episodic approach to speech recognition are (becoming) available

Page 20: Radboud University Nijmegen

Radboud University Nijmegen

The HSR-ASR gap

HSR & ASR – 2 different communitiesDifferent people, departments, journals, terminology, goals, methodologies

Goals, evaluationHSR: simulate experimental findingsASR: reduce WER

Page 21: Radboud University Nijmegen

Radboud University Nijmegen

The HSR-ASR gap

Marr (1982) – 3 levels of modeling:1. Computational2. Algorithmic3. Implementational

HSR - (larger) differences at higher levels

ASR – implementations, end-to-end models using real speech signals as input Thousands of exp.: WER has been gradually reduced However, essentially the same model New model: performance (WER), funding, etc.

Page 22: Radboud University Nijmegen

Radboud University Nijmegen

The HSR-ASR gap - bridge

Use same evaluation metric for HSR & ASR systems: reaction times (Cutler & Robinson, 1992)

Use knowledge or components from the other field (Scharenborg et al., 2003).

Use models that are suitable for HSR & ASR researchEvaluation from HSR & ASR point of view

S2S – Sound to Sense (Sarah Hawkins)Marie Curie Research Training Network (MC-RTN)Recently approved by the EU

Page 23: Radboud University Nijmegen

Radboud University Nijmegen

Episodic speech recognition

Page 24: Radboud University Nijmegen

Radboud University Nijmegen

Page 25: Radboud University Nijmegen

Radboud University Nijmegen

ESRASA model

T1

T2

TN

B1

B2

BE

C12

C11

C22

CE2

CE

W

F1

F2

FN

attention weights

... ... ...

EA1

EA2

EAE

WA1

WA2

WAW

episodes association weights

words

feature vector

episode activation

B1

B2

BW

word activation

Page 26: Radboud University Nijmegen

Radboud University Nijmegen

ESRASA model

ESRASAEpisodic Speech Recognition And Structure Acquisition

The ESRASA model is inspired by several previous models, especially model described in Johnson (1997)WRAPSA (Jusczyk, 1993), and CGM (Nosofsky, 1986)

The ESRASA model is a feedforward neural network with two sets of weights: atTention weights Tn and assoCiation weights Cew. Besides these two sets of weights, words, episodes (for speech units), and their base activation levels (Bw and Be, respectively) will be stored in memory.

Page 27: Radboud University Nijmegen

Radboud University Nijmegen

ESRRecognition

L items in lexicon

S items in subset

1 item, the winner

Preselection

Competition

L items in lexicon

S items in subset

1 item, the winner

Preselection

Competition

X

Page 28: Radboud University Nijmegen

Radboud University Nijmegen

ESRPreselection

Why preselection? Reduce CPU & memory Increase performance Also used in DTW-based pattern recognition

applications Used in many HSR models

Page 29: Radboud University Nijmegen

Radboud University Nijmegen

ESRCompetition

Recognize unknown word X : Calculate distance between X and sequences of

stored episodes (DTW) Take the one with minimum distance : the recognized

word

Use continuity constraints (as in TTS)

Page 30: Radboud University Nijmegen

Radboud University Nijmegen

ESRDTW: Dynamic Time Warping

Page 31: Radboud University Nijmegen

Radboud University Nijmegen

ESR – ResearchPreselection ?

Best method?Compare: kNN – k nearest neighbor Lower bound distance : Ddtw Dlb d Build an index for the lexicon

Is preselection needed?Compare: with & without preselection

Page 32: Radboud University Nijmegen

Radboud University Nijmegen

ESR – ResearchUnits for preselection ?

Compare : Syllable Word Begin (window of fixed length)

Page 33: Radboud University Nijmegen

Radboud University Nijmegen

ESR - ResearchUnits for competition ?Compare : Syllables Words In combination with multisyllables?

Multisyllables (reduction, resyllabification) Ik weet het niet -> kweeni Op een gegeven moment -> pgeefment Zeven-en -> ze-fnen

Page 34: Radboud University Nijmegen

Radboud University Nijmegen

ESR - ResearchExemplars ?

How to select exemplars : DTW distances + hierarchical clustering VQ : LVQ & K-means

Trade-off normalization & (size) lexiconCompare normalization techniques : TDNR, MVN, HN VTLN

Page 35: Radboud University Nijmegen

Radboud University Nijmegen

ESR - ResearchFeatures ?

Compare : Spectral features : MFCC, PLP, LPC Articulatory features (ANN) Combine spectral & articulatory feat.

Different features for preselection & competition?

Page 36: Radboud University Nijmegen

Radboud University Nijmegen

ESR - Research Distance metrics ?

Compare (frame-based metrics) : Euclidean Mahalanobis Itakura (for LPC) Perceptually-based?

Distance metric for trajectories?

Page 37: Radboud University Nijmegen

Radboud University Nijmegen

HMM-based ASR Information sources

HMM-based ASR, roughly 3 ways :1. Class-specific HMMs2. Multistream3. 2-pass decoding

Disadvantages :1. Many classes2. Synchronization & recombination3. Pass 1 : no / less knowledge

Page 38: Radboud University Nijmegen

Radboud University Nijmegen

ESR - ResearchInformation sources

ESR : compare 2 trajectoriesAll details are available during search, e.g. context &

dynamic informationCompare shape + timing of feat. contours

F0 rise: early or final, half or complete

Tags can be added to the lexicon+ continuity constraints

Page 39: Radboud University Nijmegen

Radboud University Nijmegen

HSR - Foreign English Examples

Conversation about Italy.

dropped / robbed

I was robbed in Milan.By parachute?

[ FEE 1 ]

Page 40: Radboud University Nijmegen

Radboud University Nijmegen

HSR - Indexical information

HSR : Strip off indexical information?No!

Familiar voices and accents :recognize and mimic [ FEE 2 ]

Indexical informationis perceived and encoded

Page 41: Radboud University Nijmegen

Radboud University Nijmegen

HSR - Indexical information

Verbal & indexical information :processed independently? No!

Familiar ‘voices’ are recognized better[ FEE 3 ]

Facilitation, also with ‘similar’ speech[ FEE 4 ]

Page 42: Radboud University Nijmegen

Radboud University Nijmegen

ASR - Pronunciation variation

SRIV2006:ITRW on Speech Recognition and Intrinsic Variation

Pronunciation variation modeling for ASR : Improvements, but generally small Current ASR paradigm : suitable?

Phonetic transcriptions and their evaluation : Large differences between humans What is the ‘golden reference’? Speech – a sequence of symbols?