a brief overview of speech recognition and spoken language processing advanced nlp guest lecture...

Welcome Introduction and Overview

A brief overview of Speech Recognition and Spoken Language ProcessingAdvanced NLPGuest Lecture August 31Andrew RosenbergSpeech and NLPCommunication in Natural Language

Text:Carefully preparedGrammaticalMachine readableTypos Sometimes OCR or handwriting issues1Speech and NLPCommunication in Natural Language

Speech:SpontaneousLess GrammaticalMachine readablewith > 10% error using on speech recognition.2NLP TasksParsingName TaggingSentiment AnalysisEntity CoreferenceRelation ExtractionMachine Translation3Speech TasksParsingSpeech isnt always grammaticalName TaggingIf a name isnt in vocabulary what do you do?Sentiment AnalysisHow the words are spoken helps.Entity CoreferenceRelation ExtractionMachine Translationhow can these handle misrecognition errors?4Speech TasksSpeech SynthesisText NormalizationDialog ManagementTopic SegmentationLanguage IdentificationSpeaker Identification and VerificationAuthorship and security

5No one really worries about these in text with the exception of Speaker ID5The traditional view6Text Processing System

Named Entity RecognizerText DocumentsText DocumentsTrainingApplicationThe simplest approach7Text Processing System

Named Entity RecognizerTranscribed DocumentsText DocumentsTrainingApplication

Whats the problem with this?7Speech is errorful text8Text Processing System

Named Entity RecognizerTranscribed DocumentsTranscribedDocumentsTrainingApplication

One better Whats the potential problem here? Use speech for training and testing.8Speech signal can be used9Text Processing System


One better, use transcribed speech AND signal features Whats the potential problem here? Use speech for training and testing.9Hybrid speech signal and text10Text Processing System


Text DocumentsOne better, use transcribed speech AND signal features Whats the potential problem here? Use speech for training and testing.10Speech RecognitionStandard HMM speech recognition.

Front EndAcoustic ModelPronunciation ModelLanguage ModelDecoding11Speech Recognition12

Front EndAcoustic ModelPronunciation ModelLanguage ModelWord SequenceAcoustic Feature VectorPhone LikelihoodsWord LikelihoodsSpeech Recognition13

Front EndConvert sounds into a sequence of observation vectorsLanguage ModelCalculate the probability of a sequence of wordsPronunciation ModelThe probability of a pronunciation given a wordAcoustic ModelThe probability of a set of observations given a phone labelFront EndHow do we convert a wave form into a useful representation?We are looking for a vector of numbers which describe the acoustic contentAssuming 22kHz 16bit sound. Modeling this directly is not feasible.14Discrete Cosine TransformEvery wave can be decomposed into component sine or cosine waves.

Fast FourierTransform is used to do this efficiently15

Overlapping framesSpectrograms allow for visual inspection of spectral information.We are looking for a compact, numerical representation16

10ms10ms10ms10ms10msSingle Frame of FFT17

http://clas.mq.edu.au/acoustics/speech_spectra/fft_lpc_settings.htmlAustralian male /i:/ from heed FFT analysis window 12.8msExample Spectrogram18

Example Spectrogram from Praat18Standard RepresentationMel Frequency Cepstral CoefficientsMFCC

19Pre-EmphasiswindowFFTMel-Filter BanklogFFT-1Deltasenergy12 MFCC12 MFCC12 MFCC12 MFCC1 energy1 energy1 energySpeech Recognition20

Front EndConvert sounds into a sequence of observation vectorsLanguage ModelCalculate the probability of a sequence of wordsPronunciation ModelThe probability of a pronunciation given a wordAcoustic ModelThe probability of a set of observations given a phone labelLanguage ModelWhat is the probability of a sequence of words?

Assume you have a vocabulary of V words.How many possible sequences of N words are there?21

N-gram Language ModelingSimplify the calculation.Big simplifying assumption: Each word is only dependent on the previous N-1 words.

22

N-gram Language ModelingSame question. Assume a V word vocabulary, and an N word sequence. How many counts are necessary?

23

General Language ModelingAny probability calculation can be used here.Class based language models.e.g. Recurrent neural networks

24

Speech Recognition25

Front EndConvert sounds into a sequence of observation vectorsLanguage ModelCalculate the probability of a sequence of wordsPronunciation ModelThe probability of a pronunciation given a wordAcoustic ModelThe probability of a set of observations given a phone labelPronunciation ModelingIdentify the likelihood of a phone sequence given a word sequence.There are many simplifying assumptions in pronunciation modeling.The pronunciation of each word is independent of the previous and following.26

Dictionary as Pronunciation ModelAssume each word has a single pronunciation27IAYCATK AE TTHEDH AHHADH AE DABSURDAH B S ER DYOUY UH DWeighted Dictionary as Pronunciation ModelAllow multiple pronunciations and weight each by their likelihood28IAY.4IIH.6THEDH AH.7THEDH IY.3YOUY UH.5YOUY UW.5

Grapheme to Phoneme conversionWhat about words that you have never seen before? What if you dont think youve seen every possible pronunciation?

How do you pronounce: McKayla? or Zoomba?

Try to learn the phonetics of the language.

29Letter to Sound RulesManually written rules that are able to convert one or more letters to one or more sounds.

T -> /t/H -> /h/TH -> /dh/E -> /e/

These rules can get complicated based on the surrounding context. K is silent when word initial and followed by N.30Automatic learning of Letter to Sound rulesFirst: Generate an alignment of letters and sounds31TEX-TTEHKSTTEXT---------TEHKSTAutomatic learning of Letter to Sound rulesSecond: Try to learn the mapping automatically. Generate Features from the letter sequenceUse these feature to predict soundsAlmost any machine learning technique can be used.Well use decision trees as an example.32Decision Trees exampleContext: L1, L2, p, R1, R233R1 = hYesNoPloopholeFphysicsFtelephoneFgraphFphotoPpeanutPpayPappleapplepsychopterodactylpneumoniaYesNoPloopholeFphysicsFtelephoneFgraphFphotoL1 = oR1 = consonantNoYesPpeanutP payPapplepsycho pterodactylpneumoniaDecision Trees exampleContext: L1, L2, p, R1, R234R1 = hYesNoPloopholeFphysicsFtelephoneFgraphFphotoPpeanutPpayPappleapplepsychopterodactylpneumoniaYesNoPloopholeFphysicsFtelephoneFgraphFphotoL1 = oR1 = consonantNoYesPpeanutP payPapplepsycho pterodactylpneumoniatry PARISDecision Trees exampleContext: L1, L2, p, R1, R235R1 = hYesNoPloopholeFphysicsFtelephoneFgraphFphotoPpeanutPpayPappleapplepsychopterodactylpneumoniaYesNoPloopholeFphysicsFtelephoneFgraphFphotoL1 = oR1 = consonantNoYesPpeanutP payPapplepsycho pterodactylpneumoniaNow try GOPHERSpeech Recognition36

Language ModelCalculate the probability ofa sequence of wordsFront EndConvert sounds into a sequence of observation vectorsLanguage ModelCalculate the probability of a sequence of wordsPronunciation ModelThe probability of a pronunciation given a wordAcoustic ModelThe probability of a set of observations given a phone labelAcoustic ModelingHidden markov model.Used to model the relationship between two sequences.37

Hidden Markov modelIn a Hidden Markov Model the state sequence is unobserved.Only an observation sequence is available38q1q2q3x1x2x3Hidden Markov modelObservations are MFCC vectorsStates are phone labelsEach state (phone) has an associated GMM modeling the MFCC likelihood39q1q2q3x1x2x3Training acoustic modelsTIMITclose, manual phonetic transcription2342 sentences Extract MFCC vectors from each frame within each phoneFor each phone, train a GMM using Expectation Maximization.These GMM is the Acoustic Model.Common to use 8, or 16 Gaussian Mixture Components.

40Gaussian Mixture Model41

HMM Topology for TrainingRather than having one GMM per phone, it is common for acoustic models to represent each phone as 3 triphones42S1S3S2S4S5

/r/43Speech in Natural Language ProcessingALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHATS THE STATION NAME DOWNTOWN CROSSING UM AND THATLL GET YOU BACK TO THE RED LINE JUST AS EASILY

This is how much of spoken language processing and NLP treat speech.

There are transcription errors. There is no punctuation. There is no segmentation.

There are grammatical issues. Word choice is different, structure is different.44Speech in Natural Language ProcessingAlso, from the North Station...

(I think the Orange Line runs by there too so you can also catch the Orange Line... )

And then instead of transferring

(um I- you know, the map is really obvious about this but)

Instead of transferring at Park Street, you can transfer at (uh whats the station name) Downtown Crossing and (um) thatll get you back to the Red Line just as easily.This is how much of spoken language processing and NLP treat speech.

There are transcription errors. There is no punctuation. There is no segmentation.

There are grammatical issues. Word choice is different, structure is different.45Spoken Language ProcessingNLP systemIRIEQASummarizationTopic ModelingSpeech RecognitionThis looks fundamentally appealing.

Step 1. Turn speech into text. Step 2. Process the text.46Spoken Language ProcessingNLP systemIRIEQASummarizationTopic ModelingALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHATS THE STATION NAME DOWNTOWN CROSSING UM AND THATLL GET YOU BACK TO THE RED LINE JUST AS EASILYThis text is incomplete. It doesnt contain all of the components of text that NLP systems expect. Grammaticality, disfluencies, segmentation.47Dealing with Speech ErrorsALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHATS THE STATION NAME DOWNTOWN CROSSING UM AND THATLL GET YOU BACK TO THE RED LINE JUST AS EASILYRobust NLP systemIRIEQASummarizationTopic ModelingRobust nlp systems.1. Aggressive smoothing2. Partial parsing3. Weight by confidence scores. !48Automatic Speech Recognition AssumptionALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHATS THE STATION NAME DOWNTOWN CROSSING UM AND THATLL GET YOU BACK TO THE RED LINE JUST AS EASILYASR produces a transcript of Speech.ASR is a transcript of text.

All the information that you need is in the transcript.49Automatic Speech Recognition AssumptionRich TranscriptionAlso, from the North Station...

(I think the Orange Line runs by there too so you can also catch the Orange Line... )

And then instead of transferring

(um I- you know, the map is really obvious about this but)

Instead of transferring at Park Street, you can transfer at (uh whats the station name) Downtown Crossing and (um) thatll get you back to the Red Line just as easily.ASR produces a transcript of Speech.Even Rich Transcription is missing information, but its getting closer.

And it requires prosodic analysis.50Decrease WERIncrease RobustnessSpeech as Noisy TextRobust NLP systemIRIEQASummarizationTopic ModelingSpeech RecognitionThe common approach to improving performance on speech data.

Non-grammaticality, disfluencies, neologisms, out-of-domain errors -- inconsistency in productions between training of systems and speech input51Other directions for improvement.Prosodic AnalysisRobust NLP systemIRIEQASummarizationTopic ModelingSpeech RecognitionUse Lattices or N-Best listsProsody captures a lot of whats missing.

Also speaker ID, paralinguistics, etc.ProsodyVariation is production properties that lead to changes in intended interpretation.

PitchIntensityDuration, Rhythm, Speaking RateSpectral EmphasisPausing52Tasks that can use prosodyPart of Speech Tagging [Eidelman et al. 2010]Parsing [Huang, et al. 2010]Language Modeling [Su & Jelinek, 2008]Pronunciation Modeling [Rosenberg 2012]Acoustic Modeling [Chen, et al. 2006]

Emotion Recognition [Lee, et al. 2009]Topic Segmentation [Rosenberg & Hirschberg, 2006, Rosenberg, et al. 2007]Speaker Identification/Verification[Leung, et al. 2008]53citations53Processing SpeechProcessing speech is difficultThere are errors in transcripts.It is not grammaticalThe style (genre) of speech is different from the available (text) training data.Processing speech is easySpeaker informationIntention (sarcasm, certainty, emotion, etc.)Segmentation

54Questions & CommentsWhat topic was clearest?murkiest?What was the most interesting?least interesting?

[email protected]://speech.cs.qc.cuny.eduhttp://eniac.cs.qc.cuny.edu/andrew

55

a brief overview of speech recognition and spoken language processing advanced nlp guest lecture...

Documents

transcribed speech

speech recognitionstandard

natural language speech

hybrid speech signal

text10text processing

used9text processing

potential problem

better whats