dealing with unknown unknowns (in speech recognition) hynek h ermansky

Dealing with Unknown Unknowns(in Speech Recognition)

Hynek Hermansky

Processing speech in multiple parallel processing streams, which attend to different parts of signal space and use different strengths of prior top-down knowledge is proposed for dealing with unexpected signal distortions and with unexpected lexical items. Some preliminary results in machine recognition of speech are presented.

indianwhiteman

There are things we do not know we don't know.

Donald Rumsfeld

“Funding artificial intelligence is real stupidity”"After growing wildly for years, the field of computing appears to be reaching its infancy.”

Research field of “mad inventors or untrustworthy engineers”

• supervised the Bell Labs team which built the first transistor• President’s Science Advisory Committee• developed the concept of pulse code modulation• designed and launched the first active communications satellite

Letter to EditorJ.Acoust.Soc.Am.

.... should people continue work towards speech recognition by machine ? Perhaps it is for people in the field to decide.

Why am I working in this field?

Problems faced in machine recognition of speech reveal basic limitations of all information technology !

Why did I climbed Mt. Everest?Because it is there !

-Sir Edmund Hilary

Spoken language is one of the most amazing accomplishments of human race.

access to information

• voice interactions with machines• extracting information from speech data !

production, perception, cognition,..

knowledge

We speak in order to hear, in order to be understood.

-Roman Jakobson

data

Speech recognition…a problem of maximum likelihood decoding

-Frederick Jelinek

Hidden Markov Model

Ŵ = argmaxW p(x|W) P(W)

Ŵ – estimated speech utterance

p(x|Wi) - likelihoods of acoustic models of speech sounds,the models are derived by training on very large amounts of speech data

P(W) - prior probabilities of speech utterances (language model), model estimated from large amounts of data (typically text)

Stochastic recognition of speech

“Unknown unknowns” in machine recognition of speech

• distortions not seen in the training data of the acoustic model• words that are not expected by the language model

One possible way of dealing with unknown unknowns

• Parallel information-providing streams, each carrying different redundant dimensions of a given target.

• A strategy for comparing the streams.• A strategy for selecting “reliable”

streams.

Stream formation

• Different perceptual modalities• Different processing channels

within each modality• Bottom-up and top-down

dominated channels

signal informationfusion

decision

Comparing the streams ?

• various correlation (distance) measures

Selecting reliable streams ?????

Information in speech is coded in many redundant dimensions.Not all dimensions get corrupted at the same time.

Fletcher et al

Probability of error of recognition of full-band speech is given by a product of probabilities of errors in subbands

Boothroyd and Nittrouer

Probability of error of recognition in contexts is given by a product of probabilities of errors of recognition without context and probability of error in channel which provides information about the context

Final error dominated by the channel with smallest error !

Perceptual Data

A large number of parallel processing streams

• Different carrier frequencies

• Different carrier bandwidths

• Different spectral and temporal resolutions

• Different modalities

• Different prior biases

Processing streams

different carrier frequenciesdifferent temporal

resolutionsdifferent spectral

resolutions

Auditory cortical receptive fields

Evidence for different processing strategies

time [s]

freq

uenc

y

from N. Mesgarani

Evidence for equally powerful bottom-upand top-down streams ?

From the subjective point of view, there is nothing special that would differentiate between the top-down and bottom-up dominated processing streams. All streams provide information for a decision. When all streams provide non-conflicting information, all this information is used for the decision. When the context allows for multiple interpretations of the sensory input, the bottom-up processing stream dominates. When the sensory input gets corrupted by noise, the top-down dominated stream fills in for the corrupted bottom-up input.

Hermansky 2013

Monitoring Performance

P1 P2

Pmiss = (1-P1)(1-P2)

Could it be that we know when we know ?

observer - false positives and negatives are possible

Pmiss_observed ≠ (1-P1)(1-P2)

Knowing when one knows !

Performance Monitoring in Sensory Perception

picture densitylow high

judg

emen

t

0 %

100 %sparse

dense

not sure

human judgment(adopted from Smith et al 2003)

similar data available for monkeys, dolphins, rats,…

update

classifiermodel

ofthe

output

testing data

training dataclassifier

comparemodels

model ofthe

output

Machine ?

time

frequ

ency

data preprocessing

artificial neural network

trained on large amounts of labeled data

Spectrogram Posteriogram

ANNfusion

phonemeposteriors

up to 1 s

Fusion of streams of different carrier frequencies[Hermansky et al 1996, Li et al 2013]

Preliminary results using multi-stream speech recognition on noisy TIMIT data

• Processing is done in multiple parallel streams• Signal corruption affects only some streams• Performance monitor selects N best streams for further processing

Subband 1

... ...

ANN Fusionform 31

processing streams

phone sequence

Subband 2

Subband 5

…...

speech signal

Performance Monitor

selecting N best

streams

Viterbi decoderAverage

ANN

ANN

ANN

...

...

Filte

rank

environment conventional proposed best by hand

clean 31 % 28 % 25 %

car at 0 dB SNR 54 % 38 % 35 %

Phoneme recognition error rates on noisy TIMIT data

up to 1000 mshigh frequency

components

many processing

layers

(transformed)posterior

probabilitiesof speech

sounds

mid frequencycomponents

low frequencycomponents

“smart”fusion

up to 100 ms

all availablefrequency

components

many processing

layers (transformed)

posteriorprobabilities

of speechsounds

conventional “deep” net

“long, wide and deep”net

time

time

getinfo1

getinfoi

getinfoN

Conclusions we would eventually like to make

• Recognition should be done in parallel processing streams, each attending to a particular aspect of the signal and using different levels of top-down expectations

• Discrepancy among the streams indicates an unexpected signal

• Suppressing corrupted streams can increase robustness to unexpected inputs

Machine Emulation of Human Speech Communication

..devise a clear, simple, definitive experiments. So a science of speech can grow, certain step by certain step.

John Pierce

human communication, speech production, perception, neuroscience, cognitive science,..

We speak, in order to be heard, in order to be understood

Roman JakobsonSpeech recognition…a problem of maximum likelihood decoding

information and communication theory, machine learning, large data,….

Fred Jelinek

The complexity for minimum component costs has increased at a rate of roughly a factor of two per year…

Gordon Moore

tools

also John Pierce:(Speech recognition is so far (1969) field of) mad inventors or untrustworthy engineers (because machine needs) intelligence and knowledge of language comparable to those of a native speaker .

Sounds like a good goal to aim at !

Nima Mesgarani

Samuel ThomasFeipeng Li Ehsan VarianiVijay Peddinti

THANKS !

Jont Allen

Harish Mallidi

Misha PavelHamed Ketabdar

dealing with unknown unknowns (in speech recognition) hynek h ermansky

Documents

band speech

errors of recognition

machinesextracting information

information technology

large amounts of data

smallest error

training data

nittrouer probability