“automated” speech recognition - brandeis...

+

“Automated” Speech Recognition September 1, 2017 Professor Meteer

+ Speech Recognition

n  Technology n  What does it do?

n  Speech à text n  Speech à “semantics”

n  How well does it work? n  Human standards? n  Where does it work well? n  Where does it fail?

n  Applications n  What problems can it solve? n  Do I need those problems

solved by technology? n  What is the benefit? n  Who benefits?

n  Is the performance adequate to solve the problem?

Feature extraction Sampling

Signal Processing Mel‐Frequency Cepstral Coefficient

(MFCC) Acoustic Modeling Hidden Markov Models

Multivariant Gaussian Mixture Models Maximum Entropy

Language modeling N-grams

Multiwords Class grammars Adaptation

Vocal tract length normalization Minimum phone error

+Feature extraction

Sampling Signal Processing

Mel‐Frequency Cepstral Coefficient (MFCC) Acoustic Modeling Hidden Markov Models

Multivariant Gaussian Mixture Models Maximum Entropy

Language modeling N-grams

Mulitwords Class grammars Adaptation

Vocal tract length normalization Minimum phone error

+ Course outline

n  Course work n  Architecture and algorithms in

speech recognizers n  Audio signal to features n  Acoustic models of phonemes n  Statistical models of word

sequences n  Adaptation to dialects and

idiolects

n  Project (groups of ~3) n  Learning how to use available tools, etc on

your own. n  Present to class on both process and result n  “Experiencial learning”, team building

n  Choose 1 of n  Create a mobile application that uses speech

recognition n  Use the technology to solve a problem/

assist in a task n  Designing dialogs so that the performance

is up to the task n  Build a new/better recognize with Kaldi

n  for a particular language/domain n  Show baseline and improvement n  Document all steps

+Course Overview

n Syllabus and schedule available at n  cs136a.mmeteer.com/

n Overview and history of speech

n Speech Applications

n Speech Algorithms n  Phonetics n  Algorithms

n  Acoustic modeling n  Language modeling

n Speech Synthesis

n Dialog

+ Requirements n Programming Projects (60%)

n  Using speech toolkits n  Watson & Mashup, Kaldi

n  Develop a speech application n  Platforms: Android, Apple, AT&T, Google, Microsoft n  Win a trip to California or cash prizes

n  Improve speech performance n  Acoustic modeling, language modeling, dictionary

n Quizzes and presentations (30%) n  Quizzes on class material and readings

n  Mostly “take home” (vial Latte) n  Max word length on the answers

n  Presentations based on readings, groups and individuals

n Participation (10%) n  Attendance, involvement n  Posting on class forums, both specific assignments and comments

+The AVIOS Student Speech Application Contest 2016-2017 n Demonstrate your creativity and programming skills

in voice-enabled and multimodal applications by entering the AVIOS Speech Application Development Contest organized by the Applied Voice Input Output Society.

n Develop a speech mediated application and win cash, prizes as well as world-wide recognition on the AVIOS web site and other public announcements. n  Intent to submit” email by December 15, 2014 n  Application submitted by in January 2016

n http://www.avios.org/

+Recreating the Speech Chain DIALOG

SEMANTICS

SYNTAX

LEXICON MORPHOLOG

Y PHONETICS

VOCAL-TRACT ARTICULATORS

INNER EAR ACOUSTIC

NERVE

SPEECH RECOGNITION

DIALOG MANAGEMENT

SPOKEN LANGUAGE

UNDERSTANDING

SPEECH SYNTHESIS

(Thanks to Julia Hirshberg, Roberto Pieraccini and Francis Ganong for some slides)

+ 1969 – Whither Speech Recognition?

General purpose speech recognition seems far away. Social-purpose speech recognition is severely limited. It would seem appropriate for people to ask themselves why they are working in the field and what they can expect to accomplish…

It would be too simple to say that work in speech recognition is carried out simply because one can get money for it. That is a necessary but not sufficient condition. We are safe in asserting that speech recognition is attractive to money. The attraction is perhaps similar to the attraction of schemes for turning water into gasoline, extracting gold from the sea, curing cancer, or going to the moon. One doesn’t attract thoughtlessly given dollars by means of schemes for cutting the cost of soap by 10%. To sell suckers, one uses deceit and offers glamour…

Most recognizers behave, not like scientists, but like mad inventors or untrustworthy engineers. The typical recognizer gets it into his head that he can solve “the problem.” The basis for this is either individual inspiration (the “mad inventor” source of knowledge) or acceptance of untested rules, schemes, or information (the untrustworthy engineer approach).

The Journal of the Acoustical Society of America, June 1969

9

J. R. Pierce Executive Director,

Bell Laboratories


1971-1976: The ARPA SUR project

n Despite anti-speech recognition campaign led by Pierce Commission, ARPA launches 5 year Spoken Understanding Research program

n Goal: 1000-word vocabulary, 90% understanding rate, near real time on 100 mips machine

n 4 Systems built by the end of the program n  SDC (24%) n  BBN’s HWIM (44%) n  CMU’s Hearsay II (74%) n  CMU’s HARPY (95% -- but 80 times real time!)

n Rule-based systems except for Harpy n  Engineering approach: search network of all the possible

utterances 10

Raj Reddy -- CMU

LESSON LEARNED: Hand-built knowledge does

not scale up Need of a global “optimization”

criterion


+ BBN 1982 Technical Report

HWIM: “Hear what I mean”

Bill Woods Bonnie Webber Scot Fertig Andee Rubin Ron Brachman Marilyn Adams Phil Cohen Allen Collins Marty Ringle

One of the original ARPA contracts on the Speech Understanding Program

+ Result:

n Lack of clear evaluation criteria n  ARPA felt systems had failed n  Project not extended

n Speech Understanding: too early for its time

n Need a standard evaluation method so that progress within and across systems could be measured

12


1980s -- The Statistical Approach n Based on work on Hidden Markov Models

done by Leonard Baum at IDA, Princeton in the late 1960s

n Purely statistical approach pursued by Fred Jelinek and Jim Baker, IBM T.J.Watson Research

€

ˆ W = arg maxW

P(A | W )P(W )

13

Fred Jelinek

S1 S2 S3

a11

a12

a22

a23

a33 ),|( 21 −− ttt wwwPAcoustic HMMs Word Tri-grams

§  No Data Like More Data §  “Whenever I fire a linguist, our system performance improves” (1988) §  Some of my best friends are linguists (2004)

Jim Baker


1980-1990 – Statistical approach becomes ubiquitous

n Lawrence Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceeding of the IEEE, Vol. 77, No. 2, February 1989.

14 (Thanks to Julia Hirshberg, Roberto Pieraccini and Francis Ganong for some slides)

+ Speech Technology Milestones, ala BBN 15

Early continuous speech recognizer using natural language understanding

First software-only, real-time, large-vocabulary, speaker-independent, continuous speech recognizer

First 40,000 word real time speech recognizer

Pioneered statistical language understanding and data extraction

Early adopter of statistical hidden Markov models

Introduced context dependent phonetic units

1992 1994

1995 1982 1986 1998

Rough’ n’ Ready prototype system for browsing audio

1976

2004

Exceeded DARPA EARS targets

2001

Audio Indexer System – 1st generation

Call Director deployed in Verizon

2002

DARPA EARS Program Award

2005 2000

AVOKE STX 1.0 introduced

AVOKE STX 2.0 with Domain Development Tools

BBN Hark in the Januar

1999

+ A Decade of Progress in Speech Recognition

87 88 89 90 91 92 93 94 95 96 97 98

Wor

d Er

ror R

ate

(%)

50

40

30

20 10

60

70

80

90

5

2 1 Connected Digits

Resource Mgt Spkr Dep.

+DARPA EARS for ASR Performance

© 2002 BBN Technologies

17

BBN’s 2003 Performance Exceeds Word Error Rate Goals

0

10

20

30

40

50

60

2002 2005 2007

Year

Wor

d er

ror r

ate

Broadcast news ceilingBroadcast news floorTelephony ceilingTelephony floor

2003

+ The evolution of speech recognition from Roberto Pieraccini, Head of ICSI, at Mobile Voice 2012

n  Feature extraction: frame-based measures n  Melfrequency cepstral coefficients

(MFCC) n  Perceptual linear prediction (PLP) n  Delta cepstra (and delta delta, etc.)

n  Acoustic modeling: Markov Models (HMMs) n  Representing context-dependent

phoneme‐like units

n  Language modeling: Statistical language models n  representing context-dependent words

n  Feature extraction: frame-based measures n  Melfrequency cepstral coefficients

(MFCC) n  Perceptual linear prediction (PLP) n  Delta cepstra (and delta delta, etc.)

n  Acoustic modeling: Markov Models (HMMs) n  Representing context-dependent

phoneme‐like units

n  Language modeling: Statistical language models n  representing context-dependent words

1992 2012

MORE DATA, FASTER CPUs Normalization, Adaptation, Combination of different systems, ...

+The evolution of language understanding from Roberto Pieraccini, Head of ICSI, at Mobile Voice 2012

n  Data-driven statistical models of semantic attributes n  Concepts n  Semantic classification

n  Handcrafted grammar based semantic parsing n  Context‐free grammar

tagging n  Robust parsing

n  Data-driven statistical models of semantic attributes n  Concepts n  Semantic classification

n  Handcrafted grammar based semantic parsing n  Context‐free grammar

tagging n  Robust parsing

1992 2012

MORE DATA, FASTER CPUs Standards (SRGS), Tools, ...

+2011 Siri

n What is SIRI? n  from Roberto Pieraccini, Head of ICSI, at Mobile Voice 1012

n  Practically infinite vocabulary n  Contextual language understanding – ANSWERS ... NOT LINKS n  Voice access to calendar and contacts, help make reservations, gives

answer on lots of things, including the meaning of life n  Integrated within iPhone, freely available to everyone (who buys an

iPhone)

+Siri: The Story So Far (from Angel.com Mobile Voice presentation citing sirinotes.com)

n  July 2003 n  DARPA Selects SRI to lead the Personalized Assistant that Learns (PAL) Program. SRI and a team of

20 leading research organizations are awarded $22M over five years to develop software, dubbed CALO to revolutionize how computers support military and other decision-makers. It is considered to be the largest artificial intelligence project in history

n  December 2007 n  SRI spins off Siri, Inc. to bring the technology to consumers

n  October 2008 n  Siri, Inc. announces it has raised an $8.5 million Series A financing round, led by Menlo Ventures and

Morgenthaler Ventures

n  November 2009 n  Siri, Inc. raises a $15.5 million Series B financing round from the same investors as in their previous

round plus Hong Kong billionaire Li Ka- shing

n  February 2010 n  Siri, Inc. launches its Virtual personal Assistant app for the iPhone 3GS April 2010 n  Apple acquires Siri, Inc. from SRI

n  July 2013 n  Apple opens a lab in Boston to work on speech recognition and Siri

+ More on Siri from Roberto n Why is Siri successful?

n  Perception of intelligence n  Fun to use it, witty, catchy personality n  iPhone design and Apple marketing n  Works relatively well for a certain number of tasks n  Improves with time

n So ... is speech recognition a solved problems?

n NO...and language understanding is even less solved. n  Fails where humans don’t n  Little basic science n  More data, more improvements ... but the rate of improvement is

diminishing n  Looks like we are hitting the intrinsic limitations of the underlying

models n  Each new task requires almost the same new level of effort

+ An Early Vision

n Apple’s Knowledge Navigator n  The Knowledge Navigator Video was made for Apple-CEO John

Sculley's EDUCOM 1987 keynote in six weeks on a $60,000 budget n  https://www.youtube.com/watch?v=umJsITGzXd0

n  Is the Knowledge Navigator Siri? n  Based on the dates mentioned in the Knowledge Navigator video, it

takes place on September 16, 2011. n  The date on the professor's calendar is September 16, and he's

looking for a 2006 paper written "about five years ago," setting the year as 2011.

n And in August 2011, at the iPhone keynote, Apple announced Siri, a natural language-based voice assistant, would be built into iOS 5 and a core part of the new iPhone 4S. n  So, 24 years ago, Apple predicted a complex natural-language

voice assistant built into a touchscreen Apple device, and was less than a month off.

+ The Noisy Channel Model

n Search through space of all possible sentences.

n Pick the one that is most probable given the waveform.

Thanks to Dan Jurafsky for these slides

+ Design Intuition

n Build a statistical model of the speech sounds-to-words n  Collect lots and lots of speech, and transcribe all the words. n  Train the model on the labeled speech

n Create a “search space” of all the possible word combinations n  Write a grammar n  “Learn” likely sequences from lots of text

n Paradigm: Supervised Machine Learning + Search n  Use the model to find the best sequence of words given the

input sounds

© MM Consulting 2015 2/10/15

25

+ Speech Recognition Pipeline

Feature Extraction

Segmentation

Decoding

Confidence Estimation

Words Times Scores Answers

Speech/ non-speech

Acoustic model

Grammar or language model

Dictionary

+ Speech Recognition Knowledge Sources

5/25/10

27

Speech/ non-speech

Acoustic model

Grammar or SLM

Dictionary

Model of speech, pauses, coughs, and other sounds

List of all the words and their pronunciations, the “phonetic spelling”

Models the relationship between the sounds and the phonemes Specific to a language (English or Spanish) and a channel (telephony or broadcast)

Grammar: all possible sentences that can be understood Statistical language model (SLM): Captures the likelihood of sequences of words

+ Front End

9/1/17 CS 224S Winter 2007

28

Observations

State Sequence

+ Acoustic Modeling Training

Words Nine night

Phonemes N-AY-N N-AY-T

Triphones <S>NAY-NAYN-AYN<S> <S>NAY-NAYT-AYT<S>

2/10/15 © MM Consulting 2015

29

learns the relationship between feature vectors and triphones

+ Embedded Training

9/1/17 CS 224S Winter 2007

30

+ Statistical Language Model (SLM)

n Captures how words are used in a particular domain n  Specific to dialect (“in the hospital” vs. “in hospital”) n  Specific to domain (frequency of different words) n  Disambiguate homophones (“disc” vs. “disk)

n But allows any word order

n Trigram model n  Probability of a word given the previous two words


31

What is the missing word?

“I love _______” NOW what is the missing word?

“My favorite TV show was I love _______”

+Language (or Domain) Model: Statistical Grammar

n  Word order is guided by statistics

n  Data is used to count the likelihood a word given the previous n  “in the hospital” n  “the in hospital” n  “the hospital in”

n  Creates a lattice that is searched

n  Training data must be relevant!


32

Bigram

+ Decoding: The Noisy Channel Model

n Search through space of all possible sentences.

n Pick the one that is most probable given the waveform.


+ The Noisy Channel Model (Preview)

n What is the most likely sentence out of all sentences in the language L given some acoustic input O?

n Treat acoustic input O as sequence of individual observations n  O = o1,o2,o3,…,ot

n Define a sentence as a sequence of words: n  W = w1,w2,w3,…,wn


+ Noisy Channel Model (Preview)

n Probabilistic implication: Pick the highest prob S:

n We can use Bayes rule to rewrite this:

n Since denominator is the same for each candidate sentence W, we can ignore it for the argmax:


€

ˆ W = argmaxW ∈L

P(W | O)

€

ˆ W = argmaxW ∈L

P(O |W )P(W )€

ˆ W = argmaxW ∈L

P(O |W )P(W )P(O)

+ Noisy channel model

€

ˆ W = argmaxW ∈L

P(O |W )P(W )


likelihood prior

+ Speech Architecture meets Noisy Channel

€

ˆ W = argmaxW ∈L

P(O |W )P(W )


likelihood prior

+Speech Recognition Architecture


+ Assignment 1: “State of the Art in Speech Recognition” by Makhoul and Schwartz

n  Read and reflect on what you know and don’t know.

n Highlight areas you think are important and that fall into the following 3 categories. I suggest using highlighters, but use whatever works for you: n  (green) This is something I’m familiar with, enough that I could explain it

to someone else. n  (yellow) I recognize the phrases and know I’ve studied it in the past, but

couldn’t actually explain it. n  (pink) This is like a different language. I can’t readily connect this with

anything I know.

n You don’t need to highlight every paragraph, just those that strike you as important.

n Save this out and submit it EoD Monday, September 4. We will be discussing it in class.

“automated” speech recognition - brandeis...

Documents