“automated” speech recognition - brandeis...
TRANSCRIPT
+
“Automated” Speech Recognition September 1, 2017 Professor Meteer
+ Speech Recognition
n Technology n What does it do?
n Speech à text n Speech à “semantics”
n How well does it work? n Human standards? n Where does it work well? n Where does it fail?
n Applications n What problems can it solve? n Do I need those problems
solved by technology? n What is the benefit? n Who benefits?
n Is the performance adequate to solve the problem?
Feature extraction Sampling
Signal Processing Mel‐Frequency Cepstral Coefficient
(MFCC) Acoustic Modeling Hidden Markov Models
Multivariant Gaussian Mixture Models Maximum Entropy
Language modeling N-grams
Multiwords Class grammars Adaptation
Vocal tract length normalization Minimum phone error
+Feature extraction
Sampling Signal Processing
Mel‐Frequency Cepstral Coefficient (MFCC) Acoustic Modeling Hidden Markov Models
Multivariant Gaussian Mixture Models Maximum Entropy
Language modeling N-grams
Mulitwords Class grammars Adaptation
Vocal tract length normalization Minimum phone error
+ Course outline
n Course work n Architecture and algorithms in
speech recognizers n Audio signal to features n Acoustic models of phonemes n Statistical models of word
sequences n Adaptation to dialects and
idiolects
n Project (groups of ~3) n Learning how to use available tools, etc on
your own. n Present to class on both process and result n “Experiencial learning”, team building
n Choose 1 of n Create a mobile application that uses speech
recognition n Use the technology to solve a problem/
assist in a task n Designing dialogs so that the performance
is up to the task n Build a new/better recognize with Kaldi
n for a particular language/domain n Show baseline and improvement n Document all steps
+Course Overview
n Syllabus and schedule available at n cs136a.mmeteer.com/
n Overview and history of speech
n Speech Applications
n Speech Algorithms n Phonetics n Algorithms
n Acoustic modeling n Language modeling
n Speech Synthesis
n Dialog
+ Requirements n Programming Projects (60%)
n Using speech toolkits n Watson & Mashup, Kaldi
n Develop a speech application n Platforms: Android, Apple, AT&T, Google, Microsoft n Win a trip to California or cash prizes
n Improve speech performance n Acoustic modeling, language modeling, dictionary
n Quizzes and presentations (30%) n Quizzes on class material and readings
n Mostly “take home” (vial Latte) n Max word length on the answers
n Presentations based on readings, groups and individuals
n Participation (10%) n Attendance, involvement n Posting on class forums, both specific assignments and comments
+The AVIOS Student Speech Application Contest 2016-2017 n Demonstrate your creativity and programming skills
in voice-enabled and multimodal applications by entering the AVIOS Speech Application Development Contest organized by the Applied Voice Input Output Society.
n Develop a speech mediated application and win cash, prizes as well as world-wide recognition on the AVIOS web site and other public announcements. n Intent to submit” email by December 15, 2014 n Application submitted by in January 2016
n http://www.avios.org/
+Recreating the Speech Chain DIALOG
SEMANTICS
SYNTAX
LEXICON MORPHOLOG
Y PHONETICS
VOCAL-TRACT ARTICULATORS
INNER EAR ACOUSTIC
NERVE
SPEECH RECOGNITION
DIALOG MANAGEMENT
SPOKEN LANGUAGE
UNDERSTANDING
SPEECH SYNTHESIS
(Thanks to Julia Hirshberg, Roberto Pieraccini and Francis Ganong for some slides)
+ 1969 – Whither Speech Recognition?
General purpose speech recognition seems far away. Social-purpose speech recognition is severely limited. It would seem appropriate for people to ask themselves why they are working in the field and what they can expect to accomplish…
It would be too simple to say that work in speech recognition is carried out simply because one can get money for it. That is a necessary but not sufficient condition. We are safe in asserting that speech recognition is attractive to money. The attraction is perhaps similar to the attraction of schemes for turning water into gasoline, extracting gold from the sea, curing cancer, or going to the moon. One doesn’t attract thoughtlessly given dollars by means of schemes for cutting the cost of soap by 10%. To sell suckers, one uses deceit and offers glamour…
Most recognizers behave, not like scientists, but like mad inventors or untrustworthy engineers. The typical recognizer gets it into his head that he can solve “the problem.” The basis for this is either individual inspiration (the “mad inventor” source of knowledge) or acceptance of untested rules, schemes, or information (the untrustworthy engineer approach).
The Journal of the Acoustical Society of America, June 1969
9
J. R. Pierce Executive Director,
Bell Laboratories
(Thanks to Julia Hirshberg, Roberto Pieraccini and Francis Ganong for some slides)
1971-1976: The ARPA SUR project
n Despite anti-speech recognition campaign led by Pierce Commission, ARPA launches 5 year Spoken Understanding Research program
n Goal: 1000-word vocabulary, 90% understanding rate, near real time on 100 mips machine
n 4 Systems built by the end of the program n SDC (24%) n BBN’s HWIM (44%) n CMU’s Hearsay II (74%) n CMU’s HARPY (95% -- but 80 times real time!)
n Rule-based systems except for Harpy n Engineering approach: search network of all the possible
utterances 10
Raj Reddy -- CMU
LESSON LEARNED: Hand-built knowledge does
not scale up Need of a global “optimization”
criterion
(Thanks to Julia Hirshberg, Roberto Pieraccini and Francis Ganong for some slides)
+ BBN 1982 Technical Report
HWIM: “Hear what I mean”
Bill Woods Bonnie Webber Scot Fertig Andee Rubin Ron Brachman Marilyn Adams Phil Cohen Allen Collins Marty Ringle
One of the original ARPA contracts on the Speech Understanding Program
+ Result:
n Lack of clear evaluation criteria n ARPA felt systems had failed n Project not extended
n Speech Understanding: too early for its time
n Need a standard evaluation method so that progress within and across systems could be measured
12
(Thanks to Julia Hirshberg, Roberto Pieraccini and Francis Ganong for some slides)
1980s -- The Statistical Approach n Based on work on Hidden Markov Models
done by Leonard Baum at IDA, Princeton in the late 1960s
n Purely statistical approach pursued by Fred Jelinek and Jim Baker, IBM T.J.Watson Research
€
ˆ W = arg maxW
P(A | W )P(W )
13
Fred Jelinek
S1 S2 S3
a11
a12
a22
a23
a33 ),|( 21 −− ttt wwwPAcoustic HMMs Word Tri-grams
§ No Data Like More Data § “Whenever I fire a linguist, our system performance improves” (1988) § Some of my best friends are linguists (2004)
Jim Baker
(Thanks to Julia Hirshberg, Roberto Pieraccini and Francis Ganong for some slides)
1980-1990 – Statistical approach becomes ubiquitous
n Lawrence Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceeding of the IEEE, Vol. 77, No. 2, February 1989.
14 (Thanks to Julia Hirshberg, Roberto Pieraccini and Francis Ganong for some slides)
+ Speech Technology Milestones, ala BBN 15
Early continuous speech recognizer using natural language understanding
First software-only, real-time, large-vocabulary, speaker-independent, continuous speech recognizer
First 40,000 word real time speech recognizer
Pioneered statistical language understanding and data extraction
Early adopter of statistical hidden Markov models
Introduced context dependent phonetic units
1992 1994
1995 1982 1986 1998
Rough’ n’ Ready prototype system for browsing audio
1976
2004
Exceeded DARPA EARS targets
2001
Audio Indexer System – 1st generation
Call Director deployed in Verizon
2002
DARPA EARS Program Award
2005 2000
AVOKE STX 1.0 introduced
AVOKE STX 2.0 with Domain Development Tools
BBN Hark in the Januar
1999
+ A Decade of Progress in Speech Recognition
87 88 89 90 91 92 93 94 95 96 97 98
Wor
d Er
ror R
ate
(%)
50
40
30
20 10
60
70
80
90
5
2 1 Connected Digits
Resource Mgt Spkr Dep.
+DARPA EARS for ASR Performance
© 2002 BBN Technologies
17
BBN’s 2003 Performance Exceeds Word Error Rate Goals
0
10
20
30
40
50
60
2002 2005 2007
Year
Wor
d er
ror r
ate
Broadcast news ceilingBroadcast news floorTelephony ceilingTelephony floor
2003
+ The evolution of speech recognition from Roberto Pieraccini, Head of ICSI, at Mobile Voice 2012
n Feature extraction: frame-based measures n Melfrequency cepstral coefficients
(MFCC) n Perceptual linear prediction (PLP) n Delta cepstra (and delta delta, etc.)
n Acoustic modeling: Markov Models (HMMs) n Representing context-dependent
phoneme‐like units
n Language modeling: Statistical language models n representing context-dependent words
n Feature extraction: frame-based measures n Melfrequency cepstral coefficients
(MFCC) n Perceptual linear prediction (PLP) n Delta cepstra (and delta delta, etc.)
n Acoustic modeling: Markov Models (HMMs) n Representing context-dependent
phoneme‐like units
n Language modeling: Statistical language models n representing context-dependent words
1992 2012
MORE DATA, FASTER CPUs Normalization, Adaptation, Combination of different systems, ...
+The evolution of language understanding from Roberto Pieraccini, Head of ICSI, at Mobile Voice 2012
n Data-driven statistical models of semantic attributes n Concepts n Semantic classification
n Handcrafted grammar based semantic parsing n Context‐free grammar
tagging n Robust parsing
n Data-driven statistical models of semantic attributes n Concepts n Semantic classification
n Handcrafted grammar based semantic parsing n Context‐free grammar
tagging n Robust parsing
1992 2012
MORE DATA, FASTER CPUs Standards (SRGS), Tools, ...
+2011 Siri
n What is SIRI? n from Roberto Pieraccini, Head of ICSI, at Mobile Voice 1012
n Practically infinite vocabulary n Contextual language understanding – ANSWERS ... NOT LINKS n Voice access to calendar and contacts, help make reservations, gives
answer on lots of things, including the meaning of life n Integrated within iPhone, freely available to everyone (who buys an
iPhone)
+Siri: The Story So Far (from Angel.com Mobile Voice presentation citing sirinotes.com)
n July 2003 n DARPA Selects SRI to lead the Personalized Assistant that Learns (PAL) Program. SRI and a team of
20 leading research organizations are awarded $22M over five years to develop software, dubbed CALO to revolutionize how computers support military and other decision-makers. It is considered to be the largest artificial intelligence project in history
n December 2007 n SRI spins off Siri, Inc. to bring the technology to consumers
n October 2008 n Siri, Inc. announces it has raised an $8.5 million Series A financing round, led by Menlo Ventures and
Morgenthaler Ventures
n November 2009 n Siri, Inc. raises a $15.5 million Series B financing round from the same investors as in their previous
round plus Hong Kong billionaire Li Ka- shing
n February 2010 n Siri, Inc. launches its Virtual personal Assistant app for the iPhone 3GS April 2010 n Apple acquires Siri, Inc. from SRI
n July 2013 n Apple opens a lab in Boston to work on speech recognition and Siri
+ More on Siri from Roberto n Why is Siri successful?
n Perception of intelligence n Fun to use it, witty, catchy personality n iPhone design and Apple marketing n Works relatively well for a certain number of tasks n Improves with time
n So ... is speech recognition a solved problems?
n NO...and language understanding is even less solved. n Fails where humans don’t n Little basic science n More data, more improvements ... but the rate of improvement is
diminishing n Looks like we are hitting the intrinsic limitations of the underlying
models n Each new task requires almost the same new level of effort
+ An Early Vision
n Apple’s Knowledge Navigator n The Knowledge Navigator Video was made for Apple-CEO John
Sculley's EDUCOM 1987 keynote in six weeks on a $60,000 budget n https://www.youtube.com/watch?v=umJsITGzXd0
n Is the Knowledge Navigator Siri? n Based on the dates mentioned in the Knowledge Navigator video, it
takes place on September 16, 2011. n The date on the professor's calendar is September 16, and he's
looking for a 2006 paper written "about five years ago," setting the year as 2011.
n And in August 2011, at the iPhone keynote, Apple announced Siri, a natural language-based voice assistant, would be built into iOS 5 and a core part of the new iPhone 4S. n So, 24 years ago, Apple predicted a complex natural-language
voice assistant built into a touchscreen Apple device, and was less than a month off.
+ The Noisy Channel Model
n Search through space of all possible sentences.
n Pick the one that is most probable given the waveform.
Thanks to Dan Jurafsky for these slides
+ Design Intuition
n Build a statistical model of the speech sounds-to-words n Collect lots and lots of speech, and transcribe all the words. n Train the model on the labeled speech
n Create a “search space” of all the possible word combinations n Write a grammar n “Learn” likely sequences from lots of text
n Paradigm: Supervised Machine Learning + Search n Use the model to find the best sequence of words given the
input sounds
© MM Consulting 2015 2/10/15
25
+ Speech Recognition Pipeline
Feature Extraction
Segmentation
Decoding
Confidence Estimation
Words Times Scores Answers
Speech/ non-speech
Acoustic model
Grammar or language model
Dictionary
+ Speech Recognition Knowledge Sources
5/25/10
27
Speech/ non-speech
Acoustic model
Grammar or SLM
Dictionary
Model of speech, pauses, coughs, and other sounds
List of all the words and their pronunciations, the “phonetic spelling”
Models the relationship between the sounds and the phonemes Specific to a language (English or Spanish) and a channel (telephony or broadcast)
Grammar: all possible sentences that can be understood Statistical language model (SLM): Captures the likelihood of sequences of words
+ Front End
9/1/17 CS 224S Winter 2007
28
Observations
State Sequence
+ Acoustic Modeling Training
Words Nine night
Phonemes N-AY-N N-AY-T
Triphones <S>NAY-NAYN-AYN<S> <S>NAY-NAYT-AYT<S>
2/10/15 © MM Consulting 2015
29
learns the relationship between feature vectors and triphones
+ Embedded Training
9/1/17 CS 224S Winter 2007
30
+ Statistical Language Model (SLM)
n Captures how words are used in a particular domain n Specific to dialect (“in the hospital” vs. “in hospital”) n Specific to domain (frequency of different words) n Disambiguate homophones (“disc” vs. “disk)
n But allows any word order
n Trigram model n Probability of a word given the previous two words
1/10/11 © MM Consulting 2011
31
What is the missing word?
“I love _______” NOW what is the missing word?
“My favorite TV show was I love _______”
+Language (or Domain) Model: Statistical Grammar
n Word order is guided by statistics
n Data is used to count the likelihood a word given the previous n “in the hospital” n “the in hospital” n “the hospital in”
n Creates a lattice that is searched
n Training data must be relevant!
1/10/11 © MM Consulting 2011
32
Bigram
+ Decoding: The Noisy Channel Model
n Search through space of all possible sentences.
n Pick the one that is most probable given the waveform.
Thanks to Dan Jurafsky for these slides
+ The Noisy Channel Model (Preview)
n What is the most likely sentence out of all sentences in the language L given some acoustic input O?
n Treat acoustic input O as sequence of individual observations n O = o1,o2,o3,…,ot
n Define a sentence as a sequence of words: n W = w1,w2,w3,…,wn
Thanks to Dan Jurafsky for these slides
+ Noisy Channel Model (Preview)
n Probabilistic implication: Pick the highest prob S:
n We can use Bayes rule to rewrite this:
n Since denominator is the same for each candidate sentence W, we can ignore it for the argmax:
Thanks to Dan Jurafsky for these slides
€
ˆ W = argmaxW ∈L
P(W | O)
€
ˆ W = argmaxW ∈L
P(O |W )P(W )€
ˆ W = argmaxW ∈L
P(O |W )P(W )P(O)
+ Noisy channel model
€
ˆ W = argmaxW ∈L
P(O |W )P(W )
Thanks to Dan Jurafsky for these slides
likelihood prior
+ Speech Architecture meets Noisy Channel
€
ˆ W = argmaxW ∈L
P(O |W )P(W )
Thanks to Dan Jurafsky for these slides
likelihood prior
+Speech Recognition Architecture
Thanks to Dan Jurafsky for these slides
+ Assignment 1: “State of the Art in Speech Recognition” by Makhoul and Schwartz
n Read and reflect on what you know and don’t know.
n Highlight areas you think are important and that fall into the following 3 categories. I suggest using highlighters, but use whatever works for you: n (green) This is something I’m familiar with, enough that I could explain it
to someone else. n (yellow) I recognize the phrases and know I’ve studied it in the past, but
couldn’t actually explain it. n (pink) This is like a different language. I can’t readily connect this with
anything I know.
n You don’t need to highlight every paragraph, just those that strike you as important.
n Save this out and submit it EoD Monday, September 4. We will be discussing it in class.