introduction to computer speech processing
DESCRIPTION
Introduction to Computer Speech Processing. Alex Acero Research Area Manager Microsoft Research. Outline. Grand challenges in Speech and Language Vision videos Products today Prototypes The role of speech Technology Introduction. Outline. Grand challenges in Speech and Language - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/1.jpg)
Introduction to Computer Speech Processing
Alex AceroResearch Area ManagerMicrosoft Research
![Page 2: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/2.jpg)
Outline
• Grand challenges in Speech and Language• Vision videos• Products today• Prototypes• The role of speech• Technology Introduction
![Page 3: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/3.jpg)
Outline
• Grand challenges in Speech and Language• Vision videos• Products today• Prototypes• The role of speech• Technology Introduction
![Page 4: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/4.jpg)
User Expectations for Speech
![Page 5: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/5.jpg)
The Turing Test
• Imitation Game:– Judge, man, and a woman– All chat via Email.– Man pretends to be a woman. – Man lies, woman tries to help judge.– Judge must identify man after 5 minutes.
• Turing Test– Replace man or woman with a computer.– Fool judge 30% of the time.
Thanks to Jim Gray for material
![Page 6: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/6.jpg)
What Turing Said
“I believe that in about fifty years' time it will be possible, to programme computers, with a storage capacity of about 109, to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning. The original question, "Can machines think?" I believe to be too meaningless to deserve discussion. Nevertheless I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted.”
Alan M.Turing, 1950“Computing machinery and intelligence.” Mind, Vol.
LIX. 433-460
![Page 7: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/7.jpg)
Prediction 59 Years Later
• Turing’s technology forecast was great!– Gigabyte memory is common
• Computer beat world chess champion– with some help from its programming staff!
• Computers help design most things today
![Page 8: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/8.jpg)
Prediction 59 Years Later
• Intelligence forecast was optimistic– Several internet sites offer Turning Test
chatterbots.– None pass (yet) http://www.loebner.net/Prizef/loebner-prize.html
• But I believe it will not be long:– less than 50 years, more than 10 years
• Turing test still stands as a long-term challenge
![Page 9: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/9.jpg)
Challenges Implicit in the Turing Test
1. Read and understand as well as a human
2. Think and write as well as a human3. Hear as well as a native speaker:
Speech Recognition (speech to text)4. Speak as well as a native speaker:
Speech Synthesis (text to speech)5. Remember what is heard and quickly
return it on request.
![Page 10: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/10.jpg)
Moore’s law (1965)
• Gordon Moore: “The number of transistors per chip will
double every 18 months”: 100x per decade• Progress in next 18 months
= ALL previous progress– New storage = sum of all old storage (ever)– New processing = sum of all old processing.
15 years ago
![Page 11: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/11.jpg)
Making Chips Smaller
• Advances in Lithography: science of "drawing" circuits on chips
• Impact of Moore’s law:– Short distances => smaller processing time– Smaller size => lower cost per transistor– Amount of memory is increased
• But, it is not a law of physics: a mere self fulfilling prophecy.
![Page 12: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/12.jpg)
Moore’s law not applicable to Machine Intelligence
• Speech technology benefited from Moore’s Law in the 1990’s.
• In the 21th century, faster chips mean recognition error appears faster
• New algorithmic advances needed to pass the Turing Test• Error rate halves approx every 7 years
![Page 13: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/13.jpg)
Grand Challenges
“Within 10 years speech will be in every device. Things like speech and ink are so natural, when they get the right quality level they will be in everything. As technical hurdles such as background noise and context are overcome, major adoption of speech technology will arrive. Soon, dictating to PCs and giving commands to cell phones will be basic modes of interacting with technology”
Bill Gates, March 2004
![Page 14: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/14.jpg)
Outline
• Grand challenges in Speech and Language• Vision videos• Products today• Prototypes• The role of speech• Technology Introduction
![Page 15: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/15.jpg)
Speech in Mobile devices
![Page 16: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/16.jpg)
Speech for Students
![Page 17: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/17.jpg)
Speech in cars
![Page 18: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/18.jpg)
Soccer Mom in car
![Page 19: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/19.jpg)
Insurance Agent driving
![Page 20: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/20.jpg)
Outline
• Grand challenges in Speech and Language• Vision videos• Products today• Prototypes• The role of speech• Technology Introduction
![Page 21: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/21.jpg)
Japanese dictation
![Page 22: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/22.jpg)
Telephony: Response point
![Page 23: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/23.jpg)
Directory Assistance
• Automatic generation of robust grammars– Users say “Calabria” or “Calabria restaurant”
• Nearby cities– Is “Calabria restaurant” in Redmond or Kirkland?
• Some people say the address too– “Pizza hut on 3rd Avenue” in New York, New York
• Automatic normalization– Acronyms, compound words, homonyms, misspelled words
![Page 24: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/24.jpg)
Multimodal voice search
![Page 25: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/25.jpg)
Click-Driven Automated Feedback
Acoustic ModelLanguage Model
![Page 26: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/26.jpg)
Outline
• Grand challenges in Speech and Language• Vision videos• Products today• Prototypes• The role of speech• Technology Introduction
![Page 27: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/27.jpg)
CommuteUX
![Page 28: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/28.jpg)
Speech in Education
![Page 29: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/29.jpg)
VerbalMath
![Page 30: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/30.jpg)
Virtual Receptionist
![Page 32: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/32.jpg)
Browsing a Video (Milind Mahajan & Patrick Nguyen)
![Page 33: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/33.jpg)
Podcast authoring (Patrick Nguyen)
![Page 34: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/34.jpg)
Outline
• Grand challenges in Speech and Language• Vision videos• Products today• Prototypes• The role of speech• Technology Introduction
![Page 35: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/35.jpg)
HighHigh
InternetInternetTVTV
PhonePhone
PDAPDA
Ease of text input (keyboard/pen)Ease of text input (keyboard/pen)
Ease Ease of GUIof GUI
(screen/(screen/Pointer)Pointer)
LowLow HighHigh
PCPC
TabletTabletPCPC
ScreenScreenPhonePhoneScreenScreenPhonePhone
PDAPDA
TabletTabletPCPC
CarCarCarCar
InternetInternetTVTV
Role of Speech in Different Devices
![Page 36: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/36.jpg)
PhonePhone
PCPC
ScreenScreenPhonePhone
PDAPDA
TabletTabletPCPC
CarCar
InternetInternetTVTV
A Roadmap for Speech
Ease of text input (keyboard/pen)Ease of text input (keyboard/pen)
Ease Ease of GUIof GUI
(screen/(screen/Pointer)Pointer)
HighHigh
HighHighLowLow
Speech-Only Speech-Only TelephonyTelephony
DictationDictation
Multimodal Multimodal Command/ControlCommand/Control
![Page 37: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/37.jpg)
Speech Technology
Meeting / Voicemail Transcription
Market Opportunity
Mobile Devices / Cars
Telephony / Call Center
Accessibility
Desktop Dictation
Desktop Command & Control
Technology Readiness
Customer Need
Poor Alternative
![Page 38: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/38.jpg)
Outline
• Grand challenges in Speech and Language• Vision videos• Products today• Prototypes• The role of speech• Technology Introduction
![Page 39: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/39.jpg)
Voice-enabled System Technology Components
DM
SLU
TTSText-to-Speech
Synthesis
Automatic SpeechRecognition
Spoken LanguageUnderstanding
DialogManagement
ASR
SLGSpoken Language Generation
Data,Rules
Words
Meaning
SpeechSpeech
Action
Words
![Page 40: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/40.jpg)
Voice-enabled System Technology Components
DM
SLU
TTSText-to-Speech
Synthesis
Automatic SpeechRecognition
Spoken LanguageUnderstanding
DialogManagement
ASR
SLGSpoken Language Generation
Data,Rules
Words
Meaning
SpeechSpeech
Action
Words
![Page 41: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/41.jpg)
Basic Formulation
• Basic equation of speech recognition is
X=X1,X2,…,Xn is the acoustic observation is the word sequence
P(X|W) is the acoustic model
P(W) is the language model
WpWXpXWpWWW
|maxarg|maxargˆ
![Page 42: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/42.jpg)
Feature Extraction
Feature Extraction
Language Model
Language Model
Word Lexicon
Word Lexicon
Confidence Scoring
Confidence Scoring
Pattern Classification
(Decoding, Search)
Pattern Classification
(Decoding, Search)
Acoustic Model
Acoustic Model
Input Speech “Hello World”
(0.9) (0.8)
Speech Recognition
SLU
TTS ASR
DM
SLG
![Page 43: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/43.jpg)
Goal: Extract robust features (information)from the speech that are relevant for ASR.
Method: Spectral analysis through either abank-of-filters or through Linear Predictive Codingfollowed by non-linearity and normalization.
Result: Signal compression where for each window of speech samples where 30 or so features are extracted (64,000 b/s -> 5,200 b/s).
Challenges: Robustness to environment (office, airport, car), devices (speakerphones, cellphones), speakers (accents, dialect, style, speaking defects), noise and echo.
Feature Extraction
Feature Extraction
Pattern Classification
Pattern Classification
Acoustic Model
Acoustic Model
Language Model
Language Model
Word Lexicon
Word Lexicon
Confidence Scoring
Confidence Scoring
Feature Extraction
![Page 44: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/44.jpg)
Goal:Model probability of acoustic features for each phone model i.e. p(X |/ae/)
Method: Hidden Markov Models (HMM) throughMaximum likelihood (EM) or discriminative methods
Challenges/variability: • Background noise: Cocktail Party Effect• Dialect/accent• Speaker• Phonetic context: “It aly” vs “It alian””• No spaces in speech:
Feature Extraction
Feature Extraction
Pattern Classification
Pattern Classification
Acoustic Model
Acoustic Model
Language Model
Language Model
Word Lexicon
Word Lexicon
Confidence Scoring
Confidence Scoring
Acoustic Modeling
“Recognize speech” “Wreck a nice beach”
0 21
![Page 45: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/45.jpg)
Goal:Map legal phone sequences into wordsaccording to phonotactic rules:
David /d/ /ey/ /v/ /ih/ /d/
Multiple Pronunciations:Several words may have multiple pronunciations:
Data /d/ /ae/ /t/ /ax/Data /d/ /ey/ /t/ /ax/
Challenges: •How do you generate a word lexicon automatically?
•LTS rules can be automatically trained with decision trees (CART) less than 8% errors, but proper nouns are hard!
•How do you add new variant dialects and word pronunciations?
Feature Extraction
Feature Extraction
Pattern Classification
Pattern Classification
Acoustic Model
Acoustic Model
Language Model
Language Model
Word Lexicon
Word Lexicon
Confidence Scoring
Confidence Scoring
Word Lexicon
![Page 46: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/46.jpg)
Pattern Classification
Goal:Find “optimal” word sequence:Combine information (probabilities) from• Acoustic model• Word lexicon• Language model
Method:Decoder searches through all possible recognitionchoices using a Viterbi decoding algorithm
Challenge:Efficient search through a large network space is computationally expensive for large vocabulary ASR: Beam search, WFST
Feature Extraction
Feature Extraction
Pattern Classification
Pattern Classification
Acoustic Model
Acoustic Model
Language Model
Language Model
Word Lexicon
Word Lexicon
Confidence Scoring
Confidence Scoring
![Page 47: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/47.jpg)
Confidence ScoringGoal:Identify possible recognition errors and out-of-vocabulary events. Potentiallyimproves the performance of ASR, SLU and DM.
Method:A confidence score based on a hypothesis likelihood ratio test is associated with each recognized word:
Label: credit please Recognized: credit fees Confidence: (0.9) (0.3)
Command-and-control: false rejection and false acceptance => ROC curvesChallenges:Rejection of extraneous acoustic events (noise, background speech, door slams) without rejection of valid user input speech.
Feature Extraction
Feature Extraction
Pattern Classification
Pattern Classification
Acoustic Model
Acoustic Model
Language Model
Language Model
Word Lexicon
Word Lexicon
Confidence Scoring
Confidence Scoring
![Page 48: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/48.jpg)
Voice-enabled System Technology Components
DM
SLU
TTSText-to-Speech
Synthesis
Automatic SpeechRecognition
Spoken LanguageUnderstanding
DialogManagement
ASR
SLGSpoken Language Generation
Data,Rules
Words
Meaning
SpeechSpeech
Action
Words
![Page 49: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/49.jpg)
Text-to-Speech Systems
TTS Engine
Text AnalysisDocument Structure DetectionText NormalizationLinguistic Analysis
Phonetic AnalysisHomograph disambiguationGrapheme-to-Phoneme Conversion
Speech SynthesisVoice Rendering
Raw textor tagged text
tagged text
controls
Prosodic AnalysisPitch & Duration Attachment
tagged phones
SpeechAudio Out
![Page 50: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/50.jpg)
Multimedia Customer Care(Courtesy of AT&T)
![Page 51: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/51.jpg)
Voice-enabled System Technology Components
DM
SLU
TTSText-to-Speech
Synthesis
Automatic SpeechRecognition
Spoken LanguageUnderstanding
DialogManagement
ASR
SLGSpoken Language Generation
Data,Rules
Words
Meaning
SpeechSpeech
Action
Words
![Page 52: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/52.jpg)
Language Understanding
• Application Schema (XML for semantic entities) defines the application status
• A Semantic Context Free Grammar (CFG) parses an English sentence and fills in slots of the application schema.
![Page 53: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/53.jpg)
Application Schema
<itinerary><origin>
<city></city><state></state>
</origin><destination>
<city></city><state></state>
</destination><date></date>
</itinerary>
![Page 54: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/54.jpg)
Semantic CFG
<rule name=“itinerary”>
Show me flights from <ruleref name=“origin"/>
to <ruleref name=“destination"/>
</rule>
<rule name=“origin”>
<ruleref name=“city”>
</rule>
<rule name=“destination”>
<ruleref name=“city”>
</rule>
<rule name=“city”>
Seattle | San Francisco | New York
</rule>
![Page 55: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/55.jpg)
An example sentence
“Show me flights from Seattle to New York”
would populate the application schema as<itinerary>
<origin>
<city>Seattle</city>
<state></state>
</origin>
<destination>
<city>New York</city>
<state></state>
</destination>
<date></date>
</itinerary>
![Page 56: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/56.jpg)
Voice-enabled System Technology Components
DM
SLU
TTSText-to-Speech
Synthesis
Automatic SpeechRecognition
Spoken LanguageUnderstanding
DialogManagement
ASR
SLGSpoken Language Generation
Data,Rules
Words
Meaning
SpeechSpeech
Action
Words
![Page 57: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/57.jpg)
Who manages the Dialog?Directed Dialog
– “Who would you like to contact?”– Finite State Machine– Simple CFG– MSConnect
User Initiative Dialog “What can I do for you?” Ngrams Windows Airlines
Initiative
Reservations
Flight Status
Baggage Claim
Special Announcements
![Page 58: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/58.jpg)
Problems with directed dialogs
![Page 59: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/59.jpg)
User-initiative dialogs
• Pros:– Can result in a shorter call– Can feel more natural– Useful when too many choices
• Cons:– Requires expensive expertise– Could lead to user frustration: system appears human
but caller can’t use full natural language
![Page 60: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/60.jpg)
NLU Dialog Module
• Drag-and-drop Dialog Flow Designer• Developer specifies:
– Destination branches– Example sentences per branch– Prompts (initial, mumble, no speech, etc)
• Module generates SLM and classifier• It handles confirmation, reprompt, etc.
![Page 61: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/61.jpg)
Natural Language
![Page 62: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/62.jpg)
VisualPen
Gesture
Multimodal System Technology Components
DM
SLU
TTSText-to-Speech
Synthesis
Automatic SpeechRecognition
Spoken LanguageUnderstanding
DialogManagement
ASR
SLGSpoken Language Generation
Data,Rules
Words
Meaning
SpeechSpeech
Action
Words
![Page 63: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/63.jpg)
MIPad
• Multimodal Interactive Pad• MiPad
– Tap and Talk combines speech and pen
– Use context to simplify recognition– Dictation allows complex command
entry
• Usability studies show double throughput for English
• Speech is mostly useful in cases with lots of alternatives
![Page 64: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/64.jpg)
Speech-centric Multimodal
![Page 65: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/65.jpg)
Multimodality Benefits
• Compared to speech-only:– User sees system response more quickly– User sees what system understood– User can know what system expects
• Compared to GUI-only:– Faster entry– Better use of small screen
![Page 66: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/66.jpg)
![Page 67: Introduction to Computer Speech Processing](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681587a550346895dc5da58/html5/thumbnails/67.jpg)
But general language understanding is hard