bridging the gap between speech recognition and business logic
TRANSCRIPT
@wolfpaulus
Wolf Paulus [email protected]
Bridging the Gap Between Speech Recognition and Business Logic
@wolfpaulus
@wolfpaulus
@wolfpaulus1977
History
@wolfpaulus
1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015
1977
@wolfpaulus19814….Years
@wolfpaulus
1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015
1981
@wolfpaulus
1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015
19843…Years
@wolfpaulus
1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015
1984
@wolfpaulus199511………..Years
@wolfpaulus
1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015
1995
@wolfpaulus
1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015
200712…………Years
@wolfpaulus2007
@wolfpaulus
1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015
2007
@wolfpaulus
1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015
20114….Years
@wolfpaulus
1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015
2011
@wolfpaulus
1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015
2013
Emotion andVoice enabled Avatar
2..Years
@wolfpaulus
1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015
Emotion andVoice enabled Avatar 2013
@wolfpaulus
1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015
2013
@wolfpaulus
1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015
2014
30 Years
@wolfpaulus
1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015
2015
Amazon Alexa powers Echo and is designed around your voice.
It’s always on - just ask for information, news, weather, and more.
@wolfpaulus
38 Years
1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015
@wolfpaulus
38 Years
1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015
Standalone text based ..
.. globally connected conversational Voice User Interface
@wolfpaulus
Customer Benefit
Inte
grat
ion
Com
plex
ity
Recognize who is speaking
Execute simple commands
Intelligently respond to natural language input
Classification
Recognize what is said and write it down
@wolfpaulus
Bouncer
AssistantClassification
Gopher
?Customer Benefit
Inte
grat
ion
Com
plex
ity
Secretary
@wolfpaulus
Speech Input (encoded compressed sound file)
Speech Recognition returns a text
Mapping recognized words to pre-programmed actions
@wolfpaulus
Digital art is an artistic work, using digital technology as an essential part of the creative process
@wolfpaulus
When your phone’s screen is a canvas, your finger becomes a paint brush
@wolfpaulus
@wolfpaulus
@wolfpaulus
@wolfpaulus
@wolfpaulus
@wolfpaulus
@wolfpaulus
@wolfpaulus
All painting were created with Artist on Android
@wolfpaulus
@wolfpaulus
Tool
Style
Action Bar
Color
@wolfpaulus
Tasks about, explore, publish, share, print, save, reset …
Tools pen, brush, roll
Styles emboss, blur, blend, erase, normal
Colors 216 named colors …
@wolfpaulus
Tasks about, explore, publish, share, print, save, reset …
Tools pen, brush, roll
Styles emboss, blur, blend, erase, normal
Colors 216 named colors …
@wolfpaulus
Task
Tool
Style
Color
Tool
Color
Style
Tool
StyleColor
ToolStyleColor
GrXML - Speech Recognition Grammarhttp://www.w3.org/TR/speech-grammar/<?xml version="1.0"?> <grammar version="1.0" xml:lang="en-US" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/06/grammar http://www.w3.org/TR/speech-grammar/grammar.xsd" xmlns="http://www.w3.org/2001/06/grammar">
.. <rule id=“tool">
<one-of> <item>pen</item> <item>brush</item> <item>roll</item> </one-of>
</rule>
.. <rule id=“command">
<one-of> <ruleref uri=“#tool”/> </one-of><one-of> <ruleref uri=“#color"/></one-of><one-of> <ruleref uri=“#style"/></one-of>
<one-of> <ruleref uri=“#color”/>
<ruleref uri=“#tool”/></one-of>
..
</rule> ..
}2x
6x
216x3x
5x
8x
}values
}variables
}variables
@wolfpaulus
private void startVoiceRecognitionActivity() {
final Intent intent = new Intent( RecognizerIntent.ACTION_RECOGNIZE_SPEECH );
// Specify the calling package to identify your application intent.putExtra( RecognizerIntent.EXTRA_CALLING_PACKAGE, getClass().getPackage().getName() );
// Display an hint to the user about what he should say. intent.putExtra( RecognizerIntent.EXTRA_PROMPT, getResources().getString( R.string.speakPROMPT ) );
// Given an hint to the recognizer about what the user is going to say intent.putExtra( RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM);
// Specify how many results you want to receive. The results will be sorted // where the first result is the one with higher confidence. intent.putExtra( RecognizerIntent.EXTRA_MAX_RESULTS, 1 );
intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE, new Locale("en").getLanguage()); startActivityForResult( intent, SPEECH_RECOGNITION_REQUEST_CODE ); }
Request …
@wolfpaulus
… Response
@Override protected void onActivityResult( final int requestCode, final int resultCode, final Intent data ) {
switch (requestCode) { case SPEECH_RECOGNITION_REQUEST_CODE: if (resultCode == Activity.RESULT_OK) {
final ArrayList<String> matches = data.getStringArrayListExtra( RecognizerIntent.EXTRA_RESULTS ); assert matches != null; do_something_with_the_result ( matches.get( 0 ) );
} else { mTV_STT.setText( R.string.tapScreen ); } break; default: mTV_STT.setText(""); break; } }
@wolfpaulus
Speech Input (encoded compressed sound file)
Speech Recognition returns a text, mainly based on statistical models, prioritizing frequently used words and words that are frequently used together. I.e., it’s unlikely to get an utterance like “blue blur brush” correctly recognized.
Mapping recognized words to actions. E.g. “What’s my schedule for tomorrow” opens the calendar app
@wolfpaulus
Bobby "Blue" Bland January 27, 1930 – June 23, 2013
@wolfpaulus
blue blend ‣ blue bland
powder blue brush ‣ powder blue bra
turquoise blur brush ‣ turquoise player price
cyan brush ‣ Diane Brush
sky blue emboss pen ‣ sky blue in Boston
@wolfpaulus
@wolfpaulus
Homophones pronounced the same, differ in meaning, and may differ in spelling E.g.: to, too, two, and there, their, they’re
Synophonessimilar pronunciations, different meanings E.g.: sheep, Jeep, cheap
@wolfpaulus
Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions, or substitutions) required to change one word into the other.
Dr. Vladimir Levenshtein
@wolfpaulus
Soundex is a phonetic algorithm for indexing names by their pronunciation.
For instance both "Robert" and "Rupert" return the same encoding, but not "Rubin". Also "Ashcraft" and "Ashcroft" return the same encoding.
Phonetic Algorithms
@wolfpaulus
1. Retain 1st letter2. Drop all occurrences of a, e, i, o, u, h, w, y 3. Replace consonants with digits as follows:
• b, f, p, v → 1 • c, g, j, k, q, s, x, z → 2• d, t → 3• l → 4• m, n → 5 • r → 6
4. Remove duplicates: 1. If two or more letters with the same number are adjacent in the original
name (before step 1), only retain the first letter. 2. If two letters with the same number separated by 'h' or 'w' are coded as
a single number. 5. Iterate until you have one letter and three numbers. or append ‘0’
Soundex AlgorithmsNumber Represents the Letters
1 b, f, p, v2 c, g, j, k, q, s, x, z3 d, t4 l5 m, n6 r
Soundex Coding Guide
Disregard the letters: a,e,i,o,u,h,w,y
@wolfpaulus
1. Retain 1st letter Robert
2. Drop all occurrences of a, e, i, o, u, h, w, y Rbrt
3. Replace consonants with digits as follows: • b, f, p, v → 1 • c, g, j, k, q, s, x, z → 2• d, t → 3• l → 4• m, n → 5 • r → 6 R163
4. Remove duplicates 5. Iterate until you have one letter and three numbers. or append ‘0’
Example: “Robert”
Number Represents the Letters
1 b, f, p, v2 c, g, j, k, q, s, x, z3 d, t4 l5 m, n6 r
Soundex Coding Guide
Disregard the letters: a,e,i,o,u,h,w,yRobert : R163Rupert : R163Rubin : R150Ashcraft : A261Ashcroft : A261
@wolfpaulus
Caverphone 1.0 Encodes a string into a Caverphone 1.0 value.Caverphone 2.0 Encodes a string into a Caverphone 2.0 value.Cologne Phonetic Encodes a string into a Cologne Phonetic value.Double Metaphone Encodes a string into a double metaphone value.Metaphone Encodes a string into a Metaphone value.Refined Soundex Encodes a string into a Refined Soundex value.Soundex Encodes a string into a Soundex value.…
Language Encoders
@wolfpaulus
Algorithm Correction Rate1 RefinedSoundex 64%2 Soundex 55%3 StringUtils.LevensteinDistance 46%4 DoubleMetaphone 37%5 Metaphone 28%6 Nysiis 28%7 StringUtils.FuzzyDistance 28%8 StringUtils.JaroWinklerDistance 28%9 MatchRating 19%
10 Caverphone1 19%11 Caverphone2 10%12 BeiderMorse 0%13 String.compareTo 0%
Vocabulary: 126 Words final String[][] mTuples = new String[][]{ {"Bland", "blend"}, {“Bra", "brush"}, {"Player", "blur"}, {"Price", "brush"}, {"Diane", "cyan"}, {"Boston", "emboss"}, {"in Boston", "emboss pen"}, {"sheep", "jeep"}, {"cheap", "jeep"}, {"their", "there"}, {"they are", "there"}};
https://github.com/wolfpaulus/phonetic-alg-compare.git
Apache Commons Codec - Language Encodershttp://commons.apache.org/proper/commons-codec/
@wolfpaulushttps://github.com/wolfpaulus/phonetic-alg-compare.git
Algorithm Correction Rate1 RefinedSoundex 64%2 Soundex 55%
Number Represents the Letters
1 b, f, p, v2 c, g, j, k, q, s, x, z3 d, t4 l5 m, n6 r
Soundex Coding Guide
Disregard the letters: a,e,i,o,u,h,w,y
Number Represents the Letters
1 b, p2 f, v3 c, k, s4 g, j5 q, x, z6 d, t7 l8 m, n9 r
Refined Soundex Guide
Phonetic Algorithms
Disregard the letters: a,e,i,o,u,h,w,y
@wolfpaulus
Build a Dictionary of Words/Sentences and Actions Phonetically encode the Words in the Dictionary
Capture Speech Input (encoded sound file)
Speech Recognition returns a text
Phonetically encode the recognition result Find the best match in the encoded dictionary
Launch the action mapped to the best match
@wolfpaulus
Recognize who is speaking
Recognize what is said and record it
Execute simple commands
Intelligently respond to natural language input
Classification
Customer Benefit
Inte
grat
ion
Com
plex
ity
@wolfpaulus
a web-service for building apps and devices, users can talk or text to.
provides an open and extensible natural language platform.
“learns” from every interaction, and leverages the community: what’s learned is shared across developers/apps.
declare
teach
@wolfpaulus
{ "msg_id" : "befcd276-e6e0-4f53-a075-d26c991786e4", "_text" : "blue blur brush", "outcomes" : [ { "_text" : "blue blur brush", "intent" : "setup", "entities" : { "style" : [ { "value" : "blur" } ], "color" : [ { "value" : "blue" } ], "tool" : [ { "value" : "brush" } ] }, "confidence" : 0.515 } ] }
curl -H 'Authorization: Bearer 4FPMGMHNMWDFV3E2BM5QQ4XVQA3A3W23' ‘https://api.wit.ai/message?v=20150406&q=blue+blur+brush'
Request Response
@wolfpaulus
@wolfpaulus
@wolfpaulus
@wolfpaulus
• Intent: GetAccountBalance • Institution: Bank of America • Account Type: Checking
Alexa, ask Mindy, What is the Balance of my Bank of America
checking account { "intents": [ { "intent": "GetAccountBalance", "slots": [ { "name": "Bank", "type": "LITERAL" },{ "name": "Type", "type": "LITERAL" } ] } ], .. }
Intent Schema
TYPES:LITERAL NUMBER DATE TIME DURATION
@wolfpaulus
• Intent: GetAccountBalance • Institution: Bank of America • Account Type: Checking
GetAccountBalance What is my balance
GetAccountBalance balance of my {Bank of America|Bank} account GetAccountBalance balance of my {Wells Fargo|Bank} account GetAccountBalance balance of my {PayPal|Bank} account ...
GetAccountBalance balance of my {savings|Type} account GetAccountBalance balance of my {bank|Type} account GetAccountBalance balance of my {credit|Type} account GetAccountBalance balance of my {investment|Type} account
GetAccountBalance balance of my {Bank of America|Bank} {investment|Type} account GetAccountBalance balance of my {PayPal|Bank} {checking|Type} account
GetAccountBalance balance of my {savings|Type} account at {Citi|Bank} GetAccountBalance balance of my {credit|Type} account at {Credit Union|Bank} ...
Sample Utterances …provide as many as you can
intent …. {value|name} …. {value|name} .…
@wolfpaulus
@wolfpaulus
<?xml version="1.0" encoding="utf-8" ?> <aiml version="2.0"> <category>
<pattern>What is *</pattern> <template>I don’t know what <star/> is</template>
</category> </aiml>
Dr. Richard Wallace
AIML Artificial Intelligence Markup Language, is an XML dialect for creating natural language software agents.
@wolfpaulus
<?xml version="1.0" encoding="utf-8" ?> <aiml version="2.0"> <category>
<pattern>What is *</pattern> <template>
I don’t know what <star/> is </template>
</category> </aiml>
<?xml version="1.0" encoding="utf-8" ?> <aiml version="2.0"> <category>
<pattern>I like *</pattern> <template>
I see, you like <person><star/></person> </template>
</category> </aiml>
What is cheese? I don’t know what cheese is.
What is chocolate ? I don’t know what chocolate is.
I like coffee.I see, you like coffee.
I like your voice. I see, you like my voice.
I like her dress.I see, you like her dress.
@wolfpaulus
<?xml version="1.0" encoding="utf-8" ?> <aiml version="2.0">
<category> <pattern><set>colors</set> <set>styles</set> <set>tools</set></pattern> <template> Color=<star/> Style=<star index="2"/> Tool=<star index=“3"/> </template> </category>
<category> <pattern><set>colors</set> <set>styles</set> ^</pattern> <template>Color=<star/> Style= <star index="2"/></template> </category>
<category> <pattern><set>colors</set> <set>tools</set> ^</pattern> <template>Color=<star/> Tool=<star index="2"/></template> </category>
<category> <pattern><set>colors</set> ^</pattern> <template>Color=<star/></template> </category>
<category> <pattern><set>styles</set></pattern> <template>Style=<star/></template> </category>
<category> <pattern><set>tools</set></pattern> <template>Tool=<star/></template> </category>
</aiml>
TOOLS[["Roll"], ["Brush"], ["Pen"]]
STYLES[["Blur"], ["Emboss"], ["Blend"]]
COLORS.. ["Dark", "blue"], ["Dark", "brown"], ["Dark", "byzantium"], ["Dark", "cerulean"], ["Dark", "chestnut"], ["Dark", "coral"], ["Dark", "cyan"], ["Dark", "goldenrod"], ["Dark", "gray"], ["Dark", "green"], ["Dark", "khaki"], ["Dark", "lava"], ["Dark", "lavender"], ["Dark", "magenta"], …
@wolfpaulus
pandorabots
@wolfpaulus
1975|6|7|8|9|1980|1|2|3|4|1985|6|7|8|9|1990|1|2|3|4|1995|6|7|8|9|2000|1|2|3|4|2005|6|7|8|9|2010|1|2|3|4|2015
Wit.ai
FB acq. Wit.ai
SpeakToIt
SpeakToIt Api.ai
AIML 1.0
AIML 2.0
GrXML 1.0
PandoraBots
AMZN acq. Ivona
AMZN acq. Yap
@wolfpaulus
@wolfpaulus
Emotional ProsodyWolf Paulus
https://speakerdeck.com/wolfpaulus/emotional-prosody
@wolfpaulus
Emotion andVoice enabled AvatarEmotional Prosody
20⍰⍰
@wolfpaulus
@wolfpaulus
SummaryThe adoption of context-free recognizers, may benefit from post recognition processing, using text-to-sound language encoders, dictionaries, etc.
Gopher-VUIs are still hard to find, while Apple, Google, Amazon, Microsoft, and Facebook are building Assistant-VUIs. For domain specific information, or where information is protected, those Assistants will require 3rd-party substitutes. It will be interesting to see, how much control (i.e. personality) those substitute will be granted.
Basic NLU solutions from Wit.ai, Api.ai, and Amazon/Alexa reach their limits fast and are hardly powerful enough to implement those substitute-assistants. AIML is the oldest, but still seems to be the most powerful declarative approach, for simple VUI related tasks.
@wolfpaulus
Thanks for listening