nlify: lightweight spoken natural language interfaces via exhaustive paraphrasing

NLify Lightweight Spoken Natural Language Interfaces via Exhaus:ve Paraphrasing

Seungyeop Han U. of Washington Ma@hai Philipose, Yun-‐Cheng Ju MicrosoF

Speech-‐Based UIs are Here

Ubicomp 2013 2

Today Siri, …

Today Hey Glass, …

Tomorrow Hey Microwave, …

Keyphrases Don’t Scale

Ubicomp 2013 3

What :me is it?

…

Use Spoken Natural Language

App1

App2 Next bus to Sea@le

App3 Tomorrow’s weather

App50 … App26 When is the next mee:ng “What &me is the next mee:ng” …

Keyphrase Hell

Spoken Natural Language (SNL) Today: First-‐party Applica:ons

“Hey, Siri. Do you love me?”

Ubicomp 2013 4

•  Personal assistant model •  Large speech engine (20-‐600GB) •  Experts mapping speech to a few domains

Speech Recogni:on

Language Processing

Text: “Hey Siri…” … “I’m not allowed, Seungyeop”

NLify: Scaling Spoken NL Interfaces

1st party app (e.g., Xbox, Siri) mul:ple PhDs, 10s of developers

3rd party app (e.g., intuit, spo:fy) 0 PhDs, 1-‐3 developers

end-‐user macro (e.g., [email protected]) 0 PhDs, 0 developers

10

10,000

10,000,000

# apps

Ubicomp 2013 5

Goal

Make

programming spoken natural language interfaces as easy and robust as

programming graphical user interfaces

Ubicomp 2013 6

Outline

•  Mo:va:on / Goal •  System Design •  Demonstra:on •  Evalua:on •  Conclusion

Ubicomp 2013 7

Challenges

•  Developers are not SNL experts

•  Applica:ons are developed independently

•  Cloud-‐based SNL does not scale as UI – UI capability must not rely on connec:vity – UI events must have minimal cost

Ubicomp 2013 8

Specifying GUIs

Ubicomp 2013 9 Intui:ve defini:on of UI handler linking to code

Specifying Spoken Keyphrase UIs

<CommandPrefix>Magic Memo</CommandPrefix> <Command Name="newMemo">

<ListenFor>Enter [a] [new] memo</ListenFor> <ListenFor>Make [a] [new] memo</ListenFor> <ListenFor>Start [a] [new] memo</ListenFor> <Feedback>Entering a new memo</Feedback> <Navigate Target=“/Newmemo.xaml”>

</Command> ...

How does natural language differ from keyphrases?

Ubicomp 2013 10

Difference 1: Local Varia:on

•  Missing words

•  Repeated words

•  Re-‐arranged words

•  New combina:ons of phrases

When is the next meeCng?

When is next mee:ng?

When is the next.. next mee:ng?

When the next mee:ng is?

What :me is the next mee:ng?

Ubicomp 2013 11

Difference 2: Paraphrases

show me the current :me what is the :me :me what is the current :me may i know the :me please give :me show me the :me show me the clock tell me what :me it is what is :me current :me tell what :me it is list the :me what :me

what :me it is now show current :me what :me please show :me what is the :me now current :me please say the :me find the current :me please what :me is it what is current :me what :me is it tell me :me current what's the :me tell current :me

what :me is it now what :me is it currently check :me the :me now tell me the current :me what's :me :me now tell me the :me can you please tell me what :me it is tell me current :me give me the :me :me please show me the :me now

Ubicomp 2013 12

Specifying SNL Systems

Ubicomp 2013 13

Speech Recogni:on

Language Processing

whanme() “what :me is it?”

Few rules, lots of data Use sta:s:cal language models that require li@le an:cipa:on of local noise

Use data-‐driven models that require li@le domain knowledge

Encode local varia:on in grammar

Encode domain knowledge on paraphrases in models e.g. CRFs

Lots of rules, liFle data

Exhaus:ve Paraphrasing by Automated Crowdsourcing

Ubicomp 2013 14

Examples from developers

Handler: whanme() Descrip:on: When you want to know the :me Examples: What :me is it now What’s the :me Tell me the :me

Handler: whanme() Descrip:on: When you want to know the :me Examples: What :me is it now What’s the :me Tell me the :me Current :me Find the current :me please Time now Give me :me …

following task,

descrip:on example

direc:ons

Automa:cally generated crowdsourcing

install :me

Seed Examples

dev :me

“Tell me when it’s @T=20 min …”

SAPI TFIDF + NN NLNo:fyEvent e

nlwidget

Compiling SNL Models .What is the date @d .Tell me the date @d …

amplify .What is the date @d .Tell me the date @d .What date is it @d .Give me the date @d .@d is what date …

Internet crowdsourcing

service

Amplified Examples

compile Nearest neighbormodel

SLM Sta:s:cal Models

run :me

Ubicomp 2013 15

install :me

dev :me

“Tell me when it’s @T=20 min …”

SAPI TFIDF + NN NLNo:fyEvent e

nlwidget

SNL Models for Mul:ple Apps

Amplified Examples

compile Nearest

neighbor model SLM Sta:s:cal Models

run :me

Ubicomp 2013 16

.What is the date @d

.Tell me the date @d

.What date is it @d

.Give me the date @d

.@d is what date …

Applica:on 1

•  Apps developed separately => “late assembly” of models •  Limited :me for learning at install :me => simple (e.g., NN) models •  Users no longer say anything but what they have installed => “natural

language shortcut” mental model

.How much is @com

.Get me quote for @com

.What’s the price for @com …

Applica:on 2

…

Applica:on N

Outline

•  Mo:va:on / Goal •  System Design •  Demo: SNL interfaces in 4 easy steps •  Evalua:on •  Conclusion

Ubicomp 2013 17

Ubicomp 2013 18

1. Add NLify DLL

2. Providing Examples

Ubicomp 2013 19

3. Wri:ng a Handler

Ubicomp 2013 20

4. Adding a GUI Element

Ubicomp 2013 21

Ubicomp 2013 22

Enjoy J

Outline

•  Mo:va:on / Goal •  System Design •  Demonstra:on •  Evalua:on •  Conclusion

Ubicomp 2013 23

Evalua:on

•  How good are SNL recogni:on rates? •  How does performance scale with commands? •  How do design decisions impact recogni:on? •  How prac:cal is on-‐phone implementa:on? •  What is the developer experience?

Ubicomp 2013 24

Evalua:on Dataset

Ubicomp 2013 25

Domain Intent & Slots Example

Clock FindTime() What :me is it?

FindDate(day) What’s the date today?

Calendar CheckNextMtg() What’s my next mee:ng?

Bus FindNextBus(route, dest) When is the next 20 to Sea@le?

Finance FindStockPrice(company) How much is MicrosoF stock?

CaculateTip(Money, NumPeople) How much is the :p for $20 for three people

CondiCon FindWeather(day) How is the weather tomorrow?

Contacts FindOfficeLoca:on(person) Where is the Janet Smith’s office?

FindGroup(person) Which group does Ma@hai work in?

… Across 27 different commands,

collected 1612 paraphrases, 3505 audio samples

Evalua:on Dataset

Ubicomp 2013 26

Seed 5 paraphrases/intent By authors Amplify via

Crowdsourcing $.03/paraphrase

Crowd ~60 paraphrases/intent By Crowd

Audio 130 u@erance/intent By 20 subjects

Asking “What would you say to the phone to do the described task” with an example

Training

Tes:ng

Overall Recogni:on Performance

Ubicomp 2013 27

•  Absolute recogni:on rate is good (avg: 85%, std: 7%) •  Significant rela:ve improvement from Seed (69%)

Performance Scales Well with Number of Commands

Ubicomp 2013 28

Design Decisions Impact Recogni:on Rates

Ubicomp 2013 29

•  The more exhaus:ve paraphrasing the be@er:

•  Sta:s:cal model improves recogni:on rate by 16% vs. determinis:c model

0% 20% 40% 60% 80%

100%

20% 40% 60% 80% 100%

RecogniCon

Rate

Training Set

Feasibility of Running on Mobiles

•  NLify is compe::ve with a large vocabulary model

•  Memory usage is acceptable: maximum memory for 27 intents was 32M

•  Power consump:on very close to listening loop

Ubicomp 2013 30

Figure 5. Scaling with number of commands.

Figure 6. Incremental benefit from templates.

prising since both the SLM and TF-IDF algorithms that iden-tify intents compete across intents. Third, slot recognitiondoes not vary monotonically with number of competitors; infact the particular competitors seem to make a big difference,leading to high variance for each N . On closer examinationwe determined that even the identity of the competitors doesnot matter: when certain challenging functions (e.g., 11, 12and 19) are included, recognition rate for the subset plum-mets. Larger values of n will likely give a smoother averageline. Overall, since slot-recognition is performed determinis-tically bottom up, it does not compete at the language-modellevel with other commands.

Impact of NLify FeaturesNLify uses two main techniques to generalize from the seedsprovided by the developers to the variety of SNL. To cap-ture broad variation, it supports template amplification as perthe UHRS dataset. To support small local noise (e.g. wordsdropped in the speech engine), it advocates a statistical ap-proach even when the models are run locally on the phone (incontrast, e.g., to recent production systems [5]).

We saw earlier that using the Seed set instead of Seed +UHRS (where Seed has 5 templates per command and UHRSaverages 60) lowers recognition from 85% to 69%. ThusUHRS-added templates contribute significantly. To evaluatethe incremental value of templates, we measured recognitionrates when f = 20, 40, 60 and 80% of all templates wereused. We pick the templates arbitrarily for this experiment.The corresponding average recognition rates (across all func-tions) was 66, 75, 80 and 83%. Figure 6 shows the breakoutper function. Three factors stand out: recognition rates im-

(a) intent recognition (b) slots recognition

Figure 7. Benefit of statistical modeling.

Figure 8. Comparison to a large vocabulary model.

prove noticeably between the 80 and 100% configurations,indicating that rates have likely not topped out; improvementis spread across many functions, indicating that more tem-plates are broadly beneficial; and there is a big difference be-tween the 20% and the 80% mark. The last point indicatesthat even had the developer added an additional dozen seeds,crowdsourcing would still have been beneficial.

Given that templates may provide good coverage across para-phrases for a command, it is reasonable to ask whether adeterministic model that incorporates all these paraphraseswould perform comparably to a statistical one. Given tem-plate amplification, is a statistical model really necessary? Inthe spirit of the Windows Phone 8 Voice Command [5], wecreated a deterministic grammar for each intent. For robust-ness toward accidentally omitted words, we made the com-mon words {is, me, the, it, please, this, to, you, for, now}optional in every sentence. We compared recognition per-formance of this deterministic system with the SLM, bothtrained on the Seed + UHRS data. Figure 7 shows the resultsfor both intent and slot recognition. Two points are signifi-cant. First, statistical modeling does add a substantial boostfor both intent (16% incremental) and slot recognition (19%).Second, even though slots are parsed deterministically, theirrecognition rates improves substantially with SLMs. Thisis because deterministic parsing is all-or-nothing: the mostcommon failure mode by far is that the incoming sentencedoes not parse, affecting both slot and intent recognition rates.

The experiments thus far assumed that no query was garbage.In practice, users may speak out-of-grammar commands.SLify’s parallel garbage model architecture is set up to catchthese cases. Without the garbage model, the existing SLMwould still reject commands that are egregiously out-of-gra-

[Average] SLM: 85% LV: 80%

Developer Study w/ 5 Devs

Asked to add Nlify into the exis:ng programs

Ubicomp 2013 31

DescripCon Sample commands Original LOC

Time Taken

Control a night light “turn off the light” 200 30 mins

Get sen:ment on Twi@er “review this” 2000 30 mins

Query, control loca:on disclosure

“where is Alice?” 2800 40 mins

Query weather “weather tomorrow?” 3800 70 mins

Query bus service “when is next 545 to Sea@le?” 8300 3 days

(+) How well did NLify’s capabili:es match your needs? (-‐) Did the cost/benefit of Nlify scale? (-‐) How long do you think you can afford to wait crowdsourcing

Conclusions

It is feasible to build mobile SNL systems, where: •  Developers are not SNL experts •  Applica:ons are developed independently •  All UI processing happens on the phone Fast, compact, automaCcally generated models enabled by exhausCve paraphrasing are the key.

Ubicomp 2013 32

For Data and Code

Check Ma@hai’s Homepage. h@p://research.microsoF.com/en-‐us/people/ma@haip/

Or e-‐mail the authors On/aVer October 1.

Ubicomp 2013 33

nlify: lightweight spoken natural language interfaces via exhaustive paraphrasing

Technology

menow ubicomp2013

conclusion ubicomp2013

ondataset ubicomp2013

specifyingguis ubicomp2013

ngahandler ubicomp2013

apps ubicomp2013

providingexamples ubicomp2013

addingaguielement ubicomp2013