spoken dialogue systems prof. alexandros potamianos dept. of electrical & computer engineering...

28
Spoken Dialogue Systems Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003

Post on 19-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Spoken Dialogue Systems

Prof. Alexandros Potamianos

Dept. of Electrical & Computer EngineeringTechnical University of Crete, Greece

May 2003

Outline

Discourse Research Issues

Spoken Dialogue Systems Pragmatics (dialogue acts)

Dialogue management

Multimodal Systems

Examples

Definitions

Discourse Monologue

Dialogue

Discourse: Research Issues

Reference resolution, e.g., “That was a lie” Anaphora, e.g., “John left …. He was bored.”

Co-reference, e.g., “John” and “He” refer to the same entity

Text coherence, e.g., Coherence: “John left early. He was tired”

Incoherence: “John left early. He likes spinach”

Spoken Dialogue Systems: Concepts Turn-taking

Dialogue Segmentation

Grounding Backchannel, e.g., ‘Mm Hmm’ Acknowledgment Explicit/implicit confirmation

Implicature “What time are you flying” “Well, I have a meeting at three”

Initiative “What time are you flying?” “Don’t feel like booking the flight right now. Lets look at hotels”

Speech, Dialogue and Application Acts Speech Acts (Austin 1962, Searle 1975)

Assertive (conclude), Directive (ask, order), Commissive (promise), Expressive(apologize, thank), Declarations

Dialogue Acts Statement, Info-Request, Wh-Question, Yes-No Question,

Opening, Closing, Open-Option, Action-Directive, Offer, Commit, Agree etc.

Application Acts Domain specific but general, e.g., Info-Request into system’s

semantic state, Info-Request into database, Info-Request into database results

Dialogue/Application Act Classification Semantic Parsing follows by deterministic rules, e.g.,

‘what’, ‘when’, ‘where’, ‘who’ starts a Wh-Question

Bayesian Formulation Given a sentence W the most probable dialogue act A is

argmax P(A|W) = argmax P(W|A) P(A)

P(W|A) can be an n-gram model one for each dialogue act

P(A) can also be an n-gram model of dialogue acts

Dialogue Management 1

Frame-based, e.g., DeptCity “From what city are you leaving?” GRM_CITY ArrCity “Where are you flying to” GRM_CITY DeptTime “What time would you like to fly?” GRM_TIME DeptDate “When are you flying?” GRM_DATETIME

DeptTime

Finite state machine dialogue manager Mostly system-initiated dialogue VXML-like dialogue structure (forms and frames)

Dialogue Management 2

Application Independent Flow Chart structure Generic dialogue/application manager

(really this is more like a controller)

Dialogue Management 3

Generalized Finite State Machine Dialogue Management

Application Dependent but General Dialogue Superstates

Fill: adaptive dialogue module, uses dynamic e-forms to elicit AV pairs from the user; resolves value and tree-position ambiguities

Navigate: presents database results and lets the user select the appropriate ones

Fill

Verify

Create

Query

Navigate

No

Yes

Is Full

Is Correct

Advanced Dialogue Systems Mixed Initiative:

Allow user to say anything (global grammar active at all states), e.g., “What date are you flying”

“I am flying next Tuesday in the morning” Allow user to navigate the systems state machine, e.g.,

“I would like to look at hotels first” Open prompts, give user the initiative, e.g., “What next?”

Advanced dialogue features Corrections, e.g., “No not Boston, Atlanta” Negation, e.g., “Anything but Olympic” Complex semantic expressions, e.g., “tomorrow evening or Sunday

morning” Ambiguity resolution and representation, e.g., “next Tuesday” Persistent Semantics, e.g., “Info about his organization”

Ambiguity Representation

L eg 1 L eg 2

D ep a rtu re

Trip

C ar

C ityAtlanta

DateJune 1

F ligh t

D ep a rtu re

C ityN e w Yo rk

C ityAtlanta

C ityN e w Yo rk

A rriva l

Position ambiguity

Valueambiguity

Error Correction MechanismsSys: How can I help you?

Usr: I’d like to fly from Austin to Newark on August fifth in the morningAsr: I’d like to fly from Boston to Newark on August fifth in the morning

Attribute Value Score

.trip.f light.leg1.departure.city BOS 0.5

.leg1.departure.date 8/5/2003 0.5

.leg1.departure.time 6:00 - 11:59 0.5

.leg1.arrival.city EWR 0.5

Sys: I've got you leaving Boston on Sunday, August fifth in the morning and arriving in Newark airport. What is your preferred airline?

Usr: Leaving from Austin, Texas

Attribute Value Score

.leg1.departure.city BOS 0.44

AUS 0.44

Sys: Sorry …, are you leaving from Austin, Texas, or from Boston?

Usr1: Austin, TexasAttribute Value Score

.leg1.departure.city AUS 0.72

BOS 0.38

Sys: Leaving from Austin, Texas.

Usr2: Change the departure city to Austin, TexasAttribute Value Score

.leg1.departure.city AUS 0.6

Alternate: use error correction

Spoken Dialogue System Architecture

Controller

Database

Parser

TTS Platform

ASR

Telephony

Generation

App. Controller

DM/Initiative

Interpreter/Context Tr.

AI

System Architecture and PortabilityAmbiguity representation

Pragmatic ConfidenceScores

Application dependentApplication independent

DialogueManager

Semantics Pragmatics Generation

ParserSemantic

Interpreter

ContextTracker

PragmaticInterpreter

ExpertDomain

Knowledge

InitiativeTracking

UtterancePlanner

SurfaceRealizer

Controller

Advantages of application-centric system design: Increased modularity. Flexible multi-stage data

collection. Extensible to multi-modal

input (universal access).

Multimodal Systems

Definition Input Modalities/Output Media Research Issues

User Interface Design Semantic Module

Examples

Input Modalities/Output Media Unimodal:

Speech input/Speech output.

Multimodal: Speech+DTMF input/Speech output. Speech input/Speech and GUI output. Speech and pen input/Speech and GUI output.

Definitions: Pen input: buttons, pull-down menus, graffiti, pen gestures. GUI output: text and graphics

S D P S+D

S+P

S

G

S+G

Issues

Semantic/Pragmatic Module: Merging semantic information from different modalities, e.g.,

“Draw a line from here to there” Ambiguity representation and resolution

User Interface: Synergies between input modalities Turn-taking and appropriate mix of modalities Maintain interface consistency Focus/context visualization

System issues: Synchronization and latency

July fifth 7/10

NL Parser GUI Parser

Pragmatic Analysis

Update Semantic Tree & Pragmatic Scores

Context Tracking

GUI InterpreterNL Interpreter GUI InterpreterNL Interpreter<date>

“fifth”

<day><month>

“July” <number>

<date>

“10”

<day><month>

“7” “/”

<number><number>{“date”, “Jul 5, 2002”} {“date”, “Jul 10, 2002”}

{“travel.flight.leg1.departure.date”, “Jul 5, 2002”}

{“travel.flight.leg1.departure.date”, “Jul 10, 2002”}

{“travel.flight.leg1.departure.date”, “Jul 5, 2002”, 0.4}

{“travel.flight.leg1.departure.date”, “Jul 10, 2002”, 0.9}

Semantic and Pragmatic Module

departure

travel

flight

leg 1

arrival

city datecity

{“BOS”, 0.5} {“Jul 5, 2002”, 0.4}

{“Jul 10, 2002”, 0.9}

{“NYC”, 0.5}

Multi-Modal User Interface Emphasis on synergies between modalities:

Value(s) of attributes are displayed graphically Erroneous values can be easily corrected via the GUI Focus (aka context) of speech modality is highlighted Position and value ambiguity are shown (and typically resolved)

via the GUI Voice prompts are significantly shorter and mostly used to

emphasize information that is already displayed graphically GUI takes full advantage of intelligence of voice UI, e.g., ‘round

trip’ speech input will ‘gray out’ the third leg button in the GUI Seamless integration of semantics from the two modalities using

modality-specific pragmatic scores

ASR: I want to fly from Boston to New York on September 6th.

new focus

field disabled

Example 1: Flight First Leg

navigation buttons

Example 2: Flight Second Leg

ASR: round trip

value induction

button disabled

ASR: I want a compact car from AVIS

GUI: “rental” button pressed

Example 3: Car Rental

Example 4: Ambiguity and Errors

Mixing the Modalities: Turn-Taking

“Click to talk” vs “Open Mike” “Click to talk” can be restrictive “Open mike” can be confusing (falling out of turn) Both have limitations

Often there is a dominant modality based on Type of input, e.g., “select from menu” vs enter free text Recent input history User preferences

System automatically selects the dominant modality and the user can click to change it Dominant modality selection algorithm is adaptive