nl201609

Evaluating Helpdesk Dialogues: Initial Considerations from An

Information Access PerspectiveTetsuya Sakai (Waseda University)

Zhaohao Zeng (Waseda University)

Cheng Luo (Tsinghua University/Waseda University)

[email protected]

September 29, 2016@IPSJ SIGNL (unrefereed), Osaka.

TALK OUTLINE

1. Motivation

2. Related Work

3. Project Overview

4. Pilot Test Collection: Progress Report

5. Pilot Evaluation Measures

6. Summary and Future Work

Motivation (1)

Motivation (2)

We want to evaluate task-oriented multi-turn dialogues

Motivation (3)

• We cannot conduct a subjective evaluation for every dialogue that we want to evaluate. We want an automatic evaluation method that approximates subjective evaluation.

• Build a human-human helpdesk dialogue test collection with both subjective annotations (target variables) and clues for automatic evaluation (explanatory variables).

• Using the test collection, design and verify automatic evaluation measures that approximate subjective evaluation.

• One step beyond: human-system dialogue evaluation based on the human-human dialogue test collection.

TALK OUTLINE

1. Motivation

2. Related Work

3. Project Overview




Evaluating non-task-oriented dialogues (1)

• Evaluating conversational responses

Discriminative BLEU [Galley+15] extends the machine-translation measure BLEU to incorporate +/- weights for human references (=gold responses) to reflect different subjective views.

• Dialogue Breakdown Detection Challenge [Higashinaka+16]

Find a point in dialogue where it becomes impossible to continue due to system’s inappropriate utterances.

System’s output: a probability distribution over NB (not a breakdown), PB (possible breakdown), or B (breakdown), which is compared against a gold distribution.

Evaluating non-task-oriented dialogues (2)

• Evaluating the Short Text Conversation Task [Sakai+15AIRS,Shang+16]

Human-system single-turn dialogues by searching a repository of past tweets. Ranked lists evaluated with information retrieval measures.

old post old comment





new post

new post

new post

old comment

old comment

old comment

new post

new post For each new post, retrieve and rank old comments!

Graded label (L0-L2) for each comment

Repository Training data Test data

Evaluating task-oriented dialogues (1)

• PARADISE [Walker+97]

Task: Train timetable lookupUser satisfaction = f(task success, cost)

Attribute-value matrix (depart-city=?, arrival-city=?, depart-time=?...)

• Spoken Dialogue Challenge [Black+09]

Task: Bus timetable lookup

Live evaluation by calling systems on the phone

• Dialogue State Tracking Challenge [Williams+13,Kim+16]Task: Bus timetable lookup

Evaluation: at each time t, the system outputs a probability distribution over possible dialogue states (e.g. different bus routes), which is compared with a gold label.

Closed-domain, slot filling tasks


• Subjective Assessment of Speech System Interfaces (SASSI) [Hone+00]

Task: In-car speech interface

Factor analysis of questionnaires revealed the following as key factors for subjective assessment:

- system response accuracy

- likeability

- cognitive demand

- annoyance

- habitability

- speed

• SERVQUAL [Hartikainen+04]

Task: Phone-based email application

Closed-domain, slot filling tasks


• Response Selection [Lowe+15]

Ubuntu corpus containing “artificial” dyadic dialogues.

Task: Ubuntu Q&A: most similar to ours, with no pre-defined slot filling schemes

Response selection task:

Previous dialogue context

Correct response in original dialogue


Incorrect response from another dialogue


Incorrect response from another dialogue

...

Given the context, can the system choose the correct response from 10 choices?

Evaluating textual information access (1) [Sakai15book]

• ROUGE for summarisation evaluation [Lin04]

Recall and F-measure based on n-grams and skip bigrams.

Requires multiple reference summaries.

• Nugget pyramids and POURPRE for QA [Lin+06]

• Nugget definition at TREC QA: “a fact for which the assessor could make a binary decision as to whether a response contained that nugget.”

Nugget recall, allowance-based nugget precision, nugget F-measure.

POURPRE: replaces manual nugget matching with automatic nugget matching based on unigrams.

Text is regarded as a set of small textual units

Evaluating textual information access (2) [Sakai15book]

• S-measure [Sakai+11CIKM]

A measure for query-focussed summaries, introduces a decay function over text, just as nDCG uses a decay function over ranks.

• T-measure [Sakai+12AIRS]

Nugget-precision that can handle different allowances for different nuggets.

• U-measure [Sakai+13SIGIR]

A generalisation of S, which works for any textual information access tasks, including web search, summaries, sessions etc.

Trailtext: <Sentence A> <Sentence Z>

Trailtext: <Rank 1 snippet> <Rank 2 snippet> <Rank 2 full text> <Rank 1 full text>

Nonlinear traversal

Building trailtexts for U-measure (1)

Trailtext: <News 1> <Ad 2> <Blog 1>

Trailtext: <Rank 1 snippet> <Rank 2 snippet> <Rank 1’ snippet> <Rank 1’ full text>

Building trailtexts for U-measure (2)

where

U-measure

Lpos

D(pos)

1

Gain at pos Decay function that discounts the gain

pos: position in trailtext(how much text the user has read)

Advertisement: http://sigir.org/sigir2017/

Jan 17: full paper abstracts due

Jan 24: full papers due

Feb 28: short papers and demo proposals due

Aug 7: tutorials and doctoral consortium

Aug 8-10: main conference

Aug 11: workshops

The first ever SIGIR in Japan!

TALK OUTLINE

1. Motivation

2. Related Work

3. Project Overview




Project overview

(1) Construct a pilot Chinese human-human dialogue test collection with subjective labels, nuggets and English translations.

(2) Design nugget-based evaluation measures and investigate the correlation with subjective measures.

(3) Revise criteria for subjective and nugget annotations, as well as the measures

(4) Construct a larger test collection with subjective labels, nuggets and English translations. Re-investigate the correlations.

(5) Release the finalised test collection with code for computing the measures.

φ1

φ2

Subjective labels = target variables

Possible axes for subjective annotation:

- Is the task clearly stated and is actually accomplished?

- How efficiently is the task accomplished through the dialogue?

- Is Customer satisfied with the dialogue, and to what degree?

Interlocutor viewpoints:

Customer’s viewpoint: Solve my problem efficiently, but I’m giving you minimal information about it.

Helpdesk’s viewpoint: Solve Customer’s problem efficiently, as time is money for the company.

The two viewpoints may be weighted depending on practical needs.

Why nuggets?

• Subjective labels tell us about the quality of the entire dialogue, but not about why.

• Helpdesk dialogues lack pre-defined slot filling schemes.

• Subjective scores (gold standard) = f(nuggets) ?

• Parts-Make-The-Whole Hypothesis: The overall quality of a helpdesk dialogue is governed by the quality of its parts.

C

H

C

H

C

H

C

H

Overallquality(subjective)

f(nuggets)

Nugget annotation vs subjective annotation

• Consistency Hypothesis: Nugget annotation achieves higher inter-annotator consistency. (Smaller units = reduces subjectivity and variations in annotation procedure)

• Sensitivity Hypothesis: Nugget annotation enables finer distinctions among different dialogues. (Nuggets = details)

• Reusability Hypothesis: Nugget annotation enables us to predict the quality of unannotated dialogues more accurately.

CHCH

WITH annotations

CHCH

WITHOUT annotations

Same task,different dialogues

Reuse nuggets

Unique features of nuggets for dialogue evaluation• A dialogue involves Customer and Helpdesk (not one search engine

user) – two types of nuggets

• Within each nugget type, nuggets are not homogeneous

- Special nuggets that identify the task (trigger nuggets)

- Special nuggets that accomplish the task (goal nuggets)

- Regular nuggets

Customer’s states and the role of nuggets

Possible requirements for nugget-based evaluation measures (1)(a) Highly correlated with subjective labels. (Validates the Parts-Make-The-

Whole Hypothesis)(b) Easy to compute and to interpret.(c) Accommodate Customer’s and Helpdesk’s viewpoints and change the

balance if required.(d) Accommodate nugget weights (i.e., importance).(e) For a given task, prefer a dialogue that accomplish it over one that does

not.(f) Given two dialogues containing the same set of nuggets for the same

task, prefer the shorter one.(g) Given two dialogues that accomplish the same task, prefer the one that

reaches task accomplishment more quickly.

Goal nugget No goal nuggets

C

H

C

H

C

H

C

H

Same task

>







C

H

C

H

C

H

Same task

>







Goal nugget

C

H

C

H

C

H

C

H

Same task

>Goal nugget

C

After completing the project...Evaluating human-system dialogues

Human-human dialogue test collection with subjective and nugget annotations

Utilise as an unstructuredknowledge base

CHCH

Sampled dialogue withsubjective and nugget annotations

Task

Initiate a human-system dialogue for the same task,using participant’s own expressions

Participant

Participant terminates dialogue as soon ashe receives an incoherent or a breakdown-causing utterance from System. Can System still provide the goal nuggets?How does human-system UCH compare with human-human UCH?

TALK OUTLINE

1. Motivation

2. Related Work

3. Project Overview




Dialogue mining (100% done)

Pilot data containing 234 Customer-Helpdesk dialogues obtained as follows:

1. Collect an initial set of Weibo accounts A0 by searching account names with keywords such as assistant and helper (in Chinese).

2. For each account in A0, crawl 200 most recent posts that mention that account using “@”. Filter accounts that did not respond to more than 50% of the posts. Let the set of “active” accounts be A.

3. For each account in A, crawl 2000 most recent posts that mention that account, and then extract those with at least 5 Customer posts AND at least 5 Helpdesk posts.

Subjective annotation criteria

Low inter-annotator agreement (next slide)

Subjective annotation (100% done)

See[Randolph05][Sakai15book]

Nugget definition (for annotators)

• A post: a piece of text input by Customer/Helpdesk who presses ENTER to upload it on Weibo.

• A nugget:

(I) is a post, or a sequence of consecutive posts by the same interlocutor.

(II) can neither partially nor wholly overlap with another nugget.

(III) should be minimal: it should not contain irrelevant posts at start/end/middle.

(IV) helps Customer transition from Current State towards Target State.

Nugget types (for annotators)

CNUG0: Customer trigger nuggets. Define Customer’s initial problem.

CNUG: Customer regular nuggets.

HNUG: Helpdesk regular nuggets.

CNUG*: Customer goal nuggets. Customer tells Helpdesk that the problem has been solved.

HNUG*: Helpdesk goal nuggets. Helpdesk provides customer with a solution to the problem.

Nuggets annotated for 40/234=17% of the dialogues

TALK OUTLINE

1. Motivation

2. Related Work

3. Project Overview




Pilot measures for dialogue evaluation

• U-measure [Sakai13+SIGIR]

Trailtext = concatenation of all texts that the search engine user has read

• UCH (U computed based on Customer’s and Helpdesk’s nuggets)

Trailtext = dyadic dialogue

- UC = U computed based on Customer’s nuggets (Helpdesk’s viewpoint)

- UH = U computed based on Helpdesk’s nuggets (Customer’s viewpoint)

UCH = (1-α) UC + α UH

Nugget positions (1)Trigger nugget

Regular nugget

Regular nugget

Nugget positions (2)

Regular nugget

Regular nugget

Goal nugget

Goal nugget

UCH = (1-α) UC + α UH When α=0.5, UCH is U-measurethat places the two graphs on top of each other

Weight of the goal nugget higher than the sum of the

others

Normalisation?Unnecessary ifscore standardisationis applied [Sakai16ICTIR,Sakai16AIRS]

Maximum tolerable dialogue length

Possible variants

• Use different decay functions for Customer and Helpdesk

• Use time rather than trailtext as the basis for discounting as in Time-Biased Gain [Smucker+12]

+: the gap between the timestamps of two posts can be quantified

-/+: cannot quantify the amount of information conveyed in each post expressed in a particular language / language independence

But remember Requirement (b): Measures should be easy to compute and to interpret.

Max tolerable dialogue duration

A sneak peek (40 annotated dialogues)

Subjective annotation criterion Q3

TALK OUTLINE

1. Motivation

2. Related Work

3. Project Overview




Conclusions and future work (1)(1) Construct a pilot Chinese human-human dialogue test collection

with subjective labels, nuggets and English translations.

(2) Design nugget-based evaluation measures and investigate the correlation with subjective measures.

(3) Revise criteria for subjective and nugget annotations, as well as the measures

(4) Construct a larger test collection with subjective labels, nuggets and English translations. Re-investigate the correlations.

(5) Release the finalised test collection with code for computing the measures.

φ1

φ2

Done

Human-human dialogue test collection with subjective and nugget annotations

Utilise as an unstructuredknowledge base

CHCH

Sampled dialogue withsubjective and nugget annotations

Task

Initiate a human-system dialogue for the same task,using participant’s own expressions

Participant

Conclusions and future work (2)

Participant terminates dialogue as soon ashe receives an incoherent or a breakdown-causing utterance from System. Can System still provide the goal nuggets?How does human-system UCH compare with human-human UCH?

After φ2..

Advertisement

Short Text Conversation@NTCIR-13

http://ntcirstc.noahlab.com.hk/STC2/stc-cn.htm

We Want Web@NTCIR-13

http://www.thuir.cn/ntcirwww/

Single-turnhuman-system

dialogues

Improving ad hoc web

search over 4.5 years

Selected References (1)

[Black+09] The Spoken Dialogue Challenge, Proceedings of SIGDIAL 2009

[Galley+15] ΔBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets, Proceedings of ACL 2015.

[Higashinaka+16] The Dialogue Breakdown Detection Challenge: Task Description, Datasets, and Evaluation Metrics, Proceedings of LREC 2016.

[Hartikainen+04] Subjective Evaluation of Spoken Dialogue Systems Using SERVQUAL, Method, Proceedings of INTERSPEECH 2004-ICSLP.

[Hone+00] Towards a Tool for the Subjective Assessment of Speech System Interfaces (SASSI), Natural Language Engineering, 6(3-4), 2000.

[Kim+16] The Fourth Dialog State Tracking Challenge, Proceedings of IWSDS 2016.

[Lin04] ROUGE: A Package for Automatic Evaluation of Summaries, Proceedings of the Workshop on Text Summarization Branches Out, 2004.

[Lin+06] Will Pyramids Built of Nuggets Topple Over? Proceedings of HLT/NAACL 2006.

[Lowe+15] The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems, Proceedings of SIGDIAL 2015.


[Sakai+11CIKM] Click the Search Button and Be Happy: Evaluating Direct and Immediate Information Access, Proceedings of ACM CIKM 2011.

[Sakai+12AIRS] One Click One Revisited: Enhancing Evaluation based on Information Units, Proceedings of AIRS 2012.

[Sakai+13SIGIR] Summaries, Ranked Retrieval and Sessions: A Unified Framework for Information Access Evaluation, Proceedings of ACM SIGIR 2013.

[Sakai+15AIRS] Topic Set Size Design with the Evaluation Measures for Short Text Conversation, Proceedings of AIRS 2015.

[Sakai15book] 情報アクセス評価方法論: 検索エンジンの進歩のために, コロナ社, 2015.

[Sakai16AIRS] The Effect of Score Standardisation on Topic Set Size Design, Proceedings of AIRS 2016, to appear.

[Sakai16ICTIR] A Simple and Effective Approach to Score Standardisation, Proceedings of ACM ICTIR 2016.

[Sakai16SIGIR] Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015, Proceedings of ACM SIGIR 2016.


[Shang+16] Overview of the NTCIR-12 Short Text Conversation Task, Proceedings of NTCIR-12, 2016.

[Smucker+12] Time-Based Calibration of Effectiveness Measures, Proceedings of ACM SIGIR 2012.

[Walker+97] PARADISE: A Framework for Evaluating Spoken Dialogue Agents, Proceedings of ACL 1997.

[Williams+13] The Dialog State Tracking Challenge, Proceedings of SIGDIAL 2013.

Acknowledgements

• We thank Hang Li and Lifeng Shang (Huawei Noah's Ark Lab) for helpful discussions and continued support; and Guan Jun, Lingtao Li and Yimeng Fan (Waseda University) for helping us construct the pilot test collection.

• We also thank Ryuichiro Higashinaka (NTT Media Intelligence Laboratories) for providing us with valuable information related to the evaluation of non-task-oriented dialogues.

nl201609

Technology