proceedings of workshop on feedback behaviors in dialog … · 2:30 teaser talks for posters and...

108
Proceedings of the Interdisciplinary Workshop on Feedback Behaviors in Dialog September 7-8, 2012 Skamania Lodge Stevenson, Washington, USA Author Index Papers Organizers Program

Upload: others

Post on 18-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Proceedings of the

Interdisciplinary Workshop on

Feedback Behaviors in Dialog

September 7-8, 2012

Skamania Lodge

Stevenson, Washington, USA

Author Index

Papers

Organizers

Program

Page 2: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Proceedings of Workshop on Feedback Behaviors in Dialog

i

OVERVIEW

Feedback skills are important for people (or machines) wishing to be able to function as supportive,

cooperative listeners. The production and comprehension of back-channels and related phenomena,

including response tokens, reactive tokens, minimal responses, continuers and acknowledgments, are also

of scientific interest, as possibly the most accessible example of the real-time responsiveness that underpins

many successful interpersonal interactions. This workshop provides a venue for an interdisciplinary

examination of these phenomena and feedback behaviors in dialog.

ORGANIZERS

Nigel Ward, University of Texas at El Paso

David Novick, University of Texas at El Paso

Louis-Philippe Morency, University of Southern California

Tatsuya Kawahara, Kyoto University

Dirk Heylen, University of Twente

Jens Edlund, KTH

Page 3: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Proceedings of Workshop on Feedback Behaviors in Dialog

ii

PROGRAM

SCHEDULE

FRIDAY SEPTEMBER 7, 2012

12:00 lunch

12:50 welcome

1:00 When do we say 'Mhmm'? Backchannel Feedback in Dialogue, Julia Hirschberg

2:00 A testbed for examining the timing of feedback using a Map Task, Gabriel Skantze

2:30 Teaser Talks for posters and demos

3:00 coffee and snacks

3:30 Poster Session 1

4:45 Beyond Back-channels: A Three-step Model of Grounding in Face-to-face Dialogue,

Janet Beavin Bavelas, Peter De Jong, Harry Korman, Sara Smock Jordan.

5:30 Data Analysis Breakouts / Discussion

6:00 Summary Discussion

6:15 dinner

SATURDAY SEPTEMBER 8, 2012

7:30 breakfast

8:30 Machines Don't Listen (But Neither Do People), Graham Bodie

9:30 Poster Session 2

10:30 coffee break

10:45 organization into topical groups

11:00 Topical Break-Out Sessions

12:00 Reporting back, discussions

12:15 lunch (buffet or box lunches)

1:00 Poster Session 3

2:15 group hike

3:30 coffee and snacks

3:45 Topical Break-Out Sessions

5:00 Reporting Back

5:45 Closing Words and Activity Planning

6:15 dinner

conference room open for after-dinner discussions

Page 4: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Proceedings of Workshop on Feedback Behaviors in Dialog

iii

BREAK-OUT SESSIONS Break-out sessions will be in groups of 5-7 participants. Attendees will express topic preferences, chosing

from the list below plus any other suggestions, and the organizers will facilitate clustering into groups.

Some Possible Topics:

back-channeling behavior under conditions of noise and load

generalizing beyond single-domain models of feedback

formal models of feedback, especially incoporating timing

pathways to improving feedback behaviors in commercial dialog systems

implications of findings about feedback behaviors for speech science more generally

understanding individual differences in feedback behaviors, or abstracting away from them

how to evaluate the quality of feedback prediction approaches

beyond timing prediction: selecting the type and form of feedback

multimodal feedback

spoken dialog in the post-Siri age

Page 5: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Proceedings of Workshop on Feedback Behaviors in Dialog

iv

PAPERS

KEYNOTE

When do we say 'Mhmm'? Backchannel feedback in dialogue.

Julia Hirschberg.

Machines don't listen (But neither do people).

Graham Bodie.

Feedback in adaptive interactive storytelling.

Timo Baumann.

Beyond back-channels: A three-step model of grounding in face-to-face dialogue.

Janet Beavin Bavelas, Peter De Jong, Harry Korman, Sara Smock Jordan.

Adapting language production to listener feedback behaviour.

Hendrik Buschmeier, Stefan Kopp.

Effect of linguistic contents on human estimation of internal state of dialog system users.

Yuya Chiba, Masashi Ito, Akinori Ito.

A survey on evaluation metrics for backchannel prediction models.

Iwan de Kok, Dirk Heylen.

Third party observer gaze during backchannels.

Jens Edlund, Mattais Heldner, Anna Hjalmarsson.

Feedback and activity in dialogue: signals or symptoms?

Andres Gargett.

Listener's responses during storytelling in French conversation.

Mathilde Guardiola, Roxane Bertrand, Robert Espesser, Stephane Rauzy.

Crowdsourcing backchannel feedback: Understanding the individual variability from the crowds.

Lixing Huang, Jonathan Gratch.

Page 6: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Proceedings of Workshop on Feedback Behaviors in Dialog

v

Can we predict who in the audience will ask what kind of questions with their feedback behaviors in

poster conversation?

Tatsuya Kawahara, Takuma Iwatate, Takanori Tsuchiya, Katsuya Takanashi.

Evaluating a minimally invasive laboratory architecture for recording multimodal conversational

data.

Spyros Kousidis, Thies Pfeiffer, Zofia Malisz, Petra Wagner, David Schlangen.

The temporal relationship between feedback and pauses: a pilot study.

Kristina Lundholm Fors.

Cues to perceived functions of acted and spontaneous feedback expressions.

Daniel Neiberg, Joakim Gustafson.

Exploring the implications for feedback of a neurocognitive theory of overlapped speech.

Daniel Neiberg, Joakim Gustafson.

Paralinguistic behaviors in dialog as a continuous process.

David Novick.

Empathy and feedback in conversations about felt experience.

Nicola Plant, Pat Healey.

CoFee - Toward a multidimensional analysis of conversational feedback, the case of French

language.

Laurent Prevot, Roxane Bertrand.

Investigating the influence of pause fillers for automatic backchannel prediction.

Stefan Scherer, Derya Ozkan, Louis-Philippe Morency.

A testbed for examining the timing of feedback using a Map Task.

Gabriel Skantze.

Clarification questions with feedback.

Svetlana Stoyanchev, Alex Liu, Julia Hirschberg.

Page 7: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Proceedings of Workshop on Feedback Behaviors in Dialog

vi

Acoustic, morphological, and functional aspects of "yeah/ja" in Dutch, English and German.

Jurgen Trouvain, Khiet P. Truong.

Possible lexical cues for backchannel responses.

Nigel G. Ward.

Visualizations supporting the discovery of prosodic contours related to turn-taking.

Nigel G. Ward, Joshua L. McCartney.

Where in dialog space does uh-huh occur?

Nigel G. Ward, David G. Novick, Alejandro Vega.

Listener head gestures and verbal feedback expressions in a distraction task.

Marcin Wlodarczak, Hendrik Buschmeier, Zofia Malisz, Stefan Kopp, Petra Wagner.

Page 8: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

When do we say ‘Mhmm’?: Backchannel Feedback in Dialogue

Julia Hirschberg Columbia University

In human-human dialogues, speakers regular indicate that they are attending to their conversational partner by producing oral backchannels like ‘mhmm’ and by facial and other gestures. One intriguing question for students of conversation is how speakers decide when to produce such signals. A number of us at Columbia, UPenn, the University of Buenos Aires, and Constantine the Philosopher University, have been studying the sort of acoustic-prosodic behavior of speakers which precede the production of oral backchannels by their partners to identify backchannel-preceding cues, which may give us some insight into partners’ decisions to backchannel or not. We have also been examining how backchannel behavior on the part of one partner is reflected in the backchannel behavior of the other partner in conversation, as speakers entrain on each others’ backchannel-preceding cues. I will present results of these investigations and explore their consequences for Spoken Dialogue Systems.

1

Page 9: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Machines Don't Listen (But Neither Do People) A Keynote Address prepared for the Interdisciplinary Workshop on Feedback Behaviors in Dialog

Graham D. Bodie, Ph.D.

Assistant Professor, Louisiana State University

Listening is the capacity to discern the underlying habitual character and attitudes of people with whom we communicate. It goes beyond perception and sensation of sound and is more than the mere comprehension of another’s utterance (Bodie, Worthington, Imhof, & Cooper, 2008). At its best, listening brings about a sense of shared experience and mutual understanding through the co-creating of rules based on sharing of meaningful and conscientious dialogue (Bodie & Crick, 2012). As such, it is something humans do innately but not necessarily something we all do well. When we are “listened to” we experience a range of positive outcomes from feeling better about ourselves to improved immunological function and better psychological well-being (Bodie, 2012). When we feel misunderstood or otherwise ignored, however, our health and relationships suffer. The importance of listening is something that we all know on an intuitive level. Perhaps that is why self-help gurus and academics alike hold a central place for listening in their advice for how to improve at a wide range of life tasks. Unfortunately, it is far easier to praise listening than to articulate a clear idea of just what listening is or to detail just what listeners do in order to be perceived as competent and to engender the myriad positive associated outcomes. Being a good listener is important to parenting, marital relationships, salesperson performance, customer satisfaction, and healthcare provision; and the list could go on. Good listeners can enhance others’ ability to cope with and remember events; they are more liked and garner more trust than those less proficient; and they have higher academic achievement, better socio-emotional development, and a higher likelihood of upward mobility in the workplace (for review see Bodie, 2012). But what specific messages and behaviors lead to impressions of individuals as good listeners? This question has been largely ignored in the extant literature. My colleagues and I have begun to answer this important question by building an empirical database of the attributes (what listening is) and behaviors (what listeners do) associated with effective listening in two contexts, initial interactions (Bodie et al., 2012; Bodie, St. Cyr, Pence, Rold, & Honeycutt, 2012) and supportive conversations (Bodie & Jones, in press; Bodie, Jones, & Vickery, 2012; Bodie, Vickery, & Gearhart, in press). This talk will outline those behaviors most important to perceiving others as “good” listeners in order to spur discussion about how to apply our work to contexts outside of interpersonal interaction and about the inherently interdisciplinary future of listening research. This talk will additionally posit that although research is underway by several to create “humanlike” machines, machines do not and never will, in fact, “listen.”

References:

Bodie, G. D. (2012). Listening as positive communication. In T. Socha & M. Pitts (Eds.), The positive side of interpersonal communication (pp. 109-125). New York: Peter Lang.

Bodie, G. D., & Crick, N. (2012). Making listening clear: Charles Sanders Peirce and the phenomenological foundations of communication. Unpublished manuscript submitted for publication. Department of Communication Studies. Louisiana State University. Baton Rouge, LA.

Bodie, G. D., Worthington, D. L., Imhof, M., & Cooper, L. (2008). What would a unified field of listening look like? A proposal linking past perspectives and future endeavors. International Journal of Listening, 22, 103-122. doi: 10.1080/10904010802174867

2

Page 10: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Feedback in Adaptive Interactive Storytelling

Timo Baumann

Department for InformaticsUniversity of Hamburg, Germany

[email protected]

AbstractTelling stories is different from reading out text: a speakerreponds to the listener’s feedback and incorporates thisinto the ongoing talk. However, current computer sys-tems are unable to do this and instead just non-attentivelyread out a text, disregarding all feedback (or the absencethereof). I propose and discuss an idea for a small researchproject and a plan for how an attentive listening storytellercan be built.Index Terms: Incrementality, Feedback, Storytelling,Adaptation, Prosody

1. IntroductionInteractive storytelling [1] deals with adaptations to storiesbased on listener input to change the content of the story(making this area somewhat similar to computer games).So far, such adaptations happen at a relatively coarse gran-ularity and user input is integrated only with some delay.Recent advances in incremental speech processing tech-nology [2, 3] enable adaptations to happen at a much finergranularity. Thus, it is not only possible to react to a user’sverbalized requests by changing the content of the story,but also by adapting the delivery of the story, based on re-actions to concurrent feedback utterances (or the absencethereof).

I consider interactive storytelling (or more to thepoint: interactive story delivery) an ideal testbed for fine-granular, micro-temporal turn-taking behaviour, as thecontent and types of adaptations are fully controlled bythe system, far more than for task-driven dialogue systems.

2. Research and Development GoalsI believe that micro-temporal interactive storytelling canhelp in several areas of feedback research.• Develop techniques to elicit feedback in an interac-

tive system, which involves building speech synthe-sis systems with conversational abilities. (Work oneliciting feedback utterances for spoken dialoguesystems [4] nonetheless will also apply to interac-tive storytelling.)• Advance ASR and VAD technology to reliably rec-

ognize feedback utterances in an incremental fash-

ion to allow for timely adaptation; this includesrecognizing para-linguistic phenomena such as in-halation, lip-smack, and the like.

• Develop techniques to incorporate feedback into theongoing speech.

Once these basic technological questions have beenaddressed, such a system can be used to further studyfeedback behaviour in a controlled way by deliberatelymanipulating the feedback (and turn-taking) behaviourand analyzing the users’ reactions, for example studyingentrainment phenomena.

3. Plan and System ArchitectureAt first, a system that tells a short, cyclical story would beconceived, that reacts to feedback utterances (dependingon whether they occur in expected places) change its emo-tional state (which, in turn, changes the system’s speechdelivery). The overall architecture of the system is de-picted in Figure 1 and a short utterance plan [5] for thebeginning of a story is depicted in Figure 2. As can beseen, depending on the system’s affective state, it might(somewhat aggressively) require a feedback utterance be-fore continuing. User feedback is analyzed for prosody,content, and expectancy, and results in changes to the sys-tem’s affective state. The system will be implemented in

Figure 1: Overview of system modules and the hierarchicstructure of incremental units describing an example utter-ance as it is being produced and adapted during delivery.

3

Page 11: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Figure 2: Possible changes to an ongoing utterance for alistening storyteller (which might be in a slightly annoyedmood).

INPROTK [6], an incremental dialog processing toolkitthat allows to interconnect modules that exchange theirinformation as incremental units [7].

Following Schröder [8], the affective state can be mod-elled along the two dimensions activation and evaluationwhich are also the basis for the rules in the speech syn-thesis component emoSpeak [9] which could be re-used.As a first shot, and without referring to the wealth of lit-erature on affect recognition (e. g. [10]), we would usesome simple rules-of-thumb: Every listener feedback in-creases activation while the magnitude is determined bythe type of feedback (or words in the utterance) and bythe user’s pitch-range. Pitch-range and pitch-variabilityare also the influencing factors for the evaluation param-eter. A gravitational pull for the evaluation dimensiontowards 0 and for activation towards −∞ will lead thesystem to forget about positive or negative emotions andto eventually become inactive if not given any feedbackin the long term.

Extensions of this first system would use improvedtechniques to elicit feedback utterances, include real storyinteractivity (i. e., allowing the user some freedom tochange parts of the story), and support for deviations fromideal behaviour as outlined above.

4. ConclusionsAdaptive interactive storytelling is a promising ‘micro-domain’ [11] to understand and study feedback and micro-turn-taking behaviours, that should (following the ideabehind micro-domains) later be applicable in full-blownconversational dialogue systems. Interactive storytellinghas all limitations necessary for a micro-domain whileat the same time offering a high flexibility to study feed-back utterances and micro-temporal behaviour in dialogue.Interactive storytelling is especially interesting as it is apurely system-driven domain (and with the system doingthe majority of the talking), which ensures a higher propor-tion of feedback utterances from the user than in typicaltask-driven dialogue systems.

The author is grateful to fruitful discussions on thetopics presented and hopes to get, of course, feedback onthe ideas presented.

5. References[1] A. Glassner, Interactive storytelling: Techniques for 21st

century fiction. AK Peters, 2004.

[2] T. Baumann and D. Schlangen, “Predicting the Micro-Timing of User Input for an Incremental Spoken DialogueSystem that Completes a User’s Ongoing Turn,” in Pro-ceedings of SigDial 2011, Portland, USA, 2011.

[3] H. Buschmeier, T. Baumann, B. Dorsch, S. Kopp, andD. Schlangen, “Combining incremental language genera-tion and incremental speech synthesis for adaptive infor-mation presentation,” in Procs. of SigDial, Seoul, Korea,2012, pp. 295–303.

[4] T. Misu, E. Mizukami, Y. Shiga, S. Kawamoto,H. Kawai, and S. Nakamura, “Toward construction ofspoken dialogue system that evokes users’ spontaneousbackchannels,” in Proceedings of the SIGDIAL 2011Conference, Association for Computational Linguistics.Portland, Oregon: Association for ComputationalLinguistics, June 2011, p. 259–265. [Online]. Available:http://www.aclweb.org/anthology/W/W11/W11-2028

[5] G. Skantze and A. Hjalmarsson, “Towards incrementalspeech generation in dialogue systems,” in Proceedings ofSIGdial, Tokyo, Japan, September 2010.

[6] T. Baumann and D. Schlangen, “The INPROTK 2012 re-lease,” in Proceedings of SDCTD, Montréal, Canada, 2012.

[7] D. Schlangen and G. Skantze, “A General, Abstract Modelof Incremental Dialogue Processing,” in Proceedings ofthe EACL, Athens, Greece, 2009, pp. 710–718.

[8] M. Schröder, “Dimensional emotion representation as abasis for speech synthesis with non-extreme emotions,” inProc. Workshop on Affective Dialogue Systems, 2004, pp.209–220.

[9] M. Schröder, R. Cowie, E. Douglas-Cowie, M. Westerdijko,and S. Gielen, “Acoustic correlates of emotion dimensionsin view of speech synthesis,” in Proceedings of Interspeech,2001, pp. 87–90.

[10] Z. Zeng, M. Pantic, G. Roisman, and T. Huang, “A surveyof affect recognition methods: Audio, visual, and sponta-neous expressions,” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 31, no. 1, pp. 39–58, 2009.

[11] J. Edlund, J. Gustafson, M. Heldner, and A. Hjalmarsson,“Towards human-like spoken dialogue systems,” SpeechCommunication, vol. 50, pp. 630–645, 2008.

4

Page 12: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

System Description for Demonstration at the Interdisciplinary Workshop on Feedback Behaviors in Dialog

September 7-8, 2012, Stevenson, WA

Beyond Back-channels: A Three-step Model of Grounding in Face-to-face Dialogue

Janet Beavin Bavelas1, Peter De Jong2, Harry Korman3, Sara Smock Jordan4

1Department of Psychology, University of Victoria, Victoria, BC, Canada 2Department of Social Work, Calvin College, Grand Rapids, Michigan, USA

3Harry Korman, Skabersjö Institut för KorttidsTerapi (SIKT), Malmö, Sweden 4 Sara A. Smock Jordan, Applied and Professional Studies, Texas Tech University, Lubbock, TX, USA

[email protected]; [email protected]; [email protected]; [email protected]

Abstract Feedback is not an individual behavior or skill; it is part of the collaborative process of grounding in which speaker and addressee coordinate their contributions to ensure mutual understanding. Based on our microanalysis of psychotherapy and experimental videos, we propose that grounding is a three-step process of observable behaviors, with traditional back-channels in the middle: The speaker presents information; the addressee displays understanding (or not understanding), and the speaker acknowledges (or corrects) the addressee’s display.

Proposal The unit of analysis for communication has evolved from Shannon and Weaver’s sender-focused model to include the receiver’s back-channels [1], which can be continuers or assessors [2] and generic or specific [3]. However, these refinements still fit within an implicitly unilateral, two-step model in which communication flows from a speaker to an addressee. Moreover, the two-step unit of analysis is embedded in a traditional turn-taking model in which the roles of speaker and addressee are presumed to alternate regularly and smoothly. There are at least three related problems with fitting observations of actual face-to-face dialogue into this model: First, spontaneous natural dialogues do not follow alternating turns; e.g., [1, 4]. In particular, the addressee’s feedback (e.g., “Yeah” or nodding) often occurs completely within the speaker’s turn, and the participants do not treat these overlapping contributions as either a turn or an interruption. Second, the addressee’s feedback is often visible rather than audible, e.g., nodding, smiling [5], or motor mimicry [3, 6]. Therefore, accurate analysis requires video recordings in which both participants are visible and audible at all times. Finally and most important, the two-step model is not a feedback model in the cybernetic sense because it does not include the speaker’s response to the addressee’s feedback. The default assumption seems to be that the effect of the addressee’s feedback on the speaker is ordinarily purely cognitive, that is, the speaker simply notices that the addressee understands and goes on talking. We propose that the speaker’s response is an influential and observable behavior. We agree with the proposal by Clark and Schaefer [7, 8, 9] that grounding is the fundamental, moment-by-moment conversational process by which speaker and addressee are

constantly establishing mutual understanding. Grounding is a coordinated and collaborative sequence of behaviors occurring at every moment in the dialogue, whether the information is trivial or important. Most versions of grounding describe a presentation of information by the speaker followed by the addressee’s acceptance. The acceptance phase encompasses much more than traditional back-channels, e.g., it can be a paraphrase of what the speaker has said or even new information in answer to the speaker’s question. There is also the possibility of a side-sequence for repair when the addressee does not indicate understanding. We have expanded on an implicit possibility in the grounding model [9, pp. 229-230] by adding the speaker’s acknowledgment as an essential and observable third step that concludes the grounding sequence:

1. The speaker presents information. 2. The addressee displays that he or she has understood

the information (or has not understood or is not certain).

3. The speaker acknowledges that the addressee has understood (or not).

In the third step, the speaker provides feedback to the addressee, e.g., by acknowledging the addressee’s correct understanding and completing a successful grounding sequence. They have “grounded” on their understanding of what the speaker had presented. (Ordinarily, grounding goes smoothly, but it is also an error-detection system. Steps 2 and 3 include the opportunity to detect and repair a misunderstanding on the spot.) We propose that the minimum unit of analysis for dialogue is a three-step grounding sequence. That is, the utterances that form the grounding sequence only make sense in terms of their functional relationship to each other. Grounding is the rhythm of dialogue; every utterance and back-channel is part of a grounding sequence. However, with the addition of the speaker’s acknowledgement, the sequence is no longer a linear one that ends by simply confirming what the speaker had originally presented. If the addressee’s display introduces a subtle change (e.g., a paraphrase) and the speaker acknowledges the display as an acceptable understanding of the original presentation, then the addressee’s modification is what they have grounded on--not what the speaker originally presented. Similarly, when the speaker asks a question: the addressee may answer a different question, and when the speaker acknowledges the answer, then

5

Page 13: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

the speaker’s question becomes what the addressee answered. This is one way that therapists influence the therapeutic discourse while “just listening.” Thus, our model of feedback is one of reciprocal influence or, in more contemporary terms, of co-construction [10]. We have been microanalyzing data from psychotherapy sessions as well as lab experiments, using ELAN (http://www.lat-mpi.eu/tools/elan) and video that captures both participants. The three-step model became necessary in order to fit the observed details of dialogue—details that were previously unaccounted for. This includes the observable instances where the display and acknowledgement steps introduce changes to the original presentation. There are also variations on the simple pattern. For example, the addressee may display that he or she has not understood the presentation; this initiates a repair sequence. The addressee’s answer to a question often presents new information, which starts a new, overlapping grounding sequence. The patterns also differ when both participants can contribute, compared to asymmetrical dialogues in which the speaker presents all the information. Based on preliminary data, the absence of an acknowledgement in step 3 may mark or lead to a misunderstanding in which common ground is not established.

Conclusion The data have led us to expand the minimum unit of analysis for dialogue to three closely related exchanges between speaker and addressee. They also change addressee feedback from a passive, reactive function to part of a reciprocal sequence in which both speaker and addressee determine the meaning of what was said.

References [1] Yngve, V. H. “On getting a word in edgewise.” Papers from

the Sixth Regional Meeting of the Chicago Linguistic Society, 567-578. Chicago Linguistic Society, 1970.

[2] Goodwin, C. “Between and within: Alternative sequential treatments of continuers and assessments.” Human Studies, 9:205-21, 1986.

[3] Bavelas, J. B., Coates, L., and Johnson, T. “Listeners as co-narrators.” Journal of Personality and Social Psychology, 79(6):941-952, 2000.

[4] O’Connell, D. C., Kowal, S., and Kaltenbacher, E., “Turn-taking: A critical analysis of the research tradition,” Journal of Psycholinguistic Research, 19(6):345-373, 1990.

[5] Brunner, L. J. “Smiles can be back-channels.” Journal of Personality and Social Psychology, 37:728-734, 1979.

[6] Bavelas, J. B., “Face-to-face dialogue as a micro-social context. The example of motor mimicry,” in S. D. Duncan, J. Cassell, and E. T. Levy [Eds.], Gesture and the dynamic dimension of language, 134-136, Benjamins, 2007.

[7] Clark, H.H., & Schaefer, E.F., “Collaborating on contributions to conversations,” Language and Cognitive Processes, 2(1), 19-41, 1987.

[8] Clark, H.H., & Schaefer, E.F., “Contributing to discourse,” Cognitive Science, 13:259-254, 1989.

[9] Clark, H.H., Using language, Ch. 8, Cambridge University Press.

[10] De Jong, P., Bavelas, J. B., and Korman, H., Using microanalysis to observe co-construction in psychotherapy. Article under review.

Index Terms: grounding, face-to-face dialogue, feedback from speaker.

6

Page 14: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Adapting Language Production to Listener Feedback Behaviour

Hendrik Buschmeier, Stefan Kopp

Sociable Agents Group, CITEC and Faculty of Technology, Bielefeld UniversityPO-Box 10 01 31, 33501 Bielefeld, Germany{hbuschme, skopp}@uni-bielefeld.de

AbstractListeners use linguistic feedback to provide evidence of un-derstanding to speakers. They, in turn, use it to reason aboutlisteners’ mental states, to determine the groundedness of com-municated information and to adapt subsequent utterances tothe listeners’ needs. We describe a probabilistic model for theinterpretation of listener feedback in its dialogue context that en-ables a speaker to evaluate the listener’s mental state and gaugecommon ground. We then discuss levels and mechanisms ofadaptation that speaker’s commonly use in reaction to listenerfeedback.Index Terms: communicative feedback; Bayesian listener state;adaptation mechanisms

1. IntroductionCooperative dialogue partners continuously show evidence ofperception, understanding, acceptance and agreement of andwith each others’ utterances. Such ‘evidence of understanding’[1] is provided in the form of verbal-vocal feedback signals, headgestures and facial expressions, as well as through appropriatefollow-up contributions.

A listener’s feedback signals can reflect his or her mentalstate quite accurately. In the case of verbal-vocal feedback, for ex-ample, listeners use a variety of quasi-lexical forms and modifythem prosodically (through lengthening, intonation, intensity,voice quality) and structurally (through repetition or transforma-tions) to express subtle differences in meaning [2]. A comparablyrich mapping between form and function can also be found inhead gestures and facial expressions.

In addition to the complexity of the feedback signal itself,the dialogue context may interact with it such that the result-ing meaning is the opposite of the signal’s ‘context-free mean-ing’ [3]. Because listener’s feedback signals are responses towhat a speaker has said, they need to be analysed with this con-text in mind. Speakers trying to interpret the listener’s evidenceof understanding do exactly this.

Having perceived and interpreted a listener’s feedback signal,speakers do not typically ignore it, but instead tend to respondimmediately. If they sense that the listener has a specific orgeneral need, they adapt their ongoing and subsequent utterancesto address it. In this way, listener feedback fulfils a functionin the original cybernetics sense of the word ‘feed back’ [4]:the listener’s feedback signal modifies the speaker’s languageproduction – at least in cooperative situations. Both interactionpartners benefit from this process, as it often results in betterunderstanding and greater agreement.

In this paper, we (1) present a Bayesian network model forcontext sensitive interpretation of listener feedback in its dia-logue context; and (2) describe and discuss the levels and mech-anisms by which speakers adapt to their interlocutors’ needs

as communicated through their feedback. Both, the model ofthe listener and the adaptation mechanisms, will be useful increating ‘attentive speaker agents’ [5, 6] that are able to attendand to adapt to communicative user feedback.

2. A Bayesian model of the listenerKopp and colleagues [7] proposed a computational model offeedback generation for an embodied conversational agent. Itsfocus, in contrast to other feedback generation models, is notso much on timing of feedback but rather on choice of whichfeedback signal to produce. Following Allwood and colleague’shypothesis [3] that linguistic feedback performs four basic com-municative functions (contact, perception, understanding andother attitudinal reactions), the feedback production model basesthe decision of when and how to give feedback on the virtualagent’s perception, understanding and appraisal processes. Thesefeed into a simple concept named ‘listener state’, that representsthe current estimates of the agent’s perception, understandingas well as acceptance and agreement (being the two major at-titudinal reactions) as a simple tuple (C,P,U,A). The feedbackgeneration module monitors this listener state and probabilistic-ally triggers feedback signals that express the current state.

We [6] adopted this concept of listener state for a model inwhich an attentive speaker agent attributes to its user a Theoryof Mind representation that emulates the user’s listener state.Depending on the user’s feedback signals, the agent is ableto estimate this ‘attributed listener state’ (ALS), and use it toadapt its own behaviour in such a way that listeners can perceiveand understand better. Changes to the ALS were calculated,similarly to [7]. Upon detecting a feedback signal, the ALSwas updated by increasing or decreasing the corresponding andentailed variables.

Here, we present an enhanced approach to attributed listenerstate (a more detailed description can be found in [8]), whereit is modelled probabilistically in the framework of Bayesiannetworks. This allows for (1) managing of the uncertaintiesinherent in the mapping between feedback signal and meaning,(2) enables inference about and potentially also learning listenerbehaviour, and (3) gives us a natural way of interpreting feedbackin a dialogue context that includes other multimodal signals ofthe listener, the speaker’s utterance and aspects of the dialoguesituation and domain.

As in the previous model, the notions of contact, perception,understanding, acceptance and agreement are modelled with onevariable each. Here, however, they appear as random variablesso that the values C, P, U , AC and AG can be interpreted in termsof degrees of belief instead of in terms of strengths. This last isinstead modelled via the states of the random variables.

Influences between ALS-variables are modelled after All-wood’s hierarchy of feedback functions [3], i.e., perception sub-

7

Page 15: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

C P UAC

AG

Grounding

FB-func-tion

Uncer-tainty

Diffi-culty

Trade-off

ALS

IS

Context

-

Figure 1: Structure of the Bayesian model of the listener. Theattributed listener state consists of five random variables C, P, U ,AC and AG. These are influenced by variables representing thedialogue context and the user’s behaviour. The ALS variables inturn influence the grounding status of the speaker’s utterance.

sumes contact, understanding subsumes perception and contact,and acceptance and agreement subsume understanding percep-tion and contact. This means, for instance, that if understandingis assumed, perception and contact can be assumed as well. Alack of perception, on the other hand usually implies that under-standing cannot be assumed. Thus, the influences in the Bayesianmodel of ALS are the following: C influences P, P influencesU , and U influences AC and AG (see Figure 1 for a graphicaldepiction of the model and these influences).

Each of the ALS variables can take the states low, mediumand high. Taking, for example, the case of understanding, lowmeans that the listener’s estimated level of understanding is low,(i.e., the listener did not understand the speaker’s utterance). Thestate high high means that the listener understands the speaker’sutterance very well and medium represents a level of understand-ing that lies in between the two.

The most important information for inferring the ALS ismost probably the listener’s verbal-vocal feedback signal. Thus,if it is, for example, recognised as having the communicativefunction ‘understanding’, there is a positive influence on thevariables C, P and especially, U . Variables AC and AG on theother hand are negatively influenced, as speakers usually signalfeedback of the highest function possible [3, 1].

To take into account the context-sensitivity of feedback sig-nals, features of the speaker’s utterance need to be considered inALS estimation as well. If, for example, the speaker’s utteranceis simple, the degree of belief in the listener’s successful under-standing of the utterance should be high – even if explicit positivefeedback is absent. This is modelled with the variable Difficulty,which also takes the states low, medium and high. Contributingfactors are its length, the novelty of its informational content(i.e., whether it is new or old information), and if the utterancecan be expected by the listener or will come as a surprise.

A further influence on the ALS variables is how certain oruncertain the listener seems to be about his mental state. A feed-back signal can imply that a listener is still in the process ofevaluating the speaker’s statement and is not yet sure whether heagrees with it. This is often shown by lengthening the signal orbeing hesitant of its production [2]. We model this with the vari-able Uncertainty, which again takes the states low, medium, andhigh. Uncertainty is derived from the user’s feedback behaviour.Giving feedback in both modalities simultaneously, for example,conveys a higher degree of certainty than providing just a head

nod. In the verbal-vocal domain, lengthening of feedback signalsoften marks the progressiveness of the evaluation or appraisalprocess. Taking a stance in the feedback signal itself (being pos-itive or negative) also conveys a higher degree of certainty thandoes a feedback signal with neutral polarity.

Finally, situation specific influences and those of a speaker’sexpectations about the listener’s behaviour are often connectedto the dialogue domain and to known preferences of the listener.This is modelled with the domain dependent variable Trade-off,which is closely tied to the domain we are working with (cal-endar and appointment scheduling). If the speaker proposes anappointment and knows that there is already another appoint-ment with a similar priority at that point of time, the variable canpredict that the user may have to make a significant trade-off.This variable also takes the states low, medium, and high.

The ALS mediates between the contextual factors describedabove and the information state. This makes the grounding statusof the objects in the information state conditionally independentfrom the multitude of possible influencing factors and reducesthe model’s complexity.

Each ALS variable influences the grounding status of in-formation associated with the current utterance to a differentdegree. Believing that the listener is in full contact but neitherperceives nor understands what the speaker utters, for example,should not lead to a high degree of belief in the groundedness ofthe object. Assuming the listener to be in an average state of un-derstanding on the other hand does not render impossible a highdegree of belief in the object being grounded. The informationstate is currently modelled with a single variable Grounding thatcan take the states low, low-medium, medium, medium-high andhigh and is associated with the current utterance.

Whether a context-variable conditionally influences an ALS-variable can also be seen in Figure 1. The strength of the influ-ence is modelled with structured representations, with which theconditional probability tables for each variable are derived auto-matically [8]. It is thus not necessary to specify the enormousnumber of probabilities needed for this network manually, butonly a much smaller number of parameters that control the deriv-ation by approximating the shape of the probability distributions.Since the states of many of the variables of the network have anordinal relationship (such as low, medium, high), a definition inthis way is easily possible.

When applying the model to the analysis of a certain com-municative situation, it sufficies to set the known variables. Thestates of the remaining variables can then be calculated withBayesian network inference algorithms. The result of this pro-cess is a belief state for each variable, i.e., a probability distribu-tion over the variable’s states, representing the speaker’s beliefabout the listener’s mental and grounding state.

3. Levels and mechanisms of adaptationBased on the attributed listener and grounding state, a speakermay then decide if it is necessary or helpful to accommodate thelistener by changing aspects of their language production beha-viour. This section describes a first investigation into manners ofadaptation based on findings from the literature and a qualitativeanalysis of dialogues from a human-human dialogue study weconducted. The key question of how to adapt in a given situationwill remain unanswered for now as it requires a more detailedanalysis of the speaker’s feedback-preceding utterances.

The different needs of a listener need to be addressed ondifferent levels and with different adaptation mechanisms. Forexample, a problem in perception might be resolved by simply

8

Page 16: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Table 1: Levels of adaptation, from the lowest level ‘realisation’to the highest level ‘perspective’.

Levels Mechanisms

Perspective perspective-changeprovide missing information

Rhetorical structure elaborationexplanationrepetitionsummarypragmatic explicitness

Surface form verbosityredundancyfocus/stressvocabulary

Realisation hyper&hypo articulationspeech ratevolume

repeating the utterance or the problematic phrase or word. Ifthe speaker notices, however, that the listener has built up acompletely different situation model and is stuck in this incorrectconceptualisation of what the speaker means, starting anew froma different perspective might be the right way for the speakerto resolve the situation. Table 1 gives an overview of differentlevels of adaptation along with a choice of mechanisms thatoperate on each level.

The lowest level of adaptation is the realisation level, i.e.,how an utterance is articulated and presented. Adaptation on thislevel might happen automatically during articulation along thehyper-hypo continuum [9]. A speaker might choose to hyper-articulate when the listener has difficulties perceiving the speaker’sspeech (e.g., due to noise in the environment, hearing impair-ment, importance of the message or possible ambiguities). Onthe other hand, if the listener perceives well and the messageis not overly important, the speaker might choose to conserveenergy through hypo-articulation. The realisation level is alsowhere speakers may choose to adapt their speech rate or volume.

If adapting the realisation is insufficient to accommodatethe listener’s needs, the utterance’s content itself can be adap-ted. This is possible on all of the higher adaptation levels. Thesimplest way of adapting utterance content is to change the sur-face form, keeping the utterance’s semantic content fixed. Aspeaker may choose to be more ‘verbose,’ i.e., use more wordsto communicate the same semantic content. Although the addi-tional words and phrases might not add semantic content, theycan nevertheless serve important communicative functions. Us-ing signpost language and other cue phrases for example helpsin drawing the listener’s attention to a specific aspect of an ut-terance. It might also be used to make the speaker’s underlyingintentions more explicit and to reveal the rhetorical structure ofthe speaker’s argument [10]. Verbosity also has the simple prop-erty of giving the listeners more time to process the importantmeaning-bearing parts of an utterance.

Speakers may also use different degrees of redundancy to ad-apt surface form. Similarly to verbosity, redundancy usually doesnot introduce novel semantic objects, but highlights importantinformation and increases the probability of the message beingunderstood [11]. Redundancy is also a frequent mechanism usedto repair misunderstanding [12].

Another mechanism that operates on the surface structure isstress and focus. The speaker might put stress on the important

Figure 2: The virtual conversational agent ‘Billie’ together witha visualisation of the belief states of the variables C, P, U , AC,AG and Grounding.

parts of an utterance with the help of prosodic cues as well as byusing different syntactic constructions that distribute weight dif-ferently (e.g., active vs. passive voice). Furthermore, the speakercan choose a different vocabulary, thereby accommodating thelistener’s level of expertise.

Adaptation at higher levels requires more than a change ofpackaging for semantic content, producing instead a differentmessage. ‘Rhetorical structure’ is the level of adaptation mosteasily identified and often found in the analysis of our corpus.Speakers often adapt to listener feedback by changing the amountof information they provide. They commonly elaborate on anutterance by providing more information or giving explanations.Another is to repeat the previous utterance or to summariseseveral utterances. On this level, speakers also adapt by makingpreviously implicit information pragmatically explicit.

Finally, when speakers notice that the listener’s conceptual-isation of the dialogue’s content deviates from their own, theyadapt on the level of ‘perspective’. They adjust their own per-spective to be closer to that of the listener, or track back to apoint in the dialogue where they assume the conceptualisation tohave still been consistent. Speakers might also provide furtherbackground information that they had previously assumed wasalready a part of common ground.

It should be noted that adaptation can take place at mul-tiple levels simultaneously. A speaker might very well chooseto communicate more clearly by combining several mechan-isms. Furthermore, the function of adaptation is not limited toaccommodating for the listener’s problems in perception, under-standing, and so forth. It also serves to modify dialogue whencommunication is going ‘too well’. For example, if a speakernotices that a listener is already ahead in her thinking, he mightskip planned parts of his utterance. Similarly, if there are noproblems in perception and understanding, the speaker can bemore relaxed in his or her articulation.

4. ConclusionIn this paper, we discussed linguistic feedback from the per-spective of an attentive speaker. We first presented an enhancedrepresentation of ‘attributed listener state’ [6, 8] that builds onprinciples of probabilistic reasoning. Using the framework of

9

Page 17: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Bayesian networks makes it possible to seamlessly integrate therepresentations of the listener’s assumed cognitive state withdialogue context and features of the listener’s feedback signal.Moreover, it is also possible to easily integrate a informationstate representation into the model, and to then reason aboutthe grounding status of information in the speaker’s utterances.The model enables a speaker to estimate how well utterances areperceived and understood by a listener, and evaluate acceptanceof and agreement to message content.

We further discussed how speakers accommodate to theneeds a listener expresses through feedback behaviour. We presen-ted four levels of adaptation and a number of adaptation mech-anisms commonly used by speakers, as supported by the initialresults of a dialogue study and the literature.

In sum, it appears that our Bayesian model supports theclaim that the attributed listener state as well as estimation of thegrounding of the current utterance content are important factorsin deciding whether and how to adapt to the listener’s needs, andwhich action to take next. We are currently creating a virtualconversational agent platform that will allow us to explore andevaluate the model and its interplay with different adaptationstrategies in more detail. For this, the Bayesian model has beenintegrated into the agent ‘Billie’ (see Figure 2), where the ALSvariables as well as the estimated state of groundedness areused to adapt its incrementally generated language ([13]; so faronly on the level of surface form) as well as to make choices indialogue management.

Acknowledgements This research is supported by the DeutscheForschungsgemeinschaft (DFG) in the Center of ExcellenceEXC 277 in ‘Cognitive Interaction Technology’ (CITEC).

5. References[1] H. H. Clark, Using Language. Cambridge, UK: Cam-

bridge University Press, 1996.

[2] N. Ward, “Non-lexical conversational sounds in AmericanEnglish,” Pragm. & Cogn., vol. 14, pp. 129–182, 2006.

[3] J. Allwood, J. Nivre, and E. Ahlsén, “On the semantics andpragmatics of linguistic feedback,” Journal of Semantics,vol. 9, pp. 1–26, 1992.

[4] N. Wiener, Cybernetics: or Control and Communication inthe Animal and the Machine, 2nd ed. Cambridge, MA:The MIT Press, 1948/1961.

[5] D. Reidsma, I. de Kok, D. Neiberg, S. Pammi, B. vanStraalen, K. Truong, and H. van Welbergen, “Continuousinteraction with a virtual human,” Journal on MultimodalUser Interfaces, vol. 4, pp. 97–118, 2011.

[6] H. Buschmeier and S. Kopp, “Towards conversationalagents that attend to and adapt to communicative user feed-back,” in Proceedings of the 11th International Conferenceon Intelligent Virtual Agents, Reykjavik, Iceland, 2011, pp.169–182.

[7] S. Kopp, J. Allwood, K. Grammar, E. Ahlsén, andT. Stocksmeier, “Modeling embodied feedback with vir-tual humans,” in Modeling Communication with Robotsand Virtual Humans, I. Wachsmuth and G. Knoblich, Eds.Berlin, Germany: Springer-Verlag, 2008, pp. 18–37.

[8] H. Buschmeier and S. Kopp, “Using a Bayesian model ofthe listener to unveil the dialogue information state,” inSemDial 2012: Proceedings of the 16th Workshop on the

Semantics and Pragmatics of Dialogue, Paris, France, toappear.

[9] B. Lindblom, “Explaning phonetic variation: A sketch ofthe H&H theory,” in Speech Production and Speech Mod-elling, W. J. Hardcastle and A. Marchal, Eds. Dordrecht,NL: Kluwer Academic Publishers, 1990, pp. 403–439.

[10] B. J. Grosz and C. L. Sidner, “Attention, intentions and thestructure of discourse,” Computational Linguistics, vol. 12,pp. 175–204, 1986.

[11] E. Reiter and S. Sripada, “Human variation and lexicalchoice,” Computational Linguistics, vol. 28, pp. 545–553,2002.

[12] R. Baker, A. Gill, and J. Cassell, “Reactive redundancyand listener comprehension in direction-giving,” in Pro-ceedings of the 9th SIGdial Workshop on Discourse andDialogue, Columbus, OH, 2008, pp. 37–45.

[13] H. Buschmeier, T. Baumann, B. Dosch, S. Kopp, andD. Schlangen, “Combining incremental language gener-ation and incremental speech synthesis for adaptive in-formation presentation,” in Proceedings of the 13th AnnualMeeting of the Special Interest Group on Discourse andDialogue, Seoul, South Korea, 2012, pp. 295–303.

10

Page 18: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Effect of Linguistic Contents on Human Estimation of Internal State of DialogSystem Users

Yuya Chiba1, Masashi Ito2, Akinori Ito1

1Graduate School of Engineering, Tohoku University, Sendai, Japan2Department of Electronics and Intelligent System, Tohoku Institute of Technology, Sendai, Japan

[email protected], [email protected], [email protected]

Abstract

We have studied estimation of dialog system users’ inter-nal state before the input utterance. In a practical use ofa dialogue-based system, a user is often perplexed withthe prompt. An ordinary system provides more detailedinformation to the user taking time to input, but the helpis meddlesome for the user considering the answer to theprompt. To make an appropriate response, the spoken di-alogue system needs to consider the user’s internal statebefore the user’s input. In the previous paper, we pro-posed a method for estimating the internal state usingmulti-modal cues; however, we did not separate effects ofseveral factors (e.g. linguistic information of the prompt,visual information, and acoustic information). Thus, itwas not clear which factor affects the evaluation of thedialog session to what extent. In this paper, we examinedmore detailed evaluation by human evaluators, separatinglinguistic contents (the system’s prompt utterance and theuser’s reply utterance) from the other non-verbal behav-ior, and assessed the effect of the linguistic contents onthe estimation of the user’s internal state.Index Terms: multi-modal interface, user modeling,non-verbal information

1. Introduction

A spoken dialog system is desired to respond to a userflexibly. User modeling and estimation of the user’s state[1] is an important issue for realizing flexible spoken di-alog systems. There have been many works on this issueso far, where most researches focus on estimation of theinternal states in the dialog [2, 3] or before the dialog [4].

These researches assume that the user makes an an-swer immediately when the dialog system gives a promptmessage. However, users under actual environment donot make the input on occasion. For instance, the usercould abandon the session without uttering a word ifthe user did not understand the meaning of the system’sprompt; or the user could take a time to answer whenconsidering how to answer the prompt.

In our previous work, we focused on the user’s be-havior after listening to the system’s prompt and before

answering the prompt [5]. User modeling at this phase(before the user’s first utterance) is important becausewe can recognize users who have difficulty understand-ing the system’s prompt and making answer to the sys-tem. In this work, we exploited audio-visual features ofthe user’s behavior such as the duration from the promptto the user’s answer, length of the users filled pause andsilence, and the user’s face orientation.

All the features examined in the previous work wereextracted only from the observation of the user’s behav-ior before the answer utterance. However, when label-ing the dialog data, human annotators watched whole ofthe dialog session from the system’s prompt to the endof the user’s answer, which included additional linguisticinformation that was not used for automatic estimation,such as the system’s prompt and the user’s input utter-ance. Therefore, it is not clear how linguistic informationaffected the annotator’s judgment.

In this paper, we conduct more detailed evaluation ex-periments. In this experiment, we created video clips offour different conditions that contained different informa-tion, and asked the evaluators to judge the user’s internalstate by watching the video. Then we compared judg-ments of different condition to assess the importance oflinguistic information included in the system prompt andthe user’s input utterance, as well as audio informationbefore the user’s input utterance.

2. Internal states of a user before the firstutterance

In a human-human dialogue, interlocutors converse moreor less estimating the dialog partner’s internal state. Here,we defined three internal states of a user about to answerto a dialog system [5]. In the first one (state A), the userdoes not know how to answer the prompt. In the sec-ond one (state B), the user is taking time to consider theanswer. In the third one (state C), the user has no dif-ficulty in answering the system. Estimation of these in-ternal states will help the system to generate additionalprompt when the user does not reply to the system withina certain duration.

11

Page 19: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Human estimation ofdialog partner’s internal state isbased on feeling that another person to be knowing an an-swer to the question (in other words, whether other inter-locutors could respond to his/her utterance or not). Thisis called “Feeling of Another’s Knowing (FOAK)” [6].A correlation between audio-visual cues and FOAK wasalso investigated [7]. This work used linguistic informa-tion of the user’s input utterance, which means that thismethod cannot be used for the above purpose because theestimation needs the user’s utterance.

3. Linguistic contents and estimation of theinternal state

As stated, relationship between linguistic contents and es-timation of the internal state is unclear. Therefore, weaddressed the following three questions:

(Q1) How does the answer from the user affect the deci-sion of the evaluators?

(Q2) How does the content of the question (the system’sprompt) affect the decision of the evaluators?

(Q3) Is audio information really useful for the decision?

We used video clips of users who were making con-versation with a dialog system. One clip contained onesession, where the system gave a prompt utterance, andthe user answered. We split the clip into audio and visualparts, and temporally divided into three parts, as follows.

1. Audio (A1) and video (V1) of the system’s promptutterance gave to the user

2. Audio (A2) and video (V2) after the prompt andbefore the user’s answer

3. Audio (A3) and video (V3) of the user’s answeringutterance

To investigate the above three questions, we createdthe following four kinds of clips, and carried out experi-ments to compare judgments of the subjects with differ-ent kinds of clips.

Clips A: Clips with all information (V1, V2, V3, A1,A2, A3)

Clips B: Clips without the answer from the user (V1,V2, A1, A2)

Clips C: The system’s prompt utterance of Clips B wassubstituted with tone signal (V1, V2, A2)

Clips D: The audio signal of Clips C was removed (V1,V2)

We can investigate the answer of (Q1) by comparingjudgments for Clips A and B. Then we investigate (Q2)by comparing the result of Clips A and B with that ofClips C. Finally, we compare with these results with theresult of Clips D for answering (Q3).

4. Collection of dialog data

We collected the dialog data on the Wizard-of-Oz (WOZ)basis, where subjects make dialogs with a dialog sys-tem controlled by a human operator. We prepared a“Question-and-answer” task for the dialog, where thesystem asks the subject a question and the subject an-swers it. We prepared 44 patterns of system questions.

We prepared an agent on the LCD monitor to keepthe subjects’ attention. The agent is a simple cartoon-likeface, which were controlled by the operator. The operatorgave the system’s prompt and reply to the subject using aspeech synthesizer.

The experiment was conducted in a sound-proofchamber. The system utterance was played by a speakerconnected to the PC. The operator stayed outside of thechamber and controlled the agent remotely. The subjectswore a lapel microphone. A CCD camera was installedabove the monitor to record the frontal face of the sub-ject during a dialog. The operator could monitor both thespeech and video of the subject from outside the chamber.

We employed 16 subjects (14 males and 2 females).The audio signal was recorded in a PCM format at 16kHz sampling, 16 bit quantization. The video was storedas AVI files with 24-bit color depth, 30 frame/s.

5. Subjective evaluation

We conducted subjective evaluation experiments to inves-tigate the effect of the various information on the humanestimation of the subject’s internal state. The informationincluded the system’s prompt, the subject’s non-verbalbehavior and the user’s answer utterance.

5.1. Sessions

We split the recorded video and speech into sessions,which included one system prompt and the subject’s an-swer to the prompt. When the subject did not make an an-swer, we regarded the section from the beginning of thesystem prompt to just before the next prompt as a session.As a result, we obtained 793 sessions from the recordedvideo.

5.2. Clips for subjective evaluation

As mentioned above, we prepared four kinds of clips(Clips A, B, C and D) for each of the sessions.

In the majority of the collected 793 sessions, the sub-jects immediately answered the question, which shouldbe classified into state C (the user had no problem an-swering the question) [5]. The main interest of this workis how to discriminate the users of state A and B. There-fore, we excluded the sessions where the subject an-swered the question within 5 s. As a result, we used 255sessions for the evaluation experiment.

12

Page 20: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

5.3. Evaluationprocedure

We employed 18 evaluators (13 males and 5 females)who did not participate in the dialog. We split the evalu-ators into four groups: Group A, B, C and D, to each ofwhich Clips A, B, C and D were presented, respectively.After watching one session, the evaluator was asked tochoose an answer of the following question among thethree choices:

Q: How do you evaluate the behavior of the subject withrespect to the system’s question?

1) The subject did not understand the question (state A).2) The subject understood the question and took a time

to prepare an answer (state B)3) The subject understood the question and answered it

immediately (state C).

6. Experimental result

6.1. Analysis of the result of Group A

First, we investigated the consistency of the judgmentmade by the evaluators in Group A. We used Cohen’sκ toassess the degree of agreement between evaluators. As aresult,κ were from 0.46 to 0.55, which showed moderateagreement between the evaluators.

6.2. Comparison between Group A and B

We can assess the effect of the subject’s answer to thejudgment by comparing the results from Group A and B.We classified each of the sessions by a majority vote foreach of Group A and B, and observed agreement betweenthe judgments of two groups. As a result, we obtainedκ = 0.59, which were almost same agreement betweenevaluators among Group A. This result suggests that ef-fect of the subject’s answer on the evaluator’s judgmentis small.

We also investigated the examples where evaluatorsof Group A and B gave different judgments. One typ-ical example is that a subject answered “I don’t know”after a long silence. In this case, evaluators of Group Atended to regard that example as choice 1 (state A) whilethose of Group B (who did not hear the answer of thesubject) tended to judge it as choice 2 (state B). In addi-tion, the evaluators of Group A seemed to refer the pitchand power of the answer utterance as cues to the subject’sconfidence of the answer.

6.3. Comparison between Group AB and C

We investigated the effect of the content of the systemprompt on the evaluators’ judgment by comparing the re-sults of Group C with those of Group A and B. Agree-ment of the majority vote results of Group B and C wasnot good (κ = 0.30), which suggests that the contentof the system prompt had some effect on the evaluators’

(a) Tendency by evaluators in Group A and B

(b) Tendency by evaluators in Group C

Figure 1: Tendency of judgment by evaluators

decision.Next, we made a question-by-question analysis to

make further analysis for the results. In this analysis,we observe a tendency of judgment for a specific ques-tion. Letnqj(G) be the number of evaluations with valuej ∈ {1, 2, 3} for questionq by the evaluators in GroupG.Here, the value of evaluation corresponds to the choicesdescribed in section 5.3. Then we calculate

Rq(G) = (rq1(G), rq2(G), rq3(G)) (1)

rqj(G) =nqj(G)∑

j′∈{1,2,3} nqj′(G)(2)

Rq(G) is a vector that reflects a tendency of judgmentmade by evaluators in GroupG for questionq.

Figure 1 shows the tendency of judgment by evalua-tors. Figure 1(a) showsRq(GA ∪ GB) and Figure 1(b)showsRq(GC), whereGA, GB andGC are sets of eval-uators in Group A, B and C, respectively. Note that, al-though we prepared 44 questions for the experiment, noresponses that took more than 5 s were observed for 8questions; therefore, we used only 36 questions for theevaluation.

From figure 1, we can see that most questions werejudged as either 1 or 2, as intended (most of 3 (state C)had small duration from the prompt to the answer, andthus were excluded from the evaluation). Another obser-vation is that ratio of value 1 for each question had similartendency for both sets. To confirm this, we investigatedthe correlation betweenrq1(GA ∪ GB) andrq1(GC) forall questions. Figure 2 shows the scattergram. The cor-relation coefficient betweenrq1(GA ∪GB) andrq1(GC)is 0.77, which shows that judgments by the evaluators ofGroup A+B and C have some similarity.

This observation suggests that tendency of judgmentfor specific question with and without linguistic informa-tion of the response is similar, which justifies our ap-proach to estimate the user’s internal state using non-verbal information [5]. However, it is also true that the

13

Page 21: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

Group C

Gro

up A

+B

Figure 2: Scatterplot of ratio of value 1

Figure 3: Tendency of judgment by Group D

linguistic information of the response affects the judg-ment. In the judgment of Group C, the ratio of value1 is bigger than that of Group A+B (average value ofrq1(GA∪GB) is 0.24 and that ofrq1(GC) is 0.32), whichseems that the judgment became more difficult withoutobserving the subjects’ response, and thus the judgmentsby Group C were more random than by Group A and B.

6.4. Comparison between Group C and D

In this section, we analyze the effect of audio informationon the decision of the internal state by comparing evalu-ation results by evaluators of Group C and D.

First, we calculated the agreement of the majorityvote of Group C and D. The result wasκ = 0.41, whichshowed moderate agreement. Compared with the evalua-tions by Group C, that of Group D tended to be evaluatedas 2 (state B), as shown in Figure 3. One observation wasthat there were users’ responses in which the users movedtheir lips but said nothing. When speech was missing,the evaluators tended to judge such responses as value 2(thinking), while the evaluation was value 1 (being per-plexed) when presented with speech. as value 1 (beingperplexed). Other examples were filled pauses and fillers;subjects with filled pauses and fillers were judged as 2when speech was presented, but the filler had no effectwhen speech was omitted.

7. Conclusions

In this paper, we focused on the user’s internal state afterthe system’s prompt and before the user’s first utterance.In our previous work, effect of contents of system promptand user’s utterance was unclear. By analysis presented

in this paper, we can conclude the following three obser-vations:

1. The user’s answers had a small effect on judgmentsof the evaluators.

2. Content of the system’s prompt utterance had aconsiderable effect on judgments (probably it be-came more difficult to judge without the prompt),but the tendency of judgments was still similar.

3. Audio information had also a large effect, but theevaluators still could judge with only the visual in-formation. Audio-visual synchronization (such aslip motion and speech) had an effect on the judg-ment.

In future work, we will investigate methods to deter-mine the user’s internal state automatically in real-timeby using audio and visual information, and implement itto the dialog system.

8. AcknowlegmentThis work is a part of the project “Experimental chal-lenges for dynamic virtualized networking resource con-trol over an evolved mobile core network.a new approachto reduce massive traffic congestion after a devastatingdisaster,” supported by Ministry of Internal Affairs andCommunications, Japan.

9. References[1] A. Kobsa. User modeling in dialog systems: Potentials and

hazards.AI&Society, 4:214–231, 1990.

[2] A. N. Pargellis, H.-K. J. Kuo, and C.-H. Lee. An automaticdialogue generation platform for personalized dialogue ap-plications.Speech Communication, 42:329–351, 2004.

[3] R. Gajsek, V. Struc, S. Dobrisek, and F. Mihelic. Emo-tion recognition using linear transformations in combina-tion with video. InProc. Interspeech, pages 1967–1970,2009.

[4] S. Hudson, J. Fogarty, C. Atkeson, D. Avrahami, J. Forl-izzi, S. Kiesler, J. Lee, and J. Yang. Predicting human in-terruptibility with sensors: a wizard of oz feasibility study.In Proc. Conf Human factors in computing systems, pages257–264, 2003.

[5] Y. Chiba, M. Ito, and A. Ito. Estimation of user’s internalstate before the user’s first utterance using acoustic featuresand face orientation. InProc. HSI, 2012.

[6] S. E. Brennan and M. Williams. The feeling of another’sknowing: Prosody and filled pauses as cues to listenersabout the metacognitive states of speakers.J. Memory andLanguage, 34(3):383–398, 1995.

[7] M. Swerts and E. Krahmer. Audiovisual prosody and feel-ing of knowing. J. Memory and Language, 53(1):81–94,2005.

14

Page 22: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

A Survey on Evaluation Metrics for Backchannel Prediction Models

Iwan de Kok1, Dirk Heylen1

1Human Media Interaction, University of Twente, Enschede, The [email protected], [email protected]

AbstractIn this paper we give an overview of the evaluation met-rics used to measure the performance of backchannel pre-diction models. Both objective and subjective evalua-tion metrics are discussed. The survey shows that almostevery backchannel prediction model is evaluated with adifferent evaluation metric. This makes comparison be-tween developed models unreliable, even beside the othervariables in play, such as different corpora, language,conversational setting, amount of data and/or definitionof the term backchannel.Index Terms: backchannel, machine learning, evaluationmetrics

1. IntroductionOne of the aspects of nonverbal behavior that has beena subject for computational modeling for many years isbackchanneling behavior. Originally these backchannelprediction models were developed for spoken dialog sys-tems for telecommunication purposes, but nowadays theaim for these models are virtual humans and robots.

In this paper we will give an overview of metricsand methods used to evaluate the backchannel predictionmodels developed so far. Table 1 gives an overview ofthe backchannel prediction models and their evaluationmethods. As the table shows there are almost as manyevaluation methods as there are backchannel predictionmodels. This makes a comparison between the differentapproaches very difficult.

The evaluation methods used can be divided into twocategories; objective evaluation or subjective evaluation.The paper is organized to discuss these two evaluationmethods separately.

With objective evaluation the performance of themodel is compared to (another part of) the corpus thatis used for development. The evaluation analyzes howgood the model is at reproducing backchanneling behav-ior of the recorded listener. This type of evaluation hasthe challenge that people differ in their backchannelingbehavior. The responses given by the recorded listenerare not the only moments in the conversation where abackchannel is possible or required. Predictions at othertimes might be just as good. In Section 2 the differentmeasurements and approaches to objectively evaluate the

developed backchannel prediction models and deal withthis challenge are presented in more detail.

With subjective evaluation observers are used tojudge the generated backchanneling behavior of themodel. The evaluation analyzes the capability of themodel to produce correct and natural backchanneling be-havior as perceived by humans. This type of evaluationcircumvents the challenges for objective evaluations, butit is more time consuming to perform and is thus unsuitedfor validating settings of the models and/or rapid proto-typing. In Section 3 the different measurements and ap-proaches to subjectively evaluate the backchannel predic-tion models are presented in more detail.

The paper is concluded with our final thoughts on thesubject and recommendations for the future.

2. Objective EvaluationsIn objective evaluations of backchannel prediction mod-els the backchannel predictions made by the models arecompared to the ground truth. A measure is selectedwhich quantifies the comparison. Measures that are usedto report objective evaluations include cross-correlationcoefficient [1], precision and recall [2, 3, 4, 7, 16] or F1

(which is the weighted harmonic mean of precision andrecall) [5, 8, 10, 11, 14, 15, 17, 17, 18]. Most authorsopt for a measure based on precision and recall, but inthree areas differences between measures remain, namelyground truth selection, segmentation and margin of error.

2.1. Ground Truth Selection

The majority of evaluations of backchannel predictionmodel are performed by comparing the predictions madeby the model with the listener in the corpus [2, 4, 5, 7, 8,11, 14, 15, 16, 17, 18]. As Ward and Tsukahara [4] havenoted this is not ideal. When analyzing the performanceof their predictive rule they conclude that 44% of the in-correct predictions were cases where a backchanel couldnaturally have appeared, as judged by one of the authors,but in the corpus there was silence or, more rarely, thestart of a turn. Cathcart et al. [5] dealt with this problemby only using high backchannel rate data as test data inorder to minimize false negatives.

Others have dealt with this problem by collectingmultiple perspectives on appropriate times to provide a

15

Page 23: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Aut

hors

Subj

ectiv

eO

bjec

tive

Obj

ectiv

eM

etri

cG

roun

dTr

uth

Segm

enta

tion

Mar

gin

ofE

rror

Wat

anab

e&

Yuu

ki(1

989)

[1]

XC

ross

-Cor

rela

tion

Coe

f.M

ultip

le(N

oddi

ng)

Con

tinuo

us-

Oka

toet

al.(

1996

)[2]

XPr

ecis

ion

/Rec

all

Sing

leC

ontin

uous

-100

/500

ms

Nog

uchi

etal

.(19

98)[

3]X

Prec

isio

n/R

ecal

lM

ultip

le(K

eybo

ard)

Paus

e-B

ound

Phra

ses

-W

ard

&T

suka

hara

(200

0)[4

]X

Prec

isio

n/R

ecal

lSi

ngle

Con

tinuo

us-5

00/5

00m

sC

athc

art(

2003

)[5]

XF 1

Sing

leW

ords

-Fu

jieet

al.(

2004

)[6]

X-

--

-Ta

keuc

hi(2

004)

[7]

XX

Prec

isio

n/R

ecal

lSi

ngle

100m

sPa

use

Fram

es-

Kita

oka

etal

.(20

05)[

8]X

XF 1

Mul

tiple

(Key

boar

d)10

0ms

Paus

eFr

ames

-N

ishi

mur

aet

al.(

2007

)[9]

X-

--

-M

oren

cyet

al.(

2008

)[10

]X

F 1Si

ngle

Con

tinou

s0/

1000

ms

De

Kok

etal

.(20

10)[

11]

XF 1

/FConsensus

Mul

tiple

(Par

alle

l)C

ontin

ous

-500

/500

ms

Hua

nget

al.(

2010

)[12

]X

--

--

Hua

nget

al.(

2010

)[13

]X

--

--

Ozk

an&

Mor

ency

(201

0)[1

4]X

F 1Si

ngle

Con

tinou

s0/

1000

ms

Ozk

an&

Mor

ency

(201

0)[1

5]X

F 1Si

ngle

Con

tinou

s0/

1000

ms

Popp

eet

al.(

2010

)[16

]X

F 1Si

ngle

Con

tinou

s-2

00/2

00m

sD

eK

oket

al.(

2012

)[17

]X

XF 1

Sing

leC

ontin

ous

-500

/500

ms

Ozk

an&

Mor

ency

(201

2)[1

8]X

F 1/U

PASi

ngle

Con

tinou

s0/

1000

ms

Tabl

e1:

Ove

rvie

wof

the

corp

usba

sed

back

chan

nelp

redi

ctio

nm

odel

sde

velo

ped

sofa

r.

16

Page 24: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

backchannel. This was either by asking multiple peopleto press a key on a keyboard at times they would give abackchannel in reaction to a recorded speaker [3, 12, 13],asking multiple people to intentionally nod [1] or byrecording multiple listeners in parallel who were led tobelieve the only listener [11].

Recently two measures have been proposed that arespecifically aimed at being applied to such multiple per-spective data, FConsensus [11] and User-Adaptive Predic-tion Accuracy [18].

De Kok et al. [11] recorded 3 listeners in parallel in-teraction with the same speaker. Each listener was un-aware of the other two listeners. Combining the three‘versions’ of the ground truth, moments are identifiedwhere one, two or three listeners responded. Followingthe reasoning that the moments where more listeners per-formed a backchannel are more important for a model topredict, but a prediction should only be regarded as be-ing false if it is at a moment where none of the listenersperformed a backchannel they proposed the FConsensus

metric. In this metric precision is calculated using all themoments a listener performed a backchannel as groundtruth, while recall is calculated using only the momentswhere the majority of listeners performed a backchannelas ground truth. The weighted harmonic mean is taken asthe final performance measure.

Ozkan and Morency [18] have proposed User-Adaptive Prediction Accuracy as an evaluation metricfor backchannel prediction models. For this measure themodel is asked for n most likely backchannel moments inreaction to a speaker, where n is the number of backchan-nel given by the ground truth listener. This measure al-lows evaluation of the ability of the model to adapt todifferent listeners. Some listeners may backchannel fre-quently, while others backchannel only a limited numberof times.

2.2. Segmentation

With regards to segmentation the majority of models areevaluated on continuous data [1, 2, 4, 10, 11, 14, 15, 16,17, 18]. This means that a prediction for a backchannelcan be made at any time during the interaction, usually ata 10ms interval. However, some models have limitationsthat segment the interaction in bigger chunks of data.

Noguchi and Den [3] use pre-delimited pause-bounded phrases as data. The proposed backchannel pre-diction model predicts for each such segment whether itis followed by a backchannel or not. Cathcart et al. [5]make a similar decision after each word.

Both Takeuchi et al. [7] and Kitaoka et al. [8] haveproposed a model that classify frames with no speechfrom the speaker. These pauses were split into segmentsof 100ms. For each of these segments the pause was clas-sified as either ‘making a backchannel’, ‘taking the turn’,‘waiting for the speaker to continue’ or ‘waiting to make

a backchannel or take the turn’.

2.3. Margin of Error

For the models evaluated using precision and recall basedmeasures on continous data another discriminating factorapplies, namely the margin of error. Precision and re-call based measures rely on the evaluation whether a pre-diction is ‘at the same time’ as the ground truth. Thedefinition of ‘at the same time’ differ between evalua-tions. Okato et al. [2] use a margin of error of -100msto +300ms from the onset of the ground truth backchan-nel, Ward and Tsukahara [4] and De Kok et al. [11, 17]use a margin of error of -500ms to +500ms, Poppe etal. [16] use a margin of -200ms to +200ms, and Morencyet al. [10] and Ozkan et al. [14, 15, 18] use a margin oferror of 0ms to +1000ms.

3. Subjective EvaluationsWhen it comes to subjective error measures severalstrategies have been used to establish the performance ofthe models. The approaches used so far either evaluate ageneral impression of the backchannel behavior or indi-vidual backchannels.

Fujie et al. [6] made a pair-wise comparison betweenmodels in which the general impression of the backchan-nel behavior is measured. A subject interacted twice witha conversation robot system which backchanneling be-havior was driven by two different models. After these in-teractions the subject was asked on a 5 point scale, whichsystem they preferred, with 1 being system A, 5 beingsystem B and 3 being no preference.

Huang et al. [12] also evaluated the general impres-sion of the backchannel behavior. They generated virtuallisteners in response to recorded speakers and presentedthese interactions to 17 subjects. Similar to Fujie et al. [6]the subjects were presented with three different virtuallisteners each driven by a different backchannel predic-tion model. After each interaction the subject was asked7 questions about their perceived experience with regardto the timing of backchannels. On a 7-point Likert scalethe subjects rated the virtual listeners on ‘closeness’, ‘en-grossment’, ‘rapport’, ‘attention’, ‘amount of inappropri-ate backchannels’, ‘amount of missed opportunities’ and‘naturalness’.

Poppe et al. [16] also let participants evaluate vir-tual listeners in interaction with recorded speakers. Theyasked participants for each fragment “How likely do youthink the listener’s backchannel behavior has been per-formed by a human listener”. The participants made theirjudgement by setting a slider that corresponded to a valuebetween 0 and 100.

Kitaoka et al. [8] had 5 subjects rate each generatedbackchannels individually. The data presented to the sub-jects were 16 to 18 samples of single sentences followed

17

Page 25: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

by a backchannel. Each generated backchannel rated wasrated on a 5-point scale ranging from ‘early’ to ‘late’,with an extra option for ‘outlier’. They did this processfor backchannels generated at times predicted by theirmodel and times as found in the corpus. They accumu-lated the counts of the 5 subjects and reported the per-centage of ratings in the “good” category (rating 3). Thesame approach was used by Nishimura et al. [9].

De Kok et al. [17] evaluated their models in a sim-ilar fashion as Kitaoka et al. [8]. Subjects judged indi-vidual backchannels on their appriopriateness. Contraryto Kitaoka et al. the process was done in real time andover the course of multiple conversational moves keep-ing the backchannels in context. Subjects would hit thespacebar on a keyboard when they would see an inappro-priately timed backchannel. As an evaluation metric theypresented the percentage of backchannels that were notjudged as inappropriate by any of the judges.

4. ConclusionAs this survey has shown, a wide variety of evaluationmetrics have been used in the past. This makes compar-ing different methods in terms of performance even morecomplicated than it already is. Most models are trainedand testing on different corpora, which differ in language,type of conversations, amount of data and exact definitionof backchannel. This already makes a comparison be-tween reported values unreliable. On top of these differ-ences the evaluation methods used also differ from eachother. Some evaluation measures are used often (such asF1), but even then a direct comparison is not always fairbecause of differences in segmentation or margin of error.Also in subjective evaluations differences are there.

Development of backchannel prediction modelswould benefit from a unified way to evaluate perfor-mance. It would give more insight into the performanceof the model in comparison to previous work. A bench-mark corpus would be the ideal for this purpose, but aunified evaluation metric would be a start.

5. References[1] T. Watanabe and N. Yuuki, “A Voice Reaction System with a

Visualized Response Equivalent to Nodding,” in Proceedings ofthe third international conference on human-computer interac-tion, Vol.1 on Work with computers: organizational, management,stress and health aspects, 1989, pp. 396–403.

[2] Y. Okato, K. Kato, M. Kamamoto, and S. Itahashi, “Insertion ofinterjectory response based on prosodic information,” Proceed-ings of IVTTA ’96. Workshop on Interactive Voice Technology forTelecommunications Applications, pp. 85–88, 1996.

[3] H. Noguchi and Y. Den, “Prosody-based detection of the contextof backchannel responses,” in Fifth International Conference onSpoken Language Processing, 1998.

[4] N. Ward and W. Tsukahara, “Prosodic features which cue back-channel responses in English and Japanese,” Journal of Pragmat-ics, vol. 32, no. 8, pp. 1177–1207, 2000.

[5] N. Cathcart, J. Carletta, and E. Klein, “A shallow model of

backchannel continuers in spoken dialogue,” European ACL, pp.51–58, 2003.

[6] S. Fujie, K. Fukushima, and T. Kobayashi, “A conversation robotwith back-channel feedback function based on linguistic and non-linguistic information,” in Proc. Int. Conference on AutonomousRobots and Agents, 2004, pp. 379–384.

[7] M. Takeuchi, N. Kitaoka, and S. Nakagawa, “Timing detectionfor realtime dialog systems using prosodic and linguistic informa-tion,” International Conference on Speech Prosody, pp. 529–532,2004.

[8] N. Kitaoka, M. Takeuchi, R. Nishimura, and S. Nakagawa, “Re-sponse Timing Detection Using Prosodic and Linguistic Informa-tion for Human-friendly Spoken Dialog Systems,” Transactions ofthe Japanese Society for Artificial Intelligence, vol. 20, pp. 220–228, 2005.

[9] R. Nishimura, N. Kitaoka, and S. Nakagawa, “A spoken dia-log system for chat-like conversations considering response tim-ing,” in Proceedings of the 10th International Conference on Text,Speech and Dialogue. Springer, 2007, pp. 599–606.

[10] L.-P. Morency, I. de Kok, and J. Gratch, “Predicting ListenerBackchannels: A Probabilistic Multimodal Approach,” in Intel-ligent Virtual Agents, 2008, Conference proceedings (article), pp.176–190.

[11] I. de Kok, D. Ozkan, D. Heylen, and L.-P. Morency, “Learningand Evaluating Response Prediction Models using Parallel Lis-tener Consensus,” in Proceeding of International Conference onMultimodal Interfaces and the Workshop on Machine Learningfor Multimodal Interaction, 2010.

[12] L. Huang, L.-P. Morency, and J. Gratch, “Parasocial ConsensusSampling: Combining Multiple Perspectives to Learn Virtual Hu-man Behavior,” in Proceedings of Autonomous Agents and Multi-Agent Systems, Toronto, Canada, 2010, pp. 1265–1272.

[13] ——, “Learning Backchannel Prediction Model from ParasocialConsensus Sampling : A Subjective Evaluation,” in Proceedingsof the International Conference on Autonomous Agents and Mul-tiagent Systems (AAMAS), 2010, pp. 159–172.

[14] D. Ozkan and L.-P. Morency, “Concensus of Self-Features forNonverbal Behavior Analysis,” in Human Behavior Understand-ing, 2010.

[15] D. Ozkan, K. Sagae, and L.-P. Morency, “Latent Mixture of Dis-criminative Experts for Multimodal Prediction Modeling,” in Pro-ceedings of the 23rd International Conference on ComputationalLinguistics. Association for Computational Linguistics, 2010,pp. 860–868.

[16] R. Poppe, K. P. Truong, D. Reidsma, and D. Heylen, “Backchan-nel Strategies for Artificial Listeners,” in Intelligent VirtualAgents, Philadelphia, Pennsylvania, USA, 2010, pp. 146–158.

[17] I. de Kok, R. Poppe, and D. Heylen, “Iterative Perceptual Learn-ing for Social Behavior Synthesis,” Centre for Telematics and In-formation Technology University of Twente, Tech. Rep., 2012.

[18] D. Ozkan and L.-P. Morency, “Latent Mixture of DiscriminativeExperts,” Accepted for publication in ACM Transaction on Multi-media.

18

Page 26: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

3rd

party observer gaze during backchannels

Jens Edlund1, Mattias Heldner

2, Anna Hjalmarsson

1

1 KTH Speech, Music and Hearing, Stockholm, Sweden

2Linguistics, Stockholm University, Stockholm, Sweden

[email protected], [email protected], [email protected]

Abstract

This paper describes a study of how the gazes of 3rd party

observers of dialogue move when a speaker is taking the turn and

producing a back-channel, respectively. The data is collected and

basic processing is complete, but the results section for the paper

is not yet in place. It will be in time for the workshop, however,

and will be presented there, should this paper outline be

accepted..

Index Terms: speech synthesis, unit selection, joint costs

1. Introduction

The most common view of face-to-face communication, strongly influenced by [1], is that interlocutors take turns speaking. The

principles that guide this turn-taking is a well-studied topic in

spoken communication research and application development

alike since decades. And although there are different opinions as

to how precise and mandatory turn-taking is, there is no doubt

that the most common observation in spoken dialogue is one

speaker at a time, interspersed by transitions from the

vocalizations of one speaker to those of another. In this paper,

we focus on a subset of these transitions, namely when the

incoming speaker gives a backchannel. Backchannels, as coined

by [2], are brief feedback utterances generally described as being

somehow produced in the background and often not taken to

constitute a speaking turn or to claim the floor. While carrying

little propositional information and being unobtrusive in

character, it has been shown that these short interjections play a

significant role in the collaborative processes of dialogue [e.g.

3]. Analyses of the segments preceding backchannels further

show that there is a versatile set of multimodal behaviours that

affects the probability of a backchannel [e.g. 4]. The motivation

of the current work is to better understand backchannel

behaviour in dialogue. More specifically, we aim to learn more

about the timing and the conspicuousness of these events by

analyzing the gaze patterns in 3rd party observers.

2. Background and related work

The flow of interaction in face-to-face communication is a

multifaceted process that involves a complex set of behaviours

and different modalities [1]. Many researchers approach this by

first identifying appropriate places to take the turn. One way to

do this is to pick those places where speaker changes in fact

occurred [e.g. 5]. This method results in an objective and

repeatable selection, particularly if automatic speech activity

detection is used to decide when participants are speaking and

when they are silent. An inherent problem with the method,

however, is that it only captures actual speaker changes; never

possible but unrealized speaker changes, or potential transition

relevance places (TRPs) in the terminology of [1]. Another

common method is to have one or more judges subjectively

identifying places where a speaker change could occur [e.g. 6;

7]. The method has advantages. It potentially captures not only

places where real speaker changes occurred, but also places

where speaker changes might have occurred without harm to the

flow of the interaction, but did not. The method might also leave

out those places where inappropriate speaker changes actually

occurred. An objection – possibly the strongest objection – to the

method is its lack of ecological validity. It is debateable if people

do the same thing when asked to for example press a button

while listening to a dialogue as they would do when they

contribute their voices as participants in conversation.

In the present study, we explore a novel method of

identifying places where a speaker could have entered the

conversation. The method,is based on [8, 9], who use gaze

patterns and gaze shifts of non-participating listeners to study

turn-boundary projection. The method relies on the intuition that

3rd party observers of a conversation tend to direct their gaze at

the current speaker in the conversation [e.g. 10]. One end goal of

this effort is to be able to judge, for each frame or segment of a

dialogue, how appropriate it is for another speaker to start

speaking.

2.1. 3rd party gaze

Gaze patterns of speakers and their addressees is a relatively

well-explored research area [10, 11]. For example, it has been

shown that listeners gaze almost twice as much on speakers in

dyadic dialogue than vice versa [12] and the interactive gaze

patterns between listeners and speakers play a significant role in

controlling the flow of interaction [10].

In the present study, we use the gaze behaviour of 3rd party

observers – overhearers – of a dialogue. The motivation of this

method is to obtain a fine-grained measure of listeners’ ongoing

focus of attention which is directly time-aligned with events in

the dialogue. The term 3rd party observers is used to refer to

listeners that are not directly addressed by the speaker.

Consequently, when a listener becomes an active party of the

ongoing conversation, that person is per definition no longer a 3rd

party observer. Based on the hypothesis that dialogue is a

collaborative process and that the degree of participation affects

comprehension, it has been shown that the processes of

understanding differ between addressees and overhearers [3].

The 3rd party observers in the present study, however, are not co-

present, but attending to a pre-recorded video of a dialogue,

making their role as overhearers static. While the behaviour of

3rd party observers and their role in the dialogue may not be

representative of a co-present active listener, we have previously

19

Page 27: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

shown that 3rd party observervers of videos of pre-recorded

dialogues largely look at the same thing, the speaker [13].

2.2. Backchannel feedback

A large number of vocalizations in everyday conversation are

traditionally not regarded as part of the information exchange,

but have important communicative and interactive functions.

Examples include confirmations such as “yeah” and “ok” as well

as traditionally non-lexical items, such as “uh-huh”, “um”, and

“hmm”. Vocalizations like these have been grouped in different

constellations and called different names, for example

backchannels (i.e. back-channel activity, [2]), continuers [14],

feedback and grunts, and attempts at formalizing their function

and meaning have been made [e.g. 15]. We follow [16], who

argue that the term backchannel feedback is relatively neutral,

and henceforth use the term backchannel.

In the present study, we investigate backchannels by

analysing to what extent 3rd party observers gaze at speakers who

produce backchannels and when this gaze shift is done relative to

the offset of the previous speaker’s turn. It has previously been

shown that 3rd party observers occasionally appear to anticipate

speaker changes, shifting their gaze to the other speaker before

the new turn is initiated, sometimes even before the end of the

original speaker’s turn [9]. This finding supports the claim that

listeners to some extent can anticipate the ends of speaker turns.

In the current work, we focus on speaker changes when the

incoming speaker gives backchannels. By analysing the gaze

patterns of 3rd party observers, we will be able to make in-depth

analyses of the nature of these events. That is, whether

backchannels are events to which 3rd party observers pay little

attention, or whether these events can be anticipated in advance

and is attained to by listeners to similar extents as other types of

speaker changes.

3. The Spontal corpus

3.1. Corpus description

The Spontal corpus contains in excess of 60 hours of dialogue:

120 nominal half-hour sessions (the duration of each dialogue is

minimally 30 minutes). The subjects are all native speakers of

Swedish. The subjects were balanced (1) as to whether the

interlocutors are of same or opposing gender and (2) as to

whether they know each other or not. The recordings contain

high-quality audio and video. Spontal subjects were allowed to

talk about anything they wanted at any point in the session,

including meta-comments on the recording environment. Four

segments of five minutes each were randomly chosen from the

development set of the most recent Spontal recordings

(SpontalIDs 09-20; 09-28; 09-30; 09-36), but in such a manner

that they were taken from different balance groups: Spontal

dialogues are balanced for same/different gender and for whether

or not the participants knew each other before the recording. The

segments included one known and one unknown same gender

(male) pair, as well as one known and one unknown opposing

gender pair. Each segment consisted of the first five minutes of

the dialogue – that is the first five minutes of the official

recording following the moment when the recording assistant

told the participants that the recording had started. The segments

were manipulated such that the front facing videos of both

participants were displayed simultaneously next to each other, as

seen in Figure 1.

Figure 1. Still-image from one of the front facing videos of

both participants.

3.2. Speech/non-speech decisions

The analyses presented here were based on an operationally

defined model of interaction. This interaction model is

computationally simple yet powerful and uses boundaries in the

conversation flow, defined by the relative timing of speech from

the participants in the conversation, as the only source of

information. In particular, we annotate every instant in a

dialogue with an explicit interaction state label; states describe

the joint vocal activity of both speakers, building on a tradition

of computational models of interaction [17].

As a basis for the interaction model, we first performed

automatic speech activity detection (SAD) (for a detailed

description of this procedure see 18). The SAD produced a

segmentation of each speaker state sequence into TALKSPURTS

and PAUSES. TALKSPURTS were defined as a minimum of two

contiguous speech frames (i.e. 200 ms, as enforced by the

decoding topology) by one party that were preceded and

followed by a minimum of two contiguous silence frames from

the speaker. Similarly, PAUSES were defined as a minimum of

two contiguous silence frames from that speaker. Based on these

segments, we extract speaker changes (SC): those places where

one solitary speaker speaks, followed by solitary speech from

another.

3.3. 3rd party Gaze annotation

Eight subjects participated in the third-party observer gaze data

collection. Each subject was placed in front of a monitor on

which the side-by-side videos of Spontal dialogues could be

shown in a sound-proofed studio. Sound was replayed through

stereo loudspeakers. Throughout each session, a Tobii T120 gaze

tracker was used to determine where the subjects were looking.

In order to motivate the subjects to pay close attention to the

interactions, they were told that their task was to analyze the

personalities of participants in each dialogue. They were given a

questionnaire with questions about the topic of the conversation

and of the "big five" personality traits of each participant. After

each of the three five-minute dialogue segments, they filled in a

questionnaire. Although the participants were aware that their

gaze was being tracked, they had no knowledge of the purpose of

this tracking, nor were they instructed at any point to pay special

attention to the person speaking.

Gaze data is processed in a simple but robust manner. We

used the fixation point data delivered by the system, rather than

the raw data. For each frame, we count the number of subjects

20

Page 28: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

whose fixation point rests on the left half and the right half of the

monitor, respectively, and normalize this to a number between -1

and 1, where -1 means that every subject whose gaze was

captured looked at the left half of the monitor, and 1 means that

they all looked at the right half. More details on the collection of

third-observer gaze data is presented in [13].

The timing of their shifting their gaze from a previous

speaker to a next speaker has been shown to vary, and

occasionally their gaze will shift only to shift back again when

no speaker change occurs. By averaging the gaze target (speaker

A, speaker B, elsewhere) from a number of 3rd party observers

and normalizing the results, we get a number from -1 (everybody

looks at speaker A) to 1 (everybody looks at speaker B). The

number reflects who the 3rd party observers think is going to be

the speaker in the near future, and plotted over time, provides

insight about actual speaker changes, with which it is highly

correlated, but also of moments in time where some or many

observers expected a speaker change.

4. Method

4.1. Backchannel annotation

As a basis for further analysis, the Spontal dialogues used in the

gaze data collection were manually annotated for verbal

backchannels. The annotation was done on the talkspurt level,

where a segment was considered to be a backchannel if that

segment’s (only) function was to provide feedback to the other

interlocutor’s speech, without providing any new propositional

information. Using this guideline as principle for the annotation,

two annotators labelled the three dialogues independently with

high annotator agreement. In total there were 5 disagreements

between the annotators, but all were solved in agreement after

discussion.

In addition to the manual annotation of backchannels, the

talkspurts were subdivided into very short utterances (VSUs) and

their complement (NONVSUs) based on their duration.

Talkspurts between 2 and 10 frames in duration (i.e. 200 ms to

1000 ms) were labelled VSUs and those longer than 10 frames

(i.e. ≥ 1100 ms) were labelled NONVSUs [19].

4.2. Selection and alignment

For this investigation, we chose to look at the onsets of talkspurts

- the transitions between silence and vocalization in one

speaker's channel. We characterize these transitions based on

whether the new talkspurt is a BACKCHANNEL or a

NONBACKCHANNEL and whether the transition begins in

OVERLAP, after a GAP, or (perceptually) with

NOGAPNOOVERLAP. We also include the onset of CONTINUING

talkspurts where the same speaker was the last to speak before a

preceding silence - a pause. The resulting 8 combinations and

there respective frequencies are shown in Table 1.

Table 1. Frequencies of different types of transitions

from one speaker from another.

Talkspurt type Transition type Frequency

Backchannel Overlap

Backchannel NoGapNoOverlap

Backchannel Gap

Backchannel Continuing

NoBackchannel Overlap

NoBackchannel NoGapNoOverlap

NoBackchannel Gap

NoBackchannel Continuing

We then calculate how the gaze distribution - the number of

3rd-party observers watching the incoming speaker vs. the

number of speakers watching the other speaker for each 100 ms

frame up to ten frames before and after the talkspurt begins. We

sum all of these distributions so that we get the average gaze

distribution at T for T = -1s to T = 1s in relation to talkspurt

beginnings. By splitting this data on the categories defined

above, we hope to see not only to what extent 3rd-party observers

look at incoming speakers under different conditions, but also

how quickly and robustly they are attracted to the new speaker.

4.3. Grouping of categories

The backchannels were subsequently automatically categorized

as overlapping or non-overlapping. The overlap categorization

was based on whether the VAD (described in section 3.2) had

detected speech in both channels in at least two adjacent frames.

The minimum criterion of two frames overlap is used since [20]

shows that about 130 milliseconds of simultaneous speech is

needed for speech to be perceived as overlapping.

5. Results (pending)

Report for: BC/non-BC, non-bc after pause; overlap, gap, no-

gap-no-overlap; perceptual gap/overlap/no gap no overlap.

5.1. Descriptive statistics of categories

Pending.

5.2. Gaze targets overall

Pending.

5.3. Timing of gaze shift

Pending.

6. Discussion

Pending.

7. Acknowledgements

The work was supported by the Riksbankens Jubileumsfond (RJ)

project P09-0064:1-E Prosody in conversation, the EU project

Get Home Safe, and the Swedish Research Council (VR) project

and 2011-6152.

21

Page 29: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

8. References

[1] Sacks, H., Schegloff, E., & Jefferson, G. (1974).

A simplest systematics for the organization of turn-taking

for conversation. Language, 50, 696-735.

[2] Yngve, V. H. (1970). On getting a word in

edgewise. In Papers from the sixth regional meeting of the

Chicago Linguistic Society (pp. 567-578). Chicago.

[3] Schober, M., & Clark, H. (1989). Understanding

by addressees and overhearers. Cognitive Psychology,

21(2), 211-232.

[4] Gravano, A., & Hirschberg, J. (2009).

Backchannel-inviting cues in task-oriented dialogue. In

Proceedings of Interspeech 2009 (pp. 1019-1022).

Brighton, U.K.

[5] Duncan, S. (1972). Some Signals and Rules for

Taking Speaking Turns in Conversations. Journal of

Personality and Social Psychology, 23(2), 283-292.

[6] Heldner, M., Edlund, J., & Carlson, R. (2006).

Interruption impossible. In Bruce, G., & Horne, M. (Eds.),

Nordic Prosody, Proceedings of the IXth Conference,

Lund 2004 (pp. 97-105). Frankfurt am Main, Germany.

[7] de Ruiter, J. P., Mitterer, H., & Enfield, N. J.

(2006). Projecting the end of a speaker's turn: a cognitive

cornerstone of conversation. Language, 82(3), 515-535.

[8] Tice, M., & Henetz, T. (2011). The eye gaze of

3rd party observers reflects turn-end boundary projection.

In Procs. of the 15th Workshop on the Semantics and

Pragmatics of Dialogue (SEMDIAL 2011/Los Angelogue)

(pp. 204-205). Los Angeles, CA, US.

[9] Tice, M., & Henetz, T. (2011). Turn-boundary

projection: Looking ahead. In Proceedings of the 33rd

Annual Meeting of the Cognitive Science Society. Boston,

Massachusetts, USA.

[10] Kendon, A. (1967). Some functions of gaze

direction in social interaction. Acta Psychologica, 26, 22-

63.

[11] Bavelas, J., Coates, L., & Johnson, T. (2002).

Listener Responses as a Collaborative Process: The Role

of Gaze. Journal of Communication, 52(3), 566-580.

[12] Argyle, M., & Cook, M. (1976). Gaze and

mutual gaze. Cambridge University Press.

[13] Edlund, J., Alexandersson, S., Beskow, J.,

Gustavsson, L., Heldner, M., Hjalmarsson, A., Kallionen,

P., & Marklund, E. (2012). 3rd party observer gaze as a

continuous measure of dialogue flow. In Proc. of LREC

2012. Istanbul, Turkey.

[14] Schegloff, E. (1982). Discourse as an

interactional achievement: Some uses of 'uh huh' and other

things that come between sentences. In Tannen, D. (Ed.),

Analyzing Discourse: Text and Talk (pp. 71-93).

Washington, D.C., USA: Georgetown University Press.

[15] Ward, N. (2004). Pragmatic functions of prosodic

features in non-lexical utterances. In Proceedings of

Speech Prosody (pp. 325-328).

[16] Ward, N., & Tsukahara, W. (2000). Prosodic

features which cue back-channel responses in English and

Japanese. Journal of Pragmatics, 32(8), 1177-1207.

[17] Norwine, A. C., & Murphy, O. J. (1938).

Characteristic time intervals in telephone conversation.

The Bell System Technical Journal, 17, 281-291.

[18] Heldner, M., Edlund, J., Hjalmarsson, A., &

Laskowski, K. (2011). Very short utterances and timing in

turn-taking. In Proceedings of Interspeech 2011 (pp.

2837-2840). Florence, Italy.

[19] Edlund, J., Heldner, M., & Pelcé, A. (2009).

Prosodic features of very short utterances in dialogue. In

Vainio, M., Aulanko, R., & Aaltonen, O. (Eds.), Nordic

Prosody - Proceedings of the Xth Conference (pp. 57 -

68). Frankfurt am Main: Peter Lang.

[20] Heldner, M. (2011). Detection thresholds for

gaps, overlaps and no-gap-no-overlaps. Journal of the

Acoustical Society of America, 130(1), 508-513.

22

Page 30: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Feedback and activity in dialogue: signals or symptoms?

Andrew Gargett

Linguistics Department, UAE University, Al Ain, [email protected]

AbstractThis paper presents new approaches to modelling bothlinguistic and non-linguistic feedback during instructiongiving in a virtual domain. Our approach enables fine-grained investigation of how language and actions areconditioned by task-level and domain-level features ofdialogue. In a preliminary study, we examine the inter-action between pauses in linguistic and non-linguistic ac-tivity. As far as we know, ours is the first analysis ofpauses across modalities. In the longer term, we aim touse these techniques as a window on the underlying pro-cesses conditioning feedback, and for such applicationsas the generation of situated forms of listening, such asinstruction following.Index Terms: feedback, linguistic and actional pauses,virtual worlds

1. IntroductionDuring instruction giving, backchanneling is a normalpart of feedback behaviour. However, an instruction givermay well take all kinds of instruction following behaviouras feedback, e.g. should an instruction giver consider aninstruction follower who has stopped talking or movingto be indicating understanding or lack of understanding?For their part, an instruction follower can deliberately sig-nal problems arising from lack of understanding, by stop-ping when faced with difficulties, or even waiting for fur-ther clarification.

Given that such silence and inaction is ubiquitous ineveryday conversation (e.g. [5]), then clearly instructiongivers need to be very good at detecting and dealing withsuch evidence about instruction follower behaviour. In-struction followers for their part should impart the cor-rect signals when necessary. However, such phenomenaas silence and inaction lack content (quite literaly), whichraises the question: how do we in fact construe meaningof pauses in both action and language?1 And given theapparent involvement of pausing in feedback behaviour,how does such lack of action or language affect otherfeedback channels?

The long-term aim of our work is to answer suchquestions by developing a method for gathering fine-

1While pauses lack referential content, there is certainly some sensethat can be made of them, albeit wholly gained from the context.

grained information about interactive language behaviourin multi-modal settings. To this end, in this paper wepresent some preliminary work on a multi-modal corpus,the SCARE corpus ([3]), to see if, using this method, weare able to discover interesting interactions between paus-ing in action and language and other forms of feedback,specifically, backchannels. We hope that by examiningthe interaction between such cross-modal phenomena, in-sights can be gained into the “meaning” of such interac-tional phenomena.

2. Previous workThe literature on pausing behaviour is well-established,but typically does not consider actional alongside lin-guistic pausing. Pauses involve unusually lowered lev-els of activity in the production systems of a single lan-guage user, in our case lowered activity levels in languageand/or actions. How low such levels must go for a pause(to be perceived) to have occurred is a difficult question,and some useful progress has been made toward answer-ing this (e.g. [1]). Pauses have recently been of someinterest in research on interaction. Heldner and Edlund([5]), in particular, provide a thorough typology from dif-ferent speaker’s perspectives, echoing [8], that while gapsare between-speaker silences, pauses are within-speakersilences. We adopt this definition, and extend it to definepauses in actions to be within-actor inactivity, and gapsin action to be between-actor inactivity.

In line with a growing body of research, we take itthat pauses provide a window on language production andcognitive processing (e.g. [10]). However, such work hasuntil now been largely linguistically oriented. We seek toextend this to other production systems, in a way which isin line with previous literature on linguistic pauses withininteraction (e.g. [5]).

Now, numerous established corpora of instructiongiving dialogues (e.g. TRAINS2) are strictly text based.This has led to a paucity of information regarding theuse of situational features made by interlocutors. Extend-ing models of interaction to incorporate such informationmay provide qualitatively distinct accounts of what is go-ing on in dialogue. In this paper, we will offer a prelimi-nary proof-of-concept study, suggesting the usefulness of

2http://www.cs.rochester.edu/research/speech/trains.html

23

Page 31: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

such information.Corpus collections of multimodal dialogues, like the

SCARE3 ([3]) and GIVE-24 ([4]) corpora, are crucial forapproaches of the kind we are proposing. The availabil-ity of such corpora provide an opportunity to make fine-grained investigations of the situational features condi-tioning interaction.

Our approach is as far as we know the first to em-pirically model the interaction of different forms of feed-back across modalities. This enables us to make uniquelycross-modal comparisons of dialogue phenomena.

3. Method3.1. Data

We used data from the SCARE corpus ([3]), a collectionof instruction giving dialogues in a virtual world (createdusing QuakeII gaming software) made up of two levels,each with between 7 and 9 rooms, and these rooms hav-ing buttons for opening cabinets that contained objects tobe retrieved (see Figure (1) for screenshots). The corpusconsists of 15 sessions, with interlocutors taking roles ofeither instruction giver (IG) or instruction follower (IF).They had to complete a series of 5 simple tasks (retriev-ing objects), with the IG verbally guiding the IF throughthe world, but only the IG having access to a map of theworld, and a list of the tasks to be completed. The 19male and 11 female participants had an average age of30, and identified as native speakers of North AmericanEnglish. Sessions ranged from 10 minutes in length toover half an hour.

From this corpus, we collected from 12 of the 15SCARE sessions, objective correlates of pauses in ac-tions (i.e. complete cessation of physical activity),5 whilefor pauses in language we relied on the judgements madeby the original annotators of the linguistic aspects of theSCARE corpus. Accompanying the SCARE corpus weredetailed recordings of information about the fifteen gamesessions, including position and orientation of the IF, aswell as locations of objects in the SCARE world, suchas buttons, cabinets and doors.6 From the data streamsrecorded in these log files, information about events couldbe extracted, such as whether the instruction follower wasor was not moving or turning - we took pauses in actionsto be those periods between when the follower was turn-ing and/or moving, as well as the context of such activityor inactivity. Note that due to the way the SCARE corpusis recorded, only the instruction follower both moves andtalks, while the instruction giver simply talks.

3http://slate.cse.ohio-state.edu/quake-corpora/scare/4www.give-challenge.org/research/page.php?id=give-2-corpus5Ignoring sessions 1, 5, and 15, which present various problems for

such data collection.6Thanks to Alexander Koller and Krystof Drys for making available

code, some of which was adapted in the data retrieval process. The usewe made of this modified code is of course our own responsibility.

We developed a Scala tool for re-building the SCAREcorpus as a stand-off corpus using the Nite NXT toolkit([7]). The Nite NXT approach is particularly useful forus due to its rich structuring of data, including a dataset model for structuring a corpus in terms of (i) obser-vations, (ii) agents, (iii) the interaction, as well as (iv)the signal. In particular, the observations can be multi-layered, either directly aligned to the timing level, or elsesymbolically linked to other levels (e.g. annotations ofdialogue acts can be linked to actual utterances, which inturn can be directly aligned with the timing of the originalaudio and video signal). Aside from allowing us to ade-quately model the information contained in the SCAREdialogues, this also allowed access to a very useful libraryof Java classes bundled with the Toolkit (e.g. for search-ing NXT-formatted corpus files).

For backchannels, we examined acknowledgementslike “ok” or “yeah”, tacit agreements like “mhm”, andfuller expressions of agreement like “yep” or “alright”,as well as interjections “um” and “uh” (e.g. [9]). We willnot consider here the role of “gaps”, as defined by [5](although, it would be interesting to pursue this further inthe future).

3.2. Procedure

As a proof-of-concept of our approach to modelling mul-timodal elements of feedback in interaction, we carriedout the four studies reported in Table (3.2), comparinginstruction giver (IG) and instruction follower (IF) be-haviour.

In each case, we attempt to determine whether suchforms of feedback, in both instruction givers and follow-ers, are independent or not of the activity of the instruc-tion follower. In the context of instruction giving, the in-activity of the instruction follower is a direct indication oftrouble in completing the task. For the instruction giver,given the need to finish the task as quickly as possible,the instruction follower’s lack of movement is likely asymptom of misunderstanding (where inaction=inabilityto act), while, the instruction follower could well inten-tionally signal their lack of understanding in this way (cf.“Sorry, what was that?”).

3.3. Results

3.3.1. Studies 1 & 2

Table (2) reports the percentage of backchannel tokens asused by speakers in each role, interjections being reservedfor the fourth study (see Section (3.3.3) below).7 We car-ried out a Pearson chi-squared test on this data, with re-sult χ2 = 106.5 (p < 0.01, df = 3), suggesting thatthe use of backchannels by Instruction Giver and Instruc-tion Follower do not have the same distribution, in other

7Recall, IG=instruction giver, IF=instruction follower.

24

Page 32: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Figure 1: Screenshots of rooms within the SCARE world

Study question Purpose

Study 1: In the context of pause in actions, what IGbackchannels are most likely?

Evaluating overlap of backchannels of instruction giverswith the activity of instruction followers

Study 2: In the context of pause in actions, what IFbackchannels are most likely?

Evaluating overlap of backchannels of instruction follow-ers with their own activity

Study 3: In the context of pause in actions and words,what is the likely backchannel to be used, and who ismost likely to use it?

Evaluating overlap of backchannels and cessation of bothactions and language in instruction givers and followers

Study 4: In the context of pause in actions and words,how does the choice between backchannels “um” vs.“uh” affect word pause duration?

Evaluating overlap of interjections with cessation of bothactions and language

Table 1: Proof-of-concept studies

words that the use of backchannels is not independent ofthe role of speaker. Note that Table (2) also reports stan-dardised residuals, which is useful in contrasting how theactual patterns of use of backchannel tokens differs fromone where use of such tokens would be independent ofthe role of speaker (i.e. the null hypothesis case).

Token IG IF Row totals

alright 77(.96) 30(-1.3) 107mhm 4(-5.9) 61(7.9) 65

ok 381(.63) 191(-.84) 572yeah 165(1.5) 63(-2.0) 228

Column totals 627 345 972

Table 2: Backchannels in the context of IF inactivity (in-cluding standardised residuals)

3.3.2. Study 3

Two investigations were conducted here, one that exam-ined what happened in the context of linguistic pauses by

each speaker preceding action pauses, and the other inthe context of linguistic pauses following action pauses.It turns out that only tokens for “ok” and “yeah” withenough frequency for results to be significant. The com-parisons in Table (3) yield results of far less significance,with χ2 = 1.67 (p = .80, df = 4). The data in Table (4),is similar, with χ2 = 2.05 (p = .73, df = 4). In neithercase, can independence of use of tokens from speaker rolebe refuted.

Token IG IF Row totals

ok 9(-.55) 22(.40) 31yeah 6(.89) 6(-.65) 24

Column totals 30 56 172

Table 3: Use of backchannel tokens by Instruction Givervs. Instruction Follower (word pause preceding actionpause, including standardised residuals)

25

Page 33: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Token IG IF Row totals

ok 14(-.34) 23(.29) 37yeah 3(1.0) 1(-.88) 4

Column totals 17 24 82

Table 4: Use of backchannel tokens by Instruction Givervs. Instruction Follower (action pause preceding wordpause, including standardised residuals)

3.3.3. Study 4

Our reason for focusing here on the interjections “um” vs.“uh” is that, while their interaction with linguistic pausesis established ([11]), “um” projecting a longer pause inwords than “uh”, it would be interesting to consider bothin the context of actional pauses. Indeed, we were ableto confirm the distinction in projected word duration forthose interjections overall, with two sample t-test resultst(147) = 3.49, p < .01, indicating the null hypothesisof no distinction between word duration after differentinterjections can be rejected (word duration after “um”having M = .56(SD = .68) seconds, “uh” havingM = .30(SD = .39) seconds). However, testing thedistinction between subsequent word duration followingthese interjections, that occur during action pauses, weobtained two sample t-test results t(28) = 1.76, p = .08,indicating the same null hypothesis cannot be rejected inthe context of action pauses (word duration after “um”having M = .78(SD = .76) seconds, “uh” havingM = .48(SD = .55) seconds).8

4. Summary and future workThis paper reports results of what are essentially proof-of-concept studies, presenting a novel consideration offeedback across modalities, and thereby demonstratingthe viability of our method for investigating situated dia-logue. Our results show, first, that in the context of inac-tivity by an instruction follower, a range of forms of feed-back become available for use, and that indeed the use ofbackchannels is dependent on role: instruction givers arefar more likely to use “ok” in such contexts than instruc-tion followers, while instruction followers are far morelikely to say “mhm”. However, during complete silenceand inactivity, the use of specific backchannels becomesindependent of speaker role. Finally, employing our ap-proach, a tentative initial analysis suggests that an estab-lished distinction between backchannels “um” and “uh”

8However, a mixed effects account of this data is possible, withaction pauses as fixed factors, and subjects and interjections as ran-dom factors. Preliminary ANOVA on by-subject vs. by-item meansof word duration, suggests a significant effect for action pauses for theby-subjects analysis (F(1,26)=6.27, p<.05), but not for the by-itemsanalysis. Typically, significance in both analyses is required to showoverall significance; we are further examining this line of inquiry.

may vanish in the context of action pauses.For future work, we will broaden our investigation

into this corpus, outside of the narrow range of pauses ininstruction follower activity. For example, an importantfactor in the use of backchannels which we are planningto look at is their relationship with intonation contour([12]), but also in the context of instruction follower inac-tivity. Further, given the highly adaptable means wherebyvitual domains can be installed on mobile devices suchas laptops, we are currently planning an Arabic versionof the SCARE corpus with the aim of cross-linguistic in-vestigation across modalities of the kind of phenomenaexplored by [12] and others. Finally, a key aim of ourwork is to develop from such studies, more natural andeffective generation of listening behaviour on the part ofartificial instruction following agents.

5. AcknowledgementsMany thanks to the “Feedback behaviours in dialogue”workshop organisers for their efforts. Background to thispaper is ongoing research with Magda Wolska (SaarlandUniversity), on using virtual worlds to investigate dia-logue. Of course, all errors, etc, here remain my own.

6. References[1] Anna Danielewicz-Betz, “Silence and pauses in discourse and

music”, PhD., School of Philosophy, Saarland University, 1998.[2] Patrick Ye, “Natural language understanding in controlled virtual

environments”, PhD. Department of Computer Science and Soft-ware Engineering, The University of Melbourne, 2009.

[3] Laura Stoia, Darla Magdalene Shockley, Donna K. Byron andEric Fosler-Lussier, “SCARE: A Situated Corpus with AnnotatedReferring Expressions”, Proceedings of the 6th International Con-ference on Language Resources and Evaluation (LREC 2008),2008.

[4] Andrew Gargett, Konstantina Garoufi, Alexander Koller andKristina Striegnitz, “The GIVE-2 Corpus of Giving Instructions inVirtual Environments”, Proceedings of the 7th International Lan-guage Resources and Evaluation Conference (LREC 2010), 2010.

[5] Mattias Heldner and Jens Edlund, “Pauses, gaps and overlaps inconversations”, Journal of Phonetics, 28:555–568, 2010.

[6] Brigitte Zellner, “Pauses and the Temporal Structure of Speech”,in E. Keller (Ed.) Fundamentals of speech synthesis and speechrecognition, Chichester: John Wiley,pp. 41–62, 1994.

[7] Jean Carletta, Stefan Evert, Jonathan Kilgour, Craig Nicol,Dennis Reidsma, Judy Robertson and Holger Voormann,“Documentation for the NITE XML Toolkit”, Online:http://http://groups.inf.ed.ac.uk/nxt/documentation.shtml.

[8] Harvey Sacks, Emmanuel Schegloff and Gail Jefferson,“A sim-plest systematics for the organization of turn-taking for conversa-tion”, Language, 50:696–735, 1974.

[9] Stefan Benus, Agustn Gravano and Julia Hirschberg, “Theprosody of backchannels in American English”, In ICPhS, 2007.

[10] Louis ten Bosch, Nelleke Oostdijk & L. Boves, “On temporal as-pects of turn taking in conversational dialogues”, Speech Com-munication, 47:80–86, 2005.

[11] Herbert H. Clark and Jean E. Fox Tree, “Using uh and um in spon-taneous speaking”, Cognition, 84:73–111, 2002.

[12] Nigel G. Ward, Rafael Escalante, Yaffa Al Bayyari and ThamarSolorio, “Learning to Show You’re Listening”, Computer As-sisted Language Learning 20:385–407, 2007.

26

Page 34: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Listener's responses during storytelling in French conversation

Mathilde Guardiola1, Roxane Bertrand

1, Robert Espesser

1, Stéphane Rauzy

1

1Laboratoire Parole et Langage, Université Aix-Marseille, Aix-en-Provence, France

[email protected], [email protected],

[email protected], [email protected]

Abstract

This study concerns the evolution of listeners’ production during

narratives in a French conversational corpus. Using the method

of Conversational analysis in a first part of the study, we show

that listeners use different discursive devices throughout the

narrative. In a second part, we attempt to estimate this behavior

in a systematic way by measuring the richness of morpho-

syntactic categories. We confirm the presence of specific

discursive devices as repetition or reported speech produced by

listeners in the end of the narrative while only a slight tendency

is observed concerning the increasing density of the richer

morphosyntactic categories.

Index Terms: back-channel, conversation analysis, storytelling,

French, convergence.

1. Introduction

Interactional co-construction in human-human conversation is

minimally based on the production of simple backchannel

signals, i.e. short utterances produced by the listener to signal

sustained attention to the speaker while this latter is talking [1].

This cooperative process is a necessary requirement for a

successful interaction [2]. We claim that in some points of the

interaction, co-construction can progressively evolve until a real

alignment, especially thanks to the production of responses

specifically adapted to this point of the interaction. Ratification

of the response by the main speaker (by repeating or integrating

it in his own discourse) can lead to a particular sequence of high

interactional convergence [3], but this is out of the scope.

Our study focuses on storytelling which is a very frequent

activity in conversation. Many studies have shown the role of

backchannel signals in the turn-taking organization such as [4]

[5] among others. However, to our knowledge, a few studies

have shown the role of listener in story-telling [6], and more

specifically in a systematic way such as [7]. In this latter study,

the authors have shown in experimental conditions that listeners

become co-narrators and then improve the quality of the story by

using adapted responses. These responses are what they called

generic and specific responses. The generic correspond to the

simple backchannel signals (such as mh, yeah, ok, and so on)

while the specific responses are precisely specific to the current

narrative and cannot be produced in another context. To produce

specific responses, the listeners need to have enough information

about the situation described. This means that they depend on the

state of shared knowledge. This shared knowledge increasing

throughout the narrative, listeners produce then more specific

responses throughout the narrative.

This present study aims to confirm these results in more

conversational data in French. In a first stage, we conduct a

sequentially analysis inspired from the Conversational Analysis

framework [8] about the forms and functions of the listener

responses throughout the narrative. More particularly, we

observe the typical responses produced around the end of the

narrative. In a second stage, we attempt to show in a more

systematic way that these specific responses are indeed produced

later than generic ones. We hypothesize that the different types

of responses throughout the narratives could be reflected by

measuring the richness of the morphosyntactic categories

produced by listeners.

2. Corpus & Methodology

In this study, we considered a subset of the Corpus of

Interactional Data (CID) [9], i.e. two one-hour long French-

speaking dialogues, involving two male participants for the first

one (AG-YM) and two female for the other one (AB-CM). In

these interactions, participants were told to tell unusual stories.

This consign provokes storytelling as a privileged activity in the

corpus. Although an experimental setting (record in an anechoic

sound-proof room to have a high quality of speech), the

interactions are spontaneous (unprepared speech). The

participants were not given a priori a particular role in the

interaction, and they manage themselves the turn-taking

organization. So that interactions seem to be very similar to an

ordinary conversation.

The Corpus of Interactional Data has been annotated in a multi-

level perspective (OTIM Project [10]). All the annotations have

been aligned on the signal. Among others, narratives have been

identified. In the present preliminary study, we use only

morphosyntactic [11] and narrative levels.

We conduct a study combining a double approach (qualitative

and quantitative analysis). On the one hand, qualitative analysis

consists of a sequential analysis of the interaction. On the other

hand, quantitative analysis is in line with corpus linguistics and

computational approach [12] for a similar approach in prosody

and feedback. We attempt to measure the production of

responses of the listener during the story-telling, in a more

systematic way, by using a weight that reflects the richness of

morphosyntactic categories.

3. Qualitative analysis: generic and specific

responses to narrative

A precise sequential analysis has highlighted that during a

narrative the listener produces responses which have different

functions in the interaction (continuers, confirmation request,

assessment among others). These appropriated responses can be

27

Page 35: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

generic or specific [13]. Generic responses can be produced in

any narrative (“ouais/yeah”, “mh”, laughter, etc.). Specific ones

are specific for the current narrative, they consist in several

devices: question, other-repetition, reformulation, comments and

completions of narrative. Within a study on interactional

convergence in general, we showed that some particular

phenomena (such as direct reported speech and other-repetition)

appear mostly in the middle or end of the narrative, when a

certain common ground is already established [14].

Most of the specific responses appear in the latest phases of the

narrative, often provoked by the apex (culminative point) of the

narrative, according to the formal model of narrative based on

[15].

Among specific responses, other-repetitions are used by the

listener in order to show his participation to the interaction. More

particularly, in example 1, this repetition has a savoring function

as defined by [16]. By repeating, the listener shows his

appreciation of what has been previously said by the narrator:

Example 1. AP et je galérais un peu sur la sur le

bouchon

AP et si j'étais là je bloquais un peu sur

la table

AP et je vois une activité animale sur la

table

LJ @ une activité animale

AP tu sais j'ai vu enfin

AP dans mon champ visuel y a eu quelque

chose tu vois ça s'est mis à bouger oh

AP and it was hard with the covert

AP and if i was like this I was looking at

the table

AP and I see an animal activity on the table

LJ @ an animal activity

AP you know I saw

AP in my field of vision there was

something you see it started to move oh

In example 2, AB is the narrator and the listener CM produces a

complex back-channel [17]: "oh fuck, excellent". Since the

shared knowledge is already constituted, and the narrative is

almost finished, CM is able to produce a very specific response

to the narrative, showing her understanding of the situation

described. She can even produce direct reported speech as a

completion of the narrative (in echo [14]), and takes punctually

the place of main speaker:

Example 2. AB il était à moitié allongé par terre

a(v)ec sa jambe comme ça en disant oh

j'ai mal j'ai mal j'ai mal

CM oh p(u)tain excellent

AB on a dit on s'en fout on se barre et tout

a(l)ors il a quand même réussi

CM tu peux crever

AB he was half lying on the floor with his

leg like this saying oh it hurts it hurts

it hurts +

CM oh fuck excellent

AB we said we don’t care we leave and so on

so he still managed to

CM you can die

This type of responses, (a subtype of completion) has not yet

been described as a reported speech in other studies [14]. We

would consider them as complex specific back-channels since

they are strongly adapted to the precise situation.

On the contrary, at the beginning of the narrative, listeners can

however produce a type of response similar to what [7] consider

as a specific response. This latter corresponds mostly to a

confirmation or clarification request (about a character, a place

or an event involved in the narrative) that precisely helps

participants to build the common ground. Example 3 is an

illustration of this case:

Example 3. CM on était complètement euh complètement

CM désynchronisés en fait ouais c'est ça de

euh

AB et c'était quoi c'était un bar avec euh

un un écran ou

CM c'était un bar mais tu sais un truc

vachement moderne alors que tu es dans

euh un patelin euh complètement euh

CM we were totally er totally desynchronized

in fact yeah that's it er

AB and what was it it was a pub with a a

screen or

CM it was a bar but you know a very modern

stuff whereas it was in er a small

village er totally

4. Quantitative analysis: morphosyntactic

richness of responses

We analyzed two dialogs (2 female speakers, 2 males speakers),

with a total of 50 narratives. Each participant produced between

9 and 16 narratives.

Figure 1 shows the narratives durations according to their rank of

apparition for each speaker.

Figure 1: duration of narratives for each speaker

28

Page 36: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

The quantitative analysis allows checking the systematicity of

our manual observations. For quantitative analysis, we consider

every response from the participant during the progress of the

narrative. For this purpose, we take into account every token

produced by the listener during a narrative by a main speaker.

Time was normalized for the total length of each narrative. So

that time is expressed in fractions of the total length of the

narrative in which the token appears in. Consequently, we can

compare the temporal localization of tokens across the

narratives. A token consists in a word or a vocal signal

(excluding laughter, since we cannot assign a morpho-syntactic

category to laughter, which is a meta-communicative

phenomenon [6]). There were 2534 tokens.

We consider the other types of tokens’ morphosyntactic

category. The CID morphosyntactic annotations were obtained in

two steps. In a first stage, the enriched orthographic transcription

(adopted in the OTIM project) has been filtered of information

that we cannot assign a morpho-syntactic category to, such as

laughter or disfluencies, in order to form the input for a modified

version of the syntactic parser for written French text StP1 [11].

StP1 has been modified in order to account for the specificities

of speech analysis.

Two levels of hierarchy were introduced in the syntactic

treatment, corresponding to the strong punctuation marks (such

like final point, exclamation mark) and weak or soft punctuation

marks (such like comma) that can be found in written text.

Lexical entries have also been modified for words playing

specific function in speech in interaction, such as vocal back-

channel, discourse markers, etc. We found convenient to label

these tokens as interjection.

In the second stage the output of the parser was manually

corrected for the totality of the Corpus of Interactional Data.

Morphosyntactic information is given following the Multext

tagset features which contain ten main categories (determiner,

adjective, noun, pronoun, preposition, conjunction, auxiliary,

verb, adverb and interjection). To account for the

morphosyntactic richness, we assigned a “weight” to each token,

depending on its grammatical category.

Table 1. Number of tokens by category for 2 dialogs.

Weight Morpho-syntactic category

Low (973) Interjection (973)

Medium (797) Auxiliary (37)

Conjunction (126)

Determiner (132)

Preposition (109)

Pronoun (393)

High (764) Adjective (82)

Adverb (200)

Noun (167)

Verb (315)

We assume that “low” category corresponds to simplest

responses, whereas “high” categories are used in the

morphologically richest responses.

In a first step, figure 2 shows the production of tokens for each

weight by the 4 pooled speakers, as narrators (top) and as

listeners (low), according to the normalized narrative time.

Figure 2: Production of tokens for each weight by the 4 speakers

As expected, the production was roughly stable for narrators

with a high count for medium and high weight. On the contrary,

the listeners’ production increased throughout the narrative. This

is in line with the previous section, this increasing reflecting the

production of more and complex or specific responses.

Figure 3 presents the boxplots of the start time of every token for

each listener, according to their syntactic weight. The

distribution of start time seems to be ordered from inferior times

to superior times according to the increasing syntactic weight.

Figure 3: boxplots of start times of every token for each listener,

according to their syntaxic weight.

29

Page 37: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

A preliminary statistical analysis was run, based on a linear

mixed model (package lme4 [18], R software [19]). The

dependant variable was the normalized start time of the tokens,

the predictors were the syntactic weight (as a 3 level ordered

factor), the speakers (as a 4 levels factor) and their interaction. A

random intercept was added to account for the variability across

the 50 combinations of speakers and narratives. There were 2534

tokens involved. The results show a small but significant time

shift, i.e. the higher the morphosyntactic weight, the greater the

start time.

Nevertheless, the results were not robust enough to assess firmly

the effect. As expected, the shift effect was more robust when

only two morphosyntactic weight categories are defined instead

of three, (i.e. interjection vs. other).

The durations of the narratives seem to play an important role:

the three longest narratives involved in the dialogue AB-CM

accounted for 41% of the 1748 tokens of this dialogue; these 3

narratives appear responsible for the presence of the shift effect.

For the other dialogue (AG-YM) the results were more robust.

This effect is restricted to these 4 subjects, and the size effect is

small. This weakness can be attributed to a/ the nature of the

measurement, in which morphosyntactic category is a weak

measure of the syntactic richness, b/ the automatic procedures of

time alignment and morphosyntactic tagging leading to errors.

5. Discussion & Conclusion

This study suggests that the listener responses become more

complex, in a discursive and morphosyntactic levels, throughout

the narrative in French conversational dialogs. A qualitative

sequential analysis shows that the listener produces discursively

more complex responses at the end of the narrative. Conversely,

in the first parts of this latter, the listeners produce almost simple

responses as backchannels that increase towards a more specific

type of responses such as reported speech or other-repetitions

(among others). A quantitative systematic analysis of responses

suggests that the morphosyntactic richness of the listeners’

responses increase during the narrative. Results only show a

slight tendency but we could improve this richness measurement

by accounting for only some narrative formal phases. We have

shown indeed in the example 3 that confirmation requests can

appear in the first phases of narratives, in order to improve the

shared knowledge between participants. But this kind of

production (rich in terms of morpho-syntactic categories) in the

beginning of narratives, could explain why the tendancy is so

tenuous. Next step will be to improve the measurement by taking

into account this point. For instance, we will suppress some

phases as the “parenthesis” phases of narrative that lead to

digressions in the narrative structure (as defined by [15]) while

they are very frequent in conversational data in which

participants can more easily take the floor (even in narratives).

Then we will use the available multi-level annotations in order to

better take into account the evolution of listeners’ responses

within formal phases (orientation, complication,

evaluation/resolution) of the narrative, and also in investigating

prosodic cues (intensity crescendo). At last, we plan to increase

the number of speakers and narratives analyzed, while making

more accurate the syntactic criteria.

6. References

[1] Schegloff, E. 1982. Discourse as an interactional achievement:

some use of “uh uh” and other things that come between sequences, Text and Talk, 71-93.

[2] Clark, H.H. 1996. Using Language, Cambridge: Cambridge

University Press. [3] Guardiola, M. in progress. "Contribution multimodale à l'étude de

phénomènes de convergence dans l'interaction", PhD dissertation,

Aix-Marseille Université, France. [4] Ward, N. & Tsukahara, W. 2000. Prosodic features which cue

backchannel responses in English and Japanese, Journal of

Pragmatics, 23: 1177-1207. [5] Gravano, A., Hirscberg, J., Benus S. 2012, Affirmative cue words

in task-oriented dialogue, Computational Linguistics, 38: 1, 1-39.

[6] Þórunn, B. 2005. Feedback in Conversational Storytelling, in

Feedback in Spoken Interaction, Nordtalk, Gothenburg Papers in

Theoretical Linguistics 1–17.

[7] Bavelas, J. B., Coates, L., & Johnson, T. 2000. "Listeners as co-narrators", Journal of Personality and Social Psychology, 79, 941-

952.

[8] Couper-Kuhlen, E. , Selting, M. 1996. Prosody in conversation, Cambridge University Press.

[9] Bertrand, R. , Blache, P., Espesser, R., Ferré, G., Meunier, C.,

Priego-Valverde, B., Rauzy, S. 2008. "Le CID - Corpus of Interactional Data - Annotation et Exploitation Multimodale de

Parole Conversationnelle", Traitement Automatique des Langues, 49, 3. 105-134.

[10] Blache, P., Bertrand, R., Ferré, G. 2009. Creating and Exploiting

Multimodal Annotated Corpora: The ToMA Project, In: M. Kipp, et al. (Eds.), Multimodal Corpora, From Models of Natural

Interaction to Systems and Applications, Springer-Verlag, Berlin,

Heidelberg, 38-53. [11] Blache, P., Rauzy, S. 2008. Influence de la qualité de l’étiquetage

sur le chunking : une corrélation dépendant de la taille des chunks,

Proceedings of the TALN conference, 290-299, Avignon, France. [12] Benus S., Gravano, A., Hirschberg, J. 2011. Pragmatic aspects of

temporal accomodation in turn-taking", Journal of Pragmatics, 43,

3001-3027. [13] Bertrand, R., Priego-Valverde, B, Guardiola, M. 2010. The

prosodic cues of humorous reported speech in conversation, AAAL

Conference, Atlanta, USA [14] Guardiola, M., Bertrand, R. 2011. Mise en évidence de discours

rapportés « en écho » dans la conversation, Actes Rencontres

Jeunes Chercheurs en Parole, Grenoble, France. [15] Labov, W., Waletzky, J. 1966. Narrative analysis : oral versions of

personal experience, In J. Helm (ed), Essays on the verbal and

visual arts: Proceedings of the Annual Spring Meeting of the

American Ethnological Society. Seattle, University of Washington

Press, 12-44.

[16] Tannen, D. 1989-2007. Talking Voices, Repetition, Dialogue, and Imagery in Conversational Discourse, Cambridge University Press,

Cambridge.

[17] Laforest, M. 1992. Le back-channel en situation d’entrevue. Québec : CIRAL/Recherches sociolinguistiques, 2.

[18] Bates, D., Maechler, M. , Bolker, B. 2011. LME4: Linear mixed-

effects models using S4 classes. R package version 0.999375-42. http://CRAN.R-project.org/package=lme4.

[19] R Development Core Team. 2011. R: A language and environment

for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0, http://www.R-project.org/.

30

Page 38: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Crowdsourcing Backchannel Feedback: Understanding the Individual Variability from the Crowds

Lixing Huang, Jonathan Gratch

Institute for Creative Technologies, University of Southern California 12015 Waterfront Drive, Playa Vista, California, 90094

[email protected], [email protected]

Abstract During conversation, listeners often provide so-called back-

channel feedback (e.g., nods and filled pauses) during their part-ner’s speech and these behaviors serve important interactional functions. For example, the presence of backchannels has been shown to cause increased rapport, speech fluency and speech intimacy, even when produced by computer-generated listeners. Prior work by us and others has shown that specific acoustic and visual features predict when backchannels are likely to occur, but there is also considerably individual variability not explained by such models. Here we explore a data collection framework known as Parasocial Consensus Sampling (PCS) to examine and characterize some of this individual variability. Our results indi-cate that common personality traits can capture much of this var-iability. This suggests we can build models that capture individ-ual differences in backchannel “style” and possibly identify indi-vidual traits from observations of backchannel behavior. Index Terms: Backchannel, crowdsourcing

1. Introduction Face-to-face interaction is a cooperative process. While the speaker is talking, the listener’s nonverbal reactions, such as head nod, paraverbal and facial expression, also provide mo-ment-to-moment feedback that can alter and serve to co-construct subsequent speech. These nonverbal behaviors are called backchannels, and they play an important role in efficient social interactions [1,7,8].

Several research efforts have attempted to study and model the characteristics of a speaker’s behavior that predict listener backchannel feedback [3-5,9]. Increasingly, these efforts employ data-driven approaches that automatically learn such models from large amounts of annotated face-to-face interaction data [3,9]. Although face-to-face interaction data is traditionally con-sidered as gold standard, it presents several drawbacks. First, there is considerable variability in human behavior, and not all human data should be considered as positive examples of the behavior that we want to model. For example, if we want to find out how backchannel feedback helps establish the feeling of rap-port, it is important to realize that many face-to-face interactions fail in this regard. Ideally, such data should be separated into good and bad instances of the target behavior, but it’s not obvi-ous how to make this separation. Second, face-to-face interac-tions are co-constructed in that the behaviors of individuals not only depend on their own specific characteristics, but also on their contingent reactions to the behavior of the other party in the interaction. For example, even in a monolog, a speaker will often attend to the reactions of his listeners and adjust his behavior

accordingly [7]. This mutually contingent nature of social inter-actions amplifies the underlying variability of human behavior but also makes it difficult to tease apart causality (i.e. is this per-son a non-engaging speaker, or is he reacting to a disengaged listener). These issues are not insurmountable but they imply that we need collect large amounts of data to surmount them, which brings us to the third problem: the traditional way of recording face-to-face interaction data is expensive and time-consuming. It can take months to recruit participants, followed by an extensive period of recording and annotating the data.

To address these issues, we have previously proposed a data collection technology called Parasocial Consensus Sampling (PCS) [5]. It is inspired by the theory of parasocial interaction introduced by Horton and Wohl [2], where they argued that peo-ple exhibit a natural tendency to interact with media representa-tion (e.g. video recordings) of people as if they were interacting with the actual person face-to-face. The basic idea of PCS is to have multiple independent participants experience the same so-cial situation parasocially (i.e. act “as if” they were in a real dy-adic interaction) in order to gain insight into the typicality (i.e. consensus view) of how individuals would behave within face-to-face interactions. This approach helps address the three issues of the traditional face-to-face interaction data.

Variability: By having different people experience the same social situation, we can aggregate their behaviors and count the probability of each possible behavior from the consensus view, which represents how likely the behavior should happen.

Contingency: The usage of media representation of people breaks down the contingency in face-to-face interaction, because we hold the behavior of one participant constant (e.g. a pre-recorded speaker cannot react to listener feedback). This can help to unpack the bidirectional causal influences that naturally occur in conversations, but it might destroy the very phenome-non we wish to study (i.e., by preventing speakers from reacting to listener feedback, it might change the nature of this feedback). Fortunately, this hasn’t occurred in practice, at least for modeling backchannel and turn taking behaviors, as the learned models are similar to those learned with face-to-face data and produce simi-lar social effects when used to drive conversational behaviors with virtual humans [5,6].

Efficiency: By using media representation, we can parallelize the data collection process (i.e. different people interact with the same social situation simultaneously), and the parasocial interac-tion also enables us to use more efficient way to measure human behavior. For example, in a previous study [12], 9 participants interacted with 45 speaker videos parasocially in just one day. They were asked to press a button whenever they felt like to give backchannel feedback so that we can record the time automati-

31

Page 39: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

cally instead of having coders to annotate when the backchannel feedback happened.

In this paper, we apply the Parasocial Consensus Sampling on a much larger scale. Specifically, we are crowdsourcing back-channel using Amazon Mechanical Turk (AMT). This allows us to collect hundreds of responses to each video in a much faster and less expensive way, compared with traditional approaches. Due to the large amount of data we have, it is now possible to analyze and explain individual variability in backchannel feed-back. The following section describes the data collection proce-dure and the visualization tool for exploring the crowdsourcing dataset. Section 3 illustrates the variability in backchannel pro-duction and presents our preliminary results in explaining such variability. Section 4 concludes the paper.

2. Data Collection and Visualization.

2.1. Data Collection Amazon Mechanical Turk is a web service created by Amazon, which provides a marketplace for those who (i.e. requester) want to post tasks to be completed and specify prices for completing them. The idea is to utilize people’s (i.e. worker) small trunk of time, typically from 5 minutes to 1 hour, to finish trivial tasks, such as image tagging. The price of each task is often on the or-der of a few cents. Therefore, it is possible to have many workers repeat the same task. Although the individual worker is usually not an expert for the task, one often can achieve expert-level re-sults by relying on the wisdom of crowds [10].

We implemented a web application and integrated it with AMT. Workers can find our tasks on the marketplace and follow a link to our website. First, they finish a 90-item questionnaire that assesses several individual differences that we expect to in-fluence backchannel behavior (listed in Section 3.3). They next watch an example video illustrating the process of interacting with a human speaker parasocially. Next, they watch 8 videos in sequence, each about 2 to 3 minute long. Each video features a human speaker telling two stories. Coders are instructed to pre-tend that they are very interested in what the speaker says. Whenever they think it is a good time to provide feedback, such as nodding or uttering “uh-hum”, to the speaker, they press a but-ton. The timestamp of each press is recorded and sent to our server by using JavaScript. After interacting with each video, coders answer a 6-item questionnaire regarding their parasocial experiences [11]. At the end of the task, they leave comments about the study. Coders need to finish the study within 90 minutes in order to get paid, and we pay 4 dollars for their work. Following this procedure, we initially constructed a dataset of 350 coders providing backchannel data for 8 videos. To better understand speaker variability, we subsequently coded additional 16 videos (in two rounds) using 100 coders each. For the analy-sis that follows, we collapse these three data collections into a single dataset of 24 videos.

2.2. Visualization For each video, we have N sets of parasocial responses (T1, T2…TN) from N coders. Each set of parasocial responses Ti, con-tains the timestamps Ti = {t1, t2…} representing when the coder gave a response. Each timestamp can be viewed as a window of opportunity where a backchannel feedback is likely. We create a one second time window centered around each timestamp, and the timeline is sampled at a frame rate of 10Hz. In this way, we

convert the timestamps {t1, t2…} into time series data, as shown in Figure 1, where 1 indicates that backchannel feedback occurs and 0 indicates that no backchannel feedback occurs.

Figure 1: Example timeline of feedback from a single coder. Each line indicates a point where backchannel feedback occurs.

A consensus view is created by summing together the data from each coder for a specific video. The result is equivalent to a histogram that indicates how many coders felt a particular point in the speech merited feedback. This is illustrated in Figure 2. Peaks in the consensus view indicate time points where there is high agreement for providing feedback.

Figure 2 also shows a visualization tool that helps explore the PCS data. By selecting a video ID from the video table (Part 4), the corresponding video (Part 1) and coders (Part 5) show up; the consensus view (Part 2) of that video will be computed and also show up. By selecting a coder ID from the coder table (Part 5), the parasocial response of the coder (as shown in Figure 1) will show up; if multiple coders are selected, a histogram will be computed by using the responses from those coders (Part 3). As described in Section 2.1, we measure several personality attrib-utes of each coder using standard questionnaires. By selecting an attribute (Part 7), the coder table (Part 5) will be populated with the corresponding values of all coders. And a histogram (Part 6) will be displayed, indicating the distribution of coders along the selected attribute. This helps us group coders and investigate how individual difference (e.g. personality traits) affects back-channel feedback. As the video plays, a timeline (the red vertical line) will move correspondingly so that we can compare the con-sensus view with the human speaker’s behavior.

Figure 2: The interface of the visualization tool. (2) rep-resents the consensus of all coders for video (1), and the two histograms below (2) represent the consensus of two sub-groups of coders: those that are least and most agreeable, respectively.

3. Analysis of Backchannel Consensus Data

3.1. Variability PCS provides a unique tool to characterize some of the observed variability in backchannel production. Given that many coders are responding to the same speaker, we can ask if this variance arises from characteristics of speakers or aspects of listeners. At

32

Page 40: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

one extreme, all the variance might reside in the listener: for ex-ample, coders might be providing feedback or their behavior could be governed by individually varying personality traits. If so, we should expect a uniform distribution of feedback across the speaker’s story. At the other extreme, feedback might be solely determined by characteristics of the speaker, in which case we should expect perfect consensus amongst coders but variance across speakers. Figure 2 (which illustrates the consensus view), indicates that the answer is somewhere in the middle: there are many points of high coder agreement but also considerable vari-ability within and across speakers.

A quick examination of this data suggests that peaks in the consensus data correspond to behavior of speakers that have been previously suggested as backchannel elicitors. For example, peaks often co-occur with to speaker non-filled pauses. Peaks also correspond to semantically significant events in the story. For example, in Figure 2, the highest peak corresponds to a cli-mactic moment in the narrative.

However, there is also considerable variability across coders. For example, Figure 3 gives a histogram illustrating the amount of feedback provided by different coders for a given video. Feedback varied from no feedback at all to 64 responses by the most prolific coder. We now turn to a more formal analysis of listener variability.

Figure 3: We group coders into 7 bins (0-9, 10-19,…,50-59, >60) according to the number of feedback they pro-vided. x-axis represents the bins, and y-axis represents the number of coders in the corresponding bin.

3.2. Speaker Nonverbal Features Clearly, some of the variance of this data is driven by features of the speaker’s verbal and nonverbal behavior. We leave verbal analysis to future work but here analyze what speaker nonverbal features trigger backchannel feedback. Our analysis is based on the frequency of co-occurrence between speaker features and listener backchannel feedback for several features previously implicated in backchannel production. For each speaker feature, we have the starting time 𝑡𝑠 and end time 𝑡𝑒 that have been la-beled by human coders. For listener backchannel feedback, we record the time (𝑡𝑏) when the coder pressed the button. We count it as a match if

𝑡𝑠 ≤ 𝑡𝑏 ≤ 𝑡𝑒 + 𝑤𝑖𝑛𝑑𝑜𝑤 that is, if the backchannel feedback occurs within the speaker

feature or right after it, the feature is considered as triggering the feedback. Inspired by the idea of encoding dictionary [9], we add the variable “window” to count the case where backchannel feedback is triggered by speaker features but with a certain delay. In this study, window is set to be 500ms.

For each coder, we count the co-occurrence between the backchannel feedback and each of the speaker features. If a speaker feature always co-occurs with backchannel feedback, it

is considered as an important feature that the coder relies on. We measure the importance of a feature as follows:

𝑃 =# 𝑜𝑓 𝑐𝑜𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒

# 𝑜𝑓 𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒 𝑜𝑓 𝑎 𝑓𝑒𝑎𝑡𝑢𝑟𝑒

𝑅 =# 𝑜𝑓 𝑐𝑜𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒

# 𝑜𝑓 𝑏𝑎𝑐𝑘𝑐ℎ𝑎𝑛𝑛𝑒𝑙 𝑓𝑒𝑒𝑑𝑏𝑎𝑐𝑘

And then, the importance I is calculated as the harmonic mean of P and R. By ranking the speaker features on the im-portance measure I, we find 99% coders depend on “pause” and “speaker eye gaze” to provide backchannel feedback, and 73% coders depend on “pause”, “speaker eye gaze” and “speaker head nod” to provide feedback. The result suggests that coders use almost the same subset of speaker features to decide when to give feedback, which cannot explain the individual variability in backchannel feedback.

3.3. Individual Difference and Backchannel Feedback Only some of the observed variance can be explained by speaker behavior and here we examine how personality traits might im-pact backchannel production. Table 1 lists several individual traits of coders that we are currently investigating. Table 1. The attributes of each coder we measured in Section 2.1

Big Five Personality Traits Extroversion, Agreeableness, Conscientiousness, Neuroti-cism, Openness

Self-Consciousness Self-directed, Other-directed Parasocial Experience Parasocial experience scale

[11] Other Shyness, Self-monitoring,

Gender Except gender, every other attribute is measured using stand-

ard psychometric scales. In this way, each coder can be charac-terized by a set of values. For each attribute, we group the coders whose values are the lowest into the low_group, and those with the highest values into the high_group1

(1) For each video, a consensus is computed by using the data from all coders as described in Section 2.2;

. We compute three num-bers to represent each group as follows:

(2) For each group, a histogram is computed by using the data from the coders in the corresponding group;

(3) We sum up the histogram computed in step (2) to get the total number of backchannel feedback. The aver-age number of feedback is calculated by dividing the total number by the number of coders in the corre-sponding group;

(4) We calculate the correlation coefficient between the consensus and the histogram of each group;

(5) We compute the entropy of the histogram of each group. This can be considered as a measurement of agreement among coders.

Finally, we have the three numbers for each group, which are: the average number of feedback, the correlation coefficient, and the entropy. And the three numbers are computed for every video. T-Test is used to find whether there is significant differ-

1 We calculate mean 𝜇 and standard deviation 𝜎 from all coders. low_group has coders whose values are less than 𝜇 − 𝜎, while high_group has coders whose values are larger than 𝜇 + 𝜎.

33

Page 41: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

ence between the high_group and the low_group. The results are summarized as follows. Table 2. The average amount of feedback high low t-test Extroversion 173.51 172.61 p=0.88 Agreeableness 190.72 166.59 p<0.01 Conscientiousness 207.72 178.76 p<0.01 Neuroticism 180.12 182.02 p=0.67 Openness 203.58 171.52 p<0.01 Self-consciousness 210.17 173.71 p<0.01 Other-consciousness 205.76 171.04 p<0.05 Shyness 175.51 178.26 p=0.66 Self-monitor 206.03 166.97 p<0.01 PSI 203.06 163.50 p<0.01 Gender (F/M) 182.71 181.73 p=0.69 Table 3. Correlation Coefficient high low t-test Extroversion 0.90 0.91 p=0.42 Agreeableness 0.89 0.87 p<0.05 Conscientiousness 0.90 0.88 p<0.05 Neuroticism 0.91 0.89 p<0.01 Openness 0.91 0.89 p<0.05 Self-consciousness 0.88 0.89 p=0.18 Other-consciousness 0.88 0.85 p<0.01 Shyness 0.91 0.88 p<0.05 Self-monitor 0.87 0.89 p<0.01 PSI 0.88 0.89 p<0.05 Gender (F/M) 0.98 0.97 p=0.08 Table 4. Entropy high low t-test Extroversion 6.68 6.61 p<0.05 Agreeableness 6.65 6.68 p<0.05 Conscientiousness 6.74 6.73 p=0.62 Neuroticism 6.63 6.67 p=0.12 Openness 6.69 6.64 p=0.13 Self-consciousness 6.67 6.64 p=0.19 Other-consciousness 6.72 6.66 p<0.05 Shyness 6.61 6.60 p=0.48 Self-monitor 6.69 6.64 p<0.05 PSI 6.72 6.61 p<0.01 Gender (F/M) 6.77 6.80 p<0.01

Table 2 shows the difference of the average feedback num-ber between the high_group and the low_group for each attribute, Table 3 shows the difference of correlation coefficient, and Ta-ble 4 shows the difference of entropy. It is clear that individual difference has significant influences on backchannel feedback. For example, the high_group of agreeableness tends to provide more backchannel feedback than the low_group, and they have more agreement among each other than the coders in the low_group; the coders who have good parasocial experience tend to provide more backchannel feedback but have less agreement than the coders who have bad parasocial experience; there is no significant difference between male and female except female coders tend to have more agreement.

3.4. Discussion There is significant individual variability in listener backchannel feedback. However, from the feature analysis (Section 3.2), we find that coders depend on almost the same subset of speaker features to provide backchannel feedback, indicating there may be consistent backchannel feedback rules. The correlation coeffi-cient (≈0.9) between the histogram of groups of coders and the consensus also suggests the same thing. Our preliminary analysis (Section 3.3) suggests that the reason underlying the significant individual variability may be the individual differences, such as personality traits and parasocial experience, among coders.

4. Conclusions In this paper, we presented our work in analyzing individual var-iability in backchannel feedback. Under the Parasocial Consen-sus Sampling (PCS) framework, we applied the crowdsourcing technique to collect hundreds of “listeners” backchannel feed-back to one human speaker from the web. The results showed that there is significant individual variability in backchannel feedback; however, people depend on almost the same subset of human speakers’ features to provide feedback. Our preliminary analysis suggests that the reason underlying such individual vari-ability is the individual differences, such as personality traits and parasocial experience, among coders. Our work also demon-strates the advantage of Parasocial Consensus Sampling frame-work. By breaking down the contingency in face-to-face interac-tion, it is possible to run such analysis that cannot be done by using the traditional dataset.

5. References [1] Tickle-Degnen, L. and Rosenthal, R., "The Nature of Rapport and

its Nonverbal Correlates", Psychological Inquiry 1(4): 285-293, 1990.

[2] Horton, D. and Wohl, R.R. Mass communication and para-social interaction: Observation on intimacy at a distance. Psychiary 19: 215-229, 1956.

[3] Ward, N. and Tsukahara, W. Prosodic features which cue back-channel responses in English and Japanese. Journal of Pragmatics, 32:1177-1207, 2000.

[4] de Kok, I.A. and Heylen, D.K.J. Appropriate and Inappropriate Timing of Listener Responses from Multiple Perspectives. Pro-ceedings of 10th IVA, 248-254, 2011.

[5] Huang, L, Morency, L.-P. and Gratch, J. Parasocial Consensn Sampling: combining multiple perspectives to learn virtual human behavior. Proceedings of 9th AAMAS, 1265-1272, 2010.

[6] Huang, L., Morency, L.-P. and Gratch, J. A Multimodal end-of-turn Prediction Model: Learning from Parasocial Consensus Sampling. Proceedings of 10th AAMAS, 1289-1290, 2011.

[7] Bavelas, J.B., Coates, L. and Johnson, T. Listeners as co-narrators. Journal of Personality and Social Psychology, 79: 941-952, 2000.

[8] Bavelas, J.B. and Gerwing, J. The listener as addressee in face-to-face dialogue. International Journal of Listeneing. 2011.

[9] Morency, L.-P., de Kok, I. and Gratch, J. Predicting listener back-channels: A probabilistic multimodal approach. Proceedings of 8th IVA, 176-190, 2008.

[10] Surowiecki, J. The wisdom of crowds: Why the many are smarter than the few and how collective wisdom shapes business, econom-ics, societies, and nations. Doubleday Books. 2004.

[11] Hartmann, T. and Goldhoorn, C. Horton and Wohl revisited: Ex-ploring viewer’s experience of parasocial interactions. Annual meeting of the International Communication Association. 2010.

[12] Huang, L., Morency, L.-P. and Gratch, J. Learning backchannel prediction model from parasocial consensus sampling: a subjective evaluation. Proceedings of 10th IVA, 159-172, 2010.

34

Page 42: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Can We Predict Who in the Audience will Ask What Kind of Questionswith their Feedback Behaviors in Poster Conversation?

Tatsuya Kawahara, Takuma Iwatate, Takanori Tsuchiya, Katsuya Takanashi

School of Informatics, Kyoto University,Sakyo-ku, Kyoto 606-8501, Japan

Abstract

We investigate feedback behaviors in conversationsin poster sessions, specifically whether it is possible topredict who in the audience will ask questions, and alsowhat kind of questions. We focus on verbal backchan-nels and non-verbal noddings by the audience as well asjoint eye-gaze events by the presenter and the audience.We first show how these patterns are correlated with turn-taking by the audience. Then, questions made by the au-dience are classified into two kinds: confirming questionsand substantive questions. It is suggested that only verbalbackchannels are useful for distinguishing them.

1. Introduction

Feedback behaviors are important cues in analyzingpresentation-style conversations. We can guess whetherthe audience is attracted to the presentation by observ-ing their feedback behaviors. This characteristic is moreprominent when the audience is smaller; the audience canmake not only non-verbal feedbacks such as nodding,but also verbal backchannels. We have been collectingand analyzing poster conversations, in which a researchermakes an academic presentation to a couple of personsusing a poster. In our previous work [1], we demonstratedthat non-lexical kinds of verbal backchannels, referred toas reactive tokens, are a good indicator of the audience’sinterest level.

Poster sessions have become a norm in many aca-demic conventions because of the interactive character-istics. The audience can ask questions even during thepresentation. We can also guess whether the presentationis understood or liked by the audience by observing thequantity and quality of their questions. It is also knownthat the turn-taking behavior is related with the backchan-nel and gaze patterns [2, 3, 4]. The goal of this work isto investigate the relationship of these feedback behav-iors with turn-taking by the audience and also the kindof questions they ask. We classify the audience’s ques-tions into two kinds: confirming questions and substan-tive questions. We expect that these analyses reveal howthe audience appreciates the presentation and the qualityof the poster conversation.

2. Multi-modal Corpus of PosterConversations

We have recorded a number of poster conversations de-signed for multi-modal data collection [5]. In this study,we use four poster sessions, in which the presenters andaudiences are different from each other. In each session,one presenter (labeled as “A”) prepared a poster on his/herown academic research, and there was an audience of twopersons (labeled as “B” and “C”), standing in front of theposter and listening to the presentation. They were notfamiliar with the presenter and had not heard the presen-tation before. The duration of each session was 20-30minutes.

All speech data were segmented into IPUs (Inter-Pausal Unit) with time and speaker labels, and transcribedaccording to the guideline of the Corpus of SpontaneousJapanese (CSJ) [6]. We also manually annotated fillersand verbal backchannels.

The recording environment was equipped with multi-modal sensing devices such as cameras and a motioncapturing system while every participant wore an eye-tracking recorder and an accelerometer attached witha cap. Noddings are detected with the accelerometer.Eye-gaze information is derived from the eye-trackingrecorder and the motion capturing system by matchingthe gaze vector against the position of the other partici-pants and the poster.

An outlook of session recording is given in Figure 1.

3. Relationship between FeedbackBehaviors and Turn-Taking

First, we investigate statistics of eye-gaze and backchan-nel events and their relationship with turn-taking by theaudience.

3.1. Duration of Eye-gaze

We identify the object of the eye-gaze of all participantsat the end of the presenter’s utterances. The target ob-ject can be either the poster or other participants. Then,we measure the duration of the eye-gaze within the seg-ment of 2.5 seconds before the end of the presenter’s ut-

35

Page 43: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Figure 1: Outlook of poster session recording

Table 1: Duration (sec.) of eye-gaze and its relationshipwith turn-taking

turn held by turn taken by audiencepresenter A B C

A gazed at B 0.220 0.589 0.299A gazed at C 0.387 0.391 0.791B gazed at A 0.161 0.205 0.078C gazed at A 0.308 0.215 0.355

terances because the majority of the IPUs are less than2.5 seconds. It is listed in Table 1 in relation with theturn-taking events. We can see the presenter gazed at theperson right before yielding the turn to him/her signifi-cantly longer than other cases. However, there is no sig-nificant difference in the duration of the eye-gaze by theaudience according to the turn-taking events.

3.2. Joint Eye-gaze Events

Next, we define joint eye-gaze events by the presenterand the audience as shown in Table 2. In this table, weuse notation of “audience”, but actually these events aredefined for each person in the audience. Thus, “Ii” meansthe mutual gaze by the presenter and a particular personin the audience, and “Pp” means the joint attention to theposter object.

Statistics of these events at the end of the presenter’sutterances are summarized in Table 3. Here, the countsof the events are summed over the two persons in the au-dience. They are classified according to the turn-takingevents, and turn-taking by the audience is classified intotwo cases: the person involved in the eye-gaze event ac-tually took the turn (self), and the other person took theturn (other). The mutual gaze (“Ii”) is expected to be re-lated with turn-taking, but its frequency is not so high.The frequency of “Pi” is not high, either. The most po-tentially useful event is “Ip”, in which the presenter gazes

Table 2: Definition of joint eye-gaze events by presenterand audience

who presentergazes at audience (I) poster (P)

audience presenter (i) Ii Piposter (p) Ip Pp

Table 3: Statistics of joint eye-gaze events by presenterand audience in relation with turn-taking

#turn held by #turn taken by audience totalpresenter A (self) (other)

Ii 125 17 3 145Ip 320 71 26 417Pi 190 11 9 210Pp 2974 147 145 3266

at the person in the audience before giving the turn. Thisis consistent with the observation in the previous subsec-tion.

3.3. Backchannels

Verbal backchannels, typically “hai” in Japanese and“yeah” or “okay” in English, indicate the listener is un-derstanding what is being said. They also suggest the lis-tener’s interest level [7, 1] and activate interaction. Nod-ding is regarded as a non-verbal backchannel, and it ismore frequently observed in poster conversations than insimple spoken dialogues.

The occurrence frequencies of these events arecounted within the segment of 2.5 seconds before the endof the presenter’s utterances. They are shown in Fig-ure 2 according to the joint eye-gaze events. It is ob-served that the person in the audience who takes the turn(=turn-taker) made more backchannels both in verbal andnon-verbal manners, and the tendency is more apparent inthe particular eye-gaze events of “Ii” and “Ip” which areclosely related with the turn-taking events.

3.4. Discussions

It is shown that the most relevant features among the eye-gaze information is the presenter’s gazing at the personto whom the turn is to be yielded. This is presumablyaffected by the characteristics of the poster conversationin which the presenter takes a major role in the conversa-tion. The backchannel information by the audience mayalso be effective in turn-taking. The feedback not onlyindicates the audience’s reaction, but also will attract thepresenter’s attention, triggering his/her gazing and thenturn-yielding.

36

Page 44: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Figure 2: Statistics of backchannels and their relationshipwith turn-taking

4. Relationship between FeedbackBehaviors and Kind of Questions

Next, we investigate the relationship between feedbackbehaviors of the audience and the kind of questions theyask after they take a turn. In this work, questions areclassified into confirming questions and substantive ques-tions. The confirming questions are asked to make sureof the understanding of the current explanation, thus theycan be answered simply by “Yes” or “No”.1 The substan-tive questions, on the other hand, are asking about whatwas not explained by the presenter, thus they cannot beanswered by “Yes” or “No” only; an additional explana-tion is needed.

This annotation together with the preceding expla-nation segment is not so straightforward when the con-versation got into the QA phase after the presenter wentthrough an entire poster presentation. Thus, we excludethe QA phase and focus on the questions made duringthe explanation phase. In this section, we analyze the be-haviors during the explanation segment that precedes thequestion by merging all consecutive IPUs of the presen-ter. This is a reasonable assumption once turn-taking ispredicted in the previous section. These are major differ-ences from the analysis of the previous section.

4.1. Backchannels

The occurrence frequencies of verbal backchannels andnon-verbal noddings, normalized by the duration of theexplanation segment (seconds), are listed according to thequestion type in Tables 4 and 5. In these tables, statisticsof the person who actually asked questions are comparedwith those of the person who did not. We can observe theturn-taker made significantly more verbal backchannelswhen asking substantive questions. On the other hand,there is no significant difference in the frequency of non-

1This does not mean the presenter actually answered simply by“Yes” or “No”.

Table 4: Frequencies (per sec.) of verbal backchannelsand their relationship with question type

confirming substantiveturn-taker 0.034 0.063

non-turn-taker 0.041 0.038

Table 5: Frequencies (per sec.) of non-verbal noddingsand their relationship with question type

confirming substantiveturn-taker 0.111 0.127

non-turn-taker 0.109 0.132

Table 6: Duration (ratio) of joint eye-gaze events andtheir relationship with question type

confirming substantiveIi 0.053 0.015Ip 0.116 0.081Pi 0.060 0.035Pp 0.657 0.818

verbal noddings among the audience and among the ques-tion types.

4.2. Eye-gaze Events

We also investigate the relationship between eye-gazeevents and the question type. Among several parameter-izations introduced in the previous section, we observe asignificant tendency in the duration of the joint eye-gazeevents, which is normalized by the duration of the presen-ter’s explanation segment. It is summarized in Table 6.We can see the increase of “Ip” (and decrease of “Pp” ac-cordingly) in confirming questions. By combining withthe analysis in the previous section, we can reason themajority of turn-taking signaled by the presenter’s gazingis attributed to confirmation.

4.3. Discussions

When the audience asks substantive questions, the pre-sentation and understanding should be into a deep level.Thus, both the presenter and the audience are focused onthe poster. The verbal backchannels might signal the per-son’s confidence and interest level, but we need to inves-tigate the syllabic and prosodic patterns of these reactivetokens as in the previous work [8].

5. Conclusions

We have investigated the relationship between feedbackbehaviors of the audience and the succeeding events ofmaking questions in poster conversations. It is confirmedthat the gaze information plays an important role in turn-

37

Page 45: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

taking, but it is more relevant to confirming questionsrather than substantive questions. Both verbal backchan-nels and non-verbal noddings are also correlated with theturn-taking events, but verbal backchannels are more rel-evant to substantive questions. We presume that thesedifferent feedback behaviors are related with the under-standing or interest level of the audience.

Based on the findings, we plan to design a smartposterboard which can control cameras and a microphonearray to record the sessions and annotate the audience’sreaction, which is critically important in poster conversa-tions. These findings will also be useful for an intelligentconversational agent that makes an autonomous presen-tation.

Acknowledgement: This work was supported by JSTCREST and JSPS Grant-in-Aid for Scientific Research.

6. References

[1] T.Kawahara, K.Sumi, Z.Q.Chang, and K.Takanashi.Detection of hot spots in poster conversations basedon reactive tokens of audience. In Proc. INTER-SPEECH, pages 3042–3045, 2010.

[2] Nigel G. Ward and Yaffa Al Bayyari. A CaseStudy in the Identification of Prosodic Cues to Turn-Taking: Back-Channeling in Arabic. In Proc. IN-TERSPEECH, pages 2018–2021, 2006.

[3] Bo Xiao, Viktor Rozgic, Athanasios Katsamanis,Brian R. Baucom, Panayiotis G. Georgiou, andShrikanth Narayanan. Acoustic and Visual Cues ofTurn-Taking Dynamics in Dyadic Interactions. InProc. INTERSPEECH, pages 2441–2444, 2011.

[4] K.Jokinen, K.Harada, M.Nishida, and S.Yamamoto.Turn-alignment using eye-gaze and speech in conver-sational interaction. In Proc. INTERSPEECH, pages2018–2021, 2011.

[5] T.Kawahara, H.Setoguchi, K.Takanashi, K.Ishizuka,and S.Araki. Multi-modal recording, analysis and in-dexing of poster sessions. In Proc. INTERSPEECH,pages 1622–1625, 2008.

[6] K.Maekawa. Corpus of Spontaneous Japanese: Itsdesign and evaluation. In Proc. ISCA & IEEE Work-shop on Spontaneous Speech Processing and Recog-nition, pages 7–12, 2003.

[7] T.Kawahara, M.Toyokura, T.Misu, and C.Hori. De-tection of feeling through back-channels in spokendialogue. In Proc. INTERSPEECH, page 1696, 2008.

[8] T.Kawahara, Z.Q.Chang, and K.Takanashi. Analy-sis on prosodic features of Japanese reactive tokensin poster conversations. In Proc. Int’l Conf. SpeechProsody, 2010.

38

Page 46: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Evaluating a minimally invasive laboratory architecture for recording multimodal conversational data

Spyros Kousidis1, Thies Pfeiffer2, Zofia Malisz3, Petra Wagner3, David Schlangen1

1Dialogue Systems Group, 2A.I. Group, Faculty of Technology, 3Phonetics and Phonology Group1,2,3 Bielefeld University, Germany

[email protected], [email protected]

Abstract

This paper presents ongoing work on the design, deployment and evaluation of a multimodal data acquisition architecture which utilises minimally invasive motion, head, eye and gaze tracking alongside high-quality audiovisual recording of human interactions. The different data streams are centrally collected and visualised at a single point and in real time by means of integration in a virtual reality (VR) environment. The overall aim of this endeavour is the implementation of a multimodal data acquisition facility for the purpose of studying non-verbal phenomena such as feedback gestures, hand and pointing gestures and multi-modal alignment. In the first part of this work that is described here, a series of tests were performed in order to evaluate the feasibility of tracking feedback head gestures using the proposed architecture.Index Terms: Multimodal interaction, feedback, virtual reality

1. Introduction

The acquisition of annotated multimodal conversational data is nowadays considered essential for the better understanding of human discourse [1], but also in the context of interaction between humans and ECAs [2]. However, scientific interest in multimodal corpora extends beyond computational linguistics into the fields of behavioural and social sciences, while the problems that arise in constructing, maintaining and reusing such databases have become the subject of research in computer science [3].Two major issues that often arise when designing multimodal corpora are the inhibition of natural discourse by the presence of sensory equipment (a problem that also exists in traditional, audio-only corpora [4]), and the lack of standardisation in storing, annotating and querying the data. In addition, the use of visual data is also non-standard, as the angle and zoom of the camera(s) are often chosen to serve specific purposes thus limiting re-usability of the content. Finally, annotation of the additional signal streams is costly, often limiting the size of the corpus and introducing compromises that may also limit the usefulness of the acquired content [3]. The data collection architecture described here addresses these issues by using minimally invasive motion tracking sensors and a VR environment which is used both as a collection point of all sensory data, as well as an additional annotation tool. The purpose of this work is to collect multimodal conversational data in order to study various interaction phenomena. One type of non-verbal behaviour that is of particular interest are visual feedback gestures which are deemed essential in interaction management, complementary to spoken feedback dialogue acts, such as backchannels [2]. Visual feedback gestures include eye and head movements, facial expressions, hand gestures and body posture [5]. The ability to automatically detect and model such gestures is highly desirable in ECA design [6]. Another planned use is the

study of alignment between interlocutors, which has previously been studied in a number of modalities, including posture and gaze [7]. However, few studies have looked at more than one modality at a time (e.g. [8]), perhaps due to the lack of sufficient data aggregation and synchronisation. The proposed architecture also addresses this issue by exploiting the immersive capabilities of VR.Technology for the real-time assessment of multimodal human actions has long been a corner-stone of VR research. Together with the capabilities of simulating cognitive models of communication in virtual agents, the combination of VR and linguistic research is very promising [9]. In previous work, assessment of human pointing behaviour has been achieved through the implementation of an experimental-simulative loop using VR technology with the tool IADE [10]. A study on human-human interactions, in which both participants’ gestures and speech were tracked [11] was re-simulated in VR in order to aggregate and review all collected and annotated data in one place. The use of VR technology allowed experimenters and annotators to immerse into the recorded setting and to be situated right within the original interaction context. Later work also included the tracking of gaze and the real-time identification of the objects of interest [12]. Although tracking technology has often proved inhibiting to natural behaviour on the part of participants [13] in the past, technology has recently become less obtrusive, and remote sensing capabilities for eye gaze and 3D gestures are commercially available. The following section describes our data collection architecture which utilises these technologies in order to capture visual feedback gestures.

2. Data collection architecture

The laboratory setup is shown in Figure 1. The data stream from each sensor is independently transmitted to the VR environment via LAN. This allows immersive viewing of the recorded interaction from any angle, including a real-time updating first perspective view of tracked subjects. Logging is also performed at this central point, ensuring synchronisation of the sensory components described below.

2.1 Motion tracking

Motion tracking is performed by the Microsoft Kinect1, an interaction device based on a depth camera produced by PrimeSense2. The Kinect does not require any attachments to the tracked subject, but projects a structured light and then uses its distortion to create a depth image. As a second step, the provided software frameworks, Microsoft Kinect SDK1 or OpenNI3, extract skeletal information. This skeleton model is 1 MSDN 2010 Microsoft Kinect SDK http://www.microsoft.com/en-us/kinectforwindows/2 PrimeSense Ltd, http://www.primesense.com/ 3 OpenNI, http://www.openni.org/

39

Page 47: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

still rather coarse and does not contain fingers or the orientation of the head. This technology is quite novel, so more precise versions of Kinect-like systems and better software frameworks for skeleton extraction can be expected.

2.2 Head/eye/gaze tracking

Head, eye and gaze tracking is performed using Facelab 5 by

Seeingmachines4 which is a set of two (or more) eye-tracking cameras and an infrared light that is projected onto the face. Facelab uses the reflection of this light to track the position and orientation of the subject's head, the direction of gaze, the motion of several facial features, and several derived measurements such as the percentage of eye closure, fatigue, blinks, or the vergence point of the two gaze vectors of the subject's eyes. Additional components such as zoom lenses allow for a number of different positioning configurations, moving the cameras away from the tracked subject.

2.3 Audiovisual recording

Traditional audiovisual recording is performed by a set of three synchronized Canon XHG1S HD cameras and a choice of either directional or close-contact Sennheiser microphones. For a dyadic interaction, two cameras provide close-up front views of the interactants, while the third camera provides a panoramic view of the interaction.

2.4 Virtual Reality

The different devices are connected using InstantReality5, a VR framework and the underlying InstantIO network-transparent technology. Specific implementations of InstantIO modules for the Kinect and FaceLab were developed, along with an XML-based data format to log the events from all connected devices in an integrated fashion. The logging process is managed by a custom-built software tool, which is part of an effort to create a complete, publicly available tool chain for manual and semi-automatic recording and annotation of multimodal experiments.

3. Evaluation test procedures

The evaluation plan of the system consisted of two parts: in the first part – described in this section – a number of procedures were designed in order to acquire gold standard data against which to assess the accuracy of the tracking sensors. The second part – described in the next section – comprised data acquisition in real conditions. All procedures were performed by one female and two male subjects.

4 seeingmachines, http://www.seeingmachines.com/product/facelab/5 IGD Fraunhofer, 2010, Instant reality, http://www.instantreality.org

3.1 Head position and orientation accuracy

A laser-pointing device with a precision of ±2 mm was used to measure distances from a person's head to flat panels placed around the person. Rotations of the head both in up-down (similar to nodding) and side-ways (tilting) directions were measured with a pitch-angle measuring device with a precision of 1 degree. The devices were fixed on a lightweight helmet that could be firmly strapped to the person's head. With the assistance of a lab technician, subjects moved or rotated their head in the tracked 3D space and measurements from both the laser-pointing device and the eye-tracking sensor were taken at 36 random points for each subject. Left-to-right rotation (yaw) angles were inferred using the difference in distance from the subject's head to a flat panel in front of them, as the head rotates.

3.2 Head position tracking range

The eye-tracker allows the subject to be seated in a range of distances from the tracking cameras, depending on the configuration of zoom lens provided. This distance, theoretically at least, has an effect on the range of movements tracked persons can perform before they move out of range and tracking is lost. In order to measure this range, subjects were instructed to perform movements around the tracked space, reaching the limits in each direction. Each of the three subjects was placed at three different distances from the tracker: near (~.75m), mid (~.90m), and far (~1.05m). These positions represent the extremes and mid-point of distances allowed by the focus calibration range of the tracking cameras.

3.3 Gazed object detection

The gaze-tracking function of the eye-tracking sensor, combined with a VR model of objects in the subject's field-of-view allows for detection of the object the subject is gazing at. Because this detection is based on whether a 'gaze vector' coming out of the subject's eyes intersects the modelled objects, the distance of the person from the sensors can theoretically affect accuracy. As in the previous procedure, subjects were placed at three different distances, while gazing at a set of five 40x40mm coloured cubes that were fixed on a flat table. A score (0-3) was given for each object, depending on whether the gaze vector pointed to the modelled object itself or one of a set of progressively larger proxy objects at the same location in the VR model. An overall success rate was calculated as a percentage of acquired points over the maximum possible points for all objects combined.

3.4 Body limb and hand position accuracy

The motion tracking sensor precision was measured by placing subjects in three predefined positions marked on the lab floor. A snapshot of the skeleton tracking data was taken at each position. The procedure was repeated three times, moving the motion sensor to a new position each time. The accuracy was assessed by comparing the calculated distances between the three positions. A similar procedure was followed for detecting whether subjects were holding a specific object. Three 80mm-diameter spheres were positioned in the tracked area and subjects were asked to hold them in their hands. Again, a comparison between the actual and the calculated distance between the spheres (derived from the tracked hand positions) yields a measure of the tracking accuracy.

Figure 1: Schematic of data collection architecture

VR environmentvisualisation &

logging

EYE TRACKER B

VIDEO A

EYE TRACKER A

InteractionA ↔ B

KINECT A KINECT B

VIDEO B

40

Page 48: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

4. Tracking feedback gestures

A pilot experiment was performed in order to evaluate the ability of the architecture to track non-verbal feedback gestures. Three subjects, one female and two male (different from the three in the previous section), participated in this test. The setting that was used for feedback gesture elicitation is a simplified version of the one used in [14]. Briefly, one of the subjects (speaker) is asked to narrate a story from their own experience, such as a holiday story while the other subject (listener) is encouraged to actively listen, paying attention to detail, with a hint that they might be asked questions at the end of the narration. In each interactive session, both participants were video-taped and recorded with contact microphones, while the listener was also tracked with the eye-tracking sensor (the motion-capture sensor was not used in this test). The data from the sensor was read real-time into the VR model and logged using the logging software described in section 2.4. Each of the three sessions lasted for about 10 mins. The annotation of the head gestures was performed by two expert annotators following the schema in [14] using the video only (audio was muted). The annotation labels consist of a gesture category (nod, tilt, jerk, shake etc) and the number of cycles, i.e. the number of repetitions of the gesture. For example, a 'nod-2' denotes a nod gesture with two cycles (a “double nod”). In total, 210 gestures were found and visually compared to the tracking data, as a first assessment of the detail captured with the described method.

5. Results and discussion

Table 1 shows the results from the accuracy evaluation procedures. For head tracking, the highest accuracy is acquired for the Z axis (towards or away from the tracking cameras) and the lowest for the Y axis (raising-lowering head). The difference in error is quite large, however this maybe attributed to the fact that moving the subject's head without simultaneously rotating it is progressively more difficult in the reverse order of that of the error magnitudes (Z, X, Y).

Subject Male 1 Female Male 2 All

Head tracking Position Accuracy

(cm)

X 2.15 2.28 1.16 1.86

Y 3.65 2.75 2.22 2.87

Z 0.15 2.40 0.15 0.90

Head trackingOrientation Accuracy

(DEG)

Pitch 1.87 1.96 1.71 1.85

Yaw 2.75 3.62 3.41 3.26

Roll 0.78 1.76 1.34 1.29

Motion tracking position accuracy (cm)

1.37 2.83 3.12 1.98

Motion tracking object holding (cm)

3.16 5.40 3.40 2.93

Table 1: Mean absolute error of tracking sensors

Similarly, an error margin of ~2° is common for 2 of three angles (pitch and roll) which were measured with the pitch-angle measuring device, while the left-right head-direction angle (heading) which was inferred rather than directly measured shows a higher error margin (3.26°). Thus, a value of ~2° is a better estimate of the error margin for the angles. These errors are larger than those specified by the equipment vendor (positional and angular accuracy of ±1 mm and ±1° respectively). The difference is most likely due to the fact that

the realtime data stream was read from the tracker instead of the more accurate one that comes with a latency of 2.5 seconds.Mean tracking ranges and object detection success rates for each subject are shown in Table 2, while Figure 2 shows the average effect of subjects' distance from the eye-tracking sensor. As predicted, increasing this distance also increases the effective tracking space, allowing for more freedom in subjects' movements: the “far” position yields a 25% larger tracking space compared to the “near” position. There is a similar 20% increase (“far” vs “near”) in the range of head rotations.

Subject Male 1 Female Male 2 All

Head position range (mm)

X 368 358 336 354

Y 288 238 261 262

Z 592 593 688 624

Head rotation range (DEG)

Pitch 162 144 176 161

Yaw 65 84 81 76

Detected objects (%) 64 54 45 54

Table 2: Mean tracking ranges and object detection success rates

On the other hand, the distance between subject and eye-tracker does not have an obvious effect on gazed object detection accuracy. The effect is balanced by the fact that a sharper angle is required to gaze at the objects at the near position compared to the far position. Objects with an area of at least 6x6cm facing the viewer at a distance of 1m can be consistently detected.

The motion-capture sensor yielded a comparable error margin of ±2.44 cm (see table 1) when comparing the positions of ankles or shoulders in a standing posture, while the position of the hands showed a larger error margin (±3.98 cm). This position is not well-defined (a hand can hold an object in various ways) and therefore may have differed significantly between subjects. Applications of detecting the position of a subject's hand in real time can be either hand gesture detection or monitoring whether a subject's gaze follows a displayed object (by combining motion and eye-tracking data). The results suggest that this is feasible provided that the objects are large enough to ensure a high success rate for gazed object detection and allow for hand position error. The error margins for the Kinect sensor are larger than those reported in [15], but the latter reported using special apparatus to hold subjects in place, while the focus of the work reported here has been more towards “real” conditions with naïve, unconstrained subjects.Tracking during the pilot experiment proved reliably robust, as the signal from the eye-tracker had no break-ups in the first

Figure 2: Effect of tracker distance on tracking ranges and object detection success rate

41

Page 49: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

two sessions, while tracking was lost 3.27% of the time in the third session. This was mainly due to a bad position of the subject relative to the tracker, causing his head to move out of range during a few extreme movements. A similar result has been reported for the Kinect sensor [15], i.e. tracking is never lost unless subjects move out of range.The results from the pilot experiment also suggest that successful head gesture detection may be expected. Figure 3 shows head tracking data (approximately 0.8 seconds) for which the corresponding video was labeled as a nod with two cycles. These can clearly be discerned in the plot at 500 and 800ms, where the slope of the change in pitch angle is the steepest. Importantly, this nod is visually very subtle according to the annotators.

Similarly, the interval (2 sec) shown in Figure 4 was annotated as a complex gesture (nod + 2-cycle shake). Again, the plot shows a 'dip' around 900ms on the top line (pitch) which corresponds to the nod, and two more dips at 1600 and 1800 ms on the bottom line (yaw) which are the left-to-right head shakes. The simultaneous dip at the top line at the time of the second shake shows that the head rotated diagonally both down and to the right.

It would be possible to use the VR environment to playback the tracked head movements on a 3D-head, offering advantages such as zooming/navigating freely around it and thus providing more view angles in comparison to traditional annotation using video image. This was not performed, because the logging tool is currently work in progress and has limited playback/seeking capabilities. However, this type of functionality is an expected outcome of the work described here. Another planned improvement is integration of the VR environment with widely-used annotation tools such as ELAN [16]. Finally, further improvements are expected in the upcoming releases of the MS Kinect SDK1, with more detailed skeleton and head tracking, making this sensor even more attractive to use due to its low cost and minimum disturbance to the interaction setting.

6. Conclusions & Future work

We have presented a multimodal data collection architecture that uses minimally invasive sensors and virtual reality as a

central connecting point of the data streams, opening possibilities to study multimodal phenomena in a well-designed environment. In an evaluation of the setup both under ideal and under realistic conditions we found the accuracy of the collected data to be adequate for capturing multimodal behaviour such as feedback head gestures. Our aim is to further explore the combination of the immersive capabilities of VR with real time motion-tracking data, towards a fully integrated multimodal annotation environment. These tools will eventually be released under an open source license.

7. Acknowledgements

This research is partly supported by the Deutsche Forschungsgemeinschaft (DFG) in the CRC 673 "Alignment in Communication". The authors would like to thank Felix Hülsmann, Florian Hofmann, Joanna Skubisz, Casey Kennington and Michael Bartholdt.

8. References[1] P. Paggio, J. Allwood, E. Ahlsen, K. Jokinen, and C. Navarretta,

"The NOMCO multimodal Nordic resource - goals and characteristics," , LREC 2010, Valetta, Malta, 2010.

[2] M. Boholm and J. Allwood, "Repeated head movements, their function and relation to speech," LREC 2010, Valletta, Malta, 2010.

[3] D. Knight, "The future of multimodal corpora," Revista Brasileira de Linguistica Aplicada, vol. 11, pp. 391-415, 2011.

[4] N. Campbell, "Databases of emotional speech," in ISCA Workshop on Speech and Emotion, Northern Ireland, 2000, pp. 34–38.

[5] K. Jokinen, "Gaze and Gesture Activity in Communication Universal Access in Human-Computer Interaction. Intelligent and Ubiquitous Interaction Environments.", C. Stephanidis, Ed., Springer Berlin / Heidelberg, 2009, pp. 537-546.

[6] I. Mlakar and M. Rojc, "Towards ECA’s Animation of Expressive Complex Behaviour Analysis of Verbal and Nonverbal Communication and Enactment. The Processing Issues," A. Esposito, A. Vinciarelli, K. Vicsi, C. Pelachaud, and A. Nijholt, Eds., ed: Springer Berlin / Heidelberg, 2011, pp. 185-198.

[7] D. C. Richardson, R. Dale, and K. Shockley, "Synchrony and swing in conversation: Coordination, temporal dynamics, and communication," in Embodied Communication, G. Knoblich, Ed., ed: Oxford University Press, 2008.

[8] N. Campbell, "An Audio-Visual Approach to Measuring Discourse Synchrony in Multimodal Conversation Data,", Interspeech 2009, Brighton ,UK, 2009.

[9] T. Pfeiffer, "Using virtual reality technology in linguistic research," IEEE Virtual Reality Workshops (VR), 2012, pp. 83-84.

[10] T. Pfeiffer, A. Kranstedt, and A. Lücking, "Sprach-Gestik Experimente mit IADE , dem Interactive Augmented Data Explorer," in Dritter Workshop Virtuelle und Erweiterte Realität der GIFachgruppe VRAR, 2006, pp. 61--72.

[11] A. Kranstedt, A. Lücking, T. Pfeiffer, H. Rieser, and I. Wachsmuth, "Deictic object reference in task-oriented dialogue," in Situated Communication, ed Berlin: Mouton de Gruyter, 2006, pp. 155-207.

[12] T. Pfeiffer, "Understanding multimodal deixis with gaze and gesture in conversational interfaces," PhD, Bielefeld, Bielefeld, 2010.

[13] K. Jokinen, M. Nishida, and S. Yamamoto, "Eye-gaze experiments for conversation monitoring,", 3rd International Universal Communication Symposium, Tokyo, Japan, 2009.

[14] M. Włodarczak, H. Buschmeier, Z. Malisz, S. Kopp, and P. Wagner, "Listener head gestures and verbal feedback expressions in a distraction task,", this volume, 2012.

[15] M. A. Livingston, J. Sebastian, Z. Ai, and J. W. Decker, "Performance Measurements for the Microsoft Kinect Skeleton,", IEEE Virtual Reality, Orange County, CA, USA, 2012.

[16] H. Brugman, A. Russel, and X. Nijmegen, "Annotating multi-media / multimodal resources with ELAN",LREC 2004,Lisbon, Portugal, 2004, pp. 2065—2068.

Figure 4: Nod and subsequent shake (2-cycles) head

Figure 3: Nodding gesture (2 cycles)

42

Page 50: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

The temporal relationship between feedback and pauses: a pilot study

Kristina Lundholm Fors

Department of Philosophy, Linguistics and Theory of Science, University of Gothenburg, [email protected]

Abstract

In this pilot study we investigated the temporal relation-ship between pauses and feedback. We found that themajority of feedback items occur in the proximity of trp-pauses (pauses that occur at a trp, within a speaker’s turn),but also that most intraturn pauses do not coincide withfeedback units. This suggests that when modeling feed-back in human-computer interaction, a method to identifytrp-pauses will also provide suitable places for feedback.Index Terms: feedback, back-channels, pauses

1. IntroductionBoth feedback and pauses are closely linked to turntak-ing. Pauses can occur when a speaker is trying to de-termine whether the other speaker wants the turn, or ifshe/he may continue speaking, and feedback can be usedto signal that the other speaker is free to continue.

In this study we will explore the temporal relationshipbetween pauses and feedback items. Pauses in general donot seem to be sufficient predictors for feedback [1], butour hypothesis is that some subtypes of pauses are moreclosely related feedback than others. Since we are usinga rather small sample of dialogue, we will not be ableto draw general conclusions, but rather we will use thispilot study to examine whether the relationship betweenfeedback and different types of pauses might yield resultsthat could be useful in feedback modeling.

1.1. Feedback

Feedback and related phenomena have also been referredto as, for example, backchannels and continuers. One ofthe defining characteristics of feedback is that it is notuttered in an attempt to claim the turn. Further, feed-back can be produced at the same time as someone elseis speaking, but it is not perceived as interruption. Allfeedback items do not have the same function. Allwoodet al. suggest a model based on four basic functions offeedback: to signal willingness and ability to continue theconversation, to perceive the message, to understand themessage and to convey attitude, specifically acceptanceor rejection, towards the message [2].

Feedback tends to be preceded by certain cues, suchas differences in intonation, duration and voice quality,

but cues vary between individual speakers [3]. Regionsof low pitch may be good predictors of upcoming feed-back [1]. The amount and type of feedback given is alsodependent on cultural background of the speakers [4].

Feedback does not have to be verbal; smiles and nod-ding are common feedback signals. However, in thisstudy we will focus on verbal feedback behavior.

1.2. Pauses

Sacks et al. drew a distinction between pauses andgaps, where pauses are silent intervals that occur withina speaker’s turn, and gaps occur when a speaker hasstopped speaking and no one else has been nominatedor has taken the turn [5]. Previous work has shownthat speakers tend to vary their pause lengths in syn-chrony, which means that when one speaker is length-ening his/her pauses, so will the other speaker [6]. In thisstudy we concentrated on pauses, and subdivided pausesinto two groups: pauses that occurred at a transition rel-evance place (TRP) and pauses that occurred elsewherewithin a speaker’s turn. The pauses are referred to as trp-pauses and ntrp-pauses respectively. When categorizingthe pauses, only the activity of the current speaker wastaken into account. A trp-pause was defined as a silentinterval within a speaker’s turn, where the speaker couldhave finished and yielded the turn. Pauses that occurredwhere the speaker did not seem finished were categorizedas ntrp-pauses. Hjalmarsson used a similar method whenjudging semantic completeness, and found that interrateragreement for this measure was high [7].

1.3. Feedback in human-computer interaction

When modeling feedback in human-computer interac-tion, timing is highly important. Numerous differentmodels have been developed to identify feedback places,based on for example prosody, POS-tagging and pauseduration [8], pitch variations [1] and multimodal outputfeatures [9].

2. Method and materialThe material used was an approximately 10 minute long,spontaneous dialogue with two Swedish female speak-ers. The speakers were recorded in a recording studio

43

Page 51: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

and they were free to discuss any topic.The recordings were transcribed orthographically and

analyzed in Praat [10]. Pauses were identified manually,based on the acoustic signal. After identification, pauseswere categorized as trp-pauses and ntrp-pauses respec-tively, as outlined in section 1.2. Feedback items wereoperationalized as a short, isolated utterance that could,but did not have to, be produced by one speaker while theother speaker had the turn.

3. Results

The dialogue contained 75 feedback items (the distribu-tion is shown in table 1), and ”mm” was the most com-mon feedback unit. The mean length of feedback unitswas 312 ms (SD 124 ms).

Type of backchannel Frequencymm 51aa 8ja 5jaa 4

Table 1: Feedback items that occurred more than once inthe material

The distribution of the pauses found in the material isshown in table 2.

Feedback present No feedbackTrp-pause 33 30Ntrp-pause 7 98

Sum 40 128

Table 2: Trp-pauses and ntrp-pauses

76% of pauses did not overlap with feedback. Of the75 feedback items, 40 occurred completely or partiallyduring a pause. The majority of the pauses that coincidedwith feedback were trp-pauses: 33/40 feedback units oc-curred at at trp-pause.

We also examined the distance from the beginning ofeach feedback unit to the beginning of the nearest pause(either preceding or succeeding the feedback). 55 feed-back items (79%) were closest to a trp-pause, and themean difference between the beginning of the feedbackunit and the beginning of the trp-pause was 15 ms (SD669ms). 15 feedback items were closer to a ntrp-pause,with a mean distance of -490ms (SD 1252 ms).

As can be seen in figure 1, the majority of feedbackunits began ± 1 second of the beginning of a trp-pause.

Figure 1: Distance between the start of feedback itemsand the start of the closest trp-pause

4. DiscussionFeedback may coincide with pauses, but pauses in gen-eral are not sufficient indicators of possible feedback lo-cations. In our sample, 76% of pauses within a speaker’sturn did not contain any feedback from the other speaker.However, the majority of feedback occurred at or in theproximity of trp-pauses, which was in line with our hy-pothesis. This suggests that the distance to the nearesttrp-pause is a more useful indicator of a suitable feedbackplace than, for example, pause duration.

Feedback items are likely to occur at the same time asa trp-pause, or slightly before such a pause. This raisesthe question whether the feedback has an effect on pauselength, that is, if trp-pauses preceded by feedback areshorter than other trp-pauses. It could be argued that ifthe feedback is produced shortly before the trp, the inter-locutor would take this as a signal to continue and mightshorten or even eliminate the upcoming trp-pause.

This pilot study of a short dialogue has shown cleartendencies for feedback to occur close to trp-pauses. Weplan to investigate this further to see if these findings willhold in other dialogues as well.

5. References[1] N. Ward and W. Tsukahara, “Prosodic features which cue back-

channel responses in English and Japanese,” Journal of Pragmat-ics, vol. 32, no. 8, pp. 1177–1207, 2000.

[2] J. Allwood, J. Nivre, and E. Ahlsen, “On the semantics and prag-matics of linguistic feedback,” Journal of Semantics, vol. 9, no. 1,pp. 1–26, 1992.

[3] A. Gravano and J. Hirschberg, “Backchannel-inviting cues intask-oriented dialogue,” in Tenth Annual Conference of the Inter-national Speech Communication Association, 2009.

[4] M. Stubbe, “Are you listening? Cultural influences on the use of

44

Page 52: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

supportive verbal feedback in conversation,” Journal of Pragmat-ics, vol. 29, no. 3, pp. 257–289, 1998.

[5] H. Sacks, E. Schegloff, and G. Jefferson, “A simplest systematicsfor the organization of turn-taking for conversation,” Language,vol. 50, no. 4, pp. 696–735, 1974.

[6] K. Lundholm Fors, “Pause length variations within and betweenspeakers over time,” in Proceedings of the 15th Workshop on theSemantics and Pragmatics of Dialogue, Los Angeles, USA, 2011.

[7] A. Hjalmarsson, “Human interaction as a model for spoken di-alogue system behaviour,” Ph.D. dissertation, Royal Institute ofTechnology (KTH), Sweden, 2010.

[8] N. Cathcart, J. Carletta, and E. Klein, “A shallow model ofbackchannel continuers in spoken dialogue,” in Proceedings ofthe tenth conference on European chapter of the Association forComputational Linguistics-Volume 1. Association for Computa-tional Linguistics, 2003, pp. 51–58.

[9] L. Morency, I. de Kok, and J. Gratch, “A probabilistic mul-timodal approach for predicting listener backchannels,” Au-tonomous Agents and Multi-Agent Systems, vol. 20, no. 1, pp.70–84, 2010.

[10] P. Boersma and D. Weenink, “Praat: doing phonetics bycomputer [computer program],” 2012. [Online]. Available:http://www.praat.org/

45

Page 53: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Cues to perceived functions of acted and spontaneous feedback expressions

D. Neiberg, J. Gustafson

Department of Speech, Music and Hearing, Royal Institute of Technology (KTH), Sweden[neiberg,jocke]@speech.kth.se

AbstractWe present a two step study where the first part aims to deter-mine the phonemic prior bias (conditioned on “ah”, “m-hm”,“m-m”, “n-hn”, “oh”, “okay”, “u-hu”, “yeah” and “yes”) insubjects perception of six feedback functions (acknowledgment,continuer, disagreement, surprise, enthusiasm and uncertainty).The results showed a clear phonemic prior bias for some to-kens, e.g “ah” and “oh” is commonly interpreted as surprise but“yeah” and “yes” less so. The second part aims to examine de-terminants to judged typicality, or graded structure, within thesix functions of “okay”. Typicality was correlated to four de-terminants: prosodic central tendency within the function (CT);phonemic prior bias as an approximation to frequency instantia-tion (FI), the posterior i.e. CT x FI and judged Ideality (ID), i.e.similarity to ideals associated with the goals served by its func-tion. The results tentatively suggests that acted expressions aremore effectively communicated and that the functions of feed-back to a greater extent constitute goal-based categories deter-mined by ideals and to a lesser extent a taxonomy determinedby CT and FI. However, it is possible to automatically predicttypicality with a correlation of r = 0.52 via the posterior.Index Terms: feedback, functions of feedback, goal driven cat-egories, taxonomy

1. IntroductionThe nature of the communicative functions of feedback, the pro-posed categorizations, the cues and the terminology is a subjectof considerable debate. The functions of feedback have beenproposed to support the grounding process [1] in which partic-ipants continuously work at establishing a common ground bysignaling perception, understanding, acceptance and a numberof emotional states. The opinion on the cues to these functionsrange from the standpoint of [2] who claims that prosody is de-cisive, i.e. the tokens are merely carriers of intonation, and [3]who puts more emphasis on instant and incremental phonemic-to-meaning mapping. This leads us to the first part of the cur-rent study: To what degree do phonemic realization and prosodycontribute to listeners categorization of feedback?

Moreover, if the goal of feedback is to communicate thestate of the grounding process and emotional states, it raisesthe question on how to synthesize these in a dialogue system.For example, one may feed the speech synthesis using cuesderived from acted or spontaneous data. While acted stimulimay be perceived more clearly, spontaneous stimuli may be per-ceived as more natural. We intend to explore this question usingmethodology from cognitive psychology on categorization [4].

This methodology attempts to access the mechanisms ofhow humans perceive categories by determine correlates to thetypicality of each stimuli. For example, a sparrow is perceivedas more typical for birds than an ostrich [5]. Thus, the con-tinuum among categories range from the most typical member

to the most atypical members. Three determinants to typical-ity stands out 1) the members similarity to the central tendency(CT) as measured by co-occurring correlates among members(e.g feathers, color, shape, number of legs, etc.) 2) the mem-bers frequency instantiation (FI), i.e. how often they occur asmembers of their category 3) members similarity to ideals (ID)associated with the goals served by its category.

Categories can be divided into taxonomies (e.g. birds, ap-ples, beers) and goal driven categories (e.g. things to bring ona picnic). The typicality of members in taxonomies are pri-mary determined by their frequency instantiation and secondlyby their central tendency and third by ideals. Goal driven cat-egories are primary determined by ideality, secondly by fre-quency instantiation and third by central tendency. As men-tioned, ideals can be thought of as those characteristics that acategory member should possess in order to best serve the goalsassociated with its category. According to the theory of ground-ing, functions of feedback also serve the goal of communica-tion. This means that the ideal for feedback expressions areto communicate these as efficient as possible. Previous studieshas identified emotional facial expressions [6] and vocal expres-sions [7] as goal driven categories. The latter study also showedthat acted expressions are perceived as closer to ones ideals thanspontaneous expressions. Thus, acted expressions are commu-nicated more clearly than spontaneous expressions. Finally, ifthe affective function of feedback is predominant, this wouldimply that feedback functions are goal driven categories. Onone hand, it would not be surprising to find that functions offeedback are emergent processes to support evolutionary sharedgoal or abstract constructs formed via cultural ideals. On theother hand there is evidence from statistical analysis (which of-ten makes central tendency assumptions) of prosodic cues tofeedback functions [8, 9], as well as automatic classification ex-periments [10, 11], which supports the view that the functionsforms a taxonomy.

The current study is divided in two parts 1) the first partexamines the interaction between phonemic realization andprosody by determine the phonemic prior bias to different func-tions 2) the second part is a pilot study which aims to exploredeterminants to the graded structure within functions.

2. DataThe current dataset was produced in a cooperative project be-tween the speech group at KTH and the speech synthesiscompany Cereproc1. The aim was to record feedback tokenssuch as “ah”, “m-hm”, “m-m”, “n-hn”, “oh”, “okay”, “u-hu”,“yeah” and “yes” that expressed functions like acknowledg-ment, continuer, disagreement, surprise, enthusiasm and un-certainty. These functions partly corresponds to the functionsderived from a survey where subjects were asked to judge

1http://www.cereproc.com/

46

Page 54: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

functional similarity of spontaneous occurring feedback tokens,without giving any directions on which functions to use [9]. In afirst recording session all tokens were given a dialogue contextto act out, that would help the professional voice-over artist2 toproduce all feedback token with all expressions, like:

GL: Continue for about three blocks and pass the opera.

PH: Yeah. *uncertain*

This gave a combination of 54 types of expressions, each utteredthree times, which in total gave 162 vocal expressions.

In a second phase the two were recorded while playingchess, and in a third recording they engaged in socializing con-versation. The latter two phases elicited feedback tokens withexpressions that partly overlapped those recorded in the scriptedsession. The feedback tokens were identified and annotated ac-cording to their function. Since we wanted to use acoustic mea-surements for determining the graded structure, we opted to se-lect only one type of token to avoid bias in the measurementsfrom differences in phonemic realization. We selected “Okay”since its’ abundant occurrence in the spontaneous expressionsand since it had a decent spread among categories and felt ratherneutral as a carrier.

Table 1: Categories of “Okay”.Value Acted Spontaneous Totalacknowledgment 3 3 6continuer 3 5 8disagreement 3 0 3enthusiastic 3 0 3surprise 3 1 4uncertainty 3 3 6Total 18 12 30

2.1. Prosodic analysis

A common metaphor of studying the communicative aspect ofemotions is the Brunswikian lens model [12]. It describes a pro-cess which starts with an encoding stage of emotional expres-sion which changes a number of acoustic features of the voice -the distal indicators - for example fundamental frequency (F0).In the receiving party these are decoded as proximal percepts,i.e. the F0 is perceived as pitch, and then an emotional “gestalt”is formed in cerebral cortex.

Our previous study on the encoding stage showed that thefunctions were expressed with a rather contrastive prosody [13].Enthusiasm and surprise showed a higher average F0, as wellas shorter duration which also acknowledgment showed. Thefunction of continuer showed rising F0 which was contrastiveto all other functions. The least contrastive functions, disagree-ment and uncertainty, only differed in M-F0, while surpriseand enthusiasm differed only in spectral CoG. These results arepromising since they indicate that the actor was successful inencoding these functions. This study focuses on the decodingstage.

3. Listeners Decoding AbilityThis part aims to examine the interaction between phonemicrealization and prosody by determine the phonemic prior biasto different functions.

2Paul Hamilton, http://www.pajh.org/acting/index.html

3.1. Method

10 Subjects of various gender (Females = 4, Males = 6) and age(M = 36.6; SD = 12.1) rated all acted stimuli in a forced choicetask by answering the question: “Which category is this? (ac-knowledgment / continuer/ disagreement/ enthusiasm / surprise/ uncertainty)”. The stimuli were presented in in randomizedorder; 15 per page and subjects could change their decisionswithin each page before submitting the data.

3.2. Result

The results for judging the acted categorized via the forcedchoice task is shown as a confusion matrix in Table 2. The recallrates per category are found across the diagonal and range be-tween 2-3.8 times the chance level which is 17%. There are twomain confusion patterns 1) surprise is often detected as enthusi-asm 2) uncertainty is often detected as disagreement. The recallrates per token, decomposed into the contribution of the differ-ent functions, is shown in Figure 1. The recall rates range from2.2-3.4 times the chance level. However, all functions can benot be equally well decoded for different kinds of tokens. “Yes”and “yeah” are not likely to be decoded as surprise. “ah” and“oh” are over-interpreted as carrying surprise, but are not likelyto be decoded as uncertainty or enthusiasm. The token withthe most even spread in the contributions from the functions (interms of entropy and descending order) is “m-m”, followed by“u-hu”, “okay”, “m-hm”, “yes”, “yeah”, “n-hn”, “ah” and “oh”.

Table 2: Decoders confusion matrix. The functions are abbrevi-ated as sur: surprise, unc: uncertainty, dis: disagreement, con:continuer, ent: enthusiastic and ack: acknowledgment.

True Detectedack con dis ent sur unc

ack 64 10 7 3 12 4con 21 56 4 1 5 12dis 22 13 41 1 4 20ent 16 4 1 52 26 0sur 21 6 1 35 35 1unc 19 17 24 0 6 34

Ah M−hm Mm N−hn Oh Okay Uh−huh Yeah Yes0

0.1

0.2

0.3

0.4

0.5

0.6

Unw

eigt

hed

Ave

rage

Rec

all R

ate

unc

sur

ent

dis

con

ack

Figure 1: Recall rate per token type. The contribution of thedifferent functions to the recall rate is given within each bar.The functions are abbreviated as sur: surprise, unc: uncertainty,dis: disagreement, con: continuer, ent: enthusiastic and ack:acknowledgment

47

Page 55: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

3.3. Discussion

The results indicate that subjects had difficulties in discriminat-ing surprise from enthusiasm, and to some extent uncertaintyfrom disagreement. This was expected, since our prosodic anal-ysis on this material showed smaller prosodic differences be-tween the confused functions. The present study complementsthe previous results by showing the existence of a phonemicprior bias: “ah” and “oh” tend to be strong carriers of surprisebut not for enthusiasm or uncertainty. Similarly, “yeah” and“yes” are weak carriers of surprise, “mm” is the most neu-tral token and other fall somewhere between. This points to-wards an interaction between the phonemic surface realizationand prosody for the sound-to-meaning mapping. This phone-mic prior bias might arise from the subjects experiences of howfrequent a certain token is used to express particular function.The phonemic prior bias is related to the frequency instantia-tion which has been shown to be an important determinant forthe typicality within categories (cf. [4]). At this stage we holdthe standpoint that the prior is at least one component of FI.

4. Determinants to the Graded StructureThis part is a pilot study which aims to explore determinants tothe graded structure to the functions of “okay”.

4.1. Method

In the second part, all stimuli were present in random order ona single page. For each stimuli, the subjects were asked thefollowing questions (cf. [7]):

Typicality : How typical is the expression for [category] infeedback? (1-10)

Ideality : If someone want to express [category] in feed-back, how effective would this vocalization express [cat-egory]? (1-10)

Condition :”Is this expression acted or spontaneous? (acted /spontaneous)”

were [category] is the associated function. The ratings wereobtained by the same subjects as in the first study.

Instead of obtaining the central tendency from the time con-suming process of pairwise judgments, we compute it fromprosodic measurements. We use the ESPS pitch tracker andlogarithmic power function in the SNACK toolkit with defaultparameters which gives a 10ms frame rate. The F0 values areconverted to semitones and log power is referred to as intensity.Any unvoiced frames between voiced frames are interpolatedover using splines. The F0 and intensity trajectories are param-eterized using a type II DCT modified by dividing the coeffi-cients with the duration of the token (estimated from the firstto the last voiced frame). There are two main reasons for us-ing this time-varying parameterization: 1) The DCT basis func-tions are periodic which allows good interpolation of syllabicrhythm in speech. 2) The length-invariance gives a normaliza-tion for duration or speaking rate. This makes it possible toconsider duration separately in the analysis. This parameteriza-tion has been used successfully for classification, [14, 15], vi-sualization [16] and has shown to have modest correlation withjudged similarity [9]. For this task, a time resolution of 4 co-efficients is used. The final feature vector is composed of 4DCT coefficients of F0, 4 DCT coefficients of intensity, tokenduration and spectral center of gravity. To obtain appropriateweightings of the dimensions, the pairwise distances is com-puted in a space rotated via linear discriminant analysis (LDA)

were the priors are set to an uniform distribution to avoid cor-relation to FI. Since the rotated space maximizes the distancesbetween categories, the measurements for central tendency willbe sensitive to variance between categories but not to invari-ance between categories. CT is then computed as the averagedistance to all other members of the category. As mentioned,using acoustic measurements instead of pairwise judgments hasthe advantage of being less time consuming and more objective,but on the other hand, there is no guarantee that hidden prosodicvariables are present. Obtaining Frequency Instantiation (FI) isnot without problems. In previous studies, this determinant wasobtained by letting subjects judge FI directly by relying on theirexperience. As pointed out by [7], such subjective judgmentsmay not reflect the actual frequencies. Instead we use the recallrates for the functions as transmitted by “okay” as determinedby the first study.

Bayes theorem can be interpreted as the posterior proba-bility is proportional to the prior of a parameter (e.g. the fre-quencies of different functions of feedback as transfered by“Okay”) multiplied by the likelihood (i.g. a function of CT).Bayes formula is commonly used in statistical classifiers (e.gNaive Bayes, LDA or Hidden Markov Models) and formal-izes the relation between FI and CT. By transforming the CTinto an approximation to a likelihood, l(CT |function) =exp(−(CT )2), one can compute an approximation to the pos-terior probability. This determinant, the posterior, is importantfrom a affective computing perspective since it gives an indi-cation on how methods used in machine learning can predicttypicality.

4.2. Result

The recall rate for correctly judging stimuli as acted was 53%and for spontaneous it was 55%, which is only slightly abovethe chance level of 50%. The ICC(C,k) [17] (i.e. Cronbach’salpha) was 0.67 for typicality and 0.78 for ideality. The aver-age values were saved for the successive analysis and are shownin Table 3. The ratings for typicality was higher for acted thanfor spontaneous expressions, and the same for ideality, but therewas no significant difference between typicality and ideality foracted functions, and no difference between typicality and ide-ality for spontaneous functions (t-tests, all p < 0.05). ThePearson correlations to the determinants of typicality and thecross-correlations between them are shown in Table 4. Due tothe sparseness of data, we do not present separate correlationsfor acted and spontaneous conditions.

Table 3: Average ratings for typicality and ideality for actedand spontaneous conditions, ranged between 1-10.

Typicality IdealityActed 7.04 6.98Spontaneous 5.78 5.69

4.3. Discussion

The typicality of the functions of “okay” were found to be bestpredicted by their suitability to express a certain function, i.e.similarity to ideals, and secondly by prosodic central tendencyand phonemic prior bias. This indicate that feedback functionsto a greater extent are goal driven categories and to a lesser ex-tent form a taxonomy. While subjects could barely determinedirectly whether an expression was acted or spontaneous, theresults was more contrastive when ideality and typicality wasjudged. This shows that acted expressions are more effectively

48

Page 56: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Table 4: Pearson correlations between mean ratings of typi-cality (TP), ratings of ideality (ID), prosodic measurements ofcentral tendency (CT), the phonemic prior bias and the poste-rior approximation for prior X likelihood. All correlations aresignificant to the p < 0.02 level.

TP ID CT Prior PosteriorTP 1.00 0.96 -0.49 0.45 0.52ID 1.00 -0.49 0.50 0.55CT 1.00 -0.54 -0.77Prior 1.00 0.95Posterior 1.00

communicated. The higher ideality of acted expressions sug-gests that what corresponds to ones ideals is less often found inspontaneous speech. For an analogy to this, consider that theideality of a low-calorie diet is zero-calories, but zero-caloriefood is not very common even in low-calorie diets. The slightlyhigher correlation for the posterior determinant suggests that us-ing a simple statistical classifier which weight FI and CI accord-ing to Bayes theorem is a decent but not perfect choice for appli-cations which attempt to mimic human cognitive processing ofthe communicative functions. Overall, these results follows [7],although the present study presents a higher CT and a lower FI.The former difference may be due to the objectively measuredCT, while to latter difference may be due using phonemic priorbias as an approximation to FI. However, the results of this partof the study must be taken with caution. The number of stim-uli is limited and the results are only derived from the tokenof “okay”. Although the proposed method for determine CT isobjective, one cannot exclude the presence of hidden prosodicvariables. Approximating FI by phonemic prior bias has theadvantage of more precisely showing what FI constitutes, how-ever, there might be other components to FI.

5. ConclusionsThe present study examines the decoding stage in theBrunswikian lens model [12] of feedback functions and com-plements our previous study on the same material for the encod-ing stage [13]. That study showed that the similarity in prosodicrealization makes it hard to distinguish enthusiasm from sur-prise and uncertain from disagreement. However, the currentstudy shows that by making use of the phonemic prior bias, con-fusion could be avoided. When a system is supposed to conveysurprise it should make use “ah” and “oh” feedback tokens, andwhen it needs to communicate enthusiasm. it should use “yeah”and “yes” tokens. Similarly, “yeah” has a better chance to berecognized as uncertain, while “oh” more often gets recognizedas disagreement. If the system wants the feedback function tobe more vague it should make use of the more neutral feedbacktokens “m-m” and “okay”.

When examining the graded structure within the functionsof “okay” in the present study it was tentatively suggests thatfeedback functions to a greater extent are goal driven categoriesand to a lesser extent form a taxonomy. However, it is stillpossible to automatically predict typicality with a correlationof r = 0.52 via the posterior. Finally, it was found that actedexpressions are more effectively communicated. Depending onthe situation a dialogue system might need to be more or lessclear in its feedback. In some situations it might be sure whatit wants to communicate - in these situations it should opt foracted feedback tokes with a strong bias like “oh” and “yes”. Inmore unclear situation the system might want to keep a straight

face by producing a feedback token with less clear function - inthese situations the system should make use of tokens like ‘m-m” and “okay”, preferably taken from real interactions ratherthan acted.

6. AcknowledgementsFunding was provided by the Swedish Research Council (VR)projects 2009-4291 and 2009-4599.

7. References[1] J. Allwood, J. Nivre, and E. Ahlsen, “On the Semantics and Prag-

matics of Linguistic Feedback,” Journal of Semantics, vol. 9,no. 1, pp. 1–26, 1992.

[2] D. Bolinger, Intonation and its uses: Melody in grammar anddiscourse. London: Arnold, 1989.

[3] N. Ward, “Non-lexical conversational sounds in American En-glish,” Pragmatics and Cognition, vol. 14, no. 1, pp. 129–182,2006.

[4] L. W. Barsalou, “Ideals, central tendency, and frequency of instan-tiation as determinants of graded structure in categories,” Jour-nal of experimental psychology Learning memory and cognition,vol. 11, no. 4, pp. 629–654, 1985.

[5] E. Rosch and C. B. Mervis, “Family resemblances: Studies inthe internal structure of categories,” Cognitive Psychology, vol. 7,no. 4, pp. 573 – 605, 1975.

[6] G. Horstmann, “Facial expressions of emotion: does the proto-type represent central tendency, frequency of instantiation, or anideal?” Emotion, vol. 2, no. 3, pp. 297–305, 2002.

[7] P. Laukka, N. Audibert, and V. Auberge, “Exploring the deter-minants of the graded structure of vocal emotion expressions.”Cognition emotion, no. August 2011, pp. 37–41, 2011.

[8] S. Benus, A. Gravano, and J. Hirschberg, “The prosody ofbackchannels in american english,” in Proceedings of the 16th In-ternational Congress of Phonetic Sciences 2007, 2007, pp. 1065–1068.

[9] D. Neiberg, J. Gustafson, and S. Giampero, “Semi-supervisedmethods for exploring the acoustics of simple productive feed-back in swedish,” Speech Communication, submitted.

[10] A. Gravano, S. Benus, J. Hirschberg, S. Mitchell, and I. Vovsha,“Classification of discourse functions of affirmative words in spo-ken dialogue,” in Interspeech, Antwerp, 2007, pp. 1613–1616.

[11] D. Neiberg and J. Gustafson, “The prosody of swedish conver-sational grunts,” in INTERSPEECH 2010, 11th Annual Confer-ence of the International Speech Communication Association,Makuhari, Chiba, Japan, sep 2010, pp. 2562–2565.

[12] K. R. Scherer, “Personality inference from voice quality: The loudvoice of extroversion,” Eur. J. Soc. Psychol., vol. 8, p. 467487,1978.

[13] D. Neiberg and J. Gustafson, “Towards letting machines hummingin the right way - prosodic analysis of six functions of short feed-back tokens in english,” in Fonetik 2012, Goteborg, Sweden, jun2012.

[14] ——, “Predicting speaker changes and listener responses with andwithout eye-contact,” in INTERSPEECH 2011, 12th Annual Con-ference of the International Speech Communication Association,Florence, Italy., sep 2011.

[15] D. Reidsma, I. de Kok, D. Neiberg, S. Pammi, B. van Straalen,K. Truong, and H. van Welbergen, “Continuous interaction witha virtual human,” Journal on Multimodal User Interfaces, vol. 4,no. 2, pp. 97–118, jul 2011.

[16] J. Gustafson and D. Neiberg, “Prosodic cues to engagement innon-lexical response tokens in swedish,” in DiSS-LPSS JointWorkshop 2010, Tokyo, Japan, sep 2010.

[17] K. O. McGraw and S. P. Wong, “Forming inferences aboutsome intraclass correlation coefficients: Correction,” Psychologi-cal Methods, vol. 1, no. 4, p. 390, 1996.

49

Page 57: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Exploring the implications for feedback of a neurocognitive theory ofoverlapped speech

D. Neiberg, J. Gustafson

Centre for Speech Technology (CTT), TMH, CSC, KTH, Stockholm, Sweden[neiberg,jocke]@speech.kth.se

AbstractNeurocognitive evidence suggests that the cognitive load causedby decoding interlocutors speech while one self is talking is de-pendent on two factors: the type of incoming speech, i.e. non-lexical feedback, lexical feedback or non-feedback; and the du-ration of the speech segment. This predicts that the fraction ofoverlap should be high for non-lexical feedback, medium forlexical feedback and low for non-feedback, and that short seg-ments has a higher fraction of overlapped speech than long seg-ments. By normalizing for duration, it is indeed shown that thefraction of overlap is 32% for non-lexical feedback, 27% forlexical feedback and 12% for non-feedback. Investigating non-feedback tokens for the durational factor gives that the fractionof overlap can be modeled by linear regression and logarithmictransform of duration giving a R2 = 0.57 (p < 0.01 for F-test)and a slope b(2) = −0.04 (p < 0.01 for T-test). However,it is not enough to take duration into account when modelingoverlap in feedback tokens.Index Terms: neurocognitive theory, feedback, lexical feed-back, non-lexical feedback

1. IntroductionThe perhaps simplest observation of human conversation is thatoverwhelmingly one speaks at a time. This observation can beconsidered as the first principle in turn-taking theory [1], butother observations include the presence of short feedback inoverlapped speech [2] and having an overlap up to one second inspeaker shifts is quite common [3]. This indicates that there areexceptions to the one-at-time principle which might be relatedto the neurocognitive decoding of feedback and short phrases.There are relatively few neurocognitive theories of turn-taking.The literature include a hybrid computational model of YmirTurn-Taking Model (YTTM) and the Augmented CompetitiveQueuing (ACQ) [4], an oscillator model for timing phenomena[5] and the role of mirror neurons and the motor cortex [6]. Thispaper aims to formulate a neurocognitive theory for the occur-rence of feedback and short phrases in overlap and verify thepredictions.

The most basic observation of conversation, the one at atime principle, can be hypothesized to be caused by overlappedphonological or semantic processing systems for speech recog-nition and production. Two candidates for this overlap are Wer-nicke’s area (Brodmann Area 22), located in posterior supe-rior temporal gyrus (STG), and Broca’s area (Brodmann Area44), located in posterior inferior frontal gyrus. Both areas hasshown highly correlated activation in (fMRI) functional mag-netic resonance imaging experiments during speech perceptionand production [7]. They points towards evidence which indi-cate that the overlapped regions are also involved in tasks which

puts load onto the phonological loop in working memory. Themodel of working memory consists of a phonological loop, anvisuo-spatial sketchpad and a central executive [8]. Cognitiveload is the term for the general effort or the interactions thatmust be processed simultaneously in working memory [9]. Thecontribution of Broca’s area for sentence processing has beenfurther examined by [10] using fMRI for a baseline conditionwith no secondary task, during a concurrent speech articulationtask and during a concurrent finger-tapping task. It was foundthat concurrent articulation significantly reduces the ability tocomprehend object-relative clause sentences compared to sub-ject relative and compared to comprehension during the finger-tapping task. In the baseline condition, they found greater ac-tivation in Broca’s area for sentences with object relatives thansubject relatives but this effect was only found in the in a partof Broca’s area during concurrent speech articulation. They in-terpreted these findings as: “Under high processing load condi-tions, such as sentences with object-extracted relative clauses,verbal working memory can be recruited to assist comprehen-sion”.

By assuming a reasonable degree of cooperation in con-versation and the existence of an inverse relationship betweencognitive load and the fraction of overlapped speech, it seemsreasonable to conclude that the partial overlap of the produc-tion and recognition systems contributes to the cognitive loadof speaking and listen at the same time to the extent that it canbe considered as the first principle of turn-taking.

Working memory can handle two or three of novel interact-ing elements, while the capacity is higher for non-novel infor-mation [11]. What can be remembered is also inversely relatedto word length and the total span could be predicted on the basisof the number of words which the subject can read in approxi-mately 2 seconds [12]. If this duration effect still exists whileone is decoding interlocutors speech, then very short utterancesmay still be acceptable to comprehend during production sincethey are less likely to cause excessive cognitive load. This ex-ception to the one at a time principle can explain the commonpresence of short overlaps found in speaker shifts [3].

In everyday conversation, interlocutors usually give briefvocal feedback like “uhu”, “okey” and “Yeah, that’s right”while the other is talking. Yngve [2] noticed that feedbackis common in overlapped speech. He put forward the idea ofa main half-duplex channel in conversation, which meant thatoverlapped speech including feedback have to be transmitted ina back-channel. The frequent occurrence of feedback in over-lapped speech has caught interest among researchers in turn-taking [13] and feedback is sometimes defined as these utter-ances which do not take the floor or are not full turns [14].Empirically feedback has indeed shown an over-representationin overlapped speech for English [15] and Swedish [16]. This

50

Page 58: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

cross-speaker context, that feedback often occur in overlappedspeech, is a distinct characteristic of feedback.

Very short utterances may still be acceptable to comprehendduring production since they are less likely to cause excessivecognitive load. Since feedback segments have short duration,it may explain the over-representation of feedback in overlap.Then short phrases in general are also acceptable and shouldshow the same over-representation in overlap as feedback. Em-pirically, very short utterances (< 1 sec.) [17] as well as manu-ally labeled feedback [18] have both shown over-representationin overlap.

Another contribution to overlap can be derived from themain function of feedback. The primary function of feed-back is to convey affect and attitudes via prosody [19]. Vo-cal sounds are processed along the auditory “what” process-ing stream reaching from the auditory cortex to the lateral STGand to the superior temporal sulcus (STS) where an emotional“gestalt” is formed. For each consecutive processing step thereis an increasing lateralization to the right hemisphere which pro-cesses pitch and segments on a wider temporal scale and an in-creasing lateralization to the left hemisphere for a finer temporalprocessing suitable for decoding phonemic structure [20]. Thus,affective-prosodic decoding partially involves different areas inthe brain than those used for linguistic decoding. Indeed, de-coding of verbal interjections modulated by affective prosodyhas shown involvement of areas which are primary used for af-fective decoding [21]. The parallel mechanisms for affectiveand linguistic decoding may explain why it is not problematicto decode feedback while one is talking. What is likely to passthrough this back-channel, to use Yngve’s terminology, is non-lexical feedback, i.e. feedback which has low linguistic contentwith slowly varying spectral flux and high affective content like“uhu”. Thus, the proportion of overlapped speech should behigher for non-lexical feedback than for lexical feedback andlowest for syntactical structures of words which excludes laugh-ter and other extra-linguistic sounds. This effect should be ad-ditive to the effect of duration. We put the expected proportionof overlap for lexical feedback above non-feedback due to theexpected affective loading, the lack of syntactical structure (incase of single words) and the idiomatic/non-novel structure (incase of short phrases).

The predictions are here verified as the fraction of over-lapped speech, computed with and without normalizing for du-ration as well as examine duration alone (Section 3). For thiswe utilize the DEAL corpus as described in Section 2 and fi-nally the findings are concluded in Section 4.

2. The DEAL corpusThis study uses data from the DEAL corpus [22]. It consistsof eight role-playing dialogs recorded as an informal, human-human, face-to-face conversation. The data collection wasmade with 6 subjects (4 male and 2 female), 2 posing as shopkeepers and 4 as potential buyers. Each customer interactedwith the same shop-keeper twice, in two different scenarios.The customers were given a task: to buy items at a flea mar-ket at the best possible price.

All vocal activity in the DEAL corpus was segmented intoInter Pausal Units which is defined as connected segments ofvocalizations bridged by a minimally perceivable pause set to200 ms. All dialogs were transcribed orthographically includ-ing non-lexical entities such as laughter and audible breathing.Filled pauses, repetitions, corrections, restarts and cue phraseswere labeled manually. The corpus is rich in feedback tokens.

The feedbacks were generally single words or non-lexical to-kens and appeared in similar dialog contexts (i.e. as responsesto assertions). We divided the feedback tokens into non-lexical,i.e. postulated as the ones which only consists of sonorants [23],and lexical feedback which mainly consists of “okej” (okey),“precis” (exactly), other affirmative words and short phrases.The token counts for the two classes are shown in Table 1. Itcan be seen that the non-lexical category contains tokens with amore slowly varying spectral flux suitable for processing in theright hemisphere while the lexical category contains phonemeswith more spiky spectral flux such as plosives or thrills.

Table 1: Token counts for non-lexical and lexical feedback inthe DEAL corpus.

non-lexicaltoken count percentage (%)ja 448 39.5m 163 14.4a 91 8.0nej 50 4.4na 47 4.1hm 24 2.1mm 24 2.1mhm 18 1.6jaha 15 1.3jo 14 1.2Other 62 5.4

lexicaldet 26 2.3precis 25 2.2okej 25 2.2just 17 1.5jag 7 0.6forstar 6 0.5hur 6 0.5eller 6 0.5ar 5 0.4da 4 0.4Other 50 4.4

3. InvestigationsThe goal is to compute the proportions of overlapped speechfor different types of speech segments (Inter Pausal Units). Todo this, the provided annotation is quantized into frames. Thischoice not only simplifies computation but it also makes investi-gation no. 2 possible, see Section 3.2, where segments of equallength are compared. This choice also makes the results directlyapplicable to stochastic models for detection, segmentation andturn-taking which operate on frame level [16, 24]. The framesize is chosen to be 50 ms.

3.1. Investigation 1

The first investigation aims to get a rough picture of the inter-action between the factors: duration, non-feedback and feed-back where the latter is expanded into lexical and non-lexicalfeedback. To do this, we compute the fraction of overlapand average duration for Non-Lexical Feedback (NLF); LexicalFeedback (LF); Short Non-Feedback segments (SNF), i.e seg-ments with a duration shorter than 1 second (as in [17]); Non-Feedback segments (NF) and Long Non-Feedback segments,i.e segments with a duration longer than 1 second (LNF). The

51

Page 59: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

fraction of overlap metric excludes silence and extra-linguisticsounds, e.g. laughter. The result is shown in Table 2.

3.1.1. Discussion

To interpret the results, the interaction between duration andfraction of overlap has to be considered. As predicted, non-lexical feedback has the highest proportion of overlap but hasalso the shortest duration. Lexical feedback and short non-feedback segments has almost the same average duration whilelexical feedback has much higher fraction of overlap than shortnon-feedback segments. Non-feedback segments and long non-feedback segments has almost the same fraction of overlapwhile long non-feedback segments has much longer average du-ration.

Examining the proportion metric makes it clear that feed-back is overrepresented in overlap. Specifically, non-lexicalfeedback is more over-presented in overlap compared to lexi-cal feedback. However, short non-feedback segments are alsomore common in overlap than longer non-feedback segments,but less so than feedback. If feedback is excluded from the shortsegments there is still an over-representation in overlap and it ishigher for short segments than for longer segments.

As predicted, these results suggests an interaction betweenthe continuous duration factor and the categorical feedback ornon-feedback factor. However, it is not clear whether non-lexical or lexical feedback are separate factors. The effect ofduration seems to diminish after 1.0 second. This opt for twofollow up investigations, one in which the duration factor is nor-malized and one in which the duration factor is examined inde-pendently.

Table 2: Fraction of overlap, average duration and number offrames for NLF: Non-Lexical Feedback; LF: Lexical Feedback;SNF: Short Non-Feedback (IPU ≤ 1 s.); NF: Non-Feedbacksegments; LNF: long Non-Feedback (IPU > 1 s.).

NLF LF SNF NF LNFOvl.(%) 0.33 0.27 0.18 0.12 0.11Avg. Dur. 0.36 0.51 0.55 1.16 1.85N 8002 1343 73633 17577 56056

3.2. Investigation 2

In the second investigation the fraction of overlap is comparedfor segments of non-lexical feedback, lexical feedback and non-feedback with identical duration. Given that the maximum du-ration of all segments is N , the segments are collected by cre-ating sets d = 1 . . . N each containing all segments with a du-ration equal to d. Then for each set d, a new set d is createdby collecting equal number of segments for each of the threeclasses. The segments are picked in chronological order of ap-pearance which means that the most segments are picked fromthe first dialogs. Every set d does now contain an equal num-ber of segments of equal length. This procedure collected 127segments with 1328 frames from each class. The fraction ofoverlap is shown in Table 3.

Table 3: The fraction of overlap computed for equal number ofsegments with equal duration for each of Non-Lexical Feedback(NLF), Lexical Feedback (LF) and Non-Feedback (NF). Thereare 127 segments with 1328 frames for each class.

NLF LF NFOvl. (%) 0.32 0.27 0.12

3.2.1. Discussion

The result show that the fraction of overlap is highest for non-lexical feedback, somewhat lower for lexical feedback and low-est for non-feedback. This ordering was predicted by neurocog-nitive theory. However, while the difference between feedbackand non-feedback was rather high, the difference between non-lexical and lexical feedback was smaller.

3.3. Investigation 3

In the third investigation the duration factor is examined sepa-rately for non-feedback tokens and only feedback tokens. Thetotal fraction of overlap for each segment is parameterized as afunction of the total segment duration in bins of 100 ms. Thisgives multiple points per tick on the x-axis in the histogram.The fraction of overlap is hypothesized to be smaller for longersegments which made us to chose weighted linear regression asa model. The weights are chosen as the inverse variance per bincomputed via the normal distribution approximation of the bi-nomial distribution. This gives a small weight for x-ticks whichhas a small number of observations or a large spread among theobservations, and high weight when the latter two criteria arenot satisfied. Two types of scales for the x-axis are tested: lin-ear and logarithmic. The goodness-of-fit metrics and slopes forthe models are shown in Table 4.

Table 4: Goodness-of-fit metrics and slopes for modeling thefraction of overlap via linear regression with duration as anexplanatory variable.

Linear x-scaleType R2 p(F-test) p(T-test) slopeNF 0.39 0.001 0.000 -0.02F 0.64 0.005 0.987 0.15

Logarithmic x-scaleNF 0.57 0.000 0.000 -0.04F 0.72 0.002 0.995 0.09

3.3.1. Discussion

The results for the regression analysis show that a logarithmictransformation of segment duration gives a better fit in in termsof R2. The F-test for the models was significant for all con-ditions but the T-test showed that the slope was not signifi-cantly different from zero for feedback, but significant for non-feedback. The former finding may be due to lack of enoughdata-points for feedback since most feedback tokens are shorterthan one second or due to that duration does not matter for feed-back. The regression line for non-feedback is shown in Figure1.It can be seen that the fraction of overlap decreases with increas-ing token duration.

4. ConclusionBased on neurocognitive evidence it is argued that the cognitiveload caused by decoding interlocutors speech while one self istalking is dependent on two factors: type of speech, i.e. non-lexical feedback, lexical feedback and other speech; and theduration of the speech segment (Inter Pausal Units). By as-suming an inverse relationship of cognitive load and the frac-tion of overlapped speech, it is predicted that the fraction ofoverlap is high for non-lexical feedback, medium for lexicalfeedback and low for non-feedback (excluding extra-linguisticsounds like laughter). In addition, constraints on working mem-

52

Page 60: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

10−1 1000

0.05

0.1

0.15

0.2

0.25

Segment Duration (s)

Fra

ctio

n of

fram

es in

ove

rlap

Figure 1: Explaining the fraction of overlap for non-feedbacksegments by linear regression as a function of logarithmic trans-formed segment duration. The size of the markers are propor-tional on the counts for each point.

ory predicted that short segments has a higher fraction of over-lapped speech than long segments.

By separating the continuous duration factor and the cate-gorical non-lexical/lexical feedback or non-feedback factor, it isshown that the fraction of overlap is 32% for non-lexical feed-back, 27% for lexical feedback and 12% for non-feedback. Thefraction of overlap for non-feedback can be modeled quite ac-curately by linear regression and logarithmic transform of du-ration giving a R2 = 0.57 (p < 0.01 for F-test) and a slopeb(2) = −0.04 (p < 0.01 for T-test). However, the fraction ofoverlap for feedback tokens could not be explained by durationsince the T-test failed.

Since the computations are made on frame-level, here cho-sen to be 50 ms, the results are directly applicable to stochas-tic models for detection, segmentation and turn-taking whichoperate on frame level [16, 24]. For example, the computedproportions of overlap for the categorical factor corresponds tomaximum likelihood estimates of finding or predicting differ-ent types of segments in overlap, and the continuous durationalfactor corresponds to modeling the probability of overlap withexponential decaying functions.

For turn-taking theory in general, the results gives a neuro-cognitive motivation for excluding feedback, especially non-lexical feedback, and short segments from the one-at-a-timeprinciple. One possible extension of this work would be to in-vestigate the impact of novel versus non-novel stimuli, sincecognitive load is proportional to the novelty of verbal stimuli.

5. AcknowledgementsThe authors would like to thank Petri Laukka for discussions.Funding was provided by the Swedish Research Council (VR)projects 2009-4291 and 2009-4599.

6. References[1] H. Sacks, E. Schegloff, and G. Jefferson, “A simplest systematics

for the organization of turn-taking for conversation,” Language,vol. 50, pp. 696–735, 1974.

[2] V. H. Yngve, “On getting a word in edgewise,” Papers from theSixth Regional Meeting of the Chicago Linguistic Society, pp.567–577, 1970.

[3] M. Heldner and J. Edlund, “Pauses, gaps and overlaps in conver-sations,” Journal of Phonetics, vol. 38, no. 4, pp. 555–568, 2010.

[4] J. Bonaiuto and K. Thorisson, “Towards a neurocognitive model

of turn taking in multimodal dialog,” in Embodied communicationin humans and machines, M. L. I. Wachsmuth and G. Knoblich,Eds. New York: Oxford University Press., 2008, pp. 451–483.

[5] M. Wilson and T. P. Wilson, “An oscillator model of the timingof turn-taking,” Psychonomic bulletin review, vol. 12, no. 6, pp.957–968, 2005.

[6] S. K. Scott, C. Mcgettigan, and F. Eisner, “A little more conver-sation, a little less action - candidate roles for the motor cortex inspeech perception,” Nature Reviews Neuroscience, vol. 10, no. 4,pp. 295–302, March 2009.

[7] B. R. Buchsbaum, G. Hickok, and C. Humphries, “Role of leftposterior superior temporal gyrus in phonological processing forspeech perception and production,” Cognitive Science, vol. 25,no. 5, pp. 663–678, 2001.

[8] A. Baddeley and G. Hitch, “Working memory,” in The Psychologyof Learning and Motivation, G. Bower, Ed. Academic Press,1974, pp. 48–79.

[9] John and Sweller, “Cognitive load during problem solving: Ef-fects on learning,” Cognitive Science”, vol. 12, no. 2, pp. 257 –285, 1988.

[10] C. Rogalsky, W. Matchin, and G. Hickok, “Broca’s area, sentencecomprehension, and working memory: An fMRI study,” Frontiersin human neuroscience, vol. 2, no. October, p. 13, 2008.

[11] F. Paas, A. Renkl, and J. Sweller, “Cognitive load theory and in-structional design: Recent developments,” Educational Psycholo-gist, vol. 38, no. 1, pp. 1–4, 2003.

[12] A. D. Baddeley, N. Thomson, and M. Buchanan, “Word lengthand the structure of short-term memory,” Journal of Verbal Learn-ing and Verbal Behavior, vol. 14, no. 6, pp. 575 – 589, 1975.

[13] E. Schegloff, “Overlapping talk and the organization of turn-taking for conversation,” Language in Society, vol. 29, pp. 1–63,2000.

[14] N. Ward and W. Tsukahara, “Prosodic features which cue back-channel responses in English and Japanese,” Journal of Pragmat-ics, vol. 32, no. 8, pp. 1177–1207, 2000.

[15] Ozgur Cetin and E. Shriberg, “Analysis of overlaps in meetings bydialog factors, hot spots, speakers, and collection site: Insights forautomatic speech recognition,” in Proc. ICSLP, Pittsburgh, 2006,pp. 293–296.

[16] D. Neiberg and J. Gustafson, “A dual channel coupled decoder forfillers and feedback,” in INTERSPEECH 2011, 12th Annual Con-ference of the International Speech Communication Association,Florence, Italy, sep 2011.

[17] M. Heldner, J. Edlund, A. Hjalmarsson, and K. Laskowski, “Veryshort utterances and timing in turn-taking,” in Proc. of Interspeech2011, Florence, Italy., 2011.

[18] D. Reidsma, I. de Kok, D. Neiberg, S. Pammi, B. van Straalen,K. Truong, and H. van Welbergen, “Continuous interaction witha virtual human,” Journal on Multimodal User Interfaces, vol. 4,no. 2, pp. 97–118, jul 2011.

[19] J. Allwood, J. Nivre, and E. Ahlsen, “On the semantics and prag-matics of linguistic feedback,” Journal of Semantics, vol. 1, no. 9,pp. 1–26, 1992.

[20] A. Schirmer and S. A. Kotz, “Beyond the right hemisphere: brainmechanisms mediating vocal emotional processing,” Trends CognSci, vol. 10, no. 1, pp. 24–30, 2006.

[21] S. Dietrich, I. Hertrich, K. Alter, A. Ischebeck, and H. Acker-mann, “Understanding the emotional expression of verbal inter-jections: a functional mri study,” Neuroreport, vol. 19, no. 18, pp.1751–5, 2008.

[22] A. Hjalmarsson, “Speaking without knowing what to say... orwhen to end,” in Proceedings of SIGDial 2008, Columbus, Ohio,USA, jun 2008.

[23] N. Ward, “The challenge of non-lexical speech sounds,” in In In-ternational Conference on Spoken Language Processing, 2000.

[24] K. Laskowski, J. Edlund, and M. Heldner, “A single-port non-parametric model of turn-taking in multi-party conversation,” inProc. of ICASSP 2011, Prague, Czech Republic, may 2011, pp.5600–5603.

53

Page 61: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Paralinguistic Behaviors in Dialog as a Continuous Process

David Novick

Department of Computer Science, The University of Texas at El Paso, El Paso, TX, USA

[email protected]

Abstract Prior research on gaze, turn-taking, and backchannels suggests that the speaker’s gaze cues the listener’s paralinguistic responses, including feedback behaviors. To explore how conversants use feedback cues and responses, I studied a corpus of face-to-face conversational interaction, primarily using a conversation-analytic approach. Analysis of the dialogs suggests that paralinguistic behaviors express meaning at a level of granularity often smaller than dialog control acts. Behaviors such as gaze and nodding can be seen as continuous rather than discrete actions. Moreover, speaker gaze shift toward the listener is a polysemous expression that can cue a range of behaviors in the listener, including continued attention, head nods as backchannels, utterances as backchannels, and turn-taking. The analysis also suggests that gaze, from both speakers and listeners, can express a state rather than a discrete act. Index Terms: dialog, grounding, feedback, gaze, nod

1. Introduction Humans and embodied conversational agents appear to converse more effectively when the agents appear to sense and produce paralinguistic behaviors such as gaze shifts and head nods. Conversants use nods more often when an agent’s feedback indicates that it perceives the nods [1]. Agents using human-like patterns of interaction are better appreciated by human conversants and contribute to more efficient interaction [2]. Consequently, more natural models of feedback behaviors should lead to even better interaction.

The dialog functions of paralinguistic behaviors, such as gaze and nods, can be expressed in terms of dialog control acts analogous to speech acts. Both David Traum [2] and I [4] have described act-based models that include dialog control acts such as “take turn.” These models’ discreteness makes them useful for computational representation and implementation, and they can be applied to action at a sub-utterance level. For example, gaze can be modeled as grounding at the level of intonation phrases, where speakers actively monitor for positive evidence of understanding [5].

Models of conversation, such as Suchman’s model of joint action have, all along, described these processes as continuous:

Closer analyses of face-to-face communication indicate that conversation is not so much an alternating series of actions and reactions between individuals as it is a joint action accomplished through the participants’ continuous engagement in speaking and listening [references omitted]. [6, p. 71]

Suchman’s model, though, was continuous at the level of the conversants’ contributions—a succession of discrete, interacting verbal responses rather than a moment-by-moment interplay of verbal and non-verbal. Successful embodied conversational agents will have to possess the ability to perceive, understand, and communicate through genuinely continuous processes that reflect the fine-grained dynamics of actual conversation and the moment-by-moment judgments of speakers about listeners’ understanding.

An incremental approach to interaction has been implemented at least on the generation side [7]. That is, the agent displays multimodal paralinguistic behaviors even though it cannot sense these behaviors in the human conversant. But it is the listener who moderates the speaker’s production, using nonverbal means. As Heylen pointed out,

[T]he behaviors displayed by auditors is an essential determinant of the way in which conversations proceed. By showing displays of attention, interest, understanding, compassion, or the reverse, the auditor/listener, determines to an important extent the flow of conversation, providing feedback on several levels. [8, p. 82]

The research on the relationships between gaze, turn-taking, and backchannels suggests that the speaker’s gaze cues the listener’s paralinguistic responses, including feedback behaviors. Speakers often use gaze to cue turn-exchanges by shifting gaze to the listener [9, 10], and speakers can use gaze shifts to cue backchannels, both verbal and nonverbal [11].

To improve my understanding of feedback cues and responses as a continuous process, I studied a corpus of face-to-face conversation. Because I was interested in exploring the micro-components of paralinguistics, my analysis was more in the tradition of conversation analysis than discourse analysis, complemented by reviewing all of the conversations between Americans in the entire corpus. My analysis of these interactions suggests that: • Conversants vary widely with respect to feedback

behaviors. • Both speakers and listeners can produce multiple

paralinguistic behaviors within single intonation phrases. • While listeners sometimes nod while the speaker is looking

away, they typically nod when or shortly after the speaker looks at the listener.

• Listener gaze aversion (i.e., the end of continued attention) can signal understanding.

• Speaker gaze shift toward the listener is a polysemous expression that can cue a range of behaviors in the listener, including continued attention, head nods as backchannels, utterances as backchannels, and turn-taking.

54

Page 62: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

• Gaze, from both speakers and listeners, can express a state rather than a discrete act.

Moreover, if paralinguistic behaviors are really expressing states rather than acts, and if the behaviors are still to be viewed in the perspective of speech acts (along the lines of meta-acts or dialog control acts), then speech-act theory will have to accommodate expression of being. In the balance of this paper, I present the evidence—mostly conversation analytic—for these conclusions and discuss their implication for conversation-act models.

2. Observations To explore how conversants use feedback cues and responses as a continuous process, I turned to the UTEP-CIFA corpus [12] of face-to-face conversational interaction. These conversations were recorded as part of a study of proxemics and trust, comparing behaviors between native speakers of American English and native speakers of Iraqi Arabic. For the purposes of this research, I limited my study to the twelve dialogs conducted by the eight American conversants. Each dialog was about four minutes long, for a total of about 48 minutes of conversation.

2.1. Differences among conversants

When the UTEP-CIFA corpus was collected, our research team annotated the dialogs for gaze, hand movements, and head nods. Analysis of the annotations indicates that American conversants produced, on average, 6.90 nods per minute, with a standard deviation of 2.17 nods per minute. As the standard deviation would suggest, the variation in nod rates among the dialogues was high, with two dialogs fewer than 4 nods per minute and three dialogs with more than 10 nods per minute. The variation among dialogs reflects variation among the individual conversants, each of whom participated in three dialogs. The mean, minimum and maximum rate of nods per minute across the conversants were 6.90, 4.84 and 10.07, respectively.

Analysis of the annotations also disclosed similarly wide differences with respect to the amount of time that the conversants gazed at their conversational partner. The mean amount of gaze time per minute was 16.64 seconds, but the standard deviation was 7.66 seconds, and the minimum and maximum gaze time per minute across all of the dialogs were 4.47 seconds and 32.42 seconds, respectively. In other words, there were dialogs where one of the conversants almost never looked at the other conversant, and there were dialogs where one of the conversants looked at the other conversant about half the time. Again, the differences among the dialogues reflect differences among the individual conversants, whose average gaze per minute varied from a minimum of 9.08 seconds to a maximum of 2.74 seconds.

These differences among conversants were immediately apparent when viewing the corpus. Some conversants were animated listeners, nodding more or less continuously; others were impassive, rarely nodding, even after the speaker shifted gaze. Some conversants engaged with gaze much of the time; others steadfastly kept their gaze away.

2.2. Multiple head gestures within single intonation phrases

I turn now from discourse analysis to something more along the lines of conversation analysis. I focused on segment of about 30 seconds in dialog P5 of the corpus; I transcribed the verbal and

nonverbal actions of the segment by hand, viewing each moment of the conversation perhaps a dozen times. Figure 1 shows my transcript of this dialog segment.

Figure 1. Partial transcript of dialog P5.

The transcript, especially at 00:24:00-00:25:00 and 00:30:15-00:31:10, shows complex bursts of paralinguistic behaviors from both conversants. Moreover, my transcription does a poor job of conveying the continuous, animated quality of the interaction from the speaker more or less all the time, and from the listener when—aside from the case I discuss in the next subsection—the speaker’s gaze is directed at him. In any event, these combinations of activity, within a single intonation phrase unit, include behaviors such as shifting gaze, tilting the head to side away from the other speaker, and, untranscribed because the actions are rather subtle, the suggestion of a couple of nods—all within about a second. On the part of the listener, the combinations are less complex but typically include successions of small nods, or nodding plus gaze aversion.

2.3. Gaze shift as a cue for nodding

Consistent with the behaviors described in [10, 13], the listeners in the transcribed segment and in the overall corpus generally nodded when the speaker shifted gaze to the listener. While this was subject to the variation among conversants with respect to overall frequency of nodding, when listeners did nod it was almost always just after a gaze shift toward the listener, and rarely otherwise. For example, when the speaker shifts his gaze to the listener at 00:19:10, the listener immediately produces a succession of small nods.

This pattern has a plausible, if prosaic, explanation: if the speaker is not looking at you, it does not do much for you to nod because the speaker may not (cf. peripherally) see your action. So if you want to signal grounding via head nods, your first opportunity to do so is when the speaker shifts gaze to you.

Moreover, if you want the speaker to continue but do not want to or cannot signal grounding of the speaker’s preceding speech, then you really should not nod. This gives the speaker the opportunity to elaborate or clarify, after which the listener can then nod if he or she wants to signal grounding.

Figure 2 shows the dialog at exactly this sort of point. At 00:24:15, Conversant A, on the left, has just shifted his gaze to Conversant B, and conversant B is still looking at A without

55

Page 63: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

nodding (in contrast to the immediate nod responding to the speaker’s toward-listener gaze shift at 00:19:10). My interpretation of the dialog at 00:24:15 is that the speaker’s fragmentary utterances (“the group um and when I heard the word group I used to uh I just finished a six-year tour with the”) have left the listener in a position where he is struggling to understand the listener’s meaning. So when the speaker shifts his gaze to the listener, the listener does not immediately nod. Rather, the speaker continues production of the utterance (“nine-oh-second military intelligence”). This apparently helps the listener grasp the speaker’s meaning, and the listener then, two seconds after the speaker’s gaze shift, nods.

Figure 2. Conversant A (on left) shifts gaze to Conversant B,

who waits about two seconds before nodding.

Actually, the listener not only nods, but he averts his gaze as he does so, as shown in Figure 3. This behavior has a logic to it. The listener’s non-nodding continued attention was signaling non-understanding, which is not the usual case for continued attention, which Clark and Schaefer [14] listed as a weak form of acceptance. So to signal understanding the listener has to change his gaze behavior and thus averts his gaze while nodding. In other words, the lack of initial nod transforms the listener’s continued attention into lack of acceptance, and so to signal acceptance the listener has to end his continued attention. In fact, this pattern occurred across different pairs of conversants.

2.4. Gaze shift as a polysemous cue

In the corpus, I observed the speaker’s gaze shift toward the listener cue backchannel nodding. But I also observed the same sort of gaze shift lead to a range of listener paralinguistics: continued attention (as in Section 2.3), nodding as backchannels (also as in Section 2.3), verbal backchannels, turn exchanges. That is to say, the speaker’s gaze shift toward the listener is a polysemous cue, in that it can cue any one of these four behaviors in the listener. This suggests that gaze shift is an action rather than act: it is a nonverbal behavior that has meaning as a dialog control act in the context of the interaction and of the conversants’ respective intents, much in the same way that an individual word or expression is not an act in itself but rather becomes an act when interpreted in context.

Part of the context for assigning meaning to gaze shifts consists of the speaker’s prosody, which may differentiate the nonverbal action into more specific acts through, for example,

prosodic patterns for backchannel cues (see, e.g., [15]). Another part of the context involves the conversants’, and especially the listener’s, state of mind with respect to acceptance and grounding: no matter how clear the speaker’s cue, a listener who is not understanding the speaker would usually be ill-served by signaling acceptance. And part of the context involves the actual content of the dialog: if the speaker has apparently completed a contribution to the conversation, the listener can take the turn.

Figure 3. Conversant A continues to gaze to Conversant B, who

averts his gaze and nods twice.

2.5. Gaze as an expression of state

The meta-act or dialog-control-act model of interpreting paralinguistic behaviors still has both utility and intuitive appeal, as it explains what conversants are doing. At the same time, though, even the 30-second segment of the dialog corpus analyzed here leads to questions about the discreteness of the model: • If a listener is continuously gazing at the speaker, what is or

was the listener’s act? Did the act occur when the speaker first gazed at the listener? Is there still an act some seconds later when the listener remains gazing at the speaker?

• If a speaker shifts gaze to the listener and holds this gaze, what is or was the speaker’s act? Did the speaker produce an “invite backchannel” act when he or she shifted gaze? How can the invitation still be in force as the speaker continues to gaze at the listener, as at 00:24:15 of dialog P5?

• If a listener, after not following through on an invitation to backchannel, continues to gaze at the speaker, presumably responding with an invitation for the speaker to clarify or elaborate, what is or was the listener’s act? Does the act still continue as long as the listener holds his gaze under these circumstances?

In light of these questions I suggest that gaze, and probably other paralinguistic behaviors such as continuous nodding, can be understood as expressing a state of being rather than expressing an act. The state of the continuously gazing accepting listener is something like “I am following what you are saying and invite you to continue.” This is a continuous rather than a discrete phenomenon; the state has a beginning and an end, but for its duration it is a continuing proposition. Similarly, the state of the speaker, once having shifted gaze to the listener, holding his gaze toward the listener (perhaps in search of feedback), is

56

Page 64: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

something like “I am speaking and would like to see from you a positive signal of understanding.” Again, the speaker’s proposition is a continuous one. Finally, the state of the continuously gazing non-accepting listener is something like “I am hearing you but not yet able to ground your current contribution.” This, too, is a state of being rather than an act.

For representation of paralinguistic behaviors as dialogue control acts, then, the act model will have to be extended to include continuous states of being. In other words, at each moment that a listener is normally gazing at a speaker, the listener’s state is not just the ongoing action of “continued attention” but rather a state given meaning by the context, for example, “I invite you to continue speaking.”

2.6. Implications for speech-act theory

If, as I suggested above, dialog acts should be extended encompass to states of being, in addition to discrete acts, then perhaps traditional speech acts should be extended similarly. For example, if a speaker says “I am hereby ready to sign the contract,” it is the case the speaker has not just commented on his or her own status but actually stands ready to sign. In other word, the speaker is now in a state of being willing to sign the contract, and this state remains in effect until ended by another act from the speaker or some relevant change in circumstances. Or, even more to the point, imagine that the speaker says “I am hereby ready to sign the contract” and extends a hand while holding a pen. While the hand remains extended, the speaker appears to be in a state of willingness to sign. The speaker’s state has a sort of continuing illocution. When the hand is withdrawn, the state of willingness to sign appears to end, and the illocution ends with it.

3. Conclusions Analysis of face-to-face dialogs suggests that paralinguistic

behaviors express meaning at a level of granularity often smaller than dialog control acts such as “take turn.” Behaviors such as gaze and nodding can be seen as continuous rather than discrete actions.

While it is true that conversants vary widely with respect to the extent they use feedback behaviors such as gazing and nodding, both speakers and listeners can produce multiple paralinguistic behaviors within single intonation phrases.

While listeners sometimes nod while the speaker is looking away, they typically nod when or shortly after the speaker looks at the listener. But when continued gaze without nodding means that the listener is not accepting the speaker’s current contribution, gaze aversion by the listener (i.e., the end of continued attention) can be part of the listener’s signaling of understanding.

As is apparent from the interaction in the corpus, speaker gaze shift toward the listener can cue a range of behaviors in the listener, including continued attention, head nods as backchannels, utterances as backchannels, and turn-taking. In other words, gaze shift can have multiple meanings and effects, and these meanings and effects depend on prosody, context, and intention.

Gaze, and probably other paralinguistic behaviors such as continuous nodding, from both speakers and listeners, can be understood as expressing a state of being rather than expressing a

discrete act. The model of dialog control acts, and perhaps speech act theory more generally, may have to be extended to accommodate expression of state of being.

In future work, we plan to extend the analysis of the corpus—and probably other available corpora that record naturally occurring interaction—to verify more systematically the observations of this paper that arose from a conversation-analytic approach. We also plan to test the finer-grained or continuous model of dialog control through experiments with embodied conversational agents.

4. References [1] Morency, L.-P., Context-based visual feedback recognition,

Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2006-075, Massachusetts Institute of Technology, 2006.

[2] Heylen, D.K.J. and van Es, I. and Nijholt, A. and van Dijk, E.M.A.G., “Controlling the gaze of conversational agents,” Natural, Intelligent and Effective Interaction in Multimodal Dialogue Systems. Kluwer Academic Publishers, 245-262, 2005.

[3] Traum, D. R. and Hinkelman, E. A. “Conversation acts in task-oriented spoken dialogue,” Computational Intelligence, 8(3): 575–599, 1992.

[4] Novick, D., “Controlling interaction with meta-acts,” Conference on Human Factors in Computing Systems (CHI 91), New Orleans, LA, May, 1991, 495, 1991.

[5] Nakano, Y.I., Reinstein, G., Stocky, T., and Cassell, J., “Towards a model of face-to-face grounding,” Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1 (ACL '03), Stroudsburg, PA, USA, 553-561, 2003.

[6] Suchman, L. A., Plans and situated actions. Cambridge: Cambridge University Press, 1987.

[7] Kopp, S., Stocksmeier, T., Gibbon, D.: “Incremental multimodal feedback for conversational agents, Pelachaud, C. et al. (eds.): Intelligent Virtual Agents ’07, LNAI 4722, Springer-Verlag, 139-146, 2007.

[8] Heylen, D., “Multimodal backchannel generation for conversational agents, Workshop on Multimodal Output Generation,” MOG 2007, Aberdeen, Scotland, January 25-26, 2007, 81-92, 2007.

[9] Novick, D., Hansen, B., and Ward, K., “Coordinating turn-taking with gaze,” Proceedings of ICSLP-96, Philadelphia, PA, October, 1996, 3, 1888-91, 1996.

[10] van Es, I., Heylen, D., van Dijk, B., and Nijholt, A., “Making agents gaze naturally - Does it work?” Proceedings AVI 2002: Advanced Visual Interfaces, Trento, Italy, May 2002, 357-358, 2002.

[11] Timmerman, A., “Backchannels must be seen,” 13th Twente Student Conference on IT, Enschede, The Netherlands, June 21, 2010.

[12] Flecha-Garcia, M., Novick, D., and Ward, N., Differences between Americans and Arabs in the production and interpretation of verbal and non-verbal dialogue behaviour, Speech and Face-to-Face Communication Workshop, Grenoble, France, October 27-29, 2008, 47-48, 2008.

[13] Huang, L., Morency, L.P., and Gratch, J., “A multimodal end-of-turn prediction model: Learning from parasocial consensus sampling,” Tenth International Conference on Autonomous Agents and Multiagent Systems, May 2011.

[14] Clark, H., and Schaefer, E., “Contributing to discourse,” Cognitive Science, 13, 259-294, 1989.

[15] Rivera, A., and Ward, N., “Prosodic features that lead to back-channel feedback in Northern Mexican Spanish,” Proceedings of the Seventh Annual High Desert Linguistics Society Conference, Albuquerque, NM, 19-26, 2008.

57

Page 65: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Empathy and Feedback in Conversations About Felt Experience

Nicola Plant, Patrick G.T. Healey

Queen Mary University of London,Interaction, Media and Communication Research Group,School of Electronic Engineering and Computer Science,

London, [email protected], [email protected]

Abstract

When we talk about felt experiences, such as physi-cal pains and pleasures, we normally expect our conver-sational partners to provide empathetic feedback of somekind. Some models of human interaction predict that thisfeedback should be similar in form to our original pro-duction; the gestures, expressions and other non-verbalsignals we use to explain our experience should be mir-rored in the empathic displays of our conversational part-ners. Here, we test this idea using data from a corpusof interactions in which people describe experiences thatvary in their degree of unpleasantness. Speakers in thissituation produce more gestures when describing moreunpleasant experiences. In contrast to this, their listen-ers provide less non-verbal feedback and use more verbalfeedback as the expressed experience becomes more neg-ative. These findings suggest a socially strategic use ofemphatic feedback that is not explained by the operationof an automatic perception-behaviour link.Index Terms: empathetic feedback, motor mimicry,perception-behaviour link, imitation

1. IntroductionWe have the capacity to empathise with each others ex-perience, however the particular mechanisms behind em-pathy are still disputed and unclear [1]. Much of this de-bate concerns the in principle (im)possibility of knowinganothers experience and how feedback behaviours couldbe used to demonstrate understanding of the experienceof another in conversation. Here we are concerned withthe empirical question of how people actually show un-derstanding of anothers experiences in conversation. Weare particularly interested in the use of motor mimicry asa form of feedback to display understanding of felt ex-perience. The performance of the expected expressivebehaviour associated with an experience, translated intothe perspective of another, as a way to communicate themessage that ‘I am like you’. [2]

Chartrand and Bargh (1999) propose that in conver-sation non-conscious non-verbal mimicry occurs by de-

fault. They draw from James’ principle of ideomotor-action, which held that merely thinking about a be-haviour increases the tendency to engage in that be-haviour. Termed the perception-behaviour link, they pro-posed this phenomenon could provide a mechanism fornon-conscious mimicry through an automatic connectionbetween the perception and production of a behaviour.For example, if we see someone grimace we will alsogrimace and this helps to show that we understand whatthey are expressing. Chartrand and Bargh suggest that theimitation of postures, gestures and expressions are a con-tinual source of information throughout a social interac-tion, communicating understanding and attention. Theyclaim that individuals use behaviour mimicry as a com-municative tool on a completely non-conscious level andthat this overt behavioural mimicry underpins emotionalconvergence [3].

How well does this model characterise what peopledo in conversation when someone is describing a physi-cal experience? These are situations in which a speakercan take advantage of their own embodiment to producea non-verbal display of the experience they are describ-ing. For example, wincing to describe a pain or holdingtheir sides to describe a belly laugh. How do attentive,cooperative listeners normally respond to these displays?

To address this question we present a corpus ofspeech, video and body movement data in which par-ticipants describe to each other recalled experiences thatinvoke significant elements of embodied experience, forexample a toothache or a yawn, that could provoke em-pathetic responses. Plant and Healey (2012) show thatin this corpus speakers produce gestures more frequentlyand for longer durations for descriptions of more nega-tive experiences. Here we focus on the character of thefeedback responses to the expression of these experiencesby the listeners. If the perception-behaviour link modelof empathic communication is correct, listeners shouldtend to match speakers by producing forms of non-verbalfeedback that are congruent with the forms chosen byspeakers and that tend to match the increase in speaker’sgestures for more negative experiences. The assump-

58

Page 66: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

tion is that hearers should respond with stronger empa-thetic understanding by engaging in increasing levels ofbehaviour mimicry to more negative or unpleasant expe-riences. For example, the listener would mimic a reac-tion appropriate to the speakers described situation, likeperforming a wince at a description of pain in order tocommunicate an understanding of the felt experience ofthe speakers pain.

2. Feedback MechanismsThe occurrence of listener feedback or back-channels isthought to facilitate the incremental process of a conver-sation as a joint activity. [5, 6, 7, 8, 9]. Research in thearea of listener feedback has found that there differencefunctions can be distinguished for feedback. For the anal-ysis below we distinguish between three broad categoriesof listener feedback: Contact and Perception, Compre-hension and Attitudinal and Emotional.

2.1. Contact and Perception

Contact and perception feedback shows a continuation ofcontact and presence of the listener and the listeners per-ception that there is a message being put across. This isusually in the form of back-channels that do not interruptor require acknowledgement from the speaker, althoughwithout them the speaker would question whether the lis-tener was paying attention. For example, generic noddingor vocalisations such as ’yeah’ or ’mmhmm’. LoredanaCerrato (2002) classifies feedback that functions as indi-cating contact and perception as a subtype back-channelfeedback expressions, otherwise known as continuers asclear cases of such feedback are continuing the speaker’sutterance, these share the following features:

• responds directly to the content of an utterance ofthe other

• are optional

• does not require acknowledgement by the other

This definition rules out post completion vocalisations,rules out feedback that occurs just after speaker’s utter-ance, that could be from reflecting on some cogitation,rules out the answer to questions and listener questions.Back-channels do not take the floor or the turn but cansometimes seek continuation as a way of avoiding thefloor. [9]

2.2. Comprehension

Another function of feedback is to acknowledges un-derstanding of a message. Comprehension feedback issometimes difficult to distinguish from contact and per-ception feedback. The clearest cases are when the feed-back is in the form of a question relating to the content of

the speakers message, or a direct referent to their under-standing, for example ’I see’, ’Aaaah’, ’Oh right’.

2.3. Attitudinal or Emotional

Another form of feedback is attitudinal or emotional, ex-pressing a point of view or attitude towards the speaker’smessage. Schroder, Heylen and Poggi (2006) identifiedthe subtype of listener responses displaying attitudinal oremotional feedback to speakers utterances called affectbursts. Affect bursts are very brief, discrete, nonverbalexpressions of affect in both face and voice as triggeredby clearly identifiable events. [10] Their experiments col-lecting recognition ratings of vocalisations of such phe-nomena indicated that affect bursts serve to display emo-tions that are gratifying for the speaker, or show empathytoward the speaker but generally never expressing a neg-ative attitude or emotion toward the speaker. [11]

Similarly, Bavelas et al. (1987) classify empatheticlistener responses as motor mimicry. Motor mimicry isdefined as the mimicry of an expressive behaviour, or theperformance of the expected expressive behaviour of anoccurrence in the perspective of another. Conceptualisedas primitive empathy, motor mimicry is described as anautomatic reflex of conditioned cues based on ones ownprior experience. Bavelas and her colleagues suggest thatmotor mimicry serves as an expression of the perceivedemotion, an interpersonal act to put across, in their words,I feel as you do. [12]

Both affect bursts and motor mimicry contain emo-tional or attitudinal responses that occur simultaneouslyto the speaker’s utterance. It would be expected that de-scriptions of sensory experience would provoke empa-thetic responses like motor mimicry, especially duringdescriptions of pain. Moreover, empathetic responsesshould be most likely to occur when the listener has agood understanding of the sensation.

3. MethodsA corpus of natural interactions between two participantsdescribing experiences they have had to each other wascaptured on audio, video and motion capture equipmentin the Performance Laboratory at QMUL. The aim wasto elicit natural descriptions of people’s recalled experi-ences in an open, unscripted interaction.

24 naive participants were recruited. Participantsages ranged from 18 to 60, consisting of 12 females and12 males placed in 12 random sex pairs. They weretold the study was investigating how people communi-cate common experiences and made no specific mentionof gesture. Participants were given written instructionsoutlining the entire study procedure in which participantswere asked to recall some experiences and talk aboutthem to each other.

The experiences to be described were written on sets

59

Page 67: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

of cards placed on a small table next to where the partic-ipants stood. Each participant was allotted a stack cardsand asked to take turns selecting one card at a time. Whenit was their turn each participant described the details ofa recalled instance of they had of the sensation written onthe card to their partner for no longer than a two or threeminutes. Emphasis was placed on describing the partic-ular sensation they felt at the time of the experience. Oneach description the listening participant was encouragedto talk and ask questions at any time, the process was de-scribed in the instructions as an exchange. Video footagewas taken of the each study, forming a full body face onview of each participant for the duration of the study.

Two sessions were excluded from the data where theparticipants didn’t follow the instructions as requestedand two further sessions were excluded because of in-complete data. For the coding process, each descriptionof an experience was separated into separate items. Alllistener feedback was annotated, firstly the feedback wasseparated by modality, either verbal or non-verbal. Thenevery instance of feedback was coded according to func-tion:

• CP- indicating listener contact and perception ofmessage.

• C- indicating listener comprehension or under-standing of message

• A/E- indicating an attitudinal or emotional re-sponse which could have been as simple as agree-ing with the speaker to showing shock to thespeaker message.

4. ResultsWe report data for 9 pairs of participants and for four tar-get items: Toothache, Backache, Yawn and Laugh. Foranalysis we ranked them on an intuitive basis as follows:1 Laugh, 2 Yawn, 3 Backache, 4 Toothache to provide ascale from positive to negative experience. Figure 1 and 2show mean occurrence of feedback type per item over va-lence of experience, ranging from most pleasant to mostunpleasant, as denoted above.

The frequency of occurrence of verbal empathic re-sponses by the Non-Card Holder were analysed usingGeneralised Estimating Equations (GEE) with a TweedieDistribution and an Identify link. Participants were en-tered as a subject variable, Valence (1-4), AnnotationType (Attitudinal/Emotional / Comprehension / Contactand Perception) and Valence by Annotation Type as aninteraction. As figure 2 suggests, there is an overall maineffect of Valence (Wald Chi-Square(3) = 15.5, p = 0.00)no overall main effect of Annotation Type (Wald Chi-Square(2) = 0.7, p = 0.70) and no interaction (Wald Chi-Square(6) = 11.7, p= 0.07). Linear trend contrasts for

Pattern of Non-Verbal Responses by Non-Card Holder

Figure 1: Pattern of Non-Verbal Responses by Non-CardHolder

Valence show that there is a consistent increase in ver-bal empathic responses as the unpleasantness of the de-scribed experience increases (Wald Chi-Square = 11.8, p= 0.00). The marginal means for average occurrences atlevels 1-4 are: 1.8, 1.9, 2.5, 2.8 respectively.

The parallel analysis for the non-verbal empathic re-sponses (GEE, Tweedie Distribution, Identity link withValence (1-4), Annotation Type (Attitudinal/Emotional /Comprehension / Contact and Perception) for Valence byAnnotation Type as an interaction) shows a more com-plex pattern. There is a main effect of non-verbal re-sponse type (Annotation). Main effect of Valence (WaldChi-Square(3) = 9.4, p=0.02, main effect of annotation(Wald Chi-Square(2) = 38.8, p = 0.00) and a reliable An-notation by Valence interaction (Wald Chi-Square(6) =14.9, p = 0.02).

As the marginal means in Table 1 show, the maineffect of Annotation is that non-verbal contact and per-ception signals are more common, across all strengthsof expressed experience than either attitudinal/emotionalresponses or responses showing comprehension. Table2 breaks down the interaction between Annotation Typeand Valence. The highest level of attitudinal/emotionalfeedback occurs with the least unpleasant experience.Feedback showing comprehension and contact and per-ception by contrast tend to increase as the unpleasantnessof the experience increases.

5. DiscussionFrom this analysis we can determine that the level ver-bal feedback (of all types) increases as the level of un-pleasantness of the described experience increases, withthe highest occurrence of verbal attitudinal or emotional

60

Page 68: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Estimates Valence Annotation

(Non-Verbal) Mean Std. Error 95% Wald Confidence Interval

Lower Upper

1 Attiudinal/Emotional 3.16 .457 2.26 4.05 Comprehension .81 .246 .33 1.29 Contact and Perception 4.87 1.296 2.33 7.42

2 Attiudinal/Emotional 2.32 .308 1.72 2.93 Comprehension .79 .226 .35 1.24 Contact and Perception 3.94 .555 2.85 5.03

3 Attiudinal/Emotional 2.11 .492 1.15 3.07 Comprehension 1.08 .209 .67 1.49 Contact and Perception 6.44 1.131 4.23 8.66

4 Attiudinal/Emotional 2.19 .316 1.57 2.81 Comprehension 1.59 .405 .80 2.39 Contact and Perception 6.06 1.070 3.97 8.16

Table 1

Estimates Annotation (Non-Verbal) Mean Std. Error 95% Wald Confidence Interval

Lower Upper Attitudinal/Emotional 2.44 .206 2.04 2.85 Comprehension 1.07 .202 .68 1.47 Contact and Perception 5.33 .681 4.00 6.67

Table 2

feedback in response to the most unpleasant experience.Showing the predicted higher level of engagement andunderstanding communicated the more unpleasant thedescribed experience. Analysis of the non-verbal feed-back shows a different pattern. Similar to the verbalfeedback pattern, contact/perception and comprehensionfeedback types increase as the unpleasantness of the de-scribed experience increased. However the attitudinaland emotional feedback, which would include all non-verbal empathetic feedback, decreases. This is contraryto expected, where we predicted that the more unpleas-ant the described experience would provoke more non-verbal empathetic responses such as motor mimicry tocommunicate understanding and mutual recognition ofthe speakers experience. This is incompatible with anexplanation of empathic communication based on the au-tomatic production of non-verbal feedback of the kind de-scribed by Chartrand and Bargh.

Our findings suggest that descriptions of unpleasantexperiences do elicit higher levels of engagement throughverbal feedback and generic non-verbal feedback but donot increase the tendency to engage in the embodied be-haviour associated with the experience to communicateunderstanding. This suggests that listeners are sensitive

to the character of the experience described by a speaker,but they dynamically adapt the feedback they produce.We speculate that this adaptation is related to strategicsocial goals such as politeness and to a preference forusing non-verbal communication to address the manifestconcrete particulars of a described event rather than thespeaker’s embodied experience.

6. AcknowledgementsThis research is support by an EPSRC digital economiesgrant.

7. References[1] S. D. Preston and F. B. M. de Waal, “Empathy: Its ultimate and

proximate bases.” The Behavioral and brain sciences, vol. 25,no. 1, pp. 1–20; discussion 20–71, Mar. 2002.

[2] J. Bavelas, A. Black, and N. Chovil, “Form and Function in Mo-tor Mimicry: Topographic Evidence that the Primary Function isCommunicative,” Communication, vol. 14, no. 3, pp. 275–299,1988.

[3] T. Chartrand and J. Bargh, “The Chameleon Effect: ThePerception-Behavior Link and Social Interaction,” Journal of per-sonality and social psychology, vol. 76, no. 6, pp. 893–910, 1999.

[4] N. Plant and P. Healey, “The use of gesture to communicate aboutfelt experiences,” The 16th workshop on the semantics and prag-matics of dialogue, in-press.

61

Page 69: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Pattern of Verbal Responses by Non-Card Holder

Figure 2: Pattern of Verbal Responses by Non-CardHolder

[5] D. Heylen, E. Bevacqua, and C. Pelachaud, “Generating listeningbehaviour,” Emotion-Oriented, 2011.

[6] N. Ward and W. Tsukahara, “Prosodic features which cue back-channel responses in English and Japanese,” Journal of Pragmat-ics, vol. 32, no. 8, pp. 1177–1207, Jul. 2000.

[7] J. Allwood, L. Cerrato, and K. Jokinen, “The MUMIN codingscheme for the annotation of feedback, turn management and se-quencing phenomena,” Language resources, 2007.

[8] J. Allwood, “A study of gestural feedback expressions,” FirstNordic Symposium on Multimodal, 2003.

[9] L. Cerrato, “Some characteristics of feedback expressions inSwedish,” Proceedings of Fonetik, TMH-QPSR, vol. 44, no. 1, pp.101–104, 2002.

[10] M. Schroder, “Experimental study of affect bursts,” Speech Com-munication, 2003.

[11] D. Heylen, “Perception of non-verbal emotional listener feed-back,” Phenomenology and the Cognitive, 2006.

[12] J. Bavelas, A. Black, and C. Lemery, “Motor mimicry as primi-tive empathy,” in Empathy and its Development, N. Eisenberg andJ. Strayer, Eds. Cambridge: Unversity Press, 1987, ch. 14.

62

Page 70: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

CoFee - Toward a multidimensional analysis of conversational feedback, thecase of French language

Laurent Prevot, Roxane Bertrand

Laboratoire Parole et Langage, CNRS & Aix-Marseille Universite, Aix-en-Provence, [email protected], [email protected]

Abstract

Conversational feedback is mostly performed through shortutterances such as yeah, mhmm, okay not produced by themain speaker but by one of the other participants of a con-versation. Such utterances are among the most frequentin conversational data. They also have been described inpsycho-linguistic models of communication as a crucialcommunicative tool for achieving coordination or align-ment in dialogue. The newly funded project describedin this paper addresses this issue from a linguistic view-point by combining fine-grained corpus linguistic anal-yses of semi-controlled data with formal and statisticalmodeling. The impoverished aspect of the linguistic ma-terial present in these utterances allows for a truly multi-dimensional analysis that can explain how different lin-guistic domains combine to convey meaning and achievecommunicative goals.Index Terms: Feedback, Backchannel, Semantics, Prag-matics, French Language

1. ObjectivesThe general objective of the CoFee1 project is to pro-pose a fine grained model of the form/function relation-ship concerning feedback behaviors in conversation. Tosucceed, we need to achieve:

• a fine-grained analysis of the different dimensionsinvolved (prosody, lexical markers, acoustic non-verbal signals, facial expressions, head movements,gaze) ;

• a fine-grained analysis of the communicative func-tions related to feedback ;

• a rich characterization of two crucial contextual pa-rameters: discourse context and production context;

• the integration of these ingredients into a generalmodel.

1CoFee –Conversational Feedback: Multi-dimensional Analysisand Modeling– is a newly funded ANR (Agence Nationale pour laRecherche) 3-year project [2012-2015].

Figure 1: Sketch of the CoFee model

We consider that the truly multi-dimensional natureof the analysis proposed is an important and ambitiousstep for linguistic studies. Most of existing related workeither focus on one domain and marginally integrates theother dimensions or constitutes a very shallow surface-based analysis grounded on a few features. Moreover,the integration of different situations of communicationin such a precise study is also new and will allow toaccount for communicative situation variability from amore theoretical and experimental approach than what isdone usually.

The present paper is structured as follow. We willstart by better defining the object of our study feedbackitems in section 2. Then we will discuss some relatedwork (Section 3) before presenting with some details ofthe data (Section 4) as well as the analysis and the model-ing planned (Section 5). Finally we will briefly describethe current work and what will be presented at the work-shop.

2. Definitions

CoFee is a study of the positive feedback items. Thissection is a clarification attempt, at least for the sake ofthe project.

63

Page 71: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

• Feedback: In dialogue or conversation context, itcan be associated with any evaluative communica-tive action about previously introduced material.

• Backchannel: Intuitively, backchannels are pro-ductions made by the participant holding the lis-tener role.

• Acknowledgment: A positive feedback. Polarityis functional here since negative items can have apositive evaluation function.

2.1. Backchannels vs. acknowledgments

Acknowledgements and backchannels have sometimesbeen used as synonyms. Although these phenomena arefrequently co-occurring, they constitute different aspectsof verbal interaction.

The term back-channel was introduced by [1] and in-cluded a broad range of linguistic phenomena such asquestions and short comments. The notion was broadenlater to include other items such as verbalized signals,sentence completions, brief restatements, clarification re-quests,... Almost any communicative event can be a back-channel. Indeed, backchannels are sometimes describedas production by the listener moving the definition issueto the speaker / listener definitions. This is however not asstraightforward as it seems to be since listeners are com-monly said to produce signals in the course of the com-munication. While [2] argues that participants tend to notoverlap the production of their interlocutors thanks to anefficient turn-taking rule system, [3] shows that even ifthe turn-taking system is efficient it is not rare that partic-ipants speech overlaps.

Speaker/Listener distinction, and therefore backchan-nel definition combine both form and content issues. Aparticipant that is not willing to take the turn should notproduce utterances signaling his willingness to do so.Backchannels are typically briefs, low in intensity andmay exhibit specific prosodic contours. Moreover, even ifthe listener desires to take the initiative, social rules (po-liteness) are forcing him to conform to turn-taking rulesand therefore remain more or less in his listener role untilthe speaker yields the turn. At the content level, manyproductions can be back-channeled and only a few com-municative acts (such as questions) tend to switch sys-tematically the speaker/listener role.

Traditionally backchannels are divided between con-tinuers and assessments [4]. Tottie [5] gives continuersa regulative function and assessments a supportive func-tion. The former regulates the coming contribution of theinterlocutor while the later bring a supportive reaction toa previous contribution. Feedback is more clearly associ-ated with the later but it is difficult to systematically dis-tinguish them and therefore most of the empirical studiesare proposing to work on the phenomena as a whole.

2.2. Backchannels feedback

The difficulties mentioned above lead [3] to propose aback-channel feedback notion. Back-channel feedback:

• (i) respond directly to the content of an utteranceof the other participant

• (ii) is optional

• (iii) does not require acknowledgment by the otherparticipant.

However, these criteria concern only back-channelledfeedback, not those occurring as part of a turn. Such feed-back is rather common, specifically in task-oriented dia-logues that require a detailed grounding of the informa-tion transmitted.

2.3. Sum-up

To sum-up, in COFEE we are examining positive feed-back behaviors (mainly verbal behavior but also laughterand other communicative grunts [6]). Most of them arealso back-channels but we do not exclude feedback itemsthat are not back-channeled. The later may be taken asan answer to a question but it is still optional, contrarilyto answers. Based on earlier studies on French feedback[7, 8], the list of lexical items we are including in ourstudy is : oui (yes), ouais (yeah), mhmm, ok, d’accord(right), voila (that’s it), c’est ca (that’s it), ah, bon (well).

3. Related workAmong the more recent works, [9] proposed a broadstudy of the form/function relation for feedback. Theyuse various features including acoustic and discourseones. However “discourse” features are more shallowthan the one we are planning to use (basically based onsize of Inter Pausal Units –IPU– and of position of theitem in the IPU). Moreover, we are also planning a morelinguistic way for extracting the speech parameters thanpurely acoustic measurements. However, we will attemptto replicate many aspects of their study on our Frenchcorpora.

[10, 11] have a multi-dimensional model of commu-nicative functions dealing with feedback behaviors. Herethe modeling framework is very rich but as in Gravanoand colleagues study the discourse and linguistic featuresused are very shallow since the goal was not to focus onfeedback but on the identification of all communicativefunctions.

Formal semantics and pragmatics until recently re-mained away from feedback behaviors. Fortunately, inthe recent years this field started to look more carefully atthis issue. This movement can be traced back to [12] forat least a plea for a move in this direction. More completeframework allowing to work on feedback mechanisms informal pragmatics are presented in [13] or [14].

64

Page 72: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

4. Data and annotations4.1. Corpora

Three corpora will be used in the course of the project:

• The Corpus of Interactional Data (CID) recordedby Roxane Bertrand and Beatrice Priego-Valverde[15] is a 8 hours (110K tokens) corpus composedof 8 conversations of 1 hour. It features a nearlyfree conversational style with only a single themeproposed to the participants at the beginning of theexperiment. This corpus is fully transcribed andforced-aligned at phone level with signal. More-over, it has been annotated with various linguisticinformation (Prosodic Phrasing, Discourse units,Syntactic tags,...) during the OTIM project [15].(Visible at sldr.org/sldr000720/en)

• A 3h30 French MapTask created by CorineAstesano and Ellen Bard [16]. It has been recordedaccording to the original MapTask methodology.This corpus has been transcribed and aligned man-ually at utterance level. We are now planning anautomatic phone alignment with the same method-ology used in the previous project. (Visible at Vis-ible at sldr.org/sldr000732/en)

• A French Negotiation Game Corpus that is cur-rently under construction and that consist in ne-gotiations games played by four participants. Weare targeting a bigger corpus than the two oth-ers but not fully transcribed. We plan to tran-scribe only speech neighbouring feedback itemswhich are less frequent in this setting than inthe two previous ones. (Preview visible atsldr.org/sldr000773/en)

These three corpora constitute very different com-municative situations and therefore cover an interestingrange of functions feedback can play in dialogue.

4.2. Feedback Annotations

There are some rich annotation frameworks includingfeedback aspects such as [17, 11]. However given ourfocus on a few restricted forms we will only use partof these comprehensive frameworks. Moreover we in-herit from previous annotations efforts. Namely for theCID corpus we already have some back-channels annota-tions performed. The categories annotated were: contin-uer (minimally takes note), understanding (understands),assessment (agrees with what has been said), and eval-uation (evaluate and display an attitude about what hasbeen said). Orthogonally turn-initiating and turn-endingfeatures have been added. From another study, [7] wewanted to include (i) aspects related to the confirmationnature2 of some feedback items and (ii) their discourse

2Related to allo-feedback in DIT scheme [11] for example.

structuring functions such as closing current discoursetopic.

Perhaps the most original part of our annotationscheme is the annotation of the feedback scope. In [7], weidentified 3 relevant scopes: last utterance, last pair orwide scope. In the corpus we used, this scope annotationwas reasonably well annotated (κ = 0.6) and allowed usto specify the functions of some of the lexical items stud-ied without having to rely on finer-grained functions.

5. Analysis and ModelingThe model we are aiming at combines a detailed multidi-mensional analysis of the forms involved, a deep model-ing of the meaning of these forms and how these mean-ings are used to reach the communicative goals.

5.1. Analysis of the forms

About the first aspect we will perform both linguis-tic analysis (including in particular systematic prosodicanalysis) and more acoustic measurements. About theprosodic aspects, a track we will follow is the FunctionalData Analysis (FDA) such as proposed in [18]. We there-fore adopt a really data-driven approach but guided bythe linguistic analysis. FDA requires indeed to have someminimal hypotheses about the shapes of the contours be-fore starting the purely statistical analysis that will dis-tinguish several clusters of instances of contours. Con-cerning this level the goal is to delineate as precisely aspossible the formal categories in all the dimensions con-sidered (at least lexical item, prosodic contour, acousticparameters).

5.2. Model of the functions

Concerning the functions, we consider that simply hav-ing a list of categories is not enough. To have a func-tion one should be able to model the effects on context.Moreover, we are also interested into the meaning (if any)of the forms considered and how it is exploited by theparticipants to perform communicative actions. The setof communicative actions comes from the literature onthese issues, in particular from the DIT+ framework [10].However we would like to go step deeper by looking athow formal theory of dialogue [13, 10, 14] are handlingthese phenomena.

Despite differences with regard to the primitives andto representation tools, it is possible to list a few prop-erties any semantic/pragmatic theory should feature forbeing able to deal with feedback items:

• Radical context dependence: Given the range ofcommunication functions a simple word as ’yeah’can rich in a conversation, it is clear that the theoryhas to be a theory of how the meaning of a newutterance is interpreted (and resolved) in a given

65

Page 73: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

context ;

• Rich ontology of communication objects, the con-text in which utterances are resolved cannot sim-ply be a flat representation of the actual world.Feedback has a meta-level nature, it is informa-tion about the information exchange, not about thecontent exchanged directly. Moreover feedback isalso about processing of information by the speaker(cognitive realm) and about conventional rules ofthe exchange (social realm).

Dynamic semantics and further work grounded in thisparadigm all feature the first point while the second one ispresent in most of the works that have looked seriously atdialogue. For the formal modeling aspect of our work, wewill focus on two semantic theories that have put dialogueon their agenda: SDRT (Segmented Discourse Represen-tation Theory) from [13] and KOS from [14].

6. Current workOur current work consist in building the data sets from thecorpora and in finishing recording the third corpus. Bythe time of the workshop, we will have conduced somepreliminary studies on a data subset. The study will in-clude a FDA data analysis for at least two French lexicalitems in the CID corpus: ouais (yeah) and voila (that’sit). For this data subset, we will complete the annotationsof the functions in order to have as small scale picture ofour project to present during the workshop.

7. References[1] V. H. Yngve, “On getting a word in edgewise,”

in Papers from the sixth regional meeting of thechicago linguistic society, 1970, pp. 567–578.

[2] H. Sacks, E. A. Schegloff, and G. Jefferson, “A sim-plest systematics for the organisation of turn-takingfor conversation,” Language, vol. 50, pp. 696–735,1974.

[3] N. Ward and W. Tsukahara, “Prosodic featureswhich cue back-channel responses in english andjapanese,” Journal of Pragmatics, vol. 32, no. 8, pp.1177–1207, 2000.

[4] E. Schegloff, “Discourse as an interactionalachievement: Some use of” uh-huh” and otherthings that come between sentences,” GeorgetownUniversity Round Table on Languages and Linguis-tics, Analyzing discourse: Text and talk, pp. 71–93,1982.

[5] G. Tottie, “Conversational style in british and amer-ican english: The case of backchannels,” En-glish corpus linguistics: Studies in honour of Jan

Svartvik. London & New York: Longman, pp. 254–335, 1991.

[6] N. Ward, “Non-lexical conversational sounds inAmerican English,” Pragmatics &# 38; Cognition,vol. 14, no. 1, pp. 129–182, 2006.

[7] P. Muller and L. Prevot, “An empirical study of ac-knowledgement structures,” in Proceedings od Dia-bruck, 7th workshop on semantics and pragmaticsof dialogue, Saarbrucken, 2003.

[8] R. Bertrand, G. Ferre, P. Blache, R. Espesser, andS. Rauzy, “Backchannels revisited from a multi-modal perspective,” in Proceedings of Auditory-visual Speech Processing. Citeseer, 2007.

[9] A. Gravano, J. Hirschberg, and S. Benus, “Affirma-tive cue words in task-oriented dialogue,” Compu-tational Linguistics, vol. 38, no. 1, pp. 1–39, 2012.

[10] H. Bunt, “Multifunctionality in dialogue,” Com-puter Speech & Language, 2011.

[11] V. Petukhova and H. Bunt, “Towards an integratedscheme for semantic annotation of multimodal dia-logue data,” in Proceedings of the seventh interna-tional conference on language resources and evalu-ation, 2010, pp. 2556–2563.

[12] J. Allwood, J. Nivre, and E. Ahlsen, “On the seman-tics and pragmatics of linguistic feedback,” Journalof Semantics, vol. 9, 1992.

[13] A. Lascarides and N. Asher, “Grounding and cor-recting commitments in dialogue,” Journal of Se-mantics, 2009.

[14] J. Ginzburg, The Interactive Stance: Meaning forConversation. Oxford University Press, 2012.

[15] P. Blache, R. Bertrand, and G. Ferre, “Creating andexploiting multimodal annotated corpora: the tomaproject,” Multimodal corpora, pp. 38–53, 2009.

[16] C. Astesano, E. Bard, and A. Turk, “Structural in-fluences on initial accent placement in french,” Lan-guage and Speech, vol. 50, no. 3, pp. 423–446,2007.

[17] J. Allwood, L. Cerrato, K. Jokinen, C. Navarretta,and P. Paggio, “The MUMIN coding scheme forthe annotation of feedback, turn management andsequencing phenomena,” Language Resources andEvaluation, vol. 41, no. 3, pp. 273–287, 2007.

[18] M. Gubian, L. Boves, and F. Cangemi, “Joint anal-ysis of f0 and speech rate with functional data anal-ysis,” in Acoustics, Speech and Signal Processing(ICASSP), 2011 IEEE International Conference on.IEEE, 2011, pp. 4972–4975.

66

Page 74: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Investigating the influence of pause fillers for automatic backchannelprediction

Stefan Scherer1, Derya Ozkan1, Louis-Philippe Morency1

1 Institute of Creative Technologies, University of Southern California, United [email protected]

1. IntroductionHesitations, and pause fillers (e.g. “um”, “uh”), occur fre-quently in everyday conversations or monologues. They canbe observed for a wide range of reasons including: lexical ac-cess, structuring of utterances, and requesting feedback fromthe listener [1]. In this study we investigate the usefulness ofpause fillers as a feature for the prediction of backchannels us-ing conditional random fields (CRF) [2] within a large corpusof interactions.

Backchannel feedbacks (i.e. the nods and paraverbals suchas “uh-hu” and “mm-hmm” that listeners produce as someoneis speaking) play a significant role in determining the natureof a social exchange by showing rapport and engagement [3].When these signals are coordinated and reciprocated, they canlead to feelings of rapport and promote beneficial outcomes indiverse areas such as negotiations and conflict resolution [4],psychotherapeutic effectiveness [5], improved test performancein classrooms [6] and improved quality of child care [7]. There-fore, the prediction of backchannel feedback can play a signifi-cant role in a range of applications. For virtual human systemsfor example the correct timing of backchannels could be usedto signal active listening or interest in the conversation with thehuman interlocutor. Additionally, one could provide systemswith a stronger sense of rapport.

The remainder of the paper is organized as follows: in Sec-tion 2 we introduce the dataset utilized in the study. Section 2.1statistically evaluates the relation between backchannel feed-back and hesitations, revealing a rough sense of the applica-bility of hesitations for the prediction of backchannels. Section3 reports the conducted experiments for backchannel predictionand reports the achieved results. Finally, Section 4 discusses theresults and provides an outlook for further investigations.

2. DatasetIn this study we utilized a large dataset of 43 unique interac-tions1. The data was recorded in human-human interactionswith two unique interlocutors in each conversation [8]. One par-ticipant was instructed to be the listener while the other personnarrated a video clip taken from a sexual harassment awarenessvideo by Edge Training Systems.

Synchronized multimodal data from each participant in-cluding voice and video were collected. Both the speaker andlistener wore a lightweight headset with microphone. The aver-age signal to noise ratio is very low at about 11.95 dB, indicat-ing a relatively high level of noise within the data.

Human coders manually annotated the narratives, includ-ing pauses, hesitations, i.e. filled pauses (e.g. “um”, “uh”),

1http://rapport.ict.usc.edu

1 2 3 4 5 6 7 8 9 A0

10

20

30

40

50

60

Coder ID

0.5 sec tolerance

No tolerance

1 sec tolerance

Ove

rlap

(%)

Figure 1: Overlap in percentage of hesitations and backchan-nel feedback for all nine coders (1-9) and actual feedback byoriginal listeners (A).

as well as incomplete and prolonged words; the transcriptionswere double-checked by a second transcriber.

In the present study we focused on the annotated hesita-tions and backchannels. The vocabulary of hesitations includesthe following words: “um”, “uh”, “er” and “ah”. In total wefound 470 such annotations in the dataset with an average lengthof 0.32 seconds (0.12 standard deviation) uttered by 50 uniquespeakers. The rest of the words within the dataset have an aver-age length of 0.29 seconds (0.17 standard deviation).

In total we observed 690 backchannels within the conver-sations, whereas the feedback behavior of each listener varieda lot. In order to provide the automatic prediction model withmore homogenous training data, we employed the parasocialconsensus sampling (PCS) paradigm [9], which enables effi-cient label acquisition from multiple coders. PCS is applied byhaving participants watch pre-recorded videos drawn from theRAPPORT dataset. In [9], nine participants were recruited, whowere told to pretend they are an active listener and press the key-board whenever they felt like providing backchannel feedback.This provides us with the responses from multiple listeners allinteracting with the same speaker.

2.1. The statistical relation between hesitations andbackchannels

In this section we investigate the statistical relations betweenbackchannels and hesitations. As mentioned above we found690 backchannels produced by the actual listeners and 319 hes-

67

Page 75: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

itations uttered by the speakers in 43 unique interactions. Thenine additional coders provided on average 644.7 backchan-nels. In Figure 1 the percentage of overlapping hesitationsand backchannels are listed for all the coders and the actualbackchannels. The percentage is calculated with respect to thetotal number of hesitations. Additionally, we show varying socalled “tolerance”-levels. Level 0 means that the hesitation hasto overlap with the backchannel, level 0.5 indicates that the hes-itation can be delayed or preceding the backchannel by 0.5 sec-onds and level 1.0 respectively means that the hesitation can bedelayed or preceding the backchannel by 1 second.

It is seen that several coders, as well as the actual backchan-nel timings overlap significantly with the hesitations. Coders4 through 6 have high percentage numbers of overlap and im-provements in the backchannel prediction experiments is sus-pected for those coders. Coders with very little overlap, suchas Coder 2, probably do not take hesitations into account whenproviding backchannel feedback. Therefore, no improvement isto be expected for those coder’s backchannel prediction.

3. Backchannel prediction experimentsThe experiments are based on the CRF approach found in [2].We combined multimodal features in one large feature vectorfor the CRF model along with the hesitation timings. To beprecise, the utilized multimodal features included the following:Eye gaze, lowness (i.e. low pitch values), head nods, pausetimings and smiles.

We performed hold-out testing on a randomly selected sub-set of ten interactions. The training set contains the remaining33 interactions. Model parameters were validated by using athree-fold cross-validation strategy on the training set.

3.1. Experimental results

In the experiments, the CRF needs to decide for each inputframe if a backchannel will follow or not. We evaluate the per-formance of the CRF using a slightly modified version of the F1

measure, which is the weighted harmonic mean of precision andrecall. Precision is the probability that predicted backchannelscorrespond to actual listener behavior. Recall is the probabil-ity that a backchannel produced by a listener in our test set waspredicted by the model.

We first find all the “peaks” (i.e., local maxima) from outputprobabilities. If a peak coincides with an actual backchannel,then it is counted as a hit. If the peak is outside the boundariesof a backchannel it is counted as a false alarm and if no suchinstance is found within the borders of a backchannel it countsas a miss. We compare the performance of CRF utilizing thepreviously mentioned feature set with and without hesitationsas an additional feature.

Figure 2, summarizes the performances of the experimentsfor the different coders. It is seen that for about half of thecoders the performance improved, whereas for the other halfthe performance declined. It is worthy to note, that the bestperforming models i.e. coders 4 (F1 without hesitations: 0.317;F1 with hesitations: 0.321), 5 (F1 without hesitations: 0.442;F1 with hesitations: 0.449) and 6 (F1 without hesitations: 0.352;F1 with hesitations: 0.361) slightly improved the results and thehesitations therefore allowed for an improvement of the presentbaseline. Unfortunately, the results do not provide significantimprovements.

1 2 3 4 5 6 7 8 90

0.1

0.2

0.3

0.4With hesitations

Without hesitations

F-m

easu

re

Coder IDA

Figure 2: F1 scores for the conditional random field backchan-nel feedback prediction with and without the additional featureof hesitation for the nine individual coders.

4. SummaryIn this study we investigated the influence of hesitations for theautomatic prediction of backchannels using CRF models. Wecompare the performance of the models with and without usinghesitations as an additional feature for the prediction. We couldsee improvements for several coders, however, no clear trendcould be found. We can confirm that different listeners utilizedifferent cues for the decision when to provide a backchannel.The variations of listener behavior should be investigated fur-ther.

5. References[1] R. Carlson, K. Gustafson, and E. Strangert, “Modelling hesita-

tion for synthesis of spontaneous speech,” Proceedings of SpeechProsody, Dresden, Germany, 2006.

[2] L.-P. Morency, I. de Kok, and J. Gratch, “Predicting listenerbackchannels: A probabilistic multimodal approach,” in Confer-ence on Intelligent Virutal Agents (IVA), 2008.

[3] J. Gratch, N. Wang, J. Gerten, and E. Fast, “Creating rapport withvirtual agents,” in Intelligent Virtual Agents (IVA), 2007.

[4] A. L. Drolet and M. W. Morris, “Rapport in conflict resolution: Ac-counting for how face-to-face contact fosters mutual cooperation inmixed-motive conflicts,” Journal of Experimental Social Psychol-ogy, vol. 36, no. 1, pp. 26–50, 2000.

[5] P. Tsui and G. Schultz, “Failure of rapport: Why psychotheraputicengagement fails in the treatment of asian clients,” American Jour-nal of Orthopsychiatry, vol. 55, pp. 561–569, 1985.

[6] D. Fuchs, “Examiner familiarity effects on test performance: impli-cations for training and practice,” Topics in Early Childhood Spe-cial Education, vol. 7, pp. 90–104, 1987.

[7] M. Burns, “Rapport and relationships: The basis of child care,”Journal of Child Care, vol. 2, pp. 47–57, 1984.

[8] L.-P. Morency, I. Kok, and J. Gratch, “Predicting listener backchan-nels: A probabilistic multimodal approach,” in Proceedings of the8th international conference on Intelligent Virtual Agents, ser. IVA’08. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 176–190.

[9] L. Huang, L.-P. Morency, and J. Gratch:, “Parasocial consensussampling: combining multiple perspectives to learn virtual humanbehavior,” in International Conference on Autonomous Agents andMultiagent Systems (AAMAS), 2010.

68

Page 76: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

A Testbed for Examining the Timing of Feedback using a Map Task

Gabriel Skantze

Department of Speech Music and Hearing, KTH, Stockholm, Sweden [email protected]

Abstract In this paper, we present a fully automated spoken dialogue sys-tem that can perform the Map Task with a user. By implement-ing a trick, the system can convincingly act as an attentive lis-tener, without any speech recognition. An initial study is pre-sented where we let users interact with the system and recorded the interactions. Using this data, we have then trained a Support Vector Machine on the task of identifying appropriate locations to give feedback, based on automatically extractable prosodic and contextual features. 200 ms after the end of the user’s speech, the model may identify response locations with an accu-racy of 75%, as compared to a baseline of 56.3%.

1. Introduction Spoken dialogue systems have traditionally rested on a very simplistic model of the interaction when it comes to turn-taking and feedback. Typically, a silence threshold has been used to detect when the user has finished speaking, after which the sys-tem starts to process the user’s utterance and produce a response. Silence, however, is not a very good indicator: sometimes a speaker just hesitates and no turn-change is intended, sometimes the turn changes after barely any silence [1]. Human interlocu-tors appear to use several knowledge sources, such as prosody, syntax and semantics to detect or even project suitable places to give feedback or take the turn [2]. Feedback may often be given in the middle of the interlocutor’s speech in the form of back-channels – short utterances such as “mhm” or “yeah” that are produced without the intention of claiming the floor [3]. Re-cently, there has been a lot of interest in developing spoken dia-logue systems that model this behaviour. An example of this is the Numbers system [4] – a completely incremental dialogue system that could give rapid feedback as the user was speaking, with a very short latency of around 200ms, partly using prosodic information. However, to make the task feasible, the domain was limited to that of number dictation.

In this paper, we present a dialogue system that can perform the Map Task [5]. Map Task is a common experimental para-digm for studying human-human dialogue, where one subject (the information giver) is given the task of describing a route on a map to another subject (the information follower). In our case, the user acts as the giver and the system as the follower. The choice of Map Task is motivated partly because the system may allow the user to keep the initiative during the whole dialogue, and thus only produce responses that are not intended to take the initiative, most often some kind of feedback. Thus, the system might be described as an attentive listener. Implementing a Map Task dialogue system with full speech understanding would in-deed be a challenging task, given the state-of-the-art in automatic recognition of conversational speech. In order to make the task feasible, we have implemented a trick: the user is presented with

a map on a screen (see Figure 1) and instructed to move the mouse cursor along the route as it is being described. The user is told that this is for logging purposes, but the real reason for this is that the system tracks the mouse position and thus knows what the user is currently talking about. It is thereby possible to pro-duce a coherent system behaviour without any speech recogni-tion at all, only basic speech detection. This often results in a very realistic interaction, as compared to what users are typically used to when interacting with dialogue systems – in our experi-ments, several users first thought that there was a hidden opera-tor behind it. An example video can be seen at http://www.youtube.com/watch?v=MzL-B9pVbOE.

We think that this system provides an excellent testbed for doing experiments on turn-taking and feedback in an interactive setting. In our initial study presented here, we focus on the task of finding suitable places to give feedback as the user is speak-ing. The study can be regarded as a first step in a "bootstrapping" procedure, where we have started by implementing a first itera-tion of the system and then allowed users to interact with it. A classifier has then been trained on automatically extractable fea-tures. This setup will then allow us to test the derived model in interaction with users, using exactly the same setting.

Figure 1: The user interface, showing the map.

2. Timing of feedback There are many studies which investigate the cues that may help humans determine where it is appropriate to give feedback, and thus could be useful for a dialogue system. A common procedure is to build and test a statistical model or classifier on a corpus of human-human interactions, trying to predict the behavior of one of the interlocutors [2,6,7,8]. The cues that turn out to be impor-tant are often related to prosody or syntax, but some studies also look into other modalities such as gaze [8]. Prosodic cues typi-cally involve a final falling or rising pitch or final low/high pitch

69

Page 77: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

levels, but duration and energy may also play a role. Besides prosody, a very strong cue is syntactic or semantic completeness, where non-completeness (e.g., “Then you turn around the...”) obviously indicates that it is not appropriate to take the turn or give a backchannel. A common feature to use for this is n-gram part-of-speech models [7,9]. A common finding is also that the combination of different types of features tend to improve the model [2,9].

One should be aware, however, that it might be problematic to use a corpus of human-human dialogue as a basis for imple-menting a dialogue system component. One problem is the inter-active nature of the task. If the classifier produces a slightly dif-ferent behaviour than what was found in the original data, this would likely result in a different behaviour in the interlocutor, which is never evaluated. Another problem is that it is hard to know how well such a model would work in a dialogue system, since humans are likely to behave differently towards a system as compared to another human (even if a more human-like be-haviour is being modelled). Yet another problem is that much dialogue behaviour is optional and therefore makes the actual behaviour hard to use as a gold standard. For example, there are many places where a human may take the turn or produce back-channels, but which are never realised. Indeed, many studies on identifying backchannel cues based on human-human interac-tions report a relatively poor accuracy of about 20-35% [6,8,7]. It is also possible that a lot of human behaviour that is “natural” is not necessarily preferable for a dialogue system to reproduce, depending on the purpose of the dialogue system.

A common approach for experimenting with human-computer dialogue in an interactive setting without a speech rec-ognizer is to use a Wizard-of-Oz setup, where a hidden operator replaces parts of the system. This might be hard to do, however, when the issue under investigation is time-critical behaviours such as backchannels. We therefore think that the bootstrapping approach presented here is an interesting alternative. A problem here is how to know where the system should have reacted when training the model. While several sophisticated methods for such annotation have been suggested [10], we here rely on manual offline annotation.

In the Map Task dialogue system we have implemented, we have not only used backchannels, but also other types of feed-back, such as clarification requests. A general distinction is often made in the literature between the timing of backchannels and other types of responses. It is not entirely clear, however, in which of these categories the different types of active listener responses we explore here would belong (do they claim the floor or not?). Thus, we make no such distinction in this study – the task is simply to find suitable places for an active listener to re-spond, regardless of whether a backchannel or clarification re-quest is deemed appropriate (a choice that should be made de-pending on the system’s level of understanding).

Many of the studies cited above use a combination of manu-ally annotated and automatically extractable features. In this study, we want to restrict the model to only use automatically extractable features found in the left context (i.e., available for incremental processing), in order to be able to test the derived model online in an interactive setting. Given that we currently use no speech recognition, we can therefore not use any syntactic or semantic features. Thus, we will mainly look at prosodic fea-tures. However, unlike most other studies mentioned above, we will also examine the use of contextual features that involve the interlocutor’s (i.e., the system’s) behaviour.

3. Dialogue system components The basic components of the system can be seen in Figure 2. Dashed line indicate components that were not part of the first iteration of the system, but which we have explored offline (as described further down) and which we will use in the next itera-tion. The system uses a simple energy-based speech detector to chunk the user’s speech into inter-pausal units (IPUs), that is, periods of speech that contain no sequence of silence longer than 200 ms. Such a short threshold allows the system to give back-channels (seemingly) while the user is speaking or take the turn with barely any gap. Similarly to [9] and [2], we define the end of an IPU as a candidate for the Response Location Detector (RLD) to identify as a Response Location (RL). We will use the term turn to refer to a sequence of IPUs which do not have any responses between them.

Figure 2: The basic components of the system.

Each time the RLD detected a RL, the dialogue manager pro-duced a Response, depending on the current state of the dialogue and the position of the mouse cursor. Table 1 shows the different types of responses the system could produce. The dialogue man-ager always started with an Introduction and ended with an End-ing, once the mouse cursor had reached the destination. Between these, it selected from the other responses, partly randomly, but also depending on the length of the last user turn and the current mouse location. Longer turns often led to Restart or Repetition Requests, thus discouraging longer sequences of speech that did not invite the system to respond. If the system detected that the mouse had been at the same place over a longer time, it pushed the task forward by making a Guess response. We also wanted to explore other kinds of feedback than just backchannels, and therefore added short Reprise Fragments and Clarification Re-quests (see for example [14] for a discussion on these).

Table 1: Different responses from the system.

Introduction “Could you help me to find my way to the

train station?”

Backchannel “Yeah”, “Mhm”, “Okay”, “Uhu”

Reprise Fragment “A station, yeah”

Clarification Request “A station?”

Restart “Eh, I think I lost you at the hotel, how

should I continue from there?”

Repetition Request “Sorry, could you take that again?”

Guess “Should I continue above the traffic lights?”

Ending “Okay, thanks a lot.”

For speech synthesis, we use the CereVoice unit selection syn-thesizer developed by CereProc (www.cereproc.com). Since conversational speech (such as backchannels and fragmentary utterances) typically does not come out very well from off-the-shelf speech synthesizers, CereProc was contracted to comple-ment the voice with recordings of a range of backchannel

Prosodic

analysis

Dialogue

manager

Map

Window

Speech

detector

Response

Location

Detector

Contextual

features

Prosodic

features

IPUs Response

Location

Mouse movements

Speech

synthesizer

Response

70

Page 78: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

sounds, as well as Reprise Fragments and Clarification Requests containing the landmarks that were used on the maps in the ex-periment.

4. Data collection and processing In this study, we want to explore how to improve the Response Location Detector, by training it on data collected from users interacting with a first iteration of the system. Since we initially did not have any sophisticated model for the RLD, it was simply set to wait for a random period between 0 and 800 ms after an IPU ended. If no new IPUs were initiated during this period, a RL was detected, resulting in random response delays between 200 and 1000 ms.

4.1. Data collection and annotation

10 subjects participated in the data collection. They were seated in front of the display showing the map, wearing a headset. The instructor told them that they were supposed to describe a route to the computer. They were told that they should imagine another person having a similar picture as seen on the screen, but without the route. Each subject did five consecutive tasks with five dif-ferent maps, resulting in a total of 50 dialogues.

The users’ speech was recorded and all events in the system were logged. Each IPU was then manually annotated into three categories: Hold (a response would be inappropriate), Respond (a response is expected) and Optional (a response would not be inappropriate, but it is perfectly fine not to respond). The annota-tor was given a tool with which the dialogue was played up to the end of the IPU and then paused, so that the annotation could be made based on the left context only. To check the reliability of this coding, one dialogue from each subject (i.e., 20% of the material) was annotated by a second person. For all three catego-ries, the kappa score was 0.68 – a substantial agreement. There were only 6.7% of the instances where one annotator had se-lected Hold and the other Respond. We then picked out all in-stances where the first annotator had selected one of these two categories, in order to learn a classifier to discriminate between them, thus removing all Optional IPUs (about 15%) from the data set (whether an Optional IPU is classified as Hold or Re-spond should not matter much). In total, this dataset contained 1780 IPUs. 56.3% of these were of the class Respond, which constitutes our majority class baseline (i.e., the accuracy of the RLD if it would produce a Response Location for each IPU). It should be noted that the current model does not allow for feed-back within an IPU (as in [6]). It is yet unclear how problematic this limitation is; none of the annotators felt the need to mark RLs at other locations than at the end of IPUs.

4.2. Extracting features

Next, a set of features were extracted for all IPUs. As stated above, we wanted to test two types of features: Prosodic and Contextual. To extract Prosodic features, a pitch tracker based on the Yin algorithm [11] was used. The pitch was transformed to log scale and z-normalized for each user. The last 200 ms voiced region was then identified for each IPU. For this region, the mean pitch and the slope of the pitch (using linear regression) were used as features, as well as the absolute values for these. The mean energy (again on the log scale, z-normalized) was also computed for this region. As Contextual features, we used

the last system response, as well as the length of the current IPU and the length of the current turn.

5. Results

5.1. Algorithms and feature sets

The WEKA machine learning software suite [12] was used for the classification task. Two different machine learning algo-rithms were tested (with the default WEKA parameters): CART (a decision tree) and Support Vector Machines (SVM). The clas-sifiers were evaluated using 10-fold cross validation. The accu-racy (percent correct classifications) for different feature sets are shown in Table 2. As can be seen, the best result (75%) is achieved with SVM on the full feature set. All results are signifi-cantly better than the baseline of 56.3% (t-test; p < 0.05).

Table 2: The accuracy of the different algorithms. Signifi-cant differences are indicated with “<” (t-test; p<0.05).

CART SVM Context 66.1% 66.0% ˄ ˄

Prosody 69.4% 69.6% ˄ ˄

Prosody + Context 72.6% < 75.0%

5.2. Effect of response delay

The classification above, as well as the baseline, is based on the assumption that the system should be able to respond in just 200 ms. This is a much shorter delay than what is most often used in spoken dialogue system (typically 500-1000 ms), but might be necessary if responses like backchannels should be produced “in the middle” of the user’s speech. However, by delaying the re-sponse, a lot of false positives may be avoided (often short hesi-tations), as the onset of new IPUs might be detected during this delay and stop the system from responding. While it will also introduce some false negatives (making the system wait too long and miss a RL), this number is much smaller. Figure 3 shows how a longer response delay affects the performance of the best classifier (the SVM), as well as the baseline. As can be seen, the relative improvement of the SVM classifier is not as big as the relative improvement of the baseline. Thus, while a naive system would clearly benefit from delaying the response, this is not as beneficial for the SVM classifier. Another way of looking at this is that the SVM classifier can produce a similar performance af-ter just 200 ms, as compared to a naive system that would simply wait for 1000 ms after each IPU before giving a response.

Figure 3: Effect of response delay on the accuracy.

50

55

60

65

70

75

80

85

0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Acc

ura

cy

(%

)

Response delay (seconds)

Prosody+Context

Prosody

Context

Baseline

71

Page 79: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

5.3. Looking into the selected features

While the CART classifier doesn’t show the same performance as the SVM classifier, it is interesting to look into the decision tree that is produced to get an understanding of how the features contribute to the classification. This is illustrated in Figure 4. The initial split is made between relatively flat pitch (left branch), vs. a rising or falling pitch (right branch). The latter generally leads to Respond, as typically found in related studies. However, there is an exception for very short IPUs with a mod-erate slope that don’t follow a Clarification Request (CR) (which typically trigger a simple “yes”). On the left side, we can again see that IPUs following system utterances that often trigger short user responses are labelled as Respond. Interestingly, for longer turns, an invitation to respond seems to be associated with a low pitch region, while shorter turns ends with a high pitch region. This nicely illustrates how the contextual and prosodic features are combined. Another interesting finding is that the algorithm apparently has clustered Intro, Guess and CR as utterances that typically trigger very short responses like “yes” (compare with Table 1). It is especially interesting to see that Reprise Fragment is not found in this category, despite the apparent similarity to the Clarification Requests. The difference in the realisation of these was mainly prosodic – a rising pitch at the end of a Clarifi-cation Request and a falling pitch at the end of the Reprise Fragment (similar to the patterns described in [13]), which obvi-ously had an effect on the users’ behaviour. Pragmatically, these can be compared to “explicit” and “implicit” verification re-quests in traditional dialogue systems [14]. Thus, a Clarification Request should always require some kind of response, whereas a Reprise Fragment should not need a response if it is correct.

Figure 4: CART tree for the full feature set; solid line = true; dashed line = false.

6. Conclusions and Future work The best classifier, SVM, was able to correctly identify Response Locations after 75% of all IPUs, using contextual and prosodic features, resulting in a response time of about 200 ms. This is similar to the performance of a naive system that would wait for 1000ms before responding. As stated above, the next step is to use the model in the Response Location Detector in the system (as illustrated in Figure 2) and test it with users. We may then see how much the performance actually improves in an interac-tive setting, using both objective and subjective measures.

The two human annotators agreed for 93.3% of the instances, which may be regarded as some kind of maximal performance. In the current study, the performance of the SVM classifier peaks at about 80%, even if a response delay is introduced. To further improve the classification, other kinds of features related to syn-

tax and semantics are probably needed, as indicated by related studies. A possible extension would be to use an ASR in the sys-tem to extract such features. Even if the results would be unreli-able, they could possibly help to improve the performance to some extent.

We think that the system presented here provides an excel-lent testbed for doing experiments on turn-taking and feedback in an interactive setting. While we think the Map Task domain in itself provides valuable insights into feedback behaviour, it is also similar to many practical dialogue systems, where the sys-tem needs to understand longer instructions and act as an active listener.

7. Acknowledgements This work is partly supported by the European Commission project IURO (grant no. 248314), and the Swedish research council (VR) project Incremental processing in multimodal conversational systems (#2011-6237). Also thanks to Raveesh Meena for help with the annotation, and Joakim Gustafson and Anna Hjalmarsson for helpful discussions.

8. References [1] Sacks, H., Schegloff, E., & Jefferson, G. (1974). A simplest sys-

tematics for the organization of turn-taking for conversation. Lan-guage, 50, 696-735.

[2] Koiso, H., Horiuchi, Y., Tutiya, S., Ichikawa, A., & Den, Y. (1998). An analysis of turn-taking and backchannels based on prosodic and syntactic features in Japanese Map Task dialogs. Language and Speech, 41, 295-321.

[3] Yngve, V. H. (1970). On getting a word in edgewise. In Papers from the sixth regional meeting of the Chicago Linguistic Society (pp. 567-578). Chicago.

[4] Skantze, G., & Schlangen, D. (2009). Incremental dialogue pro-cessing in a micro-domain. In Proceedings of EACL. Athens, Greece.

[5] Anderson, A., Bader, M., Bard, E., Boyle, E., Doherty, G., Garrod, S., Isard, S., Kowtko, J., McAllister, J., Miller, J., Sotillo, C., Thompson, H., & Weinert, R. (1991). The HCRC Map Task corpus. Language and Speech, 34(4), 351-366.

[6] Ward, N., & Tsukahara, W. (2000). Prosodic features which cue back-channel responses in English and Japanese. Journal of Prag-matics, 32(8), 1177-1207.

[7] Cathcart, N., Carletta, J., & Klein, E. (2003). A shallow model of backchannel continuers in spoken dialogue. In Proceedings of EACL. Budapest.

[8] Morency, L. P., de Kok, I., & Gratch, J. (2008). Predicting listener backchannels: A probabilistic multimodal approach. In Proceedings of IVA (pp. 176-190). Tokyo, Japan.

[9] Gravano, A., & Hirschberg, J. (2009). Backchannel-inviting cues in task-oriented dialogue. In Proceedings of Interspeech 2009 (pp. 1019-1022). Brighton, U.K.

[10] de Kok, I. A., & Heylen, D. K. J. (2012). Observations on Listener Responses from Multiple Perspectives. In Proceedings of the 3rd Nordic Symposium on Multimodal Communication (pp. 27-28). Helsinki, Finland.

[11] de Cheveigné, A., & Kawahara, H. (2002). YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4), 1917-1930.

[12] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA Data Mining Software: An Up-date. SIGKDD Explorations, 11(1).

[13] Skantze, G., House, D., & Edlund, J. (2006). User responses to pro-sodic variation in fragmentary grounding utterances in dialogue. In Proceedings of Interspeech 2006 (pp. 2002-2005). PA, USA.

[14] Skantze, G. (2007). Error Handling in Spoken Dialogue Systems - Managing Uncertainty, Grounding and Miscommunication. Doc-toral dissertation, KTH, Department of Speech, Music and Hearing.

Abs Pitch Slope < 0.295

Last Sys Utt = [Intro|Guess|CR] IPU Length < 0.375

Abs Pitch Slope < 1.14 RespondRespond Turn Length < 1.575

RespondPitch Mean < 1.22 Pitch Mean < -0.93

Respond HoldRespond

true false

Last Sys Utt = [CR]

Respond HoldHold

72

Page 80: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Clarification Questions with Feedback

Svetlana Stoyanchev, Alex Liu, Julia Hirschberg

Computer Science, Columbia University, New York, NY, USA

[email protected], [email protected], [email protected]

AbstractIn this paper, we investigate how people construct

clarification questions. Our goal is to develop similarstrategies for handling errors in automatic spoken dia-logue systems in order to make error recovery strategiesmore efficient. Using a crowd-sourcing tool [7], we col-lect a dataset of user responses to clarification questionswhen presented with sentences in which some words aremissing. We find that, in over 60% of cases, users chooseto continue the conversation without asking a clarificationquestion. However, when users do ask a question, ourfindings support earlier research showing that users aremore likely to ask a targeted clarification question than ageneric question. Using the dataset we have collected, weare exploring machine learning approaches for determin-ing which system responses are most appropriate in dif-ferent contexts and developing strategies for constructingclarification questions.1

Index Terms: clarification, question

1. Introduction1.1. Clarifications in Human Dialogue

Clarification questions are common in human-human dia-logue. They help dialogue participants maintain dialogueflow and resolve misunderstandings. A clarification ques-tion may be asked by a listener who fails to hear or un-derstand part of an utterance. Requesting information isnot the only role of a clarification question. It also helpsground communication by providing feedback indicatingwhich information is known and understood.

In the following example [5], Speaker B has failed tohear the word toast and so constructs a clarification ques-tion using a portion of the correctly understood utterance— the word some — to query the portion of the utteranceB has failed to understand:

A: Can I have some toast please?B: Some?A: Toast.

Such targeted clarification questions signal the lo-1This work was partially funded by DARPA HR0011-12- C-0016 as

a Columbia University subcontract to SRI International.

cation of the recogniton error to hearer. In this case,Speaker A then is able to respond with a minimal answerto the question — filling in only the missing information.

Human speakers employ diverse clarification strate-gies in dialogue. Examining human clarification strate-gies, Purver [5] distinguishes two types of clarificationquestions: reprise and non-reprise questions. Reprisequestions are questions like B’s query above, in whicha portion of the interlocutor’s utterance believed to havebeen recognized correctly is repeated as context for theportion believed to have been misrecognized or simplyunheard. Non-reprise questions are simply generic re-quests for a repeat or rephrase of a previous utterance,such as What did you say? or Please repeat. Such ques-tions do not include contextual information from the pre-vious utterance. Reprise clarification questions, on theother hand, ask a targeted question about the part of anutterance that was misheard or misunderstood, using por-tions of the misunderstood utterance which are thought tobe correctly recognized.

In human-human dialogues, reprise clarifications aremuch more common than non-reprise questions, whichexplicitly signal an error without providing informa-tion about its location. However, spoken dialogue sys-tems predominantly use non-reprise strategies to indicaterecognition errors to their users, typically requesting thatthe user repeat or rephrase their utterance [2]. Construct-ing non-reprise questions is significantly simpler thancreating reprise questions and can easily be hard-codedin the system, since they do not include contextual in-formation. However, to construct a reprise clarificationquestion, a system must first determine which part of anutterance it believes contains an error. It must then con-struct an appropriate question based upon information inthe correctly recognized part of an utterance.

In this paper we describe the collection of a corpus ofclarification questions from Mechanical Turk [7] work-ers who were asked to indicate how they would respondto an utterance containing some unknown words. Suchutterances were created from a set of misrecognized ut-terances in which blanks were substituted for recognitionerrors. We describe these annotators’ recovery strategies,including the type of question asked or request made torecover missing information from an utterance. Our ulti-

73

Page 81: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

mate goal is to learn the relationship between clarificationstrategies and features of misrecognized utterances in or-der to develop automated methods for developing bettererror recovery strategies in spoken dialogue systems. Weare currently developing such a process for a speech-to-speech (S2S) translation system in which the DialogueManager can query users about hypothesized misrecog-nitions, out-of-vocabulary (OOV) items, and translationerrors before a translation is presented to the interlocutor.

1.2. Clarification in Speech-to-Speech TranslationSystems

In a S2S translation system, two speakers communicateorally in two different languages through two ASR sys-tems and two Machine Translation (MT) systems. Such asystem takes speech in one language as input, recognizesit using an Automatic Speech Recognition (ASR) sys-tem, translates the recognized input into text in anotherlanguage, and finally produces synthetic speech outputfrom the translation for the conversational partner. Inthe S2S application we target, speakers converse freelyabout topics which may be pre-specified in very generalterms. When an ASR is hypothesized in a speaker utter-ancer, the clarification component of the system seeks toclarify errors with the speaker before passing a correctedASR transcription on to the MT component. In this way,the clarification component attempts to intercept speechrecognition errors early in the dialogue to avoid translat-ing poorly recognized utterances. In parallel research wehave also developed a method for localized ASR error de-tection in the output of the speech recognizer of an S2Stranslation system.

The ability to produce reprise clarification questionsin S2S translation systems is especially important. Whilein a form-filling dialogue system, clarification questionscan be designed around a set of specific domain concepts,in an open-domain systems such information is not avail-able. For example, if a user of a closed domain system,such as an airline reservation system, mentions a depar-ture location which the system misrecognizes, the systemmay construct a predefined clarification question Leav-ing from where?. However an open-domain translationsystem must accept input on a variety of topics and can-not rely upon users mentioning a particular set of domainconcepts. Reprise clarification questions constructed bysuch systems must be generated by the system dynami-cally. In our experiment, we collect questions for Englishutterances containing errors from an open-domain S2Stranslation system. Our motivation is to develop a repriseclarification strategy containing feedback and groundinginformation. We hypothesize that a system capable ofasking clarification questions that are more similar to thetypes of questions that humans ask will be more naturaland lead to more efficient error recovery.

In Section 2, we describe previous research on user

responses to errors in spoken dialogue. In Section 3, wedescribe the data collection experiment and analyze ourresults. We conclude in Section 4 with our plan for theuse of the described dataset for learning strategies in adialogue system.

2. Related workA number of researchers in spoken dialogue have stud-ied user responses to errors in dialogue. For exam-ple, Skantze [6] collected and analyzed user responsesto speech recognition errors in a direction-giving do-main in Swedish, using a speech recognizer to corrupthuman-human speech communication in one direction.Williams and Young [9] performed a Wizard-of-Oz studyin a tourist information dialogue system in which recog-nition errors were systematically controlled. Koulouriand Lauria [4] performed another Wizard-of-Oz studyin a human-robot instructions domain with the “wizard”playing a role of a robot with restricted communicationcapabilities. In all of these studies, results indicate that,when subjects encounter speech recognition problems,they tend to ask task-related questions providing feed-back to the other speaker and confirming their hypothesisabout the situation. These studies also find that speak-ers rarely give a direct indication of their misunderstand-ing to the system, irrespective of the system’s word-error-rate. Williams and Young’s findings suggest that, at mod-erate speech recognition levels, asking task-related ques-tions appears to be a more successful strategy for recov-ering from error tthan direct signaling of the error itself.

In our study, we collect a (text) corpus of human re-sponses to missing information in ASR transcriptions.We will use this corpus in future research to improve ourdialogue clarification strategy by automatically creatingtargeted reprise clarification questions in response to er-rors in an open-domain S2S translation system. How-ever, we believe this strategy will also be relevant to otheropen-domain spoken dialogue systems.

3. Experiment3.1. Dataset

We perform our experiments on data from SRI’s Iraq-Comm speech-to-speech translation system [1]. The datawere collected by NIST during seven months of evalu-ation exercises performed between 2005 and 2008 [8].The corpus contains acted dialogues between Englishand Arabic speakers. Table 1 shows a sample dialoguefrom the dataset, with correct English translations for theArabic utterances. The dataset is manually transcribed.We tag the manual transcript of the dataset with part-of-speech (POS) tags using Stanford POS tagger [3]. Weidentify POS tags of misrecognized words by aligning theASR output with the transcript.

In our data collection, we use 475 English utterances

74

Page 82: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

English: good morningArabic: good morningEnglish: may i speak to the head of the householdArabic: i’m the owner of the family and i can speak with

youEnglish: may i speak to you about problems with your util-

itiesArabic: yes i have problems with the utilitiesTable 1: Sample dialogue from the IraqComm Corpus.

from the dataset.2 Each utterance we present to an anno-tator contains exactly one ASR error. We use a crowd-sourcing resource, Amazon Mechanical Turk (AMT) [7],to obtain human judgments about error recovery strate-gies for these utterances.

3.2. Method

The experiment is text-based. We gave each AMT workeran original user utterance from the dataset’s manual tran-script. The words misrecognized by the automatic speechrecognizer were replaced by “XXX” to indicate a recog-nition error.3 This is intended to simulate a dialogue sys-tem’s automatic detection of misrecognized words in anutterance. We ask the AMT workers to answer a set ofquestions about their perception of the misrecognized ut-terance and then ask them how they would try to recoverthe missing information for the sentence. Table 2 showsa sample sentence and questions presented to the partici-pants. Each sentence was presented to three AMT work-ers.

Original user utterances with an ASR errorhow many XXX doors does this garage have

Questions to participants1. Is the meaning of the sentence clear to you despite the

missing word?2. What do you think the missing word could be? If you’re

not sure, you may leave this space blank.3. What type of information do you think was missing?4. If you heard this sentence in a conversation, would you

continue with the conversation or would you stop theother person to ask what the missing word is?

5. If you answered “stop to ask what the missing word is”,what question would you ask?

Table 2: Questions given annotators.From this annotation we are able to investigate hu-

man strategies for 1) the choice of action: continue dia-logue or engage in clarification; 2) the type of clarifica-tion question (reprise vs. non-reprise), and 3) the gram-matical structure of the reprise questions they produce.Below we discuss results from an initial analysis of thiscorpus.

2This is an ongoing study and we are continuing to collect moredata.

3In the current dataset each error contains exactly one misrecognizedword. We are now collecting data where multiple words may have beenmisrecognized.

3.3. Results

For each input sentence, the annotators had to decide firstwhether they would continue the conversation without in-terruption or ask a question about the missing informa-tion. If they chose to ask a question, they were promptedto construct an appropriate question. Table 3 shows ex-amples of annotator decisions and clarification questionsfor several sample sentences. In Example 1, a noun atthe end of the sentence is missing and two of the anno-tator choose to ask a reprise clarification question, whileone annotator chooses to continue without clarification.In Example 2, a verb in the beginning of the sentenceis missing and two of the annotators choose to continuewhile one chooses to ask a generic clarification question.In Example 5, one of the annotators asks a clarificationquestion — erroneously assuming that the missing wordis an adjective.

POS tag num/% Correct Correctin dataset POS word

noun 101 (21%) 70% 10%verb 133(28%) 50% 48%pronoun 25 (5%) 73% 48%adjective 34 (7%) 55% 22%adverb 8 (2%) 29% 4%preposition 34 (7%) 69% 51%wh-question 48(10%) 75% 64%other 92 (19%) - 31%overall 49% 39%

Table 4: Percentage of correctly hypothesized POStags/words.

Annotators were also asked guess the identity of themissing word and its POS tag. When guessing POS tags,annotator were given a closed set of tags: name/place,noun, verb, pronoun, adjective, adverb, preposition, wh-question, other. They were also given examples for eachtag. Table 4 shows the distribution of POS tags amongmisrecognized words as well as annotator accuracy inguessing correct word and tag. Overall accuracy for POStag hypotheses in our dataset is 49% and accuracy ofword identification is 39%. These results indicate thathumans are indeed sometimes able to fill in missing con-tent. This suggests that, to recover from a speech recogni-tion error, a system should first attempt to hypothesize themisrecognized word before asking a clarification ques-tion. Our results show that, when a missing word is averb or a closed-class word, such as a pronoun, a wh-word, or a preposition, a human is especially likely toguess correctly. In our data, they guess the POS of 73%of pronouns POS, but actual word identity only 48% ofthe time. Percentages of correctly guessed verb POS tagsand actual verbs are very close (50% / 48% ), indicatingthat most annotators who can guess that a missing wordis a verb can also guess the word itself. In our dataset,most misrecognized verbs are auxiliary verbs “to be”,

75

Page 83: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

id Sentence POS tag Word Annotator Decisions Annotator Question(s)1. do you own a XXX noun hardhat Continue(1), RepriseQ(2) Do I own a what?/ Do I have what?2. XXX these actions successful verb were Continue(2),GenericQ(1) What did you say?3. make sure you close the XXX

behind the vehiclenoun door RepriseQ(3) Close the what?/What needs to be

closed?/What behind the vehicle needsto be closed?

4. how long have the villagersXXX on the farm for

verb lived Continue (3) -

5. XXX signs on the road are veryimportant

verb having Continue(2), Reprise(1) I’m sorry what type of signs?

Table 3: Sample annotator responses.

“to do”, “to have”, which may be easier to guess thanother verbs. Noun POS tags, on the other hand, were cor-rectly guessed in 70% of cases but the nouns themselveswere rarely identified correctly (10% of cases) indicating,not surprisingly, that a clarification question for nouns isdesirable in open domain systems.

POS Continue Generic Conf. Repr.Hyp. no Q% Q Q Qname/place 23% 5% 5% 68%noun 27% 11% 4% 58%verb 62% 6% 2% 30%pronoun 69% 3% 5% 23%adjective 24% 4% 4% 45%adverb 68% 5% 7% 20%prep 85% 2% 4% 8%wh-q 86% 5% 2% 6%other 61% 13% 1% 25%overall 60% 7% 3% 30%

Table 5: Annotator responses to missing data.

Table 5 shows the distribution of annotator responsesto missing information in our dataset. Overall, in 60% ofcases annotators choose to continue without a clarifica-tion question; in 30%, they ask a reprise clarification que-ston; in 7%, they ask a generic clarification (e.g. “Pleaserepeat.”; and in 3% of cases they ask a confirmation ques-tion (e.g. “Did you say...”). The distribution of each deci-sion type varies for different annotator hypotheses about aword’s POS tag. Reprise clarification questions are askedin 58% of cases where an annotator guesses the POS tagto be a noun, but only in 6% of cases where a annotatorguesses the POS tag to be a wh-word.

4. Conclusions and Future WorkIn this study we have presented a preliminary analysisof a corpus of utterances containing ASR errors, anno-tated by Amazon Mechanical Turk workers for POS andidentity of the misrecognized word, as well as annotators’likely response to such errors: continue without clarifi-cation; generic request to repeat, rephrase, or confirm;or reprise clarification question. In over 60% of cases,annotators choose to continue the dialogue without ask-ing clarification. For some categories of errors (auxil-

iary verbs and function words), annotators could hypoth-esize the missing words with good accuracy. This sug-gests that spoken dialogue systems might avoid some-times risky clarification subdialogues by making use ofsyntactic information to also hypothesize misrecognizedwords. Similarly to previous studies, we found targetedreprise clarifications to be the most common kind of clar-ification question. However, we also found that humansare much more likely to propose a reprise clarificationquestion when they believe the missing word to be a nounthan another POS, suggesting that systems should focustheir strategies for constructing such questions on thatcategory.

In future work, we will use these annotations to trainstatistical models for identifying when a dialogue sys-tem should or should not engage in a clarification dia-logue and what type of clarification question should bepresented to a user. Features we think will be impor-tant in this modeling are POS as well as semantic anddependency parse information We will incorporate thisclassifier into an automatic clarification question genera-tion tool to construct natural clarification questions. Ourimmediate application for this tool is to improve the clar-ification engine of a speech-to-speech translation system.

5. References[1] M. Akbacak et al. Recent advances in SRI’s IraqCommtm Iraqi

Arabic-English speech-to-speech translation system. In ICASSP,pages 4809–4812, 2009.

[2] A. Rudnicky D. Bohus. Sorry, i didn’t catch that! - an investigationof non-understanding errors and recovery strategies. In Proceed-ings of the 6th SIGdial Workshop on Discourse and Dialog, 2005.

[3] D. Klein and C. D. Manning. Accurate unlexicalized parsing. InProceedings of the 41st annual meeting annual meeting of the As-sociation for Computational Linguistics, pages 423–430, 2003.

[4] T. Koulouri and S. Lauria. Exploring miscommunication and col-laborative behaviour in human-robot interaction. In SIGDIAL Con-ference, pages 111–119, 2009.

[5] M. Purver. The Theory and Use of Clarification Requests in Dia-logue. PhD thesis, King’s College, University of London, 2004.

[6] G. Skantze. Exploring human error recovery strategies: Implica-tions for spoken dialogue systems. Speech Communication, 45(2-3):325–341, 2005.

[7] Amazon Mechanical Turk. http://aws.amazon.com/mturk/, ac-cessed on 28 may, 2012.

[8] B. A. Weiss et al. Performance evaluation of speech translationsystems. In LREC, 2008.

[9] J. D. Williams and S. Young. Characterizing task-oriented dialogusing a simulated ASR channel. In Proceedings of the ICSLP, Jeju,South Korea, 2004.

76

Page 84: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Acoustic, Morphological, and Functional Aspects of “yeah/ja” in Dutch,English and German

Jurgen Trouvain1 and Khiet P. Truong2

1Phonetics, Saarland University, Saarbrucken, Germany2Human Media Interaction, University of Twente, Enschede, The Netherlands

[email protected], [email protected]

AbstractWe explore different forms and functions of one of the mostcommon feedback expressions in Dutch, English, and German,namely “yeah/ja” which is known for its multi-functionality andambiguous usage in dialog. For example, it can be used as ayes-answer, or as a pure continuer, or as a way to show agree-ment. In addition, “yeah/ja” can be used in its single form, but itcan also be combined with other particles, forming multi-wordexpressions, especially in Dutch and German. We have foundsubstantial differences on the morpho-lexical level between thethree related languages which enhances the ambiguous charac-ter of “yeah/ja”. An explorative analysis of the prosodic featuresof “yeah/ja” has shown that mainly a higher intensity is used tosignal speaker incipiency across the inspected languages.Index Terms: feedback, yeah, ja, dialog act, prosody, cross-linguistic, speaker incipiency

1. IntroductionOne of the most typical and frequent feedback expressions inEnglish is “yeah” (e.g. [1] [2] [3] [4] [5] [6]) which also hascorresponding expressions in other languages, usually with adifferent spelling such as “ja” in Dutch and in German. Es-pecially in Dutch and German, “ja” also frequently occurs inreduplicated forms such as “jaja” or in multi-word expressionssuch as “ja genau” in German or “nou ja” in Dutch. In addition,there is a huge diversity of possible meanings and functions of“yeah/ja” which is additionally enhanced by its morpho-lexicalvariability as explained above. This variability in meanings andfunctions may also affect the possible phonetic productions of“yeah/ja” (e.g., [7]). All together, these aspects make “yeah/ja”a highly ambiguous and complex feedback expression that is in-teresting to study from cross-linguistic, dialog-interactive, andphonetic point of views. The current study investigates thehighly frequent feedback expression “yeah/ja” in conversationalspeech corpora of three languages (Dutch, English, German)with a special interest (i) in morpho-lexical variability and (ii) inprosodic differences between “yeah/ja” tokens showing speakerincipiency and those showing passive recipiency (i.e., the inten-tion to commence speakership).

Phonetically “yeah/ja” is usually an opening diphthong,starting with a palatal glide and ending in the area between anopen, unrounded and central vowel and an open-mid vowel. Thephonetic make-up and hence the spelling of “yeah” and “ja” inDutch, English, and German can be seen as standardized.

In addition to its morpho-lexical variablity, the productionof “yeah/ja” can also differ across languages. In Swedish for ex-ample, “yeah/ja” can occur in various reduplicated forms such“jaja” or “jajaja” similar to Dutch and German. However, the

airstream mechanism differs since in Swedish “ja” is often pro-duced with an ingressive airstream [8].

It has been argued that multiple sayings of “ja” uttered inthe same intonation phrase are not just intensifications of a sin-gle ”ja” [9]. Additionally multiple sayings can bear differentmeanings depending on the intonation contour used (cf. [10]and [11] for German). Thus, the morpho-lexical variability,functional variability, and phonetic variability of “yeah/ja” areall related to and affect each other.

Jurafsky et al. [4] point out that “yeah/ja” is highly am-bigous in terms of function in dialog. “Yeah/ja” can be used inthe backchannel [12] as a continuer and additionally to signalagreement with a yes-answer as a particular case, and it can beused to provide assessment. Although the multi-functionalityof “yeah/ja” is acknowledged there is no generally acceptedstandard set of functions (or dialogue acts) of “yeah/ja” in dia-logues. Table 1 lists four similar but only partially compatibleapproaches of labeling the multi-functionality of response to-kens such as “yeah/ja” in English and German. It should beclear that these are just four out of several labeling schemes.Obviously there is no standard labeling scheme as the variabil-ity in labeling functions of response tokens Table 1 illustrates.It also illustrates the range of possible ambiguity of “yeah/ja”.

As shown in Table 1, it has been suggested for English that“yeah” apart from its function as continuer also signals a cer-tain level of speaker incipiency, i.e. starting a longer discourseunit with “yeah’ ([1] [2]). In contrast to backchannel utter-ances featuring neutral nasal consonants (often transcribed as“m” or “hm”) “yeah” can indicate that the speaker is preparedto shift from recipiency to incipiency [1]. This pivotal mech-anism makes the change from active listener to active speakereasier and thus conversations more fluent. In order to processthis fluency in-time we would expect that speaker incipiency isalso prosodically marked beyond syntax. This could be done bya higher intensity at the turn beginning signalling the planningof a longer stretch of speech to follow (e.g. [13]).

The cross-linguistic aspect not only plays a role in themorpho-lexical and phonetic variability, it also comes into playwhen we are looking at the function of “yeah/ja” in dialog.In German for example, “ja” can have the lexical meaning of“yes” and it can be used as a modal particle signalling commonground (e.g. [16]). In contrast to other cross-linguistic studieson feedback signals which focused on their frequency of occur-rence and the prosody of phrases preceding the feedback signal,such as Levow et al. [17], we concentrate on only one tokenwhich mostly, but not exclusively, is used as a feedback token.Unlike [17], where productions of Chinese, English and Span-ish were investigated, we are dealing here with differences andparallels of closely related languages which in our case all be-

77

Page 85: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Table 1: Possible functions of response tokens taken from four labeling schemes.

Jurafsky et al. [4] Benus et al. [5] AMI [14] Buschmeier et al. [15]continuer (backchannel) backchannel backchannel “please continue”

agreement acknowledgement/agreement assessment

“I understand”“I agree”

assessment “I disagree”attitude

yes-answer

incipient speaker (incl.pivot/latching)

beginning discourse seg-ment

new discourse segment

pivot: acknowledgement +beginning disc. segmentending discourse segment ends discourse segmentacknowledgement + endingdisc. segmentquestion “What are you saying?”

“I do not understand”stall/filler stallliteral modifierback from a taskcannot decide unresolved

informfragmentother

long to the Western branch of the Germanic languages.

Summarizing, “yeah/ja” is a multi-faceted feedback ex-pression. We aim to explore the morpho-lexical variabilityof “yeah/ja”, and possible different phonetic realizations ofspeaker incipient and passive recipient “yeah/ja” in Dutch, Ger-man, and English. Section 2 presents the data and our methodsfor analysis. The results are shown in Section 3 and we discussour findings in Section 4.

2. Method

2.1. Data

For our analysis, we used three conversational speech corpora:the Lindenstrasse corpus [18] for German, the Diapix Lucid cor-pus [19] for English and the Dutch Spoken Corpus (CGN) [20]for Dutch. We expected to see annotations of “yeah/ja” whichare possibly not consistent within one corpus and/or are notcomparable between the different corpora (cp. [21]). For noneof the corpora a clear function for “yeah/ja” was annotated.We decided to manually re-annotate selected functions (see be-low) and segmentation boundaries of the annotated tokens inquestion. Manual re-labeling was performed by both authorsindependent of each other. For this reason we restricted our-selves to 100 “yeah/ja” tokens per corpus. For each corpus, 3female-female and 3 male-male conversations were randomlyselected. From these 6 conversations per corpus, 100 yeah-tokens (50 from female-female and 50 from male-male con-versations) were selected in a random manner. Among the se-lected tokens we concentrated on single and turn-initial tokensof “yeah/ja” and thus excluded those “yeah/ja” where it occursin combination, e.g. German ”naja” or ”jaja”, or where it occursin a medial or final turn position.

2.2. Labeling

After reviewing previous work on the ambiguous functions of“yeah/ja” (e.g., [4]) and after several attempts to label these, wedecided to focus on the labeling of speaker incipiency (SI) vs.passive recipiency (PR) according to the set of categories shownin Table 2. The operationalization of speaker incipiency cantake several forms. In Drummond and Hopper [2], a speaker in-cipient “yeah/ja” was initially defined as a “yeah/ja” token thatis immediately (< 200 ms) followed by same-speaker speech.In Truong and Heylen [6], speaker incipiency was (automati-cally) defined as the number of ‘conversational states’ that haspassed until the current speaker starts a new full turn. Theirdefinition takes into account the preparedness aspect of speakerincipiency (see Jefferson [1]). For the current study, we usea similar operationalization of speaker incipiency as describedin [2]. Since it is imaginable that “yeah/ja” can be followed bysame-speaker speech without constituting a bid for speakership,the distinction between a minimal (label A2) and full turn (labelB) was made (following [2]). Although A2 could be consideredto have a higher (gradual) level of speaker incipiency, we makea binary distinction and consider both A1 and A2 as forms ofpassive recipiency and B as a form of speaker incipiency.

Table 2: Labeling of “yeah/ja”

Label SI/PR Description: “Yeah/ja” is . . .A1 PR freestandingA2 PR the first part of a minimal turnB SI the first part of a full turnC N/A none of the above (for example, not in turn-

initial position)

In addition to labeling speaker incipiency, we also markedwhether “yeah/ja” was used in a single form or as a multi-word

78

Page 86: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

expression. Only single tokens of “yeah/ja” were taken for sub-sequent acoustic analysis.

2.3. Acoustic analysis

For each token we automatically measured its duration, meanintensity, mean F0, and F0 range (F0max−F0min) usingPraat [22]. All measurements were transformed to z-scores(z = (x − µ)/σ) per speaker where the mean (µ) and stan-dard deviation (σ) were taken over all single “yeah/ja” tokensuttered by that speaker. The main expectation is that speaker in-cipient “yeah/ja” have a higher intensity than passive recipient“yeah/ja” (cf. [13]).

3. Results3.1. Morpho-lexical variability of “yeah/ja”

We counted each occurrence of “yeah/ja” and looked at whetherit was used as a single token or in combination with the sameor other ‘particles’ creating new multi-word expressions. Fig-ure 1 shows that there are substantial differences in the morpho-lexical variability across the three languages. Although the sin-gle form of “yeah/ja” is the predominant form in all languages itis used only 65% in Dutch in contrast to 89% in English (whichis in line with the numbers in [4] for English) .

There are also differences regarding the possibility of com-binatory forms including multiple sayings and other multi-wordexpressions such as “ja genau” (=“yeah exactly”) in Germanand “uh ja” and “oh ja” in Dutch. For Dutch we count morethan 60 combinatory forms compared to around 20 combina-tions in English and in German.

Summarising it can be noted that for Dutch and Germanthere is a substantial degree of morpho-lexical variability, i.e.the usage of new combinatory forms such as “jaja” or “nou ja”/“naja”. This finding was not expected when we take English asthe baseline. This large lexical variability may in turn increasethe variability in function: some of these new multi-word ex-pressions are more used as idiomatic expressions such as “jaja”or “nouja” in Dutch, and some may carry an affective meaning.

3.2. Functional variability of “yeah/ja”

As stated above “yeah/ja” in Dutch and German are not exclu-sively used as a feedback utterance, be it a simple continueror be it a continuer including some assessment or further addi-tional functions. In the German data “ja” in the usage of a modalparticle rather than a discourse particle lies around 5%. Addi-tionally there were several unclear cases of “ja” where it wasused in indirect speech or as a self-comment. Likewise, amongthe Dutch combinations “jaja”, “uh ja”, “oh ja”, “nou ja”, or“maar ja” not all are used as feedback signals but as fillers.

Thus, the ambiguity of “yeah/ja” illustrated in [4] for En-glish is enhanced by further meanings in Dutch and in Germanwhich go beyond ‘pure’ feedback.

3.3. Acoustic variability

The acoustic analysis of the single tokens among the selectionof 100 “yeah/ja” per corpus revealed that for all three languagesintensity plays an important role for distinguishing speaker in-cipiency in contrast to passive recipiency, see Table 3. Al-though there was a tendency for all three languages to have ahigher mean fundamental frequency and a narrower F0 rangefor speaker incipiency the differences between recipiency andincipiency found for fundamental frequency (mean and range)

Figure 1: Wording of “yeah/ja” in Dutch, English, and German.

were statistically not significant. The duration of tokens of“yeah/ja” was longer when used for recipiency. However, thisdifference was only significant for German.

As expected speaker incipiency is acoustically signalledmainly by intensity (see also [13]) but in general, has no prosod-ically marked forms.

4. Discussion and conclusionsIn contrast to English, “yeah/ja” shows a substantial variabil-ity in their morpho-lexical forms in German and particularly inDutch. In English, “yeah/ja” is mostly used in its single formwhile in Dutch and German, it is more often used in multi-wordexpressions that may have different dialog functions. It makesit clear that we have to acknowledge that a common feedbackexpression such as “yeah/ja” can have different functions in dif-

79

Page 87: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Table 3: Averaged acoustic measurements given in z-scores.Significant differences with p-values below .05, tested with t-tests, are marked with an asterisk.

Feature A1+A2 B signific.

Dutch

Dur -0.02 -0.37F0 -0.26 0.05F0range -0.16 -0.18Intens -0.08 0.46 *

English

Dur 0.09 0.00F0 0.40 0.55F0range 0.41 0.22Intens -0.10 0.56 *

German

Dur 0.07 -0.37 *F0 -0.20 0.17F0range -0.07 -0.28Intens -0.13 0.66 *

ferent languages even if these languages are closely related suchas the three West-Germanic languages analysed here. Given thegrowing amount of cross-cultural human-human and human-machine communication, more attention should be paid to thesecross-linguistic aspects of feedback expression. Future workhas to show for instance how far multi-word expressions basedon ”ja” differ in function and meaning to single tokens of ”ja”,especially regarding their intonation structure.

What else is clear, is that there is a large variety of dialog actlabeling approaches to feedback expressions which also reflectsthe multi-functionality of feedback expressions. For future re-search, it would be interesting to perform a more thoroughmeta-analysis of possible functions and meanings of “yeah/ja”and other feedback expressions. It would also be an asset towork with speech data of various languages but elicitated viathe same task as performed by Levow et al. [17].

Finally, although we did not find a clear prosodicallymarked form for speaker incipient “yeah/ja”, prosodic measure-ments, as illustrated in previous work, can and should be usedto help disambiguate (other) dialog act functions of feedbackexpressions. In this connection, prosody should also includefeatures of voice quality such as creaky voice which has beenshown to signal passive recipiency [23]. To explore the finephonetic detail of these functions across languages remains afurther task for the future, as well as the automatic processingof these functions which will help make human-machine inter-actions more fluent.

5. AcknowledgementsThanks to Eva Lasarcyk and one anonymous reviewer for theirfeedback. This work was partly supported by the EU 7th Frame-work Programme (FP7/2007-2013) under grant agreement no.231287 (SSPNet) and the UT Aspasia Fund.

6. References[1] G. Jefferson, “ Notes on a systematic deployment of the acknowl-

edgement tokens ’Yeah’ and ’Mm hm’,” Tilburg Papers in Lan-guage and Literature, 1984.

[2] K. Drummond and R. Hopper, “Back channels revisited: ac-knowledgement tokens and speakership incipiency,” Research onLanguage and Social Interaction, vol. 26, pp. 157–177, 1993.

[3] ——, “Some uses of yeah,” Research on Language and SocialInteraction, vol. 26, pp. 203–212, 1993.

[4] D. Jurafsky, E. Shriberg, B. Fox, and T. Curl, “Lexical,prosodic, and syntactic cues for dialog acts,” in Proceedings ofACL/COLING Workshop on Discourse Relations and DiscourseMarkers, 1998, pp. 114–120.

[5] S. Benus, A. Gravano, and J. Hirschberg, “The prosody ofbackchannels in american english,” in Proceedings of the 16th In-ternational Congress of the Phonetic Sciences (ICPhS), 2007, pp.1065–1068.

[6] K. P. Truong and D. Heylen, “Disambiguating the functions ofconversational sounds with prosody: the case of yeah,” in Pro-ceedings of Interspeech, 2010, pp. 2554–2557.

[7] T. Stocksmeier, S. Kopp, and D. Gibbon, “Synthesis of prosodicattitudinal variants in german backchannel ’ja’,” in Proceedingsof Interspeech, 2007, pp. 1290–1293.

[8] R. Eklund, “Pulmonic ingressive phonation: Diachronic and syn-chronic characteristics, distribution and function in animal andhuman sound production and in human speech,” Journal of theInternational Phonetic Association, vol. 38, pp. 235–325, 2008.

[9] T. Stivers, “ ‘No no no’ and other types of multiple sayings insocial interaction,” Human Communication Research, vol. 30, pp.260–293, 2004.

[10] A. Golato and Z. Fagyal, “Comparing single and double sayingsof the German response token ’ja’ and the role of prosody: A con-versation analytic perspective,” Research on Language and SocialInteraction, vol. 41, pp. 241–270, 2008.

[11] D. Barth-Weingarten, “Response tokens in interaction –prosody, phonetics and a visual aspect of German JAJA,”Gesprachsforschung, vol. 12, pp. 301–370, 2011.

[12] V. H. Yngve, “On getting a word in edgewise,” in Papers from theSixth Regional Meeting of Chicago Linguistic Society. ChicagoLinguistic Society, 1970, pp. 567–577.

[13] A. Hjalmarsson, “The vocal intensity of turn-initial cue phrasesand filled pauses in dialogue,” in Proceedings of SIGdial, 2010,pp. 225–228.

[14] J. Carletta, “Unleashing the killer corpus: experiences in creatingthe multi-everything AMI meeting corpus,” Language Resourcesand Evaluation, vol. 41, pp. 181–190, 2007.

[15] H. Buschmeier, Z. Malisz, M. Wlodarczak, S. Kopp, and P. Wag-ner, “‘are you sure you’re paying attention?’ - ‘uh-huh’ commu-nicating understanding as a marker of attentiveness.”

[16] E. Karagjosova, “Modal particles and the common ground: mean-ing and functions of German ’ja’, ’doch’, ’eben’/’halt’ and’auch’,” in Perspectives on Dialogue in the New Millenium,P. Kuhnlein, H. Rieser, and H. Zeevat, Eds. Amsterdam: JohnBenjamins, 2003, pp. 2–11.

[17] G.-A. Levow, S. Duncan, and E. King, “Cross-cultural investi-gation of prosody in verbal feedback in interactional rapport,” inProceedings of Interspeech, 2010, pp. 286–289.

[18] IPDS, Video Task Scenario: LINDENSTRASSE The Kiel Corpusof Spontaneous Speech, Volume 4, DVD. Institut fur Phonetikund Digitale Sprachsignalverarbeitung Universitat Kiel, 2006.

[19] R. Baker and V. Hazan, “DiapixUK: task materials for the elicita-tion of multiple spontaneous speech dialogs,” Behavior ResearchMethods, vol. 43, pp. 761–770, 2011.

[20] N. Oostdijk, “The Spoken Dutch Corpus. Overview and first eval-uation,” in Proceedings of the International Conference on Lan-guage Resources and Evaluation (LREC2000), 2000, pp. 887–894.

[21] J. Trouvain and K. P. Truong, “Comparing non-verbal vocalisa-tions in conversational speech corpora,” in Proceedings of the4th International Workshop on Corpora for Research on EmotionSentiment & Social Signals, 2012, pp. 36–39.

[22] P. Boersma and D. Weenink, “Praat, a system for doing phoneticsby computer,” Glot International, vol. 5, no. 9/10, pp. 341–345,2001.

[23] T. Grivicic and C. Nilep, “When phonation matters: The use andfunction of ’yeah’ and creaky voice,” Colorado Research in Lin-guistics, vol. 17, 2004.

80

Page 88: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Possible Lexical Cues for Backchannel Responses

Nigel G. Ward

Department of Computer Science, University of Texas at El Paso, El Paso, Texas, United [email protected]

Abstract

Looking for words that might cue backchannel feedback,I did a statistical analysis of the interlocutors’ words pre-ceding 3363 instances of uh-huh in the Switchboard cor-pus. No clear cueing words were found, but collateralfindings include the existence of semantic classes thatslightly increase the likelihood of an upcoming uh-huh,the fact that different classes have their effects at differ-ent time lags, and the existence of words which stronglycounter-indicate a subsequent uh-huh.Index Terms: uh-huh, feedback, elicitors, temporal dis-tributional analysis, dialog dynamics

1. BackgroundRegarding the question of when and why listenersbackchannel, members of the general public often thinkthat they respond to specific cue phrases. Certainlyresponses can be elicited, for example by ending anystatement with you know what I mean?, but the result-ing responses are scarcely optional, and thus not strictlybackchannels. Today for true backchannels the researchon cues focuses elsewhere, namely on non-verbal fea-tures, notably features of prosody and gaze [1].

But perhaps lexical cues also do have a role. In theliterature there is one relevant suggestion, Allwood’s “ex-tremely tentative” identification of the apparent elicitorsof feedback in several languages, including for Englishthe words eh and right [2]; but it seems that no one hasever followed up on this. The question of the existenceof lexical cues to backchannels is also of more generalinterest, as it relates to larger issues regarding the extentof automaticity and responsiveness in dialog.

This paper presents an exploratory study, looking forwords that may cue backchannels.

2. MethodSpecifically, this paper examines the contexts preceding3363 occurrences of the most typical backchannel token,uh-huh, found in a 650K word subset of Switchboard, acorpus of unstructured two-party telephone conversationsamong strangers [3]. Uh-huh was chosen as the mosttypical backchannel [4] and also as a word which is al-most invariably a backchannel. The method was Tempo-

ral Distributional Analysis [5]: this section summarizesthis technique and its use for uh-huh.

A frequently operative cue word should, by defini-tion, occur frequently in the speech of the interlocutorjust before the uh-huh. Thus I compiled statistics on thewords which commonly preceded uh-huh. Lacking fore-knowledge of where exactly cues might occur, I compiledstatistics at various offsets, as measured from the onset ofthe context word to the onset of the uh-huh. For con-venience the offsets were discretized into buckets, thusfor example an occurrence of the word know starting 1.8seconds before a uh-huh was counted in the 1–2 secondbucket.

From the counts over the whole corpus, I computedthe degree to which each context word x is characteristicof each bucket t. In particular, I did this by comparingthe in-bucket probability to the overall (unigram) proba-bility for x. For example, we can compute the ratio of theprobability of know appearing in the 1–2 second bucketto the probability of know appearing anywhere in the cor-pus. This we call the R ratio. Specifically, the probabilityof each word in each bucket, the “bucket probability,” isgiven by its count in the bucket for t divided by the totalin that bucket,

Ptb(wi@t) =count(wi@t)∑j count(wj@t)

(1)

We can then compute the ratio of this to the standard un-igram probability:

R(wi@t) =Ptb(wi@t)

Punigram(wi)(2)

If R is 1.0 there is no connection and no mutual infor-mation; larger values of R indicate positive cooccurrencerelations, and lower values of R indicate words that arerare in a given context position. To test whether a R-ratiois significantly different from 1.0, I apply the chi-squaretest, where the null hypothesis is that the context wordoccurs in a certain bucket as often as expected from theunigram probability of the word and the total number ofwords in that bucket, where the sample population is rel-ative to all occurrences of uh-huh in the corpus.

As my purpose was exploratory, I didn’t want tobe overwhelmed with possibilities, so I limited atten-tion to the most frequent 5000 words in the corpus, with

81

Page 89: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

less frequent words counted as belonging to the out-of-vocabulary class. I also limited attention to words whoseR-ratio was significantly high or low, with p < .001.Due to the large number of words examined, this doesnot guarantee that all candidate cues identified are trulysignificant, and thus the findings are only tentative.

3. ObservationsTable 1 shows the result. Within each cell, the words areordered by extremeness of the R values: above the doubleline most frequent first, below it least frequent first. Imake seven observations:

1. The word right is a counter-indication to a sub-sequent uh-huh, contrary to Allwood’s conjecture,and so are the component words of the phrase youknow, again contrary to common expectation.

2. Indeed, all discourse markers counterindicate uh-huh, except for um 2–8 seconds earlier.

3. No word is a truly strong backchannel cue; thehighest R-ratio is 3.7 for time just before a uh-huh.(Note that, by Bayes law, the implications go bothways: since time is 3.7 times more common just be-fore uh-huh, that means that an observation of theword time implies that a backchannel is 3.7 timesmore likely than usual to occur within a half sec-ond.)

4. However there are words which very strongly in-dicate that a backchannel will not occur soon. Forexample, a mm-hm by the interlocutor reduces thelikelihood of a backchannel 1–2 seconds later by afactor of 53.

5. Some words do appear somewhat more frequentlybefore uh-huh, and these mostly fall in a few cate-gories:

- The deictics here, there and now are common inthe half-second before uh-huh.

- Some pronouns are common at different offsets:them, one, and it just before uh-huh; and I, we, he,and she 2–8 seconds before.

- Some verbs, notably take, took, went, watch, use,made, and had, are common 1–6 seconds before.

- Some prepositions are common at various differ-ent offsets.

- Number words are common before uh-huh start-ing about 6 seconds before, as are other expressionsof quantity such as some, little, bit, much and more.

- Temporal expressions are common at differentoffsets: now just before, and ago, years, old, andtimes 4–8 seconds before.

6. The specific offset makes a difference. For exam-ple, the words I, we and a are positive indicators ofan upcoming backchannel after certain delays, butthey counter-indicate a backchannel right away.

7. The counter-indicating words (“anti-cues”) alsofall into a few common classes:

- listenership indicators such as mm-hum, and uh-huh, unsurprisingly

- out of vocabulary words ([OOV] in the table),that is, the less frequent words, which include mostnames, somewhat surprisingly given that some ac-counts associate uh-huh with grounding of new ref-erents

- cues to starting something new (that’s, it’s, Ithink, yeah, well, oh, uh, um, a and the); after theseuh-huh is inhibited for a second or so.

4. DiscussionTo consider a word to be a cue, it should meet two cri-teria: it should strongly evoke the backchannel response,and it should have a direct causal effect.

Regarding the first criterion, the ratios seen are notparticularly high, especially compared to the strength ofgaze and prosodic cues [1], so none of the words iden-tified can be considered a strong cue, if indeed a cue atall.

The second criterion is harder to apply. Determin-ing definitively whether one or more of these words hascausal efficacy would require detailed analysis and per-haps controlled experiments. However, a quick exami-nation of a dozen instances of the strongest candidate forcue word status, time, when it preceded a uh-huh, was notpromising: these mostly occurred as part of telling a nar-rative about some past event, as in he took a car batteryone time . . . and in one time we used it to pay our rent.The word time in these cases never seemed to bear anyspecial discourse function. More generally, looking overthe words above the line in the table, none seems likely tobe a cue. One would expect cues to be discourse markersand/or phonetically distinct so that they could be quicklyprocessed and responded to, but these words look like justnormal words, bearing their normal meanings and doingtheir normal functions.

Thus there appears to be no reason to think that thatthese words are really cues for uh-huh. By extensionit seems unlikely that there are specific lexical cues forbackchannels.

This interpretation, however, raises an interestingquestion at a deeper level, that of the pragmatic and se-mantic events that can cue uh-huh. The fact that partic-ular words correlate with subsequent backchannels (andothers anti-correlate) provides us with clues to what thosepragmatic and semantic events might be. Future research

82

Page 90: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

R 8–6 seconds 6–4 seconds 4–2 seconds 2–1 seconds 1–.5 seconds .5–0 seconds

> 2.8 three, try, used two time, here,them

> 2.0 place, old,years

took, went,used, ago,came, started,he’s, thought,I’ll, home,made, four,far, we’ve,times

husband, col-lege, take,watch, went

I’ll, five,through,never, use,had, school,he’s

very, from,real, at, them,in, on, out,into, pretty,their

work, there,now, too, up

> 1.4 she, last, went,family, he’s,year, never,here, because,over, my, little,our, those,um, we, were,was, all, mean,would, had

use, she, three,better, my,come, never,other, into,years, only,because, little,um, time, was,one, really, at,we, then, like,I

I’d, my,enough, only,I’ve, our,maybe, bit,actually, put,she, two,work, been,because, then,little, when,one, we, from,out, he, very,like, was,something,had, get, them,some, I’m, um

doing, around,go, her, these,little, two,an, like, into,work, get,they’re, from,as, a, more,been, have, to,up, with, for,on, my, we

little, for, an,much, my, a,your, of, the

one, out, or, it

< .71 uh-huh, as right, oh oh, right, yeah that’s I, that’s, so,uh, know, you

was, a, I,[OOV], the

< .50 [laughter],okay

oh but, think,[OOV], don’t

it’s, just,[laughter],have

< .35 [OOV], mm-hm

[OOV], mm-hm, uh-huh

[OOV] okay, yeah,[OOV], [laugh-ter]

if, mean, oh yeah, we, well,think, they

< .25 mm-hm,uh-huh

yeah, [laugh-ter], well

oh, uh, um

< .18 if< .12 uh-huh [vocalized-

noise], umgot, would,uh-huh, be-cause

< .09 uh-huh that’s< .06 mm-hm< .04 mm-hm< .03< .02 mm-hm

Figure 1: Interlocutor words that are notably frequent and infrequent, as judged by R-ratios, in six regions of time beforeuh-huh.

83

Page 91: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

should look for them.Doing so would be interesting in many ways, includ-

ing the fact that, given the varying offsets, this mightreveal something about the time constants of the men-tal processing required to digest information of variouskinds and decide to produce a minimal response. Thatis, these observations could provide an entry to the studyof the semantic aspects of dialog dynamics [6, 7], com-plementing existing work on prosodically- and gaze-cuedresponse patterns.

5. ApplicationsQuite apart from the possible broader implications sug-gested above, our results may have practical value as-is.

On the one hand, for the sake of improving the per-formance of dialog systems that show attention and gen-erate rapport by backchanneling, the findings above, es-pecially regarding the counter-indicators to backchannel-ing, could be useful.

On the other hand, to improve systems which elicitbackchannels using various cues [8], and those which in-terpret backchannels based on the details of their timing[9], the patterns of co-occurrence could again be useful.

6. SummaryIn this study I explored which words tended to precede anuh-huh by the interlocutor, using a new statistical analy-sis method. The results cast doubt on the existence oflexical cues to backchannels, however they do reveal ten-dencies that suggest new hypotheses about the dynamicsof interaction in dialog.

7. References[1] L.-P. Morency, I. de Kok, and J. Gratch, “A proba-

bilistic multimodal approach for predicting listenerbackchannels,” Autonomous Agents and Multi-AgentSystems, vol. 20, pp. 70–84, 2010.

[2] J. Allwood, “Feedback in second language acqui-sition,” in Adult Language Acquisition: Cross Lin-guistic Perspectives, II: The Results (C. Perdue, ed.),pp. 196–235, Cambridge University Press, 1993.

[3] ISIP, “Manually corrected Switchboardword alignments.” Mississippi State Uni-versity. Retrieved 2007 from http://www.ece.msstate.edu/research/isip/projects/switchboard/,2003.

[4] E. A. Schegloff, “Discourse as an interactionalachievement: Some uses of “Uh huh” and otherthings that come between sentences,” in AnalyzingDiscourse: Text and Talk (D. Tannen, ed.), pp. 71–93, Georgetown University Press, 1982.

[5] N. G. Ward, “Temporal distributional analysis,” inSemDial, 2011.

[6] L.-P. Morency, “Modeling human communicationdynamics,” IEEE Signal Processing Magazine,vol. 27, 2010.

[7] N. G. Ward, “The challenge of modeling dialog dy-namics,” in Workshop on Modeling Human Commu-nication Dynamics, at Neural Information Process-ing Systems, 2010.

[8] T. Misu, E. Mizukami, Y. Shiga, S. Kawamoto,H. Kawai, and S. Nakamura, “Toward constructionof spoken dialogue system that evokes users’ sponta-neous backchannels,” in Proceedings of the SIGDIAL2011 Conference, pp. 259–265, 2011.

[9] T. Kawahara, M. Toyokura, T. Misu, and C. Hori,“Detection of feeling through back-channels in spo-ken dialogue,” in Interspeech, 2008.

84

Page 92: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Visualizations Supporting the Discovery ofProsodic Contours Related to Turn-Taking

Nigel G. Ward, Joshua L. McCartney

Department of Computer Science, University of Texas at El Paso, [email protected], [email protected]

AbstractSome meaningful prosodic patterns can be usefully representedwith pitch contours, however developing such descriptions is alabor-intensive process. To assist in the discovery of contourrepresentations, visualization tools may be helpful. Edlund etal. [1] proposed the superimposition of hundreds of pitch curvesfrom a corpus to reveal the general patterns. In this paper werefine and extend this method, and illustrate its utility in thediscovery of a prosodic cue for back-channels in Chinese.Index Terms: prosodic cue, tune, turn-taking, back-channel,Chinese, bitmap cluster, overlay, superimpose

1. Why Contours?In human dialog, turn-taking is largely managed by means ofprosodic signals, or cues, exchanged by the participants. A di-alog system that can correctly recognize and respond to thesecues may be able to make the user experience more efficient andmore comfortable [2, 3, 4]. These cues often seem to involvepitch contours, or tunes: specific patterns of ups and downs overtime. Figure 1 shows three examples, diagrammed in variousstyles.

However, in the spoken dialog systems community, thoseworking on turn-taking generally do not use contours, neitherexplicitly nor implicitly. Rather the “direct approach” [5] has

Figure 1: Examples of pitch contours: a contradiction contour(after [6], pg 246), a nonfinal contour followed by a final con-tour (after [7], pg 183), and a back-channel cuing contour forSpanish (from [8], building on [9], see also Figure 6).

become mainstream. In this method, numerous low-levelprosodic features are computed and fed into a classifier trainedon the decision of interest, for example, whether to initiate aturn or wait. This method has seen many successes.

However contours also have their merits. A description interms of a contour can be concise and may possess more ex-planatory power than a complex classifier. A contour-based de-scription may apply more generally to other dialog types andother domains of discourse, whereas a complex classifier mayperform well only for the corpus it was trained on. In someways a contour may be a more natural description of a prosodicpattern. For one thing, describing a pattern in terms of low-level features presents some choices which may lack real sig-nificance: for example the two descriptions “pitch rise” and“low pitch followed by a high pitch” refer to different mid-levelfeatures that may not actually differ in realization; however ifdrawn as contours their similarity is obvious. As another exam-ple, when describing a pattern in terms of features the temporaldependencies may be obscure, as in a rule which requires “lowat t – 700” and “high at t – 400”, but with contours, the sequenc-ing and timing of the components is immediately clear.

Another advantage of contours is that people who need toknow the effective prosodic patterns of a language, for examplesecond language learners, can understand such diagrams fairlyquickly. It is even conceivable that contours approximate thetrue nature of these prosodic patterns as they exist in the humanmind. The notion of cue strength [10, 11] may have a naturalimplementation in terms of contours: the similarity between aninput pitch curve and the cue contour may be an easy way to es-timate cue strength. Contour-based descriptions can be used notonly for recognition but also for production. And finally, con-tours and the parameters describing them may serve as usefulhigher-level features for classifiers.

2. The Difficulty of Discovering ContoursDespite the attractions, contours have one great disadvantage:the difficulty of discovering them. In contrast to the directmethod, where, as long as one has the necessary resources andproperly prepared data, the hard work can be entrusted to themachine learning algorithm, the discovery of a new prosodiccontour can be a time-consuming process. Of course any spe-cific utterance has a pitch contour, but going from examples toa general rule is not straightforward.

In particular, elicitation and instrumental techniquesthat work for monolog phenomena are hard to apply todialog-specific phenomena such as the prosody of attitude,information-state, speaker and interlocutor cognitive state, andturn-taking. While some tools and methods are designed to sup-port the discovery of dialog-relevant prosodic patterns [12, 13],the process is generally still labor-intensive.

85

Page 93: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Figure 2: from Figure 6 of Edlund et al. [1], by permission

In 2009, however, Edlund, Heldner and Pelce proposed theuse of “bitmap clusters” [1], a visualization method based onsuperimposing many individual pitch contours to reveal generalpatterns:

by plotting the contours with partially transpar-ent dots, the visualizations give an indication ofthe distribution of different patterns, with darkerbands for concentrations of patterns

They used this method to visualize the contexts preceding vari-ous utterance types in the Swedish Map Task Corpus. Figure 2,from their paper, shows the contexts preceding 859 talkspurtswhich were tagged as “very short utterances” and as having“low propositional content”, which were probably mostly ac-knowledgments. The possibly visible red rectangle was addedby hand to mark the frequent occurrence of a region of low pitchfound “860 to 700ms prior to the talkspurts”, which they identi-fied with a back-channel cue previously noted in the literature.

Bitmap clusters are, however, still only suggestive, and sofar have not been shown useful for new discoveries. This paperbuilds on foundation to create visualization methods that are.

3. Visualization ImprovementsWe made several improvements to [1].

First, we chose the pitch regions to overlap in a differentway. Edlund et al. aligned the ends of the talkspurts at the rightedge of the display, presumably assuming that the prosodic cuesof interest occur at, and are aligned with, utterance ends. Whilepossibly valid for some dialog types, this is not suitable for,say, back-channels in dialog, which often overlap the continuingspeech of the interlocutor. We therefore aligned based on thestart of the response of interest, as was also done by [14] forspeaking fraction and gaze. Thus our right edge, the 0 point, isalways the onset of the response.

Second, we normalized the pitch differently. Edlund et al.vertically aligned the contours so that the “median of the firstthree voice frames in each contour fell on the midpoint of they-axis,” providing a form of per-utterance normalization. Wechose instead to normalize per-speaker, based on our experiencethat normalization with respect to longer time spans can im-prove identification of cues [15], probably because turn-takingsignals, unlike some other prosodic phenomena, are not tightlybound to utterances, but are relative to the speaker’s overall be-havior. Among the various possible normalization schemes, wechose a non-parametric approach, representing each pitch pointas a percentile of the overall distribution for that speaker. Com-pared to approaches which explicitly estimate parameters, suchas pitch range or standard deviation, using assumptions aboutthe distribution, we felt this likely to be more robust.

Third, we chose to display an additional feature, energy,again normalized by speaker and expressed in percentiles. Thiswas for two reasons. First, the pattern of speaking versus silence

Figure 3: Overlaid Pitch, Energy, Delta Pitch and Delta Energyfor Japanese

is also important for turn-taking, and we wanted to representand model this explicitly, rather than leave it to some genericutterance-delimiting pre-processing phase. Second, energy isimportant in identifying stressed syllables, fillers and so on.

Fourth, we included the deltas: delta pitch and delta en-ergy. Delta pitch may reveal upslopes, downslopes, and flatregions, and delta energy may reveal lengthened syllables andslow speaking rate. For pitch, if two adjacent pitch points arevalid we plot the difference between the previous pitch valueand the current pitch value, both expressed as percentiles.

Fifth, we extended the displays out to 2 seconds of pastcontext, to look for longer-term patterns.

Sixth, we did without pitch smoothing, not wanting to risklosing information.

Seventh, since what we really want to see is not the dis-tributions before the events of interest, but how those distribu-tions differ from the general distributions, we added a sharpen-ing step. This was done by subtracted out the global averagedistribution from each offset. In other words, we subtracted themean for each percentile, using means estimated from a fairlylarge random sample over the dialogs. Before doing this the di-agrams were blurry and hard to interpret; afterwards they weremuch sharper, although somewhat more blotchy.

Henceforth we will refer to diagrams made in this way as“overlaid prosodic displays”. Each point represents the countof times that value occurred at that time, normalized so that thehighest-count point is pure black.

4. Initial ValidationWe developed these refinements as we tried to better visualizeto the contexts preceding back-channels in several languages.

86

Page 94: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

We chose to look at back-channeling because it is a classic is-sue in turn-taking, and because previous research suggests that,among all turn-taking phenomena, back-channeling may be theone where the behavior of one speaker is most strongly influ-enced by the immediately preceding prosody of the other. Theresult for Japanese is seen in Figure 3, showing the overlaidprosodic displays for the speech of the interlocutor in the con-texts immediately preceding 873 back-channels in casual con-versation [15].

While no contour is directly visible, some useful featuresare: in the second or so preceding the back-channel, the in-terlocutor’s pitch tends to be low, around the 25th percentile,starting around 200ms before the back-channel, and stable; andthe energy tends to be high starting about 1000ms before theback-channel, but never very loud in the final 200ms. Thisroughly matches what we know: that the primary cue to back-channels in Japanese is a region of low pitch, with the interlocu-tor usually continuing speaking at least until the back-channelresponse starts. While the optimal prediction rule we found ear-lier looks somewhat different (requiring a region of pitch lowerthan the 26th percentile 460 to 350 ms before the back-channelonset [15]) this visualization could clearly be a useful clue tothe discovery of such a rule. Applying this method to English,Egyptian Arabic, and Iraqi Arabic data also revealed patternswhich matched what we know from previous work [8].

5. Utility for Cue DiscoveryOf course, interpreting a diagram is easy when you alreadyknow what you expect to see. As a fairer test of the utility ofthis visualization, we applied it to a language which we had notpreviously examined, Chinese.

Using 18 dialogs from the Callhome corpus of telephonespeech [16], 90 minutes in total, we had two native speakers in-dependently identify all back-channels according to the criteriaof [15]. One identified 528 and the other 467. We then took theintersection of the two sets, reasoning that working with unam-biguous cases would make it easier to see the normal pattern.This gave us 404 back-channels.

Digressing briefly to comment on back-channeling in Chi-nese, contrary to what is sometimes reported, back-channelswere quite frequent: at over 4 per minute, almost as commonas in English. This however may be due in part to the fact thatat least one participant in each dialog was resident in NorthAmerica. Also, although not important for current purposes,we had the annotators label the back-channels. As they werenot phonetically sophisticated, we let them use whatever lettersequences they liked. The fifteen most frequent labels of oneannotator were uh, oh, dui, uh-huh, em, shima, hmmm, ok, yeah,huh, duia, uhuh, shia, hmmmm, and good, similar to those seenin other corpora [17].

The task we set ourselves was that of discovering whatprosodic pattern in the interlocutor’s speech was serving to cueback-channel responses. We formalized this in a standard way[15, 2, 18], requiring a predictor able to process the dialog in-crementally and, every 10 milliseconds, predict whether or not aback-channel would occur in the next instant, based on informa-tion in the interlocutor’s track so far. The second author, armedwith the visualizations seen in Figure 4 and software infras-tructure previously developed for extracting prosodic featuresand making similar decisions for other languages, but with noknowledge of Chinese, got to work.

He immediately noted that the pitch tends to go extremelylow from about –500 to –100 milliseconds, and that the energy

Figure 4: Overlaid Pitch, Energy, Delta Pitch and Delta Energyfor Chinese

went low starting at about –200 milliseconds, although not nec-essarily to silence. The deltas indicated that the pitch tended tobe flat from –500 to –200ms, and that the energy also tendedto be stable from –600 to 0. Before long he came up with apredictive rule: in Chinese, respond with a back-channel if theinterlocutor’s speech contains:

• a low pitch region, below 15% and lasting at least 220ms,followed by

• a pause of at least 150ms

This predicted back-channel occurrences with 25% cover-age and 9% accuracy. Improvement is certainly possible, butthe performance is well above random guessing, which gives15% coverage and 2% accuracy.

6. Future WorkThus we conclude that overlaid prosodic displays are a visual-ization method with value. However it clearly has room for im-provement. Consider Figure 5, displaying the contexts of 152back-channels in Spanish [9]. While it suggests some featureslikely to be components of a contour, including some fuzzypitch tendencies, a pause in the last quarter second, and pos-sibly a tendency for flat pitch from –1500 to –1000 ms, as seenin the deltas, there is not much else, even knowing the patternwe expect to find (Figures 1 and 6).

There are several possibilities for improving these visual-izations. One could explicitly display also duration or rate. Onecould make the features more robust. One might apply a thin-ning algorithm to visually accentuate the tendencies: to turncloudy streaks into nice curves. One could improve the way thehorizontal alignment is done in generating the overlays. In par-ticular, as reaction times vary, the time from the prosodic cue,

87

Page 95: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Figure 5: Overlaid Pitch, Energy, Delta Pitch and Delta Energyfor Spanish

whatever it may be, to the response will not be constant, so onemight devise an expectation-maximization algorithm, where thehorizontal alignments are iteratively adjusted to make the pitchcontours align better. Finally one could use parametric methods[19] to force the visualizations to really look like contours.

We do not propose contours as a panacea. Among otherweaknesses, they do not naturally represent the degree to whichtheir component features may stretch internally or relative toeach other; for that purpose they need annotations (as in Figure1) or a sibling description (as in Figure 6). However contour-based descriptions can be useful, and this new tool can help withtheir discovery.

Predict a back-channel starting 350 ms after all of the fol-lowing conditions hold:

A. a low-pitch region,< 26th percentile and 50–500ms in length

B. a high-pitch region for at least one moment,starting > 75th percentile and never < the 26th

C. a lengthened vowel of duration >= 100 msD. a pause >= 100ms

Where:B closely follows A:

30-750 ms from end of low to start of highC closely follow B:

0-100ms from start of high to start of lengthened vowelD closely follows C:

0-60ms from end of lengthened vowel to start of pause

Figure 6: A rule for predicting back-channel opportunities inSpanish, from [8]

Acknowledgment: This work was supported in part by the NSFas Project No. 0415150 and by RDECOM via USC ICT. Wethank Tatsuya Kawahara for comments.

7. References[1] J. Edlund, M. Heldner, and A. Pelce, “Prosodic features of very

short utterances in dialogue,” in Nordic Prosody - Proceedings ofthe Xth Conference, pp. 56–68, 2009.

[2] J. Gratch, N. Wang, A. Okhmatovskaia, F. Lamothe, M. Morales,R. van der Werf, and L.-P. Morency, “Can Virtual Humans BeMore Engaging Than Real Ones?,” Lecture Notes in ComputerScience, vol. 4552, pp. 286–297, 2007.

[3] A. Raux and M. Eskenazi, “A finite-state turn-taking model forspoken dialog systems,” in NAACL HLT, 2009.

[4] G. Skantze and D. Schlangen, “Incremental dialogue processingin a micro-domain,” in EACL, pp. 745–753, 2009.

[5] E. E. Shriberg and A. Stolcke, “Direct modeling of prosody:An overview of applications in automatic speech processing,” inProceedings of the International Conference on Speech Prosody,pp. 575–582, 2004.

[6] D. Bolinger, Intonation and Its Parts. Stanford University Press,1986.

[7] M. H. Cohen, J. P. Giangola, and J. Balogh, Voice User InterfaceDesign. Addison-Wesley, 2004.

[8] N. G. Ward and J. L. McCartney, “Visualization to support thediscovery of prosodic contours related to turn-taking,” Tech. Rep.UTEP-CS-10-24, University of Texas at El Paso, 2010.

[9] A. G. Rivera and N. Ward, “Prosodic cues that lead to back-channel feedback in Northern Mexican Spanish,” in Proceedingsof the Seventh Annual High Desert Linguistics Society Confer-ence, University of New Mexico, 2008.

[10] A. Gravano and J. Hirschberg, “Turn-taking cues in task-orienteddialogue,” Computer Speech and Language, vol. 25, pp. 601–634,2011.

[11] L. Huang, L.-P. Morency, and J. Gratch, “Parasocial consensussampling: Combining multiple perspectives to learn virtual hu-man behavior,” in 9th Int’l Conf. on Autonomous Agents andMulti-Agent Systems, 2010.

[12] N. Ward and Y. Al Bayyari, “A case study in the identificationof prosodic cues to turn-taking: Back-channeling in Arabic,” inInterspeech 2006 Proceedings, 2006.

[13] T. K. Hollingsed and N. G. Ward, “A combined method for dis-covering short-term affect-based response rules for spoken tuto-rial dialog,” in Workshop on Speech and Language Technology inEducation (SLaTE), 2007.

[14] K. P. Truong, R. Poppe, I. de Kok, and D. Heylen, “A multi-modal analysis of vocal and visual backchannels in spontaneousdialogs,” in Interspeech, pp. 2973–2976, 2011.

[15] N. Ward and W. Tsukahara, “Prosodic features which cue back-channel responses in English and Japanese,” Journal of Pragmat-ics, vol. 32, pp. 1177–1207, 2000.

[16] A. Canavan and G. Zipperlen, CALLHOME Mandarin ChineseSpeech. Linguistic Data Consortium, 1996. LDC Catalog No.LDC96S34, ISBN: 1-58563-080-2.

[17] D. Xudong, “The use of listener responses in Mandarin Chi-nese and Australian English conversations,” Pragmatics, vol. 18,pp. 303–328, 2008.

[18] I. de Kok and D. Heylen, “A survey on evaluation metrics forbackchannel prediction models,” in Interdisciplinary Workshop onFeedback Behaviors in Dialog, 2012.

[19] D. Neiberg, “Visualizing prosodic densities and contours: Form-ing one from many,” TMH-QPSR (KTH), vol. 51, pp. 57–60, 2011.

88

Page 96: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Where in Dialog Space does Uh-huh Occur?

Nigel G. Ward, David G. Novick, Alejandro Vega

Department of Computer Science, University of Texas at El Paso, El Paso, Texas, United [email protected], [email protected], [email protected]

AbstractIn what dialog situations and contexts do backchannelscommonly occur? This paper examines this question us-ing a newly developed notion of dialog space, definedby orthogonal, prosody-derived dimensions. Taking 3363instances of uh-huh, found in the Switchboard corpus, weexamine where in this space they tend to occur. Whilethe results largely agree with previous descriptions andobservations, we find several novel aspects, relating torhythm, polarity, and the details of the low-pitch cue.Index Terms: backchannels, feedback, prosody, context,principal component analysis, dimensions, dialog activi-ties

1. The Contexts of BackchannelsAmong the interactive phenomena of dialog, backchan-neling is one of the most prototypical and among the moststudied. A key question of interest is when backchannelsoccur. Some aspects of this question have been inten-sively investigated, for example regarding the prosodiccontexts that cue backchannels and similar feedback[1, 2].

While we know some things about the micro-contextsof backchannels in certain situations, we lack a goodunderstanding of the more general dialog situations inwhich backchannels occur. Indeed, descriptions at thislevel tend to come not from empirical study but from def-initions, theoretical frameworks, qualitative studies, andimpressionistic observations. Aspects of situations oftenthought to be relevant to backchanneling include havingonly one person holding the floor, giving a narrative orexplanation, having one person being in a listening role,having that person supportive and maybe even agreeing,and being at points of new information or where ground-ing needs to be done [3, 4]. However such listings offactors have lacked empirical verification, may includeproperties that are rarely important, and may omit prop-erties that are vital in practice.

This is a problem for efforts to build and deploy re-sponsive systems. It is possible to backchannel naturallyand effectively in dialogs with naive humans [5, 6], butso far only when the user’s role is tightly constrained,for example to retell a story or solve a simple problem.To make backchanneling behavior (and ultimately other

types of rapid response behaviors) more robust and usefulin freer dialog settings, we need a better understanding ofthe dialog contexts and activities in which they occur.

Thus we here undertake an empirical, statisticalexploration of the dialog contexts of occurrence ofbackchannels.

2. Dialog DimensionsTo understand the typical dialog situations wherebackchannels occur, we need to start with a way to de-scribe dialog situations. While there are many taxonomicsystems to choose from, here we use a new, empiricalmethod [7]. Reasoning that the local prosody is a goodindicator of dialog activities and states, we started with 76local prosodic features, consisting of pitch height, pitchrange, speaking rate, and volume, computed over differ-ent regions of a 6 second window, and computed for bothparticipants in the dialog. We computed these featuresevery 10 milliseconds throughout the corpus. We then ap-plied Principal Component Analysis to these values. Thisgave a list of 76 dimensions, ordered by how much of thevariation in the prosodic features they explain.

Upon examination [7, 8], most of the top dimensionsturned out to align with aspects of dialog. These aspectswere diverse, including dialog situations, transient dialogstates, cooperative dialog acts, simpler dialog actions, ap-parent mental states, and some prosodic behaviors.

Since these are truly dimensions, they are continu-ously valued. Thus, a given moment in dialog might havea value –0.74 on dimension 1, +.03 on dimension 2, andso on. Any specific instance of a backchannel occurs ata point in this 76-dimensional dialog space, and thus wecan gather statistics on where backchannels tend to occur.

3. Uh-Huh and the DimensionsTo determine the typical dialog contexts of backchan-nels, we examined the patterns of occurrence of uh-huhin the Switchboard corpus. We chose Switchboard be-cause it comprises unstructured dialogs and includes awide variety of dialog activities. We chose uh-huh as aproxy for backchannels because uh-huh is almost alwaysa backchannel and is one of the most common typicalbackchannel forms (along with uh, yeah, [laughter], oh,

89

Page 97: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

and um-hum). We gathered statistics over a 600K wordsubset of Switchboard, which includes 3363 instances ofuh-huh. These tokens are not limited to phonetically pre-cise uh-huh tokens, as the labelers guide enjoins tran-scribers to “use “uh-huh” or “um-hum” (yes) . . . for any-thing remotely resembling these sounds of assent” [9] (al-though in practice the degree of yes-ness did not seem tomuch affect what ended up in the transcripts).

If backchannels were mostly about reacting to contentor mostly about reacting to a few specific cues, we wouldexpect them to be distributed evenly across most of the di-mensions. However in fact many of the distributions werestrongly asymmetric: for twelve of the dimensions 75%or more of the occurrences of uh-huh were on one side orthe other (positive or negative) of the dimension, as seenin the left columns of Table 1, and all of these asymme-tries are significant (by the chi-square test, p < .0001).Thus backchannels do relate to multiple dimensions ofdialog.

4. InterpretationsWhile the statistical information in Table 1 is adequate asa practical, operational answer to the question of wherebackchannels occur, we wanted to go further, to developa real understanding of the reasons for and significanceof the associations with these regions of dialog space.

For this, our initial exploration, we examined each di-mension individually. We generally started with a previ-ous description of the dimension [7, 8] and tried to under-stand how uh-huh was similar to the other dialog activi-ties that were common in those particular contexts. To dothis we listened to examples of speech in the vicinity ofuh-huh as it appeared in those contexts and to examplesof other words in similar contexts. While exploratory in-ductive studies of this sort risk merely confirming the ob-servers’ prejudices, that was not the case here. Indeed,we were repeatedly surprised to see connections to con-structs that had we had not previously thought relevant,notably transition relevant places and dialog rhythms.

4.1. The 12 Most-Related Dimensions

From the twelve dimensions where the distribution of uh-huh was most skewed, we infer connections to:

Turn grabbing91% of the occurrences of uh-huh were in contexts

that were low on dimension 5. In other situations low onthis dimension the speaker is starting a turn. (Situationshigh on this dimension were mostly turn yields.) Thisimplies that uh-huh can sometimes take the turn and alsomay function to decline to take a turn at a point when thatopportunity was available [4].

Pushing for a new perspective89% of the cases of uh-huh occurred in contexts on

the lower half of dimension 17. Other typical dialog ac-

tions low on dimension 17 were short questions or otherswift bids to slightly change the topic while the interlocu-tor is monopolizing the floor. (In situations high on thisdimension the speaker and/or interlocutor were generallyengaged in elaborating a feeling or mood.) This suggeststhat uh-huh can be a way to move the conversation for-ward, a facet which the common term “continuer” high-lights.

Quick thinking89% were on the high side of dimension 11. Typical

utterances high on dimension 11 were very swift echosand confirmations. (Situations on the low side were typi-cally low in confidence and/or content, for example whenending an utterance with I guess.) This suggests that uh-huh can indicate attention and quick understanding.

Expressing sympathy86% were high on dimension 18. In other typical di-

alog situations high on this dimension the speaker is ex-pressing pity or sympathy for someone in a bad situationthe interlocutor has just described. (Situations low on thisdimension were frequently descriptions of people or hap-penings meriting sympathy, and thus soliciting an expres-sion of sympathy.) This suggests that uh-huh can conveysympathy.

Expressing empathy86% were high on dimension 6; this region typically

included expressions of empathy. To clarify, this dimen-sion differs from the previous one in that it includes pos-itive emotions and evaluations. In terms of prosodic con-texts, empathy typically respond to a phrase or word pro-duced in high pitch by the interlocutor (Arizona’s beau-tiful), whereas sympathetic responses usually respond tophrases produced in low volume and reduced pitch range.(Situations low on this dimension were typically emo-tional expressions and evaluations, which often invited anexpression of empathy.) Thus uh-huh patterns with otherexpressions of empathy.

Other speaker talking85% were high on dimension 1. In other situations

high on this dimension the interlocutor was talking al-most constantly while the speaker of interest was mostlyquiet. (The low end of this dimension was the exact op-posite.) Interestingly, here the simple percentage does nottell the whole story: in fact 67% of the uh-huhs were inthe 3rd quartile on this dimension. This indicates, unsur-prisingly, that uh-huh occurs when it is mostly the inter-locutor who is speaking, but not in an extreme monologcontext. This facet is naturally the one which the term“backchannel” highlights.

Rambling82% of the cases of uh-huh occurred in contexts that

were on the lower half of dimension 14. Other typicaldialog situations low on this dimension were where thespeaker has low interest in what he himself is saying, but

90

Page 98: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

seems to feel the need to say something anyway. (Onthe high side, the speaker was usually speaking clearly,even emphatically, in a bright tone.) This suggests thatuh-huh can convey low interest and a lack of anythingspecific to say, a facet which the common term “minimalvocalization” highlights.Signaling an upcoming point of interest

79% were high on dimension 26. At points with highvalues on this dimension, the speaker often seems to besignaling that the dialog is about to take off in some way.Prosodically, this is characterized by a moderately highvolume for a few seconds that then turns low and is ac-companied by a slower speaking rate and a region of lowpitch for a hundred milliseconds or so; after this comesthe point of interest and then in the near future typicallyboth speakers have some speaking role, both with higherthan average pitch height. (At points with low values onthis dimension, the speaker is typically involved in a nar-rative and speaking with low volume, and appears to bedownplaying the importance of what he’s saying, for ex-ample in situations where he needs indicate that what he’ssaying is just background to an upcoming main point.)Thus uh-huh can be cued by a prosodic context includ-ing a region of low pitch, which elaborates a well-knownresult [10].Deploring something

78% were high on dimension 37. At other points withhigh value on this dimension, the speaker is often de-scribing something deplorable, as in if the legislature hastheir way about it they’re going to raise the tuıtion anddouble and in the straw that broke the camel’s back, withsomething of a sing-song unstressed-stressed alternation.(Times with low values on this dimension often fall nearthe point where the speaker starts to reveal that a situationalso has a silver lining.) This suggests that an uh-huh canserve to share a complaint.Not delivering confidently

76% were low on dimension 72. At other points lowon this dimension the speaker’s delivery was often weak,in the extreme including false starts or disfluencies, andthe interaction between the speakers, if any, was awk-ward. (At points high on this dimension the speaker hadestablished something of a rhythm of speaking although,unlike the previous dimension, typically with several un-stressed syllables between each stressed syllable. If thelistener was saying anything at all, his words tended tofit in smoothly where the speaker would have put un-stressed syllables. Pragmatically, this seems common incases where the speaker really knows what he wants tosay.) Thus uh-huh patterns with speaking without fullconfidence and a clear delivery.Agreeing and preparing to move on

76% were high on dimension 24. In other situa-tions high on this dimension, the speaker was expressingagreement with or sharing the other’s thought or feeling,

preparatory to moving the focus to a new aspect of thetopic. (In low situations on this dimension both speakerswere focusing for some time on the same shared refer-ent.) Thus uh-huh patterns with agreeing, closing out,and bidding to move on.Low focus

75% were low on dimension 29. In general this wasseen in contexts where there was an unstressed or some-how deemphasized word. While we found no consistentpragmatic or dialog function for these, sometimes theyco-occurred with taking a personal stance. (On the highside of this dimension there was a stressed word in thecontext, and this often occurred where the focus was onestablishing the facts, and where the speaker had knowl-edge that the interlocutor clearly lacked.) Thus uh-huhpatterns with a lack of stress, a relative lack of knowl-edge of the topic, and with taking a stance that is per-sonal, rather than fact-oriented.

4.2. Other Dimensions

Table 2 lists others among the top two dozen dimensionsfor which the distribution of uh-huh is strongly asym-metric; discussion of each dimension appears elsewhere[7, 8]. While some are not surprising (the correlationswith dimensions 8, 12, and 19), dimension 10 is moreinteresting: the propensity of low values indicates thatuh-huh patterns with thinking of something to say next,rather than being disengaged from the dialog. This isperhaps why uh-huh can work for feigning attention, asYngve humorously observed [3]. Dimension 13 suggeststhat backchannels are similar in some ways to the begin-nings of contrasts, and dimension 21 to actions done tomitigate potential face threats. These aspects should belooked at more closely.

4.3. Discussion

First, we note that our new analysis method, indirectthought it is, appears to work: most of the aspects of di-alog that uh-huh co-occurs with are things that could beexpected from one or another of the common descriptionsof backchannel. Yet the results were not without novelty,for example in the connections with the dialog aspectsinvolved in dimensions 14, 37, 72, 29, 13, and 21.

Second, we note the wide variety of factors involvedin backchannel behavior. Although many studies of thesephenomena approach them in ways that limit the find-ings to just one type of context, or one type of cue, orone type of effect, in fact backchannels are richly multi-faceted and multifunctional.

5. Future WorkTracking the dialog situation using the dimensions iden-tified here may enable future dialog systems to performbackchanneling more robustly, even with uncontrolled

91

Page 99: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Dimension Skew Interpretation (Abbreviated)

PC 5 lo 91% turn grab (vs. turn yield)PC 17 lo 89% pushing for a new perspective (vs. elaborating current feeling)PC 11 hi 89% attentive and quick-thinking (vs. low confidence)PC 18 hi 86% expressing sympathy (vs. seeking sympathy)

PC 6 hi 86% expressing empathy (vs. seeking empathy)PC 1 hi 85% other speaker talking vs. this speaker talkingPC 14 lo 82% rambling (vs. placing emphasis)PC 26 hi 79% signaling interestingness (vs. downplaying things)

PC 37 hi 78% deploring something (vs. also planning to talk about the good side)PC 72 lo 76% speaker awkward (vs. speaking with a clear delivery)PC 24 hi 76% agreeing and preparing to move on (vs. jointly focusing)PC 29 lo 75% no recently stressed word (vs. stressed word present)

Table 1: The 12 dimensions with the most asymmetric distributions of uh-huh, with interpretations of the uh-huh-rich side(vs. the opposite side).

Dimension Skew Interpretation (Abbreviated)

PC 8 hi 66% ending crisply (vs. petering out)PC 10 lo 73% engaging in lexical or memory access (vs. disengaged)PC 12 lo 69% floor yielding (vs. floor asserting)PC 13 lo 66% starting a contrasting statement (vs. reiterating)

PC 19 lo 75% solicitous (vs. controlling)PC 21 lo 71% mitigating a potential face threat (vs. agreeing, with humor)

Table 2: Seven other high-variance dimensions with significant asymmetry.

users, with the likelihood that they the backchannels willoccur only in suitable contexts.

While this exploration of dialog space has identifiedsome general areas where uh-huhs occur, with respect toeach dimension independently, we would like to moretightly characterize these regions. In particular, we wouldlike to explore whether uh-huh has distinct subpopula-tions and, if so, whether these have distinctive phoneticand prosodic properties

More generally, this new analysis method could beused to help discover and characterize the roles and typ-ical contexts of other dialog-relevant markers and behav-iors.

6. AcknowledgmentsThis work was supported in part by NSF Award IIS-0914868. We thank Tatsuya Kawahara for comments.

7. References[1] L.-P. Morency, I. de Kok, and J. Gratch, “A probabilistic

multimodal approach for predicting listener backchannels,” Au-tonomous Agents and Multi-Agent Systems, vol. 20, pp. 70–84,2010.

[2] A. Gravano and J. Hirschberg, “Turn-taking cues in task-orienteddialogue,” Computer Speech and Language, vol. 25, pp. 601–634,2011.

[3] V. Yngve, “On getting a word in edgewise,” in Papers fromthe Sixth Regional Meeting of the Chicago Linguistic Society,pp. 567–577, 1970.

[4] E. A. Schegloff, “Discourse as an interactional achievement:Some uses of “Uh huh” and other things that come between sen-tences,” in Analyzing Discourse: Text and Talk (D. Tannen, ed.),pp. 71–93, Georgetown University Press, 1982.

[5] J. Gratch, N. Wang, A. Okhmatovskaia, F. Lamothe, M. Morales,R. van der Werf, and L.-P. Morency, “Can Virtual Humans BeMore Engaging Than Real Ones?,” Lecture Notes in ComputerScience, vol. 4552, pp. 286–297, 2007.

[6] M. Schroder, E. Bevacqua, R. Cowie, F. Eyben, H. Gunes,D. Heylen, M. ter Maat, Gary, S. Pammi, M. Pantic, C. Pelachaud,B. Schuller, E. de Sevin, M. Valstar, and M. Wollmer, “Buildingautonomous sensitive artificial listeners,” IEEE Transactions onAffective Computing, to appear.

[7] N. G. Ward and A. Vega, “A bottom-up exploration of the dimen-sions of dialog state in spoken interaction,” in Sigdial, 2012.

[8] N. G. Ward and A. Vega, “Towards empirical dialog-state model-ing and its use in language modeling,” in Interspeech, 2012, sub-mitted.

[9] J. Hamaker, Y. Zeng, and J. Picone, “Rules and guidelines fortranscription and segmentation of the Switchboard large vocabu-lary conversational speech recognition corpus, version 7.1,” tech.rep., Institute for Signal and Information Processing, MississippiState University, 1998.

[10] N. Ward and W. Tsukahara, “Prosodic features which cue back-channel responses in English and Japanese,” Journal of Pragmat-ics, vol. 32, pp. 1177–1207, 2000.

92

Page 100: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Listener head gestures and verbal feedback expressions in a distraction task

Marcin Włodarczak1, Hendrik Buschmeier2, Zofia Malisz1

Stefan Kopp2, Petra Wagner1

1Faculty of Linguistics and Literary Studies2Sociable Agents Group, CITEC and Faculty of Technology

Bielefeld University, Bielefeld, Germany{zofia.malisz,petra.wagner,mwlodarczak}@uni-bielefeld.de,

{hbuschme,skopp}@techfak.uni-bielefeld.de

AbstractWe report on the functional and timing relations between headmovements and the overlapping verbal-vocal feedback expres-sions. We investigate the effect of a distraction task on headgesture behaviour and the co-occurring verbal feedback. Theresults show that head movements overlapping with verbal ex-pressions in a distraction task differ in terms of several featuresfrom a default, non-perturbed conversational situations, e.g.: fre-quency and type of movement and verbal to nonverbal displayratios.Index Terms: communicative feedback; head gestures; dialogue;attentiveness; distraction task

1. IntroductionHead gestures “are both an integral part of language expressionand function to regulate interaction” [1]. Often the resulting struc-ture involves interactional synchrony where head movementsbetween speakers are aligned in a rhythmic or quasi-rhythmicway [2]. Such temporal coordination of communicative actionson many levels and in many modalities facilitates turn-taking[3] and enhances communicative attention [4, 5]. Also, as [6]notes, feedback is an essential part of the grounding processwhere common ground is shared and achieved as a result of jointconversational activity. Head gestures are involved in updatingthe information status (grounding) and in establishing rapport.

In order to describe the form of a head gesture, a couple offeatures need to be taken into account: head orientation, speedand amplitude of movement [7]. Several different inventoriesof gesture forms were devised in the past by inter alia [8, 9].Research on general head gesture kinematics was pioneeredby [10] and [11]. [11] distinguished between linear and cyclickinematic forms, equivalent to e.g. single and multiple noddingbouts and associated them with turn taking signals and responsesto questions respectively. Moreover, phrasing and prominenceinformation can be carried by head nodding along with othervisual modalities [12, 13]. [11] noted that floor grabbing cuesare usually expressed by wide and linear head movements (e.g.high amplitude single nods) while synchronisation with pitchaccented syllables in the interlocutors’ speech occurred in caseof narrow, linear head gestures, e.g. low amplitude single nods.More importantly, the tendency of “yes” and “no” movements tobe cyclic (multiple nods) was uniform and robust across speakers.In [9] feedback categories defined as “recognition-success” and“contents-affirmation” corresponding to backchannels and other

∗The first three authors contributed to the paper equally.

affirmative responses respectively, were found to occur with“vertical head movements” of both large and small amplitude.

Claims were made by some of the above authors as to howthe physical properties of head gestures relate to their commu-nicative use. The function of a head gesture can be independentwithin the nonverbal modality or co-expressive with the accom-panying linguistic content. [14] enumerates the criteria that arenecessary to disentangle the meaning of nods. Additionally, shemakes a distinction between how a meaning of a nod can bemodified by the co-occurring linguistic context (such as preced-ing or overlapping feedback expressions) and/or simultaneousmultimodal context (co-occurring facial displays, gaze behaviouror hand gestures [15]). In [16] it was shown that head nods of alistening agent were interpreted as “agree” and “understand” byparticipants; however, when combined with a smile they wereinterpreted as “like” and “accept”. This and similar examplesshow that the exact level of evaluation and grounding can bemodified by several modalities at once and that head gesturesneed to be interpreted in their multimodal context.

In our study we concentrate on the functional and timingrelation between head movements and the overlapping spokenfeedback expressions leaving the remaining co-occurring mul-timodal context, certainly able to modify the resulting function,to later study. Additionally, we investigate the effect of a distrac-tion task on head gesture behaviour and the co-occurring verbalfeedback. We also briefly look at the timing relations withinsequences of nods.

2. Study designIn order to analyse feedback behaviour, we carried out a face-to-face dialogue study in which one of the dialogue partners (the‘storyteller’) told two holiday stories to the other participant (the‘listener’), who was instructed to listen actively, make remarksand ask questions. Furthermore, similar to [17], the listenerswere distracted during one of the stories by an ancillary task.They were instructed to press a button on a hidden remote controlevery time their dialogue partner uttered a word starting with theletter ‘s’ (the second most common German word-initial letter).Participants also had to count the total number of ‘s-words’they heard. Storytellers told two different holiday stories andlisteners only engaged in the distraction task for either the first (ineven-numbered sessions) or the second story (in odd-numberedsessions). Participants were seated approximately three metresapart to minimise crosstalk. Interactions were recorded fromthree camera perspectives: medium shots showing the storytellerand the listener and a long shot showing the whole scene.

93

Page 101: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

D ND D ND D ND D ND D ND D ND D ND D ND D ND D ND D ND D ND D ND D ND D ND D ND D ND D ND D ND D ND

Figure 1: The ratio between head gesture units (grey bars) and verbal feedback expressions (white bars) across 20 dialogue sessions andtwo experimental conditions (D: distracted; ND: non-distracted)

3. Multimodal annotation3.1. Verbal feedback

Feedback utterances and head gesture units were segmented andtranscribed for 20 sessions in the corpus. A feedback functionannotation scheme was devised (see [18] for full description) inwhich feedback levels largely correspond to definitions by [19].Our category P1 corresponds to backchannels understood as‘continuers’, category P2 signals successful interpretation (under-standing) of the message, and category P3 indicates acceptance,belief and agreement. These levels can be treated as a hierarchywith increasing value of judgement, “cognitive involvement” or“depth” of grounding. Feedback expressions were labeled accord-ing to German orthographic conventions. Feedback functionswere annotated independently by three annotators taking commu-nicative context into account. Majority labels between annotatorswere then calculated automatically and problematic cases (185;ca. 9%) were discussed and resolved.

3.2. Head gestures

Head gesture annotation was based on head gesture units (HGUs).We defined an HGU as a perceptually coherent and continuousmovement sequence. Any perceived pauses either before a rest(no movement) or between units were marked as unit boundaries.The exact onset and offset of an HGU was determined by closeinspection of the video in ELAN. Each HGU was annotated formovement types (nod, jerk, tilt, turn, protrusion and retraction)and the number of movement cycles. The movement type invent-ory was arrived at incrementally while inspecting the dataset.In case of nods, one “down-up” movement was counted as onecycle. In comparison, for jerks, one “up-down” movement wascounted as one cycle.

The following features were extracted for each gesturalphrase: duration, complexity (the number of subsequent gesturetypes in the phrase), cycles (the total number of cycles of all ges-tures in the phrase) and frequency (the number of cycles dividedby the duration of the unit). For example, the label “Nod-2+Tilt-1-Right+Pro-1” has the complexity degree of 3 (nod, tilt-right,protrusion) and its total number of cycles equals 4 (2 nod cycles+ 1 tilt-right cycle + 1 protrusion cycle).

Additionally, for phrases overlapping with short verbal feed-back expressions, the exact function (P1, P2, P3) of the expres-sion, the overlap onset (the time between the beginning of thegestural phrase and the feedback expression), and movementtypes of the head gesture were recorded.

4. Results and discussion4.1. Verbal and nonverbal feedback

The proportion of all HGUs compared to verbal feedback willbe examined first. To perform the analysis we excluded headmovements coinciding with longer utterances not marked asfeedback. It is not possible to determine how long a gap betweenmultimodal expressions can be in order to be perceived as afunctional unit without conducting a separate study on timingrelations. Therefore, barely non-overlapping verbal and HGUrelations were included in the ‘non-overlapping’ category. Fig-ure 1 presents the total number of HGUs (both overlapping andnon-overlapping) related to the total number of verbal feedbackexpressions. The results are presented as a ratio between thetwo variables and are split into the two experimental conditions(distracted vs. non-distracted) within single dyad sessions.

4.2. Verbal and nonverbal feedback per condition

Overall, there is more nonverbal than verbal feedback in bothconditions. Consequently, listeners use the nonverbal channel tosignal feedback more often. It has been noted that head move-ment is present almost incessantly in human interactive commu-nication. Moderately involved, polite listener behaviour, however,can be hypothesised to feature less speech and manual gesturebut lots of eye contact and head movement. In our setting, acomparison between the “default” non-distracted condition andthe distracted condition provides a platform for studying levelsof involvement and attention in the feedback giving context inlisteners. Indeed, Figure 1 suggests a tendency for 17 out of20 listeners to produce more more HGUs when distracted ex-perimentally. Overall, 65% nonverbal to 35% verbal signals inthe distracted and 57% nonverbal to 43% verbal signals in thenon-distracted condition was observed (χ2 = 22.3, p < 0.001).As shown in [18] less verbal feedback was displayed by distrac-ted than non-distracted listeners. This was corroborated in thepresent study, where six more sessions from the same corpuswere added to the analysed dataset.

No significant differences between conditions in the propor-tion of time spent gesturing with the head and in the number ofHGUs were found. Overall, subjects spent 17.5% of time gestur-ing (19% in the distracted condition, 16% in the non-distracted).Similarly, no evidence of a distraction effect on absolute HGUcounts (also normalised by the dialogue duration) was found.The effect seems to be only evident in the interaction betweenverbal and nonverbal channels.

94

Page 102: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

P1 P2 P3

Non−distracted

0.0

0.2

0.4

0.6

0.8

1.0

P1 P2 P3

Distracted

0.0

0.2

0.4

0.6

0.8

1.0

NodJerkTilt

Figure 2: Conditional probabilities of three head gesture types given the function of the overlapping verbal expression (P1: backchannel,P2: understanding, P3: agreement/acceptance) in the two experimental conditions.

4.3. Gesture types across dialogue act categories

We assumed that HGUs overlapping with verbal feedback expres-sions share the same feedback function. Consequently, gestureunits overlapping with more than one feedback expression (“mul-tiple overlaps”) were excluded because the functional relationbetween head movements and verbal feedback could not bedetermined for these cases. Figure 2 presents the conditionalprobability of the most frequent gesture types given the functionof the overlapping verbal expression (P1, P2, P3). The left panelcorresponds to the non-distracted condition.

Nods predominate both when overlapping with feedbackexpressions and on their own (81.5% of non-overlapping cases).The probability of nods decreases when one moves up the feed-back function hierarchy, while other head movement types aremore likely to occur. Specifically, the probability of the tilt cor-relates positively with the feedback function. For example, tiltsare twice as probable in the “acceptance/agreement” function(P3) than in the “understanding” function (P2) and three timesas probable as in backchanneling (P1). We also observe the prob-ability of the jerk occurring in P2 is four times as high than inP1 and more than two times higher than in P3. Jerks are char-acteristic as displays of understanding and surprise, especiallywith the meaning of “I have finally understood”, e.g.: after checkquestions [20].

German speakers tend to produce more repeated nods acrossfeedback categories (P1= 74.5%, P2= 81%, P3= 86%). Res-ults in [6] (for Swedish speakers) are therefore corroboratedon a larger dataset. Insofar as our feedback function invent-ory corresponds to the category of “yes and no responses” in[11], our result is also in agreement with their conclusions thatthose involve cyclic (multiple cycle) movements. However, in P1,when compared to higher feedback functions (P2 and P3), thebackchannel function (comparable to “ContinuationYouGoOn”used by [6]) exhibits a lower percentage of multiple nods.

For HGUs composed of head gestures other than head nodsonly, complexity and frequency tendentially falls with feedbackfunction level, a significant difference was found between P1and P3 (Mann-Whitney test, p < 0.05 and p < 0.01 respectively)for this data subset.

4.4. Gesture types per dialogue act category and condition

For HGUs overlapping with verbal feedback in the distracted con-dition, the probability of a nod co-occurring with an expression

bearing the P2 function is closer to the probability for overlapswith the backchannel function (see Figure 2, right panel). Inthe default, non-distracted conversational situation on the otherhand, the probability of nods comes closer to the probabilityfor overlaps with the “acceptance/agreement” function (see Fig-ure 2, left panel). Also, while bearing in mind the low number ofannotated jerks overlapping with the “understanding” function(35 instances), we observe a tendency to decrease the use ofjerks in the distracted condition when expressing understandingverbally. It is possible that the two phenomena are related: thecharacteristic “I understand” nonverbal expression, the jerk, isreplaced with the nod, a more minimal, “default” response.

81.5% non-overlapping head gestures in the whole dataset(N = 1328) were nods (N = 1083). In the non-overlapping cat-egory we find no significant differences between distracted andnon-distracted listeners in the number of cycles, the proportionof multiple vs. single nods and the duration of the nods.

Significant difference between the number of movementcycles and frequency in the distracted and non-distracted condi-tion was observed for P2 (p < 0.05 and p < 0.01 respectively).The result indicates that more intense movement in the timedomain is characteristic of distracted listeners expressing under-standing.

4.5. Movement timing

We analysed overlaps between HGUs and single feedback expres-sions. The overlap onset is negative i.e. the HGU onset precedesthe feedback expression onset. [21] found that nods in listenerspreceded the corresponding speech by 175 ms. Most HGU on-sets in our data were close in time to the overlapping feedbackexpression onset (median, non-distracted= 202 ms, SD= 380ms) with a clear tendency for the HGU onset to precede theverbal expression.

Additionally, a regression analysis was performed on headnod durations in order to determine whether HGUs with multiplenod cycles show a linear trend with cycle increase. It turnedout that from more than one nod cycle the duration of the HGUincreases by 320 ms with each consecutive nod (adjusted R2 =0.6). A non-zero intercept indicates that as new nod cycles areadded, the duration of the HGU increases non-cumulatively.

The trend can be explained by the dynamics of head motionthat is continuously oscillating within a multiple nod phrase:adding nod cycles within a uninterrupted phrase takes less timethan separate single nods. Definitely, the nature of the oscillatory

95

Page 103: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

process facilitates integration of kinetic energy so that speakerscan use the momentum produced by a previous nod to producethe next one within (at least this might hold with the contrastbetween single nods and multiple ones).

Consequently, a significant nonlinear trend was evidencedwhen single nods were added to the regression analysis. We knowfrom the nature of this biological system that there might be somevisible damping towards the end of a nodding bout with multiplecycles. [10, 11] showed that movement amplitude decreases asits frequency increases as well as that the variability in amplitudewithin HGUs is high so the damping is not monotonous. Singlehead nods were described as linear and multiple head nods ascyclical in their study, which corresponds to the difference in theregression trend.

5. Conclusions and future workOur results showed a significant difference between conditions,where the ratio of nonverbal to verbal feedback is higher in thedistracted condition. In HGUs overlapping with verbal feedbackexpression, nods, especially multiple ones, predominated. Ad-ditionally, our results suggest that the tilt is more characteristicof higher feedback categories and that the jerk expresses under-standing. The variation found here in the use of the jerk betweenexperimental conditions is in accordance with our earlier result[18] that communicating ‘understanding’ (as in P2) is a markerof attentiveness.

The visual modality, as mentioned earlier, can influence andmodify the interpretation of the feedback function. Perceptualevaluation of feedback functions including additional visualmodalities needs to be conducted in the future in order to shedmore light on the complex interaction of the verbal and nonverbalcues to feedback functions. Movement timing information willbe used for the study of interactional synchrony that embodiesattention processes and grounding.Acknowledgements – This research is supported by the DeutscheForschungsgemeinschaft (DFG) at the Center of Excellence in‘Cognitive Interaction Technology’ (CITEC) as well as at theCollaborative Research Center 673 ‘Alignment in Communica-tion’. We would also like to thank Joanna Skubisz for help withannotation.

6. References[1] E. McClave, “Linguistic functions of head movements in

the context of speech,” Journal of Pragmatics, vol. 32, pp.855–878, 2000.

[2] F. J. Bernieri and R. Rosenthal, Interpersonal Coordin-ation: Behavior Matching and Interactional Synchrony.Cambridge, UK: Cambridge University Press, 1991.

[3] M. Wilson and T. P. Wilson, “An oscillator model of thetiming of turn taking,” Psychonomic Bulletin and Review,vol. 12, pp. 957–968, 2005.

[4] W. S. Condon and W. D. Ogston, “Speech and body mo-tion synchrony of the speaker-hearer,” in Perception oflanguage, D. L. Horton and J. J. Jenkins, Eds. Columbus,Ohio: Merrill, 1971.

[5] A. Kendon, “Movement coordination in social interaction:Some examples described,” Acta Psychologica, vol. 32, pp.100–125, 1970.

[6] L. Cerrato, “Investigating communicative feedback phe-nomena across languages and modalities,” Ph.D. disserta-

tion, KTH Computer Science and Communication, Depart-ment of Speech, Music and Hearing, Stockholm, Sweden,2007.

[7] D. Heylen, “Challenges ahead: Head movements and othersocial acts in conversations,” in Proceedings of AISB 2005,2005, pp. 45–52.

[8] R. L. Birdwhistell, Kinesics and Context. Essays on BodyMotion Communication. Philadelphia, PA: University ofPennsylvania Press, 1970.

[9] Y. Iwano, S. Kageyama, E. Morikawa, S. Nakazato, andK. Shirai, “Analysis of head movements and its role inspoken dialogue,” in Proceedings of ICSLP’96, 1996, pp.2167–2170.

[10] U. Hadar, T. J. Steiner, E. C. Grant, and F. C. Rose, “Kin-ematics of head movements accompanying speech duringconversation,” Human Movement Science, vol. 2, pp. 35 –46, 1983.

[11] U. Hadar, T. Steiner, and C. F. Rose, “Head movement dur-ing listening turns in conversation,” Journal of NonverbalBehavior, vol. 9, pp. 214–228, 1985.

[12] D. House, J. Beskow, and B. Granström, “Interaction ofvisual cues for prominence,” Lund Working Papers in Lin-guistics, vol. 49, pp. 62–65, 2001.

[13] M. Sargin, O. Aran, A. Karpov, F. Ofli, Y. Yasinnik,S. Wilson, E. Erzin, Y. Yemez, and M. A. Tekalp, “Com-bined gesture-speech analysis and speech driven gesturesynthesis,” in Proceeings of the IEEE International Con-ference on Multimedia and Expo, Toronto, Canada, 2006,pp. 893–896.

[14] I. Poggi, F. D’Errico, and L. Vincze, “Types of nods. thepolysemy of a social signal,” in Proceedings of the 7th In-ternational Conference on Language Resources and Evalu-ation, 2010, pp. 17–23.

[15] H. M. Rosenfeld and M. Hancks, “The nonverbal contextof verbal listener responses,” in The Relationship of Verbaland Nonverbal Communication, M. R. Key, Ed. TheHague, The Netherlands: Mouton Publishers, 1980, pp.193–206.

[16] E. Bevacqua, “Computational model of listener behaviorfor embodied conversational agents,” Ph.D. dissertation,Université Paris 8, Paris, France, 2009.

[17] J. B. Bavelas, L. Coates, and T. Johnson, “Listeners as co-narrators,” Journal of Personality and Social Psychology,vol. 79, pp. 941–952, 2000.

[18] H. Buschmeier, Z. Malisz, M. Włodarczak, S. Kopp, andP. Wagner, “‘Are you sure you’re paying attention?’ –‘Uh-huh’. Communicating understanding as a marker ofattentiveness,” in Proceedings of INTERSPEECH 2011,Florence, Italy, 2011, pp. 2057–2060.

[19] S. Kopp, J. Allwood, K. Grammar, E. Ahlsén, andT. Stocksmeier, “Modeling embodied feedback with vir-tual humans,” in Modeling Communication with Robotsand Virtual Humans, I. Wachsmuth and G. Knoblich, Eds.Berlin: Springer-Verlag, 2008, pp. 18–37.

[20] J. Allwood and L. Cerrato, “A study of gestural feedbackexpressions,” in First Nordic Symposium on MultimodalCommunication, Copenhagen, Denmark, 2003, pp. 7–22.

[21] A. Dittmann and L. Llewellyn, “Relationship between vo-calizations and head nods as listener responses.” Journalof personality and social psychology, vol. 9, p. 79, 1968.

96

Page 104: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Proceedings of Workshop on Feedback Behaviors in Dialog

AUTHOR INDEX

B Baumann, Timo

Feedback in adaptive interactive storytelling.

Bavelas, Janet Beavin

Beyond back-channels: A three-step model of grounding in face-to-face dialogue.

Bertrand, Roxane

CoFee - Toward a multidimensional analysis of conversational feedback, the case of French

language.

Listener's responses during storytelling in French conversation.

Bodie, Graham

Machines don't listen (But neither do people).

Buschmeier, Hendrik

Adapting language production to listener feedback behaviour.

Listener head gestures and verbal feedback expressions in a distraction task.

C Chiba, Yuya

Effect of linguistic contents on human estimation of internal state of dialog system users.

D De Jong, Peter

Beyond back-channels: A three-step model of grounding in face-to-face dialogue.

De Kok, Iwan

A survey on evaluation metrics for backchannel prediction models.

E Edlund, Jens

Third party observer gaze during backchannels.

Espesser, Robert

Listener's responses during storytelling in French conversation.

G Gargett, Andres

Feedback and activity in dialogue: signals or symptoms?

Gratch, Jonathan

Crowdsourcing backchannel feedback: Understanding the individual variability from the crowds.

97

Page 105: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Proceedings of Workshop on Feedback Behaviors in Dialog

Guardiola, Mathilde

Listener's responses during storytelling in French conversation.

Gustafson, Joakim

Cues to perceived functions of acted and spontaneous feedback expressions.

Exploring the implications for feedback of a neurocognitive theory of overlapped speech.

H Healey, Pat

Empathy and feedback in conversations about felt experience.

Heldner, Mattais

Third party observer gaze during backchannels.

Heylen, Dirk

A survey on evaluation metrics for backchannel prediction models.

Hirschberg, Julia

Clarification questions with feedback.

When do we say 'Mhmm'? Backchannel feedback in dialogue.

Hjalmarsson, Anna

Third party observer gaze during backchannels.

Huang, Lixing

Crowdsourcing backchannel feedback: Understanding the individual variability from the crowds.

I Ito, Akinori

Effect of linguistic contents on human estimation of internal state of dialog system users.

Ito, Masashi

Effect of linguistic contents on human estimation of internal state of dialog system users.

Iwatate, Takuma

Can we predict who in the audience will ask what kind of questions with their feedback behaviors in

poster conversation?

J Jordan, Sara Smock

Beyond back-channels: A three-step model of grounding in face-to-face dialogue.

K Kawahara, Tatsuya

Can we predict who in the audience will ask what kind of questions with their feedback behaviors in

poster conversation?

98

Page 106: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Proceedings of Workshop on Feedback Behaviors in Dialog

Kopp, Stefan

Adapting language production to listener feedback behaviour.

Listener head gestures and verbal feedback expressions in a distraction task.

Korman, Harry

Beyond back-channels: A three-step model of grounding in face-to-face dialogue.

Kousidis, Spyros

Evaluating a minimally invasive laboratory architecture for recording multimodal conversational data.

L Liu, Alex

Clarification questions with feedback.

Lundholm Fors, Kristina

The temporal relationship between feedback and pauses: a pilot study.

M Malisz, Zofia

Evaluating a minimally invasive laboratory architecture for recording multimodal conversational data.

Listener head gestures and verbal feedback expressions in a distraction task.

McCartney, Joshua L.

Visualizations supporting the discovery of prosodic contours related to turn-taking.

Morency, Louis-Philippe

Investigating the influence of pause fillers for automatic backchannel prediction.

N Neiberg, Daniel

Cues to perceived functions of acted and spontaneous feedback expressions.

Exploring the implications for feedback of a neurocognitive theory of overlapped speech.

Novick, David

Paralinguistic behaviors in dialog as a continuous process.

Where in dialog space does uh-huh occur?

O Ozkan, Derya

Investigating the influence of pause fillers for automatic backchannel prediction.

P Pfeiffer, Thies

Evaluating a minimally invasive laboratory architecture for recording multimodal conversational data.

Plant, Nicola

Empathy and feedback in conversations about felt experience.

99

Page 107: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Proceedings of Workshop on Feedback Behaviors in Dialog

Prevot, Laurent

CoFee - Toward a multidimensional analysis of conversational feedback, the case of French

language.

R Rauzy, Stephane

Listener's responses during storytelling in French conversation.

S Scherer, Stefan

Investigating the influence of pause fillers for automatic backchannel prediction.

Schlangen, David

Evaluating a minimally invasive laboratory architecture for recording multimodal conversational data.

Skantze, Gabriel

A testbed for examining the timing of feedback using a Map Task.

Stoyanchev, Svetlana

Clarification questions with feedback.

T Takanashi, Katsuya

Can we predict who in the audience will ask what kind of questions with their feedback behaviors in

poster conversation?

Trouvain, Jurgen

Acoustic, morphological, and functional aspects of "yeah/ja" in Dutch, English and German.

Truong, Khiet P.

Acoustic, morphological, and functional aspects of "yeah/ja" in Dutch, English and German.

Tsuchiya, Takanori

Can we predict who in the audience will ask what kind of questions with their feedback behaviors in

poster conversation?

V Vega, Alejandro

Where in dialog space does uh-huh occur?

W Wagner, Petra

Evaluating a minimally invasive laboratory architecture for recording multimodal conversational data.

Listener head gestures and verbal feedback expressions in a distraction task.

100

Page 108: Proceedings of Workshop on Feedback Behaviors in Dialog … · 2:30 Teaser Talks for posters and demos 3:00 coffee and snacks 3:30 Poster Session 1 4:45 Beyond Back-channels: A Three-step

Proceedings of Workshop on Feedback Behaviors in Dialog

Ward, Nigel G.

Possible lexical cues for backchannel responses.

Visualizations supporting the discovery of prosodic contours related to turn-taking.

Where in dialog space does uh-huh occur?

Wlodarczak, Marcin

Listener head gestures and verbal feedback expressions in a distraction task.

101