generative modeling and classification of dialogs by low-level features marco cristani, anna...

Generative Modeling and Classification of

Dialogs by Low-Level Features

Marco CristaniMarco Cristani, Anna Pesarin, Alessandro Tavano, , Anna Pesarin, Alessandro Tavano, Carlo Drioli, Alessandro Perina, Carlo Drioli, Alessandro Perina, Vittorio MurinoVittorio Murino

2St-12St

2St+4

1St-11St

1St+4

…

BLA BLA

BLA

A. Markov

A. Pentland

BLABLA

BLABLA

PRINT ME IN GRAYSCALE

Summary Summary

• Goal

• Introduction

• Our approach

• Experiments

• Conclusions

GoalGoal

• To model and to classify dyadic conversational audio situations

• The situations are characterized by: – the kind of subjects involved within (adults,

children)– a predominant mood (flat or arguing discussion)

• Examples

1

2 3

GoalGoal (2) (2)

• Our guidelines for the modeling are:– to exploit the conversational turn-taking– to not model the content of the conversations (too

difficult)

• Our contribute– A novel kind of features (the Steady Conversational

Periods, SCP) + a very simple generative framework

• In practice…– We are able to finely characterize the turn-taking

encoding also the timing of the turns

Introduction – Social signallingIntroduction – Social signalling• Our aim can be cast as social signalling

problem• Social signals [Vinciarelli et

al. 2008] – the expression of one’s

attitude towards social situation and interplay

– manifested through a multiplicity of non-verbal behavioural cues (facial expressions, gestures, and vocal outbursts)

• Social signalling– recent

formalization

SocialPsychology

Pattern Recognition

Social Signalling

Introduction (2) – social signalsIntroduction (2) – social signals• Bricks for social signals, [Vinciarelli et al.

2008]

OUR FOCUS

Introduction (3) - DefinitionsIntroduction (3) - Definitions• A taxonomy for the social signals

– behavioural/social cues (or thin slice of behavior)• a set of temporal changes in neuromuscular and

physiological activity that last for short intervals of time (milliseconds to minutes)

– social signals (or social behaviours)• multiple behavioural cues

• attitudes towards others or specific social situations that can last minutes to hours

Introduction (5) – Turn takingIntroduction (5) – Turn taking

• Turn taking– includes the regulation of the

conversations, and the coordination (or the lack of it) during the speaker transitions

Introduction (6) – Turn taking examplesIntroduction (6) – Turn taking examples

• Turn-taking– coordination

– timed coordination• more interesting

Yes No

Our approach - preliminariesOur approach - preliminaries• Turn taking in a statistical way: Markov

chaining

• Ergodic Markov model of states

•

•

1 …

St-1 St St+4…

T

Our approach (2)- Markov structuresOur approach (2)- Markov structures

• Markov chaining for multiple agents: connections

• The core of the model is the transition probability (c,d=1,2)

2St-12St

2St+4

1St-11St

1St+4

…

•Problem: computational burden–for C processes, the joint states give transition matrices of O(NCxNC), where N is the number of states for the single processes

single process states

joint process states

T

Our approach (3) – Markov relaxationsOur approach (3) – Markov relaxations

• High-order Markov models [Meyn 2005]

• each single process choses the next state independently from the other single process(es) – reasonable! – O(NCxN) space complexity, still hard to deal with

2St-1

2St 2St+4

1St-11St

1St+4

…

Our approach (4) – Influence modelOur approach (4) – Influence model

• Mixed Memory processes, (Observed) Influence model (OIM) [Saul et al. 99, Asavathiratham 2000]

– each single process choses the next state not considering the choral effect of the system at the previous time step

– instead, pairwise state dependencies plus influence factors {θ} are introduced

2St-12St

2St+4

1St-11St

1St+4

…

Our approach (5) – Influence modelOur approach (5) – Influence model

• We have weighted convex combination of probabilities

– intra-chain transition:

– inter-chain transition:

2St-12St

1St-11St

2St-12St

1St-11St

self-influence

other’s influence

• Transition tables of O(CN2)+ influence matrix θ of O(C2)

Our approach (6) - Setting Our approach (6) - Setting

– The conversation originates a couple of synchronized audio signals sampled at 44100 Hz

– NO source separation issues (see later)

– short-term energies of the speech signals was computed on frames of 10 msec

– speech (T)/silence (S) classification via k-means

• We focused on two-person conversations

10 msec

TTTTTTTT TTTTTTT

TTTTT TTTTTTTTT

TTTT TSSSSS SSSSSSSSS

SSSS SSSSSSSS SSSS SSSS

• How to instantiate the (Observed) Influence Model ?– at each frame (10 msec) (no inter-chain trans. are

depicted for clarity)

– OUTPUT• we have more autotransions than effective changes

• the parameters of the Markov chains are not informative (highly diagonal)

• the length of the speech/silence segments is lost due to the 1-st order dependence

Our approach (7) – Choose a strategyOur approach (7) – Choose a strategy

T T T T T T T T T T T T T T T

T T T T T TTT T T T T T T

TTT T TSS SSS SSSSSSSSS

S SSS SSSSSSSS S SSS SSSS

• Whenever a change in the system does occurr, a novel SCP begins, for each chain/process

– OUTPUT • we have features, addressing system’s

changes

• we introduce a synchronization

• at each SCP are associated two information1. the SPEECH (T) – SILENCE (S) label2. the time length

Our approach (8) – Our approach (8) –

Steady Conversational PeriodsSteady Conversational Periods

T T T T T T T T T T T T T T T

T T T T T TTT T T T T T T

TTT T TSS SSS SSSSSSSSS

S SSS SSSSSSSS S SSS SSSSSCP

t~SCP

1~ tSCP

2~ tSCP

3~ tSCP

4~ tSCP

5~ tSCPT~

Frame

<label, time length>SCP

• How to exploit SCPs for a Markov modelling?

– By addressing a state renaming• <1,S> 1 | <1,T> 2 | <2,S> 3 | ….

– Training a OIM STATE SPACE EXPLOSION, SPARSITY!!!

<8,S> <4,S> <5,S> <3,S>

<5,S><5,S><4,S><8,T>

<5,T>

<3,T> <9,T>

<9,T>

<15> <7> <9> <5>

<9><9><7><16>

<10>

<6> <18>

<18>

Our approach (9) – Our approach (9) –

Steady Conversational PeriodsSteady Conversational Periods

• We consider SCP histograms

Gaussian clustering

Maximum Likelihood (ML)labeling

Our approach (9) –SCP Our approach (9) –SCP exploitationexploitation

• The state space decreases in size

<15> <7> <9> <5>

<9><9><7><16>

<10>

<6> <18>

<18>

<2> <1> <1> <1>

<1><1><1><4>

<3>

<3> <4>

<4>

Our approach (10) – SCP exploitationOur approach (10) – SCP exploitation

Our approach (11) – Classification Our approach (11) – Classification

• At this point the couple of sequences and are used to train the OIM λ, obtaining:

Two intra-chain matrices

they tell how each agent produces a set of SCP states

Two inter-chain matrices

they tell how each SCP state of one chain is conditioned on each state of the other chain

An influence matrixit tells how the two chains influence each other

(by counting state occurrences)

(by counting state occurrences) (by gradient ascent)

Our approach (12) – Remarks Our approach (12) – Remarks

• IMPORTANT: the order with which the sequences and

are evaluated by the system

Agent 1

Agent 2influences

0.0.88

0.0.22

0.0.77

0.0.33

Ag.1 Ag.2

Ag

.2A

g.1

influences

Agent 1

Agent 2influences

• Given a OIM, we can evaluate the likelihood

Our approach (13) Our approach (13) - - Classification Classification

• Once a model Ψ={ϴ,λ} and a test dialog I (an ordered pair of arrays O1 and O2 composed by {S,T} symbols) are provided, we want the likelihood P(I| Ψ) = P(O1 , O2 | Ψ)

1. SCP are extracted2. SCP Gaussian labels are estimated from ϴ,

originating , (ϴ act as a codebook)

3. The OIM, final likelihood is estimated as

Experiments Experiments - preliminaries- preliminaries

• Twofold aim:1. how the statistical signature explains turn-taking2. how our model is effective in the classification task

1. Analysis of the models parameters: restricted dataset– 27 healthy subjects (10 males, 17 females)– two age groups:

• 14 preschool children ranging from 4 to 6 years (so, 14 dialogs)

• 13 adults ranging from 22 to 40 years (13 dialogs)– semi-structured dialogs (lasting about 10 minutes): an

adult human operator asks the subject (child or adult) to talk about predetermined topics:• (school time/work, hobbies, friends, food, family)

Experiments (2) – Influence factorsExperiments (2) – Influence factors

influence

s

• High self-influence:– different intra-chain sequences

of speech/silence SCP states characterize the subjects

– such sequences occurr independently

influence

s

• Low self-influence:– different intra-chain sequences

of speech/silence SCP states characterize the subjects

– such sequences occurr co-ordinated in time

123 3

3 4 4 314

133 1

3 1 44

42 3

3

31 4 3

14

42 3

3

Experiments (3) |adult-child conv.Experiments (3) |adult-child conv.

INTRA CHAIN MATRICES

– The child shows a high tendency to converge to a short silence state

– The moderator alternates from a state of silence to a speech state, either long or short, with high probability

Experiments (4) |adult-child conv.Experiments (4) |adult-child conv.

INTER-CHAIN MATRICES

– the child utters a sentence whether the moderator speaks for a long time (he get bored of the moderator…)

– the moderator utters a sentence whenever the child remains silent for a long time (he encourages the child…)

Experiments (5) |adult-adult conv. Experiments (5) |adult-adult conv.

INTRA CHAIN MATRICES

– The subject tends to speak continuously

– The moderator alternates from a state of silence to a speech state, either long or short, with high probability

Experiments (6) |adult-adult conv. Experiments (6) |adult-adult conv.

INTER-CHAIN MATRICES

– the moderator interacts with the subject mostly by talking to him (whether to ask questions or stopping him)

Experiments (7) - Classification Experiments (7) - Classification

• Restricted extended dataset: – We add conversations

• 5 flat non-structured conversations

• 9 disputes between adults (an operator pushed for fighting, the other subject naturally reacted)

–We instantiate 4 classification tasks

(A) flat vs dispute - (cat:1 vs cat:3);(B) flat vs dispute, general - ((cat:1 U cat:2) vs cat:3);(C) with vs without child - (cat:2 vs cat:1);(D) all vs all;

–We gather three categories of dialogs

1.Flat dialog between adults (18 samples)2.Flat dialog between a child and an adult (14 samples)3.Dispute (9 samples, only between adults)

• Comparative strategies– SCP histograms (SCP)

• normalized histogram of the SCPs (silence, speech) as signature

• Bhattacharyya distance for the classification

– Turn taking influence model (TTIM)• In practice, it is as we had “SCP” with the same

duration [Basu et al. 01]

– Mixture of Gaussian classifier on a set of acoustic cues (MOG) [Shriberg 98] [Fernandez et al. 02] :• pitch range measure (for the intonation)

• “enrate” speech rate (articulation velocity)

• spectral flatness measure (SFM)

• drop-off of spectral energy above 1000 Hz (DO1000) for the emotion modelling

Experiments (8) – Classification Experiments (8) – Classification

Experiments (9) – ClassificationExperiments (9) – Classification

• Results:

(A) flat vs dispute - (cat:1 vs cat:3);(B) flat vs dispute, general - ((cat:1 U cat:2) vs cat:3);(C) with vs without child - (cat:2 vs cat:1);(D) all vs all;

• lower accuracy in the task A – some flat conversations are misclassified– sometimes timing of flat conversations is built by

subjects which utters very short sentences, similar to dispute

– this behavior is captured by our model and disregarded by TTIM

– SOLUTION: augment the features, not only SCPs!

Conclusions Conclusions

• A novel way to model dialogs has been proposed

• The main contributions are– Steady Conversational Periods (SCP), as a way to

synchronize a dialog, making feasible first-order Markov treatment

– The embedding of SCP in an Observed Influence Model, resulting in a detailed way to describe the turn taking of a conversation

• The future improvements– From a methodological point of view

• Inserting uncertainty in the SCP states, i.e., move to a full Influence Model

• Enrich the model with different prosodic features

– From a practical point of view• Enlarge the data set

• Try novel situations

Publications Publications • A.Pesarin, M.Cristani, V.Murino, C.Drioli and A.Perina,A statistical signature for automatic dialogue

classification. In proceedings of the International Conference on Pattern Recognition (ICPR 2008) Tampa, Florida.

• M.Cristani, A.Pesarin, C.Drioli, A.Tavano, A.Perina, V.Murino, Auditory Dialog Analysis and Understanding by Generative Modelling of Interactional Dynamics In proceedings of the Second IEEE Workshop on CVPR 2009 for Human Communicative Behavior Analysis.

• M.Cristani, A.Tavano, A.Pesarin, C.Drioli, A.Perina, V.Murino, Generative Modeling and Classification of Dialogs by Low-Level Features, submitted to System Man and Cybernetics:Part B (under review)

References References • [Vinciarelli et al. 2008] Vinciarelli, A., Pantic, M., Bourlard, H., and Pentland, A. 2008. Social

signal processing: state-of-the-art and future perspectives of an emerging domain. In Proceeding of the 16th ACM international Conference on Multimedia MM '08.

• [Choudhury et al. 2004] T. Choudhury and S. Basu. Modeling conversational dynamics as a mixed memory markov process. In Proc. NIPS, 2004.

• [Meyn 2005] S. P. Meyn and R.L. Tweedie, 2005. Markov Chains and Stochastic Stability. Second edition to appear, Cambridge University Press, 2008

• [Asavathiratham 2000] C. Asavathiratham, “A tractable representation for the dynamics of networked markov chain,” Ph.D. dissertation, Dept. of ECS, MIT, 2000.

• [Saul et al. 99] L. Saul and M. Jordan, “Mixed memory markov models: Decomposing complex stochastic processes as mixtures of simpler ones,” Machine Learning, vol. 37, no. 1, pp. 75–87, 1999.

• [Basu et al. 01] S. Basu, T. Choudhury, B. Clarkson, and A. Pentland, “Learning human interaction with the influence model,” MIT MediaLab, Tech. Rep. 539, 2001.

• [Shriberg 98] E. Shriberg, “Can prosody aid the automatic classification of dialog acts in conversational speech?” Language and Speech, vol. 41, no. 4, pp. 439–487, 1998.

• [Fernandez et al. 02] R. Fernandez and R. Picard, “Dialog act classification from prosodic features using support vector machines,” in Proc. of Speech Prosody, 2002.

Thanks!!!Thanks!!!

generative modeling and classification of dialogs by low-level features marco cristani, anna...

Documents