nick campbell atr network informatics labs keihanna science city, kyoto, japan [email protected]

44
Getting to the Heart of the Matter; (or, “Speech is more than just the expression of text or language”) Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan [email protected] LREC 2004

Upload: lauren

Post on 21-Feb-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Getting to the Heart of the Matter ; (or, “ Speech is more than just the expression of text or language ” ). Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan [email protected]. LREC 2004. Overview …. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Getting to the Heart of the Matter;

(or, “Speech is more than just the expression of text or language”)

   Nick Campbell

ATR Network Informatics LabsKeihanna Science City, Kyoto, Japan

[email protected]

LREC2004

Page 2: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Overview …• This talk addresses the current nee

ds for so-called ‘emotion’ in speech, but points out that the issue is better described as the expression of ‘relationships’ and ‘attitudes’ rather than by the currently held raw (or big-six) emotional states.  

Page 3: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Comment – For decades now, we have been producing and

improving methods for the input and output of speech signals by computer, but the market seems very slow to take up these technologies.

– In spite of the early promises for human-computer voice-based interactions, the man or woman in the street has yet to make much use of this technology in their daily lives.

– There are already applications where computers mediate in human spoken communications, but in only a few limited domains.

– Our technology appears to have fallen short of its earlier promises!

Page 4: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

The latest buzz-word in speech technology research :

‘emotion’ • Why is it that the latest promises make so mu

ch of the word ‘emotion’? • Perhaps because the current technology is ba

sed so much upon written text as the core for its processing?

• Speech recognition is evaluated by the extent to which it can ‘accurately’ transliterate a spoken utterance; and speech synthesis is driven, in the majority of case, just from the input text alone.

Page 5: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Real interactive speech(cf read-speech)

• “spontaneous speech is ill-formed and often includes redundant information such as disfluencies, fillers, repetitions, repairs, and word fragments” S. Furui 2003(and many others)

• But we don’t just talk text!– natural speech is interactive, so we show relationships as mu

ch as we give information …

• And we don’t just talk sentences … – grunts are common!

Page 6: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Example Dialogue(a person talking to a

robot) • The human speaks• The robot employs speech recognition

– (and presumably some form of processing) then replies using speech synthesis

– (which the human supposedly understands)

• The interaction is ‘successful’ if the robot responds in an intended manner

Page 7: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Example dialogue 1• Excuse me• Yes, can I help you?• Errm, I’d like you to come here and take a look at th

is …• Certainly, wait a minute please.• Can you see the problem?• No, I’m afraid I don’t understand what you are trying

to show me.• But look at this, here …• Oh yes, I see what you mean!

Page 8: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Example dialogue 2• Oi!• Uh?• Here !• Oh …• Okay?• Eh?• Look!• Ah!

Page 9: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Which do we want?• As engineers:

– The former – we can do it now• As humans:

– The latter – it’s what we are used to

• And the robots?– They should behave in the least

obtrusive way – naturally!

Page 10: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

How should we talk with robots?

• First, let’s take a look at how we talk with each other …

not using actors – but real people– in everyday conversational situations …

• Labov : the Observer’s Paradox– interactions lose their naturalness when an o

bserver intrudes!

Page 11: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Overcoming the Observer’s Paradox

analysis of a very large corpus of spoken interaction

• The JST/Crest ESP project

Page 12: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

JST/CREST ESP Project表現豊かな発話様式

Nick CampbellATR 人間情報科学研究所

研究代表者

「表現豊かな発話音声のコンピュータ処理システム」科学技術振興事業団報 第 131 号 : 「高度メディア社会の生活情報技術」

Page 13: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Project Goals• Speech technology

– Speech synthesis with 'feeling'– Speaking-style feature analysis/detection

• Corpus of spontaneous speech– 1000 hours of natural speech

• Scientific contribution– Paralinguistics & communication

Page 14: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Progress to date• More than 1000 hours recorded• 500 hours speech collected• 250 hours transcribed• 75 hours labelled

– 25 voices• Interfaces & specs are evolving• We foresee some very new unit-selection

techniques being developed

Page 15: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

The ‘Pirelli-calendar’

approach

in 1970 a team of photographers took 1000 rolls of 36-exposure film on location to an island in the Pacific in order to produce a calendar of twelve (glamour) images.

-> similarly, if we record an ‘almost infinite’ corpus of speech, and develop techniques to extract the interesting portions, then we will produce data which is both representative and sufficient for studying the full range of speaking-styles used in ordinary human communication.

Page 16: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

long-term recordings:daily interactive speech 

• MD & small head-mounted lavalier mic • conversations with parents/husband/ friends/col

leagues/clinic/others• Japanese native-language speakers both sexes,

mixed ages, mixed scenarios • Recording over a continuing period,

– speaking-style correlates of changes in familiarity/interlocutor to be studied.

Page 17: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

提案

問題

解決

Page 18: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Transcription

匿名

Page 19: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

「発話様式」の次元による尺度

明るいpositive

暗いnegative

強い active

弱い passive 形容詞/動詞一語で

______

「歓び」joy

「怒り」anger

「承諾」acceptance

「悲しみ」sadness

Labelling emotion

Free input

Page 20: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp
Page 21: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Discourse Act Labellingo 反論 argue

p 提案、申し出 suggest, offer

q 気づき notice

s つなぎ connector

r 依頼、命令 request-actiont 文句 complainu 褒める flatterw 独り言 talking-to-selfx 言い詰まりdisfluencyy 演技 act ingz 繰り返し repeat

r* 要求 request (a ~ z)v* 確認を与える verify (a ~ z)*? よく分からない場合

(see LREC 2004, 09-SE Wednesday 4pm)

a あいさつ greeting

b 会話終了 closing

c 自己紹介 introduce-self

d 話題紹介 introduce-topic

e 情報提供 give-information

f 意見、希望 give-opinion

g 応答肯定 affirm

h 応答否定 negate

i 受け入れ accept

j 拒絶 reject

k 了解、理解、納得 acknowledge

l 割り込み , 相づち interject

m 感謝 thank

n 謝罪 apologize

Page 22: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp
Page 23: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Acoustic Analysis / Visualisation toolBoundaries of quasi-syllabic

Nuclei

Quasi-syllable boundaries

F0 contourSonorant Energy contour

(a) Variance in delta-Cepstru

m(b) Formant / FFT Cepstral dis

tanceComposite

(a & b) measure of reliabilityGlottal AQ

pressed

breathy Estimated vocal-tract area-functions

Phonetic labels

(if available)

Page 24: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Voice Quality & Affect• 13,604 conversational utterances• 1 female Japanese speaker (age 32) • listener/speech-act/emotion labels

• Interlocutor: Child Family Friends Others Self 139 3623 9044 632 116

Page 25: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Listenerrelations

Talking to:

• child• family• friends• others• self

Page 26: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

NAQ & F0 by familym1 - motherm2 - fatherm3 - baby girl m4 - husbandm5 - big sisterm6 - nephewm8 - aunt

Page 27: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Meaningful speech is a uniquely human

characteristic, but …• Apes use gestural communication, but not for com

municating propositional content. • Birds and seals can mimic human sounds, but the

ir tunes don't contain semantic meaning. • Bees can communicate precise geographical locati

ons with their dances ….• African wild dogs show a high degree of social org

anisation, and they use body postures and the prosody of their barks to guide the hunt and keep the pack together.

Page 28: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Human language development

• It is likely that early humans used their voices in similar ways to the hunting dogs, and that the use of voice to complement or replace face-to-face communication (and touch) for social interaction and reassurance pre-dated propositional communication.

– In this case, prosody would have been a precursor to meaningful speech, which developed later.

Page 29: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Language as Distal Communication

• The ‘park or ride’ hypothesis (Ross, 2001)– the development of language in humans.

“Human mothers would have had to put down their helpless but heavy babies (who had difficulty in clinging on by themselves) in order to forage for food, but they maintained contact with each other through voice, or tone-of-voice” (my italics)– This distal communication would have reassured both

mother and child that all was well, even though they might actually be out of direct sight of each other.

Page 30: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Non-linguistic speech “it is all too tempting to think of languag

e as consisting of a set (infinite, of course) of independent meaning-form pairs. This way of thinking has become habitual in modern linguistics” (Hurford 1999)

– But part of being human, and of taking one's place in a social network, also involves making inferences about the feelings of others and having an empathy for those feelings. (me)

Page 31: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

“Motherese”

“If the origins of human language, or distal communication, can be traced back to the music of motherese, or infant-directed prosody, then it is easy to speculate that the sounds of the human voice replaced the vision of the face (and body) for the identification of social and security-related information.”

(Falk, 2003, my italics).

Page 32: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Prosody and Cognitive Neurology

“Just as stereoscopic vision yields more than the simple sum of input from the two eyes alone, so binaural listening probably gives us more than just the sum of the text and its linguistic prosody alone” (Auchlin 2003)

– Language may be predominantly processed in the left brain, but much of its prosody is processed in the right.

Page 33: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Right-brain prosody• Several studies have confirmed that understanding

of propositional content activates the prefrontal cortex bilaterally, more on the left than on the right, and that, in contrast, responding to emotional prosody activates the right prefrontal cortex more.

• “the right frontal lobe is perhaps particularly critical, maybe because of its central role in the neural network, for social cognition, including inferences about feelings of others and empathy for those feelings.” (Pandya et al, 1996)

• See also Monrad-Krohn (1945~), etc …

Page 34: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Binaural Speech Processing

(an extreme view!)• Information coming into the right ear and the left

ear is processed separately in the brain before being perceived as speech.

• Speculation:– if the left brain (right ear) is more tuned for linguistic

processing, and the right brain (left ear) more tuned for affective processing, then it is likely that the separate activation of the two hemispheres gives an extra-linguistic ‘depth’ to an utterance. (but cf telephones!)

Page 35: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

A two-tiered view of speech communication

two types of utterance :• I-type express linguistic information • A-type express affect

– The former can be well described by their text alone;   but the latter also need prosodic info.

– any utterance may contain both I-type and A-type information, but is primarily of one type or the other.

– The expressivity of an utterance is realised through a socially-motivated framework that determines its physical characteristics.

Page 36: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

a framework for utterance specificationSelf + Other + Event

• An utterance is realised as an event (=E*) taking place within the framework of mood and interest (=S) and friend and friendly (=O) constraints

• mood & interest, friend & friendly : – If motivation or interest in the content of the

utterance is high, then the speech is typically more expressive. If the speaker is in a good mood then more so …

– If the listener (other) is a friend, then the speech is more relaxed, and in a friendly situation, then even more so

* I-type or A-type

Page 37: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Realising Conversational

Speech Utterances

Page 38: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Discussion • Our analysis of a very large corpus of natural s

pontaneous conversational speech indicates that both Information & Affect may be realised in parallel in speech, for both social and historical reasons

• Speech synthesis (and recognition) should soon start to take these two different types of communication into consideration

- i.e., not emotion, but function & interaction

Page 39: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Conclusion • Speech conveys multiple tiers of information,

not all of which are considered in present linguistic or speech technology research.

• Prosody has an important and extra-linguistic communicative function which can be explained by language evolution and cognitive neurology

• If speech technology is to consider ‘language-in-action’ as well as ‘language-as-system’ then those levels of information which cannot be accurately portrayed by a transcription of the speech alone, must be taken into consideration.

Page 40: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Thank you

Page 41: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

expression

noise

languagespeech

Page 42: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Monrad-Krohn

• uses of speech prosody categorised into four main groups:

– i) Intrinsic prosody, for the intonation contours which distinguish e.g., a declarative from an interrogative sentence,

– ii) Intellectual prosody, for the intonation which gives a sentence its particular situated meaning by placing emphasis on certain words rather than others,

– iii) Emotional prosody, for expressing anger, joy, and the other emotions, and

– iv) Inarticulate prosody, which consists of grunts or sighs and conveys approval or hesitation

Page 43: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Definition of the Glottal AQ (Amplitude Quotient)-- figures taken from Alku et al. (JASA, August 2002) --

Stylised, triangular glottal-flow waveform

glottal-flow waveform

glottal-flow derivative

AQ = fac / dpeak = T2

Estimated glottal-flow waveforms…

1 – Breathy phonation

2 – Pressed phonation

~ “effective decay time” (Fant et al., 1994)

With thanks to Parham Mokhtari

Page 44: Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Normalised Amplitude Quotient (NAQ)-- Alku et al. (2002) --

NAQ = AQ / T0 = AQ . F0

Normalises the approximately inverse-relationship with F0

Yields a parameter more closely associated with phonation quality

NAQ is closely related to the glottal Closing Quotient (CQ), but is more reliably measured than CQ (Alku et al., 2002) !

Speaker FIA Speaker FAN