more than words: advancing prosodic analysis

42
More than Words Advancing Prosodic Analysis Andrew Rosenberg City Tech Colloquium February 5, 2015

Category:

Technology


2 download

TRANSCRIPT

Page 1: More than Words: Advancing Prosodic Analysis

More than Words Advancing Prosodic Analysis

Andrew Rosenberg City Tech Colloquium

February 5, 2015

Page 2: More than Words: Advancing Prosodic Analysis

Speech Technology2

Page 3: More than Words: Advancing Prosodic Analysis

Prosody

Syntax Semantics Pragmatics Paralinguistics

Mary knows; you can do it. Mary knows you can do it.

Bill doesn’t drink because he’s unhappy

Going to Boston. Going to Boston?

Three Hundred Twelve. Three Thousand Twelve.

3

Page 4: More than Words: Advancing Prosodic Analysis

Prosody in Text

ALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS

REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHAT’S THE STATION NAME DOWNTOWN CROSSING UM AND THAT’LL GET YOU BACK

TO THE RED LINE JUST AS EASILY

4

Page 5: More than Words: Advancing Prosodic Analysis

Also, from the North Station...

(I think the Orange Line runs by there too so you can also catch the Orange Line... )

And then instead of transferring

(um I- you know, the map is really obvious about this but)

Instead of transferring at Park Street, you can transfer at (uh what’s the station name) Downtown Crossing and (um) that’ll get you back to the Red Line just as easily.

Prosody in Text5

Page 6: More than Words: Advancing Prosodic Analysis

Prosody in TextI sooo hate you right now :-)

mondays :,(

Conner Thiele @St04hoEs: Madison people are so funny #sarcasm

Dodie Clark @doddleoddle: RePlAcEmEnT bus SerVicEs are mY fAvOURITE

#sARcASM.

Michelle Lee @mlee418 finding someone who loves makeup just as much as me

makes me feel warm inside #notkidding

6

Page 7: More than Words: Advancing Prosodic Analysis

Prosody in Spoken Language Processing• Recognizing Emotions.

Frustration and Anger in Call Centers

• Inserting punctuation in speech transcripts.Notably, not in mobile voice input yet…

• Speaker Recognition

• Speaking Style Recognition

• Recognizing Native Language, Gender, Speaker Roles

• Improving performance of other spoken language processing tasks. Parsing, Discourse Structure, Intent Recognition. Today: Identifying (possibly misrecognized) names in speech

7

Page 8: More than Words: Advancing Prosodic Analysis

Dimensions of Prosodic Variation

Pitch in Blue Intensity in GreenDuration of words/syllables

Presence ofSilence

Spectral Qualities

8

Page 9: More than Words: Advancing Prosodic Analysis

ToBI• High level dimensions of prosodic variation.

• Tones and Break Indices

• High and Low tones describe prosodic events, pitch accent and phrasing.

• Break indices describe the degree of disjuncture between words.

• Two hierarchical levels of phrasing: intermediate and intonational

9

Page 10: More than Words: Advancing Prosodic Analysis

ToBI Example - Praat10

Page 11: More than Words: Advancing Prosodic Analysis

Dimensions of Prosodic VariationProminence (bold word)

Phrasing (end of phrase)

L-L% L-H% H-H% H-L% !H-L%

H* L* L*+H L+H* H+!H*

Mother TheresaGive me the brown oneis that Mariana’s money?do you really think it’s that one? (x2)

get on the harvard square T stopleave the government center T stopwe will go through centralthrough Boylestongo from Harvard Square

11

Page 12: More than Words: Advancing Prosodic Analysis

How is prosody used?

Symbolic• Modular • Linguistically

Meaningful • Reduced

Dimensionality

Direct• Task-Appropriate • Lower information

loss (general) • High Dimensionality

Acoustic Features D = 100s-1000s

Symbolic Analysis D=10-20 Task Specific

Acoustic Features D = 100s-1000s Task Specific

Learned Representations• Modular • Task-Appropriate • Linguistically Meaningful • Low information loss • Reduced Dimensionality

Acoustic Features D = 100s-1000s

Learned Representation

D=10-20Task Specific

Goal: compact, consistent, universal

12

Page 13: More than Words: Advancing Prosodic Analysis

Direct Modeling• Topic and Sentence Segmentation.

[Liu et al. 2008, Rosenberg et al. 2006, Ostendorf et al. 2008 etc.]

• Lexical: n-grams, POS-tags, TextTiling, Lexical Chains and other Coherence measures

• Prosodic: measures of acoustic “reset” across candidate boundaries.

• Question Recognition for Spoken Dialog Systems [Liscombe et al 2006]

• Lexical: n-grams, pos tags, filled pauses

• Prosodic: pitch slope in last 200ms. pausing, loudness

13

Page 14: More than Words: Advancing Prosodic Analysis

Contour Modeling

Pitch in Blue Intensity in Green

14

Page 15: More than Words: Advancing Prosodic Analysis

TILT• Describes an F0 excursion based as a single parameter

Taylor 1998

• Compact representation of an excursion based on position of the maxima

Contour Modeling

tiltamp =|amprise|� |ampfall||amprise|+ |ampfall|

tiltdur =durrise � durfalldurrise + durfall

tilt =tiltdur + tiltamp

2

15

Page 16: More than Words: Advancing Prosodic Analysis

Quantized Contour Modeling

• Each syllabic contour is laid onto an N-by-M grid with normalized time and range. Results in an M element vector with an N-sized vocabulary.Rosenberg 2010

• This allows for a simple classification strategy

Contour Modeling

L-L% L-H%type⇤ = argmax

typep(type)

MY

i

p(Ci|type, i)

type⇤ = argmax

typep(type)

MY

i

p(Ci|Ci�1, type, i)

16

Page 17: More than Words: Advancing Prosodic Analysis

Approximate Curve Fitting• Polynomial fitting

• Legendre polynomials[orthogonal bases]

• Coefficients become the representation

Contour Modeling

from wikipedia

f(~x) = ~a

x̃(t) =kX

i=0

aiti

x̃(t) =kX

i=0

aiLi(t)L0 = 1;L1 = x

L2 =1

2(3x2 � 1)

Ln =1

2n

mX

k=0

✓n

k

◆2

(x� 1)n�k(x+ 1)k

17

Page 18: More than Words: Advancing Prosodic Analysis

Interactions• Most shape representations ignore the interaction

between different information streams.

• Pitch is assumed to be the most relevant dimension of intonation.

• Combined Pitch and Energy contour.Can be viewed as weighting the importance of pitch values by the energy.

• Energy and Duration (Area under Contour) • Very simple feature. • Improves pitch accent detection

by >3% absolute

18

Page 19: More than Words: Advancing Prosodic Analysis

Symbolic Modeling: AuToBI• Automatic ToBI labeling toolkit.

• Unified feature extraction and ToBI label prediction

• Released under Apache 2.0

• Extensible Feature Extraction Framework

• Low-level digital signal processing: pitch, spectrum, intensity, FFV

• Unique features: Automatic syllabification; shape modeling; context-sensitive features

• Applied to English, German, Spanish, Portuguese, Mandarin, French

Acoustic Features D = 100s-1000s

Symbolic Analysis D=10-20 Task Specific

19

Page 20: More than Words: Advancing Prosodic Analysis

Feature Extraction in AuToBI

Mean Mean Mean

ContextA ContextB ContextB

normalized log F0

log F0

F0

Requested Featuresmean[context[norm[log[F0]],A]]mean[context[norm[log[F0]],B]]mean[context[norm[log[F0]],C]]

Mean

ContextA

normalized log F0

log F0

F0F0

log F0

normalized log F0

ContextA

Mean

ContextA

Mean

ContextBContextB

Mean

ContextB

Mean

ContextBContextB

Mean

ContextB

normalized log F0

log F0

F0

20

Page 21: More than Words: Advancing Prosodic Analysis

Correcting Classifiers for Prominence Detection

• Examine the predictive power of Intensity drawn from 210 different spectral regions.[Rosenberg & Hirschberg 2006, 2007]

etc.

[My name is Randy Keller]

21

Page 22: More than Words: Advancing Prosodic Analysis

Correcting Classifiers• For each ensemble member, train an additional correcting

classifier — using pitch, and duration features.• Predict if an ensemble member will be correct or incorrect

• Invert the prediction if the correcting classifier predicts incorrect.

score(A) = θ(A | xi) *ψ(C | yi) + (1−θ(¬A | xi)) * (1−ψ(¬C | yi))i

N

Correcting ClassifierEnergy Classifier

22

Page 23: More than Words: Advancing Prosodic Analysis

Correcting Classifier Diagram

Energy Classifiers

Correctors

Aggregator

Filters

...

...

23

Page 24: More than Words: Advancing Prosodic Analysis

Correcting Classifier Performance

Corpus Unfiltered Energy Voting Corrected Voting Change

BDC-read 79.80 79.87 84.38 +4.51

BDC-spon 79.12 80.67 83.20 +2.53

BURNC 82.90 83.18 85.51 +2.33

Speaker Dependent Performance

24

Page 25: More than Words: Advancing Prosodic Analysis

Learning Representations

• Find redundancy in the data.

• Correlated dimensions — like PCA

• Irrelevant dimensions — L1 or L0 regularization

• Goal here: learn discrete categories, with no discriminative labels (as in MDS or LDA)

• Clustering or Codebook learning

25

Page 26: More than Words: Advancing Prosodic Analysis

Clustering as a Representationx 2 R2

f(x) 2 {A,B,C}g(x) 2 R3

26

Page 27: More than Words: Advancing Prosodic Analysis

Learning Representations• Neural Net Representations

• Autoencoder

x 2 RD

g(x) 2 Rk

x xW1 W2

g(x) = s(W1s(W2x))

27

Page 28: More than Words: Advancing Prosodic Analysis

Learning Representations• Neural Net Representations

• Bottleneck layer

x 2 RD

g(x) 2 Rk

x W1 W2 t

g(x) = s(W1s(W2x))

28

Page 29: More than Words: Advancing Prosodic Analysis

Applications of Prosodic Representations• Candidate Representations:

• Manual ToBI Labels • Automatically hypothesized ToBI Labels • Codebook/Clusters of acoustic features

(k-means, dpgmm)

• Named Entity Tagging

• Sarcasm

• Prosody Sequence Modeling • Speaking Style; Nativeness; Speaker

29

Page 30: More than Words: Advancing Prosodic Analysis

Name Tagging• Names: Persons, Geopolitical Entities (Places),

Organizations.

• These are often misrecognized, and sometimes completely unknown.

• (Most) Speech recognition systems will never recognize a word it’s never heard before. “Out-of-vocabulary” problem.

• Goal: Use prosody to help identify which words in a transcript are actually names — despite this.

work with Denys Katerenchuk

30

Page 31: More than Words: Advancing Prosodic Analysis

Approach• CRF-based Tagger

from Heng Ji’s (RPI) group • Lexical Features

• n-grams, POS, brown cluster, syntactic chunking, known dictionaries (place names, etc.)

• Prosodic Features • AuToBI hypotheses: 6 features. • K-means codebook of the input features used

by AuToBI with k=2-10: 8 features.

Name Tagging31

Page 32: More than Words: Advancing Prosodic Analysis

Results

• Prosody helps. Is likely approximating punctuation.

• AuToBI features are robust at even worse ASR performance.still higher WER!

Name Tagging

F1-s

core

20

27.5

35

42.5

50

39.94

45.0244.34

39.38

Text Features +Prosodic Clusters & AuToBI Features +AuToBI Features +Prosodic Clusters

WER: 49.13%

Ground Truth: marines battling for control of the bridges in the southern city of Nasiriyah

Hypothesis: marines battling for control the bridges in the southern city of non <GPE> sir </GPE> re f

32

Page 33: More than Words: Advancing Prosodic Analysis

Recognizing Sarcasm• Sarcasm: the use of irony to indicate scorn or disdain

• Clips from Daria

• Rated by 165 participants as sarcastic or sincere

• Features: • Baseline: Mean pitch, range pitch, standard deviation of

pitch, mean intensity, intensity range, speaking rate • Prosodic Representations: k=3 clustering of order-2

Legendre polynomial coefficients based on pitch and intensity

• unigram and bigram rates of both pitch and intensity representations

work with Rachel Rakov

33

Page 34: More than Words: Advancing Prosodic Analysis

Results• Learned representations:

• Pitch: Fast Rise, Slow Rise, Fast Fall • Intensity: Fast Rise, Stable, Moderate Fall

Recognizing Sarcasm

Feature Set AccuracyChance Baseline 55.26

Standard Acoustic 65.78+Unigram Features 78.31

+Unigram Features +Intensity Bigrams 81.57

+Unigram Features +Both Bigrams 76.31

Logistic Regression

34

Page 35: More than Words: Advancing Prosodic Analysis

Modeling Prosodic Sequences

• Prosodic Recognition of:

• Speaking Style - Read, Spontaneous, Dialog, News

• Speaker - 4 speakers all Spontaneous speech

• Nativeness - Native vs. Non-native American English Speakers, reading the same material.

35

Page 36: More than Words: Advancing Prosodic Analysis

Prosodic Sequence Modeling• 3-gram model with backoff

• Clusters trained over all material. • Sequence models trained on training splits. • automatic syllabification • only 7 acoustic features:

mean pitch and intensity and delta, duration, pre/fol silence

C

⇤= argmax

Cp(x0|C)p(x1|x0, C)

NY

i=2

p(xi|xi�1, xi�2, C)

Prosodic Sequences36

Page 37: More than Words: Advancing Prosodic Analysis

Dirichlet Process GMMsG|{↵, G0} ⇠ DP (↵, G0)

✓n|G ⇠ G

Xn|✓n ⇠ p(xn|✓n)

G_ 0

G0

ei

xi

_ 0

G0

ei

xi

p(x) =1X

n

⇡nN (x;µn,⌃n)

• Non-parametric infinite mixture model • No need to specify the number of

clusters. • need a prior of π – the dirichlet process • and a prior over N – a zero mean

gaussian • still need to set hyper parameters α &

G0 • Stick-breaking & Chinese Restaurant

metaphors • Blei and Jordan 2005

Variational Inference • “Rich get Richer”

Plate notation from M. Jordan 2005 NIPS tutorial

Prosodic Sequences37

Page 38: More than Words: Advancing Prosodic Analysis

ResultsProsodic Sequences

Speaking Style (of 4)

Nativeness (of 2)

Speaker (of 6)• K-means is a

clear winner on all tasks

• DPGMM here fail to find effective representations

ToBIK-means

DPGMM

variable lengthed sequences with repetition

38

Page 39: More than Words: Advancing Prosodic Analysis

Common Representations

• Previous experiments generated representations from a wide range of material. (3 corpora: 1) spontaneous/read; 2) dialog; 3) news

• Here: we repeat these experiments with representations learned from material from a single corpus (only news)

• Also include AuToBI hypotheses, and clusters are based on full feature set. (compared to 7 before)

Prosodic Sequences39

Page 40: More than Words: Advancing Prosodic Analysis

ResultsProsodic Sequences

K-meansSpeaking Style (of 4)

• K-means provides a robust representation of prosody.

• All speaker material is unknown during representation generations

Speaker (of 12)

40

Page 41: More than Words: Advancing Prosodic Analysis

Next Problems• Hunting for Language Universals • Additional Applications • Automatically identifying the unit of analysis.

• Too short - low information; Too long - low generalization

• Unify with representation learning • Identifying “discriminative” prosodic events.

• In emotion, deception, foreign accent recognition, the important signal is rare, but important.

• Discriminative modeling • Anomaly detection (one class modeling)

41

Page 42: More than Words: Advancing Prosodic Analysis

Thanks

Denys Katerenchuk, Rachel Rakov

Adam Goodkind, Ali Raza Syed, David Guy Brizan, Felix Grezes, Guozhen An, Michelle Morales, Min Ma, Justin Richards, Syed Reza

[email protected]

speech.cs.qc.cuny.edu eniac.cs.qc.cuny.edu/andrew

Questions?