prosodic and phonetic features for speaking styles classification and detection

18
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection IberSPEECH 2012 - VII Jornadas en Tecnología del Habla & III Iberian SLTech Workshop November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

Upload: odele

Post on 23-Feb-2016

47 views

Category:

Documents


3 download

DESCRIPTION

IberSPEECH 2012 - VII Jornadas en Tecnología del Habla & III Iberian SLTech Workshop. Prosodic and Phonetic Features for Speaking Styles Classification and Detection. November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN. Arlindo Veiga Dirce Celorico Jorge Proença - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Prosodic and Phonetic Features for Speaking Styles Classification and Detection

Arlindo VeigaDirce CeloricoJorge ProençaSara CandeiasFernando Perdigão

Prosodic and Phonetic Features for Speaking Styles Classification and Detection

IberSPEECH 2012 - VII Jornadas en Tecnología del Habla & III Iberian SLTech Workshop

November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

Page 2: Prosodic and Phonetic Features for Speaking Styles Classification and Detection

2

Summary

IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

Objective

Characterization of the corpus

Features

Methods Automatic segmentation Classification

Results Automatic detection

Segmentation Speech versus Non-speech Read versus Spontaneous

Classification Speech versus Non-speech Read versus Spontaneous

Conclusions and future works

Page 3: Prosodic and Phonetic Features for Speaking Styles Classification and Detection

3

Objective

IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

Automatic detection of speaking styles for segmentation purposes of multimedia data

Style of a speech segment?

Segment broadcast news documents into two most evident classes: read versus spontaneous speech (prepared and unprepared speech)Using combination of phonetic and prosodic featuresExplore also speech/non-speech segmentation

slow fastclear informal causal planned preparedspontaneous unprepared …

Page 4: Prosodic and Phonetic Features for Speaking Styles Classification and Detection

4

Characterization of the corpus

IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

Broadcast News audio

corpus

TV Broadcast News MP4 podcasts

Daily download

Extract audio stream and downsample from

44.1kHz to 16 kHz

30 daily news programs (~27 hours) were manually segmented and annotated in 4 levels:

Level 1– dominant signal: speech, noise, music, silence, clapping, …For speech:

Level 2– acoustical environment: clean, music, road, crowd,…Level 3– speech style: prepared speech, lombard speech and 3 levels of unprepared speech (as a function of spontaneity)Level 4– speaker info: BN anchor, gender, public figures,…

Page 5: Prosodic and Phonetic Features for Speaking Styles Classification and Detection

5

Characterization of the corpus

IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

From Level 1 – speech versus non-speechFrom Level 3 – read speech (prepared) versus spontaneous speech

Type of segment Number of segments Average duration (± std deviation) (s)

Speech 7971 11.0 (± 9.4) Non-Speech 2529 4.1 (± 5.3) Read Speech 4989 10.6 (± 8.5)

Spontaneous Speech 1738 12.0 (± 10.4)

For each segment, a vector of 322 features (214 phonetic features and 108 prosodic features) are computed

Page 6: Prosodic and Phonetic Features for Speaking Styles Classification and Detection

6

Features

IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

Phonetic (size of parameter vector for each segment: 214)• Based on the results of a free phone loop speech recognition

• Phone duration and recognized loglikelihood: 5 statistical functions (mean, median, maximum, minimum and standard deviation)

• Silence and speech rate

Prosodic (size of parameter vector for each segment: 108)• Based on the pitch (F0) and harmonic to noise ratio (HNR) envelope

• First and second order statistics• Polynomial fit of first and second order• Reset rate (rate of voiced portions)• Voiced and unvoiced duration rates

Page 7: Prosodic and Phonetic Features for Speaking Styles Classification and Detection

7

Methods

IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

Automatic detection

Implies automatic segmentation and automatic classification

Automatic segmentation based on modified BIC (Bayesian Information Criterion) - DISTBIC

Binary classification: SVM classifiers

Page 8: Prosodic and Phonetic Features for Speaking Styles Classification and Detection

8

Methods

IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

Automatic segmentation DISTBIC - uses distance (Kullback-Leibler) on the first step and delta

BIC (DBIC) to validate marks

si-1 si si+1 si+2

…. ….DBIC<0 DBIC>0

Parameters: Acoustic vector: 16 Mel-Frequency Cepstral Coefficients (MFCCs) and logarithm of energy

(windows 25 ms, step 10 ms) A threshold of 0.6 in the distance standard deviation was used to select significant local maximum;

window size: 2000 ms, step 100 ms Silence segments with duration above 0.5 seconds are detected and removed for DISTBIC

process

Page 9: Prosodic and Phonetic Features for Speaking Styles Classification and Detection

9

Methods

IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

Classification SVM classifiers (WEKA tool – SMO, linear kernel, C=14):

• speech / non-speech• read / spontaneous

2 step classification approach

Speech / non-speechclassification

Read / spontaneousclassification

non-speech

speechspontaneousread

Page 10: Prosodic and Phonetic Features for Speaking Styles Classification and Detection

10

Results

IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

Performance measureSegmentation only:

Collar (detection tolerance) range 0.5 s to 2.0 sA detected mark is assigned as correct if there is one reference mark less than “collar”

Automatic detection

Classification only: “AT” – agreement time = % frame correctly classified

Page 11: Prosodic and Phonetic Features for Speaking Styles Classification and Detection

11

Results

IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

Segmentation performance:

0.5 s 1.0 s 1.5 s 2.0 s

0.3

0.4

0.5

0.6

0.7

0.8

Collar (seconds)

F1-s

core

F1-score: collar range 0.5 s to 2.0 s 0.8

0.7

0.6

0.5

0.4

0.30.5 1.0 1.5 2.0

Page 12: Prosodic and Phonetic Features for Speaking Styles Classification and Detection

12

Results

IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

0.5 s 1.0 s 1.5 s 2.0 s

0.5

0.6

0.7

0.8

0.9

1

Collar (seconds)

Acc

urac

yRecall: collar range 0.5 s to 2.0 s

1.0

0.9

0.8

0.7

0.6

0.5

0.5 1.0 1.5 2.0

Segmentation performance:

Page 13: Prosodic and Phonetic Features for Speaking Styles Classification and Detection

13

Results

IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

Automatic detectionSpeech / non-speech detection

Type of features AT. Speech Non-speech Phonetic 91.5% 94.9% 62.2% Prosodic 93.2% 97.0% 61.0%

Combination 93.3% 96.6% 64.9%

Read / spontaneous detection

Type of features AT. Read Spontaneous Phonetic 76.7% 91.9% 38.6% Prosodic 81.1% 93.0% 51.2%

Combination 83.3% 92.7% 59.6%

“AT” – agreement time = % frame correctly classified

Page 14: Prosodic and Phonetic Features for Speaking Styles Classification and Detection

14

Results

IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

Classification only (using given manual segmentation)Speech / non-speech classifier

Type of features Acc. Speech Non-speech Phonetic 93.8% 96.7% 82.0% Prosodic 93.8% 97.5% 81.9%

Combination 94.4% 97.6% 84.0%

Type of features Acc. Read Spontaneous Phonetic 83.2% 92.8% 55.4% Prosodic 86.4% 95.0% 61.6%

Combination 87.4% 93.7% 69.5%

“Acc.” – Accuracy

Read / spontaneous classifier

Page 15: Prosodic and Phonetic Features for Speaking Styles Classification and Detection

15

Conclusions and future work

IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

Read speech can be differentiated from spontaneous speech with reasonable accuracy.

Good results were obtained with only a few and simple measures of the speech signal.

A combination of phonetic and prosodic features provided the best results (both seem to have important and alternative information).

We have already implemented several important features, such as hesitations detection, aspiration detection using word spotting techniques, speaker identification using GMM and jingle detection based on audio fingerprint.

We intend to automatically segment all audio genres and speaking styles.

Page 16: Prosodic and Phonetic Features for Speaking Styles Classification and Detection

16

THANK YOU

IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

Page 17: Prosodic and Phonetic Features for Speaking Styles Classification and Detection

17

Appendix – BIC

IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

BIC (Bayesian Information Criterion)Dissimilarity measure between 2 consecutive segments

Two hypothesizes:H0 – No change of signal characteristics. Model: 1 Gaussian:H1 – Change of characteristics. 2 Gaussians:

μ – mean vector; S – covariance matrixMaximum likelihood ratMaximum likelihood ratio between H0 and H1:

X

X1 X2

1 21 22 2 2( ) log log logX X XN N N

X X XR i S S S

~ ; ,X XX N x μ Σ

1 1 1 2 2 2~ ; , ; ~ ; , ;X X X XX N x X N xμ Σ μ Σ

Page 18: Prosodic and Phonetic Features for Speaking Styles Classification and Detection

18

Appendix – BIC

IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

P –complexity penalizationλ – penalization factor (ideal 1.0)

Change if:

Parameters used in this work:p=16; λ=1.3; frame rate = 100; N=200; M=10;

( ) ( )BIC i R i PD

*( ) 0BIC iD