9 th conference on telecommunications – conftele 2013 castelo branco, portugal, may 8-10, 2013...
TRANSCRIPT
9th Conference on Telecommunications – Conftele 2013Castelo Branco, Portugal, May 8-10, 2013
Sara Candeias 1
Dirce Celorico 1
Jorge Proença 1
Arlindo Veiga 1,2
Fernando Perdigão 1,2
1Instituto de Telecomunicações, Polo de Coimbra, Portugal2Universidade de Coimbra, DEEC, Portugal
Automatically distinguishing Styles of Speech
2
Summary
Objective
Characterization of the corpus
Automatic segmentation Method Performance
Automatic classification Features Classification method Results
Speech versus Non-speech Read versus Spontaneous
Conclusions and future works
| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013
3
Objective
Automatic detection of styles of speech for segmentation of multimedia data
Speech - Who? What? How?
Style of a speech segment?
Segment broadcast news samples into the two most evident classes: read versus spontaneous speech (prepared and unprepared speech)
Using a combination of phonetic and prosodic featuresFirst explore a speech/non-speech segmentation
slow fastclear informal causal planned prepared
spontaneous unprepared …
| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013
4
Characterization of the corpus
Broadcast News audio
corpus
TV Broadcast News MP4 podcasts
Daily download
Extract audio stream and downsample from
44.1kHz to 16 kHz
30 daily news programs (~27 hours) were manually segmented and annotated in 4 levels:
Level 1– dominant signal: speech, noise, music, silence, clapping, …
For speech:
Level 2– acoustical environment: clean, music, road, crowd,…
Level 3– speech style: prepared speech, lombard speech and 3 levels of unprepared speech (as a function of spontaneity)
Level 4– speaker info: BN anchor, gender, public figures,…
| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013
5
Characterization of the corpus
From Level 1 – speech versus non-speech
From Level 3 – read speech (prepared) versus spontaneous speech
Type of segment Number of segments Average duration
(± std deviation) (s) Speech 7971 11.0 (± 9.4)
Non-Speech 2529 4.1 (± 5.3) Read Speech 4989 10.6 (± 8.5)
Spontaneous Speech 1738 12.0 (± 10.4)
| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013
6
Methods
Automatic Detection
1. Automatic Segmentation
(find/mark different segments on the audio signal)
2. Automatic Classification (classify the segments)
| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013
7
Methods
1. Automatic segmentation
Based on modified BIC (Bayesian Information Criterion):DISTBIC – uses distance (Kullback-Leibler) on the first step and delta BIC (DBIC) to validate marks
si-1 si si+1 si+2
…. ….DBIC<0 DBIC>0
Parameters:
Acoustic vector: 16 Mel-Frequency Cepstral Coefficients (MFCCs) and logarithm of energy (windows 25 ms, step 10 ms)
A threshold of 0.6 in the distance standard deviation was used to select significant local maximum; window size: 2000 ms, step 100 ms
Silence segments with duration above 0.5 seconds are detected and removed for DISTBIC process
| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013
8
Results
Performance measure
Automatic Segmentation:
Collar (detection tolerance) range 0.5 s to 2.0 sA detected mark is assigned as correct if there is one reference mark
inside the collar allowed interval
| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013
RecallPrecision
RecallPrecision2scoreF1
marks reference#
marks detectedcorrectly #Recall
marks unexpected# marks detectedcorrectly #
marks detectedcorrectly #Precision
9
Results
Segmentation performance
0.5 s 1.0 s 1.5 s 2.0 s
0.3
0.4
0.5
0.6
0.7
0.8
Collar (seconds)
F1-
scor
eF1-score: collar range 0.5 s to 2.0 s
0.8
0.7
0.6
0.5
0.4
0.3
0.5 1.0 1.5 2.0
| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013
10
Results
0.5 s 1.0 s 1.5 s 2.0 s
0.5
0.6
0.7
0.8
0.9
1
Collar (seconds)
Acc
urac
yRecall: collar range 0.5 s to 2.0 s
1.0
0.9
0.8
0.7
0.6
0.5
0.5 1.0 1.5 2.0
Segmentation performance
| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013
11
Methods
Phonetic (size of parameter vector for each segment: 214)
• Based on the results of a free phone loop speech recognition
• Phone duration and recognized log likelihood: 5 statistical functions (mean, median, maximum, minimum and standard deviation)
• Silence and speech rate
Prosodic (size of parameter vector for each segment: 108)
• Based on the pitch (F0) and harmonic to noise ratio (HNR) envelope
• First and second order statistics
• Polynomial fit of first and second order
• Reset rate (rate of voiced portions)
• Voiced and unvoiced duration rates
| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013
2. Automatic Classification – Features
a vector of 322 features for each segment is computed
12
Methods
Classification
SVM (Support Vector Machine) classifiers (WEKA tool, linear kernel, C=14):
• speech / non-speech
• read / spontaneous
2 step classification approach
Speech / non-speech
classification
Read / spontaneous
classification
non-speech
speechspontaneous
read
| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013
13
Results
Automatic detection (automatic segmentation + classification)
Agreement time = % frame correctly classified
Speech / non-speech detection
Type of features All Speech Non-speech Phonetic 91.5% 94.9% 62.2% Prosodic 93.2% 97.0% 61.0%
Combination 93.3% 96.6% 64.9%
Read / spontaneous detection
Type of features All Read Spontaneous Phonetic 76.7% 91.9% 38.6% Prosodic 81.1% 93.0% 51.2%
Combination 83.3% 92.7% 59.6%
| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013
14
Results
Classification only (using given manual segmentation)
% - Accuracy
Speech / non-speech classifier
Type of features All Speech Non-speech Phonetic 93.8% 96.7% 82.0% Prosodic 93.8% 97.5% 81.9%
Combination 94.4% 97.6% 84.0%
Type of features All Read Spontaneous Phonetic 83.2% 92.8% 55.4% Prosodic 86.4% 95.0% 61.6%
Combination 87.4% 93.7% 69.5%
Read / spontaneous classifier
| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013
15
Conclusions and future work
Read speech can be distinguished from spontaneous speech with reasonable accuracy.
Results were obtained with only a few and simple measures of the speech signal.
A combination of phonetic and prosodic features provided the best results (both seem to have important and alternative information).
We have already implemented several important features, such as hesitations detection, aspiration detection using word spotting techniques, speaker identification using GMM and jingle detection based on audio fingerprint.
We intend to automatically segment all audio genres and speaking styles.
| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013
17
Appendix – BICBIC (Bayesian Information Criterion)Dissimilarity measure between 2 consecutive segments
Two hypothesizes:H0 – No change of signal characteristics. Model: 1 Gaussian:H1 – Change of characteristics. 2 Gaussians:
μ – mean vector; S – covariance matrix
Maximum likelihood ratio between H0 and H1:
X
X1 X2
1 2
1 22 2 2( ) log log logX X XN N NX X XR i
~ ; ,X XX N x μ Σ
1 1 1 2 2 2~ ; , ; ~ ; , ;X X X XX N x X N xμ Σ μ Σ
| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013