ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
Audio Indexing as a first step in an Audio Information Retrieval System
Jean-Pierre Martens
An Vandecatseye
Frederik Stouten
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
Information retrieval from audio
General scheme
audio indexing
speechtranscription
informationquerying
audiosignal
•time stamps•audio labels
•time stamps•audio labels•text (summary)•topic labels
info
This talk Talks of Steve & Roeland
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
Why audio indexing?
• Extract extra-linguistic information commercial, intro, football report, etc.
• Save time let speech recognizer only process parts that are
expected to contain speech
• Raise speech transcription accuracy allow speech recognizer to select the right models at the
right time
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
Audio indexing in ATRANOS project
• Project name• Main project objectives
– Automatic segmentation/labeling of audio files– Automatic transcription of the speech parts– Conversion (normalization) of transcriptions for an
application (captioning = test vehicle in this project)
• PartnersESAT+CCL/KULeuven, ELIS/UGent, CNTS/UIA
• Statusentering its final year
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
Audio indexing in ATRANOS project
• Project name• Main project objectives
– Automatic segmentation/labeling of audio files– Automatic transcription of the speech parts– Conversion (normalization) of transcriptions for an
application (captioning = test vehicle in this project)
• PartnersESAT+CCL/KULeuven, ELIS/UGent, CNTS/UIA
• Statusentering its final year
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
Audio indexing in ATRANOS project
• Mark parts which need no transcription – speech / non-speech segmentation
• Detect important change points in speech – change of speaker or acoustics (BW, background)– segment between change points = speaker turn
• Assign speaker label to each turn– all frames of one speaker get same label
• Assign speech mode to each turn– prepared versus spontaneous speech
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
Audio indexing in ATRANOS
• Additional design goals– aim for continuous input processing (stream-based)– restrict computational load (real-time on PC)– restrict maximum delay (memory)– aim for language independence
• Evaluation data– American Broadcast News database (LDC)– Pan-European Broadcast News database (COST278)– Spoken Dutch Corpus (CGN)
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
1. Speech / non-speech segmentation
• Approach– construct statistical models (GMMs) for typical situations– let these models score individual audio frames– group the frames on the basis of these scores
• Which models to build?– one clean speech model– some common background models (e.g. music)– corresponding speech + common background models– a garbage model for all the rest
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
1. Speech / non-speech segmentation
• How to group frames?– put models (colored) in a loop model (transition penalty)
– compute best state sequence (on-line Viterbi-algorithm with forced decisions)
– perform some post-processing on output sequence
E3
1
2
B
4Pt
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
1. Speech / non-speech segmentation
• Evaluation results (7 data sets)– training and parameter setting on Am BN
0102030405060708090
100
S found NS found
Am
SL
SK
PT
CZ
GA
BE
football reports
– performance degrades for unseen situations
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
2. Speaker segmentation
• Objective detect changes in speaker/acoustics
• Approach– identify change points by comparing properties of
observations in two intervals at both sides of this point– advantage: self-organizing (no speaker models)
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
2. Speaker segmentation
• Step 1: potential change position detection – select positions on a grid (CPU-time)– determine fixed length left/right context
rightleft
block of 10 frames
candidate position (n)
both
– build 3 models for the data: M(both), M(left), M(right)– retain significant maxima in LLR(n; two vs one)
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
2. Speaker segmentation
• Step 2: boundary elimination– pool all boundaries in speech part: Tmax or until EO-S
– evaluate variable length context of n using BIC ΔBIC(n) = LL(M2) - LL(M1) - λ [#par(M2) - #par(M1)] log N(n)
– select n with minimal ΔBIC(n)• if ΔBIC(n) < 0 : eliminate n and reiterate• if ΔBIC(n) 0 : move to the next speech part
NS Speech NS
Tmax
left right
n
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
2. Speaker segmentation
• Evaluation (7 data sets)– recall: how many real changes detected?– precision: how many detected changes are real?
Segmentation in speaker turns
50
60
70
80
90
50 60 70 80 90 100
recall
pre
cisi
on
BE
SL
SK
PT
CZ
GA
Am
5 out of 7
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
3. Speaker labeling
• Objective assign same label to all turns of the same speaker
• Approach– on-line clustering fully integrated in segmentation– BIC as decision criterion
• Clustering strategy – for all turns in a speech part: compute ΔBIC between turn
and ‘closest’ cluster center– select turn with maximal ΔBIC:
• if ΔBIC > 0 take turn as a new cluster• else take turn with smallest ΔBIC and add it to closest cluster
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
3. Speaker labeling
– step 1: assign official speaker label to each cluster– step 2: cluster purity = % frames with correct label– step 3: ideal cluster purity: purity for ideal clustering
• per speaker: 1 cluster with label of that speaker• per frame in turn: select label of dominant speaker in turn
official
computed
A B ABA
A B AAB
error zones
• Evaluation methodology
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
3. Speaker labeling
• Evaluation results (7 data sets)– training and parameter setting on Am BN
0102030405060708090
100
clus ideal clus
Am
BE
SL
SK
PT
CZ
GA
– still room for improvement (nr of clusters also > ideal)
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
demonstration
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
4. Speech mode labeling
• Objective– spontaneous versus prepared speech– how: presence of disfluencies (prior to recognition)
• Disfluencies– filled pauses (uh’s, abnormally lengthened sounds)– repetitions of words or word groups– abbreviations of words
• At present– no speech mode labeling results yet– therefore ….
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
4. Disfluency detection
• Objectives– spontaneous versus prepared speech– how: presence of disfluencies (prior to recognition)
• Disfluencies– filled pauses (uh’s, abnormally lengthened sounds)– repetitions of words or word groups– abbreviations of words
• At present– no speech mode labeling results yet– therefore ….
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
4. Disfluency detection
• Feature identification– CGN (Spoken Dutch Corpus): conversational speech– bootstrap data set (11h)
• 3255 annotated uh’s• manual word alignments available (location of uh’s)
• Approach– perform segmentation into phoneme-sized parts on the
basis of cepstral difference measure – identify features revealing FP/NFP nature of these parts– supply these features to a statistical classifier– keep everything stream-based (to fit with the rest)
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
4. Disfluency detection
• Feature detection on bootstrap data
– segment duration
0
0,10,2
0,3
0,4
0,5
0,6
0,7
0,80,9
1
0 10 20 30 40 50 60 70 80 90 100
# frames
pro
b
NFP FP
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
4. Disfluency detection
• Feature detection on bootstrap data
– segment duration– spectral stability
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
0 5 10 15 20 25
dstab
pro
b
NFP FP
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
4. Disfluency detection
• Feature detection on bootstrap data
– segment duration– spectral stability– stable interval durations
02468
101214161820
0 5 10 15 20
T
E[S
id]
FP NFP
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
4. Disfluency detection
• Feature detection on bootstrap data
– segment duration– spectral stability– stable interval durations– silence present sil before no sil before total
sil after 1189 773 1962no sil after 700 593 1293total 1889 1366 3255
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
4. Disfluency detection
• Feature detection on bootstrap data
– segment duration– spectral stability– stable interval durations– silence present– center of gravity
0
0,10,2
0,3
0,4
0,50,6
0,7
0,80,9
1
0 2 4 6 8 10 12 14 16 18 20 22 24
spectral centre of gravity
pro
b
NFP FP
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
4. Disfluency detection
• Feature detection on bootstrap data
– segment duration– spectral stability– stable interval durations– silence present– center of gravity
– output of simple spectral FP-model (GMM) (12 mfccs)
result : 12 useful features in total identified
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
4. Disfluency detection
• GMM-based filter– two GMM’s P(x|FP) and P(x|NFP) (x = 12 features)– prior probability P(FP) = 0.01 P(FP|x)– retain segment if P(FP|x) > threshold– results: 90 % of NFP, < 10 % of FP removed and P(FP)
raised from 1 to 10 %
• Statistical classifier – MLP to estimate P(FP|x) (x = 12 features + 12 mfccs)– problem: very low P(FP) (order of 1 %)– therefore: design filter to eliminate most certain NFP
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
4. Disfluency detection
• Evaluation on independent test set– size : 47 min containing 415 FP– available information
• all uh’s (including word internal ones) were annotated• all abnormal sound lengthenings were annotated• all corresponding time intervals were manually checked
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
4. Disfluency detection
• Evaluation on test data– Recall-precision (ROC) curves
– Embedded training (15h unlabeled data) does not help
30
40
50
60
70
80
90
100
25 35 45 55 65 75 85 95recall (%)
pre
cisi
on
(%
)
Supervised TrainingEmbedded TrainingGMM (12 feat. + 12 MFCC)MLP(12 feat. + 12 MFCC)
Our method R = 75 % P = 85 %
Gabrea method R = 60 % P = 65 %
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
4. Disfluency detection and ASR
• Baseline system• 40K lexicon + uh (FP), trigram LM • WER = 51.3 % (spontaneous dialogues CGN, uh excluded)
• Cheating experiment• remove manually labeled FP-segments from the input• equivalent with : recognize FP’s, ignore them in LM context• equivalent with : remove correct FP’s from input stream• WER = 47.6 % (7.5 % relative gain, 1.25 word corrections/FP)
• First real experiment• remove detected FP-segments from the input• WER = 49.4 % (3.7 % relative gain, 0.62 word corrections/FP)
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent
CAIR Twente (10/10/2003)
Conclusions
• There exist good audio indexing techniques– speech / non-speech segmentation– speaker turn segmentation– speaker identity labeling– filled-pause detection
• These techniques can be used– to extract extra-linguistic information for AIR– to guide the speech transcription module