audio indexing as a first step in an audio information retrieval system

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

Audio Indexing as a first step in an Audio Information Retrieval System

Jean-Pierre Martens

An Vandecatseye

Frederik Stouten




Information retrieval from audio

General scheme

audio indexing

speechtranscription

informationquerying

audiosignal

•time stamps•audio labels

•time stamps•audio labels•text (summary)•topic labels

info

This talk Talks of Steve & Roeland



Why audio indexing?

• Extract extra-linguistic information commercial, intro, football report, etc.

• Save time let speech recognizer only process parts that are

expected to contain speech

• Raise speech transcription accuracy allow speech recognizer to select the right models at the

right time



Audio indexing in ATRANOS project

• Project name• Main project objectives

– Automatic segmentation/labeling of audio files– Automatic transcription of the speech parts– Conversion (normalization) of transcriptions for an

application (captioning = test vehicle in this project)

• PartnersESAT+CCL/KULeuven, ELIS/UGent, CNTS/UIA

• Statusentering its final year



Audio indexing in ATRANOS project

• Mark parts which need no transcription – speech / non-speech segmentation

• Detect important change points in speech – change of speaker or acoustics (BW, background)– segment between change points = speaker turn

• Assign speaker label to each turn– all frames of one speaker get same label

• Assign speech mode to each turn– prepared versus spontaneous speech



Audio indexing in ATRANOS

• Additional design goals– aim for continuous input processing (stream-based)– restrict computational load (real-time on PC)– restrict maximum delay (memory)– aim for language independence

• Evaluation data– American Broadcast News database (LDC)– Pan-European Broadcast News database (COST278)– Spoken Dutch Corpus (CGN)



1. Speech / non-speech segmentation

• Approach– construct statistical models (GMMs) for typical situations– let these models score individual audio frames– group the frames on the basis of these scores

• Which models to build?– one clean speech model– some common background models (e.g. music)– corresponding speech + common background models– a garbage model for all the rest




• How to group frames?– put models (colored) in a loop model (transition penalty)

– compute best state sequence (on-line Viterbi-algorithm with forced decisions)

– perform some post-processing on output sequence

E3

1

2

B

4Pt




• Evaluation results (7 data sets)– training and parameter setting on Am BN

0102030405060708090

100

S found NS found

Am

SL

SK

PT

CZ

GA

BE

football reports

– performance degrades for unseen situations



2. Speaker segmentation

• Objective detect changes in speaker/acoustics

• Approach– identify change points by comparing properties of

observations in two intervals at both sides of this point– advantage: self-organizing (no speaker models)




• Step 1: potential change position detection – select positions on a grid (CPU-time)– determine fixed length left/right context

rightleft

block of 10 frames

candidate position (n)

both

– build 3 models for the data: M(both), M(left), M(right)– retain significant maxima in LLR(n; two vs one)




• Step 2: boundary elimination– pool all boundaries in speech part: Tmax or until EO-S

– evaluate variable length context of n using BIC ΔBIC(n) = LL(M2) - LL(M1) - λ [#par(M2) - #par(M1)] log N(n)

– select n with minimal ΔBIC(n)• if ΔBIC(n) < 0 : eliminate n and reiterate• if ΔBIC(n) 0 : move to the next speech part

NS Speech NS

Tmax

left right

n




• Evaluation (7 data sets)– recall: how many real changes detected?– precision: how many detected changes are real?

Segmentation in speaker turns

50

60

70

80

90

50 60 70 80 90 100

recall

pre

cisi

on

BE

SL

SK

PT

CZ

GA

Am

5 out of 7



3. Speaker labeling

• Objective assign same label to all turns of the same speaker

• Approach– on-line clustering fully integrated in segmentation– BIC as decision criterion

• Clustering strategy – for all turns in a speech part: compute ΔBIC between turn

and ‘closest’ cluster center– select turn with maximal ΔBIC:

• if ΔBIC > 0 take turn as a new cluster• else take turn with smallest ΔBIC and add it to closest cluster



3. Speaker labeling

– step 1: assign official speaker label to each cluster– step 2: cluster purity = % frames with correct label– step 3: ideal cluster purity: purity for ideal clustering

• per speaker: 1 cluster with label of that speaker• per frame in turn: select label of dominant speaker in turn

official

computed

A B ABA

A B AAB

error zones

• Evaluation methodology



3. Speaker labeling

• Evaluation results (7 data sets)– training and parameter setting on Am BN

0102030405060708090

100

clus ideal clus

Am

BE

SL

SK

PT

CZ

GA

– still room for improvement (nr of clusters also > ideal)



demonstration



4. Speech mode labeling

• Objective– spontaneous versus prepared speech– how: presence of disfluencies (prior to recognition)

• Disfluencies– filled pauses (uh’s, abnormally lengthened sounds)– repetitions of words or word groups– abbreviations of words

• At present– no speech mode labeling results yet– therefore ….



4. Disfluency detection

• Objectives– spontaneous versus prepared speech– how: presence of disfluencies (prior to recognition)

• Disfluencies– filled pauses (uh’s, abnormally lengthened sounds)– repetitions of words or word groups– abbreviations of words

• At present– no speech mode labeling results yet– therefore ….




• Feature identification– CGN (Spoken Dutch Corpus): conversational speech– bootstrap data set (11h)

• 3255 annotated uh’s• manual word alignments available (location of uh’s)

• Approach– perform segmentation into phoneme-sized parts on the

basis of cepstral difference measure – identify features revealing FP/NFP nature of these parts– supply these features to a statistical classifier– keep everything stream-based (to fit with the rest)




• Feature detection on bootstrap data

– segment duration

0

0,10,2

0,3

0,4

0,5

0,6

0,7

0,80,9

1

0 10 20 30 40 50 60 70 80 90 100

# frames

pro

b

NFP FP





– segment duration– spectral stability

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

0 5 10 15 20 25

dstab

pro

b

NFP FP





– segment duration– spectral stability– stable interval durations

02468

101214161820

0 5 10 15 20

T

E[S

id]

FP NFP





– segment duration– spectral stability– stable interval durations– silence present sil before no sil before total

sil after 1189 773 1962no sil after 700 593 1293total 1889 1366 3255





– segment duration– spectral stability– stable interval durations– silence present– center of gravity

0

0,10,2

0,3

0,4

0,50,6

0,7

0,80,9

1

0 2 4 6 8 10 12 14 16 18 20 22 24

spectral centre of gravity

pro

b

NFP FP





– segment duration– spectral stability– stable interval durations– silence present– center of gravity

– output of simple spectral FP-model (GMM) (12 mfccs)

result : 12 useful features in total identified




• GMM-based filter– two GMM’s P(x|FP) and P(x|NFP) (x = 12 features)– prior probability P(FP) = 0.01 P(FP|x)– retain segment if P(FP|x) > threshold– results: 90 % of NFP, < 10 % of FP removed and P(FP)

raised from 1 to 10 %

• Statistical classifier – MLP to estimate P(FP|x) (x = 12 features + 12 mfccs)– problem: very low P(FP) (order of 1 %)– therefore: design filter to eliminate most certain NFP




• Evaluation on independent test set– size : 47 min containing 415 FP– available information

• all uh’s (including word internal ones) were annotated• all abnormal sound lengthenings were annotated• all corresponding time intervals were manually checked




• Evaluation on test data– Recall-precision (ROC) curves

– Embedded training (15h unlabeled data) does not help

30

40

50

60

70

80

90

100

25 35 45 55 65 75 85 95recall (%)

pre

cisi

on

(%

)

Supervised TrainingEmbedded TrainingGMM (12 feat. + 12 MFCC)MLP(12 feat. + 12 MFCC)

Our method R = 75 % P = 85 %

Gabrea method R = 60 % P = 65 %



4. Disfluency detection and ASR

• Baseline system• 40K lexicon + uh (FP), trigram LM • WER = 51.3 % (spontaneous dialogues CGN, uh excluded)

• Cheating experiment• remove manually labeled FP-segments from the input• equivalent with : recognize FP’s, ignore them in LM context• equivalent with : remove correct FP’s from input stream• WER = 47.6 % (7.5 % relative gain, 1.25 word corrections/FP)

• First real experiment• remove detected FP-segments from the input• WER = 49.4 % (3.7 % relative gain, 0.62 word corrections/FP)



Conclusions

• There exist good audio indexing techniques– speech / non-speech segmentation– speaker turn segmentation– speaker identity labeling– filled-pause detection

• These techniques can be used– to extract extra-linguistic information for AIR– to guide the speech transcription module

audio indexing as a first step in an audio information retrieval system

Documents