audio indexing as a first step in an audio information retrieval system

32
ELIS-DSSP Sint- Pietersnieuwstraat 41 CAIR Twente (10/10/2003) Audio Indexing as a first step in an Audio Information Retrieval System Jean-Pierre Martens An Vandecatseye Frederik Stouten ELIS-DSSP Sint- Pietersnieuwstraat 41

Upload: yoshe

Post on 05-Jan-2016

29 views

Category:

Documents


1 download

DESCRIPTION

Audio Indexing as a first step in an Audio Information Retrieval System. Jean-Pierre Martens An Vandecatseye Frederik Stouten. ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent. audio indexing. speech transcription. information querying. audio signal. time stamps audio labels. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

Audio Indexing as a first step in an Audio Information Retrieval System

Jean-Pierre Martens

An Vandecatseye

Frederik Stouten

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

Page 2: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

Information retrieval from audio

General scheme

audio indexing

speechtranscription

informationquerying

audiosignal

•time stamps•audio labels

•time stamps•audio labels•text (summary)•topic labels

info

This talk Talks of Steve & Roeland

Page 3: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

Why audio indexing?

• Extract extra-linguistic information commercial, intro, football report, etc.

• Save time let speech recognizer only process parts that are

expected to contain speech

• Raise speech transcription accuracy allow speech recognizer to select the right models at the

right time

Page 4: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

Audio indexing in ATRANOS project

• Project name• Main project objectives

– Automatic segmentation/labeling of audio files– Automatic transcription of the speech parts– Conversion (normalization) of transcriptions for an

application (captioning = test vehicle in this project)

• PartnersESAT+CCL/KULeuven, ELIS/UGent, CNTS/UIA

• Statusentering its final year

Page 5: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

Audio indexing in ATRANOS project

• Project name• Main project objectives

– Automatic segmentation/labeling of audio files– Automatic transcription of the speech parts– Conversion (normalization) of transcriptions for an

application (captioning = test vehicle in this project)

• PartnersESAT+CCL/KULeuven, ELIS/UGent, CNTS/UIA

• Statusentering its final year

Page 6: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

Audio indexing in ATRANOS project

• Mark parts which need no transcription – speech / non-speech segmentation

• Detect important change points in speech – change of speaker or acoustics (BW, background)– segment between change points = speaker turn

• Assign speaker label to each turn– all frames of one speaker get same label

• Assign speech mode to each turn– prepared versus spontaneous speech

Page 7: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

Audio indexing in ATRANOS

• Additional design goals– aim for continuous input processing (stream-based)– restrict computational load (real-time on PC)– restrict maximum delay (memory)– aim for language independence

• Evaluation data– American Broadcast News database (LDC)– Pan-European Broadcast News database (COST278)– Spoken Dutch Corpus (CGN)

Page 8: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

1. Speech / non-speech segmentation

• Approach– construct statistical models (GMMs) for typical situations– let these models score individual audio frames– group the frames on the basis of these scores

• Which models to build?– one clean speech model– some common background models (e.g. music)– corresponding speech + common background models– a garbage model for all the rest

Page 9: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

1. Speech / non-speech segmentation

• How to group frames?– put models (colored) in a loop model (transition penalty)

– compute best state sequence (on-line Viterbi-algorithm with forced decisions)

– perform some post-processing on output sequence

E3

1

2

B

4Pt

Page 10: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

1. Speech / non-speech segmentation

• Evaluation results (7 data sets)– training and parameter setting on Am BN

0102030405060708090

100

S found NS found

Am

SL

SK

PT

CZ

GA

BE

football reports

– performance degrades for unseen situations

Page 11: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

2. Speaker segmentation

• Objective detect changes in speaker/acoustics

• Approach– identify change points by comparing properties of

observations in two intervals at both sides of this point– advantage: self-organizing (no speaker models)

Page 12: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

2. Speaker segmentation

• Step 1: potential change position detection – select positions on a grid (CPU-time)– determine fixed length left/right context

rightleft

block of 10 frames

candidate position (n)

both

– build 3 models for the data: M(both), M(left), M(right)– retain significant maxima in LLR(n; two vs one)

Page 13: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

2. Speaker segmentation

• Step 2: boundary elimination– pool all boundaries in speech part: Tmax or until EO-S

– evaluate variable length context of n using BIC ΔBIC(n) = LL(M2) - LL(M1) - λ [#par(M2) - #par(M1)] log N(n)

– select n with minimal ΔBIC(n)• if ΔBIC(n) < 0 : eliminate n and reiterate• if ΔBIC(n) 0 : move to the next speech part

NS Speech NS

Tmax

left right

n

Page 14: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

2. Speaker segmentation

• Evaluation (7 data sets)– recall: how many real changes detected?– precision: how many detected changes are real?

Segmentation in speaker turns

50

60

70

80

90

50 60 70 80 90 100

recall

pre

cisi

on

BE

SL

SK

PT

CZ

GA

Am

5 out of 7

Page 15: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

3. Speaker labeling

• Objective assign same label to all turns of the same speaker

• Approach– on-line clustering fully integrated in segmentation– BIC as decision criterion

• Clustering strategy – for all turns in a speech part: compute ΔBIC between turn

and ‘closest’ cluster center– select turn with maximal ΔBIC:

• if ΔBIC > 0 take turn as a new cluster• else take turn with smallest ΔBIC and add it to closest cluster

Page 16: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

3. Speaker labeling

– step 1: assign official speaker label to each cluster– step 2: cluster purity = % frames with correct label– step 3: ideal cluster purity: purity for ideal clustering

• per speaker: 1 cluster with label of that speaker• per frame in turn: select label of dominant speaker in turn

official

computed

A B ABA

A B AAB

error zones

• Evaluation methodology

Page 17: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

3. Speaker labeling

• Evaluation results (7 data sets)– training and parameter setting on Am BN

0102030405060708090

100

clus ideal clus

Am

BE

SL

SK

PT

CZ

GA

– still room for improvement (nr of clusters also > ideal)

Page 18: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

demonstration

Page 19: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

4. Speech mode labeling

• Objective– spontaneous versus prepared speech– how: presence of disfluencies (prior to recognition)

• Disfluencies– filled pauses (uh’s, abnormally lengthened sounds)– repetitions of words or word groups– abbreviations of words

• At present– no speech mode labeling results yet– therefore ….

Page 20: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

4. Disfluency detection

• Objectives– spontaneous versus prepared speech– how: presence of disfluencies (prior to recognition)

• Disfluencies– filled pauses (uh’s, abnormally lengthened sounds)– repetitions of words or word groups– abbreviations of words

• At present– no speech mode labeling results yet– therefore ….

Page 21: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

4. Disfluency detection

• Feature identification– CGN (Spoken Dutch Corpus): conversational speech– bootstrap data set (11h)

• 3255 annotated uh’s• manual word alignments available (location of uh’s)

• Approach– perform segmentation into phoneme-sized parts on the

basis of cepstral difference measure – identify features revealing FP/NFP nature of these parts– supply these features to a statistical classifier– keep everything stream-based (to fit with the rest)

Page 22: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

4. Disfluency detection

• Feature detection on bootstrap data

– segment duration

0

0,10,2

0,3

0,4

0,5

0,6

0,7

0,80,9

1

0 10 20 30 40 50 60 70 80 90 100

# frames

pro

b

NFP FP

Page 23: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

4. Disfluency detection

• Feature detection on bootstrap data

– segment duration– spectral stability

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

0 5 10 15 20 25

dstab

pro

b

NFP FP

Page 24: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

4. Disfluency detection

• Feature detection on bootstrap data

– segment duration– spectral stability– stable interval durations

02468

101214161820

0 5 10 15 20

T

E[S

id]

FP NFP

Page 25: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

4. Disfluency detection

• Feature detection on bootstrap data

– segment duration– spectral stability– stable interval durations– silence present sil before no sil before total

sil after 1189 773 1962no sil after 700 593 1293total 1889 1366 3255

Page 26: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

4. Disfluency detection

• Feature detection on bootstrap data

– segment duration– spectral stability– stable interval durations– silence present– center of gravity

0

0,10,2

0,3

0,4

0,50,6

0,7

0,80,9

1

0 2 4 6 8 10 12 14 16 18 20 22 24

spectral centre of gravity

pro

b

NFP FP

Page 27: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

4. Disfluency detection

• Feature detection on bootstrap data

– segment duration– spectral stability– stable interval durations– silence present– center of gravity

– output of simple spectral FP-model (GMM) (12 mfccs)

result : 12 useful features in total identified

Page 28: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

4. Disfluency detection

• GMM-based filter– two GMM’s P(x|FP) and P(x|NFP) (x = 12 features)– prior probability P(FP) = 0.01 P(FP|x)– retain segment if P(FP|x) > threshold– results: 90 % of NFP, < 10 % of FP removed and P(FP)

raised from 1 to 10 %

• Statistical classifier – MLP to estimate P(FP|x) (x = 12 features + 12 mfccs)– problem: very low P(FP) (order of 1 %)– therefore: design filter to eliminate most certain NFP

Page 29: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

4. Disfluency detection

• Evaluation on independent test set– size : 47 min containing 415 FP– available information

• all uh’s (including word internal ones) were annotated• all abnormal sound lengthenings were annotated• all corresponding time intervals were manually checked

Page 30: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

4. Disfluency detection

• Evaluation on test data– Recall-precision (ROC) curves

– Embedded training (15h unlabeled data) does not help

30

40

50

60

70

80

90

100

25 35 45 55 65 75 85 95recall (%)

pre

cisi

on

(%

)

Supervised TrainingEmbedded TrainingGMM (12 feat. + 12 MFCC)MLP(12 feat. + 12 MFCC)

Our method R = 75 % P = 85 %

Gabrea method R = 60 % P = 65 %

Page 31: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

4. Disfluency detection and ASR

• Baseline system• 40K lexicon + uh (FP), trigram LM • WER = 51.3 % (spontaneous dialogues CGN, uh excluded)

• Cheating experiment• remove manually labeled FP-segments from the input• equivalent with : recognize FP’s, ignore them in LM context• equivalent with : remove correct FP’s from input stream• WER = 47.6 % (7.5 % relative gain, 1.25 word corrections/FP)

• First real experiment• remove detected FP-segments from the input• WER = 49.4 % (3.7 % relative gain, 0.62 word corrections/FP)

Page 32: Audio Indexing as a first step in an Audio Information Retrieval System

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

CAIR Twente (10/10/2003)

Conclusions

• There exist good audio indexing techniques– speech / non-speech segmentation– speaker turn segmentation– speaker identity labeling– filled-pause detection

• These techniques can be used– to extract extra-linguistic information for AIR– to guide the speech transcription module