cuhk system 14oct

Overview System Description System Performance Summary

CUHK System for QUESST Task ofMediaEval 2014

Presenter: Cheung-Chi Leung

on behalf of

Haipeng Wang and Tan Lee

Department of Electronic EngineeringThe Chinese University of Hong Kong

MediaEval workshop, 16-17, October, 2014


Outline

1 Overview

2 System Description

3 System Performance

4 Summary


Overview

This is the third year we participated in this exciting evaluation.In 2012, we used multiple tokenizers with DTW matrixcombination [Wang et al., 2013b] .In 2013, we used a single ASM tokenizer built with Gaussiancomponent clustering [Wang et al., 2013a].

Overview of our 2014 systemFocusing on the type I query matching.Following the posteriorgram-based DTW detection framework[Hazen et al., 2009].DTW matrix combination for fusing multiple tokenizers.One new thing: A new ASM tokenizer construction approach.


System Framework

Query

Example

Test

Utterance

Query

Posteriorgrams 1

Test

Posteriorgrams 1

DETECT by DTW

Raw

Detection

Score

Tokenizer 1

DTW

distance

Matrix D1

DTW

Distance

Matrix D

Query

Posteriorgrams 2

Test

Posteriorgrams 2

Tokenizer 2

DTW

distance

Matrix D2

Query

Posteriorgrams N

Test

Posteriorgrams N

Tokenizer N

DTW

distance

Matrix DN

Score

Normalization

6 tokenizers are used1 GMM tokenizer (1024 mixtures)5 phoneme recognizers, including three BUT phonemerecognizers (Czech, Hungarin and Russian), one Mandarin andone English phoneme recognizers.1 ASM tokenizer (introduced later)


System Framework

Query

Example

Test

Utterance

Query

Posteriorgrams 1

Test

Posteriorgrams 1

DETECT by DTW

Raw

Detection

Score

Tokenizer 1

DTW

distance

Matrix D1

DTW

Distance

Matrix D

Query

Posteriorgrams 2

Test

Posteriorgrams 2

Tokenizer 2

DTW

distance

Matrix D2

Query

Posteriorgrams N

Test

Posteriorgrams N

Tokenizer N

DTW

distance

Matrix DN

Score

Normalization

DTW local distance: log inner product of posteriorgrams

DTW matrix combination: linear fusion with equal weights

DTW search: DTW distance computation with a sliding window

Score normalization: mean and variance normalization per query


ASM Tokenizer Construction

Acoustic segment modeling (ASM) [Lee et al., 1988] is a way tobuild acoustic models from unlabeled speech data.ASM involves three steps

initial segmentation / segment labeling / iterative modeling.

Raw Acoustic

Observations

Initial Segmentation

Converge?

Segment Labeling

Supervised Model

Training

HMM Decoding

N

Acoustic Models &

Token Sequences

Y Raw Acoustic

Observations


Converge?

Segment Labeling

Supervised Model

Training

HMM Decoding

N

Acoustic Models &

Token Sequences

Y

Iterative Modeling

Raw Acoustic

Observations


Converge?

Segment Labeling

Supervised Model

Training

HMM Decoding

N

Acoustic Models &

Token Sequences

Y

Iterative Modeling



Initial segmentation: break utterances into short-time segmentsSegment labeling: perform clustering on segmentsIterative modeling: iterative HMM training and decoding

A A CB C C

A A CB C B

Raw Acoustic Observations

After Initial Segmentation

After Segment Labeling

After Iterative Modeling


Segment Labeling

Segment labeling provides initializations to iterative modeling.Our work on segment labeling:GMM labeling[Wang et al., 2012];Gaussian component clustering [Wang et al., 2014a];multiview spectral clustering [Wang et al., 2014b].Key factors: segment representation and clustering algorithms.Our segment representation:

Speech

Frames

Tokenizer

Initial

Segmentation

Posteriograms

Segment

BoundariesClass-by-Segment Matrix

N Segments

MC

lass

es

Different tokenizers lead to different representations, e.g.,Gaussian-by-segment, phoneme-by-segment, state-by-segment


Segment Labeling

Segment labeling with spectral clustering

Speech

Frames

Tokenizer

Initial

Segmentation

Posteriograms

Segment

BoundariesClass-by-Segment Matrix

N Segments

MC

lass

es

Similarity Matrix WData Matrix

compute

solve

Perform k-means

on the M row

vectors of Y

Standard normalized cutsimilarity between segments: inner productclustering on normalized graph Laplacian

Only one class-by-segment matrix is used.

hpwang

sc


Segment Labeling

Segment labeling with multiview spectral clustering (MSC)

Construct

Construct

Construct

s.t.

Perform k-

means on the

M row vectors

of Y

Multiple class-by-segment matrices are used together.Linear fusion on the Laplacian matrices [Xia et al., 2010].Promising gains on SWS2012 data [Wang et al., 2014b].Efficient implementation is designed to avoid the explicitcomputation of similarity matrices and Laplacian matrices.



Segment cluster labels are used as initial transcriptions.

A following iterative training procedure [Lee et al., 1988]:Use the current transcriptions to train acoustic models.Use the current acoustic models to decode the speech utterance toget new transcriptions.Repeat the above two steps until converge.

ASM posteriorgrams are formed by the mono-phone stateposterior probabilities.


DTW Detection

A sliding window is used to determine a region in which DTWdetection is performed.DTW distance:

D = −∑

n

log(QTn × Tn), (1)

where Qn and Tn are the query posteriorgram and testposteriorgram generated by the nth tokenizer.For the tth region Dt, the alignment distance is,

dt = minL,i(l),j(l)

1L

L∑l=1

Dt(i(l), j(l))), (2)

where i(l) and j(l) are the coordinates of the lth step of thealignment path, and L is the length of the alignment path.


Score Processing

Score transformation.An exponential function is used to transform the DTW distance dt

to a raw detection score,

st = exp(−dt/β), (3)

where β is set to 0.6 in our system.Score normalization.

A simple mean and variance normalization,

st = (st − µ)/δ, (4)

where µ and δ2 are the mean and variance of the raw scores.


System Performance

Table: Performances on all the queries.

System No. actCnxe minCnxe ATWV MTWV1 0.682 0.659 0.412 0.4132 0.638 0.585 0.412 0.413

Table: Performances on the type I queries.


In system 1, only those promising scores higher than a thresholdare used for evaluation. Others are replaced by a negativeconstant (-0.5).In system 2, all the scores are used for evaluation.


System Performance

Table: Performances on all the queries.


Table: Performances on the type I queries.


ATWV is only affected by those promising scores, while Cnxe isaffected by all the scores.


Summary

The system is built following the posteriorgram-based DTWdetection framework.

Only type I query is considered in the system development.

DTW matrix combination is used.

A new ASM tokenizer construction method is used.


Thank you!


Reference

Hazen, T., Shen, W., and White, C. (2009).Query-by-example spoken term detection using phonetic posteriorgram templates.In ASRU, pages 421–426.

Lee, C., Soong, F., and Juang, B. (1988).A segment model based approach to speech recognition.In ICASSP.

Wang, H., Lee, T., Leung, C., Ma, B., and Li, H. (2013a).Unsupervised mining of acoustic subword units with segment-level gaussian posteriorgrams.In Interspeech, pages 2297–2301.

Wang, H., Lee, T., Leung, C., Ma, B., and Li, H. (2014a).A graph-based gaussian component clustering approach to unsupervised acoustic modeling.In Interspeech.

Wang, H., Lee, T., Leung, C.-C., Ma, B., and Li, H. (2013b).Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection.In ICASSP.

Wang, H., Lee, T., Leung, C.-C., Ma, B., and Li, H. (2014b).Acoustic segment modeling with spectral clustering methods.in submission to IEEE/ASM TASLP.

Wang, H., Leung, C., Lee, T., Ma, B., and Li, H. (2012).An acoustic segment modeling approach to query-by-example spoken term detection.In ICASSP, pages 5157–5160.

Xia, T., Tao, D., Mei, T., and Zhang, Y. (2010).Multiview spectral embedding.IEEE Trans. Systems, Man, and Cybernetics, Part B: Cybernetics,, 40(6):1438–1446.

cuhk system 14oct

Documents

dtw distance computation

single asm tokenizer

dtw matrixcombination

used1 gmm tokenizer

way tobuild acoustic

multiple tokenizers

dtwraw detectionscoretokenizer

segmentsiterative modeling