cuhk system 14oct
TRANSCRIPT
Overview System Description System Performance Summary
CUHK System for QUESST Task ofMediaEval 2014
Presenter: Cheung-Chi Leung
on behalf of
Haipeng Wang and Tan Lee
Department of Electronic EngineeringThe Chinese University of Hong Kong
MediaEval workshop, 16-17, October, 2014
Overview System Description System Performance Summary
Outline
1 Overview
2 System Description
3 System Performance
4 Summary
Overview System Description System Performance Summary
Overview
This is the third year we participated in this exciting evaluation.In 2012, we used multiple tokenizers with DTW matrixcombination [Wang et al., 2013b] .In 2013, we used a single ASM tokenizer built with Gaussiancomponent clustering [Wang et al., 2013a].
Overview of our 2014 systemFocusing on the type I query matching.Following the posteriorgram-based DTW detection framework[Hazen et al., 2009].DTW matrix combination for fusing multiple tokenizers.One new thing: A new ASM tokenizer construction approach.
Overview System Description System Performance Summary
System Framework
Query
Example
Test
Utterance
Query
Posteriorgrams 1
Test
Posteriorgrams 1
DETECT by DTW
Raw
Detection
Score
Tokenizer 1
DTW
distance
Matrix D1
DTW
Distance
Matrix D
Query
Posteriorgrams 2
Test
Posteriorgrams 2
Tokenizer 2
DTW
distance
Matrix D2
Query
Posteriorgrams N
Test
Posteriorgrams N
Tokenizer N
DTW
distance
Matrix DN
Score
Normalization
6 tokenizers are used1 GMM tokenizer (1024 mixtures)5 phoneme recognizers, including three BUT phonemerecognizers (Czech, Hungarin and Russian), one Mandarin andone English phoneme recognizers.1 ASM tokenizer (introduced later)
Overview System Description System Performance Summary
System Framework
Query
Example
Test
Utterance
Query
Posteriorgrams 1
Test
Posteriorgrams 1
DETECT by DTW
Raw
Detection
Score
Tokenizer 1
DTW
distance
Matrix D1
DTW
Distance
Matrix D
Query
Posteriorgrams 2
Test
Posteriorgrams 2
Tokenizer 2
DTW
distance
Matrix D2
Query
Posteriorgrams N
Test
Posteriorgrams N
Tokenizer N
DTW
distance
Matrix DN
Score
Normalization
DTW local distance: log inner product of posteriorgrams
DTW matrix combination: linear fusion with equal weights
DTW search: DTW distance computation with a sliding window
Score normalization: mean and variance normalization per query
Overview System Description System Performance Summary
ASM Tokenizer Construction
Acoustic segment modeling (ASM) [Lee et al., 1988] is a way tobuild acoustic models from unlabeled speech data.ASM involves three steps
initial segmentation / segment labeling / iterative modeling.
Raw Acoustic
Observations
Initial Segmentation
Converge?
Segment Labeling
Supervised Model
Training
HMM Decoding
N
Acoustic Models &
Token Sequences
Y Raw Acoustic
Observations
Initial Segmentation
Converge?
Segment Labeling
Supervised Model
Training
HMM Decoding
N
Acoustic Models &
Token Sequences
Y
Iterative Modeling
Raw Acoustic
Observations
Initial Segmentation
Converge?
Segment Labeling
Supervised Model
Training
HMM Decoding
N
Acoustic Models &
Token Sequences
Y
Iterative Modeling
Overview System Description System Performance Summary
ASM Tokenizer Construction
Initial segmentation: break utterances into short-time segmentsSegment labeling: perform clustering on segmentsIterative modeling: iterative HMM training and decoding
A A CB C C
A A CB C B
Raw Acoustic Observations
After Initial Segmentation
After Segment Labeling
After Iterative Modeling
Overview System Description System Performance Summary
Segment Labeling
Segment labeling provides initializations to iterative modeling.Our work on segment labeling:GMM labeling[Wang et al., 2012];Gaussian component clustering [Wang et al., 2014a];multiview spectral clustering [Wang et al., 2014b].Key factors: segment representation and clustering algorithms.Our segment representation:
Speech
Frames
Tokenizer
Initial
Segmentation
Posteriograms
Segment
BoundariesClass-by-Segment Matrix
N Segments
MC
lass
es
Different tokenizers lead to different representations, e.g.,Gaussian-by-segment, phoneme-by-segment, state-by-segment
Overview System Description System Performance Summary
Segment Labeling
Segment labeling with spectral clustering
Speech
Frames
Tokenizer
Initial
Segmentation
Posteriograms
Segment
BoundariesClass-by-Segment Matrix
N Segments
MC
lass
es
Similarity Matrix WData Matrix
compute
solve
Perform k-means
on the M row
vectors of Y
Standard normalized cutsimilarity between segments: inner productclustering on normalized graph Laplacian
Only one class-by-segment matrix is used.
Overview System Description System Performance Summary
Segment Labeling
Segment labeling with multiview spectral clustering (MSC)
Construct
Construct
Construct
s.t.
Perform k-
means on the
M row vectors
of Y
Multiple class-by-segment matrices are used together.Linear fusion on the Laplacian matrices [Xia et al., 2010].Promising gains on SWS2012 data [Wang et al., 2014b].Efficient implementation is designed to avoid the explicitcomputation of similarity matrices and Laplacian matrices.
Overview System Description System Performance Summary
ASM Tokenizer Construction
Segment cluster labels are used as initial transcriptions.
A following iterative training procedure [Lee et al., 1988]:Use the current transcriptions to train acoustic models.Use the current acoustic models to decode the speech utterance toget new transcriptions.Repeat the above two steps until converge.
ASM posteriorgrams are formed by the mono-phone stateposterior probabilities.
Overview System Description System Performance Summary
DTW Detection
A sliding window is used to determine a region in which DTWdetection is performed.DTW distance:
D = −∑
n
log(QTn × Tn), (1)
where Qn and Tn are the query posteriorgram and testposteriorgram generated by the nth tokenizer.For the tth region Dt, the alignment distance is,
dt = minL,i(l),j(l)
1L
L∑l=1
Dt(i(l), j(l))), (2)
where i(l) and j(l) are the coordinates of the lth step of thealignment path, and L is the length of the alignment path.
Overview System Description System Performance Summary
Score Processing
Score transformation.An exponential function is used to transform the DTW distance dt
to a raw detection score,
st = exp(−dt/β), (3)
where β is set to 0.6 in our system.Score normalization.
A simple mean and variance normalization,
st = (st − µ)/δ, (4)
where µ and δ2 are the mean and variance of the raw scores.
Overview System Description System Performance Summary
System Performance
Table: Performances on all the queries.
System No. actCnxe minCnxe ATWV MTWV1 0.682 0.659 0.412 0.4132 0.638 0.585 0.412 0.413
Table: Performances on the type I queries.
System No. actCnxe minCnxe ATWV MTWV1 0.526 0.486 0.611 0.6132 0.508 0.420 0.611 0.613
In system 1, only those promising scores higher than a thresholdare used for evaluation. Others are replaced by a negativeconstant (-0.5).In system 2, all the scores are used for evaluation.
Overview System Description System Performance Summary
System Performance
Table: Performances on all the queries.
System No. actCnxe minCnxe ATWV MTWV1 0.682 0.659 0.412 0.4132 0.638 0.585 0.412 0.413
Table: Performances on the type I queries.
System No. actCnxe minCnxe ATWV MTWV1 0.526 0.486 0.611 0.6132 0.508 0.420 0.611 0.613
ATWV is only affected by those promising scores, while Cnxe isaffected by all the scores.
Overview System Description System Performance Summary
Summary
The system is built following the posteriorgram-based DTWdetection framework.
Only type I query is considered in the system development.
DTW matrix combination is used.
A new ASM tokenizer construction method is used.
Overview System Description System Performance Summary
Thank you!
Overview System Description System Performance Summary
Reference
Hazen, T., Shen, W., and White, C. (2009).Query-by-example spoken term detection using phonetic posteriorgram templates.In ASRU, pages 421–426.
Lee, C., Soong, F., and Juang, B. (1988).A segment model based approach to speech recognition.In ICASSP.
Wang, H., Lee, T., Leung, C., Ma, B., and Li, H. (2013a).Unsupervised mining of acoustic subword units with segment-level gaussian posteriorgrams.In Interspeech, pages 2297–2301.
Wang, H., Lee, T., Leung, C., Ma, B., and Li, H. (2014a).A graph-based gaussian component clustering approach to unsupervised acoustic modeling.In Interspeech.
Wang, H., Lee, T., Leung, C.-C., Ma, B., and Li, H. (2013b).Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection.In ICASSP.
Wang, H., Lee, T., Leung, C.-C., Ma, B., and Li, H. (2014b).Acoustic segment modeling with spectral clustering methods.in submission to IEEE/ASM TASLP.
Wang, H., Leung, C., Lee, T., Ma, B., and Li, H. (2012).An acoustic segment modeling approach to query-by-example spoken term detection.In ICASSP, pages 5157–5160.
Xia, T., Tao, D., Mei, T., and Zhang, Y. (2010).Multiview spectral embedding.IEEE Trans. Systems, Man, and Cybernetics, Part B: Cybernetics,, 40(6):1438–1446.