1 xyz 10/14/2015 jhu ws’08 rsr team robust speaker recognition over varying channels niko brummer,...
TRANSCRIPT
1XYZ 04/19/23
JHU WS’08 RSR Teamm y z x DM UV
v2
v1
u2
u1
d33
d22
d11
Robust speaker recognition over varying channels
Niko Brummer, Lukas Burget, William Campbell, Fabio Castaldo, Najim Dehak, Reda Dehak, Ondrej Glembek, Valiantsina Hubeika, Sachin Kajarekar,
Zahi Karam, Patrick Kenny, Jason Pelecanos, Douglas Reynolds, Nicolas Scheffer, Robbie Vogt
2
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
NIST SRE2008 - Interview speech
The same microphone in training and test
< 1% EER
Different microphone in training and test
about 3% EER
Intersession Variability
• Variability refers to changes in channel effects between training and successive detection attempts
• Channel/session variability encompasses several factors
– The microphones Carbon-button, electret, hands-free, array, etc
– The acoustic environment Office, car, airport, etc.
– The transmission channel Landline, cellular, VoIP, etc.
– The differences in speaker voice Aging, mood, spoken language, etc.
The largest challenge to practical use of speaker detection systems is channel/session variability
The largest challenge to practical use of speaker detection systems is channel/session variability
4
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Tools to fight unwanted variability
v2
v1
u2
u1
d33
d22
d11
M = m + Vy + Dz + Ux
Joint Factor Analysis
5
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
System based on Joint Factor Analysis
Baseline System
False Alarm Probability
Mis
s P
rob
abili
ty
NIST SRE08
short2-short3
Telephone Speech in Training and Test
6
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
SRE NIST Evaluations
• Annual NIST evaluations of speaker verification technology (since 1995) using a common paradigm for comparing technologies
• All the team members participated in recent 2008 NIST evaluations
• JHU workshop provided a great opportunity to:– do common post-evaluation analysis of our systems– combine and improve techniques developed by individual sites
• Thanks to NIST evaluations we have:– identified some of the current problems that we worked on– well defined setup and evaluation framework– baseline systems that were trying to extend and improve during
the workshop
7
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
• Diarization using JFA
• Factor Analysis Conditioning
• SVM – JFA and fast scoring
• Discriminative System Optimization
Subgroups
8
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Diarization using JFA
Problem StatementProblem Statement– Diarization is an important upstream process for real-world multi-speaker
speech– At one level diarization depends on accurate speaker discrimination for
change detection and clustering – JFA and Bayesian methods have the promise of providing improvements
to speaker diarization
GoalsGoals – Apply diarization systems to summed telephone speech and interview
microphone speech Baseline segmentation-agglomerative clustering Streaming system using speaker factors features New variational-bayes approach using eigenvoices
– Measure performance in terms of DER and effect on speaker detection
9
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Factor Analysis Conditioning
Problem StatementProblem Statement– A single FA model is sub-optimal across different conditions– Eg.: different durations, phonetic content and recording scenario
GoalsGoals – Explore two approaches:
- Build FA models specific to each condition and robustly combinerobustly combine multiple models
- Extend the FA model to explicitly modelexplicitly model the condition as another source of variability
10
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
SVM - JFA
Problem StatementProblem Statement– The Support Vector Machine is a discriminative recognizer which has
proved to be useful for SRE– Parameters of generative GMM speaker models are used as features
for linear SVM ( sequence kernels)– We know Joint Factor Analysis provides higher quality GMMs, but
using these as is in SVMs has not been so successful.
GoalsGoals– Analysis of the problem– Redefinition of SVM kernels based on JFA?– Application of JFA vectors to recently proposed and closely related
bilinear scoring techniques which do not use SVMs
11
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Discriminative System Optimization
Problem StatementProblem Statement– Discriminative training has proved very useful in speech and language Discriminative training has proved very useful in speech and language
recognition, but has not been investigated in depth for speaker recognition, but has not been investigated in depth for speaker recognitionrecognition
– In both speech and language recognition, the classes (phones, In both speech and language recognition, the classes (phones, languages) are modeled with generative models, which can be trained languages) are modeled with generative models, which can be trained with copious quantities of datawith copious quantities of data
– But in speaker recognition, our speaker GMMs have at best a few But in speaker recognition, our speaker GMMs have at best a few minutes of training typically of only one recording of the speakerminutes of training typically of only one recording of the speaker
GoalsGoals– Reformulate the speaker recognition problem as Reformulate the speaker recognition problem as binarybinary discrimination discrimination
between pairs of recordings which can be (i) of the same speaker, or (ii) between pairs of recordings which can be (i) of the same speaker, or (ii) of two different speakersof two different speakers
– We now have lots of training data for these two classes and we can We now have lots of training data for these two classes and we can afford to train complex discriminative recognizersafford to train complex discriminative recognizers
12
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Relevance MAP adaptation
UBM
Target speaker model Test data
•2D features
•Single Gaussian model
•Only mean vector(s) are adapted
13
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Intersession variability
High inter-session variability
High speaker variability
UBM
Target speaker model
14
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Intersession variability
High inter-session variability
High speaker variability
UBM
Target speaker model Test data
Decision boundary
15
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Intersession compensation
High inter-session variability
High speaker variability
UBM
Target speaker model Test data
For recognition, move both models along the high inter-session variability direction(s) to fit well the test data (e.g. in ML sense)
16
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
• Probabilistic model proposed by Patrick Kenny
• Speaker model represented by mean supervector M = m + Vy + Dz + Ux
– U – subspace with high intersession/channel variability (eigenchannels)– V – subspace with high speaker variability (eigenvoices)– D - diagonal matrix describing remaining speaker variability not covered by V– Gaussian priors assumed for speaker factors y, z and channel factors x
v2
v1
u2
u1
d33
d22
d11
Joint Factor Analysis model
m
3D space of model parameters
(e.g. 3 component GMM; 1D features)
17
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Working with JFA
v2
v1
u2
u1
d33
d22
d11
m
• Enrolling speaker model:– Given enrollment data and the hyperparameters m, Σ, V, D and U, obtain MAP
point estimates (or posterior distributions) of all factors x, y, z– Most of the speaker information is in low dimensional vector y; less in high
dimensional vector z; x should contain only channel related info.
• Test:– Given fixed (distributions of) speaker dependent factors y and z, obtain new
estimates of channel factors x for test data– Score for test utterance is log likelihood ratio between UBM and speaker model
defined by factors x, y, z
• Training hyperparameters– Hyperparameters m, Σ, V, D and U can be
estimated from training data using EM algorithm Posterior distributions of “hidden” factors x, y, z and
hyperparameters are alternately estimated to maximize likelihood of training data
Distributions of speaker factors y, z are constraint to be the same for all segments of the same speaker while channel factors x may be different for every segment.
18
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Flavors of JFA
•Relevance MAP adaptationM = m + Dz with D2 = Σ/ τ where Σ is matrix with UBM variance supervector in diagonal
•Eigenchannel adaptation (SDV, BUT)• Relevance MAP for enrolling speaker model• Adapt speaker model to test utterance using U estimated by PCA
•JFA without V, with D2 = Σ/ τ (QUT, LIA)•JFA without V, with D trained from data (CRIM)
can be seen as training different τ for each supervector coefficientEffective relevance factor τef= trace(Σ)/ trace(D2)
•JFA with V (CRIM)
v2
v1
u2
u1
d33
d22
d11
m
19
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Flavors of JFA
Full JFA significantly outperform the other JFA configurations.
No JFAEigenchannel adapt.JFA: d2 = Σ/ τJFA: d trained on dataJFA with eigenvoices
SRE 2006(all trials, det1)
20
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
• Diarization based on JFA
• Factor Analysis Conditioning
• SVM – JFA and fast scoring
• Discriminative System Optimization
Subgroups
22XYZ 04/19/23
JHU WS’08 RSR Teamm y z x DM UV
v2
v1
u2
u1
d33
d22
d11
Diarization Group
Douglas Reynolds, Patrick Kenny, Fabio Castaldo, Ciprian Costin
23
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Roadmap
• Introduction– Problem definition– Experiment setup
• Diarization Systems– Variational Bayes System– Streaming and Hybrid Systems
• Analysis and Conclusions
24
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Diarization Segmentation and Clustering
• Determine when speaker change has occurred in speech signal (segmentation)
• Group together speech segments from same speaker (clustering)
• Prior speaker information may or may not be available
Speaker B
Speaker A
Which segments are from the same speaker?Which segments are from the same speaker?
Where are speaker changes?Where are speaker changes?
25
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Diarization Applications
• Diarization is used as a pre-process for other downstream applications
• Human consumption– Annotate transcript with speaker changes/labels– Provide overview of speaker activity
• Algorithm consumption– Adaptation of speech recognition system– Application to speaker detection with multi-speaker speech
1sp detector
1sp detector
1sp detector
1sp detector
MAX
MAX
Speaker Diarization
Speaker Diarization
Utterance score
26
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Diarization Error Measures
• Direct Measure – Diarization Error Rate (DER)– Optimal alignment of reference and
hypothesized diarizations– Error is sum of
miss (speaker in reference but not in hypothesis) false alarm (speaker in hypothesis but not in
reference) speaker-error (mapped reference speaker is not
the same as the hypothesized speaker)
– Time weighted measure Emphasizes talkative speakers
ref
hyp
miss fa errref
hyp
miss fa err
• Consumer Measure – Effect on speaker detection system
– Determine speaker detection error rate when using different diarization output
– Focus on NIST SRE 2008 data with a fixed detection system (JFA GMM-UBM system)
PROBABILITY OF FALSE ALARM (in %)P
RO
BA
BIL
ITY
OF
MIS
S (
in %
)
EER
27
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Diarization Experiment Data
• Summed channels telephone speech– Use summed channel data for test only
(avoid complication of extra clustering in training)
– We can derive reference for DER scoring using ASR transcripts from separate channels
(no-score for silence and speaker overlap)
– Compare use of diarization to two extremes Best case: use reference diarization Worst case: no diarization
• Interview microphone speech– Single microphone recording capturing both interviewee
(target) and interviewer– Avoid use of unrealistic side information about location of
interviewee speech provided in NIST eval– Reference for DER scoring from lavaliere microphones ASR
transcripts
28
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Roadmap
• Introduction– Problem definition– Experiment setup
• Diarization Systems– Variational Bayes System– Streaming and Hybrid Systems
• Analysis and Conclusions
29
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Baseline System
Speaker Change Detection
Speaker Change Detection
p(x|x) p(y|y)
p(z|z)
p(x|x) p(y|y)
p(z|z)
Refined speaker data
Final Diarization
• Three stages in baseline system– BIC based speaker change detection– Full covariance agglomerative clustering with BIC stopping
criterion – Iterative re-segmentation with GMM Viterbi decoding
Initial speaker data
Viterbi DecodeViterbi Decode Train GMMsTrain GMMs
Agglomerative Clustering
Agglomerative Clustering
30
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Factor Analysis Applied to Diarization
• State of the art speaker recognition systems use hundreds of speaker and channel factors
– Processing requires entire utterances – can't be implemented incrementally
• State of the art diarization systems require lots of local decisions– Very short (~1 sec) speech segments– Speaker segmentation: is this frame a speaker change point?– Agglomerative clustering: Given two short segments, is the speaker
the same?
• Proposed solution: Variational Bayes (VB)– Fabio Valente, Variational Bayesian Methods for Audio Indexing,
PhD Dissertation, Eurecom, 2005
31
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Advantages of a Bayesian Approach
• EM-like convergence guarantees
• No premature hard decisions as in agglomerative clustering– This suggested a `soft clustering' heuristic which reduced the
diarization error rate of the baseline system by almost 50%
• In theory at least, Bayesian methods are not subject to the over-fitting which maximum likelihood methods are prone to
– Bayesian model selection is a quantitative version of Occam's razor (David MacKay)
– It ought to be possible to determine the number of speakers in a file without resorting to BIC like fudge factors (Fabio Valente)
32
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Eigenvoice Speaker Model
• A supervector s is the concatenation of the mean vectors in a speaker dependent Gaussian mixture model
• The supervector m is speaker independent
• The matrix V is of low rank– The columns of V are the eigenvoices
– The entries of y are the speaker factors
• A highly informative prior on speaker dependent GMM's
• Adding eigenchannels doesn't help in diarization (so far)
, ~ (0, )s m y y N V I
• For diarization we use only the eigenvoice component of factor analysis
33
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Variational Bayes Diarization
• Assume 2 speakers and uniformly segment the file into 1 second intervals
– This restriction can be removed in a second pass
• Alternate between estimating two types of posterior distribution until convergence
– Segment posteriors (soft clustering)– Speaker posteriors (location of the speakers in the space of
speaker factors)
• Construct GMM's for each speaker and re-segment the data– Iterate as needed
36
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Variational Bayes DiarizationDetails
• Begin – Extract Baum-Welch statistics from each segment
• On each iteration– For each speaker:
Synthesize Baum Welch statistics by weighting the Baum Welch statistics for each segment by the corresponding posterior
Update the posterior distribution of the speaker factors
– For each segment Update the segment posteriors for each speaker
• End – Construct GMM's for each speaker– Re-segment the data– Iterate
37
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Experiment Configuration
• Features used for Variational Bayes– 39 dimensional feature set optimized by Brno for speaker recognition– Cepstral coefficients c0, .., c12 + first, second and third order
derivatives + Gaussianization + HLDA
• Features used in the baseline system and in the re-segmentation phase of Variational Bayes
– Un-normalized cepstral coefficients c0, .., c12– Including c0 was a lucky bug
• Factor analysis configuration for Variational Bayes– Universal background model with 512 Gaussians– 200 speaker factors, no channel factors– V matrix scaled by 0.6
• Test set: the summed channel telephone data provided by NIST in the 2008 speaker recognition evaluation
– 2215 files, (~200 hours)
• NIST Diarization Error used to measure performance– Ground truth diarization is available
38
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Experiment ResultsNIST 2008 Summed Channel Telephone Speech
• VB = Variational Bayes• BW = Baum -Welch training of speaker GMM's• Viterbi = re-segmentation with speaker GMM's• The second pass in VB uses a non-uniform segmentation
provided by the first pass• Compared to the baseline, soft clustering achieves a 50%
reduction in error rates
0
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5
DE
R (
%)
Diarization Systems Mean DER
Std DER
1 Baseline BW Viterbi 6.8 12.3
2 VB 9.1 11.9
3 VB BW Viterbi 4.5 8.5
4 VB BW Viterbi 2nd pass 3.8 7.6
5 Baseline soft-cluster BW Viterbi 3.5 8.0
39
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Roadmap
• Introduction– Problem definition– Experiment setup
• Diarization Systems– Variational Bayes System– Streaming and Hybrid Systems
• Analysis and Conclusions
40
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Streaming SystemLPT Diarization System*
• Main Ideas– Use eigenvoice model for creating a stream of speaker
factors Yt computed on a sliding windows – Perform segmentation and clustering with these new features
• Eigenvoice Model:
, ~ (0, )s m y y N V I
* Based on Castaldo, F.; Colibro, D.; Dalmasso, E.; Laface, P.; Vair, C.,Stream-based speaker segmentation using speaker factors and eigenvoices, ICASSP 2008
42
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Streaming System
Feature ExtractionFeature Extraction
Slices
Streaming Factor Analysis
Streaming Factor Analysis
GMM 2
GMM 2
GMM 1
GMM 2
GMM 1
GMM 1
Slice ClusteringSlice Clustering
Audio
x1 x2 x3 x4 x5 x6 x7 x8 x9
43
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Streaming SystemStream Factor Analysis
Viterbi Segmentation
Creating GMMsCreating GMMsGMM 1
GMM 2
Feature Extraction
Feature Extraction
Y1
Y2
Slices
Clustering
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12
44
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Streaming SystemStream Factor Analysis
First 2 dimensions of y stream
46
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Streaming SystemSlice Clustering
• A GMM model for each slice is created
• Last step: clustering the GMMs created in each slice
• The system decides whether GMMs come from the same or different speakers by using an approximation of the Kullback-Leibler divergence between GMMs
• Large KL-divergence => new speaker
New 60s sliceNew 60s slice
New GMM
GMM 1
GMM 2
Min KL Divergence
<λ
>λGMM 3
Adapt model
Creating new model
47
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Hybrid Clustering
• Speaker factors works in the streaming diarization system
• Experiments done during the workshop showed cosine distance between speaker factors produces low speaker detection errors
• Modifying the baseline system using these new ideas
• Hybrid Clustering – Replace the classical clustering using speaker factors and
cosine distance
48
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Hybrid Clustering Different Approaches
• First Approach: Level Cutting– Stop the agglomerative clustering at a certain level and
compute speaker factors for each cluster– Merge the clusters that have the maximum similarity with
respect to the cosine distance– Iterate until only two clusters remain
• Second Approach: Tree Searching– Build agglomerative clustering up to the top level– Select the nodes that have a number of frames above a
threshold– Merge the clusters that have the maximum similarity with
respect to the cosine distance– Iterate until only two clusters remain
49
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Hybrid Clustering Level Cutting
1 2 3 4 5 6
1 2 3 4 5
Y1 Y2 Y3 Y4 Y5
Y3 Y4Y2Y1
Y2 Y3
Y2
SPEAKER
FACTOR
CLUSTERING
Y1
Y1
A CG LG UL SO TM EE RR IA NT GI V E
50
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Hybrid Clustering Tree Searching
50 60 70 210 340
70 210 340
55070110
180 550Threshold=100
=selected cluster(Y)
110
51
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Results On Summed Telephone Speech
System (2213 Audio Files)
Minimum Diarization Error Rate(%)
Maximum Diarization Error Rate(%)
Standard Deviation
(%)
Average Diarization Error Rate (%)
Streaming System 0.0 53.2 8.8 4.6
Baseline Diarization System
0.0 57.2 12.3 6.8
Hybrid System 1 (Level Cutting)
0.0 67.0 14.6 17.1
Hybrid System 2 (Tree Search)
0.0 63.2 13.6 6.8
52
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Roadmap
• Introduction– Problem definition– Experiment setup
• Diarization Systems– Variational Bayes System– Streaming and Hybrid Systems
• Analysis and Conclusions
53
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
DER vs EERSummed Telephone Speech
• Some correlation of DER to EER• Systems with DER <10% have comparable EERs• No clear knee in the curve
– Still have EER gains (over doing nothing) with relatively poor DER=20% system
8
9
10
11
12
13
14
15
0 10 20 30 40
DER (%)
EE
R (
%)
54
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
DER vs EERSummed Telephone Speech
• Unclear trends with low DER systems– VB+2nd pass and BL+soft_cluster
• DER may be too coarse of a measure for effects on EER
9
9.2
9.4
9.6
9.8
10
0 2 4 6 8 10
DER (%)
EE
R (
%)
RefLPT
VB + 2nd pass
BL+ soft cluster
Hybrid
BL VB
55
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Interview Speech
• Interview speech differs from telephone speech in two main aspects
– Audio quality is much more variable for various microphones– Conversations are dominated by interviewee
• DER for do-nothing diarization (single speaker for all time)– Telephone: 35%– Interview: 11%
• Next challenge is to apply diarization systems to new domain
• Avoid idealistic assumptions and knowledge used in NIST eval
No diarization in train or test
Ideal diarization in train and test
EER=10.9%
EER=5.4%
56
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Conclusion
• Implemented variational Bayes diarization system using both segment and speaker posterior optimization
• Used speaker factor model for three speaker diarization systems
– Streaming, VB, and hybrid
• Demonstrated effectiveness of soft-clustering for improving speaker diarization
• Produced low diarization error rate (3.5-4.5%) for telephone speech
• New challenges await for interview speech domain– Microphones– Conversational patterns
59XYZ 04/19/23
JHU WS’08 RSR Teamm y z x DM UV
v2
v1
u2
u1
d33
d22
d11
Speaker Recognition: Factor Analysis Conditioning
(13 August 2008)
Sub-Team:
Sachin Kajarekar (SRI), Elly Na (GMU), Jason Pelecanos (IBM), Nicolas Scheffer (SRI), Robbie Vogt (QUT)
60
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Overview
• Introduction
• A Phonetic Analysis
• Combination Strategies
• Within Session Variability Modeling
• Hierarchical Factor Analysis
• Review
61
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Introduction
Problem StatementProblem Statement– A single FA model is sub-optimal across different conditions
- Eg.: different durations, phonetic content and recording scenario
GoalsGoals – Explore two approaches:
- Build FA models specific to each condition and robustly combinerobustly combine multiple models
- Extend the FA model to explicitly modelexplicitly model the condition as another source of variability
Results and OutcomesResults and Outcomes– A conditioned FA model can provide improved performance
- But, score level combination may not be the best way
– Including Within-Session Within-Session factorsfactors in an FA model can reduce the sensitivity to utterance duration and phonetic content variability
– Stacking factorsStacking factors across conditions or data subsets can provide additional robustness
– Hierarchical Hierarchical modeling for Factor Analysis shows promise– Applicability to other condition types: languages, microphones, …
62
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
phoneme I ‘w’
phoneme II
‘ah1’phoneme
III ‘n’
phoneme I phoneme II phoneme III
IntroductionSpeech Partitioning… An overview
phoneme I ‘w’
phoneme II‘ow’
phoneme III‘d’
feature space
Train Data Test Data
feature space
63
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Overview
• Introduction
• A Phonetic Analysis• Combination Strategies
• Within Session Variability Modeling
• Hierarchical Factor Analysis
• Review
64
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
A Phonetic AnalysisEffect of Phonetic Mismatch
Test Enroll Vowel Consonant
Vowel4.50%0.0208
12.47%0.0537
Consonant10.72%0.0521
7.03%0.0336
– How does the difference between the content in enrollment and test change the resulting performance?
– This result, albeit an extreme example, demonstrates the challenge of mis-matched phonetic content
– This phenomena is especially present for short duration utterances
EERMin. DCF
65
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
A Phonetic AnalysisPerformance vs. % of Speech
DET 1 DET 3
Phoneme Type % of speech EER (%) DCF EER (%) DCF
E vowel 18.93 12.16 0.0567 8.62 0.0419
O vowel 10.71 14.57 0.0645 12.30 0.0558
i vowel 6.85 16.73 0.0749 15.49 0.0696
A: vowel 5.89 23.31 0.0876 21.79 0.0852
n nonvowel 5.44 19.08 0.0779 17.23 0.073
e: vowel 4.73 25.31 0.0917 22.92 0.0866
k stop 4.49 25.56 0.0926 22.26 0.0868
z sibilant 4.25 29.73 0.098 28.22 0.0971
o vowel 3.01 25.53 0.0924 25.24 0.0926
t stop 2.76 27.04 0.0956 24.92 0.0936
s sibilant 2.74 30.73 0.0965 27.63 0.0908
f sibilant 2.41 34.43 0.0998 31.42 0.0984
j nonvowel 2.38 25.00 0.0918 22.41 0.0862
v sibilant 2.35 33.66 0.1 30.78 0.0992
m nonvowel 2.29 21.18 0.0835 18.63 0.0782
S sibilant 2.21 31.97 0.0959 31.74 0.0981
l nonvowel 1.99 30.05 0.0974 29.91 0.0955
66
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
A Phonetic AnalysisPerformance vs. % of Speech
0
10
20
30
40
0 5 10 15 20
% of Speech
Pe
rfo
rma
nce
(E
ER
) %
vowel nonvowel sibilant stop
67
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
A Phonetic AnalysisFusion Analysis
Vowel with Others Vowel with Vowel
Phonemes % of Speech EER (%) Phonemes % of Speech EER (%)
E n 24.37 7.96 E O 29.64 7.04
E k 23.42 8.4 E i 25.78 7.58
E z 23.18 8.35 E A: 24.82 8.94
E t 21.69 8.72 E e: 23.66 8.55
E l 20.92 8.56 E o 21.94 8.29
O n 16.15 9.64 Oi 17.56 9.42
O k 15.2 10.89 O A: 16.6 11.32
O z 14.96 11.76 O e: 15.44 11.32
O t 13.47 10.85 O o 13.72 11.38
i n 12.29 11.93 i A: 12.74 13.05
i z 11.1 14.41 i e: 11.58 13.91
n e: 10.17 14.4 A: e: 10.62 17.23
i s 9.59 14.35 i o 9.86 14.07
A: t 8.65 17.6 A: o 8.9 17.59
n j 7.82 13.71 e: o 7.74 18.46
z s 6.99 23.03
o t 5.77 19.34
t s 5.5 21.88 Fuse All 83.43 5.68
f v 4.76 26.72
S l 4.2 25.63
68
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
A Phonetic AnalysisFusion Analysis
JHU WS’08 RSR Team
vowel with othersvowel with vowel
69
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Overview
• Introduction
• A Phonetic Analysis
• Combination Strategies• Within Session Variability Modeling
• Hierarchical Factor Analysis
• Review
70
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Combination Strategies Context
• Conditioned factor analysis– Multiple systems for multiple conditions– Multiple subspaces (eg: microphones)
• Current solution– Select the best system for each condition– Perform score-level combination (our baseline)
• How to robustly gather information from these systems?
• Exploring combination strategies in the model space
• Candidate for the study: Broad-Phone classes– Work in the speaker space instead of channel space– Small set of events
FA Cond. Smaller system configuration (512g, 120 EV, 60EC)
71
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
% Data EER (%) Min. DCF
Vowels 60 6.17 0.296
Consonants 40 7.91 0.391
NonVowels 15 10.7 0.502
Sibilants 15 14.14 0.647
Stops 10 15.27 0.685
Vow. + NV + Si + St. (4 classes) 100 5.42 0.272
Vow. + Cons. (2 classes) 100 5.20 0.262
Baseline 100 5.12 0.241
Thanks S. Kajarekar, C. Richey, SRI International
Combination Strategies Baseline Results
Table of Results for Different Phone Sets (DET 1 SRE’06)
72
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Combination StrategiesStacked Eigenvectors
• In training, estimate different subspaces modeling the same kind of variability– Eg: Different utterance lengths, different set of microphone sets
• In practice : – Merging of supervectors generated by each subspace– New rank is the sum of each subspace rank– Can generate very large (and redundant) subspaces
• Advantages:– No retraining during enrollment / recognition time– No need of labeled data for system selection– Increased robustness of system in both scenarios (correlation between two
subspaces)
V1
V2
V1
V2
y1
y2
73
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Combination StrategiesCombining U and/or V?
• Stacking U’s (channel): successfully demonstrated (at NIST) for a large set of microphones
• Stacking V’s (speaker): Suitable for phone conditioning as:– Phonetic models can represent the speaker– Precedents are P-GMM, MLLR systems, …
EER (%) Min. DCF
Baseline 5.20 0.262
Stacked Us 5.72 0.279
Stacked Vs 5.14 0.243
Stack Us+Vs 5.31 0.258
Table of 2 Class Stacking Results (DET 1 SRE’06)
74
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Combination StrategiesAugmented eigenvectors
• Again, train several subspaces on the same kind of variability
• In practice:– Value of the subspace rank is unchanged– Increase the model size– Need to retrain the joint variability– Not extendable to more than 2,3 classes
• Close to Tied factor analysis– Produce a single y, independent from the class
V1 V2 y
V1
V2
75
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Combination StrategiesFactor Analysis: Un-tying
• Augmented eigenvectors produce a common y for all conditions
• In practice: – There’s always a between class error– The error is averaged out by the ML algorithm
• Keep each speaker factor (y) from each class with the error– More parameters to describe a speaker– Feed this input to a classifier– Experiments with Gaussians as classes are promising
V1 V2 yV1
V2
y1
y2
TRAININGTESTING
76
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
# of Classes Method EER DCF
1 Baseline Single system 5.12 0.241
1 Baseline (x2) EV 4.83 0.239
2 Baseline 2 sys. fusion 5.20 0.262
4 Baseline 4 sys. fusion 5.42 0.272
22 Stacked VStacked V 2 sys fusion2 sys fusion 5.095.09 0.2470.247
4 Stacked V 4 sys. fusion 5.03 0.250
2 Stacked V Single system 5.14 0.243
44 Stacked VStacked V Single systemSingle system 4.764.76 0.2340.234
2 Augmented Single system 13.4 0.573
22 AugmentedAugmented Retrained (Tied)Retrained (Tied) 5.395.39 0.2660.266
1616 Un-tiedUn-tied GaussianGaussian 4.544.54 0.2330.233
Combination StrategiesResults
Table of Results for Different Factor Configurations
77
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Overview
• Introduction
• A Phonetic Analysis
• Combination Strategies
• Within Session Variability Modeling
• Hierarchical Factor Analysis
• Review
78
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Within-session Variability Modeling
• The characteristics of inter-sessioninter-session variability are dependent on session duration.
• This doesn't fit well with the JFA model– Capturing more than channel!– Speech content (phonetic information) averages out for long
utterances but will become significant for short utterances
79
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Within-session Variability Modeling
• Proposed solution: Model within-sessionwithin-session variability as well– Break utterances into smaller segments, each described by:
M = m + Vy + UI x + U
W w + dz
U split into inter- (UI) and within-session (U
W) parts
• x is held constant for a whole utterance,
• but we have many w's!
• In this work we chose to align our segments with OLPR transcripts– i.e. one w per phonetic event,– Approx. 10 per second– Approx. 1000 in a NIST conversation side
80
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Within-session Variability Modeling
• Proposed solution: Model within-sessionwithin-session variability as well– Break utterances into smaller segments, each described by:
M = m + Vy + UI x + U
W w + dz
UI
V
x
y
81
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Within-session Variability Modeling
• Proposed solution: Model within-sessionwithin-session variability as well– Break utterances into smaller segments, each described by:
M = m + Vy + UI x + U
W w + dz
UI
V
82
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
UW
Within-session Variability Modeling
• Proposed solution: Model within-sessionwithin-session variability as well– Break utterances into smaller segments, each described by:
M = m + Vy + UI x + U
W w + dz
UI
V
83
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Within-session Variability ModelingSingle Phonetic Event
Dimension
84
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Within-session Variability ModelingSingle Phonetic Event
Dimension
85
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Within-session Variability ModelingContribution Over Varying Utterance Lengths
Dimension
86
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Within-session Variability ModelingWithin-session Results
Similar performance for full conversations
• Modest gains with reduced utterance lengths, mostly for the EER.
– Better than matching U to the utterance length in most cases– Good flexibility across utterance length for a single model!
JFA ModelSubspace
Dims
1conv 20 sec 10 sec
EER Min DCF EER Min DCF EER Min DCFU+V+D 50 3.10% .0159 12.79% .0561 20.21% .0819U+V+D 60 3.03% .0156 13.01% .0562 20.31% .0820UMatched+V+D 50 3.10% .0159 12.20% .0531 19.71% .0814UI+UW+V+D 50I + 10W 2.97% .0170 11.98% .0541 19.67% .0807
87
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Overview
• Introduction
• A Phonetic Analysis
• Combination Strategies
• Within Session Variability Modeling
• Hierarchical Factor Analysis• Review
88
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Hierarchical Factor Analysis
Low ComplexityCoarse Grain Model
High ComplexityFine Grain Model
89
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Hierarchical Factor AnalysisMulti-grained Hybrid Model
Such a model may compensate for session effects that cause large regional variability and localized distortions.
A multi-grained model may be structured in a manner such that the nuisance kernel subspace has reduced complexity (a reduced number of parameters) while preserving compensation impact.
RawFeatures
ExtractSession Var.
Statistics
Find FAStatistics
L1“Global”Model
ExtractSession Var.
Statistics
Find FAStatistics
L2 “Broad Phone”
Model
Use of Factor AnalysisCompensated Statistics
In Kernel or Model
CompensatedFeatures/Statistics
90
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Hierarchical Factor AnalysisMulti-grained GMM/Phone Model
Task Condition 7
Condition 8
Base System with NAP 0.179 0.182
Base System with Multigrained NAP
0.175 0.166
Broad Phone System with NAP 0.212 0.209
Broad Phone System with Multigrained NAP
0.206 0.190
Table of NIST 2008 Minimum DCF Results
Thanks to Jiri Navratil (IBM) for the phonetic results.
91
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Hierarchical Factor AnalysisMulti-stage FA Broad Phone Model
Table of NIST 2006 Minimum DCF/EER ResultsPhone DET 3 - Base DET 3 - ZTNormType Min DCF EER Min DCF EER
Baseline NonVowel 0.0888 24.04% 0.0413 9.05%Sibilant 0.0988 30.28% 0.0584 13.05%Stop 0.0993 33.33% 0.0631 13.81%Vowel 0.0604 11.26% 0.0201 3.97%
Hierarchical NonVowel 0.0852 23.24% 0.042 9.53%Sibilant 0.0994 28.93% 0.0585 14.20%Stop 0.0991 33.27% 0.0655 14.63%Vowel 0.0482 10.29% 0.0206 3.91%
Baseline Consonant 0.0839 20.48% 0.0323 6.28%Vowel 0.0604 11.26% 0.0201 3.97%
Hierarchical Consonant 0.0777 18.26% 0.0312 6.45%Vowel 0.0482 10.29% 0.0206 3.91%
Fusion of hierarchical systems with baseline system gives modest improvements.
92
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Review
Results and OutcomesResults and Outcomes
– A conditioned FA model can provide improved performance - But, score level combination may not be the best way - Automatic system selection may not be feasible
– Including Within-Session Within-Session factorsfactors in an FA model can reduce the sensitivity to utterance duration and phonetic content variability
– Stacking factorsStacking factors across conditions or data subsets can provide additional robustness
– Hierarchical Hierarchical modeling for Factor Analysis shows promise
– Applicability to other condition types: languages, microphones, …
94XYZ 04/19/23
JHU WS’08 RSR Teamm y z x DM UV
v2
v1
u2
u1
d33
d22
d11
Support Vector Machines and
Joint Factor Analysis
Najim DEHAK, Reda DEHAK, Zahi KARAM, and John NOECKER Jr.
95
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Outline
• Introduction
• SVM-JFA : GMM supervectors space
• SVM-JFA : Speaker factors space
• Intersession compensation in speaker factors space
– Within Class Covariance Normalization
– Handling variability with SVMs
• SVM-JFA : Combined factor space
• Importance of speaker and channel factors
• Conclusion
96
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Introduction
• Joint Factor Analysis is state of the art in speaker
verification.
• Combine discriminative and generative models.
• SVM - JFA
– Speaker GMM supervectors space.
– Speaker factors space.
– Combination of factors.
• Intersession variability compensation in speaker factors.
97XYZ 04/19/23
JHU WS’08 RSR Teamm y z x DM UV
v2
v1
u2
u1
d33
d22
d11
SVM in GMM Supervector Space
s = m + Vy + Dz
98
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
SVM-JFA : Supervector Space
n
1
JFAAdaptation
JFAAdaptation
Target
Non-Target
Speaker SVM
SVMTraining
SVMTraining
i
ibit
iaiba ggK ,1
,),(
n
1
n
1
n
1
n
1
n
1
ubmTarget
Non-target
Project each utterance into a high-dimensional space – stack mean vectors from GMM
Use a KL-based kernel
JFAAdaptation
JFAAdaptation
ubm
n
1
SVM ScoringSVM Scoring
m
Detection score
Thanks to Douglas Reynolds for this slide
• Our Starting Point : Kernel between two GMM supervectors (Campbell 06).
99
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
JFA Configuration
• Gender independent JFA.
• 2048 Gaussians, 60 dimensional features
– 19 Gaussianized MFCC’s + energy + delta + double delta
• 300 speaker factors.
• 100 channel factors for telephone speech
• Decoupled estimation of eigenvoice and diagonal matrix (D)
• JFA hyper-parameters are obtained on MIXER and
Switchboard database.
100
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
SVM-JFA : Supervector Space
• Result on NIST 2006 and 2008 SRE, Core condition, Telephone-Telephone dataset. (EER)
NIST 2006 SRE NIST 2008 SRE
English All trials English All trials
JFA: s=m+Vy 1.95% 3.01% 2.81% 5.58%
JFA: s=m+Vy+dz 1.80% 2.96% 2.81% 5.69%
SVM-JFA: s=m+Vy 4.24% 4.98% 5.10% 7.92%
SVM-JFA: s=m+Vy+dz 4.23% 4.92% 5.27% 8,13%
• JFA scoring is obtained with frame by frame.
101XYZ 04/19/23
JHU WS’08 RSR Teamm y z x DM UV
v2
v1
u2
u1
d33
d22
d11
SVM in JFA-Speaker Factors Space
s = m + V y
102
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
SVM-JFA : Speaker Factors Space
• Using speaker factors ‘y’ rather than GMM supervectors ‘s’.
– s = m + V y– c = Ux
• Low dimension space
• We can perform quick experiments.
• First we used only eigenvoice adaptation (D=0).
103
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
SVM-JFA : Speaker Factors Space
• s = m + V y
• Inner-product kernel
• Gaussian kernel :
• Cosine Kernel :
)1
exp(),(2
2121 yyk
yyk
21
2121 .
,),(
yy
yyyyk
2121 ,),( yyyyk
104
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
SVM-JFA : Speaker Factors Space
• Result on NIST 2006 SRE, Core condition. (EER).
English All trials
No-norm T-norm Zt-norm No-norm T-norm Zt-norm
JFA - - 1.95% - - 3.01%
KL- kernelsupervectors
4.24% 4.98% -
Linear kernel 3.47% 2.93% 4.64% 4.04% -
Gauss kernel 3.03% 2.98% 4.59% 4.46% -
Cosine kernel 3.08% 2.92% 4.18% 4.15% -
105
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
SVM-JFA : Speaker Factors Space
• Results on NIST 2008 SRE, Core condition. Telephone- Telephone dataset (EER).
All trials
No-norm T-norm Zt-norm
JFA - - 5.58%
KL- kernelsupervectors
- 7.92% -
Linear kernel 7.06% 7.10% -
Gauss kernel 7.84% 7.42% -
Cosine kernel 7.24% 7.24% -
106XYZ 04/19/23
JHU WS’08 RSR Teamm y z x DM UV
v2
v1
u2
u1
d33
d22
d11
Intersession Compensation in Speaker Factors Space
Within Class Covariance Normalization in SVM speaker factors space.
107
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Within Class Covariance Normalization (WCCN)
• Within Class Covariance
– mean of utterances of each speaker
– number of speakers
– number of utterances for each speaker ( s )
– Within class covariance is calculated in MIXER and Switchboard database.
S
s
n
i
ts
sis
si
s
s
yyyynS
W1 1
))((11
sn
i
si
ss y
ny
1
1
sn
S
108
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
SVM-JFA : Speaker Factors Space (WCCN)
• s = m + Vy
• Linear kernel
• Cosine Kernel :
21
121 ),( yWyyyk t
21
211
1
21
121
.),(
yWyyWy
yWyyyk tt
t
109
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
SVM-JFA : Speaker Factors Space (WCCN)
• Result on NIST 2006 SRE, Core condition, English trials (EER)
Without WCCN With WCCN
No-norm T-norm No-norm T-norm
Linear kernel 3.47% 2.93% 2.87% 2.44%
Cosine kernel 3.03% 2.98% 2.60% 2.45%
JFA-scoring 1.95%
17 %
110XYZ 04/19/23
JHU WS’08 RSR Teamm y z x DM UV
v2
v1
u2
u1
d33
d22
d11
Intersession Compensation in Speaker Factors Space
Handling variability with SVMs
111
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Handling Variability with SVMsGood and Bad Variability
• Types of Variability:– Good Variability (Inter-speaker)
Speaker information
– Bad Variability (Nuisance) Session Channel
• Handling Variability– Joint Factor Analysis – GMM + LFA – SVM + NAP– SVM + WCCN
112
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Handling Variability with SVMsMotivation: Handling Nuisance
Uw w
Principle NuisanceDimensionNuisance Estimate
114
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Handling Variability with SVMsSVM Formulation
Uw
ξ=0
ξ=∞
ξ=1
ξ=2
116
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Handling Variability with SVMsResults
11%
18%
Using only 300 speaker factors s=m+VyDimension of Nuisance subspace = 50
117
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Handling Variability with SVMsFuture Work
• Beyond Nuisance compensation:– Bias towards using inter-speaker variability– Handle all variability
Bias towards inter-speaker Bias away from Nuisance
• Extend formulation to full supervectors
118XYZ 04/19/23
JHU WS’08 RSR Teamm y z x DM UV
v2
v1
u2
u1
d33
d22
d11
SVM-JFA : Speaker and Common Factors Space
s = m + Vy + Dz
119
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
SVM-JFA : Speaker and Common Factors Space
• Full Joint Factor Analysis : s = m + Vy + Dz– y : Speaker factors.– z : Common factors (EER 6.23% on NIST 2006 SRE, English trials).
• How to use speaker and common factors with SVMs?– Score fusion– Kernel combination
120
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Scores Fusion vs. Kernels Combination
• Scores Fusion:– Linear weighted score
Weights are computed using a development scores dataset
• Kernels Combination:– Linear weighted kernel
No development dataset for weights estimation
M
lllF xSwwxS
10
M
lllF yxkyxk
1
,,
121
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Kernel space
Kernel Combination Space
New Kernel function
??
Kernel in speaker
factor space (y)
Kernel in common
factor space (z)
122
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Kernel Combination Training
Linear Kernels Linear Kernels CombinationCombination
Learn a linear combination
Large margin classifier Large margin classifier (SVM)(SVM)
Maximize the margin
M
lllF yxkyxk
1
,,
Multiple Kernel SVM training
Find that maximize the margin
123
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Results
NIST 2006 SRE NIST 2008 SRE
English all English All
Cosine Kernel on Y 2.34% 3.59% 3.86% 6.55%
Cosine Kernel On Z 6.26% 8.68% 10.34% 13.45%
Linear Score Fusion 2.11% 3.62% 3.23% 6.86%
Kernel Combination 2.08% 3.62% 3.20% 6.60%
124XYZ 04/19/23
JHU WS’08 RSR Teamm y z x DM UV
v2
v1
u2
u1
d33
d22
d11
Importance of Speaker and Channel Factors
125
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Importance of Speaker and Channel Factors
• Gender JFA (Female part).
• 2048 Gaussians, 60 dimensional features
– 19 Gaussianized MFCC’s + energy + delta + double delta
• 300 speaker factors.
• 0 common factors.
• 100 channel factors for telephone speech
• JFA hyper-parameters are obtained on MIXER and
Switchboard database.
126
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Importance of Speaker and Channel Factors
Apply intersession
compensation on speaker factor
space rather than supervector
space
EER=20% Oops:Channel factors
contain information about speaker
M = m + Vy + Dz + Ux
127
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Importance of Speaker and Channel Factors
• Three systems:
– M = m + Vy + Ux
– M = m + Vy– M = m + Tt
• V : low rank matrix, eigenvoices (speaker variability 300 dim)
• U : low rank matrix, eigenchannels (channel variability 100 dim)
• T : low rank matrix contain total variability (speaker and channel variability 400 dim).
EER
M = m + Vy + Ux 2.56%
M = m + Vy 2.74%
M = m + Tt 2.19%
Results on NIST 2006 SRE, core condition, English trials, female part (using cosine kernel).
128
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Conclusion
• SVM scoring in speaker factor space rather than GMM
supervector space:
– Quite well linear separation in speaker factors space.
– Improves over JFA-SVM supervectors
– Performance comparable to other scoring
– Allows for faster scoring
• Generalized using combinations of factors
• Further improvement using intersession compensation in
speaker factors space
• JFA as feature extraction
130
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
SVM vs. Cosine Distance (CD) Scoring
• SVMs find a linear separation in the speaker factor space.
• The score can be computed as the nearest neighbor distance.
• Can we omit the SVM and use different distance?
131
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
SVM vs. CD Scoring
• The idea is to compute the trial score as a cosine distance between the enrollment and test speaker models
• Inspired by the SVM, normalization by within-class covariance (WCCM) can be applied
testtrain
testtrain
yy
yyscore
T
1wc
T
)chol(
yy
yy
yyscore
testtrain
testtrain
132
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
SVM vs. CD ScoringResults in the y Space
• SVM – does not need any
score normalization
2006 det3 2006 det1 2008 det6
EER DCF EER DCF EER DCF
SVM w. cosine kernel, WCCN 2.38 1.27 3.62 1.80 6.59 3.23
SVM w. cosine kernel 2.98 1.55 4.09 2.02 7.20 3.24
Cosine dist w. WCCN, ZT-norm 2.00 1.19 3.82 1.93 6.43 3.60
Cosine dist w. ZT-norm 2.82 1.50 4.07 2.11 6.95 3.55
• CD Scoring – applies gender
dependent ZT-norm
133
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
SVM vs. CD Scoring Extending to x,y,z Space
• Motivation– z carries a residual speaker information, which we want to
use.– Ideally, there should be no speaker information in the x
vector as it expresses the channel shift.– We tried to substitute the y vectors by x and z.– z was much worse than y, but still gave reasonable result
(approx. 9% EER on 2006 det1)– surprisingly, using x gave around 25% EER on 2006 det1– let’s train a linear fusion on all these systems
134
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
SVM vs. CD ScoringResults in the x, y, z Space
2006 det3 2006 det1 2008 det6
EER DCF EER DCF EER DCF
SVM w. cosine kernel 2.98 1.55 4.09 2.02 7.20 3.24
SVM w. cosine kernel – y,z 2.08 1.27 3.62 2.00 6.60 3.41
Cosine dist w. ZT-norm 2.82 1.50 4.07 2.11 6.95 3.55
Cosine dist w. ZT-norm – x,y,z 2.11 1.26 3.62 1.87 6.24 3.29
• SVM – kernel combination– uses y, z only (no
improvement using x)
• CD Scoring – applies gender
dependent ZT-norm– uses gender
dependent linear logistic regression – for 2006, train on 2008 and vice-versa
135
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
SVM vs. CD Scoring Conclusion
CD Scoring
• Positives:– The problem of scoring
is symmetric– No training steps
• Negatives:– ZT-norm is needed– poorer relative
improvement on all trials (det1) – possibly needs calibration
SVM
• Positives:– Generalizes well in all
trials (det1)– No need for score
normalization
• Negatives:– SVM training procedure
137XYZ 04/19/23
JHU WS’08 RSR Teamm y z x DM UV
v2
v1
u2
u1
d33
d22
d11
Discriminative Optimization of Speaker Recognition Systems
Lukas Burget & Niko Brummer, with lots of help from other team members, especially
Ondrej Glembek, Najim Dehak and Valja Hubeika.
138
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Discriminative TrainingWhat is new here?
• Discriminative training of speaker models has been around for more than a decade,
• and SVM speaker modeling has been a constant feature at the NIST SRE evaluations since 2003.
• So what is new in this work?
139
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Discriminative TrainingWhat is new here?
• We propose to discriminatively optimize the whole speaker recognition system, rather than individual speaker models.
140
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Traditional Discriminative Training
enrollment speech test speech
feature extraction feature extraction
estimate model
match
score
discriminativeoptimization
141
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Current State-of-the-Art
enrollment speech test speech
feature extraction feature extraction
estimate model
match
score
system hyper-parameters
Generative Modeling via Joint Factor Analysis
(ML optimization)
142
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Proposed Discriminative System Optimization
enrollment speech test speech
feature extraction feature extraction
estimate model
match
score
system hyper-parameters
discriminativeoptimization
143
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Proposed Discriminative System Optimization
enrollment speech test speech
feature extraction feature extraction
estimate model
match
score
system hyper-parameters
discriminativeoptimization
This methodology directly measures and optimizes the quality of the output of the whole system.
144
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
• Typically, we have small amount of enrollment data for target speaker, which disallows use of standard discriminative techniques
• We need to consider inter-session variability – an important problem in SRE
• But only recent data collection with the same speaker recorded over various channels allowed us to start the work in this direction
Discriminative TrainingWhat is new here?
145
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
High inter-session variability
High speaker variability
UBM
Target speaker model Test data
Decision boundary
Dis
crim
inat
ive
deci
sion
bou
ndar
y
Discriminative TrainingWhat is new here?
146
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Discriminative System Optimization
• Motivation
• Envisioned advantages
• Challenges
• A few techniques to address these challenges
• Some preliminary experiments
147
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Discriminative System OptimizationMotivation
• Several participants of this workshop have previous successful experience with similar training:
– Discriminative training of weighted linear combinations of the outputs of multiple sub-systems (a.k.a. fusion) has been very successful in the last few NIST Speaker Recognition Evaluations
Neural Networks, SVM, Logistic Regression – Lukas and BUT were very successful with
discriminative (MMI) training of GMM's in the similar task of Language Recognition in the last two NIST LRE's
148
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Discriminative System OptimizationEnvisioned Advantages
• Discriminative training can compensate for unrealistic generative modeling assumptions.
– Could find hyperparameter estimates that give better accuracy than ML estimates
• Discriminative training can optimize smaller, simpler, faster systems to rival the accuracy of larger generatively trained systems
– In this workshop, we concentrated on this aspect, with a few encouraging results
149
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Discriminative System OptimizationChallenges
• This is a difficult problem!– In large LVCSR systems it took years for
discriminative methods to catch up with generative ones
• Challenges include:– complexity and scale of implementation– overtraining
150
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Discriminative System OptimizationChallenges
• Complexity: – Computation of derivatives for optimization (gradient,
Hessian) of complex systems– Finding and coding good numerical optimization algorithms
• Scale (CPU, memory): – our current state-of-the-art systems can have tens of
millions of parameters– 1500 hours of training speech, or 250 million training
examples
• Overtraining (up to millions of parameters)
151
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
TechniquesComputing Derivatives
• We tried a coding technique that automatically implements the chain-rule for partial derivatives for function combination
– Similar to back propagation in neural networks– Computationally equivalent to reverse-mode
automatic differentiation– Did not scale well for our problem, involved
multiplication of multiple Jacobian matrices of very large dimension
• Our solution was to restrict our efforts to very simple system designs, for which the derivatives could be hand-coded and optimized
152
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
DerivativesHand-Optimized
• Lukas hand-optimized a gradient calculation of 6 million components over 440 million training examples to run in 15 minutes on a single machine
• This was made possible by:– Replacing GMM log-likelihood calculation with a
linear approximation (without significant performance loss)
– Not doing ZT-norm (at some performance loss)
153
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
TechniquesOptimization Algorithms
• We investigated stochastic gradient descent (after inspiring invited talk here at JHU by Yann LeCun)
– Did not scale well in our computing environment– Difficult to set hyperparameters– Not obvious how to parallelize over machines
154
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
TechniquesOptimization Algorithms
• We investigated MATLAB's optimization toolbox. – Tried 'large scale' trust-region optimization
algorithm– Did not scale well in time and space– Needs further investigation ...
155
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
TechniquesOptimization Algorithms
• Lukas was successful in his experiments with the Extended Baum-Welch algorithm
• I was successful in my experiments with the RPROP1 algorithm
• In both cases, we coded our own optimization algorithms in MATLAB for economy of scale
1. See http://en.wikipedia.org/wiki/Rprop
156
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Objective Function
• Our discriminative optimization objective function has many names: Maximum Mutual Information (MMI), Minimum Cross-Entropy, Logistic Regression, ...
– This criterion optimizes classification error-rates over wide ranges of priors and cost functions
– For linear systems, gives nice convex optimization objective
– Gives some protection against over-training– Has been very successfully applied to fusion of
sub-systems in NIST SRE evaluations
157
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Overtraining
• I was optimizing 90 000 parameters and Lukas 6 million
• This allows the training to learn irrelevant detail of the training data (even though we used 100's of millions of training examples)
• We both managed to optimize EER << 1% over the development data (Switchboard, SRE'04+05) if we allowed the training to go too far
• These overtrained systems did not generalize to good performance on independent test data (SRE'06+08)
158
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Regularization to Combat Overtraining
• We used early stopping to combat overtraining– just stop training when performance on a cross-
validation set stops improving
• We hope to be able to apply more principled approaches in the future
– adding SVM-style regularization penalties, or – more general Bayesian methods with appropriate
priors on the hyperparameters.
159
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Proof of Concept Experiments
• Niko: smaller scale experiment using 300-dimensional y-vectors for train and test, training 90 000 parameters
• Lukas: larger scale experiments using 300 dimensional y-vector for train and 20 000 dimensional statistic for test, training 6 million parameters
160
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Small Scale Experiment
• Within-class covariance-normalized dot-product between y-vectors for train and test
• Generative (ML) covariance estimate gives on a subset (english females) of SRE 2006:
EER = 2.61% Discriminative retraining of covariance gave an 11%
relative improvement:EER = 2.33%
161
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
v2
v1
u2
u1
d33
d22
d11
m
Large Scale Experiment 1
• Pure eigenvoice system (only V; no U and D)
• GMM with 512 components; 39D features
• V matrix trained discriminatively (300x20k parameters)
• Fixed original speaker factors y
EER[%] No norm ZT-norm
Generative V 15.44 11.42
Discriminative V 7.19 5.06
Discriminative V with channel compensated y
6.80 4.81
Generative V and U 6.99 4.07
SRE 2006 all trials (det1)
162
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
v2
v1
u2
u1
d33
d22
d11
m
Large Scale Experiment 2
• Channel compensated system (V and U; no D)
• Only V matrix trained discriminatively
• Fixed original speaker factors y
EER[%] No norm ZT-norm
Generative V and U 6.99 4.07
Discriminative VGenerative U
6.00 3.87
SRE 2006 all trials (det1)
163
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Next Steps
• Re-estimation of other hyperparameters (e.g. U)
• Iterative re-estimation of both hyperparameters and factors
• Direct optimization of ZT-normalized system (derivatives difficult to compute)
164
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Conclusion
• This is a large and difficult problem.
• But it has the potential of worthwhile gains:– Possibility of more accurate, but faster and
smaller systems.
• We have managed to show some proof of concept, but so far without improving on the state-of-the-art.
• Remaining problems are practical and theoretical.– Complexity of optimization.– Principled methods for combating overtraining.
166
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Robust Speaker RecognitionSummary
v2
v1
u2
u1
d33
d22
d11
m
DiarizationDiarization• Examined application of JFA and
Bayesian methods to diarization• Produced 3-4% DER on summed
telephone speech• Working on challenging interview
speech
Factor Analysis ConditioningFactor Analysis Conditioning• Explored ways to use JFA to account
for non-session variability (phone)• Showed robustness using within-
session, stacking and hierarchical modeling
SVM-JFASVM-JFA• Developed techniques to use JFA
elements in SVM classifiers• Results comparable to full JFA system
but with fast scoring and no score normalization
• Better performance using all JFA factors
Discriminative System OptimizationDiscriminative System Optimization• Focused on means to
discriminatively optimize the whole speaker recognition system
• Demonstrated proof-of-concept experiments
167
v2
v1
u2
u1
d33
d22
d11
m y z x DM UV JHU WS’08 RSR Team
Robust Speaker Recognition
• Extremely productive and enjoyable workshop
• Aim is to continue collaboration in problem areas going forward
• Cross-site, joint efforts will provide big gains in future speaker recognition evaluations and experiments
• Possible special session at ICASSP on team workshop efforts