1 xyz 10/14/2015 jhu ws’08 rsr team robust speaker recognition over varying channels niko brummer,...

1XYZ 04/19/23

JHU WS’08 RSR Teamm y z x DM UV

v2

v1

u2

u1

d33

d22

d11

Robust speaker recognition over varying channels

Niko Brummer, Lukas Burget, William Campbell, Fabio Castaldo, Najim Dehak, Reda Dehak, Ondrej Glembek, Valiantsina Hubeika, Sachin Kajarekar,

Zahi Karam, Patrick Kenny, Jason Pelecanos, Douglas Reynolds, Nicolas Scheffer, Robbie Vogt

2

v2

v1

u2

u1

d33

d22

d11

m y z x DM UV JHU WS’08 RSR Team

NIST SRE2008 - Interview speech

The same microphone in training and test

< 1% EER

Different microphone in training and test

about 3% EER

Intersession Variability

• Variability refers to changes in channel effects between training and successive detection attempts

• Channel/session variability encompasses several factors

– The microphones Carbon-button, electret, hands-free, array, etc

– The acoustic environment Office, car, airport, etc.

– The transmission channel Landline, cellular, VoIP, etc.

– The differences in speaker voice Aging, mood, spoken language, etc.

The largest challenge to practical use of speaker detection systems is channel/session variability

The largest challenge to practical use of speaker detection systems is channel/session variability

3

v2

v1

u2

u1

d33

d22

d11


4

v2

v1

u2

u1

d33

d22

d11


Tools to fight unwanted variability

v2

v1

u2

u1

d33

d22

d11

M = m + Vy + Dz + Ux

Joint Factor Analysis

5

v2

v1

u2

u1

d33

d22

d11


System based on Joint Factor Analysis

Baseline System

False Alarm Probability

Mis

s P

rob

abili

ty

NIST SRE08

short2-short3

Telephone Speech in Training and Test

6

v2

v1

u2

u1

d33

d22

d11


SRE NIST Evaluations

• Annual NIST evaluations of speaker verification technology (since 1995) using a common paradigm for comparing technologies

• All the team members participated in recent 2008 NIST evaluations

• JHU workshop provided a great opportunity to:– do common post-evaluation analysis of our systems– combine and improve techniques developed by individual sites

• Thanks to NIST evaluations we have:– identified some of the current problems that we worked on– well defined setup and evaluation framework– baseline systems that were trying to extend and improve during

the workshop

7

v2

v1

u2

u1

d33

d22

d11


• Diarization using JFA

• Factor Analysis Conditioning

• SVM – JFA and fast scoring

• Discriminative System Optimization

Subgroups

8

v2

v1

u2

u1

d33

d22

d11


Diarization using JFA

Problem StatementProblem Statement– Diarization is an important upstream process for real-world multi-speaker

speech– At one level diarization depends on accurate speaker discrimination for

change detection and clustering – JFA and Bayesian methods have the promise of providing improvements

to speaker diarization

GoalsGoals – Apply diarization systems to summed telephone speech and interview

microphone speech Baseline segmentation-agglomerative clustering Streaming system using speaker factors features New variational-bayes approach using eigenvoices

– Measure performance in terms of DER and effect on speaker detection

9

v2

v1

u2

u1

d33

d22

d11


Factor Analysis Conditioning

Problem StatementProblem Statement– A single FA model is sub-optimal across different conditions– Eg.: different durations, phonetic content and recording scenario

GoalsGoals – Explore two approaches:

- Build FA models specific to each condition and robustly combinerobustly combine multiple models

- Extend the FA model to explicitly modelexplicitly model the condition as another source of variability

10

v2

v1

u2

u1

d33

d22

d11


SVM - JFA

Problem StatementProblem Statement– The Support Vector Machine is a discriminative recognizer which has

proved to be useful for SRE– Parameters of generative GMM speaker models are used as features

for linear SVM ( sequence kernels)– We know Joint Factor Analysis provides higher quality GMMs, but

using these as is in SVMs has not been so successful.

GoalsGoals– Analysis of the problem– Redefinition of SVM kernels based on JFA?– Application of JFA vectors to recently proposed and closely related

bilinear scoring techniques which do not use SVMs

11

v2

v1

u2

u1

d33

d22

d11


Discriminative System Optimization

Problem StatementProblem Statement– Discriminative training has proved very useful in speech and language Discriminative training has proved very useful in speech and language

recognition, but has not been investigated in depth for speaker recognition, but has not been investigated in depth for speaker recognitionrecognition

– In both speech and language recognition, the classes (phones, In both speech and language recognition, the classes (phones, languages) are modeled with generative models, which can be trained languages) are modeled with generative models, which can be trained with copious quantities of datawith copious quantities of data

– But in speaker recognition, our speaker GMMs have at best a few But in speaker recognition, our speaker GMMs have at best a few minutes of training typically of only one recording of the speakerminutes of training typically of only one recording of the speaker

GoalsGoals– Reformulate the speaker recognition problem as Reformulate the speaker recognition problem as binarybinary discrimination discrimination

between pairs of recordings which can be (i) of the same speaker, or (ii) between pairs of recordings which can be (i) of the same speaker, or (ii) of two different speakersof two different speakers

– We now have lots of training data for these two classes and we can We now have lots of training data for these two classes and we can afford to train complex discriminative recognizersafford to train complex discriminative recognizers

12

v2

v1

u2

u1

d33

d22

d11


Relevance MAP adaptation

UBM

Target speaker model Test data

•2D features

•Single Gaussian model

•Only mean vector(s) are adapted

13

v2

v1

u2

u1

d33

d22

d11


Intersession variability

High inter-session variability

High speaker variability

UBM

Target speaker model

14

v2

v1

u2

u1

d33

d22

d11


Intersession variability



UBM


Decision boundary

15

v2

v1

u2

u1

d33

d22

d11


Intersession compensation



UBM


For recognition, move both models along the high inter-session variability direction(s) to fit well the test data (e.g. in ML sense)

16

v2

v1

u2

u1

d33

d22

d11


• Probabilistic model proposed by Patrick Kenny

• Speaker model represented by mean supervector M = m + Vy + Dz + Ux

– U – subspace with high intersession/channel variability (eigenchannels)– V – subspace with high speaker variability (eigenvoices)– D - diagonal matrix describing remaining speaker variability not covered by V– Gaussian priors assumed for speaker factors y, z and channel factors x

v2

v1

u2

u1

d33

d22

d11

Joint Factor Analysis model

m

3D space of model parameters

(e.g. 3 component GMM; 1D features)

17

v2

v1

u2

u1

d33

d22

d11


Working with JFA

v2

v1

u2

u1

d33

d22

d11

m

• Enrolling speaker model:– Given enrollment data and the hyperparameters m, Σ, V, D and U, obtain MAP

point estimates (or posterior distributions) of all factors x, y, z– Most of the speaker information is in low dimensional vector y; less in high

dimensional vector z; x should contain only channel related info.

• Test:– Given fixed (distributions of) speaker dependent factors y and z, obtain new

estimates of channel factors x for test data– Score for test utterance is log likelihood ratio between UBM and speaker model

defined by factors x, y, z

• Training hyperparameters– Hyperparameters m, Σ, V, D and U can be

estimated from training data using EM algorithm Posterior distributions of “hidden” factors x, y, z and

hyperparameters are alternately estimated to maximize likelihood of training data

Distributions of speaker factors y, z are constraint to be the same for all segments of the same speaker while channel factors x may be different for every segment.

18

v2

v1

u2

u1

d33

d22

d11


Flavors of JFA

•Relevance MAP adaptationM = m + Dz with D2 = Σ/ τ where Σ is matrix with UBM variance supervector in diagonal

•Eigenchannel adaptation (SDV, BUT)• Relevance MAP for enrolling speaker model• Adapt speaker model to test utterance using U estimated by PCA

•JFA without V, with D2 = Σ/ τ (QUT, LIA)•JFA without V, with D trained from data (CRIM)

can be seen as training different τ for each supervector coefficientEffective relevance factor τef= trace(Σ)/ trace(D2)

•JFA with V (CRIM)

v2

v1

u2

u1

d33

d22

d11

m

19

v2

v1

u2

u1

d33

d22

d11


Flavors of JFA

Full JFA significantly outperform the other JFA configurations.

No JFAEigenchannel adapt.JFA: d2 = Σ/ τJFA: d trained on dataJFA with eigenvoices

SRE 2006(all trials, det1)

20

v2

v1

u2

u1

d33

d22

d11


• Diarization based on JFA

• Factor Analysis Conditioning

• SVM – JFA and fast scoring

• Discriminative System Optimization

Subgroups

22XYZ 04/19/23


v2

v1

u2

u1

d33

d22

d11

Diarization Group

Douglas Reynolds, Patrick Kenny, Fabio Castaldo, Ciprian Costin

23

v2

v1

u2

u1

d33

d22

d11


Roadmap

• Introduction– Problem definition– Experiment setup

• Diarization Systems– Variational Bayes System– Streaming and Hybrid Systems

• Analysis and Conclusions

24

v2

v1

u2

u1

d33

d22

d11


Diarization Segmentation and Clustering

• Determine when speaker change has occurred in speech signal (segmentation)

• Group together speech segments from same speaker (clustering)

• Prior speaker information may or may not be available

Speaker B

Speaker A

Which segments are from the same speaker?Which segments are from the same speaker?

Where are speaker changes?Where are speaker changes?

25

v2

v1

u2

u1

d33

d22

d11


Diarization Applications

• Diarization is used as a pre-process for other downstream applications

• Human consumption– Annotate transcript with speaker changes/labels– Provide overview of speaker activity

• Algorithm consumption– Adaptation of speech recognition system– Application to speaker detection with multi-speaker speech

1sp detector

1sp detector

1sp detector

1sp detector

MAX

MAX

Speaker Diarization

Speaker Diarization

Utterance score

26

v2

v1

u2

u1

d33

d22

d11


Diarization Error Measures

• Direct Measure – Diarization Error Rate (DER)– Optimal alignment of reference and

hypothesized diarizations– Error is sum of

miss (speaker in reference but not in hypothesis) false alarm (speaker in hypothesis but not in

reference) speaker-error (mapped reference speaker is not

the same as the hypothesized speaker)

– Time weighted measure Emphasizes talkative speakers

ref

hyp

miss fa errref

hyp

miss fa err

• Consumer Measure – Effect on speaker detection system

– Determine speaker detection error rate when using different diarization output

– Focus on NIST SRE 2008 data with a fixed detection system (JFA GMM-UBM system)

PROBABILITY OF FALSE ALARM (in %)P

RO

BA

BIL

ITY

OF

MIS

S (

in %

)

EER

27

v2

v1

u2

u1

d33

d22

d11


Diarization Experiment Data

• Summed channels telephone speech– Use summed channel data for test only

(avoid complication of extra clustering in training)

– We can derive reference for DER scoring using ASR transcripts from separate channels

(no-score for silence and speaker overlap)

– Compare use of diarization to two extremes Best case: use reference diarization Worst case: no diarization

• Interview microphone speech– Single microphone recording capturing both interviewee

(target) and interviewer– Avoid use of unrealistic side information about location of

interviewee speech provided in NIST eval– Reference for DER scoring from lavaliere microphones ASR

transcripts

28

v2

v1

u2

u1

d33

d22

d11


Roadmap




29

v2

v1

u2

u1

d33

d22

d11


Baseline System

Speaker Change Detection

Speaker Change Detection

p(x|x) p(y|y)

p(z|z)

p(x|x) p(y|y)

p(z|z)

Refined speaker data

Final Diarization

• Three stages in baseline system– BIC based speaker change detection– Full covariance agglomerative clustering with BIC stopping

criterion – Iterative re-segmentation with GMM Viterbi decoding

Initial speaker data

Viterbi DecodeViterbi Decode Train GMMsTrain GMMs

Agglomerative Clustering

Agglomerative Clustering

30

v2

v1

u2

u1

d33

d22

d11


Factor Analysis Applied to Diarization

• State of the art speaker recognition systems use hundreds of speaker and channel factors

– Processing requires entire utterances – can't be implemented incrementally

• State of the art diarization systems require lots of local decisions– Very short (~1 sec) speech segments– Speaker segmentation: is this frame a speaker change point?– Agglomerative clustering: Given two short segments, is the speaker

the same?

• Proposed solution: Variational Bayes (VB)– Fabio Valente, Variational Bayesian Methods for Audio Indexing,

PhD Dissertation, Eurecom, 2005

31

v2

v1

u2

u1

d33

d22

d11


Advantages of a Bayesian Approach

• EM-like convergence guarantees

• No premature hard decisions as in agglomerative clustering– This suggested a `soft clustering' heuristic which reduced the

diarization error rate of the baseline system by almost 50%

• In theory at least, Bayesian methods are not subject to the over-fitting which maximum likelihood methods are prone to

– Bayesian model selection is a quantitative version of Occam's razor (David MacKay)

– It ought to be possible to determine the number of speakers in a file without resorting to BIC like fudge factors (Fabio Valente)

32

v2

v1

u2

u1

d33

d22

d11


Eigenvoice Speaker Model

• A supervector s is the concatenation of the mean vectors in a speaker dependent Gaussian mixture model

• The supervector m is speaker independent

• The matrix V is of low rank– The columns of V are the eigenvoices

– The entries of y are the speaker factors

• A highly informative prior on speaker dependent GMM's

• Adding eigenchannels doesn't help in diarization (so far)

, ~ (0, )s m y y N V I

• For diarization we use only the eigenvoice component of factor analysis

33

v2

v1

u2

u1

d33

d22

d11


Variational Bayes Diarization

• Assume 2 speakers and uniformly segment the file into 1 second intervals

– This restriction can be removed in a second pass

• Alternate between estimating two types of posterior distribution until convergence

– Segment posteriors (soft clustering)– Speaker posteriors (location of the speakers in the space of

speaker factors)

• Construct GMM's for each speaker and re-segment the data– Iterate as needed

34

v2

v1

u2

u1

d33

d22

d11


Segment Posteriors

35

v2

v1

u2

u1

d33

d22

d11


Speaker Posteriors

36

v2

v1

u2

u1

d33

d22

d11


Variational Bayes DiarizationDetails

• Begin – Extract Baum-Welch statistics from each segment

• On each iteration– For each speaker:

Synthesize Baum Welch statistics by weighting the Baum Welch statistics for each segment by the corresponding posterior

Update the posterior distribution of the speaker factors

– For each segment Update the segment posteriors for each speaker

• End – Construct GMM's for each speaker– Re-segment the data– Iterate

37

v2

v1

u2

u1

d33

d22

d11


Experiment Configuration

• Features used for Variational Bayes– 39 dimensional feature set optimized by Brno for speaker recognition– Cepstral coefficients c0, .., c12 + first, second and third order

derivatives + Gaussianization + HLDA

• Features used in the baseline system and in the re-segmentation phase of Variational Bayes

– Un-normalized cepstral coefficients c0, .., c12– Including c0 was a lucky bug

• Factor analysis configuration for Variational Bayes– Universal background model with 512 Gaussians– 200 speaker factors, no channel factors– V matrix scaled by 0.6

• Test set: the summed channel telephone data provided by NIST in the 2008 speaker recognition evaluation

– 2215 files, (~200 hours)

• NIST Diarization Error used to measure performance– Ground truth diarization is available

38

v2

v1

u2

u1

d33

d22

d11


Experiment ResultsNIST 2008 Summed Channel Telephone Speech

• VB = Variational Bayes• BW = Baum -Welch training of speaker GMM's• Viterbi = re-segmentation with speaker GMM's• The second pass in VB uses a non-uniform segmentation

provided by the first pass• Compared to the baseline, soft clustering achieves a 50%

reduction in error rates

0

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5

DE

R (

%)

Diarization Systems Mean DER

Std DER

1 Baseline BW Viterbi 6.8 12.3

2 VB 9.1 11.9

3 VB BW Viterbi 4.5 8.5

4 VB BW Viterbi 2nd pass 3.8 7.6

5 Baseline soft-cluster BW Viterbi 3.5 8.0

39

v2

v1

u2

u1

d33

d22

d11


Roadmap




40

v2

v1

u2

u1

d33

d22

d11


Streaming SystemLPT Diarization System*

• Main Ideas– Use eigenvoice model for creating a stream of speaker

factors Yt computed on a sliding windows – Perform segmentation and clustering with these new features

• Eigenvoice Model:

, ~ (0, )s m y y N V I

* Based on Castaldo, F.; Colibro, D.; Dalmasso, E.; Laface, P.; Vair, C.,Stream-based speaker segmentation using speaker factors and eigenvoices, ICASSP 2008

42

v2

v1

u2

u1

d33

d22

d11


Streaming System

Feature ExtractionFeature Extraction

Slices

Streaming Factor Analysis

Streaming Factor Analysis

GMM 2

GMM 2

GMM 1

GMM 2

GMM 1

GMM 1

Slice ClusteringSlice Clustering

Audio

x1 x2 x3 x4 x5 x6 x7 x8 x9

43

v2

v1

u2

u1

d33

d22

d11


Streaming SystemStream Factor Analysis

Viterbi Segmentation

Creating GMMsCreating GMMsGMM 1

GMM 2

Feature Extraction

Feature Extraction

Y1

Y2

Slices

Clustering

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12

44

v2

v1

u2

u1

d33

d22

d11


Streaming SystemStream Factor Analysis

First 2 dimensions of y stream

46

v2

v1

u2

u1

d33

d22

d11


Streaming SystemSlice Clustering

• A GMM model for each slice is created

• Last step: clustering the GMMs created in each slice

• The system decides whether GMMs come from the same or different speakers by using an approximation of the Kullback-Leibler divergence between GMMs

• Large KL-divergence => new speaker

New 60s sliceNew 60s slice

New GMM

GMM 1

GMM 2

Min KL Divergence

<λ

>λGMM 3

Adapt model

Creating new model

47

v2

v1

u2

u1

d33

d22

d11


Hybrid Clustering

• Speaker factors works in the streaming diarization system

• Experiments done during the workshop showed cosine distance between speaker factors produces low speaker detection errors

• Modifying the baseline system using these new ideas

• Hybrid Clustering – Replace the classical clustering using speaker factors and

cosine distance

48

v2

v1

u2

u1

d33

d22

d11


Hybrid Clustering Different Approaches

• First Approach: Level Cutting– Stop the agglomerative clustering at a certain level and

compute speaker factors for each cluster– Merge the clusters that have the maximum similarity with

respect to the cosine distance– Iterate until only two clusters remain

• Second Approach: Tree Searching– Build agglomerative clustering up to the top level– Select the nodes that have a number of frames above a

threshold– Merge the clusters that have the maximum similarity with

respect to the cosine distance– Iterate until only two clusters remain

49

v2

v1

u2

u1

d33

d22

d11


Hybrid Clustering Level Cutting

1 2 3 4 5 6

1 2 3 4 5

Y1 Y2 Y3 Y4 Y5

Y3 Y4Y2Y1

Y2 Y3

Y2

SPEAKER

FACTOR

CLUSTERING

Y1

Y1

A CG LG UL SO TM EE RR IA NT GI V E

50

v2

v1

u2

u1

d33

d22

d11


Hybrid Clustering Tree Searching

50 60 70 210 340

70 210 340

55070110

180 550Threshold=100

=selected cluster(Y)

110

51

v2

v1

u2

u1

d33

d22

d11


Results On Summed Telephone Speech

System (2213 Audio Files)

Minimum Diarization Error Rate(%)

Maximum Diarization Error Rate(%)

Standard Deviation

(%)

Average Diarization Error Rate (%)

Streaming System 0.0 53.2 8.8 4.6

Baseline Diarization System

0.0 57.2 12.3 6.8

Hybrid System 1 (Level Cutting)

0.0 67.0 14.6 17.1

Hybrid System 2 (Tree Search)

0.0 63.2 13.6 6.8

52

v2

v1

u2

u1

d33

d22

d11


Roadmap




53

v2

v1

u2

u1

d33

d22

d11


DER vs EERSummed Telephone Speech

• Some correlation of DER to EER• Systems with DER <10% have comparable EERs• No clear knee in the curve

– Still have EER gains (over doing nothing) with relatively poor DER=20% system

8

9

10

11

12

13

14

15

0 10 20 30 40

DER (%)

EE

R (

%)

54

v2

v1

u2

u1

d33

d22

d11


DER vs EERSummed Telephone Speech

• Unclear trends with low DER systems– VB+2nd pass and BL+soft_cluster

• DER may be too coarse of a measure for effects on EER

9

9.2

9.4

9.6

9.8

10

0 2 4 6 8 10

DER (%)

EE

R (

%)

RefLPT

VB + 2nd pass

BL+ soft cluster

Hybrid

BL VB

55

v2

v1

u2

u1

d33

d22

d11


Interview Speech

• Interview speech differs from telephone speech in two main aspects

– Audio quality is much more variable for various microphones– Conversations are dominated by interviewee

• DER for do-nothing diarization (single speaker for all time)– Telephone: 35%– Interview: 11%

• Next challenge is to apply diarization systems to new domain

• Avoid idealistic assumptions and knowledge used in NIST eval

No diarization in train or test

Ideal diarization in train and test

EER=10.9%

EER=5.4%

56

v2

v1

u2

u1

d33

d22

d11


Conclusion

• Implemented variational Bayes diarization system using both segment and speaker posterior optimization

• Used speaker factor model for three speaker diarization systems

– Streaming, VB, and hybrid

• Demonstrated effectiveness of soft-clustering for improving speaker diarization

• Produced low diarization error rate (3.5-4.5%) for telephone speech

• New challenges await for interview speech domain– Microphones– Conversational patterns

59XYZ 04/19/23


v2

v1

u2

u1

d33

d22

d11

Speaker Recognition: Factor Analysis Conditioning

(13 August 2008)

Sub-Team:

Sachin Kajarekar (SRI), Elly Na (GMU), Jason Pelecanos (IBM), Nicolas Scheffer (SRI), Robbie Vogt (QUT)

60

v2

v1

u2

u1

d33

d22

d11


Overview

• Introduction

• A Phonetic Analysis

• Combination Strategies

• Within Session Variability Modeling

• Hierarchical Factor Analysis

• Review

61

v2

v1

u2

u1

d33

d22

d11


Introduction

Problem StatementProblem Statement– A single FA model is sub-optimal across different conditions

- Eg.: different durations, phonetic content and recording scenario

GoalsGoals – Explore two approaches:

- Build FA models specific to each condition and robustly combinerobustly combine multiple models

- Extend the FA model to explicitly modelexplicitly model the condition as another source of variability

Results and OutcomesResults and Outcomes– A conditioned FA model can provide improved performance

- But, score level combination may not be the best way

– Including Within-Session Within-Session factorsfactors in an FA model can reduce the sensitivity to utterance duration and phonetic content variability

– Stacking factorsStacking factors across conditions or data subsets can provide additional robustness

– Hierarchical Hierarchical modeling for Factor Analysis shows promise– Applicability to other condition types: languages, microphones, …

62

v2

v1

u2

u1

d33

d22

d11


phoneme I ‘w’

phoneme II

‘ah1’phoneme

III ‘n’

phoneme I phoneme II phoneme III

IntroductionSpeech Partitioning… An overview

phoneme I ‘w’

phoneme II‘ow’

phoneme III‘d’

feature space

Train Data Test Data

feature space

63

v2

v1

u2

u1

d33

d22

d11


Overview

• Introduction

• A Phonetic Analysis• Combination Strategies



• Review

64

v2

v1

u2

u1

d33

d22

d11


A Phonetic AnalysisEffect of Phonetic Mismatch

Test Enroll Vowel Consonant

Vowel4.50%0.0208

12.47%0.0537

Consonant10.72%0.0521

7.03%0.0336

– How does the difference between the content in enrollment and test change the resulting performance?

– This result, albeit an extreme example, demonstrates the challenge of mis-matched phonetic content

– This phenomena is especially present for short duration utterances

EERMin. DCF

65

v2

v1

u2

u1

d33

d22

d11


A Phonetic AnalysisPerformance vs. % of Speech

DET 1 DET 3

Phoneme Type % of speech EER (%) DCF EER (%) DCF

E vowel 18.93 12.16 0.0567 8.62 0.0419

O vowel 10.71 14.57 0.0645 12.30 0.0558

i vowel 6.85 16.73 0.0749 15.49 0.0696

A: vowel 5.89 23.31 0.0876 21.79 0.0852

n nonvowel 5.44 19.08 0.0779 17.23 0.073

e: vowel 4.73 25.31 0.0917 22.92 0.0866

k stop 4.49 25.56 0.0926 22.26 0.0868

z sibilant 4.25 29.73 0.098 28.22 0.0971

o vowel 3.01 25.53 0.0924 25.24 0.0926

t stop 2.76 27.04 0.0956 24.92 0.0936

s sibilant 2.74 30.73 0.0965 27.63 0.0908

f sibilant 2.41 34.43 0.0998 31.42 0.0984

j nonvowel 2.38 25.00 0.0918 22.41 0.0862

v sibilant 2.35 33.66 0.1 30.78 0.0992

m nonvowel 2.29 21.18 0.0835 18.63 0.0782

S sibilant 2.21 31.97 0.0959 31.74 0.0981

l nonvowel 1.99 30.05 0.0974 29.91 0.0955

66

v2

v1

u2

u1

d33

d22

d11


A Phonetic AnalysisPerformance vs. % of Speech

0

10

20

30

40

0 5 10 15 20

% of Speech

Pe

rfo

rma

nce

(E

ER

) %

vowel nonvowel sibilant stop

67

v2

v1

u2

u1

d33

d22

d11


A Phonetic AnalysisFusion Analysis

Vowel with Others Vowel with Vowel

Phonemes % of Speech EER (%) Phonemes % of Speech EER (%)

E n 24.37 7.96 E O 29.64 7.04

E k 23.42 8.4 E i 25.78 7.58

E z 23.18 8.35 E A: 24.82 8.94

E t 21.69 8.72 E e: 23.66 8.55

E l 20.92 8.56 E o 21.94 8.29

O n 16.15 9.64 Oi 17.56 9.42

O k 15.2 10.89 O A: 16.6 11.32

O z 14.96 11.76 O e: 15.44 11.32

O t 13.47 10.85 O o 13.72 11.38

i n 12.29 11.93 i A: 12.74 13.05

i z 11.1 14.41 i e: 11.58 13.91

n e: 10.17 14.4 A: e: 10.62 17.23

i s 9.59 14.35 i o 9.86 14.07

A: t 8.65 17.6 A: o 8.9 17.59

n j 7.82 13.71 e: o 7.74 18.46

z s 6.99 23.03

o t 5.77 19.34

t s 5.5 21.88 Fuse All 83.43 5.68

f v 4.76 26.72

S l 4.2 25.63

68

v2

v1

u2

u1

d33

d22

d11


A Phonetic AnalysisFusion Analysis

JHU WS’08 RSR Team

vowel with othersvowel with vowel

69

v2

v1

u2

u1

d33

d22

d11


Overview

• Introduction


• Combination Strategies• Within Session Variability Modeling


• Review

70

v2

v1

u2

u1

d33

d22

d11


Combination Strategies Context

• Conditioned factor analysis– Multiple systems for multiple conditions– Multiple subspaces (eg: microphones)

• Current solution– Select the best system for each condition– Perform score-level combination (our baseline)

• How to robustly gather information from these systems?

• Exploring combination strategies in the model space

• Candidate for the study: Broad-Phone classes– Work in the speaker space instead of channel space– Small set of events

FA Cond. Smaller system configuration (512g, 120 EV, 60EC)

71

v2

v1

u2

u1

d33

d22

d11


% Data EER (%) Min. DCF

Vowels 60 6.17 0.296

Consonants 40 7.91 0.391

NonVowels 15 10.7 0.502

Sibilants 15 14.14 0.647

Stops 10 15.27 0.685

Vow. + NV + Si + St. (4 classes) 100 5.42 0.272

Vow. + Cons. (2 classes) 100 5.20 0.262

Baseline 100 5.12 0.241

Thanks S. Kajarekar, C. Richey, SRI International

Combination Strategies Baseline Results

Table of Results for Different Phone Sets (DET 1 SRE’06)

72

v2

v1

u2

u1

d33

d22

d11


Combination StrategiesStacked Eigenvectors

• In training, estimate different subspaces modeling the same kind of variability– Eg: Different utterance lengths, different set of microphone sets

• In practice : – Merging of supervectors generated by each subspace– New rank is the sum of each subspace rank– Can generate very large (and redundant) subspaces

• Advantages:– No retraining during enrollment / recognition time– No need of labeled data for system selection– Increased robustness of system in both scenarios (correlation between two

subspaces)

V1

V2

V1

V2

y1

y2

73

v2

v1

u2

u1

d33

d22

d11


Combination StrategiesCombining U and/or V?

• Stacking U’s (channel): successfully demonstrated (at NIST) for a large set of microphones

• Stacking V’s (speaker): Suitable for phone conditioning as:– Phonetic models can represent the speaker– Precedents are P-GMM, MLLR systems, …

EER (%) Min. DCF

Baseline 5.20 0.262

Stacked Us 5.72 0.279

Stacked Vs 5.14 0.243

Stack Us+Vs 5.31 0.258

Table of 2 Class Stacking Results (DET 1 SRE’06)

74

v2

v1

u2

u1

d33

d22

d11


Combination StrategiesAugmented eigenvectors

• Again, train several subspaces on the same kind of variability

• In practice:– Value of the subspace rank is unchanged– Increase the model size– Need to retrain the joint variability– Not extendable to more than 2,3 classes

• Close to Tied factor analysis– Produce a single y, independent from the class

V1 V2 y

V1

V2

75

v2

v1

u2

u1

d33

d22

d11


Combination StrategiesFactor Analysis: Un-tying

• Augmented eigenvectors produce a common y for all conditions

• In practice: – There’s always a between class error– The error is averaged out by the ML algorithm

• Keep each speaker factor (y) from each class with the error– More parameters to describe a speaker– Feed this input to a classifier– Experiments with Gaussians as classes are promising

V1 V2 yV1

V2

y1

y2

TRAININGTESTING

76

v2

v1

u2

u1

d33

d22

d11


# of Classes Method EER DCF

1 Baseline Single system 5.12 0.241

1 Baseline (x2) EV 4.83 0.239

2 Baseline 2 sys. fusion 5.20 0.262

4 Baseline 4 sys. fusion 5.42 0.272

22 Stacked VStacked V 2 sys fusion2 sys fusion 5.095.09 0.2470.247

4 Stacked V 4 sys. fusion 5.03 0.250

2 Stacked V Single system 5.14 0.243

44 Stacked VStacked V Single systemSingle system 4.764.76 0.2340.234

2 Augmented Single system 13.4 0.573

22 AugmentedAugmented Retrained (Tied)Retrained (Tied) 5.395.39 0.2660.266

1616 Un-tiedUn-tied GaussianGaussian 4.544.54 0.2330.233

Combination StrategiesResults

Table of Results for Different Factor Configurations

77

v2

v1

u2

u1

d33

d22

d11


Overview

• Introduction





• Review

78

v2

v1

u2

u1

d33

d22

d11


Within-session Variability Modeling

• The characteristics of inter-sessioninter-session variability are dependent on session duration.

• This doesn't fit well with the JFA model– Capturing more than channel!– Speech content (phonetic information) averages out for long

utterances but will become significant for short utterances

79

v2

v1

u2

u1

d33

d22

d11



• Proposed solution: Model within-sessionwithin-session variability as well– Break utterances into smaller segments, each described by:

M = m + Vy + UI x + U

W w + dz

U split into inter- (UI) and within-session (U

W) parts

• x is held constant for a whole utterance,

• but we have many w's!

• In this work we chose to align our segments with OLPR transcripts– i.e. one w per phonetic event,– Approx. 10 per second– Approx. 1000 in a NIST conversation side

80

v2

v1

u2

u1

d33

d22

d11




M = m + Vy + UI x + U

W w + dz

UI

V

x

y

81

v2

v1

u2

u1

d33

d22

d11




M = m + Vy + UI x + U

W w + dz

UI

V

82

v2

v1

u2

u1

d33

d22

d11


UW



M = m + Vy + UI x + U

W w + dz

UI

V

83

v2

v1

u2

u1

d33

d22

d11


Within-session Variability ModelingSingle Phonetic Event

Dimension

84

v2

v1

u2

u1

d33

d22

d11


Within-session Variability ModelingSingle Phonetic Event

Dimension

85

v2

v1

u2

u1

d33

d22

d11


Within-session Variability ModelingContribution Over Varying Utterance Lengths

Dimension

86

v2

v1

u2

u1

d33

d22

d11


Within-session Variability ModelingWithin-session Results

Similar performance for full conversations

• Modest gains with reduced utterance lengths, mostly for the EER.

– Better than matching U to the utterance length in most cases– Good flexibility across utterance length for a single model!

JFA ModelSubspace

Dims

1conv 20 sec 10 sec

EER Min DCF EER Min DCF EER Min DCFU+V+D 50 3.10% .0159 12.79% .0561 20.21% .0819U+V+D 60 3.03% .0156 13.01% .0562 20.31% .0820UMatched+V+D 50 3.10% .0159 12.20% .0531 19.71% .0814UI+UW+V+D 50I + 10W 2.97% .0170 11.98% .0541 19.67% .0807

87

v2

v1

u2

u1

d33

d22

d11


Overview

• Introduction




• Hierarchical Factor Analysis• Review

88

v2

v1

u2

u1

d33

d22

d11


Hierarchical Factor Analysis

Low ComplexityCoarse Grain Model

High ComplexityFine Grain Model

89

v2

v1

u2

u1

d33

d22

d11


Hierarchical Factor AnalysisMulti-grained Hybrid Model

Such a model may compensate for session effects that cause large regional variability and localized distortions.

A multi-grained model may be structured in a manner such that the nuisance kernel subspace has reduced complexity (a reduced number of parameters) while preserving compensation impact.

RawFeatures

ExtractSession Var.

Statistics

Find FAStatistics

L1“Global”Model

ExtractSession Var.

Statistics

Find FAStatistics

L2 “Broad Phone”

Model

Use of Factor AnalysisCompensated Statistics

In Kernel or Model

CompensatedFeatures/Statistics

90

v2

v1

u2

u1

d33

d22

d11


Hierarchical Factor AnalysisMulti-grained GMM/Phone Model

Task Condition 7

Condition 8

Base System with NAP 0.179 0.182

Base System with Multigrained NAP

0.175 0.166

Broad Phone System with NAP 0.212 0.209

Broad Phone System with Multigrained NAP

0.206 0.190

Table of NIST 2008 Minimum DCF Results

Thanks to Jiri Navratil (IBM) for the phonetic results.

91

v2

v1

u2

u1

d33

d22

d11


Hierarchical Factor AnalysisMulti-stage FA Broad Phone Model

Table of NIST 2006 Minimum DCF/EER ResultsPhone DET 3 - Base DET 3 - ZTNormType Min DCF EER Min DCF EER

Baseline NonVowel 0.0888 24.04% 0.0413 9.05%Sibilant 0.0988 30.28% 0.0584 13.05%Stop 0.0993 33.33% 0.0631 13.81%Vowel 0.0604 11.26% 0.0201 3.97%

Hierarchical NonVowel 0.0852 23.24% 0.042 9.53%Sibilant 0.0994 28.93% 0.0585 14.20%Stop 0.0991 33.27% 0.0655 14.63%Vowel 0.0482 10.29% 0.0206 3.91%

Baseline Consonant 0.0839 20.48% 0.0323 6.28%Vowel 0.0604 11.26% 0.0201 3.97%

Hierarchical Consonant 0.0777 18.26% 0.0312 6.45%Vowel 0.0482 10.29% 0.0206 3.91%

Fusion of hierarchical systems with baseline system gives modest improvements.

92

v2

v1

u2

u1

d33

d22

d11


Review

Results and OutcomesResults and Outcomes

– A conditioned FA model can provide improved performance - But, score level combination may not be the best way - Automatic system selection may not be feasible

– Including Within-Session Within-Session factorsfactors in an FA model can reduce the sensitivity to utterance duration and phonetic content variability

– Stacking factorsStacking factors across conditions or data subsets can provide additional robustness

– Hierarchical Hierarchical modeling for Factor Analysis shows promise

– Applicability to other condition types: languages, microphones, …

94XYZ 04/19/23


v2

v1

u2

u1

d33

d22

d11

Support Vector Machines and

Joint Factor Analysis

Najim DEHAK, Reda DEHAK, Zahi KARAM, and John NOECKER Jr.

95

v2

v1

u2

u1

d33

d22

d11


Outline

• Introduction

• SVM-JFA : GMM supervectors space

• SVM-JFA : Speaker factors space

• Intersession compensation in speaker factors space

– Within Class Covariance Normalization

– Handling variability with SVMs

• SVM-JFA : Combined factor space

• Importance of speaker and channel factors

• Conclusion

96

v2

v1

u2

u1

d33

d22

d11


Introduction

• Joint Factor Analysis is state of the art in speaker

verification.

• Combine discriminative and generative models.

• SVM - JFA

– Speaker GMM supervectors space.

– Speaker factors space.

– Combination of factors.

• Intersession variability compensation in speaker factors.

97XYZ 04/19/23


v2

v1

u2

u1

d33

d22

d11

SVM in GMM Supervector Space

s = m + Vy + Dz

98

v2

v1

u2

u1

d33

d22

d11


SVM-JFA : Supervector Space

n

1

JFAAdaptation

JFAAdaptation

Target

Non-Target

Speaker SVM

SVMTraining

SVMTraining

i

ibit

iaiba ggK ,1

,),(

n

1

n

1

n

1

n

1

n

1

ubmTarget

Non-target

Project each utterance into a high-dimensional space – stack mean vectors from GMM

Use a KL-based kernel

JFAAdaptation

JFAAdaptation

ubm

n

1

SVM ScoringSVM Scoring

m

Detection score

Thanks to Douglas Reynolds for this slide

• Our Starting Point : Kernel between two GMM supervectors (Campbell 06).

99

v2

v1

u2

u1

d33

d22

d11


JFA Configuration

• Gender independent JFA.

• 2048 Gaussians, 60 dimensional features

– 19 Gaussianized MFCC’s + energy + delta + double delta

• 300 speaker factors.

• 100 channel factors for telephone speech

• Decoupled estimation of eigenvoice and diagonal matrix (D)

• JFA hyper-parameters are obtained on MIXER and

Switchboard database.

100

v2

v1

u2

u1

d33

d22

d11


SVM-JFA : Supervector Space

• Result on NIST 2006 and 2008 SRE, Core condition, Telephone-Telephone dataset. (EER)

NIST 2006 SRE NIST 2008 SRE

English All trials English All trials

JFA: s=m+Vy 1.95% 3.01% 2.81% 5.58%

JFA: s=m+Vy+dz 1.80% 2.96% 2.81% 5.69%

SVM-JFA: s=m+Vy 4.24% 4.98% 5.10% 7.92%

SVM-JFA: s=m+Vy+dz 4.23% 4.92% 5.27% 8,13%

• JFA scoring is obtained with frame by frame.

101XYZ 04/19/23


v2

v1

u2

u1

d33

d22

d11

SVM in JFA-Speaker Factors Space

s = m + V y

102

v2

v1

u2

u1

d33

d22

d11


SVM-JFA : Speaker Factors Space

• Using speaker factors ‘y’ rather than GMM supervectors ‘s’.

– s = m + V y– c = Ux

• Low dimension space

• We can perform quick experiments.

• First we used only eigenvoice adaptation (D=0).

103

v2

v1

u2

u1

d33

d22

d11



• s = m + V y

• Inner-product kernel

• Gaussian kernel :

• Cosine Kernel :

)1

exp(),(2

2121 yyk

yyk

21

2121 .

,),(

yy

yyyyk

2121 ,),( yyyyk

104

v2

v1

u2

u1

d33

d22

d11



• Result on NIST 2006 SRE, Core condition. (EER).

English All trials

No-norm T-norm Zt-norm No-norm T-norm Zt-norm

JFA - - 1.95% - - 3.01%

KL- kernelsupervectors

4.24% 4.98% -

Linear kernel 3.47% 2.93% 4.64% 4.04% -

Gauss kernel 3.03% 2.98% 4.59% 4.46% -

Cosine kernel 3.08% 2.92% 4.18% 4.15% -

105

v2

v1

u2

u1

d33

d22

d11



• Results on NIST 2008 SRE, Core condition. Telephone- Telephone dataset (EER).

All trials

No-norm T-norm Zt-norm

JFA - - 5.58%

KL- kernelsupervectors

- 7.92% -

Linear kernel 7.06% 7.10% -

Gauss kernel 7.84% 7.42% -

Cosine kernel 7.24% 7.24% -

106XYZ 04/19/23


v2

v1

u2

u1

d33

d22

d11

Intersession Compensation in Speaker Factors Space

Within Class Covariance Normalization in SVM speaker factors space.

107

v2

v1

u2

u1

d33

d22

d11


Within Class Covariance Normalization (WCCN)

• Within Class Covariance

– mean of utterances of each speaker

– number of speakers

– number of utterances for each speaker ( s )

– Within class covariance is calculated in MIXER and Switchboard database.

S

s

n

i

ts

sis

si

s

s

yyyynS

W1 1

))((11

sn

i

si

ss y

ny

1

1

sn

S

108

v2

v1

u2

u1

d33

d22

d11


SVM-JFA : Speaker Factors Space (WCCN)

• s = m + Vy

• Linear kernel

• Cosine Kernel :

21

121 ),( yWyyyk t

21

211

1

21

121

.),(

yWyyWy

yWyyyk tt

t

109

v2

v1

u2

u1

d33

d22

d11


SVM-JFA : Speaker Factors Space (WCCN)

• Result on NIST 2006 SRE, Core condition, English trials (EER)

Without WCCN With WCCN

No-norm T-norm No-norm T-norm

Linear kernel 3.47% 2.93% 2.87% 2.44%

Cosine kernel 3.03% 2.98% 2.60% 2.45%

JFA-scoring 1.95%

17 %

110XYZ 04/19/23


v2

v1

u2

u1

d33

d22

d11

Intersession Compensation in Speaker Factors Space

Handling variability with SVMs

111

v2

v1

u2

u1

d33

d22

d11


Handling Variability with SVMsGood and Bad Variability

• Types of Variability:– Good Variability (Inter-speaker)

Speaker information

– Bad Variability (Nuisance) Session Channel

• Handling Variability– Joint Factor Analysis – GMM + LFA – SVM + NAP– SVM + WCCN

112

v2

v1

u2

u1

d33

d22

d11


Handling Variability with SVMsMotivation: Handling Nuisance

Uw w

Principle NuisanceDimensionNuisance Estimate

114

v2

v1

u2

u1

d33

d22

d11


Handling Variability with SVMsSVM Formulation

Uw

ξ=0

ξ=∞

ξ=1

ξ=2

116

v2

v1

u2

u1

d33

d22

d11


Handling Variability with SVMsResults

11%

18%

Using only 300 speaker factors s=m+VyDimension of Nuisance subspace = 50

117

v2

v1

u2

u1

d33

d22

d11


Handling Variability with SVMsFuture Work

• Beyond Nuisance compensation:– Bias towards using inter-speaker variability– Handle all variability

Bias towards inter-speaker Bias away from Nuisance

• Extend formulation to full supervectors

118XYZ 04/19/23


v2

v1

u2

u1

d33

d22

d11

SVM-JFA : Speaker and Common Factors Space

s = m + Vy + Dz

119

v2

v1

u2

u1

d33

d22

d11


SVM-JFA : Speaker and Common Factors Space

• Full Joint Factor Analysis : s = m + Vy + Dz– y : Speaker factors.– z : Common factors (EER 6.23% on NIST 2006 SRE, English trials).

• How to use speaker and common factors with SVMs?– Score fusion– Kernel combination

120

v2

v1

u2

u1

d33

d22

d11


Scores Fusion vs. Kernels Combination

• Scores Fusion:– Linear weighted score

Weights are computed using a development scores dataset

• Kernels Combination:– Linear weighted kernel

No development dataset for weights estimation

M

lllF xSwwxS

10

M

lllF yxkyxk

1

,,

121

v2

v1

u2

u1

d33

d22

d11


Kernel space

Kernel Combination Space

New Kernel function

??

Kernel in speaker

factor space (y)

Kernel in common

factor space (z)

122

v2

v1

u2

u1

d33

d22

d11


Kernel Combination Training

Linear Kernels Linear Kernels CombinationCombination

Learn a linear combination

Large margin classifier Large margin classifier (SVM)(SVM)

Maximize the margin

M

lllF yxkyxk

1

,,

Multiple Kernel SVM training

Find that maximize the margin

123

v2

v1

u2

u1

d33

d22

d11


Results

NIST 2006 SRE NIST 2008 SRE

English all English All

Cosine Kernel on Y 2.34% 3.59% 3.86% 6.55%

Cosine Kernel On Z 6.26% 8.68% 10.34% 13.45%

Linear Score Fusion 2.11% 3.62% 3.23% 6.86%

Kernel Combination 2.08% 3.62% 3.20% 6.60%

124XYZ 04/19/23


v2

v1

u2

u1

d33

d22

d11

Importance of Speaker and Channel Factors

125

v2

v1

u2

u1

d33

d22

d11



• Gender JFA (Female part).

• 2048 Gaussians, 60 dimensional features

– 19 Gaussianized MFCC’s + energy + delta + double delta

• 300 speaker factors.

• 0 common factors.

• 100 channel factors for telephone speech

• JFA hyper-parameters are obtained on MIXER and

Switchboard database.

126

v2

v1

u2

u1

d33

d22

d11



Apply intersession

compensation on speaker factor

space rather than supervector

space

EER=20% Oops:Channel factors

contain information about speaker

M = m + Vy + Dz + Ux

127

v2

v1

u2

u1

d33

d22

d11



• Three systems:

– M = m + Vy + Ux

– M = m + Vy– M = m + Tt

• V : low rank matrix, eigenvoices (speaker variability 300 dim)

• U : low rank matrix, eigenchannels (channel variability 100 dim)

• T : low rank matrix contain total variability (speaker and channel variability 400 dim).

EER

M = m + Vy + Ux 2.56%

M = m + Vy 2.74%

M = m + Tt 2.19%

Results on NIST 2006 SRE, core condition, English trials, female part (using cosine kernel).

128

v2

v1

u2

u1

d33

d22

d11


Conclusion

• SVM scoring in speaker factor space rather than GMM

supervector space:

– Quite well linear separation in speaker factors space.

– Improves over JFA-SVM supervectors

– Performance comparable to other scoring

– Allows for faster scoring

• Generalized using combinations of factors

• Further improvement using intersession compensation in

speaker factors space

• JFA as feature extraction

130

v2

v1

u2

u1

d33

d22

d11


SVM vs. Cosine Distance (CD) Scoring

• SVMs find a linear separation in the speaker factor space.

• The score can be computed as the nearest neighbor distance.

• Can we omit the SVM and use different distance?

131

v2

v1

u2

u1

d33

d22

d11


SVM vs. CD Scoring

• The idea is to compute the trial score as a cosine distance between the enrollment and test speaker models

• Inspired by the SVM, normalization by within-class covariance (WCCM) can be applied

testtrain

testtrain

yy

yyscore

T

1wc

T

)chol(

yy

yy

yyscore

testtrain

testtrain

132

v2

v1

u2

u1

d33

d22

d11


SVM vs. CD ScoringResults in the y Space

• SVM – does not need any

score normalization

2006 det3 2006 det1 2008 det6

EER DCF EER DCF EER DCF

SVM w. cosine kernel, WCCN 2.38 1.27 3.62 1.80 6.59 3.23

SVM w. cosine kernel 2.98 1.55 4.09 2.02 7.20 3.24

Cosine dist w. WCCN, ZT-norm 2.00 1.19 3.82 1.93 6.43 3.60

Cosine dist w. ZT-norm 2.82 1.50 4.07 2.11 6.95 3.55

• CD Scoring – applies gender

dependent ZT-norm

133

v2

v1

u2

u1

d33

d22

d11


SVM vs. CD Scoring Extending to x,y,z Space

• Motivation– z carries a residual speaker information, which we want to

use.– Ideally, there should be no speaker information in the x

vector as it expresses the channel shift.– We tried to substitute the y vectors by x and z.– z was much worse than y, but still gave reasonable result

(approx. 9% EER on 2006 det1)– surprisingly, using x gave around 25% EER on 2006 det1– let’s train a linear fusion on all these systems

134

v2

v1

u2

u1

d33

d22

d11


SVM vs. CD ScoringResults in the x, y, z Space

2006 det3 2006 det1 2008 det6

EER DCF EER DCF EER DCF

SVM w. cosine kernel 2.98 1.55 4.09 2.02 7.20 3.24

SVM w. cosine kernel – y,z 2.08 1.27 3.62 2.00 6.60 3.41

Cosine dist w. ZT-norm 2.82 1.50 4.07 2.11 6.95 3.55

Cosine dist w. ZT-norm – x,y,z 2.11 1.26 3.62 1.87 6.24 3.29

• SVM – kernel combination– uses y, z only (no

improvement using x)

• CD Scoring – applies gender

dependent ZT-norm– uses gender

dependent linear logistic regression – for 2006, train on 2008 and vice-versa

135

v2

v1

u2

u1

d33

d22

d11


SVM vs. CD Scoring Conclusion

CD Scoring

• Positives:– The problem of scoring

is symmetric– No training steps

• Negatives:– ZT-norm is needed– poorer relative

improvement on all trials (det1) – possibly needs calibration

SVM

• Positives:– Generalizes well in all

trials (det1)– No need for score

normalization

• Negatives:– SVM training procedure

137XYZ 04/19/23


v2

v1

u2

u1

d33

d22

d11

Discriminative Optimization of Speaker Recognition Systems

Lukas Burget & Niko Brummer, with lots of help from other team members, especially

Ondrej Glembek, Najim Dehak and Valja Hubeika.

138

v2

v1

u2

u1

d33

d22

d11


Discriminative TrainingWhat is new here?

• Discriminative training of speaker models has been around for more than a decade,

• and SVM speaker modeling has been a constant feature at the NIST SRE evaluations since 2003.

• So what is new in this work?

139

v2

v1

u2

u1

d33

d22

d11



• We propose to discriminatively optimize the whole speaker recognition system, rather than individual speaker models.

140

v2

v1

u2

u1

d33

d22

d11


Traditional Discriminative Training

enrollment speech test speech

feature extraction feature extraction

estimate model

match

score

discriminativeoptimization

141

v2

v1

u2

u1

d33

d22

d11


Current State-of-the-Art



estimate model

match

score

system hyper-parameters

Generative Modeling via Joint Factor Analysis

(ML optimization)

142

v2

v1

u2

u1

d33

d22

d11


Proposed Discriminative System Optimization



estimate model

match

score



143

v2

v1

u2

u1

d33

d22

d11


Proposed Discriminative System Optimization



estimate model

match

score



This methodology directly measures and optimizes the quality of the output of the whole system.

144

v2

v1

u2

u1

d33

d22

d11


• Typically, we have small amount of enrollment data for target speaker, which disallows use of standard discriminative techniques

• We need to consider inter-session variability – an important problem in SRE

• But only recent data collection with the same speaker recorded over various channels allowed us to start the work in this direction


145

v2

v1

u2

u1

d33

d22

d11




UBM


Decision boundary

Dis

crim

inat

ive

deci

sion

bou

ndar

y


146

v2

v1

u2

u1

d33

d22

d11


Discriminative System Optimization

• Motivation

• Envisioned advantages

• Challenges

• A few techniques to address these challenges

• Some preliminary experiments

147

v2

v1

u2

u1

d33

d22

d11


Discriminative System OptimizationMotivation

• Several participants of this workshop have previous successful experience with similar training:

– Discriminative training of weighted linear combinations of the outputs of multiple sub-systems (a.k.a. fusion) has been very successful in the last few NIST Speaker Recognition Evaluations

Neural Networks, SVM, Logistic Regression – Lukas and BUT were very successful with

discriminative (MMI) training of GMM's in the similar task of Language Recognition in the last two NIST LRE's

148

v2

v1

u2

u1

d33

d22

d11


Discriminative System OptimizationEnvisioned Advantages

• Discriminative training can compensate for unrealistic generative modeling assumptions.

– Could find hyperparameter estimates that give better accuracy than ML estimates

• Discriminative training can optimize smaller, simpler, faster systems to rival the accuracy of larger generatively trained systems

– In this workshop, we concentrated on this aspect, with a few encouraging results

149

v2

v1

u2

u1

d33

d22

d11


Discriminative System OptimizationChallenges

• This is a difficult problem!– In large LVCSR systems it took years for

discriminative methods to catch up with generative ones

• Challenges include:– complexity and scale of implementation– overtraining

150

v2

v1

u2

u1

d33

d22

d11


Discriminative System OptimizationChallenges

• Complexity: – Computation of derivatives for optimization (gradient,

Hessian) of complex systems– Finding and coding good numerical optimization algorithms

• Scale (CPU, memory): – our current state-of-the-art systems can have tens of

millions of parameters– 1500 hours of training speech, or 250 million training

examples

• Overtraining (up to millions of parameters)

151

v2

v1

u2

u1

d33

d22

d11


TechniquesComputing Derivatives

• We tried a coding technique that automatically implements the chain-rule for partial derivatives for function combination

– Similar to back propagation in neural networks– Computationally equivalent to reverse-mode

automatic differentiation– Did not scale well for our problem, involved

multiplication of multiple Jacobian matrices of very large dimension

• Our solution was to restrict our efforts to very simple system designs, for which the derivatives could be hand-coded and optimized

152

v2

v1

u2

u1

d33

d22

d11


DerivativesHand-Optimized

• Lukas hand-optimized a gradient calculation of 6 million components over 440 million training examples to run in 15 minutes on a single machine

• This was made possible by:– Replacing GMM log-likelihood calculation with a

linear approximation (without significant performance loss)

– Not doing ZT-norm (at some performance loss)

153

v2

v1

u2

u1

d33

d22

d11


TechniquesOptimization Algorithms

• We investigated stochastic gradient descent (after inspiring invited talk here at JHU by Yann LeCun)

– Did not scale well in our computing environment– Difficult to set hyperparameters– Not obvious how to parallelize over machines

154

v2

v1

u2

u1

d33

d22

d11



• We investigated MATLAB's optimization toolbox. – Tried 'large scale' trust-region optimization

algorithm– Did not scale well in time and space– Needs further investigation ...

155

v2

v1

u2

u1

d33

d22

d11



• Lukas was successful in his experiments with the Extended Baum-Welch algorithm

• I was successful in my experiments with the RPROP1 algorithm

• In both cases, we coded our own optimization algorithms in MATLAB for economy of scale

1. See http://en.wikipedia.org/wiki/Rprop

156

v2

v1

u2

u1

d33

d22

d11


Objective Function

• Our discriminative optimization objective function has many names: Maximum Mutual Information (MMI), Minimum Cross-Entropy, Logistic Regression, ...

– This criterion optimizes classification error-rates over wide ranges of priors and cost functions

– For linear systems, gives nice convex optimization objective

– Gives some protection against over-training– Has been very successfully applied to fusion of

sub-systems in NIST SRE evaluations

157

v2

v1

u2

u1

d33

d22

d11


Overtraining

• I was optimizing 90 000 parameters and Lukas 6 million

• This allows the training to learn irrelevant detail of the training data (even though we used 100's of millions of training examples)

• We both managed to optimize EER << 1% over the development data (Switchboard, SRE'04+05) if we allowed the training to go too far

• These overtrained systems did not generalize to good performance on independent test data (SRE'06+08)

158

v2

v1

u2

u1

d33

d22

d11


Regularization to Combat Overtraining

• We used early stopping to combat overtraining– just stop training when performance on a cross-

validation set stops improving

• We hope to be able to apply more principled approaches in the future

– adding SVM-style regularization penalties, or – more general Bayesian methods with appropriate

priors on the hyperparameters.

159

v2

v1

u2

u1

d33

d22

d11


Proof of Concept Experiments

• Niko: smaller scale experiment using 300-dimensional y-vectors for train and test, training 90 000 parameters

• Lukas: larger scale experiments using 300 dimensional y-vector for train and 20 000 dimensional statistic for test, training 6 million parameters

160

v2

v1

u2

u1

d33

d22

d11


Small Scale Experiment

• Within-class covariance-normalized dot-product between y-vectors for train and test

• Generative (ML) covariance estimate gives on a subset (english females) of SRE 2006:

EER = 2.61% Discriminative retraining of covariance gave an 11%

relative improvement:EER = 2.33%

161

v2

v1

u2

u1

d33

d22

d11


v2

v1

u2

u1

d33

d22

d11

m

Large Scale Experiment 1

• Pure eigenvoice system (only V; no U and D)

• GMM with 512 components; 39D features

• V matrix trained discriminatively (300x20k parameters)

• Fixed original speaker factors y

EER[%] No norm ZT-norm

Generative V 15.44 11.42

Discriminative V 7.19 5.06

Discriminative V with channel compensated y

6.80 4.81

Generative V and U 6.99 4.07

SRE 2006 all trials (det1)

162

v2

v1

u2

u1

d33

d22

d11


v2

v1

u2

u1

d33

d22

d11

m

Large Scale Experiment 2

• Channel compensated system (V and U; no D)

• Only V matrix trained discriminatively

• Fixed original speaker factors y

EER[%] No norm ZT-norm

Generative V and U 6.99 4.07

Discriminative VGenerative U

6.00 3.87

SRE 2006 all trials (det1)

163

v2

v1

u2

u1

d33

d22

d11


Next Steps

• Re-estimation of other hyperparameters (e.g. U)

• Iterative re-estimation of both hyperparameters and factors

• Direct optimization of ZT-normalized system (derivatives difficult to compute)

164

v2

v1

u2

u1

d33

d22

d11


Conclusion

• This is a large and difficult problem.

• But it has the potential of worthwhile gains:– Possibility of more accurate, but faster and

smaller systems.

• We have managed to show some proof of concept, but so far without improving on the state-of-the-art.

• Remaining problems are practical and theoretical.– Complexity of optimization.– Principled methods for combating overtraining.

166

v2

v1

u2

u1

d33

d22

d11


Robust Speaker RecognitionSummary

v2

v1

u2

u1

d33

d22

d11

m

DiarizationDiarization• Examined application of JFA and

Bayesian methods to diarization• Produced 3-4% DER on summed

telephone speech• Working on challenging interview

speech

Factor Analysis ConditioningFactor Analysis Conditioning• Explored ways to use JFA to account

for non-session variability (phone)• Showed robustness using within-

session, stacking and hierarchical modeling

SVM-JFASVM-JFA• Developed techniques to use JFA

elements in SVM classifiers• Results comparable to full JFA system

but with fast scoring and no score normalization

• Better performance using all JFA factors

Discriminative System OptimizationDiscriminative System Optimization• Focused on means to

discriminatively optimize the whole speaker recognition system

• Demonstrated proof-of-concept experiments

167

v2

v1

u2

u1

d33

d22

d11


Robust Speaker Recognition

• Extremely productive and enjoyable workshop

• Aim is to continue collaboration in problem areas going forward

• Cross-site, joint efforts will provide big gains in future speaker recognition evaluations and experiments

• Possible special session at ICASSP on team workshop efforts

1 xyz 10/14/2015 jhu ws’08 rsr team robust speaker recognition over varying channels niko brummer,...

Documents