machine learning for speaker recognition

274
INTERSPEECH 2016 Tutorial: Machine Learning for Speaker Recognition Man-Wai Mak and Jen-Tzung Chien The Hong Kong Polytechnic University, Hong Kong National Chiao Tung University, Taiwan September 8, 2016 1 / 274

Upload: others

Post on 11-Sep-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning for Speaker Recognition

INTERSPEECH 2016 Tutorial:Machine Learning for Speaker Recognition

Man-Wai Mak† and Jen-Tzung Chien‡

†The Hong Kong Polytechnic University, Hong Kong‡National Chiao Tung University, Taiwan

September 8, 2016

1 / 274

Page 2: Machine Learning for Speaker Recognition

Table of Contents

1 Introduction

2 Learning Algorithms

3 Learning Models

4 Deep Learning

5 Case Studies

6 Future Direction

2 / 274

Page 3: Machine Learning for Speaker Recognition

Outline

1 Introduction1.1. Fundamentals of speaker recognition1.2. Feature extraction and scoring1.3. Modern speaker recognition approaches

2 Learning Algorithms

3 Learning Models

4 Deep Learning

5 Case Studies

6 Future Direction

3 / 274

Page 4: Machine Learning for Speaker Recognition

Fundamentals of speaker recognition

Speaker recognition is a technique to recognize the identity of aspeaker from a speech utterance.

Speaker recognition

Speaker identification

Speaker verification

Speaker diarization

Text dependent

Text independent

Open set

Close set

4 / 274

Page 5: Machine Learning for Speaker Recognition

Speaker identification

Determine whether unknown speaker matches one of a set knownspeakersOne-to-many mappingOften assumed that unknown voice must come from a set of knownspeakers – referred to as close-set identificationAdding “none of the above” option to closed-set identification givesopen-set identification

5 / 274

Page 6: Machine Learning for Speaker Recognition

Speaker verification

Determine whether unknown speaker matches a specific speakerOne-to-one mappingClose-set verification: The population of clients is fixedOpen-set verification: New clients can be added without having toredesign the system.

6 / 274

Page 7: Machine Learning for Speaker Recognition

Speaker diarization

Determine when a speaker change has occurred in speech signal(segmentation)Group together speech segments corresponding to the same speaker(clustering)Prior speaker information may or may not be available

7 / 274

Page 8: Machine Learning for Speaker Recognition

Input mode

Text-dependentRecognition system knows text spoken by personsFixed phrases or prompted phrasesUsed for applications with strong control over user input, e.g.,biometric authenticationSpeech recognition can be used for checking spoken text to improvesystem performanceSentences typically very short

Text-independentNo restriction on the text, typically conversational speechUsed for applications with less control over user input, e.g., forensicspeaker IDMore flexible but recognition is more difficultSpeech recognition can be used for extracting high-level features toboost performanceSentences typically very long

8 / 274

Page 9: Machine Learning for Speaker Recognition

Outline

1 Introduction1.1. Fundamentals of speaker recognition1.2. Feature extraction and scoring1.3. Modern speaker recognition approaches

2 Learning Algorithms

3 Learning Models

4 Deep Learning

5 Case Studies

6 Future Direction

9 / 274

Page 10: Machine Learning for Speaker Recognition

Feature extraction

Speech is a time-varying signal conveying multiple layers ofinformation

WordsSpeakerLanguageEmotion

Information in speech is observed in the time and frequency domains

Acoustic Features •  Speech is a continuous evolution of the vocal tract •  Need to extract a sequence of spectra or sequence of spectral

coefficients •  Use a sliding window - 25 ms window, 10 ms shift

DCT log|X(ω)| MFCC

10 / 274

Page 11: Machine Learning for Speaker Recognition

Feature extraction from speech

Feature extraction consists in transforming the speech signal to a setof feature vectors. Most of the feature extraction used in speakerrecognition systems relies on a cepstral representation of speech.

Figure: Modular representation of MFCC feature extractor

11 / 274

Page 12: Machine Learning for Speaker Recognition

Computing MFCC

Parameterization 34

Mel-Frequency Cepstrum Coefficients (MFCCs)

nc

⎥⎦

⎤⎢⎣

⎡= ∑

=

1

0

2 )()(ln)(N

km kHkSmX

)(ns

MmkHm ≤≤1 )(

∑−

=

π−=1

0

2)()(N

n

NnkjenskS

MFCCn = cn = cos n m− 12

"

#$

%

&'πM

(

)*

+

,-

m=1

M

∑ X(m) = dnTX, 0 ≤ n ≤ P

⇒ MFCC vector = c0 c1 cP()

+,T=DX

D : DCT Transformation matrix [P × M] M : No. of triangular filters in the filter bank, typically 20 ~ 30 P : No. of cepstral coefficients, typically 12 c0 : Logarithm of energy of the current frame

Filter Bank

H15(k) : the 15th triangular filter

N

Figure: Computing MFCC from one frame of speech12 / 274

Page 13: Machine Learning for Speaker Recognition

Modeling sequence of features

For most recognition tasks, we need to model the distribution offeature vector sequences

In practice, we often use the Gaussian mixture models (GMMs)

13 / 274

Page 14: Machine Learning for Speaker Recognition

GMM–UBM speaker verification

A Gaussian mixture model, namely the universal background model(UBM), is trained to represent the speech of the general population.

p(x|UBM) = p(x|Λubm) =C∑

c=1πubm

c N (x|µubmc ,Σubm

c )

The UBM parameters Λubm =πubm

c ,µubmc ,Σubm

c

C

c=1are estimated

by the expectation-maximization algorithm using the speech of manyspeakers.

14 / 274

Page 15: Machine Learning for Speaker Recognition

Expectation maximization (EM)

Denote the acoustic vectors from a large population asX = xt ; t = 1, . . . ,TExpectation step:

Conditional distribution of mixture component c:

γt(c) = p(c|xt) = πcN (xt |µubmc ,Σubm

c )∑Cc=1 π

ubmc N (xt |µubm

c ,Σubmc )

Maximization step:Mixture weights: πubm

c = 1T∑T

t=1 γt(c)

Mean vectors: µubmc =

∑Tt=1

γt (c)xt∑Tt=1

γt (c)

Covariance matrices: Σubmc =

∑Tt=1

γt (c)xt xTt∑T

t=1γt (c)

− µubmc (µubm

c )T

15 / 274

Page 16: Machine Learning for Speaker Recognition

Target-speakers’ GMMs

Each target speaker is represented by a Gaussian mixture model:

p(x|Spk s) = p(x|Λ(s)) =C∑

c=1π(s)

c N (x|µ(s)c ,Σ(s)

c )

where Λ(s) =π

(s)c ,µ

(s)c ,Σ(s)

c

C

c=1are learned by using maximum a

posteriori (MAP) adaptation [Reynolds et al., 2000].

16 / 274

Page 17: Machine Learning for Speaker Recognition

Maximum a posteriori (MAP)

The MAP algorithm finds the parameters of target-speaker’s GMMgiven UBM parameters Λubm =

πubm

c ,µubmc ,Σubm

c

C

c=1First step is the same as EM. Given Ts acoustic vectorsX (s) = x1, . . . , xTs from speaker s, we compute the statistics:

nc =Ts∑

t=1γt(c) and Ec(X (s)) = 1

nc

Ts∑t=1

γt(c)xt

Adapt UBM parameters by

µ(s)c = αcEc(X (s)) + (1− αc)µubm

c

whereαc = nc

nc + rand r is called the relevance factor. Typically, r = 16.

17 / 274

Page 18: Machine Learning for Speaker Recognition

MAP adaptation

Adapt the UBM model to each speaker using the MAP algorithm:1

µ(s)c = αcEc(X (s)) + (1− αc)µubm

c

αc → 1 when X (s) comprises lots of data and αc → 0 otherwise.

1Source: J. H. L. Hansen and T. Hasan, IEEE Signal Processing Magazine, 2015.18 / 274

Page 19: Machine Learning for Speaker Recognition

MAP adaptation

In practice, only the mean vectors will be adapted:

19 / 274

Page 20: Machine Learning for Speaker Recognition

GMM-UBM scoring

Given the acoustic vectors X (t) from a test speaker and a claimedidentity s, speaker verification can be formulated as a 2-classhypothesis problem:

H0: X (t) comes from the true speaker sH1: X (t) comes from an impostor

Verification score is a log-likelihood ratio:

SLR(X (t)|Λ(s),Λubm) = log p(X (t)|Λ(s))− log p(X (t)|Λubm)

20 / 274

Page 21: Machine Learning for Speaker Recognition

Sources of variability

21 / 274

Page 22: Machine Learning for Speaker Recognition

How to account for variabilityGMM-SVM [Campbell et al., 2006]:

Create supervectors from target-speaker GMMs.Then, project the supervectors to a subspace in which inter-speakervariability is maximized and nuisance variability is minimized.Perform SVM classification on the projected subspace.

Joint Factor Analysis:Speaker and session variabilities are represented by latent variables(speaker factors and channel factors) in a factor analysis model.During scoring, session variabilities are accounted for by integratingover the latent variables, e.g., the channel factors as in[Kenny et al., 2007a].

I-Vector + PLDA:2Utterances are represented by the posterior means of latent factors,called the i-vectors [Dehak et al., 2011].I-vectors capture both speaker and channel information.During scoring, the unwanted channel variability is removed by LDAprojection or by integrating out the latent factors in the PLDA model.

2For the relationship between JFA and I-vectors and their derivations, seehttp://www.eie.polyu.edu.hk/∼mwmak/papers/FA-Ivector.pdf

22 / 274

Page 23: Machine Learning for Speaker Recognition

Performance measures

For speaker identification

Recogntion Rate = No. of correct recognitionsTotal no. of trials

For speaker verification

False Rejection Rate (FRR) = Miss probability

= No. of true-speakers rejectedTotal no. of true-speaker trials

False Acceptance Rate (FAR) = False alarm probability

= No. of impostors acceptedTotal no. of impostor attempts

Equal error rate (EER) corresponds to the operating point at whichFAR = FRR

23 / 274

Page 24: Machine Learning for Speaker Recognition

Performance measures for speaker verification

Detection error tradeoff (DET) curves are similar to receiveroperating characteristic curves but with nonlinear x- and y-axis.

24 / 274

Page 25: Machine Learning for Speaker Recognition

Detection cost functions

Detection cost function (DCF) is a weighted sum of the FRR(PMiss|Target) and FAR (PFalseAlarm|Nontarget):

CDet(θ) = CMiss × PMiss|Target(θ)× PTarget +CFalseAlarm × PFalseAlarm|Nontarget(θ)× (1− PTarget)

where θ is a decision threshold.Normalized cost:

CNorm = CDet(θ)/CDefault

whereCDefault = min

CMiss × PTargetCFalseAlarm × (1− PTarget)

NIST 2008 SRE and earlier:CMiss = 10; CFalseAlarm = 1; PTarget = 0.01

NIST 2010 SRE:CMiss = 1; CFalseAlarm = 1; PTarget = 0.001

25 / 274

Page 26: Machine Learning for Speaker Recognition

Detection cost functions

Dectection cost function for NIST 2012 SRE:CDet(θ) = CMiss × PMiss|Target(θ)× PTarget +

CFalseAlarm × (1− PTarget) ×[PFalseAlarm|KnownNontarget(θ)× PKnown +

PFalseAlarm|UnKnownNontarget × (1− PKnown])CNorm(θ) = CDet(θ)/(CMiss × PTarget)

Parameters for core test conditionsCMiss = 1; CFalseAlarm = 1; PTarget1 = 0.01; PTarget2 = 0.001;PKnown = 0.5

PTarget1 → CNorm1(θ1) and PTarget2 → CNorm2(θ2)Primary cost:

CPrimary = CNorm1(θ1) + CNorm2(θ2)2

26 / 274

Page 27: Machine Learning for Speaker Recognition

Detection cost functions

Detection cost function for NIST 2016 SRE:

CDet(θ) = CMiss × PMiss|Target(θ)× PTarget +CFalseAlarm × PFalseAlarm|Nontarget(θ)× (1− PTarget)

CNorm(θ) = CDet(θ)/(CMiss × PTarget)

Parameters for core test conditions

CMiss = 1; CFalseAlarm = 1; PTarget1 = 0.01; PTarget2 = 0.005

PTarget1 → CNorm1(θ1) and PTarget2 → CNorm2(θ2)Primary cost:

CPrimary = CNorm1(θ1) + CNorm2(θ2)2

27 / 274

Page 28: Machine Learning for Speaker Recognition

Outline

1 Introduction1.1. Fundamentals of speaker recognition1.2. Feature extraction and scoring1.3. Modern speaker recognition approaches

2 Learning Algorithms

3 Learning Models

4 Deep Learning

5 Case Studies

6 Future Direction

28 / 274

Page 29: Machine Learning for Speaker Recognition

Main approachParametric model

Artificial neural network

Gaussian mixture model

Joint factor analysis

i-vector

Auto-encoder

Deep neural network

Support vector machine

Probabilistic linear discriminant analysis

Restricted Boltzmann machine

Nonparametric model

29 / 274

Page 30: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms2.1. Machine learning2.2. EM algorithm2.3. Approximate inference2.4. Bayesian learning

3 Learning Models

4 Deep Learning

5 Case Studies

6 Future Direction30 / 274

Page 31: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms2.1. Machine learning2.2. EM algorithm2.3. Approximate inference2.4. Bayesian learning

3 Learning Models

4 Deep Learning

5 Case Studies

6 Future Direction31 / 274

Page 32: Machine Learning for Speaker Recognition

Model-based method

Machine learning provides a wide range of model-based approachesfor speaker recognitionModel-based approach aims to incorporate the physical phenomena,measurements, uncertainties and noises in the form of mathematicalmodelsThis approach is developed in a unified manner through differentalgorithms, examples, applications, and case studiesMain-stream methods are based on the statistical modelsLatent variable models in speaker recognition include− joint factor analysis (JFA)− probabilistic linear discriminant analysis (PLDA)− Gaussian mixture model (GMM)− mixture of PLDA

32 / 274

Page 33: Machine Learning for Speaker Recognition

Neural network

Deep structured/hierarchical learningRapidly developed and widely applied for many applicationsMultiple layers of nonlinear processing unitsHigh-level abstraction

Run

Jump

33 / 274

Page 34: Machine Learning for Speaker Recognition

Model-based method vs. neural network

Model-based method Neural network

Structure Top-down Bottom-upRepresentation Intuitive DistributedInterpretation Easy Harder

Semi/unsupervised Easier HarderIncorp. domain knowl. Easy HardIncorp. constraint Easy HardIncorp. uncertainty Easy HardLearning Many algorithms Back-propagationInference/decode Harder EasierEvaluation on ELBO End performance

34 / 274

Page 35: Machine Learning for Speaker Recognition

Model-based method vs. neural network

Model-based method Neural network

Structure Top-down Bottom-upRepresentation Intuitive DistributedInterpretation Easy Harder

Semi/unsupervised Easier HarderIncorp. domain knowl. Easy HardIncorp. constraint Easy HardIncorp. uncertainty Easy Hard

Learning Many algorithms Back-propagationInference/decode Harder EasierEvaluation on ELBO End performance

35 / 274

Page 36: Machine Learning for Speaker Recognition

Model-based method vs. neural network

Model-based method Neural network

Structure Top-down Bottom-upRepresentation Intuitive DistributedInterpretation Easy HarderSemi/unsupervised Easier HarderIncorp. domain knowl. Easy HardIncorp. constraint Easy HardIncorp. uncertainty Easy Hard

Learning Many algorithms Back-propagationInference/decode Harder EasierEvaluation on ELBO End performance

36 / 274

Page 37: Machine Learning for Speaker Recognition

Modern machine learning

Model-based method Neural network

Structure Top-down Bottom-upRepresentation Intuitive DistributedInterpretation Easy HarderSemi/unsupervised Easier HarderIncorp. domain knowl. Easy HardIncorp. constraint Easy HardIncorp. uncertainty Easy HardLearning Many algorithms Back-propagationInference/decode Harder EasierEvaluation on ELBO End performance

37 / 274

Page 38: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms2.1. Machine learning2.2. EM algorithm2.3. Approximate inference2.4. Bayesian learning

3 Learning Models

4 Deep Learning

5 Case Studies

6 Future Direction38 / 274

Page 39: Machine Learning for Speaker Recognition

Parameter estimation

Assume we have a collection of acoustic frames X = xtTt=1 for

estimation of model parameters θMaximum likelihood (ML) estimation

θML = arg maxθ

p(X |θ)

Maximum a posteriori (MAP) estimation

θMAP = arg maxθ

p(θ|X ) = arg maxθ

p(X |θ)p(θ)

where p(θ) denotes the prior distribution of θ

39 / 274

Page 40: Machine Learning for Speaker Recognition

Expectation-maximization algorithm

Likelihood function for observations x in latent variable model withlatent variable z

p(x|θ) =∑

zp(x, z|θ)

Expectation (E) step: calculate an auxiliary function

Q(θ,θold) = Ez[log p(x, z|θ)|x,θold]

Maximization (M) step: find a new estimate θnew via

θnew = arg maxλ

Q(θ,θold)

EM algorithm [Dempster et al., 1977] for ML can be extended forMAP

40 / 274

Page 41: Machine Learning for Speaker Recognition

Lower bound & KL divergence

Introduce an approximate or variational distribution q(z) and adoptthe Jensen’s inequality for convex function − log(·) to obtain

log p(x|θ) = log∑

z

p(x, z|θ)q(z) q(z) = logEq

[p(x, z|θ)q(z)

]≥ Eq

[log p(x, z|θ)

q(z)

], L(q,θ)

∑z

q(z) log p(x|θ)− L(q,θ) = −∑

zq(z) log

p(z|x,θ)q(z)

, KL(q‖p)

Evidence Decomposition

log p(x|θ) = KL(q‖p) + L(q,θ)

41 / 274

Page 42: Machine Learning for Speaker Recognition

Maximum Likelihood

KL(q‖p) = −Eq[log p(z|x,θ)]−Hq[z]

L(q,θ) = Eq[log p(x, z|θ)] + Hq[z]

Maximizing p(x|θ) is equivalent to first setting KL(q‖p) = 0 orapproximating (E-step)

q(z) = p(z|x,θold )

then maximizing the resulting lower bound (M-step)

L(q,θ) , Q(θ,θold) + const

where Q(θ,θold) , Eq[log p(x, z|θ)|x,θold] is a concave function

42 / 274

Page 43: Machine Learning for Speaker Recognition

EM algorithm

00

KL(qkp)KL(qkp)

L(q;µ)L(q;µ)

log p(xjµ)log p(xjµ)

43 / 274

Page 44: Machine Learning for Speaker Recognition

EM algorithm: E-step

00

KL(qkp)KL(qkp)

L(q;µold)L(q;µold)

= 0= 0

log p(xjµold)log p(xjµold)

44 / 274

Page 45: Machine Learning for Speaker Recognition

EM algorithm: M-step

00

KL(qkp)KL(qkp)log p(xjµnew)log p(xjµnew)

L(q;µnew)L(q;µnew)

45 / 274

Page 46: Machine Learning for Speaker Recognition

EM algorithm: lower bound

µoldµold µnewµnew

L(q;µold)L(q;µold)

L(q;µnew)L(q;µnew)

log p(xjµ)log p(xjµ)

46 / 274

Page 47: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms2.1. Machine learning2.2. EM algorithm2.3. Approximate inference2.4. Bayesian learning

3 Learning Models

4 Deep Learning

5 Case Studies

6 Future Direction47 / 274

Page 48: Machine Learning for Speaker Recognition

Why approximate inference?

There are a number of latent variables in model-based speakerrecognition− i-vectors− common factors− variability matrix− mixture labels− channel, speaker and noise information

Posterior distribution of latent variables should be analytical andfactorizableEvolution of inference algorithms− maximum likelihood− maximum a posteriori− variational Bayesian− Gibbs sampling

48 / 274

Page 49: Machine Learning for Speaker Recognition

Posterior distribution

PosteriorLikelihood Prior

Marginal Likelihood aaa(model evidence)

p(zjx) =p(xjz) p(z)

Rp(xjz)p(z)dz

p(zjx) =p(xjz) p(z)

Rp(xjz)p(z)dz

p(x)p(x)

Latent variables and parameters z = z1, . . . , zm are coupled

49 / 274

Page 50: Machine Learning for Speaker Recognition

Approximate posterior

q(z)q(z)Divergence

p(zjx)p(zjx)

KL(qkp)KL(qkp) true

posteriorproxy

Find an approximate distribution q(z) that is factorizable andmaximally similar to the true posterior p(z|x)

50 / 274

Page 51: Machine Learning for Speaker Recognition

Variational Bayesian inference

q(z1:m|ν1:m) =m∏

j=1q(zj |νj)

Variationalcalculus

Optimization problem

functional

L(q) : q 7! L(q)L(q) : q 7! L(q)

maxq L(q)

s.t.R

zq(dz) = 1

maxq L(q)

s.t.R

zq(dz) = 1

p(x) = KL(qkp) + L(q)

where KL(qkp) = ¡Eq[ln p(zjx)] ¡Hq[z]

L(q) = Eq[ln p(x;z)] + Hq[z]

p(x) = KL(qkp) + L(q)

where KL(qkp) = ¡Eq[ln p(zjx)] ¡Hq[z]

L(q) = Eq[ln p(x;z)] + Hq[z]

(Evidence Lower BOund, ELBO)

51 / 274

Page 52: Machine Learning for Speaker Recognition

ln p(x)ln p(x)ln p(x)ln p(x)

00 00

KL(qkp)KL(qkp)

L(q)L(q)

L(q)L(q)

KL(qkp)KL(qkp)

Estimation for variational distributionmaxq(z)

Eq[log p(x, z)] + Hq[z]

s.t.∫

zq(dz) = 1

q(zj |νj) = exp(Ei 6=j [log p(x, z|ν)])∫exp(Ei 6=j [log p(x, z|ν)])dzj

52 / 274

Page 53: Machine Learning for Speaker Recognition

Variational Bayesian (VB) inference is implemented via adoubly-looped algorithm

VB-EM algorithmVB-E step: calculate the variational distribution q(z) in inner loop

q(z) = arg maxq(z)L(q,θ)

VB-M step: calculate the model parameter θ in outer loop

θ = arg maxθL(q,θ)

Convex optimization is performedVB-EM steps converge by a number of iterations

53 / 274

Page 54: Machine Learning for Speaker Recognition

Gibbs sampling algorithm

Initialize z(1), where z = z1:m

for τ ← 1 to T − 1 do

for j ← 1 to m do

Sample z(τ+1)j ∼ p(zj |z(τ+1)

1:(j−1), z(τ)j+1:m)

end for

end for

54 / 274

Page 55: Machine Learning for Speaker Recognition

Gibbs sampling

Two dimensional Gaussian mixture model with two mixture components

55 / 274

Page 56: Machine Learning for Speaker Recognition

Gibbs sampling

zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)

Randomly assign mixture component for each sample j

56 / 274

Page 57: Machine Learning for Speaker Recognition

Gibbs sampling

zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)

Extract one sample and compute the conditional distribution

57 / 274

Page 58: Machine Learning for Speaker Recognition

Gibbs sampling

zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)

Sample a mixture component from the conditional distribution

58 / 274

Page 59: Machine Learning for Speaker Recognition

Gibbs sampling

zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)

Extract one sample and compute the conditional distribution

59 / 274

Page 60: Machine Learning for Speaker Recognition

Gibbs sampling

zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)

Sample a mixture component from the conditional distribution

60 / 274

Page 61: Machine Learning for Speaker Recognition

Gibbs sampling

zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)

Extract one sample and compute the conditional distribution

61 / 274

Page 62: Machine Learning for Speaker Recognition

Gibbs sampling

zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)

Sample a mixture component from the conditional distribution

62 / 274

Page 63: Machine Learning for Speaker Recognition

Gibbs sampling

zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)

Extract one sample and compute the conditional distribution

63 / 274

Page 64: Machine Learning for Speaker Recognition

Gibbs sampling

zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)

Sample a mixture component from the conditional distribution

64 / 274

Page 65: Machine Learning for Speaker Recognition

Gibbs sampling

zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)

Extract one sample and compute the conditional distribution

65 / 274

Page 66: Machine Learning for Speaker Recognition

Gibbs sampling

zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)

Sample a mixture component from the conditional distribution

66 / 274

Page 67: Machine Learning for Speaker Recognition

Gibbs sampling

zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)

Extract one sample and compute the conditional distribution

67 / 274

Page 68: Machine Learning for Speaker Recognition

Gibbs sampling

zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)

Sample a mixture component from the conditional distribution

68 / 274

Page 69: Machine Learning for Speaker Recognition

Gibbs sampling

zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)

Extract one sample and compute the conditional distribution

69 / 274

Page 70: Machine Learning for Speaker Recognition

Gibbs sampling

zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)

Sample a mixture component from the conditional distribution

70 / 274

Page 71: Machine Learning for Speaker Recognition

Gibbs sampling

zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)

Extract one sample and compute the conditional distribution

71 / 274

Page 72: Machine Learning for Speaker Recognition

Gibbs sampling

zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)

Sample a mixture component from the conditional distribution

72 / 274

Page 73: Machine Learning for Speaker Recognition

Gibbs sampling

zj » p( ¢ j z¡j;x)zj » p( ¢ j z¡j;x)

Finally obtain an appropriate clustering result

73 / 274

Page 74: Machine Learning for Speaker Recognition

Variational Bayes

deterministic approximationfind an analytical proxy q(z)that is maximally similar top(z|x)inspect distribution statisticsnever generate exact resultsfastoften hard work to deriveconvergence guaranteesneed a specific parametricform

Gibbs sampling

stochastic approximationdesign an algorithm thatdraws samples z(1), . . . , z(τ)

from p(z|x)inspect sample statisticsasymptotically exactcomputationally expensivetricky engineering concernsno convergence guaranteesno need parametric form

74 / 274

Page 75: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms2.1. Machine learning2.2. EM algorithm2.3. Approximate inference2.4. Bayesian learning

3 Learning Models

4 Deep Learning

5 Case Studies

6 Future Direction75 / 274

Page 76: Machine Learning for Speaker Recognition

Challenges in model-based approach

Thomas Bayes (1701-1761)

We are facing the challenges of big dataAn enormous amount of multimedia data is available in internet whichcontains speech, text, image, music, video, social networks and anyspecialized technical dataThe collected data are usually noisy, non-labeled, non-aligned, mismatched,and ill-posedProbabilistic models may be improperly-assumed, over-estimated, orunder-estimated

76 / 274

Page 77: Machine Learning for Speaker Recognition

Uncertainty modeling

We need tools for modeling, analyzing, searching, recognizing andunderstanding real-world dataOur modeling tools should− faithfully represent uncertainty in model structure and its parameters− reflect noise condition in observed data− be automated and adaptive− assure robustness− scalable for large data sets

Uncertainty can be properly expressed by prior distribution or process

77 / 274

Page 78: Machine Learning for Speaker Recognition

Model regularization

Regularization refers to a process of introducing additional informationin order to solve the ill-posed problem or to prevent overfittingOccam’s razor is imposed to deal with the issue of model selectionScalable modeling

78 / 274

Page 79: Machine Learning for Speaker Recognition

Bayesian speaker recognition

Real-world speaker recognition− unsupervised learning− number of factors is unknown− very short enrollment utterance− high inter/intra speaker variabilities− variabilities from channel and noise

Why Bayesian? [Watanabe and Chien, 2015]− exploration for latent variables− model regularization− uncertainty modeling− approximate Bayesian inference− better prediction

79 / 274

Page 80: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models3.1. GMM-UBM system3.2. Joint factor analysis3.3. Probabilistic linear discriminant analysis3.4. Support vector machine3.5. Restricted Boltzmann machine

4 Deep Learning

5 Case Studies

6 Future Direction 80 / 274

Page 81: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models3.1. GMM-UBM system3.2. Joint factor analysis3.3. Probabilistic linear discriminant analysis3.4. Support vector machine3.5. Restricted Boltzmann machine

4 Deep Learning

5 Case Studies

6 Future Direction 81 / 274

Page 82: Machine Learning for Speaker Recognition

GMM distribution from three Gaussians

-20 -10 0 10 20 30 40 50 600

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

x

p(x)

82 / 274

Page 83: Machine Learning for Speaker Recognition

Gaussian mixture model

Gaussian mixture model (GMM) is a weighted sum of Gaussians

p(x|θ) =M∑

i=1πibi (x)

θ = πi ,ui ,Σi

πi : mixture weightui : mixture mean vectorΣi : mixture covariance matrix

bi (x) = 1(2π)D/2|Σi |1/2 exp

(−1

2(x− ui )>Σ−1i (x− ui )

)Mixture component zi is a latent variable which is either zero or one

83 / 274

Page 84: Machine Learning for Speaker Recognition

Maximum likelihood

In ML estimation, we need to− compute the likelihood of a sequence of features given a GMM− estimate the parameters of GMM given a set of feature vectors

Assuming independence between features in a sequence, we have

p(X |θ) = p(x1, . . . , xT |θ) =T∏

t=1p(xt |θ)

ML estimation is performed by

θML = arg maxθ

p(X |θ) = arg maxθ

T∑t=1

log[ M∑

i=1πibi (xt)

]

84 / 274

Page 85: Machine Learning for Speaker Recognition

Parameter estimation

E-step is to calculate the auxiliary function

Q(θ,θold) = Ez[log p(X , z|θ)|X ,θold]

=T∑

t=1

M∑i=1

p(zti = 1|X ,θold) log p(xt , zti = 1|θ)

ML estimates are obtained via M-step as

πnewi = Ti

T

µnewi = 1

Ti

T∑t=1

γ(zti )xt = 1Ti

Ei [x]

Σnewi = 1

Ti

T∑t=1

γ(zti )(xt − µi )(xt − µi )> = 1Ti

Ei [xx>]− µiµ>i

where Ti =∑

t γ(zti ) and γ(zti ) = p(zti = 1|X ,θold)85 / 274

Page 86: Machine Learning for Speaker Recognition

E-step

x

p(z1 =1jX)

p(z2 =1jX)

p(z3 =1jX)

p(zi = 1jX) =¼ibi(x)

M

j=1

¼jbj(x)

Ti =TP

t=1

p(zti = 1jX)

Ei(x) =TP

t=1

p(zti = 1jX)xt

Ei(xxT ) =TP

t=1

p(zti = 1jX)xtxTt

Accumulate sufficient statistics

Probabilistically align samples to each mixture

86 / 274

Page 87: Machine Learning for Speaker Recognition

M-step

x ¼newi = Ti

T

¹newi = 1

TiEi[x]

§newi = 1

TiEi[xx

T ]¡¹i¹Ti

GMM parameters

Update model parameters

xxx

xxx

x

87 / 274

Page 88: Machine Learning for Speaker Recognition

Speaker verification

Realization of log likelihood ratio test from signal detection theory

SLR(X |θtarget,θubm) = log(X |θtarget)− log(X |θubm)

Feature

extractor

Target model

Background

model

x1; ::::;xT+

¡

GMMs are used for both target and background models− target model trained using enrollment speech− universal background model trained using speech from many speakers

88 / 274

Page 89: Machine Learning for Speaker Recognition

Target model & UBM

Target model is adapted from universal background model (UBM)− good with limited target training data

Maximum a posteriori (MAP) adaptation− align target training vectors to UBM− accumulate sufficient statistics− update target model parameters with smoothing to UBM parameters

Adaptation for those parameters of seen acoustic events− sparse regions of feature space filled in by UBM parameters

Side benefits− keep correspondence between target and UBM mixtures− allow for fast scoring when using many target models (top-M scoring)

89 / 274

Page 90: Machine Learning for Speaker Recognition

Maximum a posteriori adaptation

Prior density for GMM mean vector µ = µi is introduced

p(µi ) = N (µi |µubmi , σ2I)

MAP estimation [Gauvain and Lee, 1994] is performed by using theenrollment data Xs = xTs

t=1 from a target speaker s− E-step is to calculate

Q(µi ,µoldi ) =

Ts∑t=1

γ(zti ) log p(xt |zti = 1,µi ) + log p(µi |µubmi , σ2I)

− M-step is to maximize Q(µi ,µoldi ) to find

µnewi =

∑t γ(zti )xt∑

t γ(zti ) + r + rµubmi∑

t γ(zti ) + r = αiEi [x] + (1− αi )µubmi

where αi =∑

tγ(zti )∑

tγ(zti )+r

90 / 274

Page 91: Machine Learning for Speaker Recognition

MAP adaptation for GMM-UBM

UBM is based on GMM and trained by using EM algorithmSpeaker GMM is established by adjusting UBM by using MAPadaptation

EM

MAP

UBM

Speaker

GMM

Training data

Enrollment data

91 / 274

Page 92: Machine Learning for Speaker Recognition

Speaker recognition procedure

Feature

extractorModeling GMM-UBM

Feature

extractorScoring Results

Enrollment

Test utterance

92 / 274

Page 93: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models3.1. GMM-UBM system3.2. Joint factor analysis3.3. Probabilistic linear discriminant analysis3.4. Support vector machine3.5. Restricted Boltzmann machine

4 Deep Learning

5 Case Studies

6 Future Direction 93 / 274

Page 94: Machine Learning for Speaker Recognition

Joint factor analysis

Factor analysis is a statistical method which is used to describe thevariability among the observed variables in terms of potentially lowernumber of unobserved variables called factorsFactor analysis is a latent variable model for feature extractionJoint factor analysis (JFA) was the initial paradigm for speakerrecognition

u = m + Vy + Ux + Dz

Speaker

SupervectorUBM

Speaker

factors

Channel

factors

Residual

factors

94 / 274

Page 95: Machine Learning for Speaker Recognition

Intuition & interpretation

A supervector for a speaker should be decomposable into speakerindependent, speaker dependent, channel dependent, and residualcomponentsEach component is represented by low-dimensional factors, whichoperate along the principal dimensions of the correspondingcomponentSpeaker dependent component, known as the eigenvoice, and thecorresponding factors

V ¢ y =

2

4j j j j

v1 v2 : : : vN

j j j j

3

5

2

664

y1

y2...

yN

3

775

Eigenvoice matrix

Low dimensional eigenvoice factors

Each speaker factor controls

an eigendimension of the

eigenvoice matrix

95 / 274

Page 96: Machine Learning for Speaker Recognition

Factor decomposition

GMM supervector u for a speaker can be decomposed as

u = m + Vy + Ux + Dz

Speaker

supervectorSpeaker-independent

component

Speaker-dependent

component

Channel-dependent

component

Speaker-dependent

resuidual component

wherem is a speaker-independent supervector from UBMV is the eigenvoice matrixy ∼ N (0, I) is the speaker factor vectorU is the eigenchannel matrixx ∼ N (0, I) is the channel factor vectorD is the residual matrix, and is diagonalz ∼ N (0, I) is the speaker-specific residual factor vector

96 / 274

Page 97: Machine Learning for Speaker Recognition

Dimensionality

For a 512-mixture GMM-UBM system, the dimensions of each JFAcomponent are typically as follows− V 20,000 by 300 (300 eigenvoices)− y 300 by 1 (300 speaker factors)− U 20,000 by 100 (100 eigenchannels)− x 100 by 1 (100 channel factors)− D 20,000 by 20,000 (20,000 residuals)− z 20,000 by 1 (20,000 speaker-specific residuals)

These dimensions have been empirically determined to produce thebest resultsBayesian model selection can helpJudge by the marginal likelihood over latent component underdifferent dimensions

97 / 274

Page 98: Machine Learning for Speaker Recognition

Training procedure

We train the JFA matricies in the following order[Kenny et al., 2007a]

1. Train the eigenvoice matrix V, assuming that U and D are zero2. Train the eigenchannel matrix U given the estimate of V, assuming

that D is zero3. Train the residual matrix D given the estimates of V and U

Using these matrices, we compute y for speaker, x for channel, and zfor residual factorsWe compute the final score by using these matrices and factors

98 / 274

Page 99: Machine Learning for Speaker Recognition

Total variability

Subspaces U and V are not completely independentA combined total variability space was used [Dehak et al., 2011]

u = m + Vy + Ux + Dz

Speaker

supervectorUBM

Speaker

factors

Channel

factors

Residual

factors

u = m + Tw

Speaker

supervectorUBM

Total variability matrix

Intermediate/identity

vector (i-vector)

99 / 274

Page 100: Machine Learning for Speaker Recognition

Training total variability space

Rank of T is set prior to trainingT and w are latent variablesEM algorithm is usedTraining total variability matrix T is similar to training V except thattraining T is performed by using all utterances from a given speakerbut as produced by different speakersRandom initialization for TEach ot has dimension D. Number of Gaussian components is M.Dimension of supervector is M · DUBM diagonal covariance matrix Σ (MD ×MD) is introduced tomodel the residual variability not captured by T

100 / 274

Page 101: Machine Learning for Speaker Recognition

Sufficient statistics

0th order statistics Nc(u) =∑

t γc(ot) of an utterance u1th order statistics Fc(u) =

∑t γc(ot)ot

2nd order statistics Sc(u) = diag(∑

t γc(ot)oto>t)

where

γc(ot) = p(c|ot ,θubm) = πcp(ot |mc ,Σc)∑Mj=1 πip(ot |mj ,Σj)

Centralized 1th and 2nd order statistics

Fc(u) =T∑

t=1γc(ot)(ot −mc)

Sc(u) = diag( T∑

t=1γc(ot)(ot −mc)(ot −mc)>

)

where mc is the subvector corresponding to mixture component c101 / 274

Page 102: Machine Learning for Speaker Recognition

EM algorithm

Sufficient statistics

N(u) =

N1(u) · ID×D 0 · · · 0

0 N2(u) · ID×D 0...

... 0. . . 0

0 · · · 0 NM (u) · ID×D

F (u) =

F1(u)F2(u)

...FM (u)

EM algorithm [Kenny et al., 2005]– Initialize m, Σ and T– E-step: for each utterance u, calculate the parameters of the posterior

distribution of w(u) using the current estimates of m,Σ,T– M-step: update T and Σ by solving a set of linear equations in which

w(u)’s play the role of explanatory variables– Iterate until data likelihood given the estimated parameters converges

102 / 274

Page 103: Machine Learning for Speaker Recognition

E-step: posterior distribution of w(u)

For each utterance u, we calculate the matrix

L(u) = I + T>Σ−1N(u)T

Posterior distribution of w(u) conditioned on the acousticobservations of an utterance u is Gaussian with mean

E[w(u)] = L−1(u)T>Σ−1F (u)

and covariance matrix

Cov(w(u),w(u)) = L−1(u)

Variational Bayesian JFA was developed for speaker verification[Zhao and Dong, 2012]

103 / 274

Page 104: Machine Learning for Speaker Recognition

Linear discriminant analysis

I-vectors from JFA model are used in linear discriminant analysis(LDA)

u = m + Tw

W = Aw

Factor analysis

Linear discriminant analysis

Both methods used to reduce the dimensionality of speaker modelA is chosen such that within-speaker variability Sw is minimized andbetween-speaker variability Sb is maximized within the spaceA is found by eigenvalue method via maximizing

J (A) = TrS−1w Sb

104 / 274

Page 105: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models3.1. GMM-UBM system3.2. Joint factor analysis3.3. Probabilistic linear discriminant analysis3.4. Support vector machine3.5. Restricted Boltzmann machine

4 Deep Learning

5 Case Studies

6 Future Direction 105 / 274

Page 106: Machine Learning for Speaker Recognition

Factor analysis

Assuming a factor analysis model of the i-vectors of the form

w = u + Fh + ε

w is the i-vector, u is the mean of i-vectors, and h ∼ N (0, I) is thelatent factorsFirst compute the maximum likelihood estimate of the factor loadingmatrix F, also known as the eigenvoice subspaceFull covariance of residual noise ε explains the variability not capturedthrough the latent variables

PLDAUnder Gaussian assumption, this model is known in face recognition asPLDA [Prince and Elder, 2007]

106 / 274

Page 107: Machine Learning for Speaker Recognition

Gaussian PLDA

Assume that there are low dimensional, normally distributed hiddenvariables x1 and x2r such that

Dr = m + U1x1︸ ︷︷ ︸S

+ U2x2r + εr︸ ︷︷ ︸Cr

Residual εr is normally distributed with mean 0 and precision matrix Λm is the center of acoustic space and x1 is the speaker factorsColumns of U1 are the eigenvoices

Cov(S, S) = U1U>1

x2r varies from one recording to another (channel factors)Columns of U2 are the eigenchannels

Cov(Cr ,Cr ) = Λ−1 + U2U>2

107 / 274

Page 108: Machine Learning for Speaker Recognition

Graphical representation

x1

x2r Dr "r

m

r = 1; 2; :::; R

¤

Including x2r enables the decomposition of speaker and channelfactorsx2r can always be eliminated at recognition timeBetween-speaker covariance matrix Cov(S,S) & within-speakercovariance matrix Cov(Cr ,Cr )These matrices cannot be treated as full rank

108 / 274

Page 109: Machine Learning for Speaker Recognition

PLDA speaker recognition

Given two i-vectors D1 and D2, we would like to perform thehypothesis test

H1: the speakers are the sameH0: the speakers are different

Likelihood ratio is calculated by

p(D1,D2|H1)p(D1|H0)p(D2|H0)

Likelihood ratio for any type of speaker recognition or speakerclustering problemThe evidence integral should be calculated∫

p(D, z)dz

109 / 274

Page 110: Machine Learning for Speaker Recognition

Model assumption

Assume that− we have succeeded in estimating the model parametersθ = m,U1,U2,Λ

− given a collection D = (D1, . . . ,DR ) of i-vectors associated with aspeaker, we have figured out how to evaluate the marginal likelihood orthe evidence

p(D) =∫

p(D, z)dz =∫

p(D|z)p(z)dz

− z = x1, x2r is the hidden variables associated with the speaker

We show how to do speaker recognition in this situation and howboth problems are tackled by using variational Bayes to approximatethe posterior distribution p(z|D)

110 / 274

Page 111: Machine Learning for Speaker Recognition

Variational approximation

Evidence p(D) can be evaluated exactly in the Gaussian case but thisinvolves inverting the large sparse block matricesIf q(z) is any distribution on z, variational lower bound is yielded as

L , Eq

[log p(D, z)

q(z)

]where log p(D) ≥ L with equality iff q(z) = p(z|D)Variational Bayes provides a principled way to find a goodapproximation q(z) to p(z|D)Model parameters θ = m,U1,U2,Λ are estimated by maximizingthe evidence lower bound (ELBO) L which is calculated over all ofthe speakers in a training set

111 / 274

Page 112: Machine Learning for Speaker Recognition

Bayesian speaker recognition

Full Bayesian avoids the point estimates of model parametersCoupling of multiple latent variables is tackled in VB[Villalba and Lleida, 2014]Uncertainties are compensated for model regularization in speakerrecognitionPrior densities p(U1) and p(U2) can be flexibly incorporatedSelection for the number of speaker factors or channel factorsManually tuning for unknown variables is avoidedAnalogous to the the treatment of the number of mixturecomponents in Bayesian estimation of GMMBayesian mixture of PLDA [Mak et al., 2016] was recently developed

112 / 274

Page 113: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models3.1. GMM-UBM system3.2. Joint factor analysis3.3. Probabilistic linear discriminant analysis3.4. Support vector machine3.5. Restricted Boltzmann machine

4 Deep Learning

5 Case Studies

6 Future Direction 113 / 274

Page 114: Machine Learning for Speaker Recognition

What is a good decision boundary?

Consider a two-class, linearlyseparable classification problemMany decision boundaries!− perceptron algorithm can be

used to find such a boundary− different algorithms have been

proposedAre all decision boundariesequally good? Class 1

Class 2

114 / 274

Page 115: Machine Learning for Speaker Recognition

Large-margin decision boundary

Decision boundary should be as far away from the data of bothclasses as possible [Vapnik, 2013]− we should maximize the margin m− distance between the origin and the line w>x = k is k

‖w‖

Class 2

Class 1

w

wTx + b = ¡1

wTx + b = 1

wTx + b = 0

m

m = 2kwk

115 / 274

Page 116: Machine Learning for Speaker Recognition

Finding the decision boundary

x1, . . . , xn is the data set and yi ∈ 1,−1 is the class label of xi

Decision boundary should classify all points correctly

yi (w>xi + b) ≥ 1, for i = 1, . . . , n

Decision boundary can be found by solving the constrainedoptimization problem

Minimize 12‖w‖

2

subject to yi (w>xi + b) ≥ 1, for i = 1, . . . , n

116 / 274

Page 117: Machine Learning for Speaker Recognition

Constrained optimization

Original problem

Minimize 12‖w‖

2

subject to yi (w>xi + b) ≥ 1, for i = 1, . . . , n

We introduce the Lagrange multipliers αi ≥ 0 to form the Lagrangianfunction

L = 12‖w‖

2 −n∑

i=1αi(yi (w>xi + b)− 1

)Setting the gradient of L w.r.t w and b to zero, we have

w =n∑

i=1αiyi xi

n∑i=1

αiyi = 0

117 / 274

Page 118: Machine Learning for Speaker Recognition

Dual problem

New objective function is expressed in terms of αi onlyIf we know w, we know all αi . If we know all αi , we know wOriginal problem is known as the primal problemA quadratic programming objective is formed by

maximize L(α) =n∑iαi −

12

n∑i=1

n∑j=1

αiαjyiyjx>i xj

subject to αi ≥ 0,n∑

i=1αiyi = 0

Global maximum of αi can be found

118 / 274

Page 119: Machine Learning for Speaker Recognition

Characteristics of the solution

Many of αi are zero− w is a linear combination of a small number of data points− This sparse representation can be viewed as data compression as in the

construction of KNN classifierxi with non-zero αi are called support vectors− decision boundary is determined only by support vectors− tj , j = 1, . . . , s are the indices of s support vectors

w =s∑

j=1αtj ytj xtj

For testing with a new data z− compute f = w>z + b =

∑sj=1 αtj ytj x>tj

z + b and classify z as class 1 ifthe sum is positive, and class 2 otherwise

119 / 274

Page 120: Machine Learning for Speaker Recognition

Non-separable problem

We allow error ξi in classification. It is based on the output of thediscriminant function w>x + bξi approximates the number of misclassified samples

Class 2

Class 1

w

wTx + b = ¡1

wTx + b = 1

wTx + b = 0

»i

»j

xj

xi

120 / 274

Page 121: Machine Learning for Speaker Recognition

Soft margin hyperplane

If we minimize∑

i ξi , ξi can be computed byw>xi + b ≥ 1− ξi yi = 1w>xi + b ≤ −1 + ξi yi = −1ξi ≥ 0 ∀i

− ξi are slack variables in optimization− ξi = 0 if there is no error for xi− ξi is an upper bound of the number of errors

We want to minimize 12‖w‖

2 + C∑n

i=1 ξi− C is a tradeoff parameter between error and margin

Optimization problem becomes

Minimize 12‖w‖

2 + Cn∑

i=1ξi

subject to yi (w>xi + b) ≥ 1− ξi , ξi ≥ 0

121 / 274

Page 122: Machine Learning for Speaker Recognition

Dual problem

Dual of this new constrained optimization problem is

maximize L(α) =n∑iαi −

12

n∑i=1

n∑j=1

αiαjyiyjx>i xj

subject to C ≥ αi ≥ 0,n∑

i=1αiyi = 0

w is recovered as w =∑s

j=1 αtj ytj xtj

This is very similar to the optimization problem in the linear separablecase, except that there is an upper bound C on αi nowOnce again, a quadratic programming solver can be used to find αi

122 / 274

Page 123: Machine Learning for Speaker Recognition

Non-linear decision boundary

So far, we have only considered large-margin classifier with a lineardecision boundaryHow to generalize it to become non-linear?Key idea: transform xi to a higher dimensional space to make lifeeasier− input space: the points xi are located− feature space: the space of φ(xi ) after transformation

Why transform?− linear operation in the feature space is equivalent to non-linear

operation in input space− classification can become easier with a proper transformation.− In the XOR problem, for example, adding a new feature make the

problem linearly separable

123 / 274

Page 124: Machine Learning for Speaker Recognition

Dimensionality in feature space

Á(²)Á( )

Á( )

Á( )

Á( )

Á( )Á( )

Á( )

Á( )

Á( )Á( )

Á( )

Á( )

Á( )

Á( )

Input space Feature space

Computation in the feature space can be costly because it is highlydimensional− feature space is typically infinite-dimensional!

Kernel trick comes to rescue

124 / 274

Page 125: Machine Learning for Speaker Recognition

Kernel trick

maximize L(α) =n∑iαi −

12

n∑i=1

n∑j=1

αiαjyiyjx>i xj

subject to C ≥ αi ≥ 0,n∑

i=1αiyi = 0

x>i xj is inner productAs long as we can calculate the inner product in the feature space, wedo not need the mapping explicitlyMany common geometric operations, e.g. angle, distance, can beexpressed by inner productsDefine the kernel function K by

K (xi , xj) = φ(xi )>φ(xj)

125 / 274

Page 126: Machine Learning for Speaker Recognition

Kernel function in SVM

Change all inner products to kernel functionsFor training− original

maximum L(α) =n∑iαi −

12

n∑i=1

n∑j=1

αiαjyiyjx>i xj

subject to C ≥ αi ≥ 0,n∑

i=1αiyi = 0

− with kernel function

maximum L(α) =n∑iαi −

12

n∑i=1

n∑j=1

αiαjyiyjK (xi , xj )

subject to C ≥ αi ≥ 0,n∑

i=1αiyi = 0

126 / 274

Page 127: Machine Learning for Speaker Recognition

Kernel function in SVM

For testing, the new data z is classified as class 1 if f ≥ 0 and as class2 if f < 0− original

w =s∑

j=1αtj ytj xtj

f = w>z + b =s∑

j=1αtj ytj x>tj

z + b

− with kernel function

w =s∑

j=1αtj ytjφ(xtj )

f = 〈w, φ(z)〉+ b =s∑

j=1αtj ytj K (xtj , z) + b

127 / 274

Page 128: Machine Learning for Speaker Recognition

Similarity measure

Since the training of SVM only requires the value of K (xi , xj), thereis no restriction of the form of xi and xj− xi can be a sequence or a tree instead of a feature vector

K (xi , xj) is just a similarity measure comparing xi and xj

Kernel function needs to satisfy the Mercer function, i.e., the functionis positive-definiteFor a test object z, the discriminant function essentially is a weightedsum of the similarity between z and a pre-selected set of objects, alsocalled the support vectors

f (z) =∑xi∈S

αiyiK (z, xi ) + b

where S denotes the set of support vectors

128 / 274

Page 129: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models3.1. GMM-UBM system3.2. Joint factor analysis3.3. Probabilistic linear discriminant analysis3.4. Support vector machine3.5. Restricted Boltzmann machine

4 Deep Learning

5 Case Studies

6 Future Direction 129 / 274

Page 130: Machine Learning for Speaker Recognition

Restricted Boltzmann machine

Bipartite the undirected graphical model with visible variable v andhidden variable hBuilding-block for deep belief networks and deep Boltzmann machinesRBMs are generative models of v based on the marginal distributionJoint distribution of (v, h) is an exponential familyDiscriminative fine-tuning can be appliedVariables are typically binary, however no such restriction existsBidirectional graphical model

130 / 274

Page 131: Machine Learning for Speaker Recognition

Graphical model

h

v

W

No connection between nodes of the same layer (i.e. sparsity)Allow fast training (blocked-Gibbs sampling)Correlations between nodes in v are still present in the marginalp(v|W)Hidden variable h captures the higher level information

131 / 274

Page 132: Machine Learning for Speaker Recognition

Joint distribution

Energy-based distribution is defined by

p(v,h|θ) = Z (θ)−1p∗(v,h|θ)

where

p∗(v,h|θ) = exp(∑

ivibi +

∑j

hjaj +∑i ,j

vihjWij︸ ︷︷ ︸−E(v,h)

)

and θ = W ,b, ab, a denote the biases and are usually assumed to be zero forcompact notationZ (θ) =

∑v,h p∗(v,h|θ) is the partition function

132 / 274

Page 133: Machine Learning for Speaker Recognition

Individual distribution

Consider binary (v,h) with zero biases b, ap(h|v) =

∏j p(hj |v) where p(hj |v) = 1

1+exp(−∑

i vi Wij )

p(v|h) =∏

i p(vi |h) where p(vi |h) = 11+exp(−

∑j hj Wij )

Product form is due to the restricted structure

h

v

W

133 / 274

Page 134: Machine Learning for Speaker Recognition

Learning with RBM

Maximize the likelihood of θ given vn

p(vn) =∏

nZ−1∑

hexp

∑ij

vni hn

j Wij

We obtain ∂logp(vn)

∂Wij= Epdata [vihj ]− Epmodel [vihj ]

where pdata(vn,hn) = p(hn|vn)p(vn)p(hn|vn) is an easy and exact calculation for RBMp(vn) is an empirical distribution∂logZ(θ)∂Wij

= Epmodel [vihj ] is hard to computeLearning using contrastive divergence with mini-batches is performed

134 / 274

Page 135: Machine Learning for Speaker Recognition

135 / 274

Page 136: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models

4 Deep Learning4.1. Deep neural network4.2. Deep belief network4.3. Stacking auto-encoder4.4. Variational auto-encoder4.5. Deep transfer learning

5 Case Studies

6 Future Direction 136 / 274

Page 137: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models

4 Deep Learning4.1. Deep neural network4.2. Deep belief network4.3. Stacking auto-encoder4.4. Variational auto-encoder4.5. Deep transfer learning

5 Case Studies

6 Future Direction 137 / 274

Page 138: Machine Learning for Speaker Recognition

Neural network

Inputs

Hidden units

xtD

xt1

xt0

zt0

zt1

yt1

ytKztM

w(1)

DM

w(2)

MK

Outputs

Multilayer perceptron

138 / 274

Page 139: Machine Learning for Speaker Recognition

Nonlinear activation

ytk = yk(xt ,w) = f

M∑j=0

w (2)jk f

( D∑i=0

w (1)ij xti

)

−6 −4 −2 0 2 4 6

−1

−0.5

0

0.5

1

f(x)

x

ReLUSigmoidTanh

activation function f (·)

139 / 274

Page 140: Machine Learning for Speaker Recognition

Deep learning

Deep belief networks (DBN) obtained great results due to goodinitialization and deep model structure− pre-train each layer from bottom up− each pair of layers is a restricted Boltzmann machine− jointly fine tune all layers using back-propagation

Deep neural network (DNN)− discriminative model works for classification tasks− empirically works well for image recognition, speech recognition,

information retrieval and many others− no theoretical guarantee

140 / 274

Page 141: Machine Learning for Speaker Recognition

DBN-DNN training

o

z1

z3

GRBM

RBM

RBM

z1

z2

o

y

z1

w1

w2

w3

w4

z3

z2

141 / 274

Page 142: Machine Learning for Speaker Recognition

Why go deep?

Deep architecture can be representationally efficient− fewer computational units for the same function

Deep representation might allow for a hierarchical representation− allows non-local generalization− comprehensibility

Multiple levels of latent variables allow combinatorial sharing ofstatistical strengthDeep architecture works well for representation of vision, audio, NLP,music and many other technical data

142 / 274

Page 143: Machine Learning for Speaker Recognition

Different level of abstraction

Hierarchical learning− natural progression from low

level to high level structure asseen in natural complexity

− easier to monitor what isbeing learnt and to guide themachine to better subspaces

− a good lower levelrepresentation can be used indifferent tasks

143 / 274

Page 144: Machine Learning for Speaker Recognition

Trainable feature hierarchy

Hierarchy of representations with increasing level of abstractionEach stage is a kind of trainable feature transformImage− Pixel → edge → texton → motif → part → object

Text− Character → word → word group → clause → sentence → story

Speech− Sample → spectral band → sound → . . .→ phone → phoneme →

word

144 / 274

Page 145: Machine Learning for Speaker Recognition

Deep architecture

Feed-forward: multilayer neural nets, convolutional nets

Feed-back: stacked sparse coding, deconvolutional nets

Bi-directional: deep Boltzmann machines, stacked auto-encoders

145 / 274

Page 146: Machine Learning for Speaker Recognition

Training strategy

Purely supervised− initialize parameters randomly− train in supervised mode

− typically with SGD, using backprop to compute gradients− used in most practical systems for speech and image recognition

Unsupervised, layerwise + supervised classifier on top− train each layer unsupervised, one after the other− train a supervised classifier on top, keeping the other layers fixed− good when very few labeled samples are available

Unsupervised, layerwise + global supervised fine-tuning− train each layer unsupervised, one after the other− add a classifier layer, and retrain the whole thing supervised− good when label set is poor

Unsupervised pre-training often uses the regularized auto-encoders

146 / 274

Page 147: Machine Learning for Speaker Recognition

Generalizable learning

Shared representation− multi-task learning− unsupervised training

Raw input

Shared

intermediate

representation

Task 1

output

Task 2

outputTask 3

output

Partial feature sharing− mixed mode learning− composition of functions

Low-level features

…High-level features

Task 1

Ouptut

Task N

Ouptut y1 yN

147 / 274

Page 148: Machine Learning for Speaker Recognition

Forward & backward passes

Forward propagation− sum inputs, produce activation, feed-forward

Inputs

x2

x1

z1

z2

z3

y2

y1

Outputs

Training: back propagation of error− calculate total error at the top− calculate contributions to error at each step going backwards

Inputs

x2

x1

z1

z2

z3

y2

y1

Outputs

148 / 274

Page 149: Machine Learning for Speaker Recognition

Deep neural network

Simple to construct− sigmoid nonlinearity for hidden layers− softmax for the output layer

Backpropagation does not work well ifrandomly initialized[Bengio et al., 2007]− deep networks trained without

unsupervised pretraining performworse than shallow networks

( Bengio et al., NIPS 2007)

149 / 274

Page 150: Machine Learning for Speaker Recognition

Problems and solvers with back propagation

Gradient is progressively getting more dilute− below top few layers, correction signal is minimal

Gets stuck in local minima− random initialization: may start out far from good regions

In usual settings, we can use only labeled data− almost all data are unlabeled− the brain can learn from unlabeled data

Use unsupervised learning via greedy layer-wise training− allow abstraction to develop naturally from one layer to another− help the network initialize with good parameters

Perform supervised top-down training as final step− refine the features in intermediate layers more relevant for the task

150 / 274

Page 151: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models

4 Deep Learning4.1. Deep neural network4.2. Deep belief network4.3. Stacking auto-encoder4.4. Variational auto-encoder4.5. Deep transfer learning

5 Case Studies

6 Future Direction 151 / 274

Page 152: Machine Learning for Speaker Recognition

Deep belief network

Deep belief network (DBN) is a probabilistic generative modelDeep architecture with multiple hidden layersUnsupervised pre-learning provides a good initialization− maximizing the lower-bound of the log-likelihood of data

Supervised fine-tuning− generative: up-down algorithm− discriminative: back propagation

152 / 274

Page 153: Machine Learning for Speaker Recognition

Model structure

Hidden Layers

Visible Layers

h3

h1

h2

v

RBM

Directed

belief nets

p(v,h1,h2, . . . ,hl ) = p(v|h1)p(h1|h2) . . . p(hl−2|hl−1)p(hl−1|hl )

153 / 274

Page 154: Machine Learning for Speaker Recognition

Greedy training

First step:− construct an RBM with an

input layer v and a hiddenlayer h

− train the RBM

h

v

W1

154 / 274

Page 155: Machine Learning for Speaker Recognition

Greedy training

Second step:− Stack another hidden layer on

top of the RBM to form anew RBM

− Fix W1, sample h1 fromq(h1|v) as input. Train W2 asRBM

h1

h2

v

W1

W2

q(h1jv)

155 / 274

Page 156: Machine Learning for Speaker Recognition

Greedy training

Third step:− continue to stack layers on

top of the network, train it asprevious step, with samplesampled from q(h2|h1)

And so on...

h3

h1

h2

v

W1

W2

W3

q(h2jh1)

q(h1jv)

156 / 274

Page 157: Machine Learning for Speaker Recognition

Deep Boltzmann machine

p(v) =∑

h1,h2,h3

1Z exp[vT W1h + (h1)>W2h2 + (h2)>W3h3]

W3

W2

W1

v

h1

h2

h3

[Salakhutdinov and Hinton, 2009]

Undirected connections between alllayers. No connections between thenodes in the same layerHigh-level representations are built fromunlabeled inputs. Labeled data is usedto only slightly fine-tune the model

157 / 274

Page 158: Machine Learning for Speaker Recognition

Training procedure

Pre-training− initialize from stacked RBMs

Generative fine-tuning− positive phase: variational or

mean-field approximation− negative phase: persistent

chain & stochasticapproximation

Discriminative fine-tuning− back-propagation

W3

W2

W1

v

h1

h2

h3

158 / 274

Page 159: Machine Learning for Speaker Recognition

Why greedy layer wise training works

Regularization hypothesis− pre-training is constraining the parameters in a region relevant to

unsupervised dataset− better generalization - representations that better describe unlabeled

data are more discriminative for labeled dataOptimization hypothesis− unsupervised training initializes lower level parameters near localities of

better minima than random initialization can

159 / 274

Page 160: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models

4 Deep Learning4.1. Deep neural network4.2. Deep belief network4.3. Stacking auto-encoder4.4. Variational auto-encoder4.5. Deep transfer learning

5 Case Studies

6 Future Direction 160 / 274

Page 161: Machine Learning for Speaker Recognition

Denoising auto-encoder

Corrupted input

Hidden code

(representation)

Raw input Reconstruction

KL (reconstruction | raw input)

X X

[Vincent et al., 2008]

Corrupt the input, e.g. set 25% of inputs to 0Reconstruct the uncorrupted inputUse the uncorrupted encoding as input to next level

161 / 274

Page 162: Machine Learning for Speaker Recognition

Manifold learning perspective

Learn a vector field towards higher probability regionsMinimize the variational lower bound on a generative modelCorrespond to the regularized score matching on an RBM

162 / 274

Page 163: Machine Learning for Speaker Recognition

Stacked denoising auto-encoders

h1

x

W1 W>1

h1

x

W1 W>1

x

h2

h1

W2W>

2

h1 h1

W1 W>1

W>2

W2

x x

h2

U

y

163 / 274

Page 164: Machine Learning for Speaker Recognition

Greedy layer-wise learning

Start with the lowest level and stack upwardsTrain each layer of auto-encoder using the intermediate codes orfeatures from the layer belowTop layer can have a different output, e.g. softmax non-linearity, toprovide an output for classification

h1

x

W1 U1

h1

x

W1 U1

y

h2

y

W2 U2

h1

W1 U1

W2

x y

h2

U2

y

y

164 / 274

Page 165: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models

4 Deep Learning4.1. Deep neural network4.2. Deep belief network4.3. Stacking auto-encoder4.4. Variational auto-encoder4.5. Deep transfer learning

5 Case Studies

6 Future Direction 165 / 274

Page 166: Machine Learning for Speaker Recognition

Auto-encoder

::::::

::::::

::::::

xx xx

zz

Encoder Decoder

ÁÁ μμ

166 / 274

Page 167: Machine Learning for Speaker Recognition

Variational auto-structure

::::::

::::::

::::::

Encoder Decoder::::::

qÁ(zjx)qÁ(zjx)

pμ(xjz)pμ(xjz)xx

zz

xx

Sampling

ÁÁμμ

167 / 274

Page 168: Machine Learning for Speaker Recognition

Graphical model

zz

x

μμÁÁRecognition

modelGenerative

model

qÁ(zjx)qÁ(zjx) pμ(xjz)pμ(xjz)

[Kingma and Welling, 2014]Mean-field approach requires analytical solutions Eq, which areintractable in the case of neural network

Use neural network and sample the latent variables z from variationalposterior

168 / 274

Page 169: Machine Learning for Speaker Recognition

Variational inference

Variational Bayesian inference aims to find a variational distributionq(z|x) that is maximally close to the original true posteriordistribution p(z|x)

According to the evidence decomposition, we have

p(x) = L(q) + KL(q‖p)

L(q) = Eq[log p(x, z)] + Hq[z]

KL(q‖p) = −Eq[log p(z|x)]−Hq[z]

169 / 274

Page 170: Machine Learning for Speaker Recognition

Mean field variational inference

Assume that q(z|x) can be factorized into the product of individualprobability distributions

q(z|x) =N∏

n=1q(zn|xn)

We can perform the coordinate ascent for each factorized variationaldistributions by

q(zj |xj) ∝ exp(Eq(zi 6=j )[log p(x, z)])

170 / 274

Page 171: Machine Learning for Speaker Recognition

Variational lower bound

Model parameters are learned by maximizing the variational lowerbound

log p(x) ≥ Eqφ(z|x)[log pθ(x|z)]− KL(qφ(z|x)||pω(z))= Eqφ(z|x)[log pθ(x, z)− log qφ(z|x)], Eqφ(z|x)[fΘ(x, z)], LΘ

where Θ = θ, φ, ω

171 / 274

Page 172: Machine Learning for Speaker Recognition

Stochastic backpropagation

L£ = EqÁ(zjx)[f£(x; z)]L£ = EqÁ(zjx)[f£(x; z)]

sample z(l) from qÁ(zjx)sample z(l) from qÁ(zjx)

L£ ' f£(xjz(l))L£ ' f£(xjz(l))

r£L£ ' r£f£(x; z(l))r£L£ ' r£f£(x; z(l))

Objective:

Gradient:

Step1

Step2

Step3

Problem: high variance by directly sampling z [Rezende et al., 2014]

172 / 274

Page 173: Machine Learning for Speaker Recognition

Stochastic gradient variational Bayes

L£ = EqÁ(zjx)[f£(x; z)]L£ = EqÁ(zjx)[f£(x; z)]

L£ ' f£(xjz(l))L£ ' f£(xjz(l))

r£L£ ' r£f£(x; z(l))r£L£ ' r£f£(x; z(l))

Objective:Gradient:

Step1

Step2

Step3

z(l) = ¹z + ¾z¯ ²(l)z(l) = ¹z + ¾z¯ ²(l)

sample ²(l) from N (0; I)sample ²(l) from N (0; I)

Step4

Reduce the variance caused by directly sampling z

173 / 274

Page 174: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models

4 Deep Learning4.1. Deep neural network4.2. Deep belief network4.3. Stacking auto-encoder4.4. Variational auto-encoder4.5. Deep transfer learning

5 Case Studies

6 Future Direction 174 / 274

Page 175: Machine Learning for Speaker Recognition

Why transfer learning?

Mismatch between training and test data in speaker recognitionalways existsTraditional machine learning works well under an assumption thattraining and test data follow the same distribution− real-world data may not follow this assumption

Feature-based domain adaptation is a common approach− allow knowledge to be transferred across domains through learning a

good feature representationCo-train for feature representation and speaker recognition withoutlabeling in target domain

175 / 274

Page 176: Machine Learning for Speaker Recognition

Transfer learning

Let D = X , p(X ) denote a domain− feature space X− marginal probability distribution p(X )− X = x1, · · · , xT ⊂ X

Let T = Y, f (·) denote a task− label space Y− objective predictive function f (·)

can be written as p(Y |X )

Assumptions in transfer learning− source and target domains are different DS 6= DT− source and target tasks are different TS 6= TT

176 / 274

Page 177: Machine Learning for Speaker Recognition

Multi-task learning

minθ`(D, θ) + λΩ(θ)

Joint

Learning

Task

apple, not apple pear, not pear

Input

Target

Auxiliary TaskMain Task

177 / 274

Page 178: Machine Learning for Speaker Recognition

Multi-task neural network learning

minθ`(D, θ) + λΩ(θ)

z1 zJz2 z3

Ym Y1 YK

Main task Auxiliary tasks

Output

Input

Shared feature

representation

178 / 274

Page 179: Machine Learning for Speaker Recognition

Learning strategy and task

Classifier

Feature

Extractor

Unlabeled Data

in Target Domain

Unlabeled Data

in Target Domain

Auxiliary TaskMain Task Main Task

Training Test

Labeled Data

in Source Domain

z0 zM

bx1 bxD

x0 xD

by1 byV

fewkjg

fwjig

fwkjg

Distribution

Matching

Semi-supervised learning is conducted under multiple objectives

179 / 274

Page 180: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models

4 Deep Learning

5 Case Studies5.1. Heavy-Tailed PLDA5.2. SNR-Invariant PLDA5.3. Mixture of PLDA5.4. DNN I-vectors5.5. PLDA with RBM

6 Future Direction 180 / 274

Page 181: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models

4 Deep Learning

5 Case Studies5.1. Heavy-Tailed PLDA5.2. SNR-Invariant PLDA5.3. Mixture of PLDA5.4. DNN I-vectors5.5. PLDA with RBM

6 Future Direction 181 / 274

Page 182: Machine Learning for Speaker Recognition

Motivation

Motivation of i-vectors:Insufficiency of joint factor analysis (JFA) in distinguishing betweenspeaker and channel information, as channel factors were shown tocontain speaker information.Better to use a two-step approach: (1) use low-dimensional vectors(called i-vectors) that comprise both speaker and channel informationto represent utterances; and (2) model the channel and variabilities ofthe i-vectors during scoring.

Motivation of Heavy-tailed PLDA:JFA assumes that the speaker and channel components follow Gaussiandistributions.The Gaussian assumption prohibits large deviations from the mean.But speaker effects (e.g., non-native speakers) and channel effect(gross channel distortion) could cause large deviations.Use heavy-tailed distributions instead of Gaussians for modeling thespeaker and channel components in i-vectors [Kenny, 2010].

182 / 274

Page 183: Machine Learning for Speaker Recognition

Generative model with heavy-tailed priors

Assuming that we have Hi i-vectors Xi = xij , j = 1, . . . ,Hi fromspeaker i , the generative model is

xij = m + Vhi + Grij + εij

where V and G represent the the speaker and channel subspaces,respectively.In heavy-tailed PLDA,

hi ∼ N (0, u−11 I) u1 ∼ G(n1/2, n1/2)

rij ∼ N (0, u−12j I) u2j ∼ G(n2/2, n2/2)

εij ∼ N (0, (vjΛ)−1) vj ∼ G(νj/2, νj/2)

By integrating out the hyperparameters (u1, u2j , and vj), one canshow [Eq. 2.161 of Bishop (2006)] that the priors of hi , rij , and εijfollow Student’s t. So, xij also follows Student’s t.

183 / 274

Page 184: Machine Learning for Speaker Recognition

Performance on NIST 2008 SRE

Telephone speech, without score normalization

Microphone speech, with score normalization

[Kenny, 2010]

184 / 274

Page 185: Machine Learning for Speaker Recognition

From HT-PLDA to Gaussian PLDA

In 2011, [Garcia-Romero and Espy-Wilson, 2011] discovered thatGaussian PLDA performs as good as HT-PLDA provided thati-vectors have been subjected to the following pre-processing steps:

Whitening + Length normalization

These steps have the effect of making the i-vectors more Gaussian.As Gaussian PLDA is computationally much simpler than HT-PLDA,the former has been extensively used in speaker verification.

185 / 274

Page 186: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models

4 Deep Learning

5 Case Studies5.1. Heavy-Tailed PLDA5.2. SNR-Invariant PLDA5.3. Mixture of PLDA5.4. DNN I-vectors5.5. PLDA with RBM

6 Future Direction 186 / 274

Page 187: Machine Learning for Speaker Recognition

Motivation

While i-vector extraction followed by PLDA is very effective inaddressing channel variabilityPerformance degrades rapidly in the presence of background noisewith a wide range of signal-to-noise ratios (SNR)Classical approach: Multi-condition training where i-vectors fromvarious background noise level are pooled to train a PLDA model.

187 / 274

Page 188: Machine Learning for Speaker Recognition

Motivation

We argue that the variation caused by SNR variability can bemodeled by an SNR subspace and utterances falling within a narrowSNR range should share the same set of SNR factors.SNR-specific information were separated from speaker-specificinformation through marginalizing out the SNR factors during scoring

188 / 274

Page 189: Machine Learning for Speaker Recognition

Motivation

I-vectors derived from utterances of similar SNR will be mapped to asmall region in the SNR subspace.

189 / 274

Page 190: Machine Learning for Speaker Recognition

SNR-Invariant PLDA

Classical PLDA: xij = m + Vhi + εij

By adding an SNR factor to the conventional PLDA, we haveSNR-invariant PLDA [Li and Mak, 2015]:

xkij = m + Vhi + Uwk + εk

ij , k = 1, . . . ,K

where U denotes the SNR subspace, wk is an SNR factor, and hi isthe speaker (identity) factor for speaker i .Note that it is not the same as PLDA with channel subspace:

xkij = m + Vhi + Grij + εij ,

where G defines the channel subspace and rij represents the channelfactors.

190 / 274

Page 191: Machine Learning for Speaker Recognition

SNR-invariant PLDA

Generative model:

xkij = m + Vhi + Uwk + εk

ij , k=1,. . . ,K

hi is speaker factors with prior distribution N (0, I)xk

ij is the j-th i-vector from speaker i in the k-th SNR subgroupV is the eigenvoice matrixU defines the SNR subspacewk is SNR factor with prior distribution N (0, I)εk

ij is a residual term with prior distribution N (0,Σ); Σ is a fullcovariance matrix aiming to model the channel variability

191 / 274

Page 192: Machine Learning for Speaker Recognition

SNR-invariant PLDA

Training utterances are divided into K groups, accroding to their SNR

192 / 274

Page 193: Machine Learning for Speaker Recognition

PLDA vs. SNR-invariant PLDA

Comparing the use of training i-vectors with conventional PLDA

193 / 274

Page 194: Machine Learning for Speaker Recognition

PLDA vs. SNR-invariant PLDA

Comparing generative models:

PLDA SNR-Invariant PLDA

xij = m + Vhi + εij xkij = m + Vhi + Uwk + εk

ij

x ∼ N (x|m,VVT + Σ) x ∼ N (x|m,VVT + UUT + Σ)

θ = m,V,Σ θ = m,V,U,Σ

194 / 274

Page 195: Machine Learning for Speaker Recognition

Auxiliary function for SNR-invariant PLDA

The parameters θ = m,V,U,Σ can be learned from a training setX using maximum likelihood estimation.X = xk

ij ; i = 1, . . . ,S; j = 1, . . . ,Hi (k); k = 1, . . . ,KS: No. of training speakersK : No. of SNR groupsHi (k): No. of utterances from speaker i in the k-th SNR group.

Given an initial value θ, we aim to find a new estimate θ thatmaximizes the auxiliary function:

Q(θ|θ) = Eh,w[∑

ikjln(p(xk

ij |hi ,wk , θ)p(hi ,wk))∣∣∣X ,θ]

= Eh,w[∑

ikj

(lnN (xk

ij |m + Vhi + Uwk ,Σ)

+ ln p(hi ,wk))∣∣∣X ,θ]

195 / 274

Page 196: Machine Learning for Speaker Recognition

Posterior distributions of latent variables

We show 3 ways to compute the posteriors:1 Computing the posterior of hi and wk separately.2 Computing the posterior hi while fixing wk using the Gauss-Seidel

method.3 Computing the joint posterior of hi and wk using variational Bayes.

196 / 274

Page 197: Machine Learning for Speaker Recognition

Method 1: Computing posteriors separately

Given i-vectors xkij , the posterior density of hi has the form:

p(hi |xkij ,θ) ∝ p(xk

ij |hi ,θ)p(hi )

=∫

p(xkij ,wk |hi ,θ)p(hi )dwk

=∫

p(xkij |hi ,wk ,θ)p(wk)p(hi )dwk

=∫N (xk

ij |m + Vhi + Uwk ,Σ)N (wk |0, I)N (hi |0, I)dwk

= N (xkij |m + Vhi ,Φ)N (hi |0, I)

∝ exp

hTi VTΦ−1(xk

ij −m)− 12hT

i (I + VTΦ−1V)hi

where Φ = UUT + Σ.

197 / 274

Page 198: Machine Learning for Speaker Recognition

Method 1: Computing posteriors separatelyIf all of the i-vectors of speaker i , say Xi , are given,

p(hi |xkij ∀j and k,θ) ∝

K∏k=1

Hi (k)∏j=1

p(xkij |hi ,θ)p(hi )

∝ exp

hTi VTΦ−1

K∑k=1

Hi (k)∑j=1

(xkij −m)− 1

2hTi

(I +

K∑k=1

Hi (k)VTΦ−1V)

hi

This is a Gaussian with mean and 2nd-order (uncentralized) moment

〈hi |Xi〉 =(

I +∑K

k=1Hi (k)VTΦ−1V

)−1VTΦ−1

∑K

k=1

∑Hi (k)

j=1(xk

ij −m)

〈hi hTi |Xi〉 =

(I +∑K

k=1Hi (k)VTΦ−1V

)−1+ 〈hi |Xi〉〈hi |Xi〉T,

(1)

N (h|µh,Ch) ∝ exp−1

2 (h− µh)TC−1h (h− µh)

∝ exp

hTC−1

h µh −12hTC−1

h h

198 / 274

Page 199: Machine Learning for Speaker Recognition

Method 1: Computing posteriors separately

Similarly, to compute the posterior expectations of wk , we marginalizeover hi ’s. Thus, the posterior density of wk is

p(wk |xkij ,θ) ∝

∫p(xk

ij |hi ,wk ,θ)p(hi )p(wk)dhi

=∫N (xk

ij |m + Vhi + Uwk ,Σ)N (hi |0, I)N (wk |0, I)dhi

= N (xkij |m + Uwk ,Ψ)N (wk |0, I)

∝ exp

wTk UTΨ−1(xk

ij −m)− 12wT

k (I + UTΨ−1U)wk

199 / 274

Page 200: Machine Learning for Speaker Recognition

Method 1: Computing posteriors separately

Given all of the i-vectors (X k) from the k-th SNR group, we cancompute the posterior expectations as follows:

〈wk |X k〉 =(

I +S∑

i=1Hi (k)UTΨ−1U

)−1

UTΨ−1S∑

i=1

Hi (k)∑j=1

(xkij −m)

〈wkwTk |X k〉 =

(I +

S∑i=1

Hi (k)UTΨ−1U)−1

+ 〈wk |X k〉〈wk |X k〉T

(2)

where Ψ = VVT + Σ

200 / 274

Page 201: Machine Learning for Speaker Recognition

Method 2: Computing posteriors by Gauss-Seidel method

Another approach to computing p(hi |Xi ) is to assume that wk ’s arefixed for all k = 1, . . . ,K .This is called the Gauss-Seidel method [Barrett et al., 1994]We fix wk to its posterior mean: w∗k ≡ 〈wk |X k〉The posterior density of hi becomes:

p(hi |Xi ,w∗k ,θ) ∝K∏

k=1

Hi (k)∏j=1

p(xkij |hi ,w∗k ,θ)p(hi )

=K∏

k=1

Hi (k)∏j=1N (xk

ij |m + Vhi + Uw∗k ,Σ)N (hi |0, I)

∝ exp

hTi VTΣ−1

K∑k=1

Hi (k)∑j=1

(xkij −m−Uw∗k ) −

12hT

i

(I +

K∑k=1

Hi (k)VTΣ−1V)

hi

201 / 274

Page 202: Machine Learning for Speaker Recognition

Method 2: Computing posteriors by Gauss-Seidel method

Comparing this posterior density with a standard Gaussian, we have

〈hi |Xi〉 =(

L(1)i

)−1VTΣ−1

K∑k=1

Hi (k)∑j=1

(xkij −m−Uw∗k)

〈hi hTi |Xi〉 =

(L(1)

i

)−1+ 〈hi |Xi〉〈hi |Xi〉T,

(3)

where L(1)i ≡ I +

∑Kk=1 Hi (k)VTΣ−1V

Note that these formulations is similar to the JFA model estimation in[Vogt and Sridharan, 2008].

202 / 274

Page 203: Machine Learning for Speaker Recognition

Method 2: Computing posteriors by Gauss-Seidel method

Apply the same approach to computing the posterior density of wk ,we have

〈wk |X k〉 =(

L(2)k

)−1UTΣ−1

S∑i=1

Hi (k)∑j=1

(xkij −m− Vh∗i )

〈wkwTk |X k〉 =

(L(2)

k

)−1+ 〈wk |X k〉〈wk |X k〉T

(4)

where L(2)k = I +

∑Si=1 Hi (k)UTΣ−1U and h∗i ≡ 〈hi |Xi〉

203 / 274

Page 204: Machine Learning for Speaker Recognition

Method 3: Computing posteriors by variational Bayes

Denote w = [w1, . . . ,wK ] and h = [h1, . . . ,hS ]In variational Bayes [Bishop, 2006, Kenny, 2010], we factorize thejoint posterior as follows:

ln p(h,w|X ) ≈ ln q(h) + ln q(w) =S∑

i=1ln q(hi ) +

K∑k=1

ln q(wk)

where

ln q(h) = Ewln p(h,w,X )+ constln q(w) = Ehln p(h,w,X )+ const

where Ew means taking expectation with respect to w.

204 / 274

Page 205: Machine Learning for Speaker Recognition

Method 3: Computing posteriors by variational BayesConsider ln q(h):ln q(h) = Ewln p(h, w,X )+ const= 〈ln p(X|h, w)〉w + 〈ln p(h, w)〉w + const

=∑

ijr

⟨lnN (xr

ij |m + Vhi + Uwr , Σ)⟩

wr

+∑

i〈lnN (hi |0, I)〉w +

∑r〈lnN (wr |0, I)〉w + const

= −12∑

ijr(xr

ij −m− Vhi −Uw∗r )TΣ−1(xrij −m− Vhi −Uw∗r )

− 12∑

ihT

i hi + const (5)

=∑

i

[hT

i VTΣ−1∑

jr(xr

ij −m−Uw∗r )− 12 hT

i

(I +∑

jrVTΣ−1V

)hT

i

]+ const

q(hi ) a Gaussian with mean and precision identical to Eq. 3:

〈hi |Xi〉 =(

L(1)i

)−1VTΣ−1

∑jr

(xrij −m−Uw∗r )

L(1)i = I +

∑jr

VTΣ−1V(6)

205 / 274

Page 206: Machine Learning for Speaker Recognition

Method 3: Computing posteriors by variational Bayes

lnq(w) = 〈ln p(X|h, w)〉h + 〈ln p(h, w)〉h + const

=∑

ijk

⟨lnN (xk

ij |m + Vhi + Uwk , Σ)⟩

hi

+∑

i〈lnN (hi |0, I)〉hi +

∑k〈lnN (wk |0, I)〉h + const

= −12∑

ijk(xk

ij −m− Vh∗i −Uwk )TΣ−1(xkij −m− Vh∗i −Uwk )

− 12∑

kwT

k wk + const

=∑

k

[wT

k UTΣ−1∑

ij(xk

ij −m− Vh∗i )− 12 wT

k

(I +∑

ijUTΣ−1U

)wT

k

]+ const

q(wk) is a Gaussian with mean and precision identical to Eq. 4:

〈wk |X k〉 =(

L(2)k

)−1UTΣ−1

∑ij

(xkij −m− Vh∗i )

L(2)k = I +

∑ij

UTΣ−1U(7)

Note : 〈lnN (hi |0, I)〉hi is the differential entropy of normal distribution and isindependent of wk , see [Norwich, 1993](Ch 8).

206 / 274

Page 207: Machine Learning for Speaker Recognition

Computing posterior moment

The exact posterior moment 〈wkhTi |X 〉 will be complicated because

hi and wk are correlated in the posterior.If Gauss-Seidel’s method is used, we may approximate the posteriormoments by (Kenny 2010, p.6)

〈wkhTi |X 〉 ≈ 〈wk |X k〉(h∗i )T

〈hi wTk |X 〉 ≈ 〈hi |Xi〉(w∗k)T

where h∗i and w∗k are the most up-to-date posterior means in the EMiterations.Alternatively, we may compute the exact joint posterior.3 But it willbe computationally intensive.

3http://www.eie.polyu.edu.hk/∼mwmak/papers/si-plda.pdf207 / 274

Page 208: Machine Learning for Speaker Recognition

Computing posterior moment

A better approach is to use variational Bayes:

p(hi ,wk |X ) ≈ q(hi )q(wk) (8)

Note that as both q(hi ) and q(wk) are Gaussians. Based on the lawof total expectation,4 the factorization in Eq. 8 gives

〈wkhTi |X 〉 ≈ 〈wk |X k〉〈hi |Xi〉T

〈hi wTk |X 〉 ≈ 〈hi |Xi〉〈wk |X k〉T

4https://en.wikipedia.org/wiki/Product distribution208 / 274

Page 209: Machine Learning for Speaker Recognition

Maximization Step

In the M-step, we maximize the auxiliary function:

Q(θ) = Eh,w

∑ijk

lnN(

xkij∣∣m + Vhi + Uwk ,Σ

)p(hi ,wk)

∣∣∣∣X ,θ=∑ijk

Eh,w

−1

2 log |Σ| − 12(

xkij −m− Vhi −Uwk

)TΣ−1

×(

xkij −m− Vhi −Uwk

)+ ln p(hi ,wk)

∣∣∣∣X ,θAs p(hi ,wk) is independent of the model parameters V, U, and Σ,they could be taken out of Q(θ) in the M-step[Prince and Elder, 2007].

209 / 274

Page 210: Machine Learning for Speaker Recognition

Maximization Step

Differentiating Q(θ) with respect to V, U, and Σ and set the resultsto 0, we obtain

V =

∑ijk

[(xk

ij −m)〈hi |Xi〉 −U〈wkhTi |X 〉

]∑

ijk〈hi hT

i |X 〉

−1

U =

∑ijk

[(xk

ij −m)〈wk |X k〉 − V〈hi wTk |X 〉

]∑

ijk〈wkwT

k |X 〉

−1

Σ = 1N∑ijk

[(xk

ij −m)(xkij −m)T

− V〈hi |Xi〉(xkij −m)T −U〈wk |X k〉(xk

ij −m)T]

210 / 274

Page 211: Machine Learning for Speaker Recognition

Likelihood Ratio Scores

Given target-speaker’s i-vector xs and test-speaker’s i-vector xtIf xs and xt are from the same speaker, they should share the samespeaker factor h:[

xsxt

]=[

mm

]+[

V U 0V 0 U

] hwswt

+[εsεt

]=⇒ xst = m + Azst + εst .

Same-speaker likelihood:

p(xst |same-speaker) =∫

p(xst |zst)p(zst)dzst

=∫N (xst |m + Azst , Σ)N (zst |0, I)dzst

= N (xst |m, AAT + Σ)

= N([

xsxt

] ∣∣∣∣ [mm

],

[Σtot ΣacΣac Σtot

])where Σ = diagΣ,Σ, Σtot = VVT + UUT + Σ and Σac = VVT

211 / 274

Page 212: Machine Learning for Speaker Recognition

Likelihood Ratio Scores

If xs and xt are from different speakers, they should have their ownspeaker factor (hs ,ht):

[xsxt

]=[

mm

]+[

V 0 U 00 V 0 U

]hshtwswt

+[εsεt

]

=⇒ xst = m + Azst + εst

Different-speaker likelihood:

p(xst |diff-speaker) =∫

p(xst |zst)p(zst)dzst

=∫N (xst |m + Azst , Σ)N (zst |0, I)dzst

= N (xst |m, AAT + Σ)

= N([

xsxt

] ∣∣∣∣[

mm

],

[Σtot 0

0 Σtot

])212 / 274

Page 213: Machine Learning for Speaker Recognition

Likelihood Ratio Scores

Log-likelihood ratio score (assuming i-vectors have been meansubtracted, x← x−m)

SLR(xs , xt) = log p(xs , xt |Same-speaker)p(xs , xt |Diff-speaker)

= logN([

xsxt

] ∣∣∣∣[

00

],

[Σtot ΣacΣac Σtot

])

N([

xsxt

] ∣∣∣∣[

00

],

[Σtot 0

0 Σtot

])

= 12[xT

s Qxs + 2xTs Pxt + xT

t Qxt ] + const

(9)

where

Q = Σ−1tot − (Σtot −ΣacΣ−1

totΣac)−1

P = Σ−1totΣac(Σtot −ΣacΣ−1

totΣac)−1

213 / 274

Page 214: Machine Learning for Speaker Recognition

Likelihood Ratio Scores

The LLR in Eq. 9 assumes that the SNR of both target-speaker’sutterance and test utterance are unknown.If both SNRs (`s , `t) are known, we may compute the score as follows:

SLR(xs , xt |`s , `t) = log p(xs , xt |Same-speaker, `s , `t)p(xs , xt |Diff-speaker, `s , `t)

214 / 274

Page 215: Machine Learning for Speaker Recognition

Likelihood Ratio Scores

Given i-vector x and utterance SNR `, the likelihood of x is

p(x|`) =∫

h

∫w

p(x|h,w, `)p(h,w|`)dhdw

=∫

h

∫w

p(x|h,w, `)p(h|w, `)p(w|`)dhdw

=∫

h

∫w

p(x|h,w, `)p(h)dhp(w|`)dw

where we have assumed that h is a priori independent of w and `.Note that if ` ∈ k-th SNR group, we have w = w∗k ≡ 〈wk |X k〉

p(x|` ∈ k-th SNR group) =∫

hp(x|h,w∗k)p(h)dh

=∫

hN (x|m + Vh + Uw∗k ,Σ)N (h|0, I)dh

= N (x|m + Uw∗k ,VVT + Σ)

215 / 274

Page 216: Machine Learning for Speaker Recognition

Likelihood Ratio Scores

SLR(xs , xt |`s , `t) = log p(xs , xt |Same-speaker, `s , `t)p(xs , xt |Diff-speaker, `s , `t)

= logN([

xsxt

] ∣∣∣∣[

m + Uw∗ksm + Uw∗kt

],

[Ψ Σac

Σac Ψ

])

N([

xsxt

] ∣∣∣∣[

m + Uw∗ksm + Uw∗kt

],

[Ψ 00 Ψ

])

= 12[xT

s Qxs + 2xTs Pxt + xT

t Qxt ] + const

(10)

wherexs = xs −m−Uw∗ks

xt = xt −m−Uw∗kt

Q = Ψ−1 − (Ψ−ΣacΨ−1Σac)−1

P = Ψ−1Σac(Ψ−ΣacΨ−1Σac)−1

Ψ = VVT + Σ; Σac = VVT216 / 274

Page 217: Machine Learning for Speaker Recognition

Compare with scoring in JFA

Scoring in JFA is based on the sequential mode[Kenny et al., 2007b]:

SJFA-LR(Os ,Ot) =PΛ(s)(Ot)PΛ(Ot)

where Λ(s) denotes the adapted speaker model based on enrollmentspeech Os from speaker s.Computing PΛ(s)(Ot) requires the posterior density of speaker factors[y(s) and z(s) in Kenny 2007], which are posteriorly correlated.The scoring function in Eq. 9 is based on the batch mode.Batch mode is similar to speaker comparison in which no modeladaptation is performed. So, the posterior correlation betweenspeaker factors and SNR factors does not occur in Eq. 9.

217 / 274

Page 218: Machine Learning for Speaker Recognition

Scoring based on sequential mode

The batch-mode scoring (Eq. 10) requires inverting a big matrix if thetarget speaker has a large number of enrollment utterances.The sequential-mode scoring can mitigate this problem.For notational simplicity, we assume that the target speaker only haveone enrollment utterance with i-vector xs :

SLR(xs , xt |`s , `t) = p(xs , xt |`s , `t)p(xs |`s)p(xt |`t)

= p(xt |xs , `s , `t)p(xs |`s)p(xs |`s)p(xt |`t)

= p(xt |xs , `s , `t)p(xt |`t)

218 / 274

Page 219: Machine Learning for Speaker Recognition

Scoring based on sequential mode

For simplicity, we omit `s and `t from now on.

p(xt |xs) =∫ ∫

p(xt |h,w)p(h,w|xs)dhdw

As h and w are posteriorly dependent, we use variational Bayes toapproximate the joint posterior:

p(xt |xs) ≈∫ ∫

p(xt |h,w)q(h)q(w)dhdw

=∫

h

∫wN (xt |m + Vh + Uw,Σ)N (h|µhs ,Σhs )N (w|µws ,Σws )dhdw

(11)

where µhs , Σhs , µws , and Σws are posterior means and posteriorcovariances.

219 / 274

Page 220: Machine Learning for Speaker Recognition

Scoring based on sequential modeEq. 11 is a convolution of Gaussians

p(xt |xs) ≈∫

w

∫hN (xt |m + Vh + Uw,Σ)N (h|µhs ,Σhs )dhN (w|µws ,Σws )dw

=∫

wN (xt |m + Vµhs + Uw,VΣhs V

T + Σ)N (w|µws ,Σws )dw

= N (xt |m + Vµhs + Uµws ,VΣhs VT + UΣws U

T + Σ)If `s falls on the k-th SNR group, we may replace µws byw∗k ≡ 〈wk |X k〉 and assume that Σw∗k → 0:

p(xt |xs) = N (xt |m + Vµhs + Uw∗k ,VΣhs VT + Σ)p(xt) is a marginal density

p(xt) =∫

p(xt |h,w)p(h,w)dhdw

=∫N (xt |m + Vh + Uw,Σ)N (h|0, I)N (w|0, I)dhdw

= N (xt |m,VVT + UUT + Σ)220 / 274

Page 221: Machine Learning for Speaker Recognition

Scoring based on sequential mode

The posteriors means and covariances can be obtained from Eq. 6and Eq. 7 by considering a single utterance from target-speaker s:

µhs = 〈hs |xs〉 = Σhs VTΣ−1(xs −m−Uµws )µws = 〈ws |xs〉 = Σws UTΣ−1(xs −m− Vµhs )Σhs = (I + VTΣ−1V)−1

Σws = (I + UTΣ−1U)−1

Note that µhs and µws depend on each other, meaning that theyshould be found iteratively.

221 / 274

Page 222: Machine Learning for Speaker Recognition

Experiments on SRE12

Evaluation dataset: Common evaluation condition 1 and 4 of NISTSRE 2012 core set.Parameterization: 19 MFCCs together with energy plus their 1stand 2nd derivatives =⇒ 60-Dim acoustic vectorsUBM: Gender-dependent, mic+tel, 1024 mixturesTotal Variability Matrix: Gender-dependent, mic+tel, 500 totalfactorsI-Vector Preprocessing: Whitening by WCCN then lengthnormalization followed by non-parametric feature analysis (NFA)5 andWCCN (500-dim → 200-dim)

5Z. Li, D. Lin, and X. Tang, “Nonparametric discriminant analysis for facerecognition,” IEEE Trans. on PAMI, 2009.

222 / 274

Page 223: Machine Learning for Speaker Recognition

Prepare training speech files

223 / 274

Page 224: Machine Learning for Speaker Recognition

SNR distributions

SNR Distribution of training and test utterances in CC4

224 / 274

Page 225: Machine Learning for Speaker Recognition

Performance on SRE12

225 / 274

Page 226: Machine Learning for Speaker Recognition

Performance on SRE12

226 / 274

Page 227: Machine Learning for Speaker Recognition

Performance on SRE12

227 / 274

Page 228: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models

4 Deep Learning

5 Case Studies5.1. Heavy-Tailed PLDA5.2. SNR-Invariant PLDA5.3. Mixture of PLDA5.4. DNN I-vectors5.5. PLDA with RBM

6 Future Direction 228 / 274

Page 229: Machine Learning for Speaker Recognition

Mixture of PLDA: Motivation

Conventional i-vector/PLDA systems use a single PLDA model tohandle all SNR conditions

229 / 274

Page 230: Machine Learning for Speaker Recognition

Mixture of PLDA: Motivation

We argue that a PLDA model should focus on a small range of SNR.

230 / 274

Page 231: Machine Learning for Speaker Recognition

Distribution of SNR

231 / 274

Page 232: Machine Learning for Speaker Recognition

Proposed solution

The full spectrum of SNRs is handled by a mixture of PLDA in whichthe posteriors of the indicator variables depend on the utterance’sSNR.Verification scores depend not only on the same-speaker anddifferent-speaker likelihoods but also on the posterior probabilities ofSNR.

232 / 274

Page 233: Machine Learning for Speaker Recognition

Mixture of PLDA [Mak et al., 2016]

Model parameters:

θ = π, µ, σ,m,V,Σ= πk , µk , σk︸ ︷︷ ︸

Modeling SNR

, mk ,Vk ,Σk︸ ︷︷ ︸Speaker subspaces

Kk=1

Generative model:

xij ∼K∑

k=1P(yijk = 1|`ij)N (xij |mk ,VkVT

k + Σk)

whereP(yijk = 1|`ij) = πkN (`ij |µk , σ

2k)∑K

k′=1 πk′N (`ij |µk′ , σ2k′))

and `ij is the SNR of the utterance j from speaker i .

233 / 274

Page 234: Machine Learning for Speaker Recognition

Mixture of PLDA

Graphical model:

234 / 274

Page 235: Machine Learning for Speaker Recognition

PLDA vs. Mixture of PLDA

Graphical models:

𝒙𝑖𝑗 𝒛𝑖

𝑁

𝒎

𝑽

𝚺

𝐻𝑖 ℓ ij

xij zi

yijk

NHi

µ , σ

V

m Σ

PLDA Mixture of PLDA

235 / 274

Page 236: Machine Learning for Speaker Recognition

Experiments on SRE12

Evaluation dataset: Common evaluation condition 1 and 4 of NISTSRE 2012 core set.Parameterization: 19 MFCCs together with energy plus their 1stand 2nd derivatives =⇒ 60-Dim acoustic vectorsUBM: Gender-dependent, mic+tel, 1024 mixturesTotal Variability Matrix: Gender-dependent, mic+tel, 500 totalfactorsI-Vector Preprocessing: Whitening by WCCN then lengthnormalization followed by LDA and WCCN (500-dim → 200-dim)

236 / 274

Page 237: Machine Learning for Speaker Recognition

Experiments on SRE12

In NIST 2012 SRE, training utterances from telephone channels areclean, but some of the test utterances are noisy.We used the FaNT tool to add babble noise to the clean trainingutterances

237 / 274

Page 238: Machine Learning for Speaker Recognition

DET Performance

Train on tel+mic speech and test on noisy tel speech (CC4).

238 / 274

Page 239: Machine Learning for Speaker Recognition

I-vector Cluster Alignment

0 10 20 30 40 50 60

1

2

3

SNR (dB)

Clu

ster

ID

Aligned to Mixture 1

Aligned to Mixture 3

Without using SNR

0 10 20 30 40 50 60

1

2

3

SNR (dB)

Clu

ster

ID

Aligned to Mixture 1Aligned to Mixture 2Aligned to Mixture 3

With SNR as guidance239 / 274

Page 240: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models

4 Deep Learning

5 Case Studies5.1. Heavy-Tailed PLDA5.2. SNR-Invariant PLDA5.3. Mixture of PLDA5.4. DNN I-vectors5.5. PLDA with RBM

6 Future Direction 240 / 274

Page 241: Machine Learning for Speaker Recognition

DNN for speaker recognition

Main idea: replacing the universal background model (UBM) with aphonetically-aware DNN for computing the frame posteriorprobabilities.The most successful application of DNN to speaker recognition[Lei et al., 2014, Ferrer et al., 2016, Richardson et al., 2015]

Source: Richardson, F., et al. IEEE Signal Processing Letters, 2015 241 / 274

Page 242: Machine Learning for Speaker Recognition

DNN I-vector extraction

Source: Ferrer, L. et al. IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing, 2016.

242 / 274

Page 243: Machine Learning for Speaker Recognition

UBM I-vectors

Factor analysis model for UBM i-vectors:

µc = µ(b)c + Tcw c = 1, . . . ,C

Given the MFCC vectors of an utterance O = o1, . . . , oT, itsi-vector is the posterior mean of w

x ≡ 〈w|O〉 = L−1C∑

c=1TT

c

(Σ(b)

c

)−1 T∑t=1

γc(ot)(ot − µ(b)c )

where

L = I +C∑

c=1

T∑t=1

γc(ot)TTc (Σ(b)

c )−1Tc

γc(ot) ≡ Pr(Mixture = c|ot) = λ(b)c N (ot |µ(b)

c ,Σ(b)c )∑C

j=1 λ(b)j N (ot |µ(b)

j ,Σ(b)j )

243 / 274

Page 244: Machine Learning for Speaker Recognition

DNN I-vectors

Replace γc(ot) by DNN output, γDNNc (at)

The DNN is trained to produce posterior probabilities of senones,given multiple frames of acoustic features, at , as input.

Acoustic features for speech recognition in the DNN are not necessarythe same as the features for the i-vector extractor.

244 / 274

Page 245: Machine Learning for Speaker Recognition

DNN I-vectors

Given MFCC or bottleneck (BN) feature vectors O = o1, . . . , oT,the DNN i-vector is6

x ≡ 〈w|O〉 = L−1C∑

c=1TT

c

(ΣDNN

c

)−1 T∑t=1

γDNNc (at)(ot − µDNN

c )

where

µDNNc =

∑Ni=1

∑Tit=1 γ

DNNc (ait)oit∑N

i=1∑Ti

t=1 γDNNc (ait)

ΣDNNc =

∑Ni=1

∑Tit=1 γ

DNNc (ait)oitoT

it∑Ni=1

∑Tit=1 γ

DNNc (ait)

− µDNNc (µDNN

c )T

6For a full set of formulae, seehttp://www.eie.polyu.edu.hk/∼mwmak/papers/FA-Ivector.pdf

245 / 274

Page 246: Machine Learning for Speaker Recognition

Denoising deep classifier [Tan et al., 2016]

246 / 274

Page 247: Machine Learning for Speaker Recognition

DNN I-vectors from denoising deep classifier

247 / 274

Page 248: Machine Learning for Speaker Recognition

Performance on NIST 2012 SRE

Performance on CC4 with test utterances contaminated with babblenoise.

248 / 274

Page 249: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models

4 Deep Learning

5 Case Studies5.1. Heavy-Tailed PLDA5.2. SNR-Invariant PLDA5.3. Mixture of PLDA5.4. DNN I-vectors5.5. PLDA with RBM

6 Future Direction 249 / 274

Page 250: Machine Learning for Speaker Recognition

PLDA–RBM [Stafylakis et al., 2012]Main idea:

Use i-vectors as input to the Gaussian visible layer of an RBMDivide RBM weights into two parts: speaker and channelConsider RBM weights as analogue to PLDA’s loading matricesDivide the Gaussian hidden layer into two parts: speaker and channelHidden nodes are considered as latent factors

RBM RBM–PLDA250 / 274

Page 251: Machine Learning for Speaker Recognition

PLDA vs. PLDA–RBM

PLDA (omitting global mean):v = Vs + Uc + ε

where v is an i-vector, s and c are speaker and channel factors.RBM-PLDA:

vn = σv

[Ws

sstσs

+ Wccstσc

]where vn is the expected value of visible layer in the negative phase ofCD-1 sampling, sst and cst are the states of Gaussian hidden nodes ofthe RBM.

251 / 274

Page 252: Machine Learning for Speaker Recognition

Scoring in PLDA–RBM

Given two i-vectors vi , i = 1, 2, compute si = WTs vi .

The log-likelihood ratio is

LLR = −12(s1 − s2)T(s1 − s2) + const

If ‖si‖ = 1, the model is similar to cosine distance scoring.

252 / 274

Page 253: Machine Learning for Speaker Recognition

Results on NIST 2010 SRE

NIST’10, female, core-extended:

253 / 274

Page 254: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models

4 Deep Learning

5 Case Studies

6 Future Direction6.1. Domain adaptation6.2. Short-utterance speaker verification6.3. Text-dependent speaker recognition

254 / 274

Page 255: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models

4 Deep Learning

5 Case Studies

6 Future Direction6.1. Domain adaptation6.2. Short-utterance speaker verification6.3. Text-dependent speaker recognition

255 / 274

Page 256: Machine Learning for Speaker Recognition

Domain adaptation

Domain adaptation aims to adapt a system from a resource-richdomain to a resource-limited domain.Domain mismatch: systems perform very well in the domain(environment) for which they are trained; however, their performancesuffers when the users use the systems in other domain.Domain Adaptation Challenge (DAC13) was introduced in 2013.The effect of domain mismatch on PLDA parameters is morepronouncedPopular approaches:

Bayesian adaptation of PLDA models [Villalba and Lleida, 2014]Unsupervised clustering of i-vectors for adapting covariance matrices ofPLDA models [Shum et al., 2014, Garcia-Romero et al., 2014]Inter-dataset variability compensation[Aronowitz, 2014, Kanagasundaram et al., 2015]

256 / 274

Page 257: Machine Learning for Speaker Recognition

Bayesian Adaptation of PLDA (Villalba and Lleida, 2014)

Use a generative model to generate out-of-domain labelled (θdj) data

and in-domain (target) unlabelled (θj) data.Unknown labels (θj) are modelled as latent variables

257 / 274

Page 258: Machine Learning for Speaker Recognition

Unsupervised clustering of i-vectors (Shum et al., 2014)

Step 1 Use the within-class and between-class covariance matrices ofout-of-domain data to compute a pairwise affinity matrix onunlabelled in-domain data.

Step 2 Use the affinity matrix to obtain hypothesized speaker clusters ofin-domain data.

Step 3 Linearly interpolate the in-domain and out-of-domain covariancematrices to obtained the adapted matrices:

Σadapt = αwcΣin + (1− αwcΣout)Φadapt = αacΦin + (1− αacΦout)

258 / 274

Page 259: Machine Learning for Speaker Recognition

Inter-dataset variability compensation (IDVC)

IDVC aims at explicitly modeling dataset shift variability in thei-vector space and compensating it as a pre-processing cleanup step.No need to use in-domain data to adapt the PLDA model.

Step 1 Divide the development data into different subsets according totheir sources, e.g., different LDC distributions of Switchboard.

Step 2 Estimate an inter-dataset variability subspace with the largestvariability across the mean i-vectors of these subsets.

Step 3 Remove the variability of i-vectors in this subspace via nuisanceattribute projection (NAP).

259 / 274

Page 260: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models

4 Deep Learning

5 Case Studies

6 Future Direction6.1. Domain adaptation6.2. Short-utterance speaker verification6.3. Text-dependent speaker recognition

260 / 274

Page 261: Machine Learning for Speaker Recognition

Short-utterance speaker verification

Performance of i-vector/PLDA systems degrades rapidly when thesystems are presented with short utterances or utterances withvarying durations.These systems consider long and short utterances as equally reliable.However, the i-vectors of short utterances have much bigger posteriorcovariances.Popular approaches:

Uncertainty Propagation: Propagate the posterior covariances ofi-vectors to PLDA [Kenny et al., 2013]Full posterior distribution PLDA [Cumani et al., 2014]

261 / 274

Page 262: Machine Learning for Speaker Recognition

Uncertainty propagation (Kenny et al., 2013)

In i-vector extraction, besides the posterior mean of the latentvariable (i-vector), we also have the posterior covariance matrix,which reflects the uncertainty of the i-vector estimate.

262 / 274

Page 263: Machine Learning for Speaker Recognition

Uncertainty propagation (Kenny et al., 2013)

wij = m + Vhi + Uijzij + εij

Uij is the Cholesky decomposition of the posterior covariance matrix(L−1

ij ) of the j-th i-vector by the i-th speaker.The intra-speaker covariance matrix becomes:

cov(wij ,wij |hi ) = UijUTij + Σ

where UijUTij changes from utterance to utterance, thus reflecting the

reliability of the i-vector wij .Scoring is computationally expensive. See [Lin and Mak, 2016] for afast scoring algorithm.

263 / 274

Page 264: Machine Learning for Speaker Recognition

Outline

1 Introduction

2 Learning Algorithms

3 Learning Models

4 Deep Learning

5 Case Studies

6 Future Direction6.1. Domain adaptation6.2. Short-utterance speaker verification6.3. Text-dependent speaker recognition

264 / 274

Page 265: Machine Learning for Speaker Recognition

Text-dependent SV using short utterances

Very difficult for short-utterance text-dependent SVText-dependent tasks with the uncertainty propagation version ofi-vector/PLDA were unsatisfactory [Stafylakis et al., 2013].It is more natural to use HMMs rather than GMMs fortext-dependent tasks. But HMMs require local hidden variables,which are difficult to handle because of data fragmentation.Emergent approaches:

Content matching [Scheffer and Lei, 2014]Exploiting out-of-domain data via domain adaptation[Aronowitz and Rendel, 2014, Kenny et al., 2014]y-vector and JFA backend [Kenny et al., 2015b]I-vector backend [Kenny et al., 2015a]Hidden suppervector backend [Kenny et al., 2016]Use DNN/RNN to extract utterance-level features[Heigold et al., 2016, Bhattacharya et al., 2016, Zeinali et al., 2016]

265 / 274

Page 266: Machine Learning for Speaker Recognition

References I

Aronowitz, H. (2014). Compensating inter-dataset variability in PLDA hyper-parametersfor robust speaker recognition. In Odyssey: The Speaker and Language RecognitionWorkshop.

Aronowitz, H. and Rendel, A. (2014). Domain adaptation for text dependent speakerverification. In Interspeech, pages 1337–1341.

Barrett, R., Berry, M. W., Chan, T. F., Demmel, J., Donato, J., Dongarra, J., Eijkhout,V., Pozo, R., Romine, C., and Van der Vorst, H. (1994). Templates for the solution oflinear systems: building blocks for iterative methods, volume 43. Siam.

Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wisetraining of deep networks. In Scholkopf, B., Platt, J. C., and Hoffman, T., editors,Advances in Neural Information Processing Systems 19, pages 153–160. MIT Press.

Bhattacharya, G., Kenny, P., Alam, J., and Stafylakis, T. (2016). Deep neural networkbased text-dependent speaker verification : Preliminary results. In Odyssey 2016: TheSpeaker and Language Recognition Workshop, pages 9–15, Bilbao, Spain.

Bishop, C. (2006). Pattern recognition and machine learning. springer, New York.

Campbell, W. M., Sturim, D. E., and Reynolds, D. A. (2006). Support vector machinesusing GMM supervectors for speaker verification. IEEE Signal Processing Letters,13(5):308–311.

266 / 274

Page 267: Machine Learning for Speaker Recognition

References II

Cumani, S., Plchot, O., and Laface, P. (2014). On the use of i–vector posteriordistributions in probabilistic linear discriminant analysis. IEEE/ACM Transactions onAudio, Speech, and Language Processing, 22(4):846–857.

Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., and Ouellet, P. (2011). Front-endfactor analysis for speaker verification. IEEE Transactions on Audio, Speech, and LanguageProcessing, 19(4):788–798.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood fromincomplete data via the EM algorithm. Journal of the royal statistical society. Series B(methodological), pages 1–38.

Ferrer, L., Lei, Y., McLaren, M., and Scheffer, N. (2016). Study of senone-based deepneural network approaches for spoken language recognition. IEEE/ACM Transactions onAudio, Speech, and Language Processing, 24(1):105–116.

Garcia-Romero, D. and Espy-Wilson, C. (2011). Analysis of i-vector length normalizationin speaker recognition systems. In Proc. of Interspeech, pages 249–252.

Garcia-Romero, D., Zhang, X., McCree, A., and Povey, D. (2014). Improving speakerrecognition performance in the domain adaptation challenge using deep neural networks. InProc. of IEEE Spoken Language Technology Workshop (SLT), pages 378–383.

267 / 274

Page 268: Machine Learning for Speaker Recognition

References III

Gauvain, J. L. and Lee, C.-H. (1994). Maximum a posteriori estimation for multivariateGaussian mixture observations of Markov chains. IEEE Transactions on Speech and AudioProcessing, 2(2):291–298.

Heigold, G., Moreno, I., Bengio, S., and Shazeer, N. (2016). End-to-end text-dependentspeaker verification. In Proc. of IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP), pages 5115–5119.

Kanagasundaram, A., Dean, D., and Sridharan, S. (2015). Improving out-domain PLDAspeaker verification using unsupervised inter-dataset variability compensation approach. InProc. of ICASSP, pages 4654–4658.

Kenny, P. (2010). Bayesian speaker verification with heavy-tailed priors. In Proc. ofOdyssey: Speaker and Language Recognition Workshop, Brno, Czech Republic.

Kenny, P., Boulianne, G., and Dumouchel, P. (2005). Eigenvoice modeling with sparsetraining data. IEEE Transactions on speech and audio processing, 13(3):345–354.

Kenny, P., Boulianne, G., Ouellet, P., and Dumouchel, P. (2007a). Joint factor analysisversus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech andLanguage Processing, 15(4):1435–1447.

268 / 274

Page 269: Machine Learning for Speaker Recognition

References IV

Kenny, P., Boulianne, G., Ouellet, P., and Dumouchel, P. (2007b). Speaker and sessionvariability in GMM-based speaker verification. IEEE Transactions on Audio, Speech, andLanguage Processing, 15(4):1448–1460.

Kenny, P., Stafylakis, T., Alam, J., Gupta, V., and Kockmann, M. (2016). Uncertaintymodeling without subspace methods for text-dependent speaker recognition. In Odyssey:The Speaker and Language Recognition Workshop, pages 16–23, Bilbao, Spain.

Kenny, P., Stafylakis, T., Alam, J., and Kockmann, M. (2015a). An i-vector backend forspeaker verification. In Proc. of INTERSPEECH, pages 2307–2310.

Kenny, P., Stafylakis, T., Alam, J., and Kockmann, M. (2015b). JFA modeling withleft-to-right structure and a new backend for text-dependent speaker recognition. In Proc.of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),pages 4689–4693.

Kenny, P., Stafylakis, T., Alam, M. J., Ouellet, P., and Kockmann, M. (2014). In-domainversus out-of-domain training for text-dependent JFA. In INTERSPEECH, pages1332–1336.Kenny, P., Stafylakis, T., Ouellet, P., Alam, M. J., and Dumouchel, P. (2013). PLDA forspeaker verification with utterances of arbitrary duration. In Proc. ICASSP, pages7649–7653.

269 / 274

Page 270: Machine Learning for Speaker Recognition

References V

Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. In Proc. ofInternational Conference on Learning Representation.

Lei, Y., Scheffer, N., Ferrer, L., and McLaren, M. (2014). A novel scheme for speakerrecognition using a phonetically-aware deep neural network. In Proc. of ICASSP.

Li, N. and Mak, M. W. (2015). SNR-invariant PLDA modeling in nonparametric subspacefor robust speaker verification. IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing, 23(10):1648–1659.

Lin, W. W. and Mak, M. W. (2016). Fast scoring for plda with uncertainty propagation. InOdyssey 2016: The Speaker and Language Recognition Workshop, pages 31–38, Bilbao,Spain.

Mak, M.-W., Pang, X., and Chien, J.-T. (2016). Mixture of PLDA for noise robust i-vectorspeaker verification. IEEE/ACM Transactions on Audio, Speech, and Language Processing,24(1):130–142.

Norwich, K. H. (1993). Information, sensation, and perception. Academic Press San Diego.

Prince, S. and Elder, J. (2007). Probabilistic linear discriminant analysis for inferencesabout identity. In Proc. of IEEE International Conference on Computer Vision, pages 1–8.

270 / 274

Page 271: Machine Learning for Speaker Recognition

References VI

Reynolds, D. A., Quatieri, T. F., and Dunn, R. B. (2000). Speaker verification usingadapted Gaussian mixture models. Digital Signal Processing, 10(1–3):19–41.

Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation andapproximate inference in deep generative models. In Proc. of International Conference onMachine Learning, pages 1278–1286.

Richardson, F., Reynolds, D., and Dehak, N. (2015). Deep neural network approaches tospeaker and language recognition. IEEE Signal Processing Letters, 22(10):1671–1675.

Salakhutdinov, R. and Hinton, G. E. (2009). Deep Boltzmann machines. In Proc. ofInternational Conference on Artificial Intelligence and Statistics, page 3.

Scheffer, N. and Lei, Y. (2014). Content matching for short duration speaker recognition.In Proc. of INTERSPEECH, pages 1317–1321.

Shum, S., Reynolds, D. A., Garcia-Romero, D., and McCree, A. (2014). Unsupervisedclustering approaches for domain adaptation in speaker recognition systems. In Odyssey:The Speaker and Language Recognition Workshop, pages 266–272.

Stafylakis, T., Kenny, P., Ouellet, P., Perez, J., Kockmann, M., and Dumouchel, P.(2013). Text-dependent speaker recognition using PLDA with uncertainty propagation.matrix, 500:1.

271 / 274

Page 272: Machine Learning for Speaker Recognition

References VII

Stafylakis, T., Kenny, P., Senoussaoui, M., and Dumouchel, P. (2012). PLDA usingGaussian restricted Boltzmann machines with application to speaker verification. In Proc.of INTERSPEECH, pages 1692–1695.

Tan, Z. L., Zhu, Y. K., Mak, M. W., and Mak, B. (2016). Senone i-vectors for robustspeaker verification. In ISCSLP’16.

Vapnik, V. (2013). The nature of statistical learning theory. Springer Science & BusinessMedia.Villalba, J. and Lleida, E. (2014). Unsupervised adaptation of PLDA by using variationalBayes methods. In Proc. ICASSP, pages 744–748.

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting andcomposing robust features with denoising autoencoders. In Proc. of InternationalConference on Machine learning, pages 1096–1103.

Vogt, R. and Sridharan, S. (2008). Explicit modelling of session variability for speakerverification. Computer Speech & Language, 22(1):17–38.

Watanabe, S. and Chien, J.-T. (2015). Bayesian Speech and Language Processing.Cambridge University Press.

272 / 274

Page 273: Machine Learning for Speaker Recognition

References VIII

Zeinali, H., Burget, L., Sameti, H., Glembek, O., and Plchot, O. (2016). Deep neuralnetworks and hidden Markov models in i-vector-based text-dependent speaker verification.In Odyssey: The Speaker and Language Recognition Workshop, pages 24–30, Bilbao,Spain.

Zhao, X. and Dong, Y. (2012). Variational Bayesian joint factor analysis models forspeaker verification. IEEE Transactions on Audio, Speech, and Language Processing,20(3):1032–1042.

273 / 274

Page 274: Machine Learning for Speaker Recognition

Contact information

Man-Wai MakThe Hong Kong Polytechnic [email protected]://www.eie.polyu.edu.hk/∼mwmak/Jen-Tzung ChienNational Chiao Tung [email protected]://chien.cm.nctu.edu.tw

274 / 274