brno university of technology speech@fit lukáš burget, michal fapšo, valiantsina hubeika, ondřej...

Brno University Of Technology Speech@FIT

Lukáš Burget, Michal Fapšo, Valiantsina Hubeika, Ondřej Glembek, Martin Karafiát, Marcel Kockmann, Pavel Matějka, Petr Schwarz and Honza Černocký

NIST Speaker Recognition Workshop 2008

MOBIO

NIST SRE 2008 2/24

Outline

• Submitted systems• Factor Analysis systems• SVM-MLLR system• Side information based calibration and fusion• Contribution of subsystems in fusion• Analysis of FA system

– Gender dependent vs. independent– Flavors of FA system– Sensitivity to number of eigenchannels– Importance of ZT-norm– Optimization for microphone data

• Techniques that did not make it to the submission• Conclusion

NIST SRE 2008 3/24

Submitted systems

• BUT01 - primary (3 systems)– Channel and language side information in fusion– FA-MFCC1339 – FA-MFCC2060– SVM-MLLR

• BUT02 - (3 systems) – The same as BUT01, but no side information in fusion

• BUT03 - (2 systems)– Channel and language side information in fusion– FA-MFCC1339 – FA-MFCC2060

NIST SRE 2008 4/24

FA-MFCC1339 system

• MAP adapted UBM with 2048 Gaussian components– Single UBM trained on Switchboard and NIST 2004,5 data

• 12 MFCC + C0 (20ms window, 10ms shift )• Short time Gaussianization

– Rank of the current frame coefficient in 3sec window transformed by inverse Gaussian cumulative distribution function.

• Delta + double delta + triple delta coefficients– Together 52 coefficients, 12 frames context

• HLDA (dimensionality reduction from 52 to 39)• Factor Analysis Model – gender independent

– 300 eigenvoices (Switchboards, NIST 2004,5) – 100 eigenchannels for telephone speech (NIST 2004,5 tel data)– 100 eigenchannels for microphone speech (NIST 2005 mic data)

• ZT-norm – gender dependent

NIST SRE 2008 5/24

FA-MFCC2060 system

• The same as FA-MFCC1339 with the following differences:– 60 dimensional features are: 19 MFCC + Energy +

deltas + double deltas (no HLDA)– Two gender dependent Factor Analysis models

NIST SRE 2008 6/24

SVM – MLLR system

• Linear kernels• Rank normalization• LibSVM C++ library [Chang2001] • Pre-computed Gram matrices• Features are MLLR transformations adapting

LVCSR system (developed for AMI project) to speaker of given speech segment

• Estimation of MLLR transformations makes use of the ASR transcripts provided by NIST

NIST SRE 2008 7/24

SVM – MLLR system

• Cascade of CMLLR and MLLR– 2 CMLLR transformation (silence and speech)– 3 MLLR transformation (silence and 2 phoneme clusters)

• Silence transformations are discarded for SRE• Supervector = 1 CMLLR + 2 MLLR =

= 3*392+3*39=4680• Impostors: NIST 2004 + mic data from NIST 2005• ZT-norm: speakers from NIST 2004

NIST SRE 2008 8/24

Side info based calibration and fusion• Side information for each trial is given by its hard

assignment to classes:– Trial channel condition provided by NIST: tel-tel, tel-mic, mic-tel,

mic-mic– English/non-English decision given by our LID system

• Side information is used as follows:– For each system:

• Split trials by channel condition and calibrate scores using linear logistic regression (LLR) in each split separately

• Split trials according to English/non-English decision and calibrate scores using LLR in each split separately

– Fuse the calibrated scores of all subsystems using LLR without making use of any side information

• For convenience , FoCal Bilinear toolkit by Niko Brummer was used, although we did not make use of its extensions over standard LLR.

NIST SRE 2008 9/24

Side info based calibration and fusion tel-tel trials

SRE 2006 (all trials, det1) SRE 2008 (all trials, det6)

No side information (BUT02)Channel cond. + Lang. by NISTChannel cond. + Lang. by LID (BUT01 – primary system)

•Use of side information is helpful•Some improvement can be obtained by relying on language information provided by NIST instead of more realistic LID system

NIST SRE 2008 10/24

Side info based calibration and fusion mic-mic trials

SRE 2006 (trial list defined by MIT-LL) SRE 2008 (det1)

No side informationChannel cond. + Lang. by NISTChannel cond. + Lang. by LID

•Use of side information allowed to use the same unchanged subsystems for all the channels

NIST SRE 2008 11/24

Subsystems and fusion tel-tel trials

SRE 2006 (all trials, det1) SRE 2008 (all trials, det6)

SVM-MLLRFA-MFCC1339FA-MFCC2060Fusion of 2 x FA Fusion of all 3

•For tel-tel trials, the single FA-MFCC2060 performs almost as well as the fusion

NIST SRE 2008 12/24

Subsystems and fusionmic-mic trials

SVM-MLLRFA-MFCC1339FA-MFCC2060Fusion of 2 x FA Fusion of all 3

SRE 2006 (trial list defined by MIT-LL) SRE 2008 (det1)

•For microphone conditions FA-MFCC2060 fails to perform well and fusion is beneficial.•FA-MFCC1339 system outperforms FA-MFCC2060 system having 3x more parameters, which is possibly too over-trained to telephone data primarily used for FA model training.

NIST SRE 2008 13/24

Gender dependent vs. gender independent FA system

SRE 2006 tel-tel SRE 2006 mic-mic

FA-MFCC1339 GIFA-MFCC2060 GDFA-MFCC2060 GI

•Halving the number of parameters of FA-MFCC2060 system by making it gender independent degrades the performance on telephone and improves on microphone condition

NIST SRE 2008 14/24

Flavors of FA

•Relevance MAP adaptationM = m + dz with d2 = Σ/ τ where Σ is matrix with UBM variance supervector in diagonal

•Eigenchannel adaptation (SDV, BUT)• Relevance MAP for enrolling speaker model• Adapt speaker model to test utterance using eigenchannels estimated by PCA

•FA without eigenvoices, with d2 = Σ/ τ (QUT, LIA)•FA without eigenvoices, with d trained from data (CRIM)

can be seen as training different τ for each supervector coefficientEffective relevance factor τef= trace(Σ)/ trace(d2)

•FA with eigenvoices (CRIM)

All hyperparameters can be trained from data using EM

EigenchannelsDiagonal matrix

EigenvoicesUBM mean supervector

M = m + vy + dz + ux

Speaker specific factors Session specific factors

NIST SRE 2008 15/24

Flavors of FA

Eigenchannel adapt.FA: d2 = Σ/ τFA: d trained on dataFA with eigenvoices

MFCC1339 features MFCC2060 features

SRE 2006 (all trials, det1)

τef = 236.1 τef = 81.2

•Without eigenvoices, simple eigenchannel adaptation seems to be more robust than FA.•FA with trained d fails for MFCC1339 features. Too high τef? Caused by HLDA?•FA with eigenvoices significantly outperform the other FA configurations.

NIST SRE 2008 16/24

Sensitivity of FA to number of eigenchannels

Eigenchannel adapt.FA: d trained on dataFA with eigenvoices 50 eigenchannels 100 eigenchannels

MFCC2060 features

•FA systems without eigenvoices seem not to be able to robustly estimate increased number of eigenchannels•However, we benefit from more eigenchannels significantly after explaining the speaker variability by eigenvoices

NIST SRE 2008 17/24

Importance of zt-norm

•zt-norm is essential for good performance FA with eigenvoices.

MFCC1339 features

NIST SRE 2008 18/24

Training eigenchannels for different channel conditions

SRE2006 trials:tel-tel (det1)tel-micmic-telmic-mic

SRE2008 trials:tel-tel (det6)tel-mic (det5)mic-tel (det4)mic-mic (det1)

FA-MFCC1339

100 EC trained on SRE04, SRE05 tel. data + 100 EC trained on SRE05 mic. data

•Negligible degradation on tel-tel condition and huge improvement particularly on mic-mic condition is obtained after adding eigenchannels trained on microphone data to those trained on telephone data.

NIST SRE 2008 19/24

Training additional eigenchannels on SRE08 dev data

SRE2008 trials:tel-tel (det6)tel-mic (det5)mic-tel (det4)mic-mic (det1)

FA-MFCC1339

•Significant improvement is obtained on microphone condition for eval data after adding eigenchannels trained on SER08 dev data (all spontaneous speech from the 6 interviewees).

100 EC trained on SRE04, SRE05 tel. data + 100 EC trained on SRE05 mic. data + 20 EC trained on SRE08 dev data

NIST SRE 2008 20/24

Training additional eigenchannels on SRE08 dev data – primary fusion

det1: int-intdet2: int-int same micdet3: int-int diff. micdet4: int-phn

det5: phn-micdet6: tel-teldet7: tel-tel Eng.det8: tel-tel native

primary submission + 20 EC trained on SRE08 dev data

•20 eigenchannels trained on SRE08 dev data are added to the two FA subsystems •The improvement generalizes also to non-interview microphone data (det5)

NIST SRE 2008 21/24

Other techniques that did not make it to the submission

• The following techniques were also tried , which did not improve the fusion performance– GMM with eigenchannel adaptation– SVM-GMM 2048 + NAP– SVM-GMM 2048 + ISV (Inter Session Variability modeling)– SVM-GMM 2048 + ISV derivative (Fisher) kernel– SVM-GMM 2048 + IVS based on FA-MFCC1339– FA modeling prosodic and cepstral contours– SVM on phonotactics – counts from Binary decision trees– SVM on soft bigram statistics collected on cumulated

posteriograms (matrix of posterior probability of phonemes for each frame)

• See our poster for more details

NIST SRE 2008 22/24

Conclusions

• FA systems build according to recipe from [Kenny2008] performs excellently, though there is still some mystery to be solved.

• It was hard to find another complementary system that would contribute to fusion of our two FA system.

• Although our system was primarily trained on and tuned for telephone data, FA subsystems can be simply augmented with eigenchannels trained on microphone data (as also proposed in [Kenny2008]), which makes the system performing well also on microphone conditions.

• Another significant improvement was obtained by training additional eigenchannels on data with matching channels condition, even thought there was very limited amount of such data provided by NIST.

NIST SRE 2008 23/24

Thanks

• To Patrick Kenny for [Kenny2008] recipe to building FA system that really works and for providing list of files for training FA system

• MIT-LL for creating and sharing the trial lists based on SRE06 data, which we used for the system development

• Niko Bummer for FoCal Bilinear, which allowed us to start playing with the fusion just the last day before the submission deadline

NIST SRE 2008 24/24

References

[Kenny2008] P. Kenny et al.: A Study of Inter-Speaker Variability in Speaker Verification IEEE TASLP, July 2008.

[Brummer2008] N. Brummer: FoCal Bilinear: Tools for detector fusion and calibration, with use of side-information http://niko.brummer.googlepages.com/focalbilinear

[Chang2001] C. Chang et al.: LIBSVM: a library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm

[Stolcke2005/6] A. Stolcke: MLLR Transforms as Features in SpkID, Eurospeech 2005, Odyssey 2006

[Hain2005] T. Hain et al.: The 2005 AMI system for RTS, Meeting Recognition Evaluation Workshop, Edinburgh, July 2005.

http://niko.brummer.googlepages.com/focalbilinear

http://niko.brummer.googlepages.com/focalbilinear

http://www.csie.ntu.edu.tw/~cjlin/libsvm

http://www.csie.ntu.edu.tw/~cjlin/libsvm

NIST SRE 2008 25/24

NIST SRE 2008 26/24NIST SRE 2008

Inter-session variability

High inter-session variability

High inter-speaker variability

UBM

Target speaker model


Inter-session compensation

High inter-session variability

UBM

Target speaker model Test data

For recognition, move both models along the high inter-session variability direction(s) to fit well the test data

High inter-speaker variability


LVCSR for SVM-MLLR system• LVCSR system is adapted to speaker (VTLN factor and

(C)MLLR transformations are estimated) using ASR transcriptions provided by NIST

• AMI 2005(6) LVCSR – state of the art system for the recognition of spontaneous speech [Hain2005]: – 50k word dictionary (pronunciations of OOVs were

generated by grapheme to phoneme conversion based on rules trained from data)

– PLP, HLDA– CD-HMM with 7500 tied-states each modeled by 18

Gaussians– Discriminatively trained using MPE– Adapted to speaker: VTLN, SAT based on CMLLR, MLLR

brno university of technology speech@fit lukáš burget, michal fapšo, valiantsina hubeika, ondřej...

Documents

nist sre

nist slide

microphone speech nist

telephone speech nist

fusion famfcc13

lvcsr system

lid system

mic tel