brno university of technology speech@fit lukáš burget, michal fapšo, valiantsina hubeika, ondřej...
TRANSCRIPT
Brno University Of Technology Speech@FIT
Lukáš Burget, Michal Fapšo, Valiantsina Hubeika, Ondřej Glembek, Martin Karafiát, Marcel Kockmann, Pavel Matějka, Petr Schwarz and Honza Černocký
NIST Speaker Recognition Workshop 2008
MOBIO
NIST SRE 2008 2/24
Outline
• Submitted systems• Factor Analysis systems• SVM-MLLR system• Side information based calibration and fusion• Contribution of subsystems in fusion• Analysis of FA system
– Gender dependent vs. independent– Flavors of FA system– Sensitivity to number of eigenchannels– Importance of ZT-norm– Optimization for microphone data
• Techniques that did not make it to the submission• Conclusion
NIST SRE 2008 3/24
Submitted systems
• BUT01 - primary (3 systems)– Channel and language side information in fusion– FA-MFCC1339 – FA-MFCC2060– SVM-MLLR
• BUT02 - (3 systems) – The same as BUT01, but no side information in fusion
• BUT03 - (2 systems)– Channel and language side information in fusion– FA-MFCC1339 – FA-MFCC2060
NIST SRE 2008 4/24
FA-MFCC1339 system
• MAP adapted UBM with 2048 Gaussian components– Single UBM trained on Switchboard and NIST 2004,5 data
• 12 MFCC + C0 (20ms window, 10ms shift )• Short time Gaussianization
– Rank of the current frame coefficient in 3sec window transformed by inverse Gaussian cumulative distribution function.
• Delta + double delta + triple delta coefficients– Together 52 coefficients, 12 frames context
• HLDA (dimensionality reduction from 52 to 39)• Factor Analysis Model – gender independent
– 300 eigenvoices (Switchboards, NIST 2004,5) – 100 eigenchannels for telephone speech (NIST 2004,5 tel data)– 100 eigenchannels for microphone speech (NIST 2005 mic data)
• ZT-norm – gender dependent
NIST SRE 2008 5/24
FA-MFCC2060 system
• The same as FA-MFCC1339 with the following differences:– 60 dimensional features are: 19 MFCC + Energy +
deltas + double deltas (no HLDA)– Two gender dependent Factor Analysis models
NIST SRE 2008 6/24
SVM – MLLR system
• Linear kernels• Rank normalization• LibSVM C++ library [Chang2001] • Pre-computed Gram matrices• Features are MLLR transformations adapting
LVCSR system (developed for AMI project) to speaker of given speech segment
• Estimation of MLLR transformations makes use of the ASR transcripts provided by NIST
NIST SRE 2008 7/24
SVM – MLLR system
• Cascade of CMLLR and MLLR– 2 CMLLR transformation (silence and speech)– 3 MLLR transformation (silence and 2 phoneme clusters)
• Silence transformations are discarded for SRE• Supervector = 1 CMLLR + 2 MLLR =
= 3*392+3*39=4680• Impostors: NIST 2004 + mic data from NIST 2005• ZT-norm: speakers from NIST 2004
NIST SRE 2008 8/24
Side info based calibration and fusion• Side information for each trial is given by its hard
assignment to classes:– Trial channel condition provided by NIST: tel-tel, tel-mic, mic-tel,
mic-mic– English/non-English decision given by our LID system
• Side information is used as follows:– For each system:
• Split trials by channel condition and calibrate scores using linear logistic regression (LLR) in each split separately
• Split trials according to English/non-English decision and calibrate scores using LLR in each split separately
– Fuse the calibrated scores of all subsystems using LLR without making use of any side information
• For convenience , FoCal Bilinear toolkit by Niko Brummer was used, although we did not make use of its extensions over standard LLR.
NIST SRE 2008 9/24
Side info based calibration and fusion tel-tel trials
SRE 2006 (all trials, det1) SRE 2008 (all trials, det6)
No side information (BUT02)Channel cond. + Lang. by NISTChannel cond. + Lang. by LID (BUT01 – primary system)
•Use of side information is helpful•Some improvement can be obtained by relying on language information provided by NIST instead of more realistic LID system
NIST SRE 2008 10/24
Side info based calibration and fusion mic-mic trials
SRE 2006 (trial list defined by MIT-LL) SRE 2008 (det1)
No side informationChannel cond. + Lang. by NISTChannel cond. + Lang. by LID
•Use of side information allowed to use the same unchanged subsystems for all the channels
NIST SRE 2008 11/24
Subsystems and fusion tel-tel trials
SRE 2006 (all trials, det1) SRE 2008 (all trials, det6)
SVM-MLLRFA-MFCC1339FA-MFCC2060Fusion of 2 x FA Fusion of all 3
•For tel-tel trials, the single FA-MFCC2060 performs almost as well as the fusion
NIST SRE 2008 12/24
Subsystems and fusionmic-mic trials
SVM-MLLRFA-MFCC1339FA-MFCC2060Fusion of 2 x FA Fusion of all 3
SRE 2006 (trial list defined by MIT-LL) SRE 2008 (det1)
•For microphone conditions FA-MFCC2060 fails to perform well and fusion is beneficial.•FA-MFCC1339 system outperforms FA-MFCC2060 system having 3x more parameters, which is possibly too over-trained to telephone data primarily used for FA model training.
NIST SRE 2008 13/24
Gender dependent vs. gender independent FA system
SRE 2006 tel-tel SRE 2006 mic-mic
FA-MFCC1339 GIFA-MFCC2060 GDFA-MFCC2060 GI
•Halving the number of parameters of FA-MFCC2060 system by making it gender independent degrades the performance on telephone and improves on microphone condition
NIST SRE 2008 14/24
Flavors of FA
•Relevance MAP adaptationM = m + dz with d2 = Σ/ τ where Σ is matrix with UBM variance supervector in diagonal
•Eigenchannel adaptation (SDV, BUT)• Relevance MAP for enrolling speaker model• Adapt speaker model to test utterance using eigenchannels estimated by PCA
•FA without eigenvoices, with d2 = Σ/ τ (QUT, LIA)•FA without eigenvoices, with d trained from data (CRIM)
can be seen as training different τ for each supervector coefficientEffective relevance factor τef= trace(Σ)/ trace(d2)
•FA with eigenvoices (CRIM)
All hyperparameters can be trained from data using EM
EigenchannelsDiagonal matrix
EigenvoicesUBM mean supervector
M = m + vy + dz + ux
Speaker specific factors Session specific factors
NIST SRE 2008 15/24
Flavors of FA
Eigenchannel adapt.FA: d2 = Σ/ τFA: d trained on dataFA with eigenvoices
MFCC1339 features MFCC2060 features
SRE 2006 (all trials, det1)
τef = 236.1 τef = 81.2
•Without eigenvoices, simple eigenchannel adaptation seems to be more robust than FA.•FA with trained d fails for MFCC1339 features. Too high τef? Caused by HLDA?•FA with eigenvoices significantly outperform the other FA configurations.
NIST SRE 2008 16/24
Sensitivity of FA to number of eigenchannels
Eigenchannel adapt.FA: d trained on dataFA with eigenvoices 50 eigenchannels 100 eigenchannels
MFCC2060 features
•FA systems without eigenvoices seem not to be able to robustly estimate increased number of eigenchannels•However, we benefit from more eigenchannels significantly after explaining the speaker variability by eigenvoices
NIST SRE 2008 17/24
Importance of zt-norm
•zt-norm is essential for good performance FA with eigenvoices.
MFCC1339 features
NIST SRE 2008 18/24
Training eigenchannels for different channel conditions
SRE2006 trials:tel-tel (det1)tel-micmic-telmic-mic
SRE2008 trials:tel-tel (det6)tel-mic (det5)mic-tel (det4)mic-mic (det1)
FA-MFCC1339
100 EC trained on SRE04, SRE05 tel. data + 100 EC trained on SRE05 mic. data
•Negligible degradation on tel-tel condition and huge improvement particularly on mic-mic condition is obtained after adding eigenchannels trained on microphone data to those trained on telephone data.
NIST SRE 2008 19/24
Training additional eigenchannels on SRE08 dev data
SRE2008 trials:tel-tel (det6)tel-mic (det5)mic-tel (det4)mic-mic (det1)
FA-MFCC1339
•Significant improvement is obtained on microphone condition for eval data after adding eigenchannels trained on SER08 dev data (all spontaneous speech from the 6 interviewees).
100 EC trained on SRE04, SRE05 tel. data + 100 EC trained on SRE05 mic. data + 20 EC trained on SRE08 dev data
NIST SRE 2008 20/24
Training additional eigenchannels on SRE08 dev data – primary fusion
det1: int-intdet2: int-int same micdet3: int-int diff. micdet4: int-phn
det5: phn-micdet6: tel-teldet7: tel-tel Eng.det8: tel-tel native
primary submission + 20 EC trained on SRE08 dev data
•20 eigenchannels trained on SRE08 dev data are added to the two FA subsystems •The improvement generalizes also to non-interview microphone data (det5)
NIST SRE 2008 21/24
Other techniques that did not make it to the submission
• The following techniques were also tried , which did not improve the fusion performance– GMM with eigenchannel adaptation– SVM-GMM 2048 + NAP– SVM-GMM 2048 + ISV (Inter Session Variability modeling)– SVM-GMM 2048 + ISV derivative (Fisher) kernel– SVM-GMM 2048 + IVS based on FA-MFCC1339– FA modeling prosodic and cepstral contours– SVM on phonotactics – counts from Binary decision trees– SVM on soft bigram statistics collected on cumulated
posteriograms (matrix of posterior probability of phonemes for each frame)
• See our poster for more details
NIST SRE 2008 22/24
Conclusions
• FA systems build according to recipe from [Kenny2008] performs excellently, though there is still some mystery to be solved.
• It was hard to find another complementary system that would contribute to fusion of our two FA system.
• Although our system was primarily trained on and tuned for telephone data, FA subsystems can be simply augmented with eigenchannels trained on microphone data (as also proposed in [Kenny2008]), which makes the system performing well also on microphone conditions.
• Another significant improvement was obtained by training additional eigenchannels on data with matching channels condition, even thought there was very limited amount of such data provided by NIST.
NIST SRE 2008 23/24
Thanks
• To Patrick Kenny for [Kenny2008] recipe to building FA system that really works and for providing list of files for training FA system
• MIT-LL for creating and sharing the trial lists based on SRE06 data, which we used for the system development
• Niko Bummer for FoCal Bilinear, which allowed us to start playing with the fusion just the last day before the submission deadline
NIST SRE 2008 24/24
References
[Kenny2008] P. Kenny et al.: A Study of Inter-Speaker Variability in Speaker Verification IEEE TASLP, July 2008.
[Brummer2008] N. Brummer: FoCal Bilinear: Tools for detector fusion and calibration, with use of side-information http://niko.brummer.googlepages.com/focalbilinear
[Chang2001] C. Chang et al.: LIBSVM: a library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm
[Stolcke2005/6] A. Stolcke: MLLR Transforms as Features in SpkID, Eurospeech 2005, Odyssey 2006
[Hain2005] T. Hain et al.: The 2005 AMI system for RTS, Meeting Recognition Evaluation Workshop, Edinburgh, July 2005.
NIST SRE 2008 25/24
NIST SRE 2008 26/24NIST SRE 2008
Inter-session variability
High inter-session variability
High inter-speaker variability
UBM
Target speaker model
NIST SRE 2008 27/24NIST SRE 2008
Inter-session compensation
High inter-session variability
UBM
Target speaker model Test data
For recognition, move both models along the high inter-session variability direction(s) to fit well the test data
High inter-speaker variability
NIST SRE 2008 28/24NIST SRE 2008
LVCSR for SVM-MLLR system• LVCSR system is adapted to speaker (VTLN factor and
(C)MLLR transformations are estimated) using ASR transcriptions provided by NIST
• AMI 2005(6) LVCSR – state of the art system for the recognition of spontaneous speech [Hain2005]: – 50k word dictionary (pronunciations of OOVs were
generated by grapheme to phoneme conversion based on rules trained from data)
– PLP, HLDA– CD-HMM with 7500 tied-states each modeled by 18
Gaussians– Discriminatively trained using MPE– Adapted to speaker: VTLN, SAT based on CMLLR, MLLR