standalone training and context-independent ...€¦ · dev eval cv acc% s11 discriminative 11.1...
TRANSCRIPT
Standalone Training and Context-Independent Initialisationsof Context-Dependent Deep Neural Networks
Chao Zhang & Phil Woodland
University ofCambridge
20 May 2013
Improving Standard CD-DNN Training
• Std. CD-DNN-HMM training relies on GMM-HMMs in two ways:◦ Training labels — state-to-frame alignments◦ Tied CD state targets — GMM-HMM based decision tree state tying
• Can we build CD-DNN-HMMs independently of GMM-HMMs?
• Training CD-DNN-HMMs independently from any GMM-HMMs:Standalone training◦ Alignments — by CI-DNN-HMMs trained in a standalone fashion
• Training started with a flat start• Refine initial alignments in an iterative fashion• Train CI-DNN-HMMs using discriminative pre-training with realignment
and std. fine-tuning
◦ Targets — by DNN-HMM based decision tree target clustering
2 of 14
DNN-HMM based Target Clustering
• Assume the output distribution for each target is Gaussian with acommon covariance matrix, i.e., p(z|Ck) = N (z;µk ,Σ), we have
p(Ck |z) =
exp{ µTkΣ−1 z −1
2µTkΣ−1µk + ln P(Ck) }
∑k ′ exp{ µT
k ′Σ−1 z −1
2µTk ′Σ−1µk ′ + ln P(Ck ′ ) }
• According to softmax output activation function,
p(Ck |z) =exp{ wT
k z + bk }∑k ′ exp{ wT
k ′ z + bk ′ }
• We can convert Gaussians to DNN output layer parameters.3 of 14
Procedure of Building CD-DNN-HMMs
4 of 14
Experiments
• Training set: Wall Street Journal training set WSJ0+1 (SI-284)
• Testing sets: 1994 H1-dev (Dev) and Nov’94 H1-eval (Eval)◦ 65k dictionary and trigram LM
• MPE GMM-HMMs: ((13PLP)D A T Z)HLDA; 5981 states, 12Gaussians/state◦ MPE GMM-HMMs were with ((13PLP)D A T Z)HLDA
• DNN models were trained and tested using an extended version ofQuickNet◦ Cross-entropy criterion, sigmoid/softmax hidden/output activation
functions
• DNN-HMMs: 9× (13PLP)D A Z; 5× 1000 hidden layers; 6000output targets
5 of 14
CI-DNN-HMM Results
ID TypeDNN WER%
Alignments Dev Eval
G2 MPE GMM-HMMs — 8.0 8.7I1 CI-DNN-HMMs G2 10.5 12.0
Baseline GMM-HMM and CI-DNN-HMM Results (351× 10005 × 138).
ID Training RouteWER%
Dev Eval
I3 Realigned 12.2 14.3I4 Realigned+Conventional 11.7 13.8I5 Conventional 12.2 15.0I6 Conventional+Conventional 12.0 14.6
Different CI-DNN-HMMs trained in a standalone fashion.
6 of 14
CD-DNN-HMM Results
• Baseline CD-DNN-HMMs (D1) were trained with G2 alignments.The WER on Dev and Eval were 6.7 and 8.0, respectively
• CD-DNN-HMMs with different clustered targets were listed in thetable. The hidden layer and alignments were from I4
ID Clustering BP LayersWER%
Dev Eval
G3 GMM-HMM Final Layer 7.6 9.0G4 GMM-HMM All Layers 6.8 7.9D2 DNN-HMM Final Layer 7.7 8.7D3 DNN-HMM All Layers 6.8 7.8
CD-DNN-HMM based state tying results (351× 10005 × 6000).
• The CD-DNN-HMMs (D3) trained without relying on anyGMM-HMMs is comparable to baseline D1
7 of 14
Conclusion of Standalone Training
• Accomplish training CD-DNN-HMMs without relying on anypre-existing system◦ train CI-DNN-HMMs by updating the model parameters and the
reference labels in an interleaved fashion◦ decision tree tying in sigmoidal activation vector space of CI-DNN
• The experiments on WSJ SI-284 have shown◦ the proposed training procedure gives comparable performance◦ the methods are very efficient
8 of 14
CI Discriminative Pre-training of CD-DNNs
• Weakness of standard CD-DNN Pre-training◦ RBM based Generative Pre-training
• Weight values are not directly optimised for classification purposes• Usually uses different settings from fine-tuning
◦ Traditional (CD) Discriminative Pre-training
• Lower layers are over-specific to particular set of CD states: not genericenough for modelling low level acoustic features
• Training speed can be very slow when the target set is big
• Propose CI discriminative pre-training◦ Initialise CD-DNNs with parameters discriminatively trained for
classifying CI states
• Improves CD-DNN performance• Can be much faster than CD discriminative pre-training
9 of 14
CI Discriminative Pre-training
10 of 14
Experiments
• All resulting DNNs were evaluated as hybrid acoustic models
• Training set: WSJ0 (SI-84) and WSJ0+1 (SI-284)
• WSJ0 MPE GMM-HMMs: 3007 tied states; 8 Gaussians per state
• WSJ0 CI-/CD-DNN structures: 351× 10005 × 138/3007
• The rest configurations were the same as before
• All experiments were conducted using an extended version of HTK,which supports DNNs
11 of 14
WSJ0 DNN-HMMs Performance
ID Pre-trainingWER% CI State
Dev Eval CV Acc%
S01 Discriminative 14.6 16.6 67.2S02 Generative 9.4 10.9 68.9S03 CD Discriminative 9.6 11.3 68.7S04 CI Discriminative† 8.9 10.3 69.7S05 CI Discriminative 8.4 10.0 70.2
WSJ0 DNN-HMM system results. † means CI-DNN fine-tuning is not
included. S01 is a CI model. S02-S05 are CD models.
• S05 vs S02: 9.4% relative WER reduction
• S05 vs S03: 12.0% relative WER reduction
• S04 vs S02 (same num of epochs): 5.7% relative WER reduction
• S04 vs S03 (same num of epochs): 8.1% relative WER reduction
• CI state accuracies are consistent with the WERs12 of 14
WSJ0+1 DNN-HMMs Performance
ID Pre-trainingWER% CI State
Dev Eval CV Acc%
S11 Discriminative 11.1 12.6 70.5S12 Generative 6.9 8.1 73.4S13 CD Discriminative 6.7 8.1 72.5S14 CI Discriminative† 6.3 7.4 73.4S15 CI Discriminative 6.3 7.4 72.9
WSJ0+1 DNN-HMM system results. † means CI-DNN fine-tuning is not
included in pre-training. S11 is a CI model. S12-S15 are CD models.
• S14 vs S12: 8.7% relative WER reduction
• S15 vs S13: 7.4% relative WER reduction
• If sufficient data are available, CI-DNN fine-tuning is less important
• S14 pre-training is 5 times faster than S13 pre-training (1×K20c)
13 of 14
Conclusion of CI-DNN Pre-training
• We introduce an alternative discriminative pre-training methodthat intialises CD-DNNs using a DNN with context independentstate targets◦ Resulting CD-DNN hybrid systems reduced the WER by 9.1% and
9.7% relative over the baselines with generative and CD discriminativepre-training
◦ Also reduced training time by a factor of five compared to CDdiscriminative pre-training with 6000 CD state targets
• A way of evaluating CD classification results on CI level is used tofacilitate frame level DNN comparisons with different targets◦ Frame error CV accuracies correlate well with final WERs in hybrid
system
14 of 14