hiwire progress report chania, may 2007 presenter: prof. alex potamianos technical university of...

Click here to load reader

Post on 20-Dec-2015




0 download

Embed Size (px)


  • Slide 1
  • HIWIRE Progress Report Chania, May 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University of Crete
  • Slide 2
  • Audio-Visual Processing (WP1) VTLN (WP2) Segment Models (WP1) Recognition on BSS (WP1) Bayes Optimal Adaptation (WP2) Outline
  • Slide 3
  • Audio-Visual Processing (WP1) VTLN (WP2) Segment Models (WP1) Recognition on BSS (WP1) Bayes Optimal Adaptation (WP2) Outline
  • Slide 4
  • Combining several sources of information to improve the performance Unfortunately, for different environments and noise conditions not all the sources of information are equally reliable. Mismatch between training and test conditions. Goal Propose estimators of optimal stream weights si that can be computed in an unsupervised manner Motivation
  • Slide 5
  • Equal error rate in single-stream classifiers Equal estimation error variance in each stream Optimal Stream Weights
  • Slide 6
  • Subset of CUAVE database used: 36 speakers (30 training, 6 testing) 5 sequences of 10 connected digits per speaker Training set: 1500 digits (30x5x10) Test set: 300 digits (6x5x10) Features: Audio: 39 features (MFCC_D_A) Visual: 39 features (ROIDCT_D_A, odd columns) Multi-Streams HMM models: 8 state, left-to-right HMM whole-digit models Single Gaussian mixture AV-HMM uses separate audio and video feature streams Experimental Results
  • Slide 7
  • Two classes anti models Class membership inter and intra classes distance Results (classification)
  • Slide 8
  • Generalization of the inter- and intra-distances measure inter distance among all the classes. Results (recognition)
  • Slide 9
  • Stream weight computation for multi class classification task based on theoretical results for a two classes classification use of an anti-model technique We use only the test utterance and the information contained in the trained models. Generalization towards unsupervised estimation of stream weights for multi-streams classification and recognition problems. Conclusions
  • Slide 10
  • Audio-Visual Processing (WP1) VTLN (WP2) Segment Models (WP1) Recognition on BSS (WP1) Bayes Optimal Adaptation (WP2) Outline
  • Slide 11
  • Vocal Tract Length Normalization. Dependence between warping and phonemes. Frame Segmentation into Regions. Warping Factor and Function Estimation. VTLN in Recognition. Evaluation.
  • Slide 12
  • Dependence between warping and phonemes[1]. Examining the similarity between two frames before and after the warping: For each phoneme and speaker and for the middle frame of the utterance, the average spectral envelope is computed. An optimal warping factor is computed (for each phonemes s utterance), so that the MSE, between the warped spectrum and the corresponding unwraped spectrum,, is minimized. Optimization is achieved by a full search in the interval of warping factors ranging from 0.8 to 1.2, where 1 corresponds to no warping, The mapped spectrum is warped according to this optimal warping factor.
  • Slide 13
  • Dependence between warping and phonemes[2]. Bi-Parametric Warping Function (2pts). Different warping factors are evaluated, correspondingly, for the low (f < 3 KHz) and high (f 3 KHz) frequencies. Constraints:, and step 0.02. A full search over the 25 ( ) candidate warping functions provides the optimal pair of warping factors. Four-Parametric Warping Function (4pts). Different warping factors are evaluated for the frequency ranges, 0-1.5, 1.5-3, 3-4.5 and 4.5-8 KHz. The constraints and step remain the same with the bi-parametric case. Full search over the 625 ( ) different candidate warping functions. Bias addition before the warping process. Based on the ML algorithm, we evaluate a linear bias that minimizes the spectral distance between the reference and mapped spectrums. The extracted linear bias is added to the unwrapped mapped spectrum.
  • Slide 14
  • Results (over all speakers) after bias addition.
  • Slide 15
  • Frame Segmentation into Regions. Based on unsupervised K-Means algorithm, a sequence of testing utterances frames, length M, is divided on, specific by us, population of regions. The algorithms output is a function F between the frames m and the corresponding region index c, As an additional constraint, a media filtering is placed on the region indexs sequence. This constraint has the effect of smoothing the sequence of indices so as to reflect a more physiologically degree of region transition between successive frames.
  • Slide 16
  • Warping Factor and Function Estimation. After the division of frames into regions, an optimal factor and function for each region is obtained by maximizing the likelihood of the warped vectors with respect to the transcriptions from the first pass and the un-normalized Hidden Markov Model, where, is the testing utterance in which every frame, after its categorization into region c, is warped according to one of the R candidate factors and to one of the N candidate functions. The optimum warping factor for each region is obtained by searching over a value space between 0.88 and 1.12 with step 0.02. is the, trained with unnormalized training vectors, Hidden Markov Model, W is the obtained by the first-pass transcription.
  • Slide 17
  • VTLN in Recognition. During recognition, since a preliminary transcription for testing utterances is not given, a multiple-pass strategy is introduced: A preliminary transcription W is obtained through a first pass recognition using the unwrapped sequence of cepstral vectors X and the unnormalized model , The utterance's frames are categorized into c regions For each region c, an optimal warping factor and function is evaluated through a multi-dimensional grid search, After the evaluation of the vectors related with the optimal per region factor and function the optimally warped sequence is decoded in order to obtain the final recognition result.
  • Slide 18
  • Results WER(%) # of Utters15 Baseline50.83 Li & Rose (2 pass) 43.7943.48 2 regions41.73 (+4.7%)42.79 (+1.60) 3 regions43.11 (+1.56)43.66 (-0.46)
  • Slide 19
  • Audio-Visual Processing (WP1) VTLN (WP2) Segment Models (WP1) Recognition on BSS (WP1) Bayes Optimal Adaptation (WP2) Outline
  • Slide 20
  • The Linear Dynamic Model (LDM) Discrete-time Linear Dynamical Systems: Efficient model the evolution of spectral dynamics An observation y k is produced in each time step The state process is first-order Markov Initial state is Gaussian The state and observation noises w k, v k are : Uncorrelated Temporally white Zero-mean Gaussian distributed
  • Slide 21
  • Noise covariances are not constrained Matrices F,H have canonical forms Canonical form is identifiable if it is also controllable (Ljung) Generalized canonical form of LDM
  • Slide 22
  • Experimental Setup Training Set Aurora 2 Clean Database 3800 training sentences Test set: Aurora 2, test A, subway sentences 1000 test sentences Different levels of noise ( Clean, SNR: 20, 15, 10, 5 dB ) Front-End extracts 14-dimensional features (static features): HTK standard front-end 2 feature configurations 12 Cepstral Coefficients + C0 + Energy + first and second order derivatives (, )
  • Slide 23
  • Model Training on Speech Data Word models with different number of segments based on the phonetic transcription Segment alignments produced using HTK SegmentsModels 2oh 4two, eight 6one, three, four, five, six, nine, zero 8seven
  • Slide 24
  • Classification process Keep true word-boundaries fixed Digit-level alignments produced by an HMM Apply suboptimum search and pruning algorithm Keep the 11 most probable word-histories for each word in the sentence Classification is based on maximizing the likelihood
  • Slide 25
  • Classification results Comparison of LDM Segment-Models and HTK HMM classification (% Accuracy) Same Front-End configuration, same alignments Both Models trained on clean training data AURORA Subway HMM (HTK)LDMs MFCC, E+ +MFCC, E+ + Clean97,19%97,57%97,53% 97,61% SNR2090,91%95,71%93,23%95,12% SNR1580,09%91,76%87,91%91,13% SNR1057,68%81,93%76,29%82,69% SNR536,01%64,24%54,87%63,56%
  • Slide 26
  • Classification results Performance Comparison (MFCCs)
  • Slide 27
  • Classification results Performance Comparison (MFCCs + + )
  • Slide 28
  • Sub-optimal Viterbi decoding (SOVD) We use a Viterbi-like decoding algorithm for speech classification HMM state equivalent in LDMs is : [x k,s i ] It is applied among the segments of each word-model Provides segment alignments based on the likelihood of the LDM Estimated with a Kalman filter Allows decoding at each time k using possible histories leading to a different [x k,s i ] combination at several depth levels
  • Slide 29
  • SOVD Steps
  • Slide 30
  • Sub-Optimal Viterbi-like Search S2S2 S1S1 S3S3 S4S4 F 1 x 0 F 1 x 1 F 1 x 2 F 1 x 3 F 1 x 4 F 2 x 1 F 2 x 2 F 2 x 3 F 2 x 4 F 3 x 2 F 3 x 3 F 3 x 4 F 4 x 3 F 4 x 4 t1t1 t2t2 t4t4 t5t5 t3t3 Time (frames)
  • Slide 31
  • Visualization of Model Predictions Trajectories of true and predicted observations for c 1, c 3

View more