[ieee milcom 2012 - 2012 ieee military communications conference - orlando, fl, usa...

6
Abstract— Military speech communication often needs to be conducted in very high noise environments. In addition, there are scenarios, such as special-ops missions, for which it is beneficial to have covert voice communications. To enable both capabilities, we have developed the MUTE (Mouthed-speech Understanding and Transcription Engine) system, which bypasses the limitations of traditional acoustic speech communication by measuring and interpreting muscle activity of the facial and neck musculature involved in silent speech production. This article details our recent progress on automatic surface electromyography (sEMG) speech activity detection, feature parameterization, multi-task sEMG corpus development, context dependent sub-word sEMG modeling, discriminative phoneme model training, and flexible vocabulary continuous sEMG silent speech recognition. Our current system achieved recognition accuracy at developable levels for a pre-defined special ops task. We further propose research directions in adaptive sEMG feature parameterization and data driven decision question generation for context- dependent sEMG phoneme modeling. Index Terms— sEMG speech detection, sEMG speech recognition, silent speech communication, sub-word sEMG model, speaker adaptive sEMG feature extraction. I. INTRODUCTION ECENT work in the field of non-acoustic automatic speech recognition (ASR) has been motivated by the desire to mitigate two significant weaknesses of standard, acoustic ASR: (1) severe performance degradation in the presence of ambient noise and (2) a limited ability to maintain privacy/secrecy because of the requirement of using audible speech. These non-acoustic ASR studies have investigated alternative modalities, such as ultrasound [1] or sEMG [2-9] that can capture sufficient speech information while overcoming the aforementioned deficiencies of acoustic ASR systems. sEMG-based speech recognition, also known as subvocal speech recognition, operates on signals recorded from a set of sEMG sensors that are strategically located on the neck and face surface to measure muscle activity associated with the phonation, resonation and articulation of speech. Manuscript received April 30, 2012. This study was sponsored by the United States Defense Advanced Research Projects Agency (DARPA) Information Innovation Office (I2O) Program, “Advanced Speech Encoding.” The views and conclusions in this document are those of the authors and should not be interpreted as representing the official policies, either expressly or implied, of DARPA or the U.S. Government. Y. Deng, G. S. Meltzner, and G. Colby are with BAE Systems, Inc., Burlington, MA 01803 USA. Corresponding author is Yunbin Deng (phone: 781-262-4393; fax: 781-273-9345; e-mail: [email protected]). J.T. Heaton is with the Massachusetts General Hospital Department of Surgery, Boston, MA 02114 USA. Because the signals are a direct measurement of articulatory muscle activity, there is no need for acoustic excitation of the vocal tract, making it possible to recognize silent, mouthed speech. Moreover, because sEMG signals are effectively decoupled from acoustic signals, they are immune to acoustic noise corruption. Although these two properties have made sEMG an attractive alternative modality for speech recognition, sEMG-based speech recognition performance lags that of standard ASR systems. Moreover, this alternative modality introduces challenges of its own, especially for operating in a continuous speech recognition mode. This paper will discuss some approaches we have taken to address these challenges in our work to expand the continuous recognition capability for the MUTE (Mouthed-speech Understanding and Transcription Engine) silent speech recognition system. Our key new contributions are discriminative sEMG phoneme model training and a working sEMG speech communication system on portable devices. We further identify two new research directions to enhance sEMG speech recognition accuracy: 1) adaptive sEMG feature parameterization, and 2) automatic decision tree question generation for context dependent sEMG phoneme model clustering. II. BACKGROUND A. Prior Studies Although the breadth and depth of silent speech recognition research does not yet rival those of standard ASR research, several studies have been conducted in this domain. Chan et al. [3] obtained a 93% recognition rate on a vocabulary of 10 digits (zero through nine) using 5 sEMG channels on the face and neck for two subjects who produced vocalized (normally spoken) speech. Betts and Jorgensen [4] conducted a similar study on a single speaker but were able to achieve only a 74% recognition rate, albeit on a larger vocabulary of 15 words of voiced speech. Jou et al. [5] further extended the vocabulary size to 108 words but at the cost of reduced recognition accuracy. Other studies were able to demonstrate improved recognition rates for somewhat smaller vocabulary sizes. For example, Lee [6] was able to achieve a mean 87% recognition rate on 60 vocalized words for 8 male, Korean speakers. Similarly, we reported recognition rates of 92.1% and 86.7% on vocalized and mouthed speech, respectively for a group of 9 American English speakers for a 65 word vocabulary [2]. More recently, Wand and Schultz have pushed sEMG-based Signal Processing Advances for the MUTE sEMG-Based Silent Speech Recognition System Yunbin Deng, Glen Colby, James T. Heaton, and Geoffrey S. Meltzner R 978-1-4673-3/12/$31.00 ©2013 IEEE 978-1-4673-3/12/$31.00 ©2013 IEEE

Upload: geoffrey-s

Post on 23-Dec-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE MILCOM 2012 - 2012 IEEE Military Communications Conference - Orlando, FL, USA (2012.10.29-2012.11.1)] MILCOM 2012 - 2012 IEEE Military Communications Conference - Signal processing

Abstract— Military speech communication often needs to be conducted in very high noise environments. In addition, there are scenarios, such as special-ops missions, for which it is beneficial to have covert voice communications. To enable both capabilities, we have developed the MUTE (Mouthed-speech Understanding and Transcription Engine) system, which bypasses the limitations of traditional acoustic speech communication by measuring and interpreting muscle activity of the facial and neck musculature involved in silent speech production. This article details our recent progress on automatic surface electromyography (sEMG) speech activity detection, feature parameterization, multi-task sEMG corpus development, context dependent sub-word sEMG modeling, discriminative phoneme model training, and flexible vocabulary continuous sEMG silent speech recognition. Our current system achieved recognition accuracy at developable levels for a pre-defined special ops task. We further propose research directions in adaptive sEMG feature parameterization and data driven decision question generation for context- dependent sEMG phoneme modeling.

Index Terms— sEMG speech detection, sEMG speech recognition, silent speech communication, sub-word sEMG model, speaker adaptive sEMG feature extraction.

I. INTRODUCTION ECENT work in the field of non-acoustic automatic speech recognition (ASR) has been motivated by the desire to mitigate two significant weaknesses of

standard, acoustic ASR: (1) severe performance degradation in the presence of ambient noise and (2) a limited ability to maintain privacy/secrecy because of the requirement of using audible speech. These non-acoustic ASR studies have investigated alternative modalities, such as ultrasound [1] or sEMG [2-9] that can capture sufficient speech information while overcoming the aforementioned deficiencies of acoustic ASR systems. sEMG-based speech recognition, also known as subvocal speech recognition, operates on signals recorded from a set of sEMG sensors that are strategically located on the neck and face surface to measure muscle activity associated with the phonation, resonation and articulation of speech.

Manuscript received April 30, 2012. This study was sponsored by the United States Defense Advanced Research Projects Agency (DARPA) Information Innovation Office (I2O) Program, “Advanced Speech Encoding.” The views and conclusions in this document are those of the authors and should not be interpreted as representing the official policies, either expressly or implied, of DARPA or the U.S. Government.

Y. Deng, G. S. Meltzner, and G. Colby are with BAE Systems, Inc., Burlington, MA 01803 USA. Corresponding author is Yunbin Deng (phone: 781-262-4393; fax: 781-273-9345; e-mail: [email protected]).

J.T. Heaton is with the Massachusetts General Hospital Department of Surgery, Boston, MA 02114 USA.

Because the signals are a direct measurement of articulatory muscle activity, there is no need for acoustic excitation of the vocal tract, making it possible to recognize silent, mouthed speech. Moreover, because sEMG signals are effectively decoupled from acoustic signals, they are immune to acoustic noise corruption. Although these two properties have made sEMG an attractive alternative modality for speech recognition, sEMG-based speech recognition performance lags that of standard ASR systems. Moreover, this alternative modality introduces challenges of its own, especially for operating in a continuous speech recognition mode. This paper will discuss some approaches we have taken to address these challenges in our work to expand the continuous recognition capability for the MUTE (Mouthed-speech Understanding and Transcription Engine) silent speech recognition system. Our key new contributions are discriminative sEMG phoneme model training and a working sEMG speech communication system on portable devices. We further identify two new research directions to enhance sEMG speech recognition accuracy: 1) adaptive sEMG feature parameterization, and 2) automatic decision tree question generation for context dependent sEMG phoneme model clustering.

II. BACKGROUND

A. Prior Studies Although the breadth and depth of silent speech recognition research does not yet rival those of standard ASR research, several studies have been conducted in this domain. Chan et al. [3] obtained a 93% recognition rate on a vocabulary of 10 digits (zero through nine) using 5 sEMG channels on the face and neck for two subjects who produced vocalized (normally spoken) speech. Betts and Jorgensen [4] conducted a similar study on a single speaker but were able to achieve only a 74% recognition rate, albeit on a larger vocabulary of 15 words of voiced speech. Jou et al. [5] further extended the vocabulary size to 108 words but at the cost of reduced recognition accuracy. Other studies were able to demonstrate improved recognition rates for somewhat smaller vocabulary sizes. For example, Lee [6] was able to achieve a mean 87% recognition rate on 60 vocalized words for 8 male, Korean speakers. Similarly, we reported recognition rates of 92.1% and 86.7% on vocalized and mouthed speech, respectively for a group of 9 American English speakers for a 65 word vocabulary [2]. More recently, Wand and Schultz have pushed sEMG-based

Signal Processing Advances for the MUTE sEMG-Based Silent Speech Recognition System

Yunbin Deng, Glen Colby, James T. Heaton, and Geoffrey S. Meltzner

R

978-1-4673-3/12/$31.00 ©2013 IEEE978-1-4673-3/12/$31.00 ©2013 IEEE

Page 2: [IEEE MILCOM 2012 - 2012 IEEE Military Communications Conference - Orlando, FL, USA (2012.10.29-2012.11.1)] MILCOM 2012 - 2012 IEEE Military Communications Conference - Signal processing

recognition towards continuous speech recognition [7], ultimately achieving a Word Error Rate (WER) of 15.7% on a vocabulary of 108 words [8].

B. The MUTE System The MUTE silent-speech recognition system has been developed over the past 5 years as part of the DARPA Advanced Speech Encoding (ASE) program. The goal of the MUTE program has been to develop a portable, practical silent speech recognition system, with the purpose of exploiting the advantages sEMG-based speech recognition in order to provide new communication capabilities for both warfighters and civilians. The initial research used a set of wired Delsys DE 2.1 electrodes placed at locations on the face and neck aimed at targeting the muscles involved in speech articulation. The resulting performance was an average rate of 86.7% on a 65 isolated word vocabulary for 9 speakers [2]. Recognizing the impracticality of requiring the use of 11 sensors, further research revealed that reducing the sensor set size to 6 sensors only led to a recognition performance drop of 3% [10]. Subsequent research efforts concentrated on both designing more practical portable MUTE hardware (see Figure 1) and expanding recognition capability. These efforts have resulted in the current version of the system, the components of which are shown in Figure 2. The system recognize a vocabulary of 200 words in a continuous mode with an accuracy of 96.9% [11].

Figure 1. The MUTE system being demonstrated in the field. The system is portable and flexible enough to be used in outdoor environments while the operator is situated in a non-ideal position. Note that in this case, the MUTE system is running on an Android tablet rather than the smartphone shown in Figure 2.

Figure 2. The MUTE system consists of 4 components: (1) four pairs of sEMG sensors custom designed for placement on the face and neck; (2) a wireless sEMG transmitter module that transmits the raw sEMG data; (3) a wireless receiver module that receives the raw sEMG signals, and (4) an Android smartphone or tablet that acquires the raw data from the wireless receiver via USB, processes the data and ultimately conducts the recognition.

III. DATA COLLECTION AND PROCESSING

A. Corpus Design Whereas most of our previous reports on sEMG-based speech recognition were focused on isolated word recognition [2][10][12], this study is focused on speaker dependent, flexible and unlimited vocabulary continuous recognition tasks. For practical military communication applications, state of the art sEMG processing techniques can only achieve high recognition accuracy given a well-defined task, i.e. small vocabulary with grammar constraints. However, we would prefer an unlimited vocabulary for flexible deployments, which may only require one instance of training data collection. To enable this capability, we designed a data collection corpus that covers commonly used English words (about 2000) and balances the frequency of phoneme combinations for unforeseen testing words. The text for our designed corpus contains five subsets: 1. The SX part of TIMIT data, which contains a total of

450 utterances with about 1800 unique words. The original TIMIT corpus was designed to provide speech data for acoustic phonetic studies and the development and evaluation of automatic speech recognition systems [14]. This portion of data was selected to balance the frequency of phoneme combinations and will be used as a training set.

2. The special operation data set contains 202 unique words, including 26 NATO alphabets, some common phrases, and a set of sign language signal and command [15]. These arm-and-hand signal terminologies can be used for a military special-operation silent speech communication proof of concept.

3. Commonly used phrases. This subset contains 100 most commonly used English phrases [16].

4. A set of frequently used text messages extracted from [17], which would be used to demonstrate sEMG silent text messaging for commercial applications.

5. A hand crafted data set to cover phone calls, numbers, and dates. This set was developed to enhance the accuracy of digits and number recognition.

The overall data collection corpus contained a total 1200 utterances that could be collected within about two hours for each subject. The pronunciation dictionary contained 2,517 words and 3,046 possible pronunciations. We then conducted testing in each subset with models trained on other subsets.

B. Data Collection

For the above developed text corpus, six adult native speakers of English were recruited (two male, four female).

Page 3: [IEEE MILCOM 2012 - 2012 IEEE Military Communications Conference - Orlando, FL, USA (2012.10.29-2012.11.1)] MILCOM 2012 - 2012 IEEE Military Communications Conference - Signal processing

The 1200 utterances were randomly sorted and prompted at a user-controlled pace. The data collection protocol for this continuous mouthing mode was similar to that of the isolated word data collection and is detailed in [2]. The key difference was that a set of only 8 prototype wireless sEMG sensors were used instead of 11 wired sensors. The sensors are a customized version of the Trigno™ wireless sensors (Delsys Inc.) and are attached to the skin with double-stick adhesive strips. Note that these sensors were the pre-cursors of the ones displayed in Figure 2 and Figure 1 which are used in the current MUTE system. The 8 sensor locations are depicted in Figure 3. The face sensors are placed in relation to a template pattern (printed on overhead transparency film) that uses the corners of the mouth and eye as reference points. The two ventral neck sensors are placed with their medial edge 5 mm from the midline, with the more inferior sensor (#4) above the cricothyroid membrane and the superior location (#3) just inferior to the submental surface. The submental sensors (#1-2) are centered on the submental surface with medial edges 10 and 40 mm lateral to the midline.

C. sEMG Speech Activity Detection (SAD)

Our continuous sEMG silent speech communication system automatically determines the start and end of speech using our customized SAD algorithm. It incorporates a multi-channel decision logic that takes advantage of the fact that speech production typically involves multiple muscles at the same time, and is thus able to ignore noise in any single channel. To balance the trade-off between simplicity (desired for real-time implementation) and robustness, our current SAD algorithm is based on these principles: 1) using a 50 ms windowed signal to compute local statistics; 2) online noise and signal level tracking on each channel; and 3) a global decision based on the five selected best sEMG channels. The SAD algorithm operates on two levels of finite state machines. The first level consists of a finite state

machine for each channel, which determines each channel’s speech state. The higher level machine combines each channel’s states to make the final start of speech (SOS) and end of speech (EOS) decision. The details of this algorithm are provided in [11]. On isolated word recognition tasks, this algorithm was shown to perform well. The recognition accuracy based on this automatic segmentation algorithm performed comparably to segmentation created by hand. However, its performance on rejecting non-speech related sEMG signals, such as when smiling and throat clearing, remains to be investigated.

D. sEMG Feature Parameterization A number of researchers have reported differing levels of success using a variety of features, including wavelets [3], wavelet packets [4] and a customized set of time-domain features [5,7-9]. Our work in this area commenced with an investigation into a number of potential parameterization schemes, including Mel-Frequency Cepstral Coefficients (MFCCs), as well as a set of features derived to specifically process sEMG signals for the recognition of gross motor movements [2]. Using a round-robin technique that compared the recognition performance obtained for all possible feature combinations, we determined that the combination of our customized MFCCs (i.e., the frequency range and number of cepstral components customized to sEMG signals) and muscle co-activation levels (quantified amount of simultaneous firing activity between all possible pairs of sEMG channels) produced the best recognition performance. As the co-activation is computationally expensive and the performance gain is small, only MFCCs (with mean and variance normalization) were applied in our smart-phone silent speech communication systems [2]. The fact that different features were found to be optimal by different research groups may be due to their adoption of different sEMG sensors, which could results in different signal characteristics. sEMG signals have a quasi-stationary nature at small time windows, and higher energy at lower frequencies, making MFCCs a reasonable parameterization choice. We have also investigated filter-bank schemes other than the Mel-scale, but found no performance improvement over MFCCs. Our research has also demonstrated that mean and variance normalization of MFCC greatly enhances session robustness [11]. Our experience with sEMG-based speech recognition has suggested that 1) a subject’s mouthing speed is quite different between isolated and continuous modes; and 2) different speakers use very different continuous mouthing speeds. Our previous work on isolated word recognition used a uniform window size of 50 ms and frame rate of 25 ms for all speakers. In this study, we sought to determine if better accuracy can be achieved for speaker-dependent sEMG speech recognition by adapting these two parameters to a specific speaker.

Figure 1 sEMG sensor locations. A) The template for placement of sensors 5-8 is shown with blue pen marks on the skin at the inside edge of each sensor outline corner. Template lines extend from the corner of the mouth vertically downward, and angled upward to cross the corner of the eye. Black pen marks on the neck show sensor electrode bar locations. B) Sensors on the submental neck (1-2), ventromedial neck (3-4), supralabial face (5-6) and infralabial face (7-8).

Page 4: [IEEE MILCOM 2012 - 2012 IEEE Military Communications Conference - Orlando, FL, USA (2012.10.29-2012.11.1)] MILCOM 2012 - 2012 IEEE Military Communications Conference - Signal processing

E. sEMG Phoneme Model To build an sEMG speech recognition system for unlimited vocabulary tasks, sub-word models must be developed. This includes phoneme-based models, such as bundled phonetic features based on predefined phonetic questions [18], and syllables-based models, which have been applied to isolated Spanish syllable recognition [19]. As English has a very large set of syllables, applying a syllable-based approach would require collecting a very large training data set from each subject. As such, we took the phoneme-based approach, i.e. the traditional 39 phonemes, plus models for silence and short-pauses. Our current system uses context-independent mono-phone models. Because the sentences were continuously mouthed, it was not feasible to manually label the data at phoneme or even word level. The phoneme models were initialized with global mean and variance, followed by a few iterations of an expectation and maximization (EM) algorithm. For words with multiple possible pronunciations, we applied alignment during training to automatically determine the most likely phoneme realization for each word instance. Additionally, we have been working toward implementing context-dependent models. To that end, we have implemented a data-driven method to automatically generate the “phonetic” questions from mouthed sEMG speech features, instead of using predefined phonetic questions, that were meant for traditional vocalized speech. Which questions to ask during sEMG decision-tree building is still an open question. We adopted a data driven bottom-up and top-down hybrid clustering approach to automatically generate these questions [20]. This approach has the advantage of being automatic, and does not have to be phonetic driven. It may also have the extra benefit of generating speaker-specific questions. The input of this algorithm is a pre-trained context- independent phoneme model. The outputs are linguistic questions, i.e. does the left (or right) context belong to a phoneme group. Assuming the phoneme HMMs have three states, the first (left) state is used to generate questions for the right context, and the third (right) state is used to generate questions for left context. The algorithm is implemented in four steps:

1. Iteratively combine phonemes into groups until a pre-defined final set size is reached. In each step, choose the combination which results in minimum likelihood reduction out of all possible combinations.

2. Combine the final sets in Step 1 into two maximally separated groups.

3. Recursively apply Step 1 and Step 2 separately to the resulting two groups obtained from Step 2, until there are no more than two of the original phonemes united in either resulting cluster.

4. Prune the resultant tree to have K leaves (final number of questions). This is done in a similar fashion as in Step 1.

These automatically generated questions can used be used to build the traditional tied-stated tri-phone model, or an improved version of the bundled phonetic features model.

Finally, we have applied discriminative feature dimension reduction and minimum phone error (MPE) rate HMM training after maximum likelihood (ML) model training [21]. We used the Hidden Markov Model Toolkit (HTK) [22] to conduct the MPE model training. Specifically, the MPE training procedure based on a ML trained HMM and can be summarized as following seven steps:

1. Build a statistical bi-gram language model (LM) on training corpus using CMU-CAM LM tools.

2. Convert the bi-gram LM into a word net using HTK HBuild tool.

3. Align the training data with the ML trained model, save the aligned results with timing information and acoustic score in HTK lattice format.

4. Apply the lattice with the LM score and prune the lattice to have only the best phone sequence with HTK HLRescore tool.

5. Run recognition on training data with the bi-gram word net and keep the 10 best hypotheses using HTK HVite tool. Save all timing and acoustic score info into HTK lattice format.

6. Apply HTK HLRescore tool to make the recognition lattice deterministic (prune redundancy).

7. Choose those training files which successfully generated lattice files and apply HMMIRest tool for discriminative training.

IV. RESULTS This section details our experimental results in minimum phone error training, effect of speaking rate, and automatic phoneme question generation, all based on continuous, mouthed sEMG data.

A. Minimum Phone Error (MPE) Training This experiment was conducted on the special operations data set (Subset 2 described in Section III) that was collected from three subjects. Each subject produced 1200 mouthed utterances, with a vocabulary size of 202. The features were MFCC (customized for sEMG) with Mean Variance Normalization. The original feature vector dimension was 112, with 14 dimensions each from 8 sEMG channels. Three-state phoneme models were built on 90% of the data and tested on the remaining 10% of the data. Each state was modeled by 16 Gaussians mixtures. A grammar, specifying the random alphabet combinations and a predefined set of commands, was used during recognition. Table 1 summarizes the recognition accuracy for each subject and the average word error rate (WER) among subjects, with baseline ML and MPE phoneme model training. The MPE training resulted in relative WER reduction of 21.4%.

Table 1: MPE vs. ML Model Training.

Subjects &System

Spk 1 Spk 2 Spk 3 Average WER

Page 5: [IEEE MILCOM 2012 - 2012 IEEE Military Communications Conference - Orlando, FL, USA (2012.10.29-2012.11.1)] MILCOM 2012 - 2012 IEEE Military Communications Conference - Signal processing

Baseline 93.7% 87.6% 99.1% 6.5% +MPE 95.9% 88.9% 99.1% 5.1%

B. Effect of Window Size and Frame Rate To investigate the effect of window size and frame rate, a comparative study was carried out on our flexible vocabulary recognition subsets, where the testing was based on model trained on data from all other subsets plus the SX portion of the TIMIT corpus. The context independent phoneme models were used. Note that the window size of 50ms and 25ms window shift was found to be optimum in our previous study on isolated sEMG word recognition. Tables 2 and 3 summarize the recognition accuracy for two different window sizes and frame rates. We notice the WERs of Subjects 2 and 5 were reduced by 50% or more with a smaller window and high frame rate, while other subjects had smaller improvement or negative effects. These mixed results suggest that the optimal window size and frame rate can be speaker dependent or even utterance dependent. We are investigating advanced algorithms for adaptive sEMG feature parameterization. Table 2: Recognition accuracy and WER based on 30 ms

analysis window size and 15 ms window shift. Subjects &tasks

Special Ops

Common phrases

Text message

Digits Average WER

1 92.3 95.7 89.3 88.7 8.5% 2 84.7 82.7 79.4 71.9 20.3% 3 71.7 80.1 76.6 62.9 27.2% 4 79.0 81.1 65.4 63.1 27.8% 5 95.5 95.4 94.3 86.4 7.0% 6 75.2 70.8 49.0 55.4 37.2% Table 3: Recognition accuracy and WER based on 50 ms

analysis window size and 25 ms window shift. Subjects &tasks

Special Ops

Common phrases

Text message

Digits Average WER

1 94.7 97.6 94.8 88.3 6.1% 2 65.2 58.3 64.4 53.4 39.5% 3 72.3 75.4 78.9 55.6 27.3% 4 76.7 76.8 57.9 66.7 30.5% 5 88.2 79.7 82.7 80.1 17.3% 6 66.1 49.6 52.8 49.5 45.5%

C. Preliminary Results in Context Dependent sEMG Phoneme Question Generation

We applied the phoneme question generation algorithm to our continuous mouthed sEMG speech data and generated 20 questions for both left and right context. Some example questions generated for Subject 1 are:

QUESTION0_1_R: D K L N S T Z NG QUESTION0_2_R: S Z QUESTION0_3_R: D K L N T NG QUESTION0_4_R: D K N T NG QUESTION0_5_R: N NG QUESTION0_6_R: D K T QUESTION2_1_L: CH JH OY SH UH ZH QUESTION2_2_L: OY

QUESTION2_3_L: CH JH SH UH ZH QUESTION2_4_L: CH JH SH QUESTION2_5_L: CH JH

For example, “QUESTION2_5_L: CH JH” stands for a question of “Does it belong to the group of {CH JH}?” to be asked for the left context. In this specific case, the question generated is identical to the phonetic question of: Is this an affricate? The question “QUESTION0_2_R: S Z” groups fricative S and Z together, although they belong to voiced and unvoiced fricative according to traditional phonetic rule, possibility because their realization are similar enough under mouthed speaking mode. As a reference, some commonly used phonetic groups for traditional tri-phone acoustic decision tree questions are:

Front vowel: IY IH EH AE Middle vowel: AA ER AH AO Back vowel: UW UH OW Voiced stops: B D G Unvoiced stops: P T K Voiced Fricatives: V TH ZH Z Unvoiced Fricatives: F SH TH S Affricates: JH CH

As such, our automatically generated “phonetic” questions have their own meanings and yield some interesting insights into the sEMG speech recognition. Our future work will apply those automatically generated questions to cluster triphone states for tied-state sEMG phoneme models or bundled phonetic features model [6] and compare it with the traditional predefined linguistic questions.

V. DISCUSSION For a well-defined task, i.e., the training and testing data are from the same domain, our sEMG speech recognition achieved a useful level of recognition accuracy, as shown in Table 1. Under un-trained testing domain, the sEMG recognition accuracy is usable only for a subset of speakers, shown in Tables 2 and 3. Thus, deploying sEMG silent speech communication for unconstrained task and user remains a challenging problem need further advances in signal processing, feature parameterization, and advanced modeling. Our results suggest that some degree of improvement may be achieved by adapting the analysis frame size and rate to each individual speaker. However, instead of using trial and error to establish optimal values for these parameters, it is preferable to use information measures to identify the quasi-optimal values. This approach appears to be possible, as it has been done for standard acoustic speech recognition tasks [23][24]. These two parameters can be optimized as follows:

A. Optimal window size. The standard MFCC analysis starts with segmenting the input signal into equal length frames, followed by Fast Fourier Transform (FFT) on each frame. This process assumes that the incoming signal is composed of piecewise stationary segments of a pre-defined equal lengths, which does

Page 6: [IEEE MILCOM 2012 - 2012 IEEE Military Communications Conference - Orlando, FL, USA (2012.10.29-2012.11.1)] MILCOM 2012 - 2012 IEEE Military Communications Conference - Signal processing

not seem to hold for speech and sEMG signals as previous results showed. The idea of multi-scale Fourier transform analysis is to apply a few different window sizes and compute the FFTs for each [23]. The resulting power spectra are then normalized according to energy and each normalized power spectrum can be considered as a probability density. The entropy is computed on the probability density and further normalized according to the log of the window length. The window size which gives minimum entropy, i.e. the sharpest time frequency distribution, is chosen as the optimal window size for MFCC analysis.

B. Adaptive frame rate: The standard speech/sEMG

MFCC analysis also assumes a constant frame rate, typically chosen to be half of window length, during feature extraction. This is inconsistent with the fact that the formant transition region contains more discriminative information than the steady-state part of a vowel. The idea behind variable frame rate feature extraction is to start with oversampling in feature domain, i.e. with a very fast frame rate [24]. The feature domain information measure is then taken to keep only feature frames with entropy above certain thresholds. This idea has shown to improve the performance of standard MFCC feature on noisy acoustic speech recognition task.

Finally, it is clear that the automatic question generation algorithm for sEMG signals produces significantly different results than traditional defined questions for acoustic speech signals. This suggests that adapting the sub-word modeling algorithms to better suit the characteristics of sEMG based speech signals may result in a realization of recognition performance gains.

ACKNOWLEDGMENT The authors would like to thank Carlo J. De Luca,

Gianluca De Luca, Don Gilmore, and Serge Roy of Delsys, Inc. for their assistance in designing and running the sEMG data collection experiments and for their work in developing and providing the sEMG sensors and associated hardware.

REFERENCES

[1] T. Hueber, G. Chollet, B. Denby, M. Stone, “Acquisition of ultrasound, video and acoustic speech data for a silent-speech interface application,” International Seminar on Speech Production, pp. 365-369, Strasbourg, France, 2008.

[2] G.S. Meltzner, J. Sroka, J.T. Heaton, L.D. Gilmore, G. Colby, S. Roy, N. Chen, and C.J. De Luca, “Speech Recognition for Vocalized and Subvocal Modes of Production using Surface EMG Signals from the Neck and Face,” INTERSPEECH 2008, Australia, 2008.

[3] A.D.C. Chan, K. Englehart, B. Hudgins, and D.F. Lovely,. "Myoelectric Signals to Augment Speech Recognition," Medical and Biological Engineering & Computing vol. 39, pp. 500-506, 2001.

[4] B. Betts and C. Jorgensen, "Small Vocabulary Recognition Using Surface Electromyography in an Acoustically Harsh Environment." NASA TM-2005-21347, 2005.

[5] S.C. Jou, L. Maier-Hein, T. Schultz, and A. Waibel, “Articulatory feature classification using surface electromyography,” in Proc. ICASSP 2006, pp 606-608, 2006.

[6] Lee, K-S. “EMG-Based Speech Recognition Using Hidden Markov Models With Global Control Variables.” IEEE Trans. On Biomed. Eng., vol 55, pp. 930-940, 2008.

[7] T Schultz and M. Wand, “Modeling Coarticulation in EMG-based Continuous Speech Recognition,” Speech Comm., vol 52, 2010.

[8] M. Wand and T. Schultz, “Session-Independent EMG-based Speech Recognition,” International Conference on Bio-inspired Systems and Signal Processing 2011

[9] L. Maier-Hein, F. Metze, T. Schultz, and A. Waibel, “Session Independent Non-Audible Speech Recognition Using Surface Electromyography.” IEEE Automatic Speech Recognition and Understanding Workshop. p. 331-336, 2005

[10] G. Colby, J.T. Heaton, L.D. Gilmore, J. Sroka,, Y. Deng, J. Cabrera, S. Roy, C.J. De Luca, and G.S.Meltzner, “Sensor Subset Selection for Surface Electromyography Based Speech Recognition”, ICASSP 2009. 2009.

[11] G.S. Meltzner, G. Colby, Y. Deng, and J.T. Heaton, "Signal acquisition and processing techniques for sEMG based silent speech recognition," Engineering in Medicine and Biology Society,EMBC, 2011 Annual International Conference of the IEEE , vol., pp.4848-4851, 2011

[12] Y Deng., R. Patel, J.T. Heaton, C. Colby, L.D. Gilmore, J. Cabrera, S.H. Roy, C.J. De Luca and G.S. Meltzner “Disordered Speech Recognition Using Acoustic and sEMG Signals.” INTERSPEECH 2009, United Kingdom, 2009.

[13] M.S. Cheng, “Monitoring Functional Activities in Patients With Stroke.” Sc.D. Dissertation. Boston University, Department of Biomedical Engineering, 2005.

[14] J.S. Garofolo, L.F. Lamel, W. M. Fisher, J.G. Fiscus, D.S. Pallett, N. L. Dahlgren, and V. Zue, “TIMIT Acoustic-Phonetic Continuous Speech Corpus”, Linguistic Data Consortium, 1993.

[15] Special Ops Hand Signals, http://www.specialoperations.com/Focus/Tactics/Hand_Signals/default.htm

[16] Most common 1000 English Phrases, http://www.englishspeak.com/english-phrases.cfm

[17] http://www.netlingo.com/acronyms.php [18] T. Schultz, M. Wand, “Modeling Co-articulation in EMG-based

Continuous Speech Recognition”, Speech Communication, Vol 52, Issue 4, April 2010, Pages 341-353.

[19] E. Lopez-Larraz, O.M. Ozos, J.M. Antelis, J. Minguez, “Syllable-Based Speech Recognition Using EMG”, IEEE International Conference of the Engineering in Medicine and Biology Society (EMBC), 2010.

[20] R. Singh, B. Raj, and R.M. Stern “Automatic clustering and generation of contextual questions for tied state in hidden Markov models", Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc, ICASSP 1999.

[21] Povey & P.C. Woodland “Minimum Phone Error and I-Smoothing for Improved Discriminative Training”, Proc. ICASSP 2002, Orlando FL.

[22] S. Yong, G. Evermann, (2006). The HTK Book. [23] V., Tyagi and H., Bourlard, “On Multi-scale Fourier Transform

Analysis of Speech Signals”, IDIAP research report, June 2003. [24] H. You, Q. Zhu and A. Alwan, “Entropy-based variable frame rate

analysis of speech signals and its application to ASR”, ICASSP, 2004, pp.549-552.