multiple reduced phoneme sets for second language...

Multiple Reduced Phoneme Sets for Second Language Speech Recognition

Xiaoyun Wang

Graduate School of Science and Engineering, Doshisha University, Kyoto, [email protected]

AbstractThis paper describes a novel method to improve the perfor-mance of second language speech recognition when the mothertongue of users is known. Considering that second languagespeech usually includes less fluent pronunciation and morefrequent pronunciation mistakes, I propose using a reducedphoneme set generated by a phonetic decision tree (PDT)-basedtop-down sequential splitting method instead of the canonicalone of the second language. However, the proficiency of sec-ond language speakers varies widely, the optimal phoneme setis dependent on the proficiency of the second language speaker.In this work, I verify the efficacy of reduced phoneme set inthe first step, and examine the relation between the proficiencyof speakers and a reduced phoneme set customized for them inthe second step. On the basis of the results of the investiga-tion, I propose a novel speech recognition method using mul-tiple reduced phoneme sets to further improve the recognitionperformance considering the various proficiency levels of sec-ond language speakers. Experimental results demonstrate thehigh validity of the proposed method.Index Terms: phonetic decision tree (PDT), reduced phonemeset (RPS), second language speech recognition, language profi-ciency, multiple reduced phoneme sets

1. IntroductionThe rapid progress in transportation systems and informationtechnologies has increased the opportunities for worldwidecommunication. Many people have more opportunities forspeaking in a foreign language. Non-native speakers have alimited vocabulary and a less than complete knowledge of thegrammatical structures of the target language. This limited vo-cabulary forces speakers to express themselves in basic words,making their speech sound unnatural to native speakers. In ad-dition, non-native speech usually includes less fluent pronuncia-tion and mispronunciation even in cases in which it is well com-posed. Actual human beings can eventually understand non-native speech quite easily because after a while the listener getsused to the style of the talker, i.e., the various insertions, dele-tions, and substitutions of phonemes or the wrong grammar.

More problematic is when non-native pronunciations be-come an issue for speech dialogue systems that target tourists,such as travel assistance systems, hotel reservation systems, andsystems in which consumers purchase goods through a network.The vocabulary and grammar of non-native speakers is oftenlimited and therefore basic, but a speech recognizer takes no oronly a little advantage of this and is confused by the differentphonetics [1].

In order to improve the speech recognition accuracy fornon-native speech, various methodologies have been proposedfor adapting to the acoustic features of non-native speech,including a speaker adaptation method for second language

speech recognition [2], a method using the state-tying of acous-tic modeling (AM) for second language speech with a variantphonetic unit obtained by analyzing the variability of secondlanguage speech pronunciation [3], AM interpolating with bothnative and non-native acoustic models [4], and others.

In this work, I firstly propose a novel speech recogni-tion method that uses a reduced phoneme set to improve therecognition accuracy for second language speech. The reducedphoneme sets were created with a phonetic decision tree (PDT)-based top-down sequential splitting utilizes not only the phono-logical knowledge between mother and target languages andtheir phonetic features but also the occurrence distribution of thephonemes of the target language produced by second languagespeakers. I apply this method to the recognition of English ut-terances spoken by Japanese speakers of various English profi-ciencies and demonstrate the effect of the reduced phoneme setin comparison with the canonical phoneme set conventionallyused in English ASR. Considering the speech quality of secondlanguage speakers overall depends on their proficiency level inthe second language [5], [6], and there are different patternsin accent among inexperienced, moderately experienced, andexperienced speakers [7], [8]. Then I investigate the relationbetween the reduced phoneme set and the English proficiencylevel of speakers. On the basis of the results of this investiga-tion, I propose a method which is multiple-pass decoding usinga lexicon represented by multiple reduced phoneme sets to fur-ther improve the recognition performance of second languagespeech considering the various proficiency levels of second lan-guage speakers.

This paper is structured as follows. Section 2 briefly de-scribes the construction of reduced phoneme set and the exper-imental results. In Section 3, the relation between the reducedphoneme set and the proficiency level is discussed. In Section 4,I propose using multiple-pass decoding with multiple reducedphoneme sets and describe the implementation of a languagemodel suitable for it. Section 5 presents the experimental resultsusing the multiple reduced phoneme sets. A brief summary andfuture work are concluded in Section 6.

2. Reduced phoneme set

The reduced phoneme set is created with a phonetic decisiontree based top-down sequential splitting method. The phoneticdecision tree is a binary tree in which discrimination rules areattached to each splitting step. As discrimination rules, I use theknowledge of phonetic relations between the Japanese and En-glish languages and the actual pronunciation inclination of En-glish utterances by Japanese. The splitting process uses the pho-netic acoustic features of speech by second language speakersand the occurrence distributions of each phoneme as the split-ting criterion.

2.1. Criterion of designing phoneme set

I used as the splitting criterion the log likelihood (1) definedby the logarithm of the probability distribution function of anacoustic model generating the second language speech observa-tion data:

L(Pm) ≈T∑

t=1

logP (Ot) · γm, (1)

where Pm represents the joint node pdf of the mth phonemeor phoneme cluster. Ot means the observation data of[O1, O2, . . . , OT ]. γm is a posteriori probability of the modelgenerating Ot, which is a good prediction of the occupancyfrequency of the canonical phonemes that are used in typicalJapanese-English speech utterances.

To carry out the preliminary splitting, 166 discriminationrules were designed to categorize each phoneme on the basisof phonetic features such as the manner and position of artic-ulation. In this splitting method, all phonemes listed in eachdiscrimination rule based on other phonetic features depict sim-ilar phonological characteristics and have the possibility to bemerged into a cluster.

2.2. Procedure to design phoneme set

I primarily used a 4-step procedure to design a reducedphoneme set:

1. Set all merging phonemes as a root cluster at the initialstate. Calculate increased log likelihood 4LR accord-ing to equations (1) and (2), assuming that cluster S ispartitioned into Sy(R) and Sn(R) by discriminationrules:

4LR = L(Sy(R)) + L(Sn(R))− L(S), (2)

where 4LR is the increased log likelihood of thephoneme cluster, which is calculated for all discrimina-tion rules applicable to every cluster.

2. Select the rule with maximum log likelihood increasecompared with before splitting:

LR∗ = argmaxall R

4LR (3)

The rule R∗ causing the maximum increase is chosen asthe splitting rule.

3. Split a phoneme cluster according to the selected dis-crimination rule R∗.

4. Check whether increased log likelihood is less than thethreshold. If yes, output the final reduced phoneme set.If no, repeat the splitting process.

There are two reasons the reduced phoneme set can cre-ate suitable phonological decoding when the mother tongue ofspeakers is known. The first is that it can be designed to charac-terize the acoustic features of Japanese accented English morecorrectly. The second is that there is more speech data fortraining the acoustic model of each phoneme in the reducedphoneme set than in the canonical one, so I can obtain morereliable estimate values as parameters of acoustic models.

Figure 1: Word accuracies for various numbers of phoneme sets

2.3. Experiment with the reduced phoneme set

2.3.1. Phoneme Set and modeling

The phonemic symbols of the TIMIT database were used asthe canonical phoneme set. The size of the initial phoneme setwas 41 and consisted of 17 vowels and 24 consonants. Afteriterative splitting, I chose various phoneme sets with numbersranging from 38 to 25 in decreasing order for the recognitionexperiment. An English as a Second Language (ESL) speechdatabase comprised of 80,409 utterances read by 200 Japanesestudents [9] was used to train context-dependent state-tying tri-phone HMM acoustic models of various numbers of phonemesets. I developed a bigram language model from 5,000 tran-scribed utterances by 55 Japanese university students. The pro-nunciation lexicon included about 35,000 vocabulary words re-lated to conversation about travel abroad.

2.3.2. Evaluation data

There are 20 Japanese students who were measured their com-munication levels in English using the Test of English for In-ternational Communication (TOEIC) as the participants. Theirscores ranged from 380 to 910 (990 being the highest score thatcan be attained). 1,420 orally translated English speech (semi-spontaneous) were collected using a dialogue-based CALL sys-tem [10] as evaluation data.

2.3.3. Experimental results

Figure 1 shows the word accuracies for canonical phoneme setand various reduced phoneme sets whose number ranges from38 to 25 in decreasing order. As shown, all reduced phonemesets provided better word accuracies than the canonical one, andthe reduced phoneme set of 28-phoneme clusters obtained thebest performance on the average proficiency level of second lan-guage speakers.

3. Relation between optimal phoneme setand English proficiency level

Phoneme mismatches resulting in mis-recognition of secondlanguage speech can be improved by reducing the number ofphonemes. The tendency of mispronunciation depends on theproficiency level of speakers. It is generally expected thatphoneme mismatch is more frequent in speech by those witha lower level of proficiency and that the optimal number of re-duced phonemes may vary depending on speaker proficiencies.In order to examine this hypothesis, I conducted an experimentto determine the relation between the reduced phoneme set andthe proficiencies of speakers under the experimental conditionspresented in the next subsections.

Figure 2: Relative error reduction for speech by participants indifferent TOEIC score ranges.

3.1. Evaluation data

I also collected 3,150 orally translated speech items using adialogue-based CALL system [10] as evaluation data. Theirspeaking styles range widely from ones similar to conversa-tion to ones closer to read-speech. There were a total of 45participants aged 18 to 24 who had acquired Japanese as theirmother tongue and learned English as their second language. Inthis study, the Test of English for International Communication(TOEIC) score was used for measuring the English proficiencyof the speakers.

The participants’ English proficiencies ranged from 300 to910. I divided their scores into 5 levels based on TOEIC scorerange: lower than 500, 500–600, 600–700, 700–800, and higherthan 800, with 10, 10, 10, 8, and 7 participants in each respec-tive score range.

3.2. Recognition performance in different proficiencies

I used the HTK toolkit [11] to compare the ASR performanceusing canonical and reduced phoneme sets for speech by par-ticipants at each level of the 5 TOEIC score ranges. Figure 2shows the relative error reduction of various reduced phonemesets compared with the canonical one for speech by participantsof each of the 5 TOEIC score ranges. It shows that

• All reduced phoneme sets achieved error reduction com-pared with the canonical phoneme set for all TOEICscore ranges.

• The optimal phoneme number of reduced phoneme setsvaries depending on the English proficiency level of thespeakers.

• The recognition performance of the reduced phonemeset was improved more for speech by participants withlower-level proficiency than for those with higher-levelproficiency.

• The optimal number of phonemes for the speakers withlower-level proficiency was smaller than that for thosewith higher-level proficiency.

The results of Fig. 2 suggest that the optimal reducedphoneme set corresponding to the English proficiency of speak-ers is a 25-phoneme set for speakers with a TOEIC score of lessthan 500, a 28-phoneme set for those with a 500–700 score,and a 32-phoneme set for those with scores higher than 700. It

showed that the effect of the reduced phoneme set is differentdepending on the proficiency of second language speakers.

4. ASR system with multiple reducedphoneme sets

The proficiency of second language speakers various widely, itseems to be inadequate to use a single phoneme set to recognizeinput speech by all second language speakers. In this section,I propose an ASR system using multiple reduced phoneme setsto further improve the recognition performance of second lan-guage speech considering the various proficiency levels of sec-ond language speakers. On the basis of the experimental resultsin Section 3.2, I selected 25-, 28-, and 32-phoneme sets as thecomponents of multiple reduced phoneme sets to capture thevarious proficiency levels of second language speakers.

I assumed that one of three multiple phoneme sets, not amixture, is used to recognize input speech by a single secondlanguage speaker in order to resolve the problem caused byViterbi decoding. To fulfill this constraint, I developed a lexiconin which the pronunciation of each lexical item is representedby the multiple reduced phoneme sets and a language modelimplementing the constraint, as described in the following.

4.1. Lexicon

Some phonemes in the canonical phoneme set are differ-ently distributed among the 25-phoneme, 28-phoneme, and 32-phoneme sets. Specifically, some lexical items are representedby a single phoneme set sequence consisting only of phonemeswithout merging or merged into the same clusters by the pho-netic decision tree (PDT) in three reduced phoneme sets. Otherlexical items in the lexicon are represented with three differentphoneme set sequences in the multiple reduced phoneme sets.

In the lexicon, 62.9% of the lexical items have a singlephoneme set sequence and 37.1% have multiple phoneme setsequences used for the experiment. I added a symbol that dif-ferentiates words of the single phoneme set sequence from thoseof the multiple phoneme set sequences.

4.2. Language model

A simple method for training a stochastic language model isto train a language model independently of the structure of thelexicon. This can be done simply by counting word occurrencesin the training corpus, assuming that words represented withmultiple phoneme set sequences have multiple pronunciations.

Since probabilities leaving the start arc of each word mustadd up to 1.0, each of these pronunciation paths through thismultiple-pronunciation HMM word model will have a smallerprobability than the path through a word with only a single pro-nunciation path. A Viterbi decoder can only follow one of thesepronunciation paths and may ignore a word with many pronun-ciations in favor of an incorrect word with only one pronun-ciation path. It is well known that the Viterbi approximationpenalizes words with many pronunciations [12].

In order to resolve degradation of speech recognition per-formance stemming from multiple pronunciations, the languagemodel only permits transition between words represented by thesame reduced phoneme set and words of the single pronunci-ation while inhibiting transition among words represented bydifferent reduced phoneme sets, as shown in Fig. 3.

Figure 3: Schematic diagram of language model for words rep-resented by 25-, 28-, and 32-phoneme sets and for words withsingle phoneme set sequence. Arcs depict transition between

words represented by the same reduced phoneme set andwords of single pronunciation. The transition among wordsrepresented. by different reduced phoneme sets is inhibited.

4.3. Multiple-pass decoding

In order to take complete advantage of language model imple-menting the lexical items with increased multiple pronuncia-tions, we use a multiple-pass decoding algorithm that modifiesthe Viterbi algorithm to return the N-best word sequences fora given speech input. The bigram grammar mentioned in Sec-tion 4.2 is used in the first pass with an N-best Viterbi algo-rithm to return the 50 most highly probable sentences. The 50-best list is used to create a more sophisticated LM that allowstransition among the words with multiple reduced phoneme setsequences. This LM allows transition among the words withdifferent reduced phoneme set calculates likelihood of concate-nating words with different reduced phoneme set in the secondpass. Sentences are thus re-scored and re-ranked according tothis more sophisticated probability [12].

5. Experimental results using multiplereduced phoneme sets

To evaluate the proposed method, I compared the performanceon ASR implementing the proposed method with that of thecanonical phoneme set and three single reduced phoneme sets.I also compared the multiple-pass decoding with a mixturetransition that allows transitions among all words representedwith multiple reduced phoneme set sequences in the first-pass.Figure 4 shows the word accuracy of the canonical phonemeset, various single reduced phoneme sets, mixture transition,and multiple-pass decoding with the multiple reduced phonemesets. It shows that:

• Multiple-pass decoding with multiple reduced phonemesets had a better performance than the canonicalphoneme set, all of the single reduced phoneme sets, andthe mixture transition.

• There were significant differences between the word ac-curacy of multiple-pass decoding with multiple reducedphoneme sets and the canonical phoneme set, the 32-phoneme set, the 28-phoneme set, the 25-phoneme set,and the mixture transition (paired t-test, t(44) = 2.02, p<0.01).

Figure 4: Word accuracy of canonical phoneme set, three singlereduced phoneme sets, mixture transition and multiple-pass

decoding with multiple reduced phoneme sets.

6. Conclusion and future workIn this study, I presented a novel method of designing a reducedphoneme set for recognizing English spoken by Japanese. Thespeech recognition results obtained for second language speechshowed that the reduced phoneme set with the proposed methodachieved a better improvement of speech recognition than thecanonical one. I also clarified the relation between the secondlanguage speakers and an optimal reduced phoneme set. Onthe basis of this analysis, I proposed a speech recognition tech-nique using multiple reduced phoneme sets to capture the vari-ous proficiency levels of second language speakers. The methodusing multiple-pass decoding with multiple reduced phonemesets was able to further improve the recognition performancefor second language speech collected with a translation gametype dialogue-based CALL system for each proficiency levelcompared with the canonical phoneme set and single reducedphoneme sets.

I plan to investigate the relation between the features ofspeech by each speaker and the optimal reduced phoneme setin future.

7. References[1] R. E. Gruhn, W. Minker, and S. Nakamura, Statistical pronuncia-

tion modeling for non-native speech processing, Berlin Heidelberg:Springer, 2011.

[2] D. Luo, Y. Qiao, N. Minematsu, Y. Yamauchi & K. Hirose,"Regularized-MLLR speaker adaptation for computer-assisted lan-guage learning system." Proc. ISCA, Chiba, Japan, pp. 594-597,Sept. 2010.

[3] Y. Oh, J. Yoon & H. Kim, "Acoustic model adaptation based onpronunciation variability analysis for non-native speech recogni-tion." Speech Communication, vol. 49, issue. 1, pp. 59-70, Jan.2007.

[4] K. Livescu, "Analysis and modeling of non-native speech for auto-matic speech recognition." Ph.D Thesis, Massachusetts Institute ofTechnology, 1999.

[5] K. Osaki, N. Minematsu, and K. Hirose, "Speech recognition ofJapanese English using Japanese specific pronunciation habits," IE-ICE Technical report, vol. 102, no. 749, pp. 7–12, 2003.

[6] J. Van Doremalen, C. Cucchiarini, and H. Strik, "Optimizing auto-matic speech recognition for low-proficient non-native speakers,"EURASIP Journal on Audio, Speech, and Music Processing, 2010.

[7] P. Trofimovich and W. Baker, "Learning second language supraseg-mentals: Effect of L2 experience on prosody and fluency char-acteristics of L2 speech," Studies in Second Language Acquisi-tion(SSLA), vol. 28, no. 01, pp. 1–30, 2006.

[8] J.E. Flege, "Factors affecting degree of perceived foreign accent inEnglish sentences," The Journal of the Acoustical Society of Amer-ica, vol. 84, no. 1, pp. 70–79, 1988.

[9] N. Minematsu, Y. Tomiyama, K. Yoshimoto, K. Shimizu, S. Nak-agawa, M. Dantsuji, and S. Makino, "Development of Englishspeech database read by Japanese to support CALL research," Proc.ICA. vol. 1, pp. 557–560, 2004.

[10] X. Wang, J. Zhang, M. Nishida, and S. Yamamoto, "Phoneme SetDesign Using English Speech Database by Japanese for Dialogue-based English CALL Systems," Proc. LREC, Reykjavik, Iceland,pp. 3948–3951, 2014.

[11] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu,G. Moore, J. Odell, D. Ollason, and D. Povey, HTK Speech Recog-nition Toolkit version 3.4, Cambridge University Engineering De-partment, 2006.

[12] D. Jurafsky and J.H. Martin, "Speech & language processing,"Pearson Education India, 2000

multiple reduced phoneme sets for second language...

Documents