the use of speech technology in foreign language

21
ARCHIVES OF ACOUSTICS DOI: 10.2478/v10168-010-0027-z Arch. Acoust., 35, 3, 309–329 (2010) The Use of Speech Technology in Foreign Language Pronunciation Training Grażyna DEMENKO (1),(2) , Agnieszka WAGNER (1) , Natalia CYLWIK (1) (1) Adam Mickiewicz University Institute of Linguistics Department of Phonetics Al. Niepodleglości 4, 61-874 Poznań, Poland e-mail: {lin, wagner, nataliac}@amu.edu.pl (2) Poznań Supercomputing and Networking Center Z. Noskowskiego 12/14, 61-704 Poznań, Poland (received April 14, 2010; accepted July 20, 2010) In recent years the application of computer software to the learning process has been found to be an indisputably effective tool supporting the traditional teach- ing methods. Particular focus has been put on the application of techniques based on speech and language processing to the second language learning. Most of the commercial self-study programs, however, do not allow for introduction of an indi- vidualized learning course by the teacher and to concentrate on segmental features only. The paper discusses the use of speech technology in the training of foreign lan- guages’ pronunciation and prosody and defines pedagogical requirements for an ef- fective training with CAPT systems. In this context, steps taken in the development of the intelligent tutoring system AzAR3.0 (German ‘Automat for accent reduction ’) in the scope of the Euronounce project (Cylwik et al., 2008) are described with the focus on creation of the linguistic content. In response to the European Union’s call for promoting less widely spoken languages, the project focuses on German as a target language for native speakers of Polish, Slovak, Czech, and Russian, and vice versa. The paper presents the design of the speech corpus for the purpose of the tutoring system and the analysis of pronunciation errors. The results of the latter provide information which is important for Automatic Speech Recognition (ASR) training on the one hand, and for automatic error detection and feedback genera- tion on the other hand. In the end, Pitch Line software for implementation in the prosody visualization and training module of AzAR3.0 tutoring system is described. Keywords: speech technology, pronunciation training, feedback, phonetic-acoustic databases, multimedia tools.

Upload: others

Post on 16-Mar-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

ARCHIVES OF ACOUSTICS DOI: 10.2478/v10168-010-0027-zArch. Acoust., 35, 3, 309–329 (2010)

The Use of Speech Technology in Foreign LanguagePronunciation Training

Grażyna DEMENKO(1),(2), Agnieszka WAGNER(1),Natalia CYLWIK(1)

(1)Adam Mickiewicz UniversityInstitute of LinguisticsDepartment of PhoneticsAl. Niepodległości 4, 61-874 Poznań, Polande-mail: {lin, wagner, nataliac}@amu.edu.pl

(2)Poznań Supercomputing and Networking CenterZ. Noskowskiego 12/14, 61-704 Poznań, Poland

(received April 14, 2010; accepted July 20, 2010)

In recent years the application of computer software to the learning process hasbeen found to be an indisputably effective tool supporting the traditional teach-ing methods. Particular focus has been put on the application of techniques basedon speech and language processing to the second language learning. Most of thecommercial self-study programs, however, do not allow for introduction of an indi-vidualized learning course by the teacher and to concentrate on segmental featuresonly. The paper discusses the use of speech technology in the training of foreign lan-guages’ pronunciation and prosody and defines pedagogical requirements for an ef-fective training with CAPT systems. In this context, steps taken in the developmentof the intelligent tutoring system AzAR3.0 (German ‘Automat for accent reduction’)in the scope of the Euronounce project (Cylwik et al., 2008) are described withthe focus on creation of the linguistic content. In response to the European Union’scall for promoting less widely spoken languages, the project focuses on German asa target language for native speakers of Polish, Slovak, Czech, and Russian, and viceversa. The paper presents the design of the speech corpus for the purpose of thetutoring system and the analysis of pronunciation errors. The results of the latterprovide information which is important for Automatic Speech Recognition (ASR)training on the one hand, and for automatic error detection and feedback genera-tion on the other hand. In the end, Pitch Line software for implementation in theprosody visualization and training module of AzAR3.0 tutoring system is described.

Keywords: speech technology, pronunciation training, feedback, phonetic-acousticdatabases, multimedia tools.

310 G. Demenko, A. Wagner, N. Cylwik

1. Introduction

The increasing use of speech technology can be especially seen in the areaof foreign language education, which has led to the development of a new dis-cipline known under the name of Computer-assisted language learning (CALL).The literature on CALL mentions a number of its potential advantages: elim-ination of time limitations and dependence on the teacher, possibility to workat the user’s own tempo and to store his/her profile in order to monitor theprogress, constant access to a number of additional materials such as visualiza-tions, recordings and animations as well as elimination of stress related to thefact that the learner is being listened to by his/her classmates, the last of which isparticularly important while acquiring L2 (second language) pronunciation andprosody. Computer-assisted language learning affords a creative approach in thefield of language teaching, e.g. didactical constructivism in multimedia environ-ment based on project-oriented collaboration of students and teachers. Addition-ally, the implementation of information technology in a multilingual languagetutoring system augments the interest of young people who are familiar withcomputers and communication devices.

For many years, pronunciation was neglected in favor of grammar and vo-cabulary in language learning, because it was generally believed that traininghad no significant impact or even could have a detrimental effect on pronuncia-tion (Krashen, Terrell, 1983; Scovel, 1988). However, the acknowledgementthat a reasonably intelligible pronunciation constitutes an essential component ofcommunicative competence and the results of later research which showed that anappropriate training could significantly improve pronunciation (e.g. Bongaerts,1999) caused an increasing interest in pronunciation learning. One of the resultswas the development of the first CAPT (Computer-assisted Pronunciation Train-ing) systems (for an overview see Subsec. 2.2).

Due to the inherent complexity, lack of knowledge in adequate prosody process-ing both for linguistic and technological needs and to the ensuing difficulty in theiracquisition, intonation and other prosodic phenomena like rhythm and voice qual-ity were ignored in language teaching for many years. There appear to be severalreasons for the generally growing interest in intonation, which we have been wit-nessing in recent years. First, there have been important new advances in thetheory of intonation, its functions and forms, aided by the growing accessibilityof acoustic signal analysis, processing and interpretation. Second, the expansionof the analytical domains of traditional linguistics from sounds and words tolarger units of inquiry such as phrases, discourse and interactions, has drawnattention to such subfields as pragmatics, discourse analysis, and conversationanalysis. Third, applied linguistics has grown to give priority to the communica-tive function of prosody rather than its linguistic form.

The current research in the field of CALL focuses on a more effective inte-gration of computer technology with the learning and teaching of languages and

The Use of Speech Technology in Foreign Language. . . 311

on including the prosodic factor. The main goal of this paper is, therefore, tointegrate both the segmental and suprasegmental aspects, especially in discourseand interaction, and to suggest a complex framework for studying foreign lan-guage pronunciation with AzAR3.0 (Jokisch et al., 2005; Cylwik et al., 2008)software.

The core of AzAR3.0 is a pronunciation trainer with integrated multilingualspeech recognition and audio-visual feedback functionality. The system providesspeech signal analysis of the user’s speech input, which allows the user to comparehis/her pronunciation with that of the tutor’s, and to receive selective informationon how to improve articulation.

AzAR3.0 includes PC-based and PDA-based learning programs, a web-basedtool for providing the learning content (‘authoring system’) and a reference speechdatabase for a given language pair combination (German as a target and Russian,Polish, Czech or Slovak as source languages and vice versa). In order to ensureeffectiveness of the courseware, the system was designed according to strict ped-agogical guidelines.

2. Challenges in computer-assisted language training

2.1. Pedagogical requirements for CAPT

Some of the factors affecting pronunciation learning cannot be manipulated,e.g. the first language and length of exposure to L2 whereas others can be con-trolled and thus, effectively used in CAPT systems. The latter include: input,output and feedback (Neri et al., 2002b). First of all, the learner should haveaccess to a large number of L2 speech samples from multiple speakers, to be ableto improve perception of novel contrasts and to build models of general phoneticcategories. Apart from the auditory channel, the visual one should be available(visualizations of the articulatory movements), as the combination of the twochannels improves the production and perception of L2 contrasts. It is importantthat the input is meaningful to the learner and that the learning context is rel-evant and realistic (e.g. by including scenarios resembling real life situations) inorder to stimulate the learner’s motivation.

The second factor, i.e. the output, refers to speech production. Output isessential for pronunciation improvement – the strong accent of some long-termresidents suggests that mere exposure to L2 (input) is not sufficient for that pur-pose. The production of their own output gives the learners a chance to test theirhypotheses on the L2 sounds. By comparing their output with the input model,learners can form correct L2 representations. The exercises for speech productionshould be varied, realistic and engaging so as to boost learners’ motivation. Theyshould not be limited to listening and repeating minimal pairs or isolated sen-tences, because the controlled practice does necessarily lead to an improvementin a real conversation. Interaction with the system, e.g. in role-plays with char-

312 G. Demenko, A. Wagner, N. Cylwik

acters or interactive dialogues, is a more effective method, but it can be appliedonly if ASR technology is specially tuned for the recognition of non-native speechis available. What could also be useful for the learner from the pedagogical pointof view, is the possibility to self-monitor the progress.

The third essential factor in pronunciation training is the feedback. Explicitand detailed feedback is necessary to make the learner aware of the discrepanciesbetween his/her production (output) and that of the model speaker (input).Interferences from L1 which can be considered as the main source of pronunciationerrors can sometimes prevent learners from perceiving such discrepancies. Asshown in (Lyster, 1998), immediate feedback is essential – so, ideally, it shouldbe provided by means of ASR technology. It should involve the aural and visualchannels and provide assessment of these errors which are crucial for intelligibility,and thus it should take into account both the segmental and suprasegmentalfeatures. However, beside scoring learners’ production (ideally, on the basis ofsome hierarchy of mispronunciations with regard to intelligibility), the feedbackshould provide information on the type and location of an error and instruct thelearner how to correct it.

The tutoring system AzAR 3.0 presented in this paper meets the pedagog-ical requirements defined here. It provides the learner with an extensive inputwhich consists of speech produced by two model speakers per language (referencedatabase, see Subsec. 3.2). Apart from the aural channel, AzAR offers the visualmode including animated visualization of the vocal tract (lip area and articulatorsmovements) as well as a formants graph for specific phones and a display of thespeech signal under the transcribed and phonemically segmented reference utter-ances. For the output, a comprehensive curriculum was provided. It was createdon the basis of extensive analyses (Subsec. 3.3) of non-native speech databases(see Subsec. 3.2) and thus it concentrates on these segmental and suprasegmentalaspects, which are crucial for intelligibility and are typical and common amongL2 learners with specific L1. In the end, the application of ASR technology inAzAR3.0 makes it possible to generate individualized and immediate correctivefeedback (details are given in Sec. 4).

2.2. Audio and visual training

The application of multimedia tools for audiovisual feedback to detect devi-ations from standard articulation in the target language has shown an especiallyhigh effectiveness in PC-based learning pronunciation systems. Although prosodyvisualization seems to be more complex, speech analysis has been used for teach-ing L2 intonation patterns since 1970s, e.g. (Abberton, Fourcin, 1975; deBot, Mailfert, 1982). The main principle is that the sound waveform or pitchcontour of the student’s utterance are visually displayed alongside those of theteacher’s. An example of a program that displays visual pitch curves is a prod-uct of Kay Elemetrics called Visi-Pitch, that has been available for a number of

The Use of Speech Technology in Foreign Language. . . 313

years for DOS-based personal computers. With Visi-Pitch, the students are ableto see both the reference speaker’s and simultaneously their own intonationalcurve. A highly valued system in which clear and intelligible feedback is providedon intonation, stress and rhythm, is BetterAccentTutor developed by Kommis-sarchik (2000) for teaching American English prosody to non-native speakers ofEnglish.

The main shortcomings of hardware and software used currently for prosodytraining can be summarized as follows:

• Technical aspects:– Low speech signal energy;– No extrapolation for voiceless sounds;– Not entirely correct/reliable F0 extraction;– Lack of voice quality visualization.

• Methodological shortcomings:– Lack of user-friendliness, i.e. learners do not know how to interpret

displays and to evaluate results;– Examples and exercises are focused on word and sentence-level into-

nation;– Lack of integration of such prosodic features as tone, duration, loud-

ness;– Lack of voice quality analysis – even when a learner can produce in-

dividual sound segments which are very similar to those produced bythe teacher, they may still sound ‘wrong’ due to overall voice quality.

It should also be noted that the importance of auditory and/or visual feedbackwith regard to prosody is difficult to assess, because computer programs providingfeedback require the learners to monitor and evaluate themselves critically. Apartfrom visual display, no further feedback is provided and there is a lack of objectiveassessment.

2.3. Speech technology in language learning

Although the majority of recent studies have demonstrated the effectivenessof audio and visual training in improving learners’ perception, several problemsare still unsolved.

One of the problems found in some of the earlier software programs was thelack of feedback processing, for instance the learner’s pitch could be measured andinstantly displayed but interruptions in the intonation contour during voicelessparts of the utterance and the inclusion of perceptually irrelevant pitch variationsmade it difficult for the learner to interpret the feedback.

Speech synthesis is not widely used in CALL systems. As it often soundsartificial, at present most developers seem to prefer recordings of natural voices.However, there have been attempts to investigate the possibility to use syntheticstimuli in language teaching. The potential for using speech synthesis in CALL

314 G. Demenko, A. Wagner, N. Cylwik

applications lies in the text-to-speech synthesis, and especially in integratingspeech synthesis with visual models of the face, mouth and vocal tract.

A number of studies, e.g. (Dalby, Kewley–Port, 1999; Anderson, Kew-ley–Port, 1995), on the other hand, showed the usefulness of automatic speechrecognition in pronunciation training. Nevertheless, ASR systems perform poorlywhen confronted with non-native speakers (Morgan, 2004) and various methodshave been proposed to enhance their performance with non-native speech, e.g.(Goronzy, 2002; Bouselmi et al., 2007). The effectiveness of CALL systemsbased on speech recognition is not only determined by the capabilities of thespeech recognizer but also by (a) – the type of feedback and teaching methodand (b) – the inclusion of repair strategies to safeguard against the recognizer’serror. Unfortunately, at present speech recognition systems are poor at handlinginformation contained in the speaker’s prosody. The limitations of the technologyimply that the learner’s utterances have to be predictable and that the detectionof errors is only possible with a limited degree of detail, which makes it difficultto give the learner corrective feedback. However, speech recognition systems canbe used to measure the speed at which the learners speak and the rate of speechhas been shown to correlate with L2 learners’ proficiency.

Advanced PC-based learning systems (e.g. Pronunciation Power, AmericanSounds, Phonics Tutor, Eyespeak) include (verify http://www.learningvillage.com/html/guide.html or https://calico.org/CALICO%20Review/): 1) speech analyz-ing windows or frames, 2) Internet-based features like email answering, onlinehelp and chat sessions with human tutors, 3) animated visualizations of the ar-ticulatory mechanics, video clips showing jaw, lip and tongue movements andwaveform patterns of sound samples. Users are able to record sound files and toacoustically compare a graphical representation of their utterances to those of theinstructor. A few systems (e.g. Fonix iSpeak 3.0, Pro-Nunciation) include synthe-sized speech or TTS solutions. During the last decade, speech recognition tech-nology has been implemented into innovative interactive systems like Istra andPronto (Dalby, Kewley–Port, 1999) and in the European research projectInteractive Spoken Language Education ISLE (http://nats-www.informatik.uni-hamburg.de/∼_\isle/). The ISLE system targets German and Italian learners ofEnglish and aims at providing feedback focusing mainly on word-level errors, i.e.it points to mispronunciation of specific sounds and lexical-stress errors. Whilethis feedback design seems satisfactory, the system yields poor performance re-sults. Non-native speech databases have been created, and some researchers haveinvestigated the possibility to improve the performance of speech recognizersand to try various probabilistic models to produce pronunciation scores from thephonetic alignments generated by HMM-based acoustic models (Teixeira et al.,2000). In the FLUENCY project (Eskenazi, Hansma, 1998), a speech recog-nizer was used to detect foreign speakers’ pronunciation errors for L2 training.The research also involved prosody and the correlation between pronunciationand prosody errors was investigated. However, neither the information on the

The Use of Speech Technology in Foreign Language. . . 315

placement of the intonation errors nor suggestions on how to improve the into-nation were provided, leaving the comparison task to the users. A well-knownsoftware, Tell Me More of Auralog, improved error detection and feedback forpronunciation practice by pointing out erroneous phonemes and showing a 3Danimation to visualize the ‘standard’ articulation. Through ASR, the computerrecognizes the student’s utterance and moves on to an appropriate conversa-tional exchange. A similar method is used by U.S. Army researchers and by thedevelopers of TraciTalk to design game-like programs to teach L2 (Neri et al.,2002a). In this case, the student orally asks the computer to perform a tasksuch as ‘put the book on the table’. If the computer understands the utterance,it will perform the action required by the student. This type of feedback is veryeffective in reinforcing the correct pronunciation behavior. However, all these pro-grams are unable to offer help if a student cannot make him/herself intelligiblebecause, for instance, he/she cannot correctly pronounce a sound. Additionally,their technology for suprasegmentals (concerning only intonational aspects) isvery limited.

3. Towards optimal technology for L2 pronunciation training

3.1. Euronounce project

Intelligent Language Tutoring System with Multimodal Feedback Functions(acronym Euronounce) is a project within the framework of European Commis-sion’s Lifelong Learning Programme, which aims at creating L2 pronunciationand prosody teaching software. The Euronounce project was preceded by twoearlier projects carried out by the Euronounce coordinator, TU Dresden, be-tween 2004 and 2007. As a result, an audio-visual systems AzAR and AzAR2.0,aimed at teaching Russians German pronunciation, were created (Jokisch et al.,2005). Following the baseline developed in these projects, the Euronounce projectaims at creating software for German as a source language (L1) and Polish, Slo-vak, Czech as target languages (L2) and vice versa, beside segmental adding ofsuprasegmental exercises. In accordance with the current emphasis on commu-nicative and socio-cultural competence, more attention is paid to discourse-levelcommunication and to cross-cultural differences in pitch. In order to achievethese goals, a new version of AzAR (AzAR3.0), developed in the scope of theEuronounce project, will also improve the underlying speech and visualizationtechnology, e.g. by a more-in-depth and more specific analysis of prosodic fea-tures.

3.2. Speech databases and speaker selection

It seems clear that in order for pronunciation tutors to be successful not onlytarget, but also source language needs to be taken into account (Neri et al.,

316 G. Demenko, A. Wagner, N. Cylwik

2002a; Eskenazi, 1999). It is understandable if we keep in mind that most er-rors result from L1 and L2 interference and consist primarily in transferringallophonic and phonotactic rules from our mother tongue to the target languageand replacing L2 phonemes with their most similar L1 counterparts (Flege,1995). Taking only L2 into account is one of the main flaws of ASR-based pro-nunciation tutors as they mostly fail to recognize non-native speech. For thatreason, in the development of AzAR 3.0 for the language pairs Polish-Germanand German-Polish (as well as for other languages), three speech databases werecreated:

1. Reference database – target language speech by professional speakers ofthe target language;

2. Non-native speech database – target language speech by non-native speak-ers;

3. Source-language accent database – source language speech by source lan-guage native speakers.

The reference database consists of a set of reference utterances for a givenL1–L2 pair, uttered by two native speakers (one male and one female), whichserve as template utterances for exercises which are designed in the way thatallows practicing production as well as perception at the phonemic and prosodiclevel in isolated words, simple phrases, complex phrases and continuous speech.

The non-native database consists of recordings of non-native speech pro-duced by 36 speakers per language pair (18 speakers per language direction). Thespeakers were recruited from among students of the target language, with a differ-ent degree of proficiency specified according to Common European Framework ofReference for Languages, i.e. levels A1-A2, B1-B2, C1-C2. The database was bal-anced with respect to sex and proficiency of the speakers. The proficiency ratingswere based partly on the information provided by the students in a questionnairewhich also included self-judgment of their language skills.

The non-native database contains recordings of six tests including differenttypes of texts:

Accent test – is a collection of 125 sentences containing those Polish (andother target languages) sounds and phonetic phenomena – which are consideredas difficult from the point of view of a German learner, e.g. Polish [x] in wordssuch as ‘ich’ (Eng. ‘their’) which Germans might pronounce as [C].

Dialectological test – 124 sentences containing words with alternative pronun-ciations according to the dialect spoken and a full range of Polish phonemes in dif-ferent contexts, word and sentence positions, vowels in minimal pairs, e.g. tik:tak(Eng. ‘tick’:‘yes’), consonants in oppositions voiceless vs. voiced, e.g. pić:bić (Eng.drink:hit), vowels in stressed and/vs. unstressed positions, e.g. ma:mama (Eng.‘has’:‘mum’), etc.

Spontaneous speech test – addressed only to more advanced students. It con-sists of four simple tasks such as finishing a sentence, e.g. ‘My hobby is. . . ’ and

The Use of Speech Technology in Foreign Language. . . 317

explaining the meaning of a proverb or idiomatic expression, commonly knownboth in Poland and Germany. The purpose of the test was to collect speechsamples for the final assessment of the learners’ proficiency level.

Continuous speech test – three passages (72 sentences altogether) taken fromstories by H.Ch. Andersen and Grimm Brothers, two of which were addressed toupper-intermediate and advanced students only.

Phondat corpus – it contains three sets of phonetically rich and balancedsentences (341 altogether) for the purposes of ASR training and testing, and col-lecting mispronunciations of consonant clusters. Polish is an exceptionally conso-nantal language allowing for sequences of even four or five consonants in a word,e.g. drgnąć (Eng. ‘to quiver’), which is often a source of pronunciation errors ofthe foreigners.

Prosody test – a set of 59 sentences designed in such a way so as to collect evi-dence of those prosodic errors that are most easily detectable and most crucial forcomprehension, e.g. erroneous stress placement or non-native-like vowel duration.The purpose of the test was to investigate the realization of prosodic/intonationalfeatures by advanced L2 learners and L1 interferences in the domain of prosody.Details concerning the structure of the test are given in (Cylwik et al., 2009).

Except for the spontaneous speech test, the recorded speech material wasautomatically transcribed and aligned on the phoneme level. Speech samples inPolish were annotated using a modified version of Polish SAMPA (Demenkoet al., 2003) and non-native German speech data were annotated according toGerman SAMPA (www.bas.uni-muenchen.de/Bas/BasSAMPA).

The source-language accent database has been collected for the “general”speech recognizer training. It consists of at least 50 hours of speech provided bymore than 100 speakers.

The recordings for the speech databases were conducted using the WiGErec software, in a simple studio with low noise and reverberation. Basic qual-ity requirements were: sampling frequency 44.1 kHz, minimal resolution 16 bit,minimal SNR of 35 dB. The recordings for the reference speech database wereconducted in a professional studio to ensure highest quality.

3.3. Statistical and linguistic analysis of non-native Polishand German speech databases

The analyses were carried out on the basis of a subset of Polish and Germannon-native databases. The whole speech material was annotated at the segmentaland suprasegmental (lexical and sentence stress) level. Guidelines for the label-ing of pronunciation errors (classified as substitutions, insertions, deletions) andother phenomena (e.g., different types of acoustic and non-linguistic interferences,stress shifting, word-internal pause insertions) occurring in the non-native speech,

318 G. Demenko, A. Wagner, N. Cylwik

were defined. Firstly, a trained phonetician – a native speaker of the target lan-guage – verified and corrected the canonical transcription which was then usedto produce a phonetic segmentation. Subsequently, the annotator identified theportions of the signal perceived as mispronounced and marked deviations fromthe canonical pronunciation and accentuation. As expected, the analyses of theannotated speech material from the accent and prosody tests provided by fifteenPolish and nine German native speakers (representing four different proficiencylevels) showed that the number of pronunciation errors decreased with the pro-ficiency level of the learner: beginners made significantly more errors (42.67%of all pronunciation errors) than, intermediate (31.35%) or advanced learners,who contributed the least to the overall number of pronunciation errors (26%,all results normalized with respect to the number of learners in each group). Thedistribution of different types of errors (substitutions, insertions and deletions)is similar in different proficiency groups (Fig. 1).

Fig. 1. Frequency and distribution of pronunciation errors in different proficiency groups.

As regards pronunciation, German learners of Polish have most difficulty in:a) pronunciation of Polish palatal and non-palatal fricatives (/s′/, /z′/, /S/)

and affricates (/d∧z/, /d∧z′/, /t∧s/, /t∧s′/) especially if they occur in clus-ters,

b) pronunciation of palatal consonants (/n′/, /s′/, /z′/, /t∧s′/, /d∧z′/) whichare most often realized as their non-palatal counterparts (/n/, /s/, /z/,/t∧s/, /d∧z/),

c) realization of the voiced – voiceless contrast which is the main basis of thephonological opposition between consonants in Polish, but not in Germanwhere the lenis – fortis contrast plays the major role; as a result, Germanlearners tend to devoice phonologically voiced consonants or to realize themsimilarly to German lenis consonants (perceptually salient partial devoic-ing),

The Use of Speech Technology in Foreign Language. . . 319

d) pronunciation of Polish graphemes [ą] and [ę] which are mapped to se-quences consisting of the vowel /o/ and /e/ followed by nasalized /w/ (incase of [ą]) or /j/ (in case of [ę]) or /n/, /m/, /n′/ depending on the fol-lowing context; German speakers pronounce [ą] and [ę] as monophthongs:nasal /E∼/ or /O∼/ or use Polish GTP rules, but in inappropriate con-texts,

e) pronunciation of sequences of vowels followed by semi-vowels (/j/, /w/):they are substituted by German diphthongs (e.g. /aj/ replaced by /aI/) ortense long vowels (e.g. /ej/ pronounced as /e:/),

f) avoiding reduction of final unstressed vowels (there is no /@/ in Polish).The most frequent pronunciation errors found in the Polish – German corpus

(recordings of Polish speakers reading speech material in German) concern:a) vowel production: about 78% of tense long vowels were mispronounced by

the Polish learners (see Fig. 2 below) – realization of short vowels (bothlax and tense) was significantly less problematic as they are more simi-lar to Polish vowels with respect to both acoustic and articulatory fea-tures,

b) pronunciation of German diphthongs /aI/, /OY/, /aU/ which are realizedas sequences of a vowel followed by a glide (/aj/, /oj/, /aw/),

c) production of consonants absent in Polish: /h/, /C/ which were most oftenpronounced as /x/,

d) realization of /N/ without velar burst in words such as fängt, jüngerer,jungen.

Fig. 2. Frequency and distribution of pronunciation errors in different proficiency groups.

The results discussed here were used to create a curriculum of production andperception exercises for a comprehensive L2 pronunciation and prosody trainingin AzAR. Apart from that, the knowledge of typical errors made by L2 learnerswas used to define mispronunciation hypotheses which are used to detect andassess pronunciation errors in AzAR.

320 G. Demenko, A. Wagner, N. Cylwik

4. Feedback system

4.1. System performance

Lack of proper (or any) feedback is often named as the most serious flawin educational software, e.g. (Dalby, Kewley–Port, 1999; Engwall et al.,2004). Good software should not only assess the degree of pronunciation correct-ness but also instruct on how to improve it, show where exactly the error hasbeen made, e.g. which phone has been produced erroneously and offers feedbackthat is easy to interpret. To answer these needs, the AzAR3.0 software providesa multimodal feedback – it includes visual and audio modules in the form ofcurriculum recordings by a reference voice and the visualization of the speechsignal under the transcribed and phonemically segmented reference utterances.The Euronounce project combines the objective features of an ASR system withthe knowledge of human experts. The system cannot detect all errors which thehuman teacher could find and, in contrast to other approaches, does not try todo so. Nevertheless, it is able to reliably detect errors which the human experthas identified for a given corpus of test utterances.

The software uses HMM-based speech recognition and speech signal analysison the learner’s input which makes the visual and aural comparison of the user’sown performance with that of the reference voice possible. Even though ANN andSVN could give better recognition results at the phone level, currently HMM-based classifier was used because of the reliable methodology of building suchclassifiers that was developed in previous projects. In the future it is planned totest the ANN and SVN classifiers.

From a technical point of view, designing a voice-interactive pronunciationtutor goes beyond the state of the art required by commercial dictation systems.While the grammar and vocabulary of a pronunciation tutor are relatively simple,the underlying speech processing technology tends to be quite complex since itmust be customized to recognize the halting speech of language learners. Acousticmodels are generalized so as to accept and recognize correctly a wide range ofdifferent accents and pronunciations. In general terms, the procedure consistsof building native pronunciation models and then measuring the non-native re-sponses against the native models. For the need of the preliminary evaluation ofPolish acoustic speech models, 116 h of orthographically annotated speech pro-duced by 321 speakers were used. The speakers were split into 5 cross-validationsets in such a way that the number of speakers and the number of sentences wereapproximately equal across different cross-validation ‘folds’. Each set serves asa testing corpus once, with the other four used for building of the models. The sto-chastic acoustic speech models for Polish were trained using HTK tools (Younget al., 1997). The number of Gaussian mixtures in each state was experimen-tally set to 24. Polish phonetic alphabet consisting of 39 phonemes (Demenkoet al., 2003) was used as the default setup. For the present experiment, it wasdecided to divide each of the six Polish vowels (/i/,/y/,/e/,/a/,/o/,/u/) into two

The Use of Speech Technology in Foreign Language. . . 321

separate monophones, one representing lexically stressed instances and the otherrepresenting their unstressed counterpart. The list of contextual questions wasaccordingly modified to distinguish stressed and unstressed phonemes. The ex-tension of the above method consisted not only in dividing vowel models, butalso distinguishing ‘sonorants’ placed within stressed syllables from those locatedinside unstressed syllables. The results of the acoustic modeling (without anylinguistic models) are summarized in Table 1.

Table 1. Acoustic modeling results for different train and test speech-rate classes.‘mu’ denotes mean word-level accuracy [%], ‘sig’ denotes a standard deviation

across 5 cross-validation folds.

Setup Number of phonemes mu sig

Standard 39 45.82 1.27

With stressed vowels 45 49.10 1.18

With stressed vowels and sonorants 52 49.62 1.52

The difference between standard (39 phonemes) and vowel-distinguishingsetup is statistically highly significant (a matched-pairs difference is 3.28% witha standard deviation of 0.65%). Further extension of the phonetic alphabet, withtwo monophones for each sonorant, does not give any significant boost whentested by the one-tailed t-test (a mean fold difference of 0.52% with standarddeviation 0.96% gives a P-value 14.7%).

For the auditory feedback, a German phoneme recognizer was trained andadapted to the specific German speech data (recorded in the reference database).The alignment to generate the spoken transcription for the educational speechmaterial was carried out by a mixed German and Polish phoneme recognizer,considering a lexicon with several alternatives for each word.

The best-scored phone sequence wins over the alternative sequences. The dy-namic matching between canonical and aligned phone sequence provides a warp-ing function of temporal information and information about the best-matchingareas of the speech signal. The system marks the potential area of wrong pro-nunciation in the utterance and suggests suitable exercises to reduce the speaker-specific pronunciation errors. All uttered phones are marked using color scalefrom red for mispronounced phones to green for those pronounced correctly. Theuser can listen to and play back the reference voice as well as see the speech sig-nal for a given utterance, record and listen to his/her own utterance and see thespeech signal for this utterance, and finally get feedback on his/her own pronun-ciation. An additional visual mode includes animated visualization of the vocaltract (lip area and articulators movements) and a formants graph for particu-lar phones. A typical AzAR template for an exemplary minimal pair is showedin Fig. 3 and Fig. 4. From top to bottom, the panel containing the text to beproduced is shown together with the formants/articulation graph, the oscillo-

322 G. Demenko, A. Wagner, N. Cylwik

gram/spectrogram of user’s and reference speaker’s utterances: below them thetranscription and segmentation panel can be seen.

Fig. 3. AzAR template for pronunciation assessment of an exemplary minimal pair “bitten –bieten” (DE). In the right top corner animated visualization of the vocal tract (lips area and

articulators movements) are displayed.

Pronunciation quality is visualized by colors. The segments that appearedproblematic in this example (marked in a light and mid-grey) are German vowels/i:/, /I/, /a/, /E/, /6/ and consonants: /r/, /N/ – all of them have differentarticulation from the corresponding Polish phonemes (additionally, there is no/6/ in Polish), so substitutions can be expected.

All parts of the display are easily customizable by the learners to fit theirindividual needs.

For pronunciation training, also traditional instruction is being recommended(Chun, 1998) since visuals can be too difficult for the user to interpret andlistening drill is not enough when one keeps in mind that L2 learners tend toperceptually associate foreign sounds with more familiar L1 sounds. Therefore,beside audio-visual feedback, the AzAR3.0 software also includes a text tutorialon articulatory and basic acoustic phonetics with glossary, phonemes descriptionand classification, anatomic information, etc.

The Use of Speech Technology in Foreign Language. . . 323

Fig. 4. AzAR template for the pronunciation assessment of a word pair “Bach –Bauch”. In theright top corner a formants graph for particular phones is displayed.

4.2. Suprasegmental feedback

In spoken conversation, intonation and stress information not only help thelisteners to identify phrase boundaries and word emphasis, but also the prag-matic thrust of the utterance (e.g. interrogative vs. declarative). One of the mainacoustic correlates of stress and intonation is the fundamental frequency (F0);other acoustic features are loudness, duration, and tempo. In order to providean effective feedback on prosody, training software should visualize the “rele-vant” intonation pattern of a given utterance as realized by an L2 student anda native speaker. Apart from that, it should draw attention to acoustic featuresinvolved in the realization of intonation. For example, the software could (a)instruct learners to compare the steepness of their falling or rising pitch move-ment to that of the native speaker, and/or (b) provide a quantitative measure-ment of the actual pitch slopes of both the native speaker and the learner. Aneffective feedback of this kind requires implementation of some kind of pitchstylization and normalization. The Pitch Line program (Demenko, Wagner,

324 G. Demenko, A. Wagner, N. Cylwik

2007) designed for approximation and parameterization of intonation contoursresponds to these needs and therefore it could be useful for the AzAR3.0 envi-ronment.

The stylization method adopted in Pitch Line is based on the assumption thatintonational tunes can be regarded as strings of events (pitch accents, boundarytones) associated with the segmental structure of the utterance. The events aremodeled as rising, falling or rising-falling pitch movements. They are delimited bytarget points in the contour (F0 minima and maxima) which define their start,peak and end; some of the targets are effectively corresponding to phonologi-cal tones (H, L). The parts of the F0 contour corresponding to the events areapproximated by the functions given below (γ denotes the degree of curvature):

0 < x < 1 y = xγ ,

1 < x < 2 y = 2− (2− x)γ .

The stretches of contour between subsequent events are called connections andare approximated by straight lines. In Pitch Line the approximation is carriedout semi-automatically: the choice of the approximation function, i.e. R-rising,F-falling, or C-connection (cf. Taylor, 2000) and the alignment of the functionwith the segmental string depend on the human labeler and are decided upon byclicking in the appropriate location on the approximation panel. It is assumedthat the start and end of the approximation functions have to be aligned withsome segmental landmark located on the pre-accented, accented or post-accentedsyllable. During the approximation, the normalized mean square error can be con-trolled: it is displayed on the approximation panel. Similarly to other stylizationmethods (e.g. Momel or PaIntE), the Pitch Line represents intonation contours ona fundamental frequency scale. F0 provides useful information about the acousticproperties of the speech signal, but it is not the most accurate representation ofthe intonation contour as it is perceived by human listeners. Therefore, in thefuture it is planned to apply a more perception-oriented scale in semitones (ST)and to compare the effect of different representations of intonation on the effec-tiveness of prosody training.

At the output, the Pitch Line provides a file containing the values of thestylized F0 curve and another file with parameters describing the events: slope(describing the steepness of the F0 curve), Fp (F0 value at the point of thealignment of the approximation function), amplitude of the pitch movement andshape coefficient of the curve.

Figure 5 illustrates the editing window of Pitch Line. The upper panel containsthe waveform; the mid-panel shows SAMPA transcription of the utterance: oceniłw mig sytuację (Eng. ‘he judged the situation in a flash’). The bottom panelpresents the original F0 contour (dotted line), the stylized contour (solid line),approximation functions (R, F, C) used for stylization of the intonation eventsand NMSE. The vertical lines show approximate phoneme boundaries.

The Use of Speech Technology in Foreign Language. . . 325

Fig. 5. The editing window and stylization of an intonation contour with the Pitch Line program.

The potential application of Pitch Line to prosody training in AzAR 3.0could be useful for several reasons. First of all, the stylized contours representthe macroprosodic component of F0 curves which carries linguistically relevantinformation, but is perceptually equivalent to raw F0 contours consisting addi-tionally of a microprosodic component caused by the nature of the individualphonematic segments of the utterance. The perceptual equivalence of the origi-nal and Pitch Line-stylized pitch contours was shown in a perception study (fordetails see: Demenko, Wagner, 2007): the general impression of the listenerswas that the phrases resynthesized with the stylized F0 contours sounded verynatural. Apart from that, a quantitative evaluation was carried out – it consistedin measuring the NMSE value between the original and stylized F0 contours of1000 phrases from a novel passage read by two speakers (male and female). Theaverage NMSE value for the two speakers was 0.003, which indicates that theproposed method provides an accurate approximation of F0 contours.

Secondly, at the output the program provides quantitative information whichcan be used to instruct the learner how to improve his or her prosodic realiza-tions (e.g. “try to move the peak of the accent higher”). From this quantitativeinformation, a more abstract qualitative representation can be derived and usedin the evaluation of the learner’s realization (e.g. realization of a low tone insteadof a high one is considered to be a more severe error than realization of a differentslope of a high pitch accent) and feedback generation. In the learning process itis important to provide the student with a linguistically relevant representationand Pitch Line could be used for that purpose.

326 G. Demenko, A. Wagner, N. Cylwik

5. Disscussion

The tutoring system AzAR 3.0 developed in the scope of the Euronounceproject was designed according to specific pedagogical guidelines. They identifythe input, output and feedback as the essential factors for pronunciation improve-ment. The AzAR3.0 functionality provides several audio-visual modes of userfeedback, e.g. showing animated articulatory organs to correct wrong movementsof tongue, lips, etc. or playing back reference utterances, but the core function ismarking mispronounced phones within the spoken utterance using a colored scalefrom red (“bad”) to green (“good”). The marking of mispronounced parts of theuser’s utterances is based on different phonetic-phonologic and prosodic distancemeasures – identifying typical cross-lingual influences of the source language onthe target language, such as:

• Confusion of specific phoneme classes;• Wrong phoneme duration;• Articulation mistakes e.g. voicing of voiceless phonemes.

Fig. 6. The AzAR 2 demonstrator.

AzAR 3.0 is an “extended” version of the system developed in previous projects– it adopts speech processing components (such as HMM-based ASR) of AzAR2.0. AzAR 3.0 includes improved audio and visual algorithms (like phonemerecognition or region of interest recognition) and quality assessment methods,developed on the basis of the analyses of large multilingual speech databases andoffers training of prosodic features.

The AzAR2.0 system (Fig. 6) was tested and optimized for Russians learn-ing German and runs on PC and PDA. Preliminary qualitative evaluations wereconducted to gauge the effectiveness of using AZAR2.0 system to teach Polishspeakers German pronunciation. Fifteen native speakers of Polish who were inter-mediate learners of German, were asked to subjectively evaluate the effectiveness

The Use of Speech Technology in Foreign Language. . . 327

of a 2-hour pronunciation training using the AzAR2.0. Thirteen subjects con-sidered it a good tool to learn pronunciation individually and 2 users expressedwillingness to use it in consultation with a teacher. ASR and prosodic module forPolish language were subjected to detailed testing. The present results concerningASR for educational needs should be regarded as a preliminary verification of thedevelopment of acoustic models for Polish. Although they are promising, furtherexperiments are indispensable to improve the obtained acoustic models, espe-cially for accented syllables and to provide an outcome that would be practicallyuseful in the designed speech recognition system.

As far as the use of the Pitch Line software in AZAR3.0 is concerned, the workis in progress to develop automatic pitch target detection so that a fully auto-matic stylization is possible. However, for multilingual implementation of prosodymodule in AZAR3.0 and in order to address the multiple levels of communica-tive competence such as grammatical, attitudinal, discourse, or sociolinguistic, itis necessary to distinguish the intonational features with regard to four aspectsof pitch change (t’Hart et al., 1990): its direction, range, speed and alignmentwith a particular syllable in the utterance and to link them to tonal categorieswhich constitute a higher-level representation of the utterance’s intonation.

In order to further optimize AzAR3.0 for both the pronunciation and prosodytraining, the software will be tested in real learning environment.

Acknowledgments

This project has been funded with support from the European Commis-sion within the Lifelong Learning Programme (project 135379-LLP-1-2007-1-DE-KA2-KA2MP). This publication reflects the views of the authors only, and theCommission cannot be held responsible for any use which may be made of theinformation contained therein. The project homepage is located at:

http://www.euronounce.net.This research was partially supported by the Polish Ministry of Scientific

Research and Information Technology, projects Nos. R00 035 02 and OR00006707http://www.man.poznan.pl/online/pl/projekty/105/Laboratorium−Jezyka−i−Mowy.html.

References

1. Abberton E., Fourcin A.J. (1975), Visual feedback and the acquisition of intonation,[in:] Foundations of Language Development, Lenneberg E.H. and Lenneberg E. [Eds.],pp. 157–165, Academic Press, New York.

2. Anderson S., Kewley–Port D. (1995), Evaluation of speech recognizers for speech train-ing applications, IEEE Proceedings on speech and audio processing, 3, (4), 229–241.

3. Bongaerts T. (1999), Ultimate attainment in L2 pronunciation: The case of very ad-vanced late learners, [in:] The Critical period Hypothesis and Second language Acquisition,Birdsong D. [Ed.], Mahwah, NJ: Lawrence Erlbaum.

328 G. Demenko, A. Wagner, N. Cylwik

4. de Bot K., Mailfert K. (1982), The teaching of intonation: Fundamental research andclassroom applications, TESOL Quarterly, 16, 71–77.

5. Bouselmi G., Fohr D., Illina I. (2007), Combined Acoustic and Pronunciation Mod-elling for Non-Native Speech Recognition, Proceedings of INTERSPEECH, pp. 1449–1452,Antwerp.

6. Chun D.M. (1998), Signal analysis software for teaching discourse intonation, LanguageLearning & Technology, 2, (1), 61–77.

7. Cylwik N., Demenko G., Jokisch O., Jäckel R., Rusko M., Hoffmann R.,Ronzhin A., Hirschfeld D., Koloska U., Hanisch L. (2008), The use of CALL in ac-quiring foreign language pronunciation and prosody – general specifications for EuronounceProject, Proceedings of Speech Analysis, Synthesis and Recognition (SASR), Piechowice.

8. Cylwik N., Wagner A., Demenko G. (2009), The EURONOUNCE corpus of non-nativePolish for ASR-based Pronunciation Tutoring System, Proceedings of SLaTE Workshop onSpeech and Language Technology in Education, Wroxall Abbey Estate, Warwickshire.

9. Dalby J., Kewley–Port D. (1999), Explicit Pronunciation Training Using AutomaticSpeech Recognition Technology, Calico Journal, 16, (3), 425–445.

10. Demenko G., Wagner A. (2007), Prosody annotation for unit selection text-to-speechsynthesis, Archives of Acoustics, 32, (1), 25–40.

11. Demenko G., Wypych M., Baranowska E. (2003), Implementation of Grapheme-To-Phoneme Rules and Extended SAMPA Alphabet In Polish, Speech and Language Technol-ogy, 7, 79–95.

12. Engwall O., Wik P., Beskow J., Granström G. (2004), Design strategies for a virtuallanguage tutor, Proceedings of 8th ICSLP, pp. 1693–1696, Jeju Island.

13. Eskenazi M., Hansma S. (1998), The Fluency pronunciation trainer, Proceedings ofSpeech Technology in Language Learning, pp. 77–81, Marholmen.

14. Eskenazi M. (1999), Using automatic speech processing for foreign language pronunciationtutoring: some issues and a prototype, Language Learning & Technology, 2, 2, 62–76.

15. Flege J.E. (1995), Second-language speech learning: Findings and problems, [in:] SpeechPerception and Linguistic Experience: Theoretical and Methodological Issues, Strange W.[Ed.], pp. 233–273, Timonium, MD: York Press.

16. Goronzy S. (2002), Robust Adaptation to Non-native Accents in Automatic Speech Recog-nition, Springer Verlag.

17. t’Hart J., Collier R., Cohen A. (1990), A Perceptual Study of Intonation, CambridgeUniversity Press, Cambridge.

18. Jokisch O., Koloska U., Hirschfeld D., Hoffmann R. (2005), Pronunciation learn-ing and foreign accent reduction by an audiovisual feedback system, Proceedings of 1stIntern. Conf. on Affective Computing and Intelligent Interaction (ACII), pp. 419–425,Beijing.

19. Kommissarchik J., Kommissarchik E. (2000), Better Accent Tutor – Analysis and vi-sualization of speech Prosody, Proceedings of InSTIL 2000, pp. 86–89, Dundee.

20. Krashen S.D., Terrell T.D. (1983), The Natural Approach: Language Acquisition inthe Classroom, Pergamon Press, Oxford.

The Use of Speech Technology in Foreign Language. . . 329

21. Lyster R. (1998), Negotiation of Form, Recasts, and Explicit Correction in relation toerror types and learner repair in immersion classrooms, Language Learning, 48, 183-218.

22. Morgan J. (2004), Making a Speech Recognizer Tolerate Non-Native Speech ThroughGaussian Mixture Merging, Proceedings of InSTIL/ICALL 2004, pp. 213–216, Venice.

23. Neri A., Cucchiarini C., Strick H. (2002a), Feedback in Computer Assisted Pronuncia-tion Training: When technology meets pedagogy, Proceedings of 10th Int. CALL Conferenceon “CALL professionals and the future of CALL research”, pp. 179–188, Antwerp.

24. Neri A., Cucchiarini C., Strick H., Boves L. (2002b), The Pedagogy-Technology In-terface in Computer Assisted Pronunciation Training, Computer Assisted Language Learn-ing, 15, 5, 441–467.

25. Taylor P. (2000), Analysis and synthesis of intonation using the tilt model, J. Acoust.Soc. Am., 107, 3, 1697–1714.

26. Teixeira C., Franco H., Shriberg E., Precoda K., Sönmez K. (2000), ProsodicFeatures for Automatic Text-Independent Evaluation of Degree of Nativeness for LanguageLearners, Proceedings of 6th ICSLP, pp. 187–190, Beijing.

27. Scovel T. (1988), A Time to Speak. A Psycholinguistic Inquiry into the Critical Periodfor Human Speech, Newbury House Publishers, Cambridge.

28. Young S., Odell J. , Ollason D., Valtchev V., Woodland P. (1997), The HTKBook (for HTK Version 2.1), 1997.