Transcript
Page 1: A Mandarin edutainment system integrated virtual learning environments

Available online at www.sciencedirect.com

www.elsevier.com/locate/specom

Speech Communication 55 (2013) 71–83

A Mandarin edutainment system integrated virtuallearning environments q

Yue Ming a,⇑, Qiuqi Ruan a, Guodong Gao b

a Institute of Information Science, Beijing JiaoTong University, Beijing 100044, PR Chinab Beijing Traffic Control Technology CO. Ltd., Beijing 100044, PR China

Received 3 July 2010; received in revised form 1 June 2012; accepted 28 June 2012Available online 10 July 2012

Abstract

In this paper, a novel Mandarin edutainment system is developed for learning Mandarin in immersing, interactive Virtual LearningEnvironments (VLE). Our system is mainly comprised of two parts: speech technology support and virtual 3D game design. First, 3D facerecognition technology is introduced to discriminate the different learners and provide the personalized learning services based on the char-acteristics of the individuals. Then, a Mandarin pronunciation recognition and assessment scheme is constructed by state-of-the-art speechprocessing technology. According to the distinctive differences of Mandarin rhythm from the Western languages, we integrate the prosodicparameters into the recognition and evaluation model to highlight Mandarin characteristics and improve the evaluation performance. Inorder to promote the engagement of foreign learners, we embed our technology framework into a Virtual Reality (VR) game environment.The character design reflects the Chinese traditional culture, and the plots effectively give consider to learning pronunciation and learners’interest, providing the scoring feedback simultaneously. In the experimental design, first, we test the correlation of recognition results andmachine scores with the different errors and human scores. Then, we evaluate the usability, likeability, and knowledgeability of the wholeVLE system. We divide the learners into three categories in terms of their Mandarin levels, and they provide feedback via a questionnaire.The results show that our system can effectively promote the foreign learners’ engagement and improve their Mandarin level.� 2012 Elsevier B.V. All rights reserved.

Keywords: Mandarin learning; Pronunciation evaluation; Virtual Reality (VR); Edutainment; Virtual learning environment (VLE); 3D face recognition

1. Introduction

Computer-assisted language learning (CALL) (Amoryet al., 1998; Conati and Zhou, 2002; Kearney, 2004) is acontinuously developing topic. Since the communicationamong the different countries has become increasingly fre-quent, more and more people urgently need to grasp one or

0167-6393/$ - see front matter � 2012 Elsevier B.V. All rights reserved.

http://dx.doi.org/10.1016/j.specom.2012.06.007

q This work is supported by National Natural Science Foundation(60973060), the Research Fund for the Doctoral Program (20080004001)and Beijing Program (YB20081000401) and the Fundamental ResearchFunds for the Central Universities (2009YJS025). Informedia digital videounderstanding lab in Carneige Mellon University provides the portions ofexperimental materials and environments. The authors would like alsothank the Associate Editor and the anonymous reviewers for their helpfulcomments.⇑ Corresponding author. Tel.: +86 10 51682936

E-mail address: [email protected] (Y. Ming).

more foreign languages. Mandarin, as one of the most pop-ulous languages, is being given greater attention. With therapid development of speech processing technology, auto-matic pronunciation recognition and evaluation make itideal for learners studying the language by themselves.However, although the traditional speech-based classeshave laid a solid foundation for identifying the levels ofthe learners and correcting incorrect pronunciation, it iseasy to become bored with extended learning without anyinteraction. In order to better simulate real-life communi-cation and provide the personalized services, a virtuallearning environment (VLE) is merged into the speechtechnical support and 3D face recognition engages thelearners in a live communication according to individualneeds. The immersing, interactive 3D virtual games canbe designed to promote extensive levels of motivation by

Page 2: A Mandarin edutainment system integrated virtual learning environments

72 Y. Ming et al. / Speech Communication 55 (2013) 71–83

providing “free form” storylines based on the learners’ages, native languages, and cultural backgrounds. In oursystem, the learners’ role is as a commander as soon asthe learner has been identified by 3D face recognition tech-nology. The learner controls the gestures and behaviors ofthe virtual characters built by them, based on their prefer-ences. They can also communicate with virtual humans viainteractive dialogs and actions. In this paper, the completevirtual Mandarin learning environment is constructed,composed of real-time pronunciation processing and theinteractive scene design.

First, pronunciation recognition and assessment is anindispensable indicator for real-time language learning. Inthe last several decades, considerable research has beendevoted to this area, providing a solid technique founda-tion for our system. The SRI speech group (Franco et al.,1997; Neueyer et al., 2000; Franco et al., 2000; Neumeyeret al., 1996) first proposed the scheme which can verifythe overall quality of the learner’s pronunciation. Thespeech groups of Cambridge University and the AI lab ofMIT (Witt and Young, 2000; Witt, 1999) developed jointresearch for the CALL system. Their work can be usedto effectively identify the pronunciation errors of theWestern languages, and the evaluation was based onphone-level pronunciation, not the syllable-level. TheVICK system (Cucchiarini et al., 1998; Cucchiarini et al.,1997) constructed by the University of Nijmegen, extendedthe evaluation into the prosodic level. They summarizedthe relationship between human scoring and the effect ofprosodic information, including fluency, segmental qualityand stress. The results of the investigating showed thatprosody is crucial for smooth pronunciation and communi-cation. The research from Tokyo University and KyotoUniversity (Raux and Kawahara, 2002; Tsubota et al.,2004) analyzed an important degree from the different pho-nemes during language learning. The different kinds of pro-nunciation errors were also investigated in terms ofproficiency. Currently, structural representation has beenintroduced to help pronunciation assessment, which caneffectively reflect the structure and high-level languagesemantics spoken by non-native learners (Minematsu,2004; Asakawa et al., 2005). de Wet et al. (2009) exploiteda scheme for large-scale oral pronunciation assessment.Their research focused on proficiency and listening com-prehension for fairly advanced students of English as a sec-ond language. However, one consideration is that thesesystems, which are based on Western languages, cannotdirectly be used for Mandarin learning due to the distinc-tive differences from Mandarin. By widely investigatingthe pronunciation variances between Mandarin and wes-tern alphabetic languages, prosodic and tone features playan important role in accurate recognition and evaluation.In our system, we incorporate a prosodic model into thetraditional speech recognition framework. In the experi-mental section, we have a detailed outline that analyzesthe detection and evaluation of Mandarin pronunciationin a real-time audio-interaction system.

Our surveys demonstrate that the simple audio-interac-tion for Mandarin learning lacks intelligence and interest.For example, the learners using the traditional learningsystem cannot perceive the real interactive environmentsand take the corresponding actions according to theunderstanding of language knowledge. From the ecologi-cal perspective (Hodges, 2009), the language generationoriginates from the interaction between persons and theirlived environments. In ecolinguistics, applying an agent’sactions can reflect the interrelational transactions betweenthe learners and their simulated environments and bestunderstand the culture and policies as maintaining alearner’s behavior. Speech technology only provides areduced view of the notion of emergence in terms ofprecision, rigor and success. The ecological approach(Van Lier, 2000) asserts that the perception and personalbehaviors of the learners, and particularly the interactionin which the learners participates, are critical to the con-cept understanding of Mandarin utterances. Immersingthe learners in an interactive environment can effectivelyhelp them acquire language skills while interacting withinthe environments.

There have already been a number of researches effortsfor language education that aim to promote the motivationof learners (Young, 2004; Young et al., 2000). One promi-nent example, Lewis Johnson et al. at Alelo, Inc. (Johnsonet al., xxxx) has embraced a game method for languagelearning for many years. In their system, they construct areal-life community by integrating scenario dialogs andcultural common sense, and their technologic and peda-gogic innovations have produced the quick and economicaleffectiveness. However, for Mandarin, they have not devel-oped an immersing, interactive VLE by effectively reflectingMandarin characteristics and cultural backgrounds. In thispaper, we focus on building a completed Mandarin edu-tainment system by incorporating Mandarin pronunciationrecognition and evaluation, and immersing the student ininteractive virtual learning games involving Chinese historyand culture.

The rest of this paper is organized as follows: in Section2, we introduce the outline of our system and summarizethe main contributions of our system. Then, a real-timeMandarin recognition and pronunciation evaluationscheme is described in Section 3. After discussing the con-struction of VLE in Section 4, we propose our Mandarineducational virtual game. In Section 5, we evaluate the per-formance of our system based on accuracy of pronuncia-tion recognition and assessment, usability, likeability,knowledgeability of our VLE game design. Finally, we con-clude in Section 6.

2. System framework and main contributions

According to previous research (Amory et al., 1998;Conati and Zhou, 2002; Kearney, 2004; Hodges, 2009;Van Lier, 2000; Young, 2004; Young et al., 2000), we focuson the major challenges effecting real-time interactive

Page 3: A Mandarin edutainment system integrated virtual learning environments

Y. Ming et al. / Speech Communication 55 (2013) 71–83 73

Mandarin pronunciation learning and present a new systemframework in a Virtual Reality (VR) game environment(Virtual Reality is a term that applies computer-simulatedenvironments to create a lifelike experience that can simu-late physical presence in places in the real world, as well asimaginary worlds and sound through speakers (Hodges,2009)). First, 3D face recognition is used to identify the dif-ferent learners and provide the personalized learning mate-rials according to their preferences (Ming, and Ruan,2012). Then, the Mandarin learning system consists of fourimportant elements: speech recognition interface, pronun-ciation evaluation, virtual game environment, and the plotdevelopment as shown in Fig. 1. In the past few decades,considerable efforts have been devoted to speech recogni-tion, and this technology has become quite mature for stan-dard Mandarin. The speech recognition interface can beintroduced to identify the speaking contents of the foreignlearners and display the recognition results on the screenfor reference. Based on the distinctive characteristics ofMandarin, the pronunciation model is combined with theprosodic parameters which are used for our proposed sys-tem. Then, the confidence of pronunciation detection, asthe evaluating standard, can be converted to the score ofthe foreign speaker’s Mandarin pronunciation. The resultsin the experimental section show that the machine scores ofpronunciation evaluation in our system are quite close tothe human scores.

In our system, the learners are exposed to a VR gameenvironment, which is about a young man’s Eastern adven-ture. At the beginning of the scenario, each learner candesign an intelligent virtual character for himself accordingto his personal preference. The character can be placed inthe virtual environment and reflect the real-time communi-cation based on the learner’s Mandarin pronunciation.Once the character appears, a magic box is prepared forhim, which can take him to an island according to theirchoices. There is a guardian angel ruling the island, andthe angel has a conversation with the character inMandarin.

The score of the pronunciation evaluation contributes tothe plot development. The system deals with the learner’sMandarin utterance, comparing the recognition resultswith the standard answers. If the answer spoken by the

Fig. 1. The flow chart of our proposed Mandarin edutainment system.

learner is completely illogical in relation to the content ofthe angel’s questions, the angel will repeat the question,and the system will set the output score at 0 for this time.Otherwise, if the answer is reasonable, the system willassign different scenarios based on the evaluating score.A series of exciting adventures of the East will be startedcombined with Mandarin pronunciation and cultural back-grounds in a VR game environment. For beginners, thegame is relatively easy, and winning the game is quite sim-ple. As Mandarin pronunciation improves, the game diffi-culty level gradually increases. The pronunciation scorethreshold can be also correspondingly updated based onthe learner’s progress, and the relative rewards can be pro-vided. If the learning progress is not being made, the pun-ishment system will begin to work. If consecutive errors,including the lower pronunciation scores and unreasonableanswers surpass five times, the severe punishment willresend the virtual character to the beginning of the gameor the last checkpoint. If learners speak and behave quitewell, a set of incentive mechanisms will provide some spe-cial magic cards for reducing the degree of the conversationdifficulty, skipping questions or even entering more inter-esting scenes. Both the punishment and incentive strategiesare used to increase playability and fun in our Mandarinlearning edutainment system.

In our framework, a novel scheme is proposed toimprove the Mandarin pronunciation level in VR gameenvironment. The main contributions of our scheme canbe summarized as follows:

1. Accuracy: Via extensive investigation of the differencebetween Mandarin and Western languages, prosodicfeatures are the cornerstones to good pronunciation,especially for the four Mandarin tones. A syllable takesits meaning from the sound and tones, which is difficultfor foreign learners. In our system, we combine the pro-sodic model with the classical speech processing frame-work which can provide promising results forMandarin recognition and evaluation. Detailed technol-ogy analysis will be given in the following section. Theexperimental results also verify this point. Once thelearners have mastered tones and rhythm, everythingelse will fall in place and great progress will be made.

2. Usability: In our system, learners can realize real-timecommunication simply by way of speech and simplemanipulation. Speech recognition and evaluation isoptimized to understand the utterance spoken by nativeand non-native speakers and provide feedback to thesystem. Tutorial feedback has an effect on learners’ pro-nunciation to facilitate the assignment of the course con-tent. Interactive feedback that is encouraging andsensitive to the learner’s sense of self-esteem leads to abetter learning state and atmosphere rather than simplytelling the learner that his/her responses are right orwrong.

Page 4: A Mandarin edutainment system integrated virtual learning environments

74 Y. Ming et al. / Speech Communication 55 (2013) 71–83

3. Likeability: Increasing the learner’s motivation andengagement is an essential component of the learningprogress. VLE with appropriate game design caneffectively enhance the intelligence and stimulate thelearner’s understanding based on the special events orscenarios. Furthermore, learners can be completelyemersed in a happy mode by playing the virtual game.Likeability can facilitate raising learning interest,exploring new concepts and building an immersive,imaginative virtual learning space.

4. Knowledgeability: Since the Chinese culture plays animportant role in the learning of Mandarin, all virtualgame scenarios contain Chinese history and culture.The storylines are based on Eastern adventure. A varietyof characters are originally from classical Chinesemythology, and the scenario design is based on famousscenery. During the learning process, learners not onlypractice their Mandarin pronunciation, but alsoenhance their understanding of Chinese culture.

3. Real-time Mandarin speech recognition and pronunciationevaluation

In this section, we focus on the technology supportingabout the Mandarin speech recognition and pronunciationevaluation. The flowchart of speech processing is illustratedin Fig. 2. Detailed analysis will be given in the followingsubsections.

3.1. Mandarin speech recognition with prosodic modeling

Maximum a posteriori (MAP) estimation is introducedto the classical speech recognition, and word sequenceswith maximum posterior probability can be treated as rec-ognition (Huang and Lee, 2006; Rabiner and Juang, 1993):

W � ¼ arg maxW

P ðW jX Þ ¼ arg maxW

P ðW ÞP ðX jW Þ ð1Þ

Fig. 2. The framework of our speech recognition and pronunciationevaluation system.

where W ¼ fw1;w2; . . . ;wng is the word sequence and wj

denotes the jth lexical word. X is the acoustic feature vec-tors, composed of 12 MFCC (Mel-Frequency CeptralCoefficients), a log energy, and their corresponding deltaand double delta values. P ðW Þ is the prior distribution con-tributed by certain language model, and PðX jW Þ is calcu-lated using an HMM (Hidden Markov Model) acousticmodel. In the speech process, Hidden Markov Model canbe treated as the statistical process of pronunciationsequences, which can be modeled by assuming Markovsequences with unobserved states (Rabiner and Juang,1993).

Mandarin has its distinctive tonal characteristics, andaccurate prosodic pronunciation is closely related to thepower of speaking. A set of prosodic feature vectorsF ¼ ff1; f2; . . . ; fng have been devoted to speech recogni-tion, derived from pitch, duration and energy (Zhanget al., 2008; Huang and Lee, 2006). Next, the above equa-tion can be extended to the following formulation:

W � ¼ arg maxW

PðW jX ; F Þ ¼ arg maxF

PðW ÞPðX ; F jW Þ ð2Þ

Here, we assume the acoustic and prosodic features aremutually independent given the word sequence W. Foreach lexical word, the prosodic feature fj assumes indepen-dently the neighboring words wj�1 and wjþ1 and has onlyeffect on the current word wj. Based on the simple algebraictheory, we get the following equation (Huang and Lee,2006):

W � ffi arg maxW

PðW ÞP ðX jW ÞYNj¼1

P ðfjjwjÞ ð3Þ

Then, we introduce a two-pass decoding process in therestoring stage for effective recognition (Huang and Lee,2006). In the first pass, an appropriate size word graph isbuilt based on the traditional acoustic and language mod-els. Then, during the second pass, every word arc isrescored by integrating the prosodic model featuresPðfjjwjÞ. The rescoring equation can be derived from thedecision tree and maximum likelihood theories:

SðW Þ ¼ kX PðX jW Þ þ kW P ðW Þ þ kP P ðF jW Þ ð4Þ

where SðW Þ is the final score, and kX ; kW ; kF are theweights for the likelihoods of acoustic model, languagemodel, and prosodic model. The probability P ðfjjwjÞ effec-tively reflects intonation, rhythm, and stress, which is sui-ted for identifying the pronunciation for those whosenative language is a Western language. Mandarin usesfour tones to clarify the meanings of words, includinghigh level, rising, fall rising and falling. Most foreignlearners can not accurately differentiate the tones fromeach other. For example, “yue (falling)” is a quite simplepronunciation for native Mandarin speakers. However,for most foreigners, they encounter difficulties to correctpronunciation. Thus, prosodic recognition and detectionplay a quite important role for correcting pronunciation

Page 5: A Mandarin edutainment system integrated virtual learning environments

Y. Ming et al. / Speech Communication 55 (2013) 71–83 75

errors. We analyze Mandarin pronunciation based on twolevels, the syllable level and the prosodic level, which caneffectively identify and recognize errors when foreignerslearn Chinese.

3.2. Mandarin pronunciation detection

In this subsection, we introduce the related confidencemeasure to identify whether pronunciation is correct ornot (Williams and Renals, 1999). A confidence measurecan quantify how well the model matches the pronuncia-tion units, where the values indicate the similarity acrossthe whole utterance. Here, a 2� 2 confusion matrix ofthe confidence values is used to estimate the unconditionalerror rate (UER) in the Eq. (5):

bP ðerrorÞ ¼ NðrejectðH 0Þ; trueðH 0ÞÞ þNðacceptðH 0Þ; falseðH 0ÞÞNðHÞ

ð5Þ

where bP is error probability estimation and NðHÞ denotesthe total number of hypotheses tested. Then, conditionederror rate (CER) can be divided into two types:

bP ðtypeIerrorÞ¼ bP ðrejectðH 0ÞjtrueðH 0ÞÞ¼NðrejectðH 0Þ;trueðH 0ÞÞ

NðtrueðH 0ÞÞð6Þ

bP ðtypeIIerrorÞ¼ bP ðacceptðH 0ÞjfalseðH 0ÞÞ¼NðacceptðH 0Þ;falseðH 0ÞÞ

NðfalseðH 0ÞÞð7Þ

Here, posterior probability is chosen as our confidencemeasure (Williams and Renals, 1999), which can be esti-mated by the following acceptor acoustic model, where qk

denotes the kth states sequence:

PP ðqkÞ ¼Xne

n¼ns

log pðqkjX n1Þ

� �ð8Þ

And duration-normalized is used to balance the particularlevel of the difference utterances:

nPP ðqkÞ ¼1

DPP ðqkÞ ð9Þ

where ns is the start time of the utterance and ne is the endtime.

The separability of the two utterances can be estimatedby symmetric Kullback–Leibler distance (KLD), which cansum the divergence between the utterances evaluated inboth directions (Williams and Renals, 1999):

dKL2 ¼ �XM

m¼1

pðnPPðxmj ÞjtrueðH 0ÞÞ

� logpðnPP ðxm

j ÞjfalseðH 0ÞÞpðnPPðxm

j ÞjtrueðH 0ÞÞ

( )

�XM

m¼1

pðnPPðxmj ÞjfalseðH 0ÞÞ

� logpðnPPðxm

j ÞjtrueðH 0ÞÞpðnPP ðxm

j ÞjfalseðH 0ÞÞ

( )ð10Þ

where nPP ðxmj Þ denotes the confidence value for the mth

word decoding. Once the value of KLD is larger the presetthreshold, the corresponding utterance can be consideredan error pronunciation.

3.3. Mandarin scoring based on the acoustic and prosodic

parameters

The machine score must be reasonable and well corre-lated with the human scores. Based on the distinctive char-acteristics of Mandarin, the information both from theacoustic model and prosody must be considered to bescored.

3.3.1. HMM log-likelihood scoresHMM model can effectively describe the acoustic likeli-

hoods of spectral observation as scores (Neueyer et al.,2000). The total log-likelihood of the utterance is estimatedby the following equation by summing the short-time win-dows of frames:

li ¼Xsiþ1�1

t¼si

log pðstjst�1ÞpðxtjstÞð Þ ð11Þ

where si is the start time of the ith phonetic segment, xt isthe observed spectral vector, and st denotes the HMM stateat time t. The different lengths of the utterances have severeinfluence on the log-likelihood scores. Normalization isintroduced to balance the effect of time-period and com-pensate the shorter utterances. The local average HMMlog-likelihood score L can be calculated as follows:

L ¼ 1

N

XN

i¼1

li

dið12Þ

3.3.2. Magnitude scoresMagnitude score is a key indicator for reflecting the pro-

sodic information, which is closely related to the assess-ment of evaluation. It can be computed by averaging theshort time speech frames:

aveMagðnÞ ¼ 1

M

XM�1

m¼0

SnðmÞj j; n ¼ 0; 1; . . . ;N � 1 ð13Þ

Page 6: A Mandarin edutainment system integrated virtual learning environments

Fig. 3. An example of Mandarin speech recognition and pronunciationevaluation system.

Fig. 4. The interactive 3D content creation by Virtools.

76 Y. Ming et al. / Speech Communication 55 (2013) 71–83

where SnðmÞ denotes the magnitude of nth frame. In oursystem, interpolation is used to normalize the differentlengths of feature vectors and linear scaling is introducedto compensate the environment diversity. Euclidean dis-tance is introduced to evaluate the similarity between thestandard and input utterances.

3.3.3. Segment duration scoresDuration parameter can be used to evaluate the rate of

speech (ROS) among the different individuals (Neueyeret al., 2000). The segment duration score can be calculatedby the Viterbi phonetic alignment. The value is derivedfrom averaging the duration for i� th segment of eachsegment:

D ¼ 1

N

XN

i¼1

logðpðf ðdiÞjqiÞÞ ð14Þ

where f ðdiÞ ¼ di � ROSS is the normalized duration, ROSS isthe estimate for the particular speaker S, and qi is thephone corresponding to the ith segment.

3.3.4. Timing scores

Empirical study shows that non-native speakers cannotspeak as quickly as natives, and speaking rate can effec-tively reflect fluency and tends to be used as a scoring indi-cator (Neueyer et al., 2000). Euclidean distance, denoted asdist, based on DTW can be converted into the score mech-anism as follows, which can evaluate the similarity betweenthe input speech sequences and standard utterance:

scorepho ¼100

1þ a � distð Þb; a; b > 0 ð15Þ

Then, we can calculate the weighting scores based onfour types of parameters:

scoresen ¼ w1 � scorefea1þ w2 � scorefea2

þ w3 � scorefea3

þ w4 � scorefea4ð16Þ

where fea1; fea2; fea3, and fea4 denote the HMM log-like-lihood, magnitude, segment duration, and timing scoresrespectively. By simple downhill search, we can estimate

the values of weights. Finally, we integrate the machinescoring scheme with the recognition and detection system.The system shows the recognition results of the input andprovides the pronunciation scores in screen simultaneouslyin Fig. 3. The detailed analysis of performance will be de-scribed in experimental section.

4. Virtual 3D game design

In this section, we describe the details how to effectivelyconstruct the VLE system and design the Mandarin edu-tainment game.

4.1. Construction of VLE system

To pursue full color, wide-field view, and reality shows,Virtools (Virvou and Katsionis, 2008; Pan et al., 2006), asan available and deployment platform, is introduced to oursystem to construct an immersive 3D game scenario. Theinteractive 3D content creation is shown in Fig. 4, whichcan support a variety of 3D file storage formats. Thus,the tools are well suited for our applications and easier tocreate and edit virtual environments and 3D objects. Theconstruction of the Mandarin edutainment system can bedivided into three components as follows and as shownin Fig. 4.

1. 3D Layout (top left): The layout window illustrates thecurrent editing project in real-time, which can be used tocreate, select, and manipulate the different 3D Entities.The navigation tools can drive the motion of 3D objects.

2. Building Blocks and Data Resource (top right): TheBuilding Blocks Resource is a group of classes contain-ing useful functions, especially for the behavior BuildingBlocks (BBs). The data resource is the storage area forall the media files, including 2D sprites, 3D entities,

Page 7: A Mandarin edutainment system integrated virtual learning environments

Fig. 5. Correct answer interface with satisfactory speech content andpronunciation score.

Y. Ming et al. / Speech Communication 55 (2013) 71–83 77

3D sprites, Behavior Graphs, Characters, Materials,Audios, and Videos. The user can easily create andmanipulate a Data Resource or add a media file in it.

3. Schematic (bottom): In this area, the scripts can be usedto view, edit, and debug the scenarios and the behaviorsof characters. The schematic is a simply visual represen-tation of a special behavior attached to a behaviorcharacter.

Virtools allows the users to create rich and interactivescripts in a VR environment and realize the real-timemanipulation, which not only allows you to drag and dropBBs into Schematic, but also provides an available accessto create special BBs (http://a2.media.3ds.com/products/3dvia/3dvia-virtools/; Li et al., 2007). We create our novelMandarin edutainment system based on SDK of Virtoolson VC++ 6.0 platform, and integrate the technology ofspeech recognition and pronunciation evaluation. We uti-lize Virtools to build our VLE system, which leads thelearners into a happy mood by interacting with the virtualenvironment. The VLE system can effectively enhance andengage learners’ interest in certain behaviors or pronuncia-tions, especially for which the traditional methods haveencountered obstacles or difficulties.

4.2. The Mandarin edutainment VR game design

Virtual Reality (VR) is a technological breakthrough forgame design. The research and application of VR technol-ogy can effectively improve learning efficiency and arousethe interest of language learning compared with the exist-ing instructive strategies. Our edutainment curriculum con-sists of 15 modules with 11 units focusing on Mandarinword pronunciation, and each module can last approxi-mately two hours with audio-interaction.

The main 3D scenario of our Mandarin edutainmentgame is a desert island, surrounded by a river, swamp,sky and so on. The major roles include the virtual humancreated by the learners based on their personal preferences,four great classical roles of Chinese mythology (MonkeyKing, Monk Sha, Kwan-yin, and Bodhisatta), and a guard-ian angel. The learner can navigate through the scenariosby simple voice interactive commands and achieve the aimsof pronunciation exercises and explorations step by step.

At the beginning of the game plot, the learners encoun-ter the guardian angel, gods or spirits in each game check-point. At this point, pronunciation practice is based on thequestions raised by the angel, which is the only way toaccess the next scene. Each answer is provided a scorebased on the speech recognition and pronunciation evalua-tion module embedded in the system. When the corre-sponding score of the valid answer is higher than thepreset threshold, the next question or scene becomes avail-able. We show an example in Fig. 5. In this case, the angelasks the virtual human “What is the scene behind me?” Thelearner answers “he liu (means river in English)” by audio-interaction with a satisfactory pronunciation. Then, our

system shows the recognition result and machine score atthe bottom of screen as illustrated in Fig. 5. Otherwise,the learners need to practice the same pronunciation again.If the learner gives an unreasonable answer, the angel mayrepeat the question, set the output score to 0 and give hintsto make the problem easier.

With the plot development, the content of our designedgame may incorporate a lot of adventure elements (Virvouand Katsionis, 2008). The standard of winning the game isto improve Mandarin utterance and ultimately find thefour Great Classical roles in Chinese mythology. Toachieve the aims, the learners have to go through all thehurdles guarded by the angels and accumulate a specificnumber of points. For example, the virtual human, createdby the learners, encounters a swampy farmland in the vir-tual world as shown in Fig. 5. The guardian angel presentsa problem for him related to the Chinese culture, objects orlandscape surrounding him. The virtual human is requiredto answer the problems with fluent Mandarin acquired bythe learner’s audio-interaction. Then, the system can matchthe learner’s pronunciation with the standard utteranceand give a score based on the technology discussed in Sec-tion 3. If the score reaches or exceeds the predefined thresh-old, the learner will receive award points on this hurdle,and the angel will allow him/her to access the next hurdlewith harder questions but with more points, which leadshim/her to achieve the ultimate goal.

Other than the special scene, the learners may also meetsome objects which they can click on or manipulate. Theseobjects appear randomly and may give hints or barriers forlearners to go through the hurdles. For example, if thelearners encounter difficulty with special Mandarin pro-nunciations, hints can provide words with similar or thesame pronunciation to the learners, or provide the sceneand meaning tips for the learners to facilitate their Manda-rin learning. However, all of these hints cannot be availableimmediately to the learners, since they need to accumulatesufficient points in the previous steps to open the door of

Page 8: A Mandarin edutainment system integrated virtual learning environments

Fig. 6. Kwan-yin appeared.

Table 1The relation between the error rate and machine scores.

Machine scores 0–0.60 0.60–0.90 0.90–1Disparateness 54.17% 12.34% 0Substitution 5.87% 11.18% 5.72%Insertion 4.15% 9.34% 3.15%

78 Y. Ming et al. / Speech Communication 55 (2013) 71–83

another scene. Hence, the learners have to remember thepronunciation and Mandarin knowledge as much as theycan to facilitate more and more points obtained when theymeet the objects containing hints or barriers. Instructively,these objects can efficiently motivate the students to mem-orize the important parts of the language they are learning.

As a part of the adventure game, special bonuses play anindispensable role in the progress of the game and defeat-ing the enemies. The bonuses may involve some weaponsand related services. In our educational system, if the learn-ers have achieved 10 consecutive correct pronunciations,they will have an opportunity to get a key for a guardeddoor. If he/she can provide the correct Mandarin pronun-ciation and reasonable answer for the questions raised bythe keepers, they can enter into the door with the keyand pick up his/her bonuses. The degree of bonusesdepends on the questions’ difficulty level. There are alsosome penalties that will be accrued if the learner’s pronun-ciation makes no progress during a certain amount of time.

In the learning process, the learners can be completelyimmersed in the virtual world, only using the mouse andaudio-interaction. Our system also provides a 2D mapfor facilitating the learners’ navigation. As learner pro-gresses through certain learning units, he/she will havethe opportunity to see the desert island panorama. Thehigher the unit is, the closer the learner is to winning thegame. When the learners solve all the problems in Manda-rin pronunciation, a new character will take the learners tothe next scene as demonstrated in Fig. 6. Based on theabove design discussion, our humanized system can fosterand motivate the interests of foreign learners to a highextent. From the perspective of VLE, our edutainment sys-tem not only enriches the form of language learning, butalso provides a well rounded, multi-angel view of the teach-ing materials to enrich the learning experience to analyzepronunciation characteristics and grasping the new knowl-edge about Chinese culture. All kinds of learners can sharea public virtual learning space to discuss their understand-ing and obstacles to certain events or pronunciations.

Compared with the current instructive strategies withoutany VR game, our system shows obvious improvementfor the learners’ communication skills. We will evaluatethe performance of our system from four aspects in the fol-lowing section.

5. Experiments

In this section, we evaluate the performance of our pro-posed Mandarin edutainment system based on fouraspects: the accuracy of speech recognition and pronuncia-tion evaluation, usability and likeability with and withoutthe VLE system, and knowledgeability from the perspec-tive of linguistics and culture. Our edutainment course iscomposed of 11 units, each with a certain number ofChinese phrases to practice the Mandarin pronunciation.The system records the pronunciation scores in real time.The learners are divided into three levels according to theircorresponding pronunciation scores. These levels arenovice, intermediate, and experienced learners. To quantifythe system performance, questionnaire is used to record thecorresponding experimental results.

5.1. Accuracy of speech recognition and pronunciation

evaluation

In this subsection, we examine the accuracy of ourspeech processing technology based on the aspects of rec-ognition results and evaluating scores. First, the recogni-tion errors can be classified into disparateness,substitution and insertion. Here, disparateness describeswhat happens when whole phrases are recognized com-pletely incorrectly. Substitution demonstrates only a frac-tion of phrases recognized correctly, and insertiondescribes some new words inserted through the recognitionsystem which are not contained in the original pronuncia-tion spoken by learners. In Table 1, we list the error ratesbased on the different range of the machine scores. Theresults indicate that the output of our system is consistentlyclose to the ground truth, especially for disparateness. Forthe errors of substitution and insertion, they are closelyrelated to the learners’ language background. In futureresearch, we will exploit a personalized speech recognitionsystems based on the learners’ language habits. From Table1, we can conclude that the machine scores have a certainrelationship with the recognition results. With the scoresgradually increasing, the disparateness errors correspond-ingly decrease. In the range of 0.6–0.9 scores, parts ofphrases may have mistakes. If the score is up to 0.9, theerrors will significantly decrease.

Page 9: A Mandarin edutainment system integrated virtual learning environments

Table 2The correlation of the assessment results between the machine scores andhuman scores.

Phones Syllables Words Phrases

0.75 0.78 0.83 0.81

Y. Ming et al. / Speech Communication 55 (2013) 71–83 79

Second, we evaluate the correlation between themachine scores and human scores, which is an importantindicator to verify the performance of the pronunciationassessment. We randomly choose N phrases pronouncedby different learners. The correlation between the machinescores ðA1; � � �;AN Þ and human scores ðB1; � � �;BN Þ can becalculated by the following equation:

CorrA;B ¼

XN

i¼1

ðAi � AÞðBi � BÞ�� ��ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXN

i¼1

ðAi � AÞ2XN

i¼1

ðBi � BÞ2s ð17Þ

where A ¼ 1N

PNi¼1Ai, and B ¼ 1

N

PNi¼1Bi.

In Table 2, we list the correlation based on the differentlevels of Mandarin pronunciation, including phonemes,syllables, words, and phrases. From the results, our evalu-ation scheme can obtain a satisfactory correlation with thehuman scores, especially for the word level. It can be seenthat combining the prosodic parameters for pronunciationevaluation can effectively reflect the characteristics ofMandarin and improve the evaluation performance. Inaddition, more evaluation factors should be taken intoconsideration in further improving the performance, forexample, learners’ language background, cultural habits,and Mandarin linguistic properties.

5.2. Usability

In this subsection, we test the usability effect of our pro-posed Mandarin edutainment system. The evaluation isbased on pronunciation improvement compared with tradi-tional 2D learning systems without VLE systems and vir-tual 3D learning game systems. For the learners’Mandarin accent, we focus on the fluency, comprehensibil-ity, grammar and vocabulary. In the case of VLE game sce-narios, three interactive features need to be consideredincluding learner interface acquaintance, VR navigation,and VLE distractions (Virvou and Katsionis, 2008).

First, we test the pronunciation improvement of our sys-tem compared with a similar application without any 3DVR game. The evaluation indicator is the rate of accuracyimprovement. We define accuracy as follows:

Table 3Accuracy improvement based on the different level learners.

Accuracy Novice (%) Intermediate (%) Experienced (%)

Traditional ones 59.17 42.41 34.19Proposed ones 71.45 59.12 43.29

accuracy improvement ¼ ðN � Dafter � Safter � IafterÞ=NðN � Dbefore � Sbefore � IbeforeÞ=N

ð18Þ

where Dbefore; Sbefore; Ibefore and Dafter; Safter; Iafter are the totalnumbers of disparateness, substitution and insertion errorsbefore and after using our learning systems, respectively,and N is the total number of labels in the whole course.We show the results in Table 3 based on the three differenttypes of learners.

As shown in Table 3, our proposed system achieves sig-nificant improvement of the learners’ accent skills across alllevels, especially for novice learners. The traditional 2Deducation software with the simple user interface hashypertext for displaying of domain theory and pronuncia-tion results through forms, dialogue boxes, buttons, drop-down menus etc, which easily makes the learners bored in ashort time studying. Empirical study shows that the VRenvironment and animated-speaking agents can effectivelysimulate the interaction between persons and theirenvironment. The significant pronunciation improvementof novice learners in Table 3 demonstrates the excellentperformance of our proposed system compared with thetraditional 2D systems. From the ecological perspective,our 3D VR system reveals a greater potential for learning.3D face recognition technology can identify the learners forpersonalized learning service (Ming, and Ruan, 2012) and3D motion analysis can adjust the learning materials inreal-time for improving the learning efficiency (Minget al., 2012). In our Mandarin courses, we design the rela-tive simple interactive practices for the learner-specifiedagents with the VR environment. With the continuousprogress of learning skills and interactive ability, the pro-posed system will continuously improve the interactiveVR scenarios and provides more and more difficult andsuitable learning materials to satisfy the personalizeddemands for the learners. The Mandarin skill improvementin our system relies highly on the ever-refining perceptioncycle, which is distinctively different from the traditional2D system just providing the pronunciation result andevaluating scores. The accuracy ratings highly correlatewith fluency, comprehensibility, grammar, and vocabulary.Combining the immersive, interactive and imaginativeadvantages in our system, every study unit and gameis designed to match the personalized preference of thelearners, and incorporate numerous innovations that spanthe aspects of interactive simulations, intelligent navigatingsystem, and speech processing and so on. The learners tendto promote extensive levels of engagement and practice bydriving the virtual humans to exhibit behaviors that areappropriate to the Chinese culture, which is quite suitablefor language learning and pronunciation improvement.

Next, we evaluate the usability of our VR Mandarineducational game based on the interactive features forour language instruction goals. Learner interface acquain-tance refers to the skills of operating the VLE system interms of different level learners (Virvou and Katsionis,

Page 10: A Mandarin edutainment system integrated virtual learning environments

Table 4The performance of user interface acquaintance. In this comparison,group 1, 2, and 3 test the percentage of using the inventory, map, and hintsin the whole learning process respectively. In all configurations, WMT 1,2, and 3 denote the wasted time for not use the inventory, map, and hintsrespectively. TT 1, 2, and 3 denote the total time for each learner withoutusing the inventory, map, and hints respectively, and Improve 1, 2, and 3denote the improvement of Mandarin skills with the inventory, map, andhints respectively, comparing with the first hour of playing with the secondhour.

People Novices Intermediate Experienced

Group 1 65.17% 42.47% 12.32%WMT 1 75 s 52 s 32 sTT 1 7.97 min 5.36 min 2.18 minImproved 1 47.57% 23.62% 11.15%Group 2 54.18% 32.52% 21.84%WMT 2 72 s 59 s 32 sTT 2 12.41 min 8.53 min 3.82 minImproved 2 45.36% 9.78% 5.34%Group 3 48.31% 23.15% 11.08%WMT 3 254 s 148 s 107 sTT 3 7.25 min 5.02 min 3.46 minImproved 3 19.72% 11.75% 5.23%

Table 5The performance of VR navigation. In this comparison, group 1, 2 test thenumber of occurrences in the case of losing the way and inability to move.In all configurations, WMT 1, 2 denote the wasted time in two cases, TT 1,2 denote the total time in two cases, and Improve 1, 2 denote theimprovement of Mandarin skills in two cases.

People Novices Intermediate Experienced

Group 1 132 times 79 times 25 timesWMT 1 41 s 22 s 15 sTT 1 6.35 min 2.74 min 1.51 minImproved 1 25 times 14 times 5 timesGroup 2 207 times 152 times 97 timesWMT 2 35 s 24 s 13 sTT 2 8.76 min 6.54 min 2.35 minImproved 2 22 times 14 times 6 times

Table 6The performance of VLE distraction. In this comparison, DOT denotesthe occurrence times of distraction and TT denotes the total delay timedue to distraction.

Times Novice Intermediate Experienced

DOT 6.5 times 11.5 times 3.5 timesTT 4.57 times 7.18 times 5.93 times

80 Y. Ming et al. / Speech Communication 55 (2013) 71–83

2008). VR navigation is concerned with the learner’sbehaviors and tests the time spent looking for the correctway. The last indicator is VLE distraction, which is relatedto the learner’s attention, and whether the learners focus oneducational goals or amusement.

Next, we evaluate the usability of our VR Mandarineducational game based on the interactive features forour language instruction goals. Learner interface acquain-tance refers to the skills of operating the VLE system interms of different level learners (Virvou and Katsionis,2008). VR navigation is concerned with the learner’sbehaviors and tests the time spent looking for the correctway. The last indicator is VLE distraction, which is relatedto the learner’s attention, and whether the learners focus oneducational goals or amusement.

In our system, we introduce some assistant features toassist the learners, such as an inventory, a map and hints.We record the time spent on these features and theirimprovements on the second hour used with the systemcompared with the first hour to evaluate the learner’s skillof navigating the interface. The more time that is wastedthe less usability there is. Some data collected are displayedin Table 4. It is obvious that more experience leads to lesstime spent seeking assistance, and therefore experiencedlearners tend to use fewer assisting features than the othertwo types of learners. The data shows that for experiencedlearners, they can easily find their way without frequentlyusing the assisting features. In addition, for the secondhour spent in the system, there is a significant improvementfor novice learners once they are more acquainted with theinterface compared with the first hour, which indicates thatour system is quite easy to grasp.

Second, the performance of VR navigation is measuredin two ways: losing the way and inability to move (Virvouand Katsionis, 2008). The first one describes the case that

the learners may be stuck in difficulties of movementaround the virtual environments. The second one is thatthe learners may encounter some obstructions which pre-vent forward movement, such as virtual objects and walls.The wasted time as the measurement for a VR navigationalproblem is shown in Table 5. We can summarize that theexperienced learners have fewer difficulties with navigationcompared with the novice learners, which makes them havemore time to read the related Mandarin theories and con-solidate their pronunciation. As the learners advance in thegame, the time spent on the problems obviously decreasedfor the novice and intermediate learners. Then, they can getmore benefits from the educational contents of the VRgame. The improvements reflect the power of our systemto guide the learners out of problematic circumstances.

The ultimate goal of our system is to help learners makegreat progress in Mandarin pronunciation. However, theVR environment might distract the learner from acquiringMandarin. In this part, we test how to balance educationand entertainment to form edutainment. The statisticalinformation is recorded for VLE distraction problems dur-ing two hours of learning time. These data indicate theaverage occurrence of distraction for each learner and thetotal delay time due to distraction. As shown in Table 6,the novice learners have the fewest number of distractionsand the intermediate learners have the most. One cause ofdistraction is that the learners may be attracted by the VRscenes, such as movements of animated agents, virtualobjects and storytelling etc, resulting in absent-mindedness.In the future, we will add more personalized options to cus-tomize the virtual environment components to reducedistractions.

Page 11: A Mandarin edutainment system integrated virtual learning environments

Table 7Time spent on the different language learning systems.

Time (min) Novice Intermediate Experienced

Non-game 125 367 593Non-education 325 402 617Ours 546 737 813

Table 8Percentage of learners in each of three learner categories who chose eachtopic specified in the left-most column when filling out the questionnaire.

Novices (%) Intermediate (%) Experienced (%)

Consonant 43 31 26

Y. Ming et al. / Speech Communication 55 (2013) 71–83 81

5.3. Likeability

How to increase the learners’ motivation and engage-ment is an important factor for the Mandarin edutainmentsystem. The likeability of our system is measured by twocomparative studies (Virvou and Katsionis, 2008): The firstone is compared with the non-game language learning soft-ware. The second one is compared with the non-educationVR game software. The assessment is based on the amountof time learners spend on each application, which showstheir preferences. We provide different applications to alllearners for about 900 min, and record how much time theyspend on each application. The results in Table 7 show asignificant preference for our proposed system by all learn-ers in comparison with other applications, especially for theexperienced learners. The total results demonstrate that oursystem has achieved the aim of being more attractive andmotivating than the non-game educational applications,and more instructive and knowledgeable than the non-education game application. Novice learners encountersome difficulties getting started, such as Mandarin pronun-ciation and the system manipulation, resulting in a slightlybored and irritable mood. However, with improvement ofpronunciation skills, learners will be more and moreattracted to our system, and more frequently immersedinto the Mandarin learning environment.

After about two hours of using our system, the learnersgenerally reflected that the VLE system was more intelli-gence and enhanced their understanding of certain con-cepts or pronunciations, especially the areas for whichthe traditional methods have proven inappropriate or diffi-cult. In combination with the knowledge of Chinese cul-ture, this can effectively enrich their ability to analyzeproblems from the perspective of history and culture. Asharable virtual learning space can motivate the learnersto be more interactive and can make their learning moreengaging and adventurous.

Vowels 35 23 56Tones 23 31 42Rhythm 31 42 29Intonation 36 25 47Stress 52 34 29Society 57 32 46History 21 25 42Culture 26 19 35Words 59 43 32Phrases 46 54 48Sentences 25 32 54In class 52 57 41At home 61 43 37In community 49 45 62

5.4. Knowledgeability

In this subsection, we provide the results of a question-naire concerning the influence of the learners based on thedifferent Mandarin language elements and culture knowl-edge on the learning process. A series of questions aboutour Mandarin edutainment system are addressed concern-ing the views of different level learners. The major ques-tions about knowledgeability asked in the questionnaireare as follows:

1. Which pronunciation is more motivating? Consonant,vowels, or tones?

2. Which has the most influence on your fluency? Rhythm,intonation, or stress?

3. Which lesson content do you prefer? Society, history, orculture?

4. Which unit in our system is more appealing? Words,phrases, or sentences?

5. Where do you think the knowledge will be helpful? Inclass, at home, or in the community?

6. Do you have any other suggestions for our Mandarinedutainment system?

The answers to the questionnaire provided a lot of use-ful data for evaluating the knowledgeability of our pro-posed system. In the aspect of Mandarin pronunciation,we mainly evaluate the motivation characteristics for thelearners based on the different pronunciation elements,which are consonant, vowels and tones. We also showthe results of the questionnaire in Table 8. Based on thesurvey of the different aspects of the Mandarin learningsystem, we can further analyze the preference of the learn-ers and alter the course assignment to facilitate more effi-cient learning. Our immersive simulations of the VRgame are based on interactive social communicationinvolving spoken dialogs and cultural protocols are builtupon ecological psychology and technologic supports.Interactive dialog can boost the learners’ communicationand social skills in the communities and successfully trans-fer separate Mandarin pronunciation into fluent verbal andnon-verbal communication. Cultural protocols involve cul-tural knowledge, historical background and artistic accom-plishments, which is critical for successfully communicatingwith the native speakers and acquiring deep understandingof idioms.

There were also a huge number of constructive com-ments collected by analyzing the personal suggestions oflearners. First, in order to enhance motivation more effec-

Page 12: A Mandarin edutainment system integrated virtual learning environments

82 Y. Ming et al. / Speech Communication 55 (2013) 71–83

tively, most of the learners indicated that the VLE systemwould be better if it involved more VR objects, more cul-tural background, and more adventure schemes like thecommercial games. Second, some learners point out thatit was not challenging enough and they want it to containmore Chinese idioms, adages and poems in the dialogs.These suggestions came, to a large extent, from theexperienced learners who want to know extensively aboutMandarin and its native use. Additionally, some learnerssuggested that they like to take the courses in their homesduring their leisure time. According to the above com-ments, we will continue to develop our system to be morecomprehensive, entertaining, and to have more flexiblecharacteristics.

The results from evaluations discussed above show thatthe learners are indeed fascinated by the idea of the VRgame for learning Mandarin and they are certainly moreenthusiastic about our VLE system than the traditionalnon-game learning software, especially for learners withpoor academic performance. A vital question is howknowledgeability for foreign learners can effectively beincorporated into the system’s usability and likeabilityinteraction. Learners prefer systems which are less difficultto manipulate and have interesting and challenging gamescenarios. The length of time learners spend on each unitis also important. The appropriate length will improvelearning efficiently.

6. Conclusion

In this paper, we provide an overview of a technique fora Mandarin edutainment system. 3D face recognition andemotion analysis technologies are introduced to identifythe learner and adjust the learning materials according tothe learner’s personal preference and emotion states. Then,we build a Mandarin pronunciation recognition and assess-ment system combined with prosodic parameters. Second,the VLE Mandarin learning system can be constructedby the incorporation of the VR games, where highly moti-vating scenarios effectively stimulate interests during thelearning processing. Combining Chinese cultural and his-tory knowledge which makes the storyline more fascinated.The experiment is based on four aspects: accuracy, usabil-ity, likeability, and knowledgeability. The results effectivelydemonstrate that our system has the distinctive advantagescompared with traditional non-game learning software.The open, shared, and interactive properties can enhancethe ability of learners’ communication and create a real-lifesocial community that response to the learners’ speech,intent, gesture, and behaviors.

Although our system evaluates the common Mandarinlearning issues well, it may produce incorrect results inthe presence of unclear articulation and confusing VR sce-narios. This is the main obstacles encountered in our exist-ing system. In the future, we will continuous to develop oursystem. First, we need to review the latest speech process-ing technologies, and perform detailed analysis of the Man-

darin characteristics to improve the performance of thespeech support module. Second, we will update the currentVLE game system and make it easier to manipulate. Wewill develop more characters and storylines, and adven-tures will be introduced to the proposed system, whichcan help the learners to compare their Mandarin pronunci-ation to that of the native speakers. Our ultimate goal is tohelp the learners master the Mandarin more quickly in asatisfying and engaging environment.

References

Amory, A., Naicker, K., Vincent, J., Adams, C., 1998. Computer games asa learning resource. In: Proceedings World Conference on EducationalMultimedia, Hypermedia and Telecommunications, pp.50–55.

Asakawa, S., Minematsu, N., Isei-Jaakkols, T., Hirose, K., 2005.Structural representation of the non-native pronunciation. In: Pro-ceedings of EuroSpeech, pp. 165–168.

Conati, C., Xiaoming Zhou, 2002. Modeling students emotions fromcognitive appraisal in educational games. In: Proceedings The 6thInterational Conference on Intelligent Tutoring Systems, pp. 944–954.

Cucchiarini, C., Strik, H., Boves, L., 1997. Automatic evaluation of Dutchpronunciation by using speech recognition technology. In: Proceedingsof IEEE Workshop ASRU, Santa Barbara, pp. 622–625.

Cucchiarini, C., Wet, F.D., Strik, H., Boves, L., 1998. Assessment ofDutch pronunciation by means of automatic speech recognitiontechnology. In: Proceedings of ICSLP, pp. 1738–1741.

de Wet, F., Van der Walt, C., Niesler, T.R., 2009. Automatic assessmentof oral language proficiency and listening comprehension. SpeechCommunication 51, 864–874.

Franco, H.L., Neumeyer, L., Kim, Y., Ronen, O., 1997. Automaticpronunciation scoring for language instruction. In: Proceedings IEEEInternational Conference on Acoustic, Speech, and Signal Processing,pp.1465–1469.

Franco, H.L., Neumeyer, L., Digalakis, V., Ronen, O., 2000. Combina-tion of machine scores for automatic grading of pronunciation quality.Speech Communication 30, 121–130.

Hodges, B., 2009. Ecological pragmatics: values, dialogical arrays,complexity and caring. Pragmatics and Cognition 17, 628–652.

http://a2.media.3ds.com/products/3dvia/3dvia-virtools/.Huang, Jui-Ting, Lee, Lin-shan, 2006. Improved large vocabulary

Mandarin speech recognition using prosodic features, Speech Prosody.Johnson, Lewis et al. at Alelo, Inc, http://www.alelo.com/index.html.Kearney, P., 2004. Engaging young minds C using computer game

programming to enhance learning. In: Proceeding World Conferenceon Educational Multimedia, Hypermedia and Telecommunications,pp.3915–3920.

Li, Xunxiang, Chen, Dingfang, Wang, Le, Li, Anding, 2007. A Develop-ment Framework for Virtools-Based DVR Driving System. Springer,pp. 188–196.

Minematsu, N., 2004. Pronunciation assessment based upon the compat-ibility between a learners pronunciation structure and the targetlanguages lexical structure. In: Proceedings of ICSLP, pp. 1317–1320.

Yue Ming, Qiuqi Ruan, 2012. Robust sparse bounding sphere for 3D facerecognition. Image and Vision Computing, in press.

Yue Ming, Qiuqi Ruan, Hauptmann, Alex, 2012. Activity recognitionfrom kinect with 3D local spatio-temporal features. In: IEEEInternational Conference on Multimedia and Expo (ICME 2012).

Neueyer, L., Franco, H., Digalakis, V., Weintraub, M., 2000. Automaticscoring of pronunciation quality. Speech Communication 30, 83–93.

Neumeyer, L., Franco, H.L., Weintraub, M., Price, P., 1996. Automatictext-independent pronunciation scoring of foreign language studentspeech. In: Proceedings of ICSLP, pp. 217–220.

Pan, Zhigeng, Cheok, Vdrian David, Yang, Hongwei, Zhu, Jiejie, Shi,Jiaoying, 2006. Virtual reality and mixed reality for virtual learningenvironments. Computers and Graphics 30, 20–28.

Page 13: A Mandarin edutainment system integrated virtual learning environments

Y. Ming et al. / Speech Communication 55 (2013) 71–83 83

Rabiner, L., Juang, B.H., 1993. Fundamentals of Speech Recognition.Prentice Hall PTR, Upper Saddle River, New Jersey.

Raux, A., Kawahara, T., 2002. Automatic intelligibility assessment anddiagnosis of critical pronunciation errors for computer-assisted pro-nunciation learning. In: Proceedings of ICSLP, pp. 737–740.

Tsubota, Y., Kawahara, T., Dantsuj, M., 2004. Practical use of Englishpronunciation system for Japanese students in the CALL class. In:Proceedings of ICSLP, pp. 849–852.

Van Lier, V., 2000. From input to affordance: socio-interactive learningfrom an ecological perspective. In: J.P. Lantolf (Ed.), Socioculturaltheory and second language learning, Oxford University Press,Oxford, pp. 245–259.

Virvou, M., Katsionis, G., 2008. On the usability and likeability of virtualreality games for education: the case of VR-ENGAGE. Computersand Education 50, 154–178.

Williams, Gethin, Renals, Steve, 1999. Confidence measures from localposterior probability estimates. Computer Speech and Language 13,395–411.

Witt, S.M., 1999. Use of speech recognition in computer-assisted languagelearning, PhD Dissertation.

Witt, S.M., Young, S.J., 2000. Phone-level pronunciation scoring andassessment for interactive language learning. Speech Communication30, 95–108.

Young, M.F., 2004. An ecological psychology of instructional design:learning and thinking by perceiving-acting systems. In: D.H. Jonassen(Ed.), Handbook of Research for Educational Communications andTechnology, second ed. Mahwah, NJ: Erlbaum.

Young, M.F., Barab, S., Garrett, S., 2000. Agent as detector: an ecologicalpsychology perspective on learning by perceiving-acting systems. In:.D.H. Jonassen, S.M. Land (Eds.), Theoretical foundations of learningenvironments, Mahwah, NJ: Erlbaum, pp. 147–172.

Zhang, Yan-Bin, Chu, Min, Huang, Chao, Liang, Man-Gui, 2008.Detection tone errors in continuous Mandarin speech. In: ProceedingsIEEE International Conference on Acoustic, Speech, and SignalProcessing, pp. 5065–5068.


Top Related