audio-visual speech recognition techniques in augmented reality environments

13
Vis Comput DOI 10.1007/s00371-013-0841-1 ORIGINAL ARTICLE Audio-visual speech recognition techniques in augmented reality environments Mohammad Reza Mirzaei · Seyed Ghorshi · Mohammad Mortazavi © Springer-Verlag Berlin Heidelberg 2013 Abstract Many recent studies show that Augmented Real- ity (AR) and Automatic Speech Recognition (ASR) tech- nologies can be used to help people with disabilities. Many of these studies have been performed only in their special- ized field. Audio-Visual Speech Recognition (AVSR) is one of the advances in ASR technology that combines audio, video, and facial expressions to capture a narrator’s voice. In this paper, we combine AR and AVSR technologies to make a new system to help deaf and hard-of-hearing people. Our proposed system can take a narrator’s speech instantly and convert it into a readable text and show the text directly on an AR display. Therefore, in this system, deaf people can read the narrator’s speech easily. In addition, people do not need to learn sign-language to communicate with deaf peo- ple. The evaluation results show that this system has lower word error rate compared to ASR and VSR in different noisy conditions. Furthermore, the results of using AVSR tech- niques show that the recognition accuracy of the system has been improved in noisy places. Also, the results of a survey that was conducted with 100 deaf people show that more than 80 % of deaf people are very interested in using our system as an assistant in portable devices to communicate with people. M.R. Mirzaei ( ) · S. Ghorshi · M. Mortazavi School of Science and Engineering, Sharif University of Technology, International Campus, Kish Island, Iran e-mail: [email protected] S. Ghorshi e-mail: [email protected] M. Mortazavi e-mail: [email protected] Keywords Augmented reality · Audio-visual speech recognition · Augmented reality environments · Communication · Deaf people 1 Introduction Today, using new technologies to help people with disabili- ties is highly regarded and much research in this area is un- derway. The important issue is how to combine and use these technologies together to make them more applica- ble. Augmented Reality (AR) is a new technology that has a lot of interest recently. AR is a low-cost technology that has 3D computer graphics and one special feature known as the marker [1]. The ultimate goal of AR is to show virtual ob- jects completely in a real environment, so that viewers feel to see the real-world [2]. AR can be used to make a real environment with virtual objects rather than to make the en- vironment completely virtual. Therefore, AR is actually be- tween real and Virtual Reality (VR) worlds as the “Middle Ground” [3]. AR has penetrated in the world of mobile de- vices, and also has become practical in the past years. Re- cent advances in mobile technologies have made AR a vi- able technology on mobile devices [4]. AR has widespread applications in different fields and many toolkits have been prepared for it. AR has also been used in many portable de- vices to show AR environments, such as AR Head Mounted Displays (HMD), AR glasses, and AR eye’s bio-lens [5]. Several studies have shown that AR and VR can be used to help people with disabilities [6, 7]. AR gives a possibility to disabled people to control and manage the information and adapt it easily to the desired form to improve their interac- tions with people [8, 9].

Upload: mohammad

Post on 08-Dec-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Vis ComputDOI 10.1007/s00371-013-0841-1

O R I G I NA L A RT I C L E

Audio-visual speech recognition techniques in augmented realityenvironments

Mohammad Reza Mirzaei · Seyed Ghorshi ·Mohammad Mortazavi

© Springer-Verlag Berlin Heidelberg 2013

Abstract Many recent studies show that Augmented Real-ity (AR) and Automatic Speech Recognition (ASR) tech-nologies can be used to help people with disabilities. Manyof these studies have been performed only in their special-ized field. Audio-Visual Speech Recognition (AVSR) is oneof the advances in ASR technology that combines audio,video, and facial expressions to capture a narrator’s voice.In this paper, we combine AR and AVSR technologies tomake a new system to help deaf and hard-of-hearing people.Our proposed system can take a narrator’s speech instantlyand convert it into a readable text and show the text directlyon an AR display. Therefore, in this system, deaf people canread the narrator’s speech easily. In addition, people do notneed to learn sign-language to communicate with deaf peo-ple. The evaluation results show that this system has lowerword error rate compared to ASR and VSR in different noisyconditions. Furthermore, the results of using AVSR tech-niques show that the recognition accuracy of the system hasbeen improved in noisy places. Also, the results of a surveythat was conducted with 100 deaf people show that morethan 80 % of deaf people are very interested in using oursystem as an assistant in portable devices to communicatewith people.

M.R. Mirzaei (�) · S. Ghorshi · M. MortazaviSchool of Science and Engineering, Sharif Universityof Technology, International Campus, Kish Island, Irane-mail: [email protected]

S. Ghorshie-mail: [email protected]

M. Mortazavie-mail: [email protected]

Keywords Augmented reality · Audio-visual speechrecognition · Augmented reality environments ·Communication · Deaf people

1 Introduction

Today, using new technologies to help people with disabili-ties is highly regarded and much research in this area is un-derway. The important issue is how to combine and usethese technologies together to make them more applica-ble.

Augmented Reality (AR) is a new technology that has alot of interest recently. AR is a low-cost technology that has3D computer graphics and one special feature known as themarker [1]. The ultimate goal of AR is to show virtual ob-jects completely in a real environment, so that viewers feelto see the real-world [2]. AR can be used to make a realenvironment with virtual objects rather than to make the en-vironment completely virtual. Therefore, AR is actually be-tween real and Virtual Reality (VR) worlds as the “MiddleGround” [3]. AR has penetrated in the world of mobile de-vices, and also has become practical in the past years. Re-cent advances in mobile technologies have made AR a vi-able technology on mobile devices [4]. AR has widespreadapplications in different fields and many toolkits have beenprepared for it. AR has also been used in many portable de-vices to show AR environments, such as AR Head MountedDisplays (HMD), AR glasses, and AR eye’s bio-lens [5].Several studies have shown that AR and VR can be used tohelp people with disabilities [6, 7]. AR gives a possibility todisabled people to control and manage the information andadapt it easily to the desired form to improve their interac-tions with people [8, 9].

M.R. Mirzaei et al.

Another technology that can be used to help disabledpeople is Automatic Speech Recognition (ASR). ASR orSpeech-to-Text system converts the speech signal into textor commands [10]. ASR technology gives the possibilityto disabled people to control and use computers by voicecommands, e.g., to control robotic arms [11]. In addition,ASR allows deaf or hard-of-hearing people to participate inconference and lecture rooms by automatically generating aclosed-captioning of conversations [11]. ASR has also beenused in deaf telephony, such as voicemail to text, relay ser-vices, and captioned telephone [12].

In a noisy environment, human listeners have difficultyunderstanding what the narrator says. However, they canuse the lip-reading technique to recognize the narrator’svoice. Therefore, noise is the most important issue in speechrecognition systems, especially when ASR systems are usedin the outdoors [13]. Recent advances in ASR technologyhave combined audio, video, and facial expressions to cap-ture the narrator’s voice [14, 15]. This technique is called“Audio-Visual Speech Recognition (AVSR)” that improvesthe speech recognition efficiency significantly in the con-ditions where the ASR system provides poor results [16].Also, some new studies have shown that making a real-timeAVSR system is now possible [17].

Communication with people is a basic need for everyone,but some people, such as deaf people, are not able to com-municate well. Deaf people communicate visually and phys-ically rather than audibly. Therefore, they have some prob-lems in their relationship with people. Usually, deaf peo-ple learn and use sign-language to communicate with eachother, but people have no desire to learn it. For this reason,many people feel awkward or become frustrated trying tocommunicate with deaf people, especially when no inter-preter is available.

In this paper, a new system is proposed to combine ARand AVSR technologies. This is a new system with multiplefeatures, but helping deaf people to communicate with peo-ple is its main goal. This system uses the narrator’s audio,video, and facial expressions to make the narrator’s speechvisible to deaf people on AR display.

The rest of this paper is organized as follows. In Sect. 2,related work in AR and AVSR area is presented. In Sect. 3,the proposed system is explained in terms of structure, sys-tem design, and system requirements. In Sect. 4, the experi-mental results are presented, and finally in Sect. 5 we sum-marize and conclude the paper.

2 Related work

Recently, much research has been carried out in AR andAVSR technology areas. Some of them show that these

technologies can be used to help people with disabilities andhave many benefits for this group of people. However, manyof these studies have been performed only in their special-ized field and few tend to use a combination of these tech-nologies. This section provides a short overview of someimportant studies on AR and AVSR technologies, and showsthe advancement and the impact of these technologies in ourproposed system.

In the AR field, Zainuddin et al. [18] used AR to makean AR-Book called the “Augmented Reality Book Sciencein Deaf (AR-SiD)” that contains 3D modeling objects withusing markers and sign-language symbols to improve thelearning process of deaf students. They showed that AR isvery useful for visualizing the abstract concepts in learningscience to help deaf people. Also, in the ASR field, Lopez-Ludena et al. [19] developed an advanced speech commu-nication system with a visual user interface and 3D avatarmodule for deaf people and showed the effects of using ASRto help people with disabilities.

Some researchers have worked on using AR and ASRtechnologies together and have reported the benefits of com-bining them. Irawati et al. [20] used a combination ofAR and ASR technologies in AR design applications toshow the possibility of using speech to arrange the vir-tual objects in virtual environments. Hanlon et al. [21],in a similar work, used the speech interfaces as an at-tractive solution to the problem of using keyboard andmouse in AR design applications. Kaiser et al. [22] usedAR and ASR technologies to make a 3D gesture multi-modal AR interface application with the ability of speechrecognition and object identification. In addition, the ARand ASR technologies have been used in AR industrialmaintenance applications. Goose et al. [23] made a mul-timodal AR interface framework in the SEAR (Speech-Enabled AR) project with a context-sensitive speech dia-logue for maintenance technicians to check the factory com-ponents.

Recent studies have also shown that ASR performance isvery sensitive to noise and the presence of background noisein a signal can deteriorate the recognition performance [13].Researchers have also worked on AVSR techniques to im-prove the ASR performance. Chin et al. [24] showed that theAVSR system could use narrator’s facial expressions, suchas lips, tongue, and jaw, to capture the speech, especiallyin noisy environments. Figure 1 shows the performance ofASR, VSR, and AVSR in various acoustic SNRs. It is clearfrom Fig. 1 that the AVSR performance is much betterthan Audio-only Speech Recognition (ASR) or Visual-onlySpeech Recognition (VSR), in various acoustic Signals-to-Noise Ratios (SNR).

It is observed from Fig. 1 that the VSR is not sensitive tonoise, where the recognition rate is retained at about 50 %.

Audio-visual speech recognition techniques in augmented reality environments

It can also be noted that the performance of ASR decreaseswhen the level of noise increases, but its recognition rateis well in clean environments. Figure 1 also shows that therecognition rate of AVSR is better than ASR and VSR innoisy environments [24].

Compared to our work, explained systems are used byspeakers, and the speech is used by the system to spec-ify a command to arrange virtual objects in AR environ-ments, or it is used by the system as a parameter to iden-tify objects without using input devices. Furthermore, allof these systems have used AR markers to detect and showvirtual objects in an AR environment. Our system capturesthe narrator’s speech and uses the AVSR engine to con-vert the speech to text, so that the AR engine shows thetext directly in an AR environment. Also, the main userof our system is a deaf person that does not talk in thesystem’s scenario. In addition, our system is a marker lessAR application and uses facial expressions of the narratoras a marker to show the virtual objects in an AR environ-ment.

Fig. 1 Recognition rate under various acoustic SNRs [24]

3 The proposed system

Our proposed system includes a variety of technologies.It consists of the AR engine, the AVSR engine, the Joiner Al-gorithm, Audio-Visual contents, Automated Process Scripts,and Face Detection techniques. This system uses a combina-tion of these technologies to make the speech visible to deafpeople. Figure 2 shows the overall view of our proposed sys-tem with an Ultra Mobile PC (UMPC). Deaf people can alsouse mobile phones or AR-HMD to see AR environments.Since the system is developed as a cross-platform applica-tion, it can be used in many portable devices.

In our system’s scenario, the AVSR engine collects thespeech and facial expressions from the detected narrator, andthe AR engine realizes the scenario. The Joiner algorithm isused to combine AR and AVSR engines to work together. Toachieve some goals, it is required to write and use some au-tomated process scripts that will be integrated to the systemin the future. This system uses the AVSR engine to recog-nize the narrator’s speech and convert the speech to text, andit uses the AR engine to display the text as a dynamic objectin AR environments. Also, the system uses the built-in cam-era and microphone on UMPC (or mobile phone), or AR-HMD to get the narrator’s video and speech, and uses theirdisplays to show the objects in AR environments. A deafperson can use our system in any place without carrying ARmarkers because of the following important features in thesystem:

1. The face detection techniques are used instead of markersto detect the narrator in the environment.

2. The AVSR techniques are used instead of ASR tech-niques to capture the narrator’s voice.

3.1 System structure

In this section, the structure of our proposed system is brieflyexplained, in terms of components and technologies that areused in the system. Figure 3 shows the structure of the pro-posed system. It consists of two main parts: hardware and

Fig. 2 Overall view of ourproposed system

M.R. Mirzaei et al.

Fig. 3 Our proposed systemstructure

software. In the hardware part, some hardware requirements,such as camera, microphone, and display are required, andin the software part, the core of the system is developed thatconsists of the AR engine, the AVSR engine, the Joiner al-gorithm, and the Auto-Save script. All these parts can bebrought together in an integrated system.

In this system, the deaf person focuses the camera on thenarrator. This camera can be a web camera connected to orembedded in a computer or mobile phone, or mounted onAR-HMD. The camera captures the video and the built-inmicrophone, or external microphone, captures the narrator’sspeech. These two parameters (video and speech) are con-sidered as key parameters in the system, because both areprocessed separately. When video frames are given by thecamera, the AR engine identifies the narrator’s face and usesit as a marker. Simultaneously, the AVSR engine capturesthe speech and video of the detected narrator and convertshis or her speech to a text file. The output text file is notcompletely dynamic and cannot be used by the Joiner algo-rithm, since it is changing frequently in time without beingsaved. To make it completely dynamic and usable for theJoiner algorithm, an Auto-Save script is proposed to savethe text file automatically in specified times. This text file iscalled “Dynamic Text File” that is used by the Joiner algo-rithm as its text string’s database. Therefore, the words saveautomatically in the text file, every time the narrator sayssomething. The Joiner algorithm module loads the dynamictext file strings on a speech bubble image that is used by theAR engine. The AR engine places the output of the Joiner

algorithm to its face position information and augments it onthe narrator’s video frames in an AR environment. Finally,deaf people see and read the narrator’s speech easily on ARdisplay, near the narrator’s face.

3.2 System design

This system is designed as a portable system to be used eas-ily by deaf people. The AR engine is developed in AdobeFlash Platform [25] to make the system cross-platform.Therefore, the system can be used by deaf people in manyportable devices. Also, this system can be developed witheither internal or external AVSR engines. For initial testingof the system, an Open-source AVSR engine that is called“Audio-Visual Continuous Speech Recognition (AVCSR)”by an Intel research group [26] is used as an external en-gine.

3.2.1 Hardware requirements

To run the system in the basic state, it is not necessary tohave very powerful hardware specifications while it can berun by users as a very lightweight application. Generally, weneed some specific hardware requirements to run the systemin a portable device with Windows Operating System (OS).These hardware requirements are shown in Table 1.

The power of ASR or AVSR engines is different inportable devices and depends on hardware specifications.

Audio-visual speech recognition techniques in augmented reality environments

Nevertheless, the system runs easily on new mobile phones,due to recent advances in mobile hardware. For initial test-ing of the system, we use the Intel AVCSR application inan UMPC with Microsoft Windows XP OS and specifiedhardware, such as Intel Dual Core 1.8 GHz processor, 4 GBmemory, and 5 mega pixels High-Definition (HD) sensorexternal web camera with a built-in noise-canceling micro-phone. The system works fine with this particular hardwareand provides very good results in WindowsXP OS. The onlylimitation is that the Intel AVCSR application cannot be runin another OS, such as Mac OS, and it is tested just in Win-dows OS.

3.2.2 Software design

This particular system has three main software parts: theAVSR engine, the Joiner algorithm, and the AR engine.A group of libraries and toolkits are used in each part forface detection, audio and video recognition, tracking, textobject loading, and rendering.

The AVSR engine The proposed system uses an externalpowerful AVSR engine to get different results, due to exper-imental work. We found that the Intel AVCSR applicationpackage is the best choice for our purpose. It was developed

Table 1 Hardware Requirements

Processor Intel Mobile or Core Due 1.0 GHz

Memory 1 GB

Camera 640 × 480 VGA sensor or more, compatible withcomputers that have either USB 2.0 or USB 1.1, orBuilt-in cameras

Microphone All types of external microphone, or Built-inmicrophones

by Intel research group to test and evaluate audio and visualspeech recognition [26]. The Intel AVCSR uses a set of au-dio and visual features to increase the accuracy of speechrecognition in noisy environments. It uses OpenCV (OpenSource Computer Vision by Intel) [27] and contains a set ofVisual C + + functions for visual feature extraction, as wellas binary codes for audio feature extraction and the fusionmodel [26]. Figure 4 shows the AVSR engine working pro-cess. The AVSR engine processes the input narrator’s audio,video, and facial expressions and converts them into recog-nized text strings [26]. In general, most AVSR engines usethe speech recognition method that is shown in Fig. 4. Theresult of this process is recognized text strings that are writ-ten by the engine in a text file. Our system uses the dynamicversion of this output text file as an input to the Joiner algo-rithm.

At first, the AVSR engine detects and tracks the narra-tor’s face and mouth in consecutive video frames. Then itextracts a set of visual features from the mouth region, usingPrincipal Component Analysis (PCA) followed by LinearDiscriminant Analysis (LDA) using N concatenated frames[28]. In addition, it obtains the acoustic features from theaudio channel that consists of Mel Frequency Cepstral Co-efficients (MFCC) [28]. Finally, it models the audio and vi-sual observation sequences jointly, using a Coupled HiddenMarkov Model (CHMM) [28].

The Intel AVCSR application is an offline system inwhich the recorded audio-visual data is loaded and pro-cessed from an Audio Video Interleave (AVI) file. The Au-dio Speech Recognition decoder (ASR), the Visual or lip-reading decoder (VSR), and the AVCSR decoder [26] areused separately by the Intel AVCSR application to processthe input .AVI file. It is necessary for the system to use thetest data in an appropriate format to compare the perfor-mance of different engines.

Fig. 4 Block diagram of theAVSR engine

M.R. Mirzaei et al.

Fig. 5 Using video recorder inAVSR engine

Fig. 6 The Intel AVCSR andAuto-Save script workingprocess

As mentioned earlier, the Intel AVCSR application is anoffline system and cannot work in real-time. Therefore, thesystem needs to record the narrator’s video in the .AVI fileto be processed by the Intel AVCSR. The system cannot usethe Intel AVCSR results normally either. It is necessary tochange the Intel AVCSR application source codes to use itsresults in the system. Since the Intel AVCSR is an Open-source package, it is possible to change some codes in orderto retrieve the recognized text strings and write them respec-tively into a .TXT file as the system’s text database. For ini-tial testing of the system, a .TXT file is used by the system tosave recognized text strings. On the other hand, this systemcan be developed with internal temporary text array to saverecognized text strings.

Furthermore, a small video recorder application is usedby the system to record the narrator’s video (in .AVI file for-mat) and send it to the Intel AVCSR application, as shown inFig. 5. This video recorder has been developed with VisualBasic programming language. To avoid increasing the sizeof .AVI file, the video recorder has been developed to record100 mb video file. Also, it deletes the processed .AVI fileautomatically. Certainly, we will not need to use the videorecorder if we use a real-time AVSR engine [17].

Every time the .AVI file is opened and processed bythe Intel AVCSR, the recognized words are analyzed by itsASR, VSR, and AVSR engines. Then these text strings arewritten by the Intel AVCSR in the text file, but the text file isnot saved by the Intel AVCSR. The Joiner algorithm needs

the updated version of the text file. It means every word thatthe Intel AVCSR recognizes as a text string, must be savedin the text file. The Auto-Save script was proposed to do thesaving operation automatically every 1 s when the narratorsays a word. This script provides the updated version of thetext file as “Dynamic Text File.” Then the text file is usedby the Joiner algorithm as the input text, which is shown inFig. 6. The Auto-Save script is a small automated processscript that was written with VBScript (Visual Basic Script-ing Edition) programming language in Microsoft WindowsOS. Also, it can be developed with other programming lan-guages or as internal coding in the system. To avoid increas-ing the size of the text file with useless text strings, the sys-tem has been programmed to clean the text file completely,every time it is run by the user.

The joiner algorithm In this section, the “Joiner algo-rithm” module is proposed to make relations between theAR and AVSR engines. This module enables the system tocombine AR and AVSR engines. In this module, the outputof the AVSR engine is the input to the AR engine. The Joineralgorithm is a Flash application with Action Script version3.0 (AS3) coding [29] to load dynamic text file strings onspeech bubble image. The speech bubble is used by the ARengine, as shown in Fig. 7.

The Joiner algorithm was developed to load only the lastword in the text file. This development is optional and de-velopers can change it easily to show more words (like a

Audio-visual speech recognition techniques in augmented reality environments

Fig. 7 The Joiner Algorithmworking process

sentence). With this development, the Joiner algorithm is al-ways looking for the last word in the text file and when itfinds a new word, it loads the word on the speech bubbleimage. The output of the Joiner algorithm is a .SWF appli-cation that is displayed by AR engine in an AR environment.

The AR engine In the AR engine, the FLARManager [30]and OpenCV (with AS3) are used. In addition, the FLAR-ToolKit library [31] is used in FLARManager frameworkto make AR applications with flash platform support. TheFLARManager is compatible with many 3D frameworks[32], such as Papervision3D [33]. The FLARManager alsosupports multiple marker detection that will enable the sys-tem to detect more than one narrator in the future.The pur-pose of using these toolkits is their availability to export thesystem as .SWF, hence allowing the availability to make itas a cross-platform application. Also, these libraries give thepossibility to the AR engine to show the output of the Joineralgorithm, that is a .SWF file, in AR environments.

As mentioned earlier, our system uses the face detec-tion instead of marker detection. Most AR applications usemarkers to detect and identify objects. If the marker hadbeen used, we should have designed and created specificmarkers for the system, and the users should have carried themarkers. Therefore, the system was developed as a marker-less AR application. In this case, some marker detectionchallenges were avoided, such as marker choice and markercreation that makes the system more complex.

In the system, the OpenCV is used to detect the narrator’sface as a marker. The OpenCV is a library of programmingfunctions for a real-time computer vision and is a power-ful platform [27]. Our system uses a ported AS3 version ofOpenCV that is called “Marilena Object Detection” for fa-cial recognition and motion tracking. The Marilena is AS3library classes for object detection and decent facial recog-nition [34]. It has the best performance among other AS3 li-braries for face detection. It does not have a heavy process inthe AR engine either. The face detection part of the OpenCVwas ported to AS3 language by Ohtsuka Masakazu in SparkProject team (Libspark) [34]. Figure 8 shows the Marilenalibrary face detection results, in different angles and modesof the face.

We combine the Marilena library to the FLARToolkit li-brary to develop the AR engine. To play the narrator’s video

Fig. 8 Marilena library face detection

on AR display, a SWF (.FLV) video playback is used by theFLARToolkit in the AR engine. The block diagram of theAR engine is shown in Fig. 9.

In the AR engine, after the narrator’s video frames arecaptured by camera, the face detection module identifies theface of the current narrator in the environment, and sendsthis information to the Position and Oriented module. Forinitial testing of the system, the face detection module wasdeveloped to detect and use only the face of one person inthe environment as the narrator’s face. Therefore, the sys-tem knows where the narrator’s face is located in the videoframes. Simultaneously, the Joiner algorithm sends its out-put that is a .SWF file, to the Position and Oriented module.Then the Position and Oriented module places the speechbubble .SWF file to the narrator’s face information and sendsit to the Render module. Finally, the Render module aug-ments the output of the Position and Oriented module onAR display, near the narrator’s face. Figure 10 shows thenarrator’s speech that is visible on AR display.

On the AR display, the speech bubble, which is placednear the narrator’s face, is sensitive to the movements. Whenthe narrator moves his or her face, the speech bubble alsomoves. In addition, when the narrator moves his or her faceforward or backward, the speech bubble is zoomed in or outwithout any delay. It is possible to make the speech bubblesensitive to the narrator’s face rotations, but we disabled this

M.R. Mirzaei et al.

Fig. 9 Block diagram of theAR engine

Fig. 10 The AR environment in our proposed system

feature by default, because the text would become fade andunreadable with rotating the speech bubble. Figure 11 illus-trates the proposed system. It is noted from Fig. 11 that thenarrator’s speech is visible on AR display.

4 Evaluation results

We evaluated and tested the system in three different noisyenvironments with three different people, using the IntelAVCSR application. In the Intel AVCSR, it is possible toadd noise concurrently to the input video file in variousacoustic SNRs to simulate the project in different noisy en-vironments. To test the system, the video recording shouldbe completely clear. Or in other words, there should be no

noise in the recording. Also, these video files were recordedfrom three different people to enable the system to test theAVSR engine in different facial conditions. Figure 12 showsthe output of Intel AVCSR on different facial expressions. Itis clear from Fig. 12 that three different people were used inIntel AVCSR.

For initial testing of the system, a condition is chosento reflect the performance of the system in different placeswith different facial expressions, in terms of recognition ac-curacy. To get better results, an external microphone witha noise-canceling feature and a powerful digital camera areused to capture the narrator’s speech and video. Also, it isassumed that the distance between the narrator and the cam-era is only 1 meter, and the narrator speaks slowly with aclear English accent.

4.1 Classification of tests

We classified the tests according to different conditions inwhich a deaf person may be in those situations. These testsare classified to: Test 1 for noiseless environments, Test 2for different SNRs, such as 30, 20, and 10 dB SNR, andTest 3 for very noisy environments with zero dB SNR. Sincethe Intel AVCSR application has some limitations in wordrecognition, a specific text with 8 different words is used ineach test as our own database.

4.1.1 Test 1: noiseless environments

In Test 1, it is assumed that the system works in noiselessenvironments. Figure 13 shows the comparison of the worderror rate between the Intel AVCSR engines (ASR, VSR,and AVSR). The results show that AVSR engine has muchlower word error rate on average, compared to ASR and

Audio-visual speech recognition techniques in augmented reality environments

Fig. 11 Illustration of theproposed system

Fig. 12 Illustration of IntelAVCSR on different facialexpressions

Fig. 13 Comparison of word error rate in noiseless environment

VSR. Also, it is noted that the differences in facial expres-sions have a direct impact on the VSR and AVSR engine’sresults. Therefore, it is clear from Fig. 13 that the system isworking fine with best performance and accuracy in noise-less environments, using Intel AVCSR engines.

4.1.2 Test 2: noisy environments with different SNRs

In Test 2, a specific noise with different SNR values, rangingfrom 10, 20, and 30 dB, is added to the recorded video files

by Intel AVCSR application. Also, the Intel AVCSR enginescapture the video files separately. These values of SNR takethe system from a high noise condition to a low-noise condi-tion. Figure 14 shows the comparison of the word error ratebetween ASR, VSR, and AVSR engines in noisy environ-ments. The results of Fig. 14 show that the average word er-ror rate of the AVSR engine is much lower in different SNRvalues, compared to ASR and VSR. Also, it is obvious fromFig. 14 that the difference of the average value of the worderror rate between AVSR and ASR becomes larger, whenthe value of SNR decreases from 30 dB to 20 and 10 dB,respectively. Therefore, it is clear from Fig. 14 that the ASRengine is very sensitive to noise while the sensitivity of theAVSR engine to noise level is much less than the ASR en-gine. It is also noted that noise does not have a direct impacton the VSR engine.

4.1.3 Test 3: very noisy environments

To get better results in a noisy environment, in terms ofrecognition accuracy, an effort is made to add zero dB SNRnoise to the recorded video files. The results of Fig. 15 showthat the average word error rate of AVSR engine is 22 %while the average word error rate of ASR engine is above50 %. The results also indicate that the AVSR engine isworking better in the proposed system in very noisy envi-ronments, in terms of recognition accuracy.

M.R. Mirzaei et al.

Fig. 14 (a) Word error rate in30 dB SNR. (b) Word error ratein 20 dB SNR. (c) Word errorrate in 10 dB SNR

Fig. 15 Comparison of word error rate in Noisy Environment (zerodB SNR)

4.2 Comparison of test results

In this section, we compared the average word error ratesof the Intel AVCSR engines in each test to each other. It isclear from the tests that the Intel AVCSR engines are provid-ing different results that depend on many parameters, suchas hardware specifications, noise, narrator’s English accent,and facial expressions. Figure 16 shows the comparison ofthe average word error rate of each test in the Intel AVCSRapplication. The results of Fig. 16 show that the AVSR en-gine still has a lower word error rate in zero dB, 10 dB,20 dB, 30 dB, and noiseless environments (without noise)in the Intel AVCSR application, compared to ASR and VSRengines. Therefore, it is clear from Fig. 16 that the Intel

Fig. 16 Comparison of word error rate in the Intel AVCSR engines

AVCSR application is working fine in our proposed systemas an external AVSR engine, in terms of performance andrecognition accuracy.

4.3 Processing time of the system

In this section, we assumed that the system works in real-time, which means that every time the text file is changedby the AVSR engine, the results appear immediately on ARdisplay. Therefore, the processing time of word recogni-tion and displaying the word on AR display depend directlyon the power of AVSR engine and hardware specifications.

Audio-visual speech recognition techniques in augmented reality environments

Fig. 17 The processing time ofIntel AVCSR engines onaverage

Fig. 18 Statistical results of our survey with 100 deaf people

Figure 17 shows the processing time of Intel AVCSR en-gines in four different steps. In each step, 8 random wordswere captured and recognized by the Intel AVCSR engines.It is noted from Fig. 17 that the average processing time ofthe Intel AVCSR engines is less than 10 s that is a reason-able result for an offline AVSR engine, but it is still notgood enough. Nevertheless, if we use a real-time AVSRengine, it will provide the average processing time lessthan 2 s that is a very good result for this particular sys-tem [17].

4.4 Survey about the system

In this paper, we conducted a survey with 100 deaf peopleto obtain their viewpoints about the system. Since we didnot have the necessary hardware to implement the system,such as AR-HMD, and the people were not being able to testthe system, the survey was conducted only as questionnairesand deaf people could choose different answers simultane-ously. In addition, we provided a manual file of the systemto make deaf people familiar with the system, which con-tains the system structure and working process with imagesof the system.

The following multiple-choice question “Do you intendto use such a system in the future?” helped us to clarify ourobjectives for the survey. Figure 18 shows the statistical re-sults of this survey. It is clear from Fig. 18 that more than80 % of deaf people are very interested in using our systemand more than 95 % would like to use this system in portable

devices (or mobile phones) or in advanced AR glasses as anassistant. It is also noted from Fig. 18 that almost 10 % ofdeaf people would not prefer to see the narrator’s face on theAR display while talking and would like to talk to the nar-rator face-to-face. However, almost all of the deaf people,who participated in the survey, believe that the system canbe used as an assistant in many places.

5 Conclusions

This paper proposed a new system to help deaf people tocommunicate to people and vice versa. We found the textstring as a common factor between AR and AVSR technolo-gies in combining both. This is the first time that AR andAVSR technologies are used together to make the speechvisible to deaf people. The results of testing the system indifferent environments showed that the system acts verywell in many situations that a deaf person might be. Thecomparison of word error rate in the Intel AVCSR enginesshowed that AVSR engine has lower word error rate in zerodB, 10 dB, 20 dB, 30 dB SNR, and noiseless environments(without noise) compared to ASR and VSR engines, andshowed that the Intel AVCSR engines are working fine inour proposed system, in terms of performance and recogni-tion accuracy. Also, the results of processing time indicatedthat the average processing time of word recognition anddisplaying the word on AR display is less than 10 s, usingthe Intel AVCSR application, which is reasonable for an of-fline AVSR engine. However, the average processing timeof using the AVSR engine will get lower by 8 s if we use areal-time AVSR engine. Furthermore, the result of the sur-vey showed that almost all deaf people would like to use thissystem as an assistant in the future. Hopefully, this systembecomes an alternative tool for deaf people to improve theircommunication skills.

References

1. Sherman, W.R., Craig, A.B.: Understanding Virtual Reality. Mor-gan Kaufmann, San Mateo (2003)

M.R. Mirzaei et al.

2. Cawood, S., Falia, M.: Augmented Reality: A Practical Guide.Pragmatic Bookshelf (2008)

3. Arusoaie, A., Cristei, A.I., Livadariu, M.A., Manea, V., Iftene, A.:Augmented reality. In: Proc. of the 12th Int. Symposium on Sym-bolic and Numeric Algorithms for Scientific Computing, pp. 502–509. IEEE Comput. Soc., Los Alamitos (2010)

4. Schmalstieg, D., Wagner, D.: Experiences with handheld aug-mented reality. In: Proc. of the 6th Int. Symposium on Mixed andAugmented Reality, Japan, pp. 3–15. IEEE Press/ACM, New York(2007)

5. Silva, R., Oliveira, J.C., Giraldi, G.A.: Introduction to AugmentedReality. National Laboratory for Scientific Computation. LNCCresearch report No. 25, Brazil (2003)

6. Lange, B.S., Requejo, P., Flynn, S.M., Rizzo, A.A., Cuevas, F.J.,Baker, L., Winstein, C.: The potential of virtual reality and gamingto assist successful aging with disability. J. Phys. Med. Rehabil.Clin. N. Am. 21(2), 339–356 (2010)

7. Zainuddin, N.M., Zaman, H.B.: Augmented reality in science ed-ucation for deaf students: preliminary analysis. Presented at Re-gional Conf. on Special Needs Education, Faculty of Education,Malaya Univ (2009)

8. Zayed, H.S., Sharawy, M.I.: ARSC: an augmented reality solu-tion for the education field. Int. J. Comput. Educ. 56, 1045–1061(2010)

9. Passig, D., Eden, S.: Improving flexible thinking in deaf and hardof hearing children with virtual reality technology. Am. Ann. Deaf145(3), 286–291 (2000)

10. Kalra, A., Singh, S., Singh, S.: Speech recognition. Int. J. Comput.Sci. Netw. Secur. 10(6), 216–221 (2010)

11. Mosbah, B.B.: Speech recognition for disabilities people. In:Proc. of the 2nd Information and Communication Technologies(ICTTA), Syria, pp. 864–869 (2006)

12. Mihelic, F., Zibert, J.: Speech Recognition, Technologies and Ap-plications. InTech Open Access Publisher (2008)

13. Bailly, G., Vatikiotis, E., Perrier, P.: Issues in Visual and Audio-Visual Speech Processing. MIT Press, Cambridge (2004)

14. Lipovic, I.: Speech and Language Technologies, InTech Open Ac-cess Publisher (2011)

15. Dupont, S., Luettin, J.: Audio-visual speech modeling for contin-uous speech recognition. IEEE Trans. Multimed. 2(3), 141–151(2000)

16. Navarathna, R., Lucey, P., Dean, D., Fookes, C., Sridharan, S.: Lipdetection for audio-visual speech recognition in-car environment.In: Proc. of the 10th Int. Conf. on Information Science, SignalProcessing and Their Applications, pp. 598–601 (2010)

17. Shen, P., Tamura, S., Hayamizu, S.: Evaluation of real-time audio-visual speech recognition. Presented at Int. Conf. on Audio-VisualSpeech Processing, Japan (2010)

18. Zainuddin, N.M.M., Zaman, H.B., Ahmad, A.: Developing aug-mented reality book for deaf in science: the determining fac-tors. In: Proc. of the Int. Symposium in Information Technology(ITSim), pp. 1–4. IEEE Press, Los Alamitos (2010)

19. Lopez-Ludena, V., San-Segundo, R., Martin, R., Sanchez, D., Gar-cia, A.: Evaluating a speech communication system for deaf peo-ple. J. Latin Am. Trans. 9(4), 556–570 (2011)

20. Irawati, S., Green, S., Billinghurst, M., Duenser, A., Ko, H.: Movethe couch where? Developing an augmented reality multimodal in-terface. In: Proc. of 5th Int. Symposium on Mixed and AugmentedReality, pp. 183–186. IEEE/ACM, Los Alamitos (2006)

21. Hanlon, N., Namee, B.M., Kelleher, J.D.: Just Say It: an evaluationof speech interfaces for augmented reality design applications. In:

Proc. of the 20th Irish Conf. on Artificial and Cognitive Science(AICS), pp. 134–143 (2009)

22. Kaiser, E., Olwal, A., McGee, D., Benko, H., Corradini, A., Li, X.,Cohen, P., Feiner, S.: Mutual disambiguation of 3D multimodalinteraction in augmented and virtual reality. In: Proc. of the 5thInt. Conf. on Multimodal Interfaces, Vancouver, BC, Canada, pp.12–19. ACM, New York (2003)

23. Goose, S., Sudarsky, S., Zhang, X., Navab, N.: Speech-enabledaugmented reality supporting mobile industrial maintenance. Int.J. Pervasive Comput. Commun. 2(1), 65–70 (2003)

24. Chin, S.W., Ang, L.M., Seng, K.P.: Lips detection for audio-visualspeech recognition system. Presented at Int. Symposium on Intel-ligent Signal Processing and Communication Systems, Thailand(2008)

25. Adobe Systems Inc.: Adobe flash builder (2011). http://www.adobe.com/products/flash-builder.html. Accessed 10 June 2011

26. Open Computer Vision Library: Open AVSR Alpha 1 (2011).http://sourceforge.net/projects/opencvlibrary/files/obsolete/. Ac-cessed 12 May 2011

27. Bradski, G., Kaehler, A.: Learning OpenCV: computer vision withthe OpenCV library. O’Reilly Media, Sebastopol (2008)

28. Liu, X.X., Zhao, Y., Pi, X., Liang, L.H., Nefian, A.V.: Audio-visual continuous speech recognition using a coupled hiddenMarkov model. In: Proc. of the 7th Int. Conf. on Spoken LanguageProcessing, Denver, CO, pp. 213–216 (2002)

29. Braunstein, R., Wright, M.H., Noble, J.J.: ActionScript 3.0 Bible.Wiley, New York (2007)

30. Transmote: FLARManager: augmented reality in flash (2011).http://words.transmote.com/wp/flarmanager/. Accessed 8 May2011

31. Spark Project Team: FLARToolKit (2011). http://www.libspark.org/wiki/saqoosha/FLARToolKit/en. Accessed 8 May 2011

32. Hohl, W.: Interactive environment with open-source software:3D walkthrough and augmented reality for architects withblender 2.43, DART 3.0 and ARToolkit 2.72. Springer ViennaArchitecture (2008)

33. Hello Enjoy Company: Papervision 3D (2011). http://blog.papervision3d.org/. Accessed 8 May 2011

34. Spark Project Team: Marilena face detection (2011). http://www.libspark.org/wiki/mash/Marilena. Accessed 14 June 2011

Mohammad Reza Mirzaei is aM.Sc. student in Information Tech-nology Engineering (Network andCommunication Systems) at SharifUniversity of Technology, Interna-tional Campus, Kish Island, Iran.He received his B.Sc. degree in In-formation Technology Engineeringat the Iran University of Science andTechnology (IUST), Tehran, Iran,in 2009. His major research inter-ests are in mixed and augmentedreality, virtual reality, computer vi-sion, computer graphics, visualiza-tion, and human computer interac-tion (HCI).

Audio-visual speech recognition techniques in augmented reality environments

Seyed Ghorshi is an assistant pro-fessor of Communication SignalProcessing at the Sharif Universityof Technology, International Cam-pus, Kish Island, Iran. He receivedhis B.Eng. in Electronic and Com-munications Engineering from theUniversity of North London in 2002and his Ph.D., in CommunicationSignal Processing from Brunel Uni-versity in 2007, respectively. His in-terests lie mostly in the applicationof signal processing and informa-tion theory, which includes speechprocessing and image processing.

Mohammad Mortazavi is an as-sistant professor of Digital Elec-tronics at the Sharif University ofTechnology, International Campus,Kish Island, Iran. He received hisB.Sc., M.Sc., and Ph.D., all in Elec-trical Engineering from State Uni-versity of New York (SUNY). Hisarea of expertise and interests are inelectronic design automation, CADVLSI, IC design and verification,and FPGA.